Armin Ronacher's Thoughts and Writings

Absurd Workflows: Durable Execution With Just Postgres

Mon, 03 Nov 2025 00:00:00 +0000

It’s probably no surprise to you that we’re building agents somewhere. Everybody does it. Building a good agent, however, brings back some of the historic challenges involving durable execution.

Entirely unsurprisingly, a lot of people are now building durable execution systems. Many of these, however, are incredibly complex and require you to sign up for another third-party service. I generally try to avoid bringing in extra complexity if I can avoid it, so I wanted to see how far I can go with just Postgres. To this end, I wrote Absurd ¹, a tiny SQL-only library with a very thin SDK to enable durable workflows on top of just Postgres — no extension needed.

Durable Execution 101

Durable execution (or durable workflows) is a way to run long-lived, reliable functions that can survive crashes, restarts, and network failures without losing state or duplicating work. Durable execution can be thought of as the combination of a queue system and a state store that remembers the most recently seen execution state.

Because Postgres is excellent at queues thanks to SELECT ... FOR UPDATE SKIP LOCKED, you can use it for the queue (e.g., with pgmq). And because it’s a database, you can also use it to store the state.

The state is important. With durable execution, instead of running your logic in memory, the goal is to decompose a task into smaller pieces (step functions) and record every step and decision. When the process stops (whether it fails, intentionally suspends, or a machine dies) the engine can replay those events to restore the exact state and continue where it left off, as if nothing happened.

Absurd At A High Level

Absurd at the core is a single .sql file (absurd.sql) which needs to be applied to a database of your choice. That SQL file’s goal is to move the complexity of SDKs into the database. SDKs then make the system convenient by abstracting the low-level operations in a way that leverages the ergonomics of the language you are working with.

The system is very simple: A task dispatches onto a given queue from where a worker picks it up to work on. Tasks are subdivided into steps, which are executed in sequence by the worker. Tasks can be suspended or fail, and when that happens, they execute again (a run). The result of a step is stored in the database (a checkpoint). To avoid repeating work, checkpoints are automatically loaded from the state storage in Postgres again.

Additionally, tasks can sleep or suspend for events and wait until they are emitted. Events are cached, which means they are race-free.

With Agents

What is the relationship of agents with workflows? Normally, workflows are DAGs defined by a human ahead of time. AI agents, on the other hand, define their own adventure as they go. That means they are basically a workflow with mostly a single step that iterates over changing state until it determines that it has completed. Absurd enables this by automatically counting up steps if they are repeated:

absurd.registerTask({name: "my-agent"}, async (params, ctx) => {
  let messages = [{role: "user", content: params.prompt}];
  let step = 0;
  while (step++ < 20) {
    const { newMessages, finishReason } = await ctx.step("iteration", async () => {
      return await singleStep(messages);
    });
    messages.push(...newMessages);
    if (finishReason !== "tool-calls") {
      break;
    }
  }
});

This defines a single task named my-agent, and it has just a single step. The return value is the changed state, but the current state is passed in as an argument. Every time the step function is executed, the data is looked up first from the checkpoint store. The first checkpoint will be iteration, the second iteration#2, iteration#3, etc. Each state only stores the new messages it generated, not the entire message history.

If a step fails, the task fails and will be retried. And because of checkpoint storage, if you crash in step 5, the first 4 steps will be loaded automatically from the store. Steps are never retried, only tasks.

How do you kick it off? Simply enqueue it:

await absurd.spawn("my-agent", {
  prompt: "What's the weather like in Boston?"
}, {
  maxAttempts: 3,
});

And if you are curious, this is an example implementation of the singleStep function used above:

Single step function

Regulation Isn’t the European Trap — Resignation Is

Tue, 21 Oct 2025 00:00:00 +0000

Plenty has been written about how hard it is to build in Europe versus the US. The list is always the same with little process: brittle politics, dense bureaucracy, mandatory notaries, endless and rigid KYC and AML processes. Fine. I know, you know.

I’m not here to add another complaint to the pile (but if we meet over a beer or coffee, I’m happy to unload a lot of hilarious anecdotes on you). The unfortunate reality is that most of these constraints won’t change in my lifetime and maybe ever. Europe is not culturally aligned with entrepreneurship, it’s opposed to the idea of employee equity, and our laws reflect that.

What bothers me isn’t the rules — it’s the posture that develops from it within people that should know better. Across the system, everyone points at someone else. If a process takes 10 steps, you’ll find 10 people who feel absolved of responsibility because they can cite 9 other blockers. Friction becomes a moral license to do a mediocre job (while lamenting about it).

The vibe is: “Because the system is slow, I can be slow. Because there are rules, I don’t need judgment. Because there’s risk, I don’t need initiative.” And then we all nod along and nothing moves.

There are excellent people here; I’ve worked with them. But they are fighting upstream against a default of low agency. When the process is bad, too many people collapse into it. Communication narrows to the shortest possible message. Friday after 2pm, the notary won’t reply — and the notary surely will blame labor costs or regulation for why service ends there. The bank will cite compliance for why they don’t need to do anything. The registrar will point at some law that allows them to demand a translation of a document by a court appointed translator. Everyone has a reason. No one owns the outcome.

Meanwhile, in the US, our counsel replies when it matters, even after hours. Bankers answer the same day. The instinct is to enable progress, not enumerate reasons you can’t have it. The goal is the outcome and the rules are constraints to navigate, not a shield to hide behind.

So what’s the point? I can’t fix politics. What I can do: act with agency, and surround myself with people who do the same and speak in support of it. Work with those who start from “how do we make this work?” not “why this can’t work.” Name the absurdities without using them as cover. Be transparent, move anyway and tell people.

Nothing stops a notary from designing an onboarding flow that gets an Austrian company set up in five days — standardized KYC packets, templated resolutions, scheduled signing slots, clear checklists, async updates, a bias for same-day feedback. That could exist right now. It rarely does or falls short.

Yes, much in Europe is objectively worse for builders. We have to accept it. Then squeeze everything you can from what is in your control:

Own the handoff. When you’re step 3 of 10, behave like step 10 depends on you and behave like you control all 10 steps. Anticipate blockers further down the line. Move same day. Eliminate ambiguity. Close loops.
Default to clarity. Send checklists. Preempt the next two questions. Reduce the number of touches.
Model urgency without theatrics. Be calm, fast, and precise. Don’t make your customer chase you.
Use judgment. Rules exist and we can’t break them all. But we can work with them and be guided by them.

Select for agency. Choose partners who answer promptly when it’s material and who don’t confuse process with progress.

The trap is not only regulation. It’s the learned helplessness it breeds. If we let friction set our standards, we become the friction. We won’t legislate our way to a US-style environment anytime soon. But we don’t need permission to be better operators inside a bad one.

That’s the contrast and it’s the part we control.

Postscript: Comparing Europe to the US triggers people and I’m concious of that. Maturity is holding two truths at once: they do some things right and some things wrong and so do we. You don’t win by talking others down or praying for their failure. I’d rather see both Europe and the US succeed than celebrate Europe failing slightly less.

And no, saying I feel gratitude and happiness when I get a midnight reply doesn’t make me anti-work-life balance (I am not). It means when something is truly time-critical, fast, clear action lifts everyone. The times someone sent a document in minutes, late at night, both sides felt good about it when it mattered. Responsiveness, used with judgment, is not exploitation; it’s respect for outcomes and the relationships we form.

Building an Agent That Leverages Throwaway Code

Fri, 17 Oct 2025 00:00:00 +0000

In August I wrote about my experiments with replacing MCP (Model Context Protocol) with code. In the time since I utilized that idea for exploring non-coding agents at Earendil. And I’m not alone! In the meantime, multiple people have explored this space and I felt it was worth sharing some updated findings. The general idea is pretty simple. Agents are very good at writing code, so why don’t we let them write throw-away code to solve problems that are not related to code at all?

I want to show you how and what I’m doing to give you some ideas of what works and why this is much simpler than you might think.

Pyodide is the Dark Horse

The first thing you have to realize is that Pyodide is secretly becoming a pretty big deal for a lot of agentic interactions. What is Pyodide? Pyodide is an open source project that makes a standard Python interpreter available via a WebAssembly runtime. What is neat about it is that it has an installer called micropip that allows it to install dependencies from PyPI. It also targets the emscripten runtime environment, which means there is a pretty good standard Unix setup around the interpreter that you can interact with.

Getting Pyodide to run is shockingly simple if you have a Node environment. You can directly install it from npm. What makes this so cool is that you can also interact with the virtual file system, which allows you to create a persistent runtime environment that interacts with the outside world. You can also get hosted Pyodide at this point from a whole bunch of startups, but you can actually get this running on your own machine and infrastructure very easily if you want to.

The way I found this to work best is if you banish Pyodide into a web worker. This allows you to interrupt it in case it runs into time limits.

A big reason why Pyodide is such a powerful runtime, is because Python has an amazing ecosystem of well established libraries that the models know about. From manipulating PDFs or word documents, to creating images, it’s all there.

File Systems Are King

Another vital ingredient to a code interpreter is having a file system.

Not just any file system though. I like to set up a virtual file system that I intercept so that I can provide it with access to remote resources from specific file system locations. For instance, you can have a folder on the file system that exposes files which are just resources that come from your own backend API. If the agent then chooses to read from those files, you can from outside the sandbox make a safe HTTP request to bring that resource into play. The sandbox itself does not have network access, so it’s only the file system that gates access to resources.

The reason the file system is so good is that agents just know so much about how they work, and you can provide safe access to resources through some external system outside of the sandbox. You can provide read-only access to some resources and write access to others, then access the created artifacts from the outside again.

Now actually doing that is a tad tricky because the emscripten file system is sync, and most of the interesting things you can do are async. The option that I ended up going with is to move the fetch-like async logic into another web worker and use Atomics.wait to block. If your entire Pyodide runtime is in a web worker, that’s not as bad as it looks.

That said, I wish the emscripten file system API was changed to support stack swiching instead of this. While it’s now possible to hide async promises behind sync abstractions within Pyodide with call_sync, the same approach does not work for the emscripten JavaScript FS API.

I have a full example of this at the end, but the simplified pseudocode that I ended up with looks like this:

// main thread: wrap a worker so fetch() looks synchronous
fetch(url) {
  const signalBuffer = new SharedArrayBuffer(4);
  const signal = new Int32Array(signalBuffer);
  const { port1, port2 } = new MessageChannel();
  this.worker.postMessage({url, signalBuffer, port: port2}, [port2]);

  Atomics.wait(signal, 0, 0);                   // park until worker flips the signal
  const message = receiveMessageOnPort(port1);  // MessageChannel gives the payload
  port1.close();

  if (message.message.status !== "ok") {
    throw new Error(message.message.error.message);
  }
  return message.message.data;
}

// worker thread: perform async fetch, then wake the main thread
parentPort.on("message", async ({ url, signalBuffer, port }) => {
  const signal = new Int32Array(signalBuffer);
  try {
    const bytes = await fetch(url).then(r => {
      if (!r.ok) throw new Error(`HTTP ${r.status}`);
      return r.arrayBuffer();
    });
    port.postMessage({ status: "ok", data: new Uint8Array(bytes) });
    Atomics.store(signal, 0, 1);          // mark success
  } catch (error) {
    port.postMessage({ status: "error", error: serialize(error) });
    Atomics.store(signal, 0, -1);         // mark failure
  } finally {
    Atomics.notify(signal, 0);            // unblock the waiting main thread
    port.close();
  }
});

Durable Execution

Lastly now that you have agents running, you really need durable execution. I would describe durable execution as the idea of being able to retry a complex workflow safely without losing progress. The reason for this is that agents can take a very long time, and if they interrupt, you want to bring them back to the state they were in. This has become a pretty hot topic. There are a lot of startups in that space and you can buy yourself a tool off the shelf if you want to.

What is a little bit disappointing is that there is no truly simple durable execution system. By that I mean something that just runs on top of Postgres and/or Redis in the same way as, for instance, there is pgmq.

The easiest way to shoehorn this yourself is to use queues to restart your tasks and to cache away the temporary steps from your execution. Basically, you compose your task from multiple steps and each of the steps just has a very simple cache key. It’s really just that simple:

function myAgenticLoop(taskID, initialState) {
  let stepCount = 0;
  let state = initialState;
  while (stepCount < MAX_STEPS) {
    let cacheKey = `${taskID}:${stepCount}`;
    let cachedState = loadStateFromCache(cacheKey);
    if (cachedState !== null) {
      state = cachedState.state;
    } else {
      state = runAgenticStep(state);
      storeStateInCache(cacheKey, state);
    }
    stepCount++;
    if (reachedEndCondition(state)) {
      break;
    }
  }
  return state;
}

You can improve on this greatly, but this is the general idea. The state is basically the conversation log and whatever else you need to keep around for the tool execution (e.g., whatever was thrown on the file system).

What Other Than Code?

What tools does an agent need that are not code? Well, the code needs to be able to do something interesting so you need to give it access to something. The most interesting access you can provide is via the file system, as mentioned. But there are also other tools you might want to expose. What Cloudflare proposed is connecting to MCP servers and exposing their tools to the code interpreter. I think this is a quite interesting approach and to some degree it’s probably where you want to go.

Some tools that I find interesting:

Describe: a tool that just lets the agent run more inference, mostly with files that the code interpreter generated. For instance if you have a zip file it’s quite fun to see the code interpreter use Python to unpack it. But if then that unpacked file is a jpg, you will need to go back to inference to understand it.
Help: a tool that just … brings up help. Again, can be with inference for basic RAG, or similar. I found it quite interesting to let the AI ask it for help. For example, you want the manual tool to allow a query like “Which Python code should I write to create a chart for the given XLSX file?” On the other hand, you can also just stash away some instructions in .md files on the virtual file system and have the code interpreter read it. It’s all an option.

Putting it Together

If you want to see what this roughly looks like, I vibe-coded a simple version of this together. It uses a made-up example but it does show how a sandbox with very little tool availability can create surprising results: mitsuhiko/mini-agent.

When you run it, it looks up the current IP from a special network drive that triggers an async fetch, and then it (usually) uses pillow or matplotlib to make an image of that IP address. Pretty pointless, but a lot of fun!

What an agent run looks like

90%

Mon, 29 Sep 2025 00:00:00 +0000

“I think we will be there in three to six months, where AI is writing 90% of the code. And then, in 12 months, we may be in a world where AI is writing essentially all of the code”

— Dario Amodei

Three months ago I said that AI changes everything. I came to that after plenty of skepticism. There are still good reasons to doubt that AI will write all code, but my current reality is close.

For the infrastructure component I started at my new company, I’m probably north of 90% AI-written code. I don’t want to convince you — just share what I learned. In parts, because I approached this project differently from my first experiments with AI-assisted coding.

The service is written in Go with few dependencies and an OpenAPI-compatible REST API. At its core, it sends and receives emails. I also generated SDKs for Python and TypeScript with a custom SDK generator. In total: about 40,000 lines, including Go, YAML, Pulumi, and some custom SDK glue.

I set a high bar, especially that I can operate it reliably. I’ve run similar systems before and knew what I wanted.

Setting it in Context

Some startups are already near 100% AI-generated. I know, because many build in the open and you can see their code. Whether that works long-term remains to be seen. I still treat every line as my responsibility, judged as if I wrote it myself. AI doesn’t change that.

There are no weird files that shouldn’t belong there, no duplicate implementations, and no emojis all over the place. The comments still follow the style I want and, crucially, often aren’t there. I pay close attention to the fundamentals of system architecture, code layout, and database interaction. I’m incredibly opinionated. As a result, there are certain things I don’t let the AI do. I know it won’t reach the point where I could sign off on a commit. That’s why it’s not 100%.

As contrast: another quick prototype we built is a mess of unclear database tables, markdown file clutter in the repo, and boatloads of unwanted emojis. It served its purpose — validate an idea — but wasn’t built to last, and we had no expectation to that end.

Foundation Building

I began in the traditional way: system design, schema, architecture. At this state I don’t let the AI write, but I loop it in AI as a kind of rubber duck. The back-and-forth helps me see mistakes, even if I don’t need or trust the answers.

I did get the foundation wrong once. I initially argued myself into a more complex setup than I wanted. That’s a part where I later used the LLM to redo a larger part early and clean it up.

For AI-generated or AI-supported code, I now end up with a stack that looks something like something I often wanted, but was too hard to do by hand:

Raw SQL: This is probably the biggest change to how I used to write code. I really like using an ORM, but I don’t like some of its effects. In particular, once you approach the ORM’s limits, you’re forced to switch to handwritten SQL. That mapping is often tedious because you lose some of the powers the ORM gives you. Another consequence is that it’s very hard to find the underlying queries, which makes debugging harder. Seeing the actual SQL in your code and in the database log is powerful. You always lose that with an ORM.

The fact that I no longer have to write SQL because the AI does it for me is a game changer.

I also use raw SQL for migrations now.
OpenAPI first: I tried various approaches here. There are many frameworks you can use. I ended up first generating the OpenAPI specification and then using code generation from there to the interface layer. This approach works better with AI-generated code. The OpenAPI specification is now the canonical one that both clients and server shim is based on.

Iteration

Today I use Claude Code and Codex. Each has strengths, but the constant is Codex for code review after PRs. It’s very good at that. Claude is indispensable still when debugging and needing a lot of tool access (eg: why do I have a deadlock, why is there corrupted data in the database etc.). The working together of the two is where it’s most magical. Claude might find the data, Codex might understand it better.

I cannot stress enough how bad the code from these agents can be if you’re not careful. While they understand system architecture and how to build something, they can’t keep the whole picture in scope. They will recreate things that already exist. They create abstractions that are completely inappropriate for the scale of the problem.

You constantly need to learn how to bring the right information to the context. For me, this means pointing the AI to existing implementations and giving it very specific instructions on how to follow along.

I generally create PR-sized chunks that I can review. There are two paths to this:

Agent loop with finishing touches: Prompt until the result is close, then clean up.
Lockstep loop: Earlier I went edit by edit. Now I lean on the first method most of the time, keeping a todo list for cleanups before merge.

It requires intuition to know when each approach is more likely to lead to the right results. Familiarity with the agent also helps understanding when a task will not go anywhere, avoiding wasted cycles.

Where It Fails

The most important piece of working with an agent is the same as regular software engineering. You need to understand your state machines, how the system behaves at any point in time, your database.

It is easy to create systems that appear to behave correctly but have unclear runtime behavior when relying on agents. For instance, the AI doesn’t fully comprehend threading or goroutines. If you don’t keep the bad decisions at bay early it, you won’t be able to operate it in a stable manner later.

Here’s an example: I asked it to build a rate limiter. It “worked” but lacked jitter and used poor storage decisions. Easy to fix if you know rate limiters, dangerous if you don’t.

Agents also operate on conventional wisdom from the internet and in tern do things I would never do myself. It loves to use dependencies (particularly outdated ones). It loves to swallow errors and take away all tracebacks. I’d rather uphold strong invariants and let code crash loudly when they fail, than hide problems. If you don’t fight this, you end up with opaque, unobservable systems.

Where It Shines

For me, this has reached the point where I can’t imagine working any other way. Yes, I could probably have done it without AI. But I would have built a different system in parts because I would have made different trade-offs. This way of working unlocks paths I’d normally skip or defer.

Here are some of the things I enjoyed a lot on this project:

Research + code, instead of research and code later: Some things that would have taken me a day or two to figure out now take 10 to 15 minutes.
It allows me to directly play with one or two implementations of a problem. It moves me from abstract contemplation to hands on evaluation.
Trying out things: I tried three different OpenAPI implementations and approaches in a day.
Constant refactoring: The code looks more organized than it would otherwise have been because the cost of refactoring is quite low. You need to know what you do, but if set up well, refactoring becomes easy.
Infrastructure: Claude got me through AWS and Pulumi. Work I generally dislike became a few days instead of weeks. It also debugged the setup issues as it was going through them. I barely had to read the docs.
Adopting new patterns: While they suck at writing tests, they turned out great at setting up test infrastructure I didn’t know I needed. I got a recommendation on Twitter to use testcontainers for testing against Postgres. The approach runs migrations once and then creates database clones per test. That turns out to be super useful. It would have been quite an involved project to migrate to. Claude did it in an hour for all tests.
SQL quality: It writes solid SQL I could never remember. I just need to review which I can. But to this day I suck at remembering MERGE and WITH when writing it.

What does it mean?

Is 90% of code going to be written by AI? I don’t know. What I do know is, that for me, on this project, the answer is already yes. I’m part of that growing subset of developers who are building real systems this way.

At the same time, for me, AI doesn’t own the code. I still review every line, shape the architecture, and carry the responsibility for how it runs in production. But the sheer volume of what I now let an agent generate would have been unthinkable even six months ago.

That’s why I’m convinced this isn’t some far-off prediction. It’s already here — just unevenly distributed — and the number of developers working like this is only going to grow.

That said, none of this removes the need to actually be a good engineer. If you let the AI take over without judgment, you’ll end up with brittle systems and painful surprises (data loss, security holes, unscalable software). The tools are powerful, but they don’t absolve you of responsibility.

What’s a Foreigner?

Sun, 14 Sep 2025 00:00:00 +0000

Across many countries, resistance to immigration is rising — even places with little immigration, like Japan, now see rallies against it. I’m not going to take a side here. I want to examine a simpler question: who do we mean when we say “foreigner”?

I would argue there isn’t a universal answer. Laws differ, but so do social definitions. In Vienna, where I live, immigration is visible: roughly half of primary school children don’t speak German at home. Austria makes citizenship hard to obtain. Many people born here aren’t citizens; at the same time, EU citizens living here have broad rights and labor-market access similar to native Austrians. Over my lifetime, the fear of foreigners has shifted: once aimed at nearby Eastern Europeans, it now falls more on people from outside the EU, often framed through religion or culture. Practically, “foreigner” increasingly ends up meaning “non-EU.” Keep in mind that over the last 30 years the EU went from 12 countries to 27. That’s a signifcant increase in social mobility.

I believe this is quite different from what is happening in the United States. The present-day US debate is more tightly tied to citizenship and allegiance, which is partly why current fights there include attempts to narrow who gets citizenship at birth. The worry is less about which foreigners come and more about the terms of becoming American and whether newcomers will embrace what some define as American values.

Inside the EU, the concept of EU citizenship changes social reality. Free movement, aligned standards, interoperable social systems, and easier labor mobility make EU citizens feel less “foreign” to each other — despite real frictions. The UK before Brexit was a notable exception: less integrated in visible ways and more hostile to Central and Eastern European workers. Perhaps another sign that the level of integration matters. In practical terms, allegiances are also much less clearly defined in the EU. There are people who live their entire live in other EU countries and whos allegiance is no longer clearly aligned to any one country.

Legal immigration itself is widely misunderstood. Most systems are both far more restrictive in some areas and far more permissive than people assume. On the one hand, what’s called “illegal” is often entirely lawful. Many who are considered “illegal” are legally awaiting pending asylum decisions or are accepted refugees. These are processes many think shouldn’t exist, but they are, in fact, legal. On the other hand, the requirements for non-asylum immigration are very high, and most citizens of a country themselves would not qualify for skilled immigration visas. Meanwhile, the notion that a country could simply “remove all foreigners” runs into practical and ethical dead ends. Mobility pressures aren’t going away; they’re reinforced by universities, corporations, individual employers, demographics, and geopolitics.

Citizenship is just a small wrinkle. In Austria, you generally need to pass a modest German exam and renounce your prior citizenship. That creates odd outcomes: native-born non-citizens who speak perfect German but lack a passport, and naturalized citizens who never fully learned the language. Legally clear, socially messy — and not unique to Austria. The high hurdle to obtaining a passport also leads many educated people to intentionally opt out of becoming citizens. The cost that comes with renouncing a passport is not to be underestimated.

Where does this leave us? The realities of international mobility leave our current categories of immigration straining and misaligned with what the population at large thinks immigration should look like. Economic anxiety, war, and political polarization are making some groups of foreigners targets, while the deeper drivers behind immigration will only keep intensifying.

Perhaps we need to admit that we’re all struggling with these questions. The person worried about their community or country changing too quickly and the immigrant seeking a better life are both responding to forces larger than themselves. In a world where capital moves freely but most people cannot, where climate change might soon displace millions, and where birth rates are collapsing in wealthy nations, our immigration systems will be tested and stressed, and our current laws and regulations are likely inadequate.

996

Thu, 04 Sep 2025 00:00:00 +0000

“Amazing salary, hackerhouse in SF, crazy equity. 996. Our mission is OSS.” — Gregor Zunic

“The current vibe is no drinking, no drugs, 9-9-6, […].” — Daksh Gupta

“The truth is, China’s really doing ‘007’ now—midnight to midnight, seven days a week […] if you want to build a $10 billion company, you have to work seven days a week.” — Harry Stebbings

I love work. I love working late nights, hacking on things. This week I didn’t go to sleep before midnight once. And yet…

I also love my wife and kids. I love long walks, contemplating life over good coffee, and deep, meaningful conversations. None of this would be possible if my life was defined by 12 hour days, six days a week. More importantly, a successful company is not a sprint, it’s a marathon.

And this is when this is your own company! When you devote 72 hours a week to someone else’s startup, you need to really think about that arrangement a few times. I find it highly irresponsible for a founder to promote that model. As a founder, you are not an employee, and your risks and leverage are fundamentally different.

I will always advocate for putting the time in because it is what brought me happiness. Intensity, and giving a shit about what I’m doing, will always matter to me. But you don’t measure that by the energy you put in, or the hours you’re sitting in the office, but the output you produce. Burning out on twelve-hour days, six days a week, has no prize at the end. It’s unsustainable, it shouldn’t be the standard and it sure as hell should not be seen as a positive sign of a company.

I’ve pulled many all-nighters, and I’ve enjoyed them. I still do. But they’re enjoyable in the right context, for the right reasons, and when that is a completely personal choice, not the basis of company culture.

And that all-nighter? It comes with a fucked up and unproductive morning the day after.

When someone promotes a 996 work culture, we should push back.

Passkeys and Modern Authentication

Tue, 02 Sep 2025 00:00:00 +0000

There is an ongoing trend in the industry to move people away from username and password towards passkeys. The intentions here are good, and I would assume that this has a significant net benefit for the average consumer. At the same time, the underlying standard has some peculiarities. These enable behaviors by large corporations, employers, and governments that are worth thinking about.

Attestations

One potential source of problems here is the attestation system. It allows the authenticator to provide more information about what it is to the website that you’re authenticating with. In particular it is what tells a website if you have a Yubikey plugged in versus something like 1password. This is the mechanism by which the Austrian government, for instance, prevents you from using an Open Source or any other software-based authenticator to sign in to do your taxes, access medical records or do anything else that is protected by eID. Instead you have to buy a whitelisted hardware token.

Attestations themselves are not used by software authenticators today, or anything that syncs. Both Apple and Google do not expose attestation data in their own software authenticators (Keychain and Google Authenticator) for consumer passkeys. However, they will pass through attestation data from hardware tokens just fine. Both of them also, to the best of my knowledge, expose attestation data for enterprises through Mobile Device Management.

One could make the argument that it is unlikely that attestation data will be used at scale to create vendor lock-in. However, I’m not sufficiently convinced that this won’t create sub-ecosystems where we see exactly that happening. If for no other reason, this API exists and it has already been used to restrict keys for governmental sign-in systems.

Auth Lock-in

One slightly more concerning issue today is that there is effectively no way to export private keys between authentication password managers. You need to enroll all of your ecosystems individually into a password manager. An attempt by an open source password manager to reveal private keys to the user was ruled insecure and should not be supported. This taking away agency from the user is not an accident. You can also see this with the passkey export specification which comes with a protocol that, while enabling exports in principle, encourages a system to system transfer that does not hand over the user’s credentials to the user. ¹

This might be for good intentions, but it also creates problems. As someone recently trying to leave the Apple ecosystem step by step, I have noticed how many services are now bound to an iCloud-based passkey. Particularly when it comes to Apple, this fear is not entirely unwarranted. Sign-in with Apple using non-shared email addresses makes it very hard to migrate to Android unless you retain an iCloud subscription.

Obviously, one could pay for an authenticator like 1Password, which at least is ecosystem independent. However, not everybody is in a situation where they can afford to pay for basic services like password managers.

Sneaky Onboarding

One reason why passkeys are adopted so well today is because it happens automatically for many. I discovered that non-technical family members now all have passkeys for some services, and they did not even notice doing that. A notable example is Amazon. After every sign-in, it attempts to enroll you into a passkey automatically without clear notification. It just brings up the fingerprint prompt, and users will instinctively touch it.

If you use different types of devices to authenticate — for instance, a Windows and an iOS device — you may eventually have both authenticators associated. This now covers the devices you already use. However, it can make moving to a completely different ecosystem later much harder.

We Are Run By Corporations

For many years already, people lose access to their Google account every day and can never regain it. Google is well known for terminating accounts without stating any reasons. With that comes the loss of access to your data. In this case, you also lose your credentials for third-party websites.

There is no legal recourse for this and no mechanism for appeal. You just have to hope that you’re a good citizen and not doing anything that would upset Google’s account flagging systems.

As a sufficiently technical person, you might weigh the risks, but others will not. Many years ago, I tried to help another family gain access to their child’s Facebook account after they passed away. Even then, it was a bureaucratic nightmare where there was little support by Facebook to make it happen. There is a real risk that access becomes much harder for families. This is particularly true in situations where someone is incapacitated or dead. The more we move away from basic authentication systems, the worse this becomes. It’s also really inconvenient when you are not on your own devices. Signing into my accounts on my children’s devices has turned from a straightforward process to an incredibly frustrating experience. I find myself juggling all kinds of different apps and flows.

Complexity and Gatekeepers Everywhere

Every once in a while, I find myself in a situation where I have very little foundation to build on. This is mostly just because of a hobby. I like to see how things work and build them from scratch. Increasingly, that has become harder. Many username and password authentication schemes have been replaced with OAuth sign-ins over the years. Nowadays, some services are moving towards passkeys, though most places do not enforce these yet. If you want to build an operating system from scratch, or even just build a client yourself, you often find yourself needing to do a lot of yak-shaving. All this work is necessary just to get basic things working.

I think this is at least something to be wary of. It doesn’t mean that bad things will necessarily happen, but there is potential for loss of individual agency.

An accelerated version of this has been seen with email. Accessing your own personal IMAP account from Google today has been significantly restricted under security arguments. Getting OAuth credentials that can access someone’s IMAP accounts with their approval has become increasingly harder. It is also very costly.

Username and password authentication has largely been removed. Even the app-specific passwords on Google are now entirely undocumented. They are no longer exposed in the settings unless you know the link ².

What Does Any Of This Mean?

I don’t know. I am both a user of passkeys and generally wary of making myself overly dependent on tech giants and complex solutions. I’m noticing an increased reliance and potential loss of access to my own data. This does abstractly concern me. Not to the degree that it changes anything I’m doing, but still. As annoying as managing usernames and passwords was, I don’t think I have ever spent so much time authenticating on a daily basis. The systems that we now need to interface with for authentication are vast and complex.

This might just be the path we’re going. However, it is also one where we maybe want to reflect a little bit on whether this is really what we want.

Edit: I reworded the statement about pass key exports to not misrepresent the original comment on GitHub.

The details can be debated, but the protocol explicitly does not permit a user to just hold on to a symmetrically encrypted export (or even a plain text one). The best option is the HPKE scheme.↩
This OAuth dependency also puts Open Source projects in an interesting situation. For instance, the Thunderbird client ships with OAuth credentials for Google when you download it from Mozilla. However, if you self-compile it, you don’t have that access.↩

Your MCP Doesn’t Need 30 Tools: It Needs Code

Mon, 18 Aug 2025 00:00:00 +0000

I wrote a while back about why code performs better than MCP (Model Context Protocol) for some tasks. In particular, I pointed out that if you have command line tools available, agentic coding tools seem very happy to use those. In the meantime, I learned a few more things that put some nuance to this. There are a handful of challenges with CLI-based tools that are rather hard to resolve and require further examination.

In this blog post, I want to present the (not so novel) idea that an interesting approach is using MCP servers exposing a single tool, that accepts programming code as tool inputs.

CLI Challenges

The first and most obvious challenge with CLI tools is that they are sometimes platform-dependent, version-dependent, and at times undocumented. This has meant that I routinely encounter failures when using tools on first use.

A good example of this is when the tool usage requires non-ASCII string inputs. For instance, Sonnet and Opus are both sometimes unsure how to feed newlines or control characters via shell arguments. This is unfortunate but ironically not entirely unique to shell tools either. For instance, when you program with C and compile it, trailing newlines are needed. At times, agentic coding tools really struggle with appending an empty line to the end of a file, and you can find some quite impressive tool loops to work around this issue.

This becomes particularly frustrating when your tool is absolutely not in the training set and uses unknown syntax. In that case, getting agents to use it can become quite a frustrating experience.

Another issue is that in some agents (Claude Code in particular), there is an extra pass taking place for shell invocations: the security preflight. Before executing a tool, Claude also runs it through the fast Haiku model to determine if the tool will do something dangerous and avoid the invocation. This further slows down tool use when multiple turns are needed.

In general, doing multiple turns is very hard with CLI tools because you need to teach the agent how to manage sessions. A good example of this is when you ask it to use tmux for remote-controlling an LLDB session. It’s absolutely capable of doing it, but it can lose track of the state of its tmux session. During some tests, I ended up with it renaming the session halfway through, forgetting that it had a session (and thus not killing it).

This is particularly frustrating because the failure case can be that it starts from scratch or moves on to other tools just because it got a small detail wrong.

Composability

Unfortunately, when moving to MCP, you immediately lose the ability to compose without inference (at least today). One of the reasons lldb can be remote-controlled with tmux at all is that the agent manages to compose quite well. How does it do that? It uses basic tmux commands such as tmux send-keys to send inputs or tmux capture-pane to get the output, which don’t require a lot of extra tooling. It then chains commands like sleep and tmux capture-pane to ensure it doesn’t read output too early. Likewise, when it starts to fail with encoding more complex characters, it sometimes changes its approach and might even use base64 -d.

The command line really isn’t just one tool — it’s a series of tools that can be composed through a programming language: bash. The most interesting uses are when you ask it to write tools that it can reuse later. It will start composing large scripts out of these one-liners. All of that is hard with MCP today.

Better Approach To MCP?

It’s very clear that there are limits to what these shell tools can do. At some point, you start to fight those tools. They are in many ways only as good as their user interface, and some of these user interfaces are just inherently tricky. For instance, when evaluated, tmux performs better than GNU screen, largely because the command-line interface of tmux is better and less error-prone. But either way, it requires the agent to maintain a stateful session, and it’s not particularly good at this today.

What is stateful out of the box, however, is MCP. One surprisingly useful way of running an MCP server is to make it an MCP server with a single tool (the ubertool) which is just a Python interpreter that runs eval() with retained state. It maintains state in the background and exposes tools that the agent already knows how to use.

I did this experiment in a few ways now, the one that is public is pexpect-mcp. It’s an MCP that exposes a single tool called pexpect_tool. It is, however, in many ways a misnomer. It’s not really a pexpect tool — it’s a Python interpreter running out of a virtualenv that has pexpect installed.

What is pexpect? It is the Python port of the ancient expect command-line tool which allows one to interact with command-line programs through scripts. The documentation describes expect as a “program that ‘talks’ to other interactive programs according to a script.”

What is special about pexpect is that it’s old, has a stable API, and has been used all over the place. You could wrap expect or pexpect with lots of different MCP tools like pexpect_expect, pexpect_sendline, pexpect_spawn, and more. That’s because the pexpect.Spawn class exposes 36 different API functions! That’s a lot. But many of these cannot be used in isolation well anyway. Take this motivating example from the docs:

child = pexpect.spawn('scp foo user@example.com:.')
child.expect('Password:')
child.sendline(mypassword)

Even the most basic use here involves three chained tool calls. And that doesn’t include error handling, which one might also want to encode.

So instead, a much more interesting way to have this entire thing run is to just have the command language to the MCP be Python. The MCP server turns into a stateful Python interpreter, and the tool just lets it send Python code that is evaluated with the same state as before. There is some extra support in the MCP server to make the experience more reliable (like timeout support), but for the most part, the interface is to just send Python code. In fact, the exact script from above is what an MCP client is expected to send.

The tool description just says this:

Execute Python code in a pexpect session. Can spawn processes and interact with
them.

Args:
  `code`: Python code to execute. Use 'child' variable to interact with the
  spawned process. The pexpect library is already imported. Use
  `pexpect.spawn(...)` to spawn something. timeout: Optional timeout in seconds.
  If not provided, uses global `TIMEOUT` (default 30s).

Example:
  child = pexpect.spawn('lldb ./mytool')
  child.expect("(lldb)")

Returns:
  The result of the code execution or an error message.

This works because the interface to the MCP is now not just individual tools it has never seen — it’s a programming language that it understands very well, with additional access to an SDK (pexpect) that it has also seen and learned all the patterns from. We’re relegating the MCP to do the thing that it does really well: session management and guiding the tool through a built-in prompt.

More importantly, the code that it writes is very similar to what it might put into a reusable script. There is so little plumbing in the actual MCP that you can tell the agent after the session to write a reusable pexpect script from what it learned in the session. That works because all the commands it ran are just Python — they’re still in the context, and the lift from that to a reusable Python script is low.

Do It, Then Do It Again

Now I don’t want to bore you too much with lots of Claude output, but I took a crashing demo app that Mario wrote and asked it to debug with LLDB through pexpect_tool. Here is what that looked like:

Expand to see entire session

In Support Of Shitty Types

Mon, 04 Aug 2025 00:00:00 +0000

You probably know that I love Rust and TypeScript, and I’m a big proponent of good typing systems. One of the reasons I find them useful is that they enable autocomplete, which is generally a good feature. Having a well-integrated type system that makes sense and gives you optimization potential for memory layouts is generally a good idea.

From that, you’d naturally think this would also be great for agentic coding tools. There’s clearly some benefit to it. If you have an agent write TypeScript and the agent adds types, it performs well. I don’t know if it outperforms raw JavaScript, but at the very least it doesn’t seem to do any harm.

But most agentic tools don’t have access to an LSP (language server protocol). My experiments with agentic coding tools that do have LSP access (with type information available) haven’t meaningfully benefited from it. The LSP protocol slows things down and pollutes the context significantly. Also, the models haven’t been trained sufficiently to understand how to work with this information. Just getting a type check failure from the compiler in text form yields better results.

What you end up with is an agent coding loop that, without type checks enabled, results in the agent making forward progress by writing code and putting types somewhere. As long as this compiles to some version of JavaScript (if you use Bun, much of it ends up type-erased), it creates working code. And from there it continues. But that’s bad progress—it’s the type of progress where it needs to come back after and clean up the types.

It’s curious because types are obviously being written but they’re largely being ignored. If you do put the type check into the loop, my tests actually showed worse performance. That’s because the agent manages to get the code running, and only after it’s done does it run the type check. Only then, maybe at a much later point, does it realize it made type errors. Then it starts fixing them, maybe goes in a loop, and wastes a ton of context. If you make it do the type checks after every single edit, you end up eating even more into the context.

This gets really bad when the types themselves are incredibly complicated and non-obvious. TypeScript has arcane expression functionality, and some libraries go overboard with complex constructs (e.g., conditional types). LLMs have little clue how to read any of this. For instance, if you give it access to the .d.ts files from TanStack Router and the forward declaration stuff it uses for the router system to work properly, it doesn’t understand any of it. It guesses, and sometimes guesses badly. It’s utterly confused. When it runs into type errors, it performs all kinds of manipulations, none of which are helpful.

Python typing has an even worse problem, because there we have to work with a very complicated ecosystem where different type checkers cannot even agree on how type checking should work. That means that the LLM, at least from my testing, is not even fully capable of understanding how to resolve type check errors from tools which are not from mypy. It’s not universally bad, but if you actually end up with a complex type checking error that you cannot resolve yourself, it is shocking how the LLM is also often not able to fully figure out what’s going on, or at least needs multiple attempts.

As a shining example of types adding a lot of value we have Go. Go’s types are much less expressive and very structural. Things conform to interfaces purely by having certain methods. The LLM does not need to understand much to comprehend that. Also, the types that Go has are rather strictly enforced. If they are wrong, it won’t compile. Because Go has a much simpler type system that doesn’t support complicated constructs, it works much better—both for LLMs to understand the code they produce and for the LLM to understand real-world libraries you might give to an LLM.

I don’t really know what to do with this, but these behaviors suggest there’s a lot more value in best-effort type systems or type hints like JSDoc. Because at least as far as the LLM is concerned, it doesn’t need to fully understand the types, it just needs to have a rough understanding of what type some object probably is. For the LLM it’s more important that the type name in the error message aligns with the type name in source.

I think it’s an interesting question whether this behavior of LLMs today will influence future language design. I don’t know if it will, but I think it gives a lot of credence to some of the decisions that led to languages like Go and Java. As critical as I have been in the past about their rather simple approaches to problems and having a design that maybe doesn’t hold developers in a particularly high regard, I now think that they actually are measurably in a very good spot. There is more elegance to their design than I gave it credit for.

Agentic Coding Things That Didn’t Work

Wed, 30 Jul 2025 00:00:00 +0000

Using Claude Code and other agentic coding tools has become all the rage. Not only is it getting millions of downloads, but these tools are also gaining features that help streamline workflows. As you know, I got very excited about agentic coding in May, and I’ve tried many of the new features that have been added. I’ve spent considerable time exploring everything on my plate.

But oddly enough, very little of what I attempted I ended up sticking with. Most of my attempts didn’t last, and I thought it might be interesting to share what didn’t work. This doesn’t mean these approaches won’t work or are bad ideas; it just means I didn’t manage to make them work. Maybe there’s something to learn from these failures for others.

Rules of Automation

The best way to think about the approach that I use is:

I only automate things that I do regularly.
If I create an automation for something that I do regularly, but then I stop using the automation, I consider it a failed automation and I delete it.

Non-working automations turn out to be quite common. Either I can’t get myself to use them, I forget about them, or I end up fine-tuning them endlessly. For me, deleting a failed workflow helper is crucial. You don’t want unused Claude commands cluttering your workspace and confusing others.

So I end up doing the simplest thing possible most of the time: just talk to the machine more, give it more context, keep the audio input going, and dump my train of thought into the prompt. And that is 95% of my workflow. The rest might be good use of copy/paste.

Slash Commands

Slash commands allow you to preload prompts to have them readily available in a session. I expected these to be more useful than they ended up being. I do use them, but many of the ones that I added I ended up never using.

There are some limitations with slash commands that make them less useful than they could be. One limitation is that there’s only one way to pass arguments, and it’s unstructured. This proves suboptimal in practice for my uses. Another issue I keep running into with Claude Code is that if you do use a slash command, the argument to the slash command for some reason does not support file-based autocomplete.

To make them work better, I often ask Claude to use the current Git state to determine which files to operate on. For instance, I have a command in this blog that fixes grammar mistakes. It operates almost entirely from the current git status context because providing filenames explicitly is tedious without autocomplete.

Here is one of the few slash commands I actually do use:

## Context

- git status: !`git status`
- Explicitly mentioned file to fix: "$ARGUMENTS"

## Your task

Based on the above information, I want you to edit the mentioned file or files
for grammar mistakes.  Make a backup (eg: change file.md to file.md.bak) so I
can diff it later.  If the backup file already exists, delete it.

If a blog post was explicitly provided, edit that; otherwise, edit the ones
that have pending changes or are untracked.

My workflow now assumes that Claude can determine which files I mean from the Git status virtually every time, making explicit arguments largely unnecessary.

Here are some of the many slash commands that I built at one point but ended up not using:

/fix-bug: I had a command that instructed Claude to fix bugs by pulling issues from GitHub and adding extra context. But I saw no meaningful improvement over simply mentioning the GitHub issue URL and voicing my thoughts about how to fix it.
/commit: I tried getting Claude to write good commit messages, but they never matched my style. I stopped using this command, though I haven’t given up on the idea entirely.
/add-tests: I really hoped this would work. My idea was to have Claude skip tests during development, then use an elaborate reusable prompt to generate them properly at the end. But this approach wasn’t consistently better than automatic test generation, which I’m still not satisfied with overall.
/fix-nits: I had a command to fix linting issues and run formatters. I stopped using it because it never became muscle memory, and Claude already knows how to do this. I can just tell it “fix lint” in the CLAUDE.md file without needing a slash command.
/next-todo: I track small items in a to-do.md file and had a command to pull the next item and work on it. Even here, workflow automation didn’t help much. I use this command far less than expected.

So if I’m using fewer slash commands, what am I doing instead?

Speech-to-text. Cannot stress this enough but talking to the machine means you’re more likely to share more about what you want it to do.
I maintain some basic prompts and context for copy-pasting at the end or the beginning of what I entered.

Copy/paste is really, really useful because of how fuzzy LLMs are. For instance, I maintain link collections that I paste in when needed. Sometimes I fetch files proactively, drop them into a git-ignored folder, and mention them. It’s simple, easy, and effective. You still need to be somewhat selective to avoid polluting your context too much, but compared to having it spelunk in the wrong places, more text doesn’t harm as much.

Hooks

I tried hard to make hooks work, but I haven’t seen any efficiency gains from them yet. I think part of the problem is that I use yolo mode. I wish hooks could actually manipulate what gets executed. The only way to guide Claude today is through denies, which don’t work in yolo mode. For instance, I tried using hooks to make it use uv instead of regular Python, but I was unable to do so. Instead, I ended up preloading executables on the PATH that override the default ones, steering Claude toward the right tools.

For instance, this is really my hack for making it use uv run python instead of python more reliably:

#!/bin/sh
echo "This project uses uv, please use 'uv run python' instead."
exit 1

I really just have a bunch of these in .claude/interceptors and preload that folder onto PATH before launching Claude:

CLAUDE_BASH_MAINTAIN_PROJECT_WORKING_DIR=1 \
    PATH="`pwd`/.claude/interceptors:${PATH}" \
    claude --dangerously-skip-permissions

I also found it hard to hook into the right moment. I wish I could run formatters at the end of a long edit session. Currently, you must run formatters after each Edit tool operation, which often forces Claude to re-read files, wasting context. Even with the Edit tool hook, I’m not sure if I’m going to keep using it.

I’m actually really curious whether people manage to get good use out of hooks. I’ve seen some discussions on Twitter that suggest there are some really good ways of making them work, but I just went with much simpler solutions instead.

Claude Print Mode

I was initially very bullish on Claude’s print mode. I tried hard to have Claude generate scripts that used print mode internally. For instance, I had it create a mock data loading script — mostly deterministic code with a small inference component to generate test data using Claude Code.

The challenge is achieving reliability, which hasn’t worked well for me yet. Print mode is slow and difficult to debug. So I use it far less than I’d like, despite loving the concept of mostly deterministic scripts with small inference components. Whether using the Claude SDK or the command-line print flag, I haven’t achieved the results I hoped for.

I’m drawn to Print Mode because inference is too much like a slot machine. Many programming tasks are actually quite rigid and deterministic. We love linters and formatters because they’re unambiguous. Anything we can fully automate, we should. Using an LLM for tasks that don’t require inference is the wrong approach in my book.

That’s what makes print mode appealing. If only it worked better. Use an LLM for the commit message, but regular scripts for the commit and gh pr commands. Make mock data loading 90% deterministic with only 10% inference.

I still use it, but I see more potential than I am currently leveraging.

Sub Tasks and Sub Agents

I use the task tool frequently for basic parallelization and context isolation. Anthropic recently launched an agents feature meant to streamline this process, but I haven’t found it easier to use.

Sub-tasks and sub-agents enable parallelism, but you must be careful. Tasks that don’t parallelize well — especially those mixing reads and writes — create chaos. Outside of investigative tasks, I don’t get good results. While sub-agents should preserve context better, I often get better results by starting new sessions, writing thoughts to Markdown files, or even switching to o3 in the chat interface.

Does It Help?

What’s interesting about workflow automation is that without rigorous rules that you consistently follow as a developer, simply taking time to talk to the machine and give clear instructions outperforms elaborate pre-written prompts.

For instance, I don’t use emojis or commit prefixes. I don’t enforce templates for pull requests either. As a result, there’s less structure for me to teach the machine.

I also lack the time and motivation to thoroughly evaluate all my created workflows. This prevents me from gaining confidence in their value.

Context engineering and management remain major challenges. Despite my efforts to help agents pull the right data from various files and commands, they don’t yet succeed reliably. They pull in too much or too little. Long sessions lead to forgotten context from the beginning. Whether done manually or with slash commands, the results feel too random. It’s hard enough with ad-hoc approaches, but static prompts and commands make it even harder.

The rule I have now is that if I do want to automate something, I must have done it a few times already, and then I evaluate whether the agent gets any better results through my automation. There’s no exact science to it, but I mostly measure that right now by letting it do the same task three times and looking at the variance manually as measured by: would I accept the result.

Keeping The Brain On

Forcing myself to evaluate the automation has another benefit: I’m less likely to just blindly assume it helps me.

Because there is a big hidden risk with automation through LLMs: it encourages mental disengagement. When you stop thinking like an engineer, quality drops, time gets wasted and you don’t understand and learn. LLMs are already bad enough as they are, but whenever I lean in on automation I notice that it becomes even easier to disengage. I tend to overestimate the agent’s capabilities with time. There are real dragons there!

You can still review things as they land, but it becomes increasingly harder to do so later. While LLMs are reducing the cost of refactoring, the cost doesn’t drop to zero, and regressions are common.