Armin Ronacher's Thoughts and Writings

We Can Just Measure Things

written on Tuesday, June 17, 2025

This week I spent time with friends to letting agents go wild and see what we could build in 24 hours. I took some notes for myself to reflect on that experience. I won't bore you with another vibecoding post, but you can read Peter's post about how that went.

As fun as it was, it also was frustrating in other ways and in entire predictable ways. It became a meme about how much I hated working with Xcode for this project. This got me thinking quite a bit more that this has been an entirely unacceptable experience for a long time, but with programming agents, the pain becomes measurable.

When I first dove into programming I found the idea of RTFM quite hilarious. “Why are you asking dumb questions, just read it up.” The unfortunate reality is that the manual often doesn't exist — or is wrong. In fact, we as engineers are quite willing to subject each others to completely inadequate tooling, bad or missing documentation and ridiculous API footguns all the time. “User error” is what we used to call this, nowadays it's a “skill issue”. It puts the blame on the user and absolves the creator, at least momentarily. For APIs it can be random crashes if you use a function wrong, for programs it can be impossible to navigate UI or lack of error messages. There are many different ways in which we humans get stuck.

What agents change about this is, is that I can subject them to something I wouldn't really want to subject other developers to: measuring. I picked the language for my current project by running basic evals and it worked well. I learned from that, that there are objectively better and worse language when it comes to my particular problem. The choice however is not just how much the AI knows about the language from the corpus of examples during training. It's also tooling, the inherent capabilities of the language, ecosystem churn and other aspects.

Using agents to measure code quality is great because agents don't judge me, but they do judge the code they are writing. Not all agents will swear, but they will express frustration with libraries when loops don't go well or give up. That opens up an opportunity to bring some measurements into not agent performance, but the health of a project.

We should pay more attention to how healthy engineering teams are, and that starts with the code base. Using agents we can put some numbers to it in which we cannot do with humans (or in a very slow and expensive way). We can figure out how successful agents are in using the things are are creating in rather objective ways which is in many ways a proxy for how humans experience working with the code. Getting together with fresh souls to walk them through a tutorial or some tasks is laborious and expensive. Getting agents that have never seen a codebase start using a library is repeatable, rather cheap, fast and if set up the right way very objective. It also takes the emotion out of it or running the experiment multiple times.

Now obviously we can have debates over if the type of code we would write with an agent is objectively beautiful or if the way agents execute tools creates the right type of tools. This is a debate worth having. Right at this very moment though what programming agents need to be successful is rather well aligned with what humans need.

So what works better than other things? For now these are basic indicators, for agents and humans alike:

When an agent struggles, so does a human. There is a lot of code and tooling out there which is objectively not good, but because of one reason or another became dominant. If you want to start paying attention to technology choices or you want to start writing your own libraries, now you can use agents to evaluate the developer experience.

Because so can your users. I can confidently say it's not just me that does not like Xcode, my agent also expresses frustration — measurably so.

This entry was tagged ai, api and thoughts