Armin Ronacher's Thoughts and Writings

Constraints are Good: Python's Metadata Dilemma

written on November 26, 2024

There is currently an effort underway to build a new universal lockfile standard for Python, most of which is taking place on the Python discussion forum. This initiative has highlighted the difficulty of creating a standard that satisfies everyone. It has become clear that different Python packaging tools are having slightly different ideas in mind of what a lockfile is supposed to look like or even be used for.

In those discussions however also a small other aspect re-emerged: Python has a metadata problem. Python's metadata system is too complex and suffers from what I would call “lack of constraints”.

JavaScript: Example of Useful Constraints

JavaScript provides an excellent example of how constraints can simplify and improve a system. In JavaScript, metadata is straightforward. Whether you develop against a package locally or if you are using a package from npm, metadata represents itself the same way. There is a single package.json file that contains the most important metadata of a package such as name, version or dependencies. This simplicity imposes significant but beneficial constraints:

These constraints offer several benefits:

Python: The Cost of Too Few Constraints

In contrast, Python has historically placed very few constraints on metadata. For example, the old setup.py based build system essentially allowed arbitrary code execution during the build process. At one point it was at least strongly suggested that the version produced by that build step better match what is uploaded to PyPI. However, in practice, if you lie about the version that is okay too. You could upload a source distribution to PyPI that claims it's 2.0 but will in fact install 2.0+somethinghere or a completely different version entirely.

What happens is that both before a package is published to PyPI and when a package is installed locally after downloading, the metadata is generated from scratch. Not only does that mean the metadata does not have to match, it also means that it's allowed to be completely different. It's absolutely okay for a package to claim it depends on cool-dependency on your machine, but on uncool-dependency on my machine. Or to dependent on different packages depending on the time of the day of the phase of the moon.

Editable installs and caching are particularly problematic since metadata could become invalid almost immediately after being written. 3

Some of this has been somewhat improved because the new pyproject.toml standard encourages static metadata. However build systems are entirely allowed to override that by falling back to what is called “dynamic metadata” and this is something that is commonly done.

In practice this system incurs a tremendous tax to everybody that can be easily missed.

Not getting this right can be the difference of hitting one static URL with all the metadata, and downloading a zip file, creating a virtualenv, installing build dependencies, generating an entire sdist and then reading the final generated metadata. Many orders of magnitude difference in time it takes to execute.

This also extends to caching. If the metadata can constantly change, how would a resolver cache it? Is it required to build all possible source distributions to determine the metadata as part of resolving?

Having support for dynamic metadata also means that developers continue to maintain elaborate and confusing systems. For instance there is a plugin for hatch that dynamically creates a readme 4, requiring even arbitrary Python code to run to display documentation. There are plugins to automatically change versions to incorporate git version hashes. As a result to figure out what version you actually have installed it's not just enough to look into a single file, you might have to rely on a tool to tell you what's going on.

Moving The Cheese

The challenge with dynamic metadata in Python is vast, but unless you are writing a resolver or packaging tool, you're not going to experience the pain as much. You might in fact quite enjoy the power of dynamic metadata. Unsurprisingly bringing up the idea to remove it is very badly received. There are so many workflows seemingly relying on it.

At this point fixing this problem might be really hard because it's a social problem more than a technical one. If the constraint would have been placed there in the first place, these weird use cases would never have emerged. But because the constraints were not there, people were free to go to town with leveraging it with all the consequences it causes.

I think at this point it's worth moving the cheese, but it's unclear if this can be done through a standard. Maybe the solution will be for tools like uv or poetry to warn if dynamic metadata is used and strongly discourage it. Then over time the users of packages that use dynamic metadata will start to urge the package authors to stop using it.

The cost of dynamic metadata is real, but it's felt only in small ways all the time. You notice it a bit when your resolver is slower than it has to, you notice it if your packaging tool installs the wrong dependency, you notice it if you need to read the manual for the first time when you need to reconfigure your cache-key or force a package to constantly reinstall, you notice it if you need to re-install your local dependencies over and over for them not to break. There are many ways you notice it. You don't notice it as a roadblock, just as a tiny, tiny tax. Except that is a tax we all pay and it makes the user experience significantly worse compared to what it could be.

The deeper lesson here is that if you give developers too much flexibility, they will inevitably push the boundaries and that can have significant downsides as we can see. Because Python's packaging ecosystem lacked constraints from the start, imposing them now has become a daunting challenge. Meanwhile, other ecosystems, like JavaScript's, took a more structured approach early on, avoiding many of these pitfalls entirely.

  1. You can see how this works in action for sentry-cli for instance. The @sentry/cli package declares all its platform specific dependencies as optionalDependencies (relevant package.json). Each platform build has a filter in its package.json for os and cpu. For instance this is what the arm64 linux binary dependency looks like: package.json. npm will attempt to install all optional dependencies, but it will skip over the ones that are not compatible with the current platform.

  2. For @sentry/cli at version 2.39.0 for instance this means that this singular URL will return all the information that a resolver needs: registry.npmjs.org/@sentry/cli/2.39.0

  3. A common error in the past was to receive a pkg_resources.DistributionNotFound exception when trying to run a script in local development

  4. I got some flak on Bluesky for throwing readme generators under the bus. While they do not present the same problem when it comes to metadata like dependencies and versions do, they do still increase the complexity. In an ideal world what you find in site-packages represents what you have in your version control and there is a README.md file right there. That's what you have in JavaScript, Rust and plenty of other ecosystems. What we have however is a build step (either dynamic or copying) taking that readme file, and placing it in a RFC 5322 header encoded file in a dist info. So instead of "command clicking" on a dependency and finding the readme, we need special tools or arcane knowledge if we want to read the readme files locally.

This entry was tagged python