written on Tuesday, November 26, 2024
There is currently an effort underway to build a new universal lockfile standard for Python, most of which is taking place on the Python discussion forum. This initiative has highlighted the difficulty of creating a standard that satisfies everyone. It has become clear that different Python packaging tools are having slightly different ideas in mind of what a lockfile is supposed to look like or even be used for.
In those discussions however also a small other aspect re-emerged: Python has a metadata problem. Python's metadata system is too complex and suffers from what I would call “lack of constraints”.
JavaScript provides an excellent example of how constraints can simplify and improve a system. In JavaScript, metadata is straightforward. Whether you develop against a package locally or if you are using a package from npm, metadata represents itself the same way. There is a single package.json file that contains the most important metadata of a package such as name, version or dependencies. This simplicity imposes significant but beneficial constraints:
These constraints offer several benefits:
In contrast, Python has historically placed very few constraints on metadata. For example, the old setup.py based build system essentially allowed arbitrary code execution during the build process. At one point it was at least strongly suggested that the version produced by that build step better match what is uploaded to PyPI. However, in practice, if you lie about the version that is okay too. You could upload a source distribution to PyPI that claims it's 2.0 but will in fact install 2.0+somethinghere or a completely different version entirely.
What happens is that both before a package is published to PyPI and when a package is installed locally after downloading, the metadata is generated from scratch. Not only does that mean the metadata does not have to match, it also means that it's allowed to be completely different. It's absolutely okay for a package to claim it depends on cool-dependency on your machine, but on uncool-dependency on my machine. Or to dependent on different packages depending on the time of the day of the phase of the moon.
Editable installs and caching are particularly problematic since metadata could become invalid almost immediately after being written. [3]
Some of this has been somewhat improved because the new pyproject.toml standard encourages static metadata. However build systems are entirely allowed to override that by falling back to what is called “dynamic metadata” and this is something that is commonly done.
In practice this system incurs a tremendous tax to everybody that can be easily missed.
Disjointed and complex metadata access: there is no clear relationship of PyPI package name and the installed Python modules. If you know what the PyPI package name is, you can access metadata via importlib.metadata. Metadata is not read from pyproject.toml, even if it's static, instead it takes the package name and it accesses the metadata from the .dist-info folder (most specifically the METADATA file therein) installed into site-packages.
Mandatory metadata re-generation: As a consequence if you edit pyproject.toml to edit a piece of metadata, you need to re-install the package for that metadata to be updated in the .dist-info. People commonly forget doing that, so desynchronized metadata is very common. This is true even for static metadata today!
Unclear cache invalidation: Because metadata can be dynamic, it's not clear when you should automatically re-install a package. It's not enough to just track pyproject.toml for changes when dynamic metadata is used. uv for instance has a really complex, explicit cache management system so one can help uv detect outdated metadata. This obviously is non-standardized, requires uv to understand version control systems and is also not shared with other tools. For instance if you know that the version information incorporates the git hash, you can tell uv to pay attention to git commits.
Fragmented metadata storage: even where generated metadata is stored is complex. Different systems have slightly different behavior for storing that metadata.
Dynamic metadata makes resolvers slow: Dynamic metadata makes the job of resolvers and installers very hard and slows them down. Today for instance advanced resolvers like poetry or uv sometimes are not able to install the right packages, because they assume that dependency metadata is consistent across sdists and wheels. However there are a lot of sdists available on PyPI that publish incomplete dependency metadata (just whatever the build step for the sdist created on the developer's machine is what is cached on PyPI).
Not getting this right can be the difference of hitting one static URL with all the metadata, and downloading a zip file, creating a virtualenv, installing build dependencies, generating an entire sdist and then reading the final generated metadata. Many orders of magnitude difference in time it takes to execute.
This also extends to caching. If the metadata can constantly change, how would a resolver cache it? Is it required to build all possible source distributions to determine the metadata as part of resolving?
Cognitive complexity: The system introduces an enormous cognitive overhead which makes it very hard to understand for users, particularly when things to wrong. Incorrectly cached metadata can be almost impossible to debug for a user because they do not understand what is going on. Their pyproject.toml shows the right information, yet for some reason it behaves incorrectly. Most people don't know what "egg info" or "dist info" is. Or why an sdist has metadata in a different location than a wheel or a local checkout.
Having support for dynamic metadata also means that developers continue to maintain elaborate and confusing systems. For instance there is a plugin for hatch that dynamically creates a readme, requiring even arbitrary Python code to run to display documentation. There are plugins to automatically change versions to incorporate git version hashes. As a result to figure out what version you actually have installed it's not just enough to look into a single file, you might have to rely on a tool to tell you what's going on.
The challenge with dynamic metadata in Python is vast, but unless you are writing a resolver or packaging tool, you're not going to experience the pain as much. You might in fact quite enjoy the power of dynamic metadata. Unsurprisingly bringing up the idea to remove it is very badly received. There are so many workflows seemingly relying on it.
At this point fixing this problem might be really hard because it's a social problem more than a technical one. If the constraint would have been placed there in the first place, these weird use cases would never have emerged. But because the constraints were not there, people were free to go to town with leveraging it with all the consequences it causes.
I think at this point it's worth moving the cheese, but it's unclear if this can be done through a standard. Maybe the solution will be for tools like uv or poetry to warn if dynamic metadata is used and strongly discourage it. Then over time the users of packages that use dynamic metadata will start to urge the package authors to stop using it.
The cost of dynamic metadata is real, but it's felt only in small ways all the time. You notice it a bit when your resolver is slower than it has to, you notice it if your packaging tool installs the wrong dependency, you notice it if you need to read the manual for the first time when you need to reconfigure your cache-key or force a package to constantly reinstall, you notice it if you need to re-install your local dependencies over and over for them not to break. There are many ways you notice it. You don't notice it as a roadblock, just as a tiny, tiny tax. Except that is a tax we all pay and it makes the user experience significantly worse compared to what it could be.
The deeper lesson here is that if you give developers too much flexibility, they will inevitably push the boundaries and that can have significant downsides as we can see. Because Python's packaging ecosystem lacked constraints from the start, imposing them now has become a daunting challenge. Meanwhile, other ecosystems, like JavaScript's, took a more structured approach early on, avoiding many of these pitfalls entirely.
[1] | You can see how this works in action for sentry-cli for instance. The @sentry/cli package declares all its platform specific dependencies as optionalDependencies (relevant package.json). Each platform build has a filter in its package.json for os and cpu. For instance this is what the arm64 linux binary dependency looks like: package.json. npm will attempt to install all optional dependencies, but it will skip over the ones that are not compatible with the current platform. |
[2] | For @sentry/cli at version 2.39.0 for instance this means that this singular URL will return all the information that a resolver needs: registry.npmjs.org/@sentry/cli/2.39.0 |
[3] | A common error in the past was to receive a pkg_resources.DistributionNotFound exception when trying to run a script in local development |