written on Thursday, March 24, 2016
Like everybody else this week we had fun with the pad-left disaster. We're from the Python community and our exposure to the node ecosystem is primarily for the client side. We're big fans of the ecosystem that develops around react and as such quite a bit of our daily workflow involves npm.
What frustrated me personally about this conversation that took place over the internets about the last few days however has nothing to do with npm, the guy who deleted his packages, any potential trademark disputes or the supposed inability of the JavaScript community to write functions to pad strings. It has more to do with how the ecosystem evolving around npm has created the most dangerous and irresponsible environment which in many ways leaves me scared.
My opinion very quickly went from “Oh that's funny” to “This concerns me”.
When "pad left" disaster stroke I had a brief look at Sentry's dependency tree. I should probably have done that before but for as long things work you don't really tend to do that. At the time of writing we have 39 dependencies in our package.json. These dependencies are strongly vetted in the sense that we do not include anything there we did not investigate properly. What however we cannot do, is also to investigate every single dependency there is. The reason for this is how these node dependencies explode. While we have 39 direct dependencies, we have more than a thousand dependencies in total as it turns out.
To give you a comparison: the Sentry backend (Sentry server) has 45 direct dependencies. If you resolve all dependencies and install them as well you end up with a total of 65 packages which is significantly less. We only get a total of 20 packages over what we depend on ourselves. The typical Python project would be similar. For instance the Flask framework depends on three (soon to be four with Click added) other packages: Werkzeug, Jinja2 and itsdangerous. Jinja2 additionally depends on MarkupSafe. All of those packages are written by the same author however but split into rough responsibilities.
Why is that important?
Let's talk about the cost of dependencies first. There are a few costs associated with every dependency and most of you who have been programming for a few years will have encountered this.
The most obvious costs are that packages need to be downloaded from somewhere. This corresponds to direct cost. The most shocking example I encountered for this is the isarray npm package. It's currently being downloaded short of 19 million times a month from npm. The entire contents of that package can fit into a single line:
module.exports = Array.isArray || function(a) { return {}.toString.call(a) == '[object Array]' }
However in addition to this stuff there is a bunch of extra content in it. You actually end up downloading a 2.5KB tarball because of all the extra metadata, readme, license file, travis config, unittests and makefile. On top of that npm adds 6KB for its own metadata. Let's round it to 8KB that need to be downloaded. Multiplied with the total number of downloads last month the node community downloaded 140GB worth of isarray. That's half of the monthly downloads of what Flask achieves measured by size.
The footprint of Sentry's server component is big when you add up all the dependencies. Yet the entire installation of Sentry from pypi takes about 30 seconds including compiling lxml. Installing the over 1000 dependencies for the UI though takes I think about 5 minutes even though you end up with a fraction of the code afterwards. Also the further you are away from the npm CDN node the worse the price for the network roundtrip you pay. I threw away my node cache for fun and ran npm install on Sentry. Takes about 4.5 minutes. And that's with good latency to npm, on a above average network connect and a top of the line Macbook Pro with an SSD. I don't want to know what the experience is for people on unreliable network connections. Afterwards I end up with 165MB in node_modules. For comparison the entirety of the Sentry's backend dependencies on the file system and all metadata is 60MB.
When we have a thousand different dependencies we have a thousand different licenses and copyright files. Really makes me wonder what the license screen of a node powered desktop application would look like. But it's not also a thousand licenses, it's a huge number of independent developers.
This leads me to what my actual issue with micro-dependencies is: we do not have trust solved. Every once in a while people will bring up how we all would be better off if we PGP signed our Python packages. I think what a lot of people miss in the process is that signatures were never a technical problem but a trust and scaling problem.
I want to give you a practical example of what I mean with this. Say you build a program based on the Flask framework. You pull in a total of 4-5 dependencies for Flask alone which are all signed off my me. The attack vector to get untrusted code into Flask is:
All of those attack vectors I cover. I use my own software, monitor what releases are PyPI which is also the only place to install my software from. I 2FA all my logins where possible, I use long randomly generated passwords where I cannot etc. None of my libraries use a dependency I do not trust the developer of. In essence if you use Flask you only need to trust me to not be malicious or idiotic. Generally by vetting me as a person (or maybe at a later point an organization that releases my libraries) you can be reasonably sure that what you install is what you expect and not something dangerous. If you develop large scale Python applications you can do this for all your dependencies and you end up with a reasonably short list. More than that. Because Python's import system is very limited you end up with only one version of each library so when you want to go in detail and sign off on releases you only need to do it once.
Back to Sentry's use of npm. It turns out we have four different versions of the same query string library because of different version pinning by different libraries. Fun.
Those dependencies can easily end up being high value targets because of how few people know about them. juliangruber's "isarray" has 15 stars on github and only two people watch the repository. It's downloaded 18 million times a month. Sentry depends on it 20 times. 14 times it's a pin for 0.0.1, once it's a pin for ^1.0.0 and 5 times for ~1.0.0. Any pin for anything other than a strict version match is a disaster waiting to happen if someone would manage to push out a point release for it by stealing juliangruber's credentials.
Now one could argue that the same problem applies if people hack my account and push out a new Flask release. But I can promise you I will notice a release from one of my ~5 libraries because of a) I monitor those packages, b) other people would notice a release. I doubt people would notice a new isarray release. Yet isarray is not sandboxed and runs with the same rights as the rest of the code you have.
For instance sindresorhus maintains 827 npm packages. Most of which are probably one liners. I have no idea how good his opsec is, but my assumption is that it's significantly harder for him to ensure that all of those are actually his releases than it is for me as I only have to look over a handful.
There is a common talk that package signatures would solve a lot of those issues but at the end of the day because of the trust we get from PyPI and npm we get very little extra security from a package signature compared to just trusting the username/password auth on package publish.
Why package signatures are not the holy grail was covered by Donald Stufft aka Mr PyPI. You should definitely read that since he's describing the overarching issue much better than I could ever do.
To be perfectly honest: I'm legitimately scared about node's integrity of the ecosystem and this worry does not go away. Among other things I'm using keybase and keybase uses unpinned node libraries left and right. keybase has 225 node dependencies from a quick look. Among those many partially pinned one-liner libraries for which it would be easily enough to roll out backdoor update if one gets hold of credentials.
Update: it has been pointed out that keybase shrinkwrapped in the node client and that the new client is written in Go. Source
If micro-dependencies want to have a future then something must change in npm. Maybe they would have to get a specific tag so that the system can automatically run automated analysis to spot unexpected updates. Probably they should require a CC0 license to simplify copyright dialogs etc.
But as it stands right now I feel like this entire thing is a huge disaster waiting to happen and if you are not using node shrinkwrap yet you better get started quickly.