“STD” stands for Sleazy, Tattered and Dead
My latest discovery of a behavior bug in Python earned me some negative comments. I have to admit that the way I blogged about it and how I reported the bug was not that fair. It was just one bug in a million and I was surprised how late it was discovered. I was really puzzled because of that.
Anyhow. That's not what I want to blog about here; more about my bad experiences with the standard library in general. Everybody who knows me know that I hate things quickly and that I'm crazy about beautiful code. Especially in the Python land there are some guys like Christopher Lenz and the trac team or Georg Brandl who write beautiful code just as I like it :)
The Python Standard Library
But why am I especially mad about the standard library? The reason for this is that the standard library has some problems (caused by the fact that it's the standard library).
A lot of stuff ended up in Python a long, long time ago. And that was fine for the time. I can't blame anyone for the state of the standard library. A lot of stuff was added to it long before I even knew what a computer was. Let's take cgi as an excellent example. cgi was, when it was created, CGI a nice little protocol that just worked. And I really have to give kudos to the developers of this library for doing tons of stuff right. It may sound awkward if you know the library in detail, but believe me, the fact alone that it nearly worked flawlessly with WSGI is noteworthy. It's incredible if you think that WSGI was added years after the library was written. So some forward thinking in terms of “decoupling” it from CGI did the library very well.
However the age of cgi shines through. I just recently discovered that the infamous cgi.FieldStorage provides multipart/mixed support. This is incredible. While it's specified as part of HTML4 it was never implemented in a browser people actually used on the world-wide web.
The downside is that, like many other libraries in the stdlib, it mixes a high-level API with low-level parsing features. For example if you access a key in a field storage you can't trust that key. It could either be a string, a FieldStorage or a list or strings or FieldStorages or both at the time thanks to the multipart/mixed support.. Hardly anyone knows that because he trusts the data in that it returns the correct value. I guess there are enough Python scripts out there that would die with an internal server error if you pass input data via form that changes from a string into a FieldStorage. The days when you could trust your user to submit the data you expect are long gone. And security and browser bugs are things that change on a nearly on a monthly basis.
To savely use FieldStorage you have to un-magicify it by walking it and unpacking the data into something you can trust. Moving all files into one dict, all strings into another one etc. So in older Werkzeug versions and in current Paste/Webob versions the field storage is traversed and preprocessed before the data is handled to the developer.
And the cgi module is one of those I had the least problems with. Besides having an archaic API it also features some serious fuckups like accessing sys.argv when you least expect it, undocumented logging code and years of backwards compatibility. Thanks to the magic API it was also impossible to select the upload storage based on the content length or stop parsing if resources are exhausted (someone trying to submit a gigabyte of form data to the server, which is always stored in memory).
The Cookie module is one of my “favourites”. It comes with backwards compatible code that can be used to let an attacker execute arbitrary code on the server and has an API that is so weird and magical that few in the history of Python frameworks exposed that API to the user. On parsing errors it drops all the cookie values and it does not very well with real-world cookies which means that you can have a lost cookie very fast. The morsel stuff it uses internally is written in a way that you can only add support for stuff like HttpOnly by subclassing it and overriding builtin and undocumented attributes.
Until recently urllib didn't had proper timeout support making it practically impossible to safely use it in a web application. The socket.setdefaulttimeout() hacks have so many problems, don't even get me started. And I'm not mad that there was no timeout support. My problem solely is that the library was written in a way that it's impossible to add such missing features by hand without rewriting the library.
BaseHTTPServer is another library that has magic built-in. Without copy/paste of undocumented code you can't write a web server that listens for all HTTP methods. (Not true, there is a way. You could override __getattr__, look for do_* attributes and forwarding that to a proxy method but …)
Do you know the codeop module? It's used to implement the Python-version of the Python interactive shell. It works like this: compile the code, compile the code with a newline at the end, compile the code with two newlines at the end and compare the string value of exceptions raised to figure out if we are at the end of the input -.-
Until Python 2.6 there was no documented way to load a file from the Zip file as a file descriptor, rather than a complete string.
Do you know imaplib? In the real world it's nearly unusable because it stops half way and returns values in a half-parsed and undocumented format making it impossible to actually do anything useful it with except for the very basics.
And I'm not talking about stuff that was now finally deprecated like dircache, sv or god knows what is in the stdlib nobody knows about or locale which is not process-safe and so useless that the Babel guys see no way except reimplementing everything from it as separate new library.
Why is it in that Sad State?
How does stuff go into the standard library? Maybe we should go there. I'm not sure how that happens. Some stuff went into the standard library I'm very happy about. Modules like threading, multiprocessing, urllib, json etc. These modules have one thing in common: They either implement something that is heavily platform dependent, essential or standardized and stable. Other stuff went into the standard library just to decay there. For example we have the cgi module, the webbrowser module (which should be part of the GUI libraries, not a Programming language's standard library) etc.
What is a standard libary for anyways? A standard library is shipped as part of the language and should make it possible to make applications platform independent. To provide often used features and implement them in a way that everybody can use it, from any context. Not just single-threaded command line applications.
A cool standard library would provide IO access, filesystem and platform introspection helpers, access to the programming language internals (a interface for debuggers, access to the AST / compiler / bytecode / garbage collector etc.), support for package distribution, ways to extend the import system etc. There would be kick-ass unicode support, regular expressions, datetime objects, collections and other data structures etc.
Not standard library worthy is stuff like any kind of web development support. These things change quicker than you can sing the Spam song. Also a UI toolkit like Tk is something that's not standard library worthy (especially because it's rendering widgets like it's 1985). Why is there support for wave files? Especially in such a useful way. Why are there 5 or more file-system databases like bsddb? Why is there an SQLite adapter shipped? Why do we have parsers for robot.txt files? plist? asyncore? commands / popen2 and tons of other redundant ways to invoke external applications and get the output? The builtin XML support is in such a bad state due to the fact that XML and the technologies that make it worthwhile are so complex that they require more bugfixes / releases of the libraries that implement it or change so quickly that the standard library can't keep up. Minidom is annoying, the standard etree doesn't even support printing of XML documents with custom namespaces without falling back to unreadable names. (remember that XML was sold to use as human readable?).
Your area of expertise != Our area of expertise
I'm one of those developers that really likes to write library with a nice API and that reads through tons of RFCs, blog posts to similar topics etc. to deliver a nearly-perfect library in the end. Of course I fail in delivering perfect libraries. Far from it. However I try to improve the stuff I write over time, learning of my mistakes and improving them. From nearly two years of Jinja developing, the feedback I've got, studying of similar code and more I was able to collect some knowledge to know how template engines in Python may work and what can be changed in the language to improve the experience. I just recently started diving into the gory details of HTTP, browser bugs and everything else. I had a look at earlier code I've written and had to notice that I was stupid and solved problems in a way that they seem to work, without seeing the bigger picture. This experience comes over time and it takes a couple of releases to really come up with an implementation that works like it should.
I've seen from other project that I'm not alone with that. Compare older Django versions with more recent ones. Earlier Django versions monkey patched modules to move models into other modules, CherryPy started as a standalone server in the pre-WSGI days that even went as far as implementing a Python-inspired language for the application code that compiles down to Python (I'm not exactly sure how that worked. I just remember something like that. Correct me if I'm wrong). Zope is in it's third iteration as well, Ian seems to have learned from mistakes as well and fixes them in WebOb now, Genshi took over Kid and is its unofficial successor fixing problems learned from there etc.
This is something you can't do in the standard library. Once code is there, it sticks. So nobody can be blamed for problems in the standard library. This is what happens if code ends up there. This is the effect a standard library has on code.
So far I have just contributed two modules to the standard library. One is the ast module which provides compiler.ast like access to the new Python AST, incorporating experiences I've got when working on Jinja and Genshi. The other one is the ordered dict which isn't yet there, but where I suppose it will be accepted in one way or another. The experience for those two libraries was interesting.
The intentions I had with the AST module seem to clash with Guido's believes in Python a bit. When Google launched the AppEngine I and Christopher Lenz had a discussion with Guido via mail why the _ast (the internal module used by ast) module was unavailable there. Between the lines you could hear that he was not very happy with giving Python modules the access to the compiler:
IMO it's more that because it was available people flocked to it as a timesaver. As the compiler package has turned out to be a ridiculous maintenance nightmare, nobody really wants to support that any more.
Hopefully the pgen2 package (which is more flexible *and* more limited) is easier to use. I can highly recommend it.
pgen2, if you don't know it, is the library working in the 2to3 tool and Sphinx which is a (slowish) Python parser written in Python. I noticed Guido's dislike in Python code generating and compiling Python code last djangocon as well. He started his keynote by joking about how the Django template engine is superior to anything else out there. (Of course I don't know if he means the implementation or philosophy, but something inside me told me he was happy that it was evaluating a custom AST and not compiling down to Python)
I suppose that's fine. Python is his brain child, but I was hoping he could see that for quite a few situations it would be helpful to have an AST to play around and compile it down to Python bytecode.
So what does this have to do with the standard library? A lot if you think about it. It basically means that a library in the standard library is no longer the library of the person who wrote it. It's part of a bigger plan. Suddenly different rules apply. Updated are distributed with Python as I've said earlier already. But that's not the only thing that changes. The philosophy changes as well. Normally if I notice that something does not work as expected, I consider changing it with a deprecating warning or starting a separate library that is backwards incompatible but fixes those problems (like I did with Jinja 2). In the standard library you are forced to live with some bugs if they are not fixable in a backwards compatible way. Someone else will suddenly decide that changes won't go into a library because it would break code, something the Python team can't allow.
And this is a great thing. It means that updating from one Python version to another is in general very painless. It just has negative implications on libraries that ended up there too early or have to be changed to stay up with latest developments.
On the other hand stuff that does belong into the standard library should get some more love. Why is there no function yielding file names in a directory instead of returning a list? Why don't we have a module that gives us colors for the terminal in a platform-independent way? What about adding unicode support to Python's regular expressions? Or implement some more UTRs for the unicodedata module? Platform independent file locking and file change notifications? That would be honking great!
Where to Go?
If there is one thing I want to say with this blog post, it's that I strongly support the idea of making the standard library as light as possible and to improve the package distribution problem which still exists. Ever since virtualenv came around I'm no longer installing packages system wide so that I can have different versions in place. Maybe someone could even come up with a PEP to support loading different versions of the same library into the Python interpreter. Imagine you could install different versions of SQLAlchemy via debian's apt-get and the application could require a specific version. If the package installation is easy and simple there would be no problem with moving “essentials” like the urllib, cgi, sqlite or all the XML modules outside of the standard library and on the Python package index.
The great libraries are great because they are actively developed. And we should take advantage of that!
As always read this post with a grain of salt. The fact that I'm still a Python Lover, with all the mistakes and limitations it has strongly speaks for it.
Wow, great post. I really enjoyed reading it.
— Pascal on Monday, March 2, 2009 22:13 #
Do you share the same aversion to urllib2 as urllib?
— Benjamin Peterson on Monday, March 2, 2009 23:21 #
@2: urllib2 shared the same weaknesses until 2.6 came along. It didn't provide a timeout either and was exactly as impossible to subclass in a way that a timeout could have been added without rewriting the module.
— Armin Ronacher on Monday, March 2, 2009 23:23 #
I think you need to improve the package distribution problem before you can make the library as light as possible. And since people have been working on that for over a decade and don't have anything close to a good solution (virtualenv alone is woefully inadequate) I'm not holding my breath.
Also, to make a good case you'd need to talk about Python's historical Batteries Included philosophy and compare that with perl's CPAN approach and the relative merits, popularity growth, etc.
— Justus on Monday, March 2, 2009 23:42 #
Thanks for expanding on this Armin. While I still don't agree the stdlib is hopeless, I'd agree that to an extent, there are some technologies which simply move too fast to include in the stdlib.
I think the issue at work is actually multi-fold. I think issue one is that some of the older modules suffer from a lack of a dedicated maintainer - I suspect the stdlib size has outstripped the size of the dev team. Not having a dedicated maintainer who can really focus on those modules means that they'll get bugfixes, but more aggressive changes don't happen.
I think another issue at work is simply that people do go and roll their own, rather than offer to pick up an existing implementation, or push for changes in the stdlib.
It's possible for something to exist in both the stdlib, and on pypi - distutils and multiprocessing both have backports that include bugfixes that went into the more recent python versions.
All told, sure, there's some things in the stdlib which need help (and additional help would be appreciated). There is something to be said for batteries included though - with a basic python installation, it's possible to get a wide range of things done, even if they're not optimal implementations.
Some companies and environments ban things such as the pypi and CPAN (for perl), as they don't allow the base installation to be modified - for this very simple reason, it's great to have a wide range of things in the stdlib.
But that being said, those things should evolve with technology, and be the best of breed implementations (and possibly contain the least amount of magic).
As for your module loading ideas - ping Tarek - he's the current disutils maintainer, and I know he has big plans.
— Jesse Noller on Monday, March 2, 2009 23:54 #
Nice read! I almost agree.
It's just ... Python is used a lot for simple one-off scripts (and maybe they're not so simple, like temporarily setting up a tcp server to test something). I might not want to download several third-party libraries to do stuff. I might not even have a way to download external libraries to that computer. But Python is most likely there, and it should be useful by itself.
Still, you're right about the problems. It's especially problematic if it means people use a stagnating standard library module instead of developing something better themselves.
— Simon Percivall on Tuesday, March 3, 2009 0:27 #
I think you are plain wrong when you bash stdlib. Battery included is what makes python so cool. When you read the python cookbook tons of recipes don't need external package and that's great. There's no real standard lib in C or C++ and that's what make those langage so lame, because you have to re-invent the wheel everyday. In my company there are probably as many string library as there are employee. Do you think it's a good thing ? And setting up CPAN is such a pain, plus there are some people who don't use debian so the apt-get answer does not work.
Why don't you fix packages instead, by adding a 2 at the end (urllib2) if you're going to break backward compatibility ?
— Benjamin Sergeant on Tuesday, March 3, 2009 2:12 #
I think that having `sqlite` in `stdlib` is great. However, I agree with you regarding `stdlib` stagnation, just look at `datetime` & friends.
— Suraj on Tuesday, March 3, 2009 4:34 #
I think you have some valid points, but I think it is worthwhile to step back from your vantage point as an experienced developer and try to see the standard library through the lense of someone new to Python or programming in general.
When that person installs Python or discovers it installed in /usr/bin, they may not know much about it. Maybe they'll read the tutorial on python.org. That won't tell them about easy_install or PyPI. Maybe they will want to write a small GUI application to learn a little something about GUI programming. I'm glad that in most cases they won't have to chase down packages to do so.
Back when I was new to Python and programming in general I was thrilled to find telnetlib in the standard library. It was very useful for some mundane sysadmin tasks and really whetted my appetite for more of Python. We all know there is plenty wrong with telnet in general and telnetlib probably has its share of flaws as well.
The standard library that you propose sounds useful but also very boring to me. I still get a kick out of all that is available with a fresh Python install and I'd be sad to see the eclectic (and still useful) standard library get replaced with "kick-ass unicode support ... and platform introspection helpers".
— Christian Wyglendowski on Tuesday, March 3, 2009 4:54 #
I don't agree with everything in your post, but I think you have some good ideas. I am very happy that someone with your dedication and passion for coding is putting work back into the standard library.
I both agree and disagree with you about the standard library. I actually like how conservative Guido and the rest of the Python developers are with the standard library. There are some pretty big bugs every once in a while, as you point out, but for the most part things work. I actually really, really like the standard library, but think it could get better too.
One thing I strongly agree with you on is that Python has a real problem with package management, especially when it comes to packages that depend on other packages. This is a substantial problem that is on my personal list of the top 3 warts of Python. It is easy to demonstrate this problem. Install any web framework, you will fail over 50% of the time....
— Noah Gift on Tuesday, March 3, 2009 7:49 #
You may be right that some aspects of the stdlib are less than perfect. But removing them, and slimming down the stdlib, is NOT the answer. Many people use Python in environments where "just installing an extra module" is far from easy - corporate environments, shared web hosting, Google AppEngine, etc. In those environments, the fact that there is something in the stdlib is not just useful, it's crucial.
And furthermore, even if (say) the CGI module is broken, it is still "good enough" for lots of uses. I'm not a web developer, and I don't know much about web applications. But with the stdlib, I can knock up a basic CGI program to wrap up some scripts on my server pretty easily. If I had to locate a CGI library, I'd never bother (or worse still, I'd use another language! :-))
In my view, the solution is for people who find issues with the stdlib, particularly if they find they need to "add such missing features by hand [by] rewriting the library", to contribute their fixes/enhancements back to the stdlib. I know such patches may languish in the tracker for far too long, but if people stop trying to fix the stdlib, it'll just stay broken.
From my POV, I'm hugely grateful that the writers of the stdlib have contributed their code, so that I may use it - no matter how limited it might be.
— Paul Moore on Tuesday, March 3, 2009 12:09 #
and I thought it was "sexual transmitted desease".. well at least the last one seems to be true ;)
— Ron on Tuesday, March 3, 2009 13:20 #
Except for corporate environments in all these situations you can easily add other modules. That's how Python works. Everything in the Python path is looked up. Just copying the package into the PYTHONPATH (a variable you can set yourself) is enough.
That's not a good argument.
Also I did say that it is important to further improve the installation and distribution of Python packages. Just because it may suck currently, it does not mean it has to suck in the future.
People went over to writing separate libraries (finally). I think the standard library will continue to rot for a while and then a team will jump on it and decide do deprecate some of the most rotten stuff.
That's what happened before the 3.0 release and happened during the 2.x release cycle as well.
— Armin Ronacher on Tuesday, March 3, 2009 13:22 #
(which should be part of the GUI libraries, not a Programming language's standard library etc.).
should be
(which should be part of the GUI libraries, not a Programming language's standard library) etc.
— Eddie on Tuesday, March 3, 2009 16:10 #
I notice a lot of your complaints are focused on http and other web protocol packages. I don't hear hate for re, sys, and lots of other packages :) In my experience, cgi, urllib2 and some others aren't very good, and others (like shelve, optparse) can use some major re-writing. I tend to view the stdlib as a good starting point, until you need something better.
Maybe it would be a good project to rebuild some of the old libraries from scratch? I'd sure be game to work on optparse and shelve, and I'm guessing others would have similar pets. I'd love a great cgi module, for one.
— Gregg Lind on Tuesday, March 3, 2009 16:52 #
@15: re is a thing on its own. Some things of it are just weird (for example that you can't perform type checks and that pattern/match objects are missing a __class__ attribute). It's also missing unicode support as I've written earlier. But in general the library is nicely done. For a long time it implemented stuff many other libraries did not provide. Unfortunately it seems to be unmaintained and parts of it are written in Python which makes it not that nice to interface from the C API.
I think you can expect web support modules that are uncoupled from Werkzeug/Paste/Django soon. Ben Bangert from the Pylons team contacted me if I would be interested in working with them on a separate library that implements all that ugly header parsing etc. If all works out as expected, there will be a library.
However I strongly vote against putting that into the standard library as web stuff is a moving target. The IE guys are putting out new HTTP headers like mad and browser bugs change over time.
— Armin Ronacher on Tuesday, March 3, 2009 17:15 #
— Christopher Lenz on Tuesday, March 3, 2009 21:18 #
@17: Right. The parsing as such would be something stable. But who except for web guys needs HTTP header parsing? I mean, why ship that as part of Python if only web stuff uses it? As soon as the packaging issues are solved it should straightforward to install a package everywhere :-)
— Armin Ronacher on Tuesday, March 3, 2009 21:43 #
The standard library was good once apon a time.
It still is good I think... but now there's way more good code outside of the stdlib. Using many of the modules in there is not the best thing to do. eg, there are better web,gui,database, xml, networking, graphics and game libraries. Pretty much in every field of programming python has some better add-ons compared to what the stdlib provides.
Now we have better programmers, and better tools to help us work together(improving version control, and collaboration software).
It's now easy for a python package to have 1-100 developers.
It's also fairly easy for people to distribute packages now. However that is improving too -- with a few different people and projects working on improving the situation.
Note, that there are more than 5868 packages listed on the python package index. That's only a small proportion of all python code available too.
This is why py3k has had terrible uptake. Python isn't in the core anymore... python is in the libraries. There is way more good stuff available outside of python stdlib. My guess would be that lots of stdlib libraries have better external replacements.
At this point in pythons lifetime it makes sense to keep the stdlib stable, and to concentrate on package management... and work towards allowing all of those packages to reach a high quality more easily.
Anyway... I'm glad someone has come and given us an education about the potential badness of STDs.
— Rene Dudfield on Tuesday, March 3, 2009 22:36 #
I agree with a lot of your opinions regarding the standard library. There are far too many modules, and it can be quite confusing especially when functionality is duplicated.
However, I have to take exception with your criticism of the inclusion of the sqlite bindings. In my opinion this is one of the killer features of Python. If you have an application that needs to use relational database type features, it's a great boon to know that you can freely do so and the module will be there no matter where you run your app. The fact that sqlite supports creating in-memory databases without a file backing makes this even more powerful. It really provides a neat way to have a more powerful queryable datastructure available.
— Kamil Kisiel on Tuesday, March 3, 2009 23:19 #
@Gregg Lind, you wrote, "Maybe it would be a good project to rebuild some of the old libraries from scratch? I'd sure be game to work on optparse and shelve, and I'm guessing others would have similar pets. I'd love a great cgi module, for one."
I love this idea. Something like boost, but for Python. The focus would be on standard library reimplementations, and then it could serve as a staging ground for inclusion in the 'real' standard library, just like Boost does.
I really want to work on a reimplementation of Date/DateTime support that is loosely based on the Joda-Time API from Java:
joda-time.sourceforge.net/
I think you could take the best of that API, plus Python's datetime and mx.DateTime, and create an awesome and truly useful Pythonic DateTime API.
I'm sure people have similar feelings about things like other standard data structures and the like.
— pixelmonkey on Wednesday, March 4, 2009 3:25 #
My understanding is Guido considers the stdlib not part of the core language, and as such it's shortcomings fall through the cracks. Is there some way to get the priority of stdlib maintenance (and documentation, which is also very spotty) kicked up to a higher plane in the eyes of core developers? Reading Noah Gift's comment, I was reminded of how stdlib shortcomings will keep Python from replacing Perl (or good old shell) in any serious Sys Admin environment unless there's an evangelist pushing the change. Here's the documentation from shutil.copytree:
<blockquote>
copytree(src, dst, symlinks=False)
Recursively copy a directory tree using copy2().
The destination directory must not already exist. If exception(s) occur, an Error is raised with a list of reasons.
If the optional symlinks flag is true, symbolic links in the source tree result in symbolic links in the destination tree; if it is false, the contents of the files pointed to by symbolic links are copied.
XXX Consider this example code rather than the ultimate tool. </blockquote>
— sk on Saturday, March 7, 2009 1:49 #
Minor pedantic twitch re
the days when you could trust user input
are long gone
Thank goodness those days are gone, but
did they ever exist?
How did they survive the introduction of gets() ?
— Alan on Sunday, March 8, 2009 2:26 #
Interesting point. I guess I'm only using less than 5% of stdlib on a daily basis. Never had it problem with it being "crowded". But stale code is bad, of course. I noticed that recently on another playground: jQuery plugins (found in the official SVN tree and hosted on jquery.com): many of them are outdated crap. I'd be concerned if that'd be the case with Python, too.
— Fabian Neumann on Friday, March 13, 2009 12:52 #
You don't seem to see the point of a stdlib because of the way you are using Python: you develop in a given field of expertise (web developpement) and setup a virtualenv with the best existing tools. Whether they are in the stdlib or not doesn't matter to you.
My use case is different: on my home computer, I use Python for various small admin tasks, more often than not directly in the interpreter. For that, I need Python to be like a swiss army knive: yeh, it might not have the best saw or scissors in the world, but I'm glad it has them.
For that use case, even the best packaging system in the world (anyone feels like reinventing Debian?) is not gonna help. I don't want to wait for dependencies to be downloaded, let alone compiled (which on Windows also needs a MS compiler...), I don't want 10 different python installations because applications have conflicting dependencies.
So when discussing improvements to the stdlib, think of the people who really depend on it, not about your own use cases. For example, aggressively deprecating bitrotting modules is useful, because they are just waisting user time. But plotting ways to eventually get rid of the stdlib is not.
— Baptiste on Monday, March 16, 2009 15:43 #