Armin Ronacher

Plurk Solace Released

written by Armin Ronacher, on Wednesday, September 2, 2009 14:04.

Yesterday we released Solace under the BSD license free for everybody to use and modify. Solace is a multilingual support platform inspired by Stack Overflow.

That's the project I spend most of my time on the last three or four weeks and I'm very happy to share the results with out. If you're not yet familiar with the concept of Stack Overflow, think of it like a bulletin board where you can vote on topics and replies. The topics are the questions and the replies are possible answers. One can accept the best answer and that is automatically accepted as the correct one, unless someone comes and accepts a different reply as the answer. Like with Stack Overflow, Solace has a reputation system and gives you badges. The more reputation have have, the more features are available for you. For example over a specific threshold you are able to edit other people's posts, downvote others etc.

We do not have a demo version installed yet, but you can already install it on your local machine in less than a minute. Just download the tarball from the python package indes, unpack it and follow the instructions in the README file. It's distributed as a standard Python setuptools packaged application that you can run from a virtual env easily. (Note that there are problems with virtualenv on Snow Leopard at the moment, so google for solutions there or wait for virtualenv to get fixed.)

What does Solace do currently besides the very basic things?

  • Multi-lingual user interface. At the time of this writing only English and German are supported, but with the help of the community we should see many more languages soon.
  • The same instance can host multiple languages at once. The whole system from the ground up was developed with that in mind.
  • There is a RESTful API that allows you to query the database. Later we will also provide write access to it.
  • The badge system is in place and can be modified easily (currently by monkeypatching, but that will get a documented interface)
  • The Auth-system is pluggable. Integrate Solace with your own applications easily.
  • Replies and Answers are revisioned and a diff-view is provided.

Oh yes, and of course it's developed in Python as a standard WSGI application. If you like hacking on the code that should make things really easy :)

Grab it while it's hot and tell us what you think about it.

Pro/Cons about Werkzeug, WebOb and Django

written by Armin Ronacher, on Wednesday, August 5, 2009 7:54.

Yesterday I had a discussion with Ben Bangert from the Pylons team, Philip Jenvey and zepolen from the pylons IRC channel. The topic of the discussion was why we have Request and Response objects in Werkzeug, WebOb and Django and what we could to to improve the situation a bit.

We decided on writing down what we like or dislike on these three systems in order to find out in which direction to go, so this is my attempt. Please keep in mind that this are my opinions only!

WebOb

Let's start with WebOb which is the smallest of the three libraries in question. WebOb really just sticks to the basics and provides request and response objects and some data structures required.

The philosophy of WebOb is to stay as compatible to paste as possible and that modifications on the request object appear in the WSGI environment. That basically means that when you do anything on the request object and you create another one later from the same environment you will see your modifications again.

This is without doubt something that neither Werkzeug or Django do. Both Werkzeug and Django consider the incoming request something you should not modify, after all it came from the client. If you need to create a request or WSGI environment in Werkzeug you get a separate utility for, that is designed for exactly that purpose.

While I have to admit that the idea of a reflecting request object is tempting, I don't think it's a good idea. Using the WSGI environment as a communication channel seems wrong to me. The main problem with it is that WebOb cannot achieve what it's doing with standard environment keys. There are currently five WebOb keys in the environment for “caching” purposes and for compatibility with paste it also understands a couple of paste environment keys.

The idea is that other applications can get a request again at a completely different point, but I'm not sure if WSGI is the correct solution for that particular problem. Reusable applications based on the complex WSGI middleware system seems to be the wrong layer to me.

Some other parts where I don't agree with the WebOb concepts:

  • The parsing of the data is implemented either in private functions or directly in the request object. I strongly prefer giving the user the choice to access the parser separately. Sometimes you really just need a cookie parsed, why create a full request object then?
  • WebOb uses request.GET and request.POST for URL parameters and form data. Because you can have URL parameters in non-GET requests as well this is misleading, for POST data it's wrong as well because form data is available in more than just POST requests. Accessing request.POST to get form data in a PUT request seems wrong.
  • WebOb still uses cgi.FieldStorage and not only internally but also it puts those objects into the POST dict. This is not the best idea for multiple reasons. First of all users are encouraged to trust their submitted data and blindly expect a field storage object if they have a upload field in their form. One could easily cause trouble by sending forged requests to the application. If logging is set up the administrator is sent tons of error mails instantly. I strongly prefer storing uploaded files in a separate dictionary like Django and Werkzeug do. The other problem with using FieldStorage as parser is that it's not WSGI compliant by requiring a size argument on the readline function and that it has a weird API. You can't easily tell it to not accept more than n bytes in memory and to switch between in memory uploading and a temporary file based on the length of the transmitted data. Also cgi.FieldStorage supports nested files which no browser supports and which could cause internal server errors as well because very few developers know that a) nested uploads exist and b) that the field storage object behaves differently if a nested uploaded file is transmitted.
  • Also WebOb barks on invalid cookies and throws away all of them if one is broken. This is especially annoying if you're dealing with cookies outside of your control that use invalid characters (stuff such as advertisement cookies)

Now to the parts where WebOb wins over Django and Werkzeug:

  • Unlike Django and Werkzeug WebOb provides not only a unicode API but also a bytestring based API. This could help existing applications that are not unicode ready yet. Downside is that with the current plans of Graham for WSGI on Python 3 there do not seem to be ways to support it on Python 3.
  • WebOb supports the HTTP range feature.
  • The charset can be switched on the fly in WebOb, in Werkzeug you set the charset for your request/response object and from that point nowards it's used no matter what. In Django the charset is application wide.

An interesting thing is that WebOb uses datetime objects with timezone informations. The tzinfo attribute is set to a tzinfo object with an UTC offset of zero. That's different to Werkzeug and Django which use offset-naive datetime objects. Because Python treats them differently and does not support operations that mix those. Unfortunately the datetime module makes it hard to decide what to do. Personally I decided to use datetime objects that have no tzinfo set and only dates in UTC.

Werkzeug

In terms of code base size Werkzeug's next. The problem with Werkzeug certainly is that it does not really know what belongs into it and what not. That situation will slightly improve with the next version of it when some deprecated interfaces go away and when the debugger is moved into a new library together with all sorts of debugging tools such as profilers, leak finders and more (enter flickzeug).

Werkzeug is based on the principle that things should have a nice API but at the same time allow you to use the underlying functions. For example you can easily access request.form to get a dict of uploaded form data, but at the same time you can call werkzeug.parse_form_data to parse the stuff into a multidict. You can even go a layer down and tell Werkzeug to not use the multidict and provide a custom container or a standard dict, list, whatever.

Also Werkzeug has a slightly different goal than WebOb. WebOb focuses on the request and response object only, Werkzeug provides all kind of useful helpers for web applications. The idea is that if there is a function you can use, you are more likely to use it than that you reimplement it. For example many applications take the uploaded file name and just create a file with the same name. This however turns out to be a security problem so Werkzeug gives you a function (werkzeug.secure_filename) you can use to get a secure version of the filename that also is limited to ASCII characters.

So obviously there is a lot of stuff in Werkzeug you probably would not expect there.

So here some of the things I like especially about Werkzeug:

  • The request/response objects. They are designed to be lightweight and can be extended using mixins. Werkzeug also provides full-featured request objects that implement all shipped mixins. Also the request/response objects are not doing any parsing or dumping, that is all available through separate functions as well which makes the code readable and easy to extend.
  • It fixes many problems with the standard library or reimplements broken features. It does not depend on the cgi.FieldStorage since 0.5, allows you to limit the uploaded data before it's consumed. That way an attacker cannot exhaust server resources.
  • The data structures provide handy helpers such as raising key errors that are also bad request exceptions so that if you're not catching them, you are at least not generating internal server errors as long as the base HTTPException is catched.
  • Werkzeug uses a non-data descriptor for the properties on the request and response objects. The first time you access the property code is executed and that is stuffed into the dict. After that there is no runtime penalty when accessing the attributes.

And of course here the list of things that are not that nice:

  • It's too large for a library that only wants to implement request and response objects.
  • There is no support for if-range and friends.
  • The response stream is useless because each write() ends up as a separate “item” in the application iterator. Because each item is followed by a flush it makes the response stream essentially useless.
  • The MultiDict is unordered which means that some information is lost.
  • The response object modifies itself on __call__. This allows some neat things like automatically fixing the location header, but in general that should happen temporarily when called as WSGI application instead of modifying the object.

Django

Now Django isn't exactly a reusable library for WSGI applications but it does have a request and response object with an API, so here my thoughts on it:

  • URL arguments are called request.GET like in WebOb, but files and form data was split up into request.POST and request.FILES.
  • The request object is unicode only and the encoding can be set dynamically.
  • Problem is, they don't work with non-Django WSGI applications.

Chances on a common Request Object?

WebOb and Werkzeug will stick around, and the chances that Django starts depending on external libraries for the Request object are very, very low. However it could be possible to share the implementation of the HTTP parsers etc.

To be humble, I would not want to break Werkzeug into two libraries for utlities and request/response objects and parsers because of the current packaging situation. A lot of small stuff I work on works perfectly fine with nothing but what Werkzeug provides which is pretty handy. So yes, it's selfish to not break it up, but that's how I feel about the situation currently.

NIH in the WSGI World

written by Armin Ronacher, on Thursday, July 30, 2009 19:14.

Today I've seen yet another WSGI powered microframework. It does not do anything another framework does not do, but it exists. Which is not a problem per se. It probably does some things differently to other things out there and that would be perfectly okay. Except … more than half the code are repetitive WSGI bridging.

Seriously guys, stop doing that. For the following reasons:

  • cgi as a module is for CGI, not WSGI. Don't try to use it for WSGI applications unless you know what you're doing. It's expecting a WSGI server that implements readline() with a size hint which is not compliant to the specification. Also with the wrong invocation it will read your command line arguments and incorporate those into the parsing process and other weird things.
  • Half the frameworks out there are not implementing proper multi dicts or try to leave that to the user by returning either lists or strings from URL parsing or whatever can yield multiple values for a key.
  • Most frameworks get unicode terrible wrong. How hard can it be to properly implement unicode …
  • Some of you are expecting EOF on input streams which might not be there.
  • URL-decoding the path info is not something you should do. I know there are WSGI servers that are doing it wrong, but those things have to be fixed in a fixer middleware and not in your framework. You cannot reliable auto-detect that.
  • Many frameworks are catching system exceptions such as SystemExit and GeneratorExit and others that can cause ugly problems.

There are libraries like WebOb and Werkzeug that are not doing anything besides the very basic things such as form data parsing and stuff. Especially Werkzeug can be used totally low-level, without any request or response objects. It's just doing the boring parts you don't want to implement anyway.

Why shouldn't you reimplement it yourself? There is not much win by doing that. A single dependency for your framework won't kill anyone. The microframeworks in the Ruby World all depend on various stuff (such as Rack). The main reason for not reimplementing are server and browser bugs, limitations in WSGI that have to be worked around, complex issues that are hard to get right and other things where people should rather collaborate.

It's already problematic that there is the Django core, WebOb and Werkzeug that are all implementing low-level parsing and similar things, but I'm pretty sure that we can do better there. For future bugs in Werkzeug my policy is to check for similar problems in both WebOb and Django to ensure that nobody is missing anything here.

So I beg you: If you're working on a microframework, depend on WebOb/Werkzeug/Django, whatever or at least steal the code with copy/paste. Talk back to other developers and share patches. You don't win anything by reimplementing basic things on your own. Not even an understanding of HTTP or WSGI, those things turn out to be only learned by reading the specifications carefully.

Singletons and their Problems in Python

written by Armin Ronacher, on Friday, July 24, 2009 22:33.

The infamous Singleton design pattern is now widely seen as stupid and evil and also causes some hatred. Fortunately singletons in Python are not that common and few people use it. It seems to be a natural thing not to create a singleton class.

But beware. Just because you do not implement the singleton design pattern it does not mean you avoid the core problem of a singleton. The core problem of a singleton is the global, shared state. A singleton is nothing more than a glorified global variable and in languages like Java there are many reasons why you would want to use something like a singleton. In Python we have something different for singletons, and it has a very innocent name that hides the gory details: module.

That's right folks: a Python module is a singleton. And it shares the same problems of the singleton pattern just that it's a little bit worse.

Namespaces

So let's dive into the problems of Python modules by looking at a completely different language. Let's compare our beloved Python modules with C# namespaces for a moment. If you don't know C#, let me show you how they are declared:

namespace MyNamespace {
  class MyClass {
  }
}

So as you can see, a namespace in C# is something you specify explicitly. On the surface that looks like the big difference between a Python module and a C# namespace. However, that's really just the surface. The biggest difference is that a C# namespace is something like a folder. You put stuff into it so that you can better organize it. And the only stuff you can stuff into a C# namespace are classes, interfaces, other namespaces, enums and delegates (something like a function prototype in C).

In Python a module is an object. And that object is an instance of a class called module and it has as many attributes as you like. You can put whatever you want on it. What's stored on there are usually the imported objects, the classes and functions declared in that module and other global variables or constants.

That means the big difference is that a Python module has the ability to store state, a C# namespace does not. There is nothing you can store on a C# namespace that could change at runtime. That means the only thing “stored” on a C# namespace is compiled code that was loaded from an assembly (something like a .pyc file in Python, just more portable).

So what are the implications?

There can only be one…

So I have already told you that modules in Python are simple objects with attributes. So what happens if you write import meh (Ignoring the obscure details about the Python import system)? First the Python interpreter checks if the module was already imported and if yes, it's using the already imported module, otherwise it's creating a new module object and executes the code that creates it.

The already imported modules are stored in a special dictionary inside the interpreter. sys.modules points to this dictionary, so you can access that from the Python code. Each module that was already imported (and also modules that are known to not exist) are stored in there to ensure there will only be one. So as you might have guessed, it's what we call a singleton.

The second step, the execution of code to create the module attributes is the second “problem” here. It's what creates the shared state or what can create the shared state. In order to not talk about irrelevant things, let's have a look at one of the modules from the standard library, the mimetypes module.

Have a look:

inited = False

def init(files=None):
    global inited
    db = MimeTypes()
    ...

This is actual code from the mimetypes module that ships with Python, just with the more gory details removed. The point is, there is shared state. And the shared state is a boolean flag that is True if the module was initialized or False if it was not. Now that particular case is probably not that problematic (trust me, it is) because mimetypes initializes itself, but you can see that there is a files parameter to the init function. If you pass a list of files to that function, it will reinitialize the mime database in memory with the mime information from those files. Now imagine what would happen if you have two libraries initializing mimetypes with two different sources …

Now obviously, that's a problem of the library that implements it not of Python itself. Nobody should have shared state in module scope. Unfortunately there are many standard library modules that have that (cgi, logging, mimetypes, csv, …) and it seems to be standard practice in Python world. There is a lot of shared state in Django and nearly all modern frameworks, not just for the web.

Let it be None?

Now before I ask for more than one, I want to ask for none. Because this is the problem that freaks me out the most. I'm mainly doing Python webdevelopment and that means I have some long running processes that are managed by some external server I don't really control. Not only do I work with Python, I also obsessed by the idea to have extensible systems. Which is why a project of mine has a plugin interface. Users can upload new plugins in the web interface and activate and deactivate them.

What does all of that have to do with singletons and modules? Unfortunately too much. I told you already that once a module is imported, it's stored in that sys.modules thing. Now imagine a user uploads a new version of a plugin, he upgrades it. In order for the new code to load you would first have to shutdown the Python interpreter and restart it again. Unfortunately there is no way for a WSGI application to request a restart from the webserver.

So how does one unload a module to reload a new version? There is no documented way for that, and the thing I'm doing is dangerous, not portable, kills little kittens and you should never, ever do that.

The road to insanity or code reloading in Python:

  1. Put your reloading code into a separate module, one with a special name (zine._core in my case)
  2. Have some sort of lock.
  3. Acquire that lock, and do that when you're sure no other thread is executing code from your package (haha, good luck)
  4. Clear all modules from sys.modules that belongs to your code, except the one that implements the reloader.
  5. Import your package again and execute the code that sets up the application again.

This is dangerous and stupid. Imagine what happens if a thread is still active in the old code and you kick away the modules it's executing in. Because of weak references you could get rid of the global scope (the module one) a called function is still weakly referencing and the function would break with an obscure error.

Currently there is no solution for that problem, and I don't expect one to appear in Python anytime soon, at least not without breaking stuff. Because what we would need is …

… more Singletons

If one singleton does not solve the problem, a second one could. That's the point where you should disagree with me and call me names, but let me explain myself first. The problem is shared state, but why is shared state the problem? In Python development we seem to love shared state, a whole lot. And it does make development simple and lets you learn and understand the language quickly. The shared state is usually stored on modules or stuff stored on modules, so modules seem to be the root of all evil. There can only be one version of a module, what does this mean for us? Imagine we have one running Python interpreter, the following things do not work:

  • that interpreter runs application A and application B, A wants libfoo in version 1.0, B wants libfoo in version 2.0, both API incompatible
  • we can't reload code on the fly because we would have to tear everything down first and restart, we can't load the new version of the code and slowly moving over to it and get rid of the old code with the help of the garbage collector when it's no longer needed.
  • we can't have two instances of the same application running in the same process that want different search paths for plugins loaded with the regular import API (instance 1 loading the modules below app.plugins from /var/www/instance1/plugins and instance 2 loading the modules from /var/www/instance2/plugins)

The funny (and sad) part is that all these nice things do not work just because of one single object: sys.modules, the übersingleton of Python.

But we can't get rid of it because our modules are objects and we want to get the same module back if we import it in two different modules. So if we can't get rid of the singleton, add some more!

This solution would solve the problems of the three cases outlined above, but there would still be many problems left. Also there is no way this could be implemented in a backwards compatible fashion in Python due to the fact how pickle imports objects and how we refer to objects, but this is how it could work:

Tagging sys.modules

Currently the key for the items in sys.modules is the name of the module. In an ideal world, the keys would be tuples in the form (module_name, tag) where tag could be used for the following things:

  • specify a specific version of the library (like '1.0')
  • a secondary import of a library (like mimetypes import for library B)
  • an random ad-hoc identifier to enforce fresh imports (think about testsuites and benchmarks that need to work on clean imports of a library because of … well … shared state …)

How to express which tag to use?

# a string literal as tag
from sqlalchemy['0.6'] import create_engine

# the contents of a variable as tag
from zine.plugins[my_instance] import myrtle_theme

What if no tag is provided? No idea man.

What's your Point?

I guess … there is none. It shows a problem I have with Python and provides the first part of a solution. It explains why Zine is doing funny things and why there can only be one Zine instance per interpreter. It's some brainstorming I wanted to share with the world and maybe someone can use that to implement a new dynamic language that fixes that problem. It's not like that's a problem only Python has …

free VS free

written by Armin Ronacher, on Tuesday, July 14, 2009 21:09.

Seems like my favourite discussion is back. In the ring two guys: Zed Shaw, the developer of lamson and mongrel, on the other side we have Jacob Kaplan-Moss Django's BDFL.

This time the discussion seems to be entitled "Because the only thing better than an arbitrarily restrictive license is an ambiguously restrictive license" [via twitter]. I won't warm up the discussion with new arguments (promised) but what I found most interesting about the discussion is Zed's blog post why he's using the (A/L)GPL. Basically what he's saying is that he does not want to be burned again like he was with Mongrel and uses the GPL to force people to contribute.

I'm not exactly sure how that supports freedom. I might be idealistic here, but what motivates me the most about the open source libraries I work on is how they are used. I got mails from developers in many companies that are using various Pocoo libraries internally and cannot contribute patches due to restrictions in the company structure. Every once in a while I get patches those developers craft in their free time and very often I don't get any. However the point is, that I can see people using my stuff which motivates.

I'm not making money with my libraries, but that's probably because I'm not a friend of selling code. I love to give the stuff away I'm working on, and get payed for support if one needs it. And so far this worked flawlessly for me.

Forcing people to freedom is not exactly my definition of being free.

So dear users: Use my stuff, have fun with it. And letting me know that you're doing is the best reward I can think of. And if you can contribute patches, that's even better.