The Latest Downtime

From yesterday 18:00 until now pocoo.org and all the related domains (including wiki.python.de and pygments.org) where down because we moved all the xen instances to a new hetzner server. However the RAM usage by the instances is still unchanged so there will be another small downtime in the next few days. And small is ~20 minutes. Because I expected that the migration will take less time I have to postpone the upcoming Jinja release until Saturday.

Additionally we’re struggling with a mod_wsgi bug that is probably not mod_wsgi’s fault. Under certain conditions a C extension seems to not release the GIL which the result that one apache process consumes all the processor power available.

Sorry for the inconveniences caused.

Update: resource relocation done (for the moment at least)

4 Responses to “The Latest Downtime”

  1. FWIW, if it is correct that it is the C extension module that is blocking or going into a tight loop without releasing the GIL, then technically this could cause a problem with mod_python or even if being used in a standalone Python based web server. This is what makes the problem so strange, as one would expect to see such other hosting systems also affected.

    In the case of mod_python, or embedded mode of mod_wsgi, then one process no longer processing requests would cause Apache to go and create extra child processes, up to its max, in order process new requests. Thus, although host performance would be reduced if something is running in a tight loop, the server would still keep functioning, at least until enough processes got stuck in this state that the machine just couldn’t handle the load. I have actually heard of some cases like this, so it may be the same issue.

    For a standalone Python web server running in a single process, then the whole server would just hang and only restarting it would fix it.

    The question thus is whether the issue is being seen with other hosting solutions, or whether for some reason the problematic C extension module is acting a bit differently when run under mod_wsgi.

    Anyway, on the presumption that the problems are caused by a C extension module not releasing the GIL, have added code for mod_wsgi 2.0 which will detect a deadlock on the Python GIL and forcibly shutdown the daemon process.

    http://groups.google.com/group/modwsgi/browse_frm/thread/1f1e8139123465d6

    This should help to at least recover from such a problem occurring with a problematic C extension module.

    Comment by Graham Dumpleton — Friday, November 16th, 2007 @ 12:09 am
  2. I upgraded the server to wsgi trunk today and it successfully recovered from one of those deadlocks/gil problems/“funny things”. We also disabled the loading of the svn module and the problem still appeared so svn lib (unfortunately) is not to blame.

    Maybe some libraries have problems when run in a sub interpreter. That would explain why things behave differently when run under mod_wsgi.

    Comment by Armin Ronacher — Friday, November 16th, 2007 @ 12:12 am
  3. Good to hear the deadlock check worked in practice then.

    As for it being to do with a sub interpreter rather than being the main interpreter, don’t think this is the case as other site seeing this issue is definitely setting WSGIApplicationGroup to %{GLOBAL} and so forcing use of main interpreter.

    The other site was also suspicious of ClearSilver and database adapter. I haven’t completely ruled out yet something odd happening in mod_wsgi, but then seems to only occur when using Trac.

    Comment by Graham Dumpleton — Friday, November 16th, 2007 @ 12:25 am
  4. We use trac 0.11, so no clearsilver. But both trac-hacks and pocoo.org use postgres :-/

    Comment by Armin Ronacher — Friday, November 16th, 2007 @ 9:35 am

Leave a Reply

cogitations driven by wordpress