Pros and Cons about Python 3
I was briefly expressing my disagreement with the Python 3 development decisions, so I want to elaborate on that a bit. While I was previously addressing some of the problems I have with Python 3, I took the time to create a list of things that were solved in a way other than I expected or hoped.
Let's start with the biggest grief of mine...
Unicode Support
If you look at any of the pocoo libraries, they all use unicode. In fact, Jinja2 and Werkzeug even enforce unicode so you can't even use them with byte strings internally unless you do the encode dance. Why that? Because I believe in unicode and that is not too surprising because German, the official language of Austria uses some non ASCII letters and yet tons of systems deployed force me to substitute Umlauts with latin letters just because someone had a limited horizon.
But the unicode world is complex and Python does not care about unicode too much. And that neither in Python 2 nor Python 3. So what does not work about unicode in Python? So in German for example, there are words like "Fuß" (which means foot). The last letter there is a so-called "Scharfes Es" or "Eszett". The former means "sharp-s" and is usually represented as a "ſs" ligature, the other one is a "ſz" ligature and used in blackletter fonts (The letter "ſ" is no longer in use but worked similar to a latin "s"). Another common way was using an "ſs" litature. Because this letter will never occur at the beginning of words there was never an uppercase character for it (there was one introduced lately, but nobody uses it). However, it is pretty common to use title case or uppercase for emphasis, so there is a need for that letter to exist in uppercase. The common replacement you see is a doubled "s" or "sz". So "Fuß" becomes "FUSS", "Maße" became "MASZE" etc. There are some variations, but basically it means that one letter becomes two.
However, that does not work in Python. The Python unicode implementation cannot do two things: neither can it replace one letter with two when changing the case, nor does it allow a locale information for character mappings. The latter is necessary for languages like Turkish where an uppercase "I" is lowercased to "ı" and not "i". (And I will not complain about the shared state of the locale library which of course stayed in Python 3).
Another problem is that Python uses UCS2 or UCS4 internally and that shines through. So if you have an UCS2 build, len() called on strings does not give you the number of characters in the string, but the number of UCS2 interpreted characters which might not be the same. In fact, every letter outside the basic plane will be wrongly counted. Because UTF-16 (UCS2 with support for surrogate pairs that allow you to use characters outside the basic plane) is a variable length encoding it has the same problems as utf-8 as an internal encoding: namely, making slicing a non trivial operation. Another problem arises: binary extensions have to be compiled for UCS2 and UCS4 pythons separately. And last time I checked, setuptools did not allow you to publish both builds on pypi and pull the correct one. In fact, by default there is no such information in the filename which would make it possible to provide both extensions.
So what they did "improve" with Python 3 was making unicode strings the default. And they did that in a very backwards incompatible and IMO problematic way: they degraded bytestrings from strings to glorified integer arrays and enforced unicode on non-unicode protocols.
Just to give you an example: When you iterate over bytestrings in Python 2, the iterator will yield you a bytestring of length 1 for each character with that character in. While I was always a harsh opponent of strings being iteration, it was something everybody relied on. In Python 3, bytestrings are bytes objects, which are basically arrays of integers which look like strings in the repr, but yield the bytes as integers. So if you had code that relied on the iteration returning chars, your code will break. And yes, Python 3 breaks backwards compatibility, but this is something that 2to3 does not pick up and most likely you will not either. At least in my situation it took me a long time to track down the problem because some other implicit conversion was happening at another place.
Now at the same time unicode strings continue to yield unicode strings on iteration with the char in. That means suddenly bytes and unicode objects have different semantics, making it impossible to provide an interface for both bytestrings and unicode. There were tons of places libraries accepted both unicode and bytes in Python 2 because it "made sense". A good example is URL handling. URLs are encodingless. Some schemes hint a default charset, but in reality no such thing exists. However, applications themselves knew what encoding the URL would use, so they happily pass unicode strings to the URL encoder and that would use the application URL encoding to ensure the URL is properly quoted. In Django/Werkzeug/and probably many more libraries, if you passed unicode to the URL encoder, it would by default encode to UTF-8. However if the URL was came from another source with unknown encoding, it was possible to transparently pass the URL on. Also the decoding of URLs from somewhere else usually happened encoding-less. Many applications for example check the referrer of the page to check if the user came from a search engine and if yes, grab the search keywords from the referrer and highlight them on the current page. In that situation you can keep a list of known encodings of referrers in the URL and decode the referrer URL accordingly. In Python 3 the URL module in many situations uses an UTF-8 default encoding, or requires the URL to be UTF-8 encoded or provides a completely different and limited interface for byte URLs.
Sure, it might be sufficient for 98% of all users, but there are non obvious implications: a library that wraps urllib/urlparse and whatnot cannot reuse the same code for Python 2 and 3. When I started supporting IRIs in Werkzeug (basically the URL successors with proper encoding, already somewhat used by browsers) I chose to abandon the urllib module altogether and write my own simple decoder to make it easier to later port that thing over to Python 3 without changed semantics.
There are other examples as well: filesystem access. Python 3 assumes your filesystem has an encoding, but many linux systems do not. In fact, not even OS X enforces an encoding. You can happily use fopen to create a file that does not look like UTF-8 at all. And even there, the situation is a lot more complex because on OS X, different unicode normalization rules apply for the filesystem than for the applications themselves. So even if you are using Python 3, you still will have to manually normalize the filename to a different encoding when you want to compare filenames on the filesystem.
When I looked at the unicode stuff in Python 3 I did not see much value over nicely written libraries in Python 2 that enforced unicode usage. In fact, the update makes it especially hard to convert such libraries (that required unicode) to Python 3, because 2to3 assumes you are using byte strings and not unicode.
The case of "super" and other Quirks
I wonder how this slipped past everybody and why Guido is okay with that, but the new super non-keyword-keyword in Python 3 is just wrong. The fact that this code works is alarming already:
class Foo(Bar): def foo(self): super().foo()
Assuming you know Python used to work function invocation wise the fact that this works smells. But it gets a lot worse because this code does not work:
_super = super class Foo(Bar): def foo(self): _super().foo()
That's just wrong. The use of the name of a global function (which btw I can reassign!) should never affect the bytecode generated, that's what keywords are for! Python also did not optimize while True loops because someone could reassign True, but suddenly it's sortof okay to do that. Also, why have self explicit when some magic in the compiler is now suddenly able to inject new symbols in the code? From that point onwards it is a one-liner to make the self implicit and suddenly there is no reason for that self being the explicit first parameter any more.
From what I remember, this was done to optimize the code. That's true, they do optimize something, but at the same time a function call of a global function in a method in a loop, will do a dict lookup every time the thing is invoked. Another thread could reassign the global function and suddenly the code would no longer call the new function because the old one was pulled into a local "register" (fastlocal or similar). And if you think "that's undefined behavior", I beg you to look into the mimetypes library. That will explain that no where in the world a Python implementation could be conforming if it avoids global lookups by optimizing them.
What I wished for Python 3 was to remove really useless dynamic features like pulling in functions on every call to allow more compiler/interpreter optimizations, easier multithreading support and everything.
Also what I was wishing for, for Python 3 was a better interpreter interface, and a revoked GIL or no GIL at all. I would love to be able to use multiple Python interpreters per application. Some sort of reentrant interpreter. That would simplify embedding Python into other applications and expand the possibilities. Just look at how V8 works internally to get an idea of what I was hoping for. I also wished there was a builtin support for freezing objects (no longer a frozenset, just freeze the set, and then finally be able to do the same for lists etc.). Also builtin support for proxing would be nice. The hack thread local libraries and the weakref module to proxy objects is just wrong, wrong, wrong (and unreliable as well). Imports are still horrible implemented, the standard library is still inconsistent or limited (and now even broken, cgi.FieldStorage in Python 3 anyone?)
What's cool about Python 3?
What I really like is the new non local stuff. I was longing for that for a long, long time. booleans being a keyword, that should have been in there for longer, finally easier division semantics, improved metaclasses, class decorators, no more classic classes, dict views, the builtins returning iterables instead of lists etc (Though they should have added improved repr support that would allow me to introspect those iterators and freezing them at the same time [which I guess would once again require a cleaner and improved interpreter design to get right]).
Conclusion
But that does not justify a new version of Python. Instead they could have added a strict mode and let the old code run emulated. They could have expanded that strict mode to allow access to new features of the language, add support for compiler optimizations and so much more. (JavaScript is currently getting such a strict mode).
So yes, I am disappointed how Python 3 worked out. They could have done so much more or skipped Python 3 altogether and get the cool stuff into an optional strict mode in Python 2.
Minor nitpick regarding OS X:
While you can give arbitrary data to fopen(), HFS+ still uses UTF-16, and any data that isn't valid UTF-16 is percent-encoded:
$ echo hi > $'\xff\xfe\xee'
$ cat $'\xff\xfe\xee'
hi
$ ls
%FF%FE%EE
$ cat %FF%FE%EE
hi
You could never run into a situation where you couldn't decode a file name.
— Brodie Rao on Thursday, January 7, 2010 17:36 #
Frankly I still don't understand why is it necessary to use the nonlocal directive at all.
Why can't it just use the proper scoping as javascript does. And what it usually does as long as you are only reading from the variable ... yet as soon as you assign to a name all hell breaks loose and you need specify it as nonlocal etc ... it's just soo stupid
— Toot on Thursday, January 7, 2010 19:13 #
Slicing is actually trivial in UTF-16, specifically because we use code units for indexing rather than scalar values. I would prefer specialized methods for iterating over scalar values/grapheme clusters/words/etc, or enumerating their indexes, and perhaps even depreciating the default iteration.
Every modern URI scheme requires UTF-8 encoding. They can't go back and fix http, but UTF-8 is the defacto standard there too. Ultimately, it is text, and text is meant to be displayed to the user, so you need the encoding. Maybe python should be a little more tolerant of bad URLs, but not at the cost of good URLs.
Windows doesn't enforce the encoding either. It stores 16 bit units and you can stick any garbage you want in there. Ultimately though, a filesystem's names are text (otherwise you'd use simple numeric identifiers), and the APIs should focus on using it as such.
Some of the key filesystem APIs do accept bytes arguments though, and switch to returning bytes in that case. More could be added if needed.
— Adam Olsen on Thursday, January 7, 2010 19:25 #
@2: JavaScript does not have proper scoping. JavaScript just does it the other way round. Instead of saying nonlocal you have to say var when you mean the opposite. And most of the time I want it the other way round so that way I have to type less. Python 3 wins over JavaScript there.
— Armin Ronacher on Thursday, January 7, 2010 19:31 #
Yes, in Python 3 strings are always unicode. And this is a "pro". Your first point is about some features that the builtin unicode support lacks. But it's lacking in Python 2 as well, so that's not a question of 2 vs 3.
Your complaint about super_() makes no sense. Yes, super() works only if it's called super, and not super_, but if you have renamed it (for some incomprehensible reason), then you know you have renamed it, and can use super_(Foo, self) instead, which still works. OK, you may think that super() shouldn't be allowed, and you should have to write super(Foo, self) (and you would have a point, super() is a bit implicit), but that you can't reassign super and still use it implicitly is a nonsense argument, and not a drawback with Python 3, as you simply can either use it explicitly, and/or not reassign super. It's not like there is ever any reason to do so.
— Lennart Regebro on Thursday, January 7, 2010 20:20 #
@5: 'Yes, in Python 3 strings are always unicode. And this is a "pro".'. In an ideal world we would use unicode everywhere, turns out, we do not. Changing the semantics of byte strings then makes it a lot harder than it has to, to support those who are not as fortunate to use unicode (eg: HTTP).
Python was known for not being magical, super is magical. It means you are unable to implement something like super() in either an extension or within Python. I can't see why that would be a good thing.
— Armin Ronacher on Thursday, January 7, 2010 20:32 #
— matelot on Thursday, January 7, 2010 20:49 #
@6: Pragmatically, the new way of using super() is more useful than the old way, which was way too verbose for common usage. I agree that it's too magical though (and I feel the same way about locals() and globals()).
In general, I think Python's evolution would go better if they had (in hindsight) focused on long transition periods, rather than simply delaying breaking anything until one giant breakage. Declare to always have 4 major releases with both options available, then remove the old one for the 5th. For example, new-style classes were added in python 2.2, all the way back in 2001. python 2.6 could remove it, in 2006. Some pain for python's developers, but better than the alternatives.
— Adam Olsen on Thursday, January 7, 2010 22:26 #
Ooops, python2.6 was released in 2008. Still applies though.
— Adam Olsen on Thursday, January 7, 2010 22:30 #
@8: locals() and globals() can be implemented in Python itself via getframe:
locals = lambda: sys._getframe(1).f_locals
Emulating super() is not possible however because unless there is a call to a global called super, the required code is not generated. (which would inject an implicit __class__ local into the local frame).
— Armin Ronacher on Thursday, January 7, 2010 23:25 #
Chriz Lenz has a more detailed writeup about Unicode in Python - see www.cmlenz.net/archives/2008/07/the-truth-about-unicode-in-python
It would be nice to see some "standard" libraries for Unicode in Python to make it actually usable.
On the GIL side I develop C extensions and love the concept of the GIL. I don't have to worry about objects being mangled in another thread while I am working on them. I then release the GIL when it is needed, and again only have to worry about object changes immediately after the GIL is reacquired.
If the GIL was not there then writing extensions would be a nightmare. You would have to be taking locks all over the place (imagine something as simple as looking up a value in a dictionary).
I'd also like to see pypi significantly improved. Binary extensions have many other issues such as where shared libraries they link to are found.
Finally I think Python should switch to free compilers on Windows and make it easy to download a free one for users to build extensions themselves. Requiring random versioned Microsoft compilers that are not free (in most senses of the word) and that disappear from Microsoft's web site on a whim is not user friendly!
— Roger on Friday, January 8, 2010 0:09 #
@11: For Python 2.6 and up you can use free like beer "2008 Express Edition" to compile extensions. Also it was always possible to compile extensions using MinGW.
— Waldemar on Friday, January 8, 2010 1:27 #
@12/Waldemar: You apparently forget Microsoft's history. They randomly make Express Editions appear and then take them away later (again at random). This is generally not a problem if you only run the latest and greatest Python. However it turns out to be a way bigger problem if you have older Python versions around since they don't offer the older Express Editions.
And yes I do compile my extension with MinGW. Unfortunately Python itself breaks that every now and again. For example I was able to compile my extension with 3.1.0 and MinGW just fine but it wouldn't work any more with 3.1.1. (I may have some confusion over what numbers worked and didn't. I produce Windows binaries of my extension for every Python version from 2.3 to 3.1 and had been using MinGW for all of them until something was broken in distutils for the 3.x versions.)
The MinGW folks also explicitly claim that they do not support any MS C runtime except what corresponds to version 6. It just so happens to work with more recent versions but that is not by design.
— Roger on Friday, January 8, 2010 6:41 #
"So if you have an UCS2 build, len() called on strings does not give you the number of characters in the string, but the number of UCS2 interpreted characters which might not be the same."
You mean UCS2 code points, there is no such thing as "interpreted characters".
— Lorenzo Gatti on Friday, January 8, 2010 17:27 #
"In an ideal world we would use unicode everywhere, turns out, we do not."
Wrong. In an ideal world we would use unicode for text, and something that are not a string for binary data. Which is what Python 3 does.
"Changing the semantics of byte strings then makes it a lot harder than it has to, to support those who are not as fortunate to use unicode (eg: HTTP)."
That's plain incorrect.
— Lennart Regebro on Friday, January 8, 2010 18:04 #
@10: Just because you can do it doesn't mean you should. These functions violate all the normal rules about encapsulation. The only reason they're accepted is that the standard usage doesn't NEED to violate encapsulation; it's just how the implementation and API happen to do it.
For instance, making super a keyword would only be a superficial change to most usage, but would eliminate the dynamic scoping.
— Adam Olsen on Friday, January 8, 2010 21:07 #
@15: HTTP headers are iso-8859-15 which is not unicode.
— Armin Ronacher on Saturday, January 9, 2010 0:50 #
@17: ISO 8859-15 is a subset of unicode. The text it's storing can be considered unicode.
Besides that, RFC 2616 specifies ISO 8859-1, and you shouldn't use anything outside of ASCII for headers anyway.
— Adam Olsen on Sunday, January 10, 2010 3:28 #
@18: With that logic ASCII can be considered Unicode.
— Armin Ronacher on Sunday, January 10, 2010 13:16 #
@19: Why yes, the ASCII character set is a subset of the Unicode character set. That's why it can be displayed in unicode-based apps. Otherwise web browsers and the like would need to invent their own meta-character set that could cover both the unicode character set and the hypothetical incompatible character set.
— Adam Olsen on Wednesday, January 13, 2010 3:06 #
What server is reuqired to run Python , my site www.gisearch.com is in PHP , please advice what I can do with Puthon.
— Vet on Thursday, January 21, 2010 10:10 #
Actually my biggest Python project to date uses 2 out of those 3 things. I'm talking about moobot an IRC bot that I'm pretty much now the sole maintainer of. It uses a database (MySQL or PostgreSQL, your choice) rather extensively in a huge portion of its modules, and its performance is definitely more than adequate for an IRC bot I'm not sure what you mean by "GUI DB applications" however ... could you clarify? And as far as CGI stuff, I've created several CGI scripts that give some statistics for moobot all in Python as well. So, it works for what you need.
— 3Dfabrique on Monday, January 25, 2010 2:44 #