Friday, August 20, 2010

Python 3 and WSGI

I like Python.  I really do.  It's clean, easy to read, quick to write, and is usually fun.  It's been our primary choice when developing for the web at work, and in it's current state it fits that role exceptionally well.

Well, kind of.

You see the WSGI standard that most Python web frameworks and libraries adhere to right now don't work in Python 3.x.  Why?  It has to do with how strings are implemented in 3.x vs 2.x.

In Python 2.x, the str data type was a series of bytes with methods that made it convenient to treat them as, well, strings.  There was also the unicode type that had many of the same methods.  One thing that Python 2.x did that was really annoying, though, was it would implicitly change data back and forth between the two types on you.  This led to all kinds of hard-to-find bugs (I ran in to TONS of them in working on my first project for my current employer, a web crawler).  Especially in web programming, there were times were it was unclear whether a value was str or unicode.

Enter Python 3.x, which was largely motivated by a desire to clean up weird bugs like the above in the language.  To fix the above problem, what they did was this: Python 2.x's unicode became Python 3.x's str, and Python 2.x's str became Python 3.x's byte arrays.  Kind of.  The Python core developers felt that giving the new byte arrays all of the old str methods would result in exactly the same problem; developers would inevitably get strs and byte arrays mixed up.

This obviously affects anything that has to deal with strings, but it especially affects WSGI.  A lot of WSGI implementations seemed to rely on the implicit type changing behavior from before, and Python 3.x breaks that pretty hard.  So some of the people involved with the original WSGI spec got together and tried to propose new solutions.

From what I have been able to ascertain, there are three main camps:


  • Make everything native (that is, unicode) strings
  • Make everything byte arrays
  • Use a combination of them (usually bytes or strings in the header, and the body being the other)


There is also a fourth view, to petition the Python core developers to re-introduce the string methods on byte arrays, or at the very least create a new data type that does so.  This hasn't gotten much traction from either the web community or Python core.

Using native strings across the board sounds nice, except that HTTP isn't implemented in Unicode; it's ASCII.  So, there would have to be a conversion to the byte array at the response/request boundary, when the data is leaving or entering the server.  This could result in data loss, depending on the encoding used by the WSGI application(s).

And speaking of encoding, it seems unclear which would be the default.  There are a lot of Unicode encodings, and without clear definition of which one is the 'default', it becomes hard for an implementation that relies on WSGI middleware-wrapping to keep them straight.

That leads us to the byte array proposal.  This matches up with the HTTP spec quite well, and would put us back where we're at now in Python 2.x, right?  Unfortunately, that's not quite true.  Again, because the byte arrays don't have the old string methods, you can't do string operations on them inside of your application without doing an explicit conversion into Unicode, which again runs into the problem of which encoding to use.

Compounding that, when using WSGI middleware, an implementation with bytes would have to encode-act-decode in every single middleware application, which could certainly add up.

Finally, there's the mixed approaches.  These carry the same problems as the first two approaches, along with the added confusion of working with two different Python types in a single response or request.  And, some proposals have even had mixed data types inside of the WSGI environ dictionary, something I'm sure would be a head ache.

So where does that put us?

Right now, I don't know of any concrete attempts to build any code implementing any of these proposals.  None of the library authors (Paste, BFG, Werkzeug) want to take the time to do a large-scale conversion for fear it would be wasted effort if that particular proposal lost out.  At least, that's my understanding from digging through their blogs and the mailing lists.

Everyone agrees that a new standard is necessary for Python 3.  After all, 2.7 was released last month, and it's the last of the 2.x line.  Sure, it will still work, but it's not going to receive any new features that future 3 versions will, like Unladen Swallow.

It's hard to predict what's going to happen.  My fear is that something will either just get implemented without much input, or a PEP is accepted just to break the deadlock, and we end up with a problematic standard.

Maybe that's everyone else's fear, too.

No comments:

Post a Comment