Programming Python (121 page)

Read Programming Python Online

Authors: Mark Lutz

Tags: #COMPUTERS / Programming Languages / Python

BOOK: Programming Python
8.99Mb size Format: txt, pdf, ePub
Unicode, Internationalization, and the Python 3.1 email
Package

Now that I’ve shown you how “cool” the email package is, I
unfortunately need to let you know that it’s not completely operational
in Python 3.1. The
email
package works as
shown for simple messages, but is severely impacted by Python 3.X’s
Unicode/bytes string dichotomy in a number of ways.

In short, the
email
package in
Python 3.1 is still somewhat coded to operate in the realm of 2.X
str
text strings. Because these have
become Unicode in 3.X, and because some tools that
email
uses are now oriented toward
bytes
strings, which do not mix freely with
str
, a variety of conflicts crop up
and cause issues for programs that depend upon this module.

At this writing, a new version of
email
is being developed which will handle
bytes
and Unicode encodings better,
but the going consensus is that it won’t be folded back into Python
until release 3.3 or later, long after this book’s release. Although a
few patches might make their way into 3.2, the current sense is that
fully addressing the package’s problems appears to require a full
redesign.

To be fair, it’s a substantial problem. Email has historically
been oriented toward single-byte ASCII text, and generalizing it for
Unicode is difficult to do well. In fact, the same holds true for most
of the Internet today—as discussed elsewhere in this chapter, FTP, POP,
SMTP, and even webpage bytes fetched over HTTP pose the same sorts of
issues. Interpreting the bytes shipped over networks as text is easy if
the mapping is one-to-one, but allowing for arbitrary
Unicode encoding in that text opens a Pandora’s box of
dilemmas. The extra complexity is necessary today, but, as
email
attests, can be a daunting task.

Frankly, I considered not releasing this edition of this book
until this package’s issues could be resolved, but I decided to go
forward because a new
email
package
may be years away (two Python releases, by all accounts). Moreover, the
issues serve as a case study of the types of problems you’ll run into in
the real world of large-scale software development. Things change over
time, and program code is no exception.

Instead, this book’s examples provide new Unicode and
Internationalization support but adopt policies to work around issues
where possible. Programs in books are meant to be educational, after
all, not commercially viable. Given the state of the
email
package that the examples depend on,
though, the solutions used here might not be completely universal, and
there may be additional Unicode issues lurking. To address the future,
watch this book’s website (described in the Preface) for updated notes
and code examples if/when the anticipated new
email
package appears. Here, we’ll work with
what we have.

The good news is that we’ll be able to make use of
email
in its current form to build fairly
sophisticated and full-featured email clients in this book anyhow. It
still offers an amazing number of tools, including MIME encoding and
decoding, message formatting and parsing, Internationalized headers
extraction and construction, and more. The bad news is that this will
require a handful of obscure workarounds and may need to be changed in
the future, though few software projects are exempt from such
realities.

Because
email
’s limitations
have implications for later email code in this book, I’m going to
quickly run through them in this section. Some of this can be safely
saved for later reference, but parts of later examples may be difficult
to understand if you don’t have this background. The upside is that
exploring the package’s limitations here also serves as a vehicle for
digging a bit deeper into the
email
package’s interfaces in general.

Parser decoding requirement

The first Unicode
issue in Python3.1’s
email
package is nearly a showstopper in
some contexts: the
bytes
strings of
the sort produced by
poplib
for
mail fetches must be decoded to
str
prior to parsing with
email
.
Unfortunately, because there may not be enough information to know how
to decode the message bytes per Unicode, some clients of this package
may need to be generalized to detect whole-message encodings prior to
parsing; in worst cases other than email that may mandate mixed data
types, the current package cannot be used at all. Here’s the issue
live:

>>>
text
# from prior example in his section
'Content-Type: multipart/mixed; boundary="===============1574823535=="\nMIME-Ver...'
>>>
btext = text.encode()
>>>
btext
b'Content-Type: multipart/mixed; boundary="===============1574823535=="\nMIME-Ve...'
>>>
msg = Parser().parsestr(text)
# email parser expects Unicode str
>>>
msg = Parser().parsestr(btext)
# but poplib fetches email as bytes!
Traceback (most recent call last):
File "", line 1, in
File "C:\Python31\lib\email\parser.py", line 82, in parsestr
return self.parse(StringIO(text), headersonly=headersonly)
TypeError: initial_value must be str or None, not bytes
>>>
msg = Parser().parsestr(btext.decode())
# okay per default
>>>
msg = Parser().parsestr(btext.decode('utf8'))
# ascii encoded (default)
>>>
msg = Parser().parsestr(btext.decode('latin1'))
# ascii is same in all 3
>>>
msg = Parser().parsestr(btext.decode('ascii'))

This is less than ideal, as a
bytes
-based
email
would be able to handle message
encodings more directly. As mentioned, though, the
email
package is not really fully functional
in Python 3.1, because of its legacy
str
focus, and the sharp distinction that
Python 3.X makes between Unicode text and byte strings. In this case,
its parser should accept
bytes
and
not expect clients to know how to decode.

Because of that, this book’s email clients take simplistic
approaches to decoding fetched message bytes to be parsed by
email
. Specifically, full-text decoding will
try a user-configurable encoding name, then fall back on trying common
types as a heuristic, and finally attempt to decode just message
headers.

This will suffice for the examples shown but may need to be
enhanced for broader applicability. In some cases, encoding may have
to be determined by other schemes such as inspecting email headers (if
present at all), guessing from bytes structure
analysis
, or dynamic user feedback.
Adding such enhancements in a robust fashion is likely too complex to
attempt in a book’s example code, and it is better performed in common
standard library tools in any event.

Really, robust decoding of mail text may not be possible today
at all, if it requires headers inspections—we can’t inspect a
message’s encoding information headers unless we parse the message,
but we can’t parse a message with 3.1’s
email
package unless we already know the
encoding. That is, scripts may need to parse in order to decode, but
they need to decode in order to parse! The byte strings of
poplib
and Unicode strings of
email
in 3.1 are fundamentally at odds. Even
within its own libraries, Python 3.X’s changes have created a
chicken-and-egg dependency problem that still exists nearly two years
after 3.0’s release.

Short of writing our own email parser, or pursuing other
similarly complex approaches, the best bet today for fetched messages
seems to be decoding per user preferences and defaults, and that’s how
we’ll proceed in this edition. The PyMailGUI client of
Chapter 14
, for instance, will allow Unicode
encodings for full mail text to be set on a per-session basis.

The real issue, of course, is that email in general is
inherently complicated by the presence of arbitrary text encodings.
Besides full mail text, we also must consider Unicode encoding issues
for the text components of a message once it’s parsed—both its text
parts and its message headers. To see why, let’s
move on.

Note

Related Issue for CGI scripts
: I should
also note that the full text decoding issue may not be as large a
factor for email as it is for some other
email
package clients. Because the
original email standards call for
ASCII
text and require binary data to
be MIME encoded, most emails are likely to decode properly according
to a 7- or 8-bit encoding such as Latin-1.

As we’ll see in
Chapter 15
,
though, a more insurmountable and related issue looms for
server-side scripts that support
CGI file
uploads
on the Web—because Python’s CGI module also uses
the
email
package to parse
multipart form data; because this package requires data to be
decoded to
str
for parsing; and
because such data might have mixed text and binary data (included
raw binary data that is
not
MIME-encoded, text
of any encoding, and even arbitrary combinations of these), these
uploads fail in Python 3.1 if any binary or incompatible text files
are included. The
cgi
module
triggers Unicode decoding or type errors internally, before the
Python script has a chance to intervene.

CGI uploads worked in Python 2.X, because the
str
type represented both possibly encoded
text and binary data. Saving this type’s content to a binary mode
file as a string of bytes in 2.X sufficed for both arbitrary text
and binary data such as images. Email parsing worked in 2.X for the
same reason. For better or worse, the 3.X
str
/
bytes
dichotomy makes this generality
impossible.

In other words, although we can generally work around the
email
parser’s
str
requirement for fetched emails by
decoding per an 8-bit encoding, it’s much more malignant for web
scripting today. Watch for more details on this in
Chapter 15
, and stay tuned for a future fix,
which may have materialized by the time you read these words.

Text payload encodings: Handling mixed type results

Our next
email
Unicode
issue seems to fly in the face of Python’s generic
programming model: the data types of message payload objects may
differ, depending on how they are fetched. Especially for programs
that walk and process payloads of mail parts
generically
, this complicates
code.

Specifically, the
Message
object’s
get_payload
method we used earlier accepts an optional
decode
argument to control automatic
email-style MIME decoding (e.g., Base64, uuencode, quoted-printable).
If this argument is passed in as
1
(or equivalently,
True
), the
payload’s data is MIME-decoded when fetched, if required. Because this
argument is so useful for complex messages with arbitrary parts, it
will normally be passed as true in all cases. Binary parts are
normally MIME-encoded, but even text parts might also be present in
Base64 or another MIME form if their bytes fall outside email
standards. Some types of Unicode text, for example, require MIME
encoding.

The upshot is that
get_payload
normally returns
str
strings for
str
text parts, but returns
bytes
strings if its
decode
argument is true—even if the message
part is known to be text by nature. If this argument is not used, the
payload’s type depends upon how it was set:
str
or
bytes
. Because Python 3.X does not allow
str
and
bytes
to be mixed freely, clients that need
to use the result in text processing or store it in files need to
accommodate the difference. Let’s run some code to illustrate:

>>>
from email.message import Message
>>>
m = Message()
>>>
m['From'] = 'Lancelot'
>>>
m.set_payload('Line?...')
>>>
m['From']
'Lancelot'
>>>
m.get_payload()
# str, if payload is str
'Line?...'
>>>
m.get_payload(decode=1)
# bytes, if MIME decode (same as decode=True)
b'Line?...'

The combination of these different return types and Python 3.X’s
strict
str
/
bytes
dichotomy can cause problems in code
that processes the result unless they decode
carefully
:

>>>
m.get_payload(decode=True) + 'spam'
# can't mix in 3.X!
TypeError: can't concat bytes to str
>>>
m.get_payload(decode=True).decode() + 'spam'
# convert if required
'Line?...spam'

To make sense of these examples, it may help to remember that
there are two different concepts of “encoding” for email text:

  • Email-style MIME encodings
    such as
    Base64, uuencode, and quoted-printable, which are applied to
    binary and otherwise unusual content to make them acceptable for
    transmission in email text

  • Unicode text encodings
    for strings in
    general, which apply to message text as well as its parts, and may
    be required after MIME encoding for text message parts

The
email
package handles
email-style MIME encodings automatically when we pass
decode=1
to fetch parsed payloads, or
generate text for messages that have nonprintable parts, but scripts
still need to take Unicode encodings into consideration because of
Python 3.X’s sharp string types differentiation. For example, the
first
decode
in the following
refers to MIME, and the second to Unicode:

m.get_payload(decode=True).decode()  # to bytes via MIME, then to str via Unicode

Even without the MIME
decode
argument, the payload type may also differ if it is stored in
different forms:

>>>
m = Message(); m.set_payload('spam'); m.get_payload()
# fetched as stored
'spam'
>>>
m = Message(); m.set_payload(b'spam'); m.get_payload()
b'spam'

Moreover, the same hold true for the text-specific MIME subclass
(though as we’ll see later in this section, we cannot pass a
bytes
to its constructor to force a binary
payload):

>>>
from email.mime.text import MIMEText
>>>
m = MIMEText('Line...?')
>>>
m['From'] = 'Lancelot'
>>>
m['From']
'Lancelot'
>>>
m.get_payload()
'Line...?'
>>>
m.get_payload(decode=1)
b'Line...?'

Unfortunately, the fact that payloads might be either
str
or
bytes
today not only flies in the face of
Python’s type-neutral mindset, it can complicate your code—scripts may
need to convert in contexts that require one or the other type. For
instance, GUI libraries might allow both, but file saves and web page
content generation may be less flexible. In our example programs,
we’ll process payloads as
bytes
whenever possible, but decode to
str
text in cases where required using the
encoding information available in the header API described in the next
section.

Other books

Mistletoe Mine by Emily March
Jackson's Dilemma by Iris Murdoch
The Heir Agreement by Leon, Kenzie
Soldados de Salamina by Javier Cercas
Slocum 420 by Jake Logan
The Paris Secret by Karen Swan
Death Whispers (Death Series, Book 1) by Blodgett, Tamara Rose