Programming Python (13 page)

Read Programming Python Online

Authors: Mark Lutz

Tags: #COMPUTERS / Programming Languages / Python

BOOK: Programming Python
10.69Mb size Format: txt, pdf, ePub
Other String Concepts in Python 3.X: Unicode and bytes

Technically
speaking, the Python 3.X string story is a bit richer than
I’ve implied here. What I’ve shown so far is the
str
object type—a
sequence of characters (technically, Unicode “code points” represented
as Unicode “code units”) which represents both ASCII and wider Unicode
text, and handles encoding and decoding both manually on request and
automatically on file transfers. Strings are coded in quotes (e.g.,
'abc'
), along with various syntax for
coding non-ASCII text (e.g.,
'\xc4\xe8'
,
'\u00c4\u00e8'
).

Really, though, 3.X has two additional string types that support
most
str
string operations:
bytes
—a sequence of
short integers for representing 8-bit binary data, and
bytearray
—a mutable
variant of bytes. You generally know you are dealing with
bytes
if strings display or are coded with a
leading “b” character before the opening quote (e.g.,
b'abc'
,
b'\xc4\xe8'
). As we’ll see in
Chapter 4
, files in 3.X follow a similar
dichotomy, using
str
in text mode
(which also handles Unicode encodings and line-end conversions) and
bytes
in binary mode (which transfers
bytes to and from files unchanged). And in
Chapter 5
, we’ll see the same distinction for
tools like sockets, which deal in byte strings today.

Unicode text is used in Internationalized applications, and many
of Python’s binary-oriented tools deal in byte strings today. This
includes some file tools we’ll meet along the way, such as the
open
call, and the
os.listdir
and
os.walk
tools we’ll study in upcoming
chapters. As we’ll see, even simple directory tools sometimes have to be
aware of Unicode in file content and names. Moreover, tools such as
object pickling and binary data parsing are byte-oriented today.

Later in the book, we’ll also find that Unicode also pops up today
in the text displayed in GUIs; the bytes shipped other networks;
Internet standard such as email; and even some persistence topics such
as DBM files and shelves. Any interface that deals in text necessarily
deals in Unicode today, because
str
is
Unicode, whether ASCII or wider. Once we reach
the realm of the applications programming presented in this book,
Unicode is no longer an optional topic for most Python 3.X
programmers.

In this book, we’ll defer further coverage of Unicode until we can
see it in the context of application topics and practical programs. For
more fundamental details on how 3.X’s Unicode text and binary data
support impact both string and file usage in some roles, please see
Learning Python
,
Fourth Edition; since this is officially a core language
topic, it enjoys in-depth coverage and a full 45-page dedicated chapter
in that book.

File Operation Basics

Besides processing
strings, the
more.py
script also uses
files—it opens the external file whose name is listed on the command
line using the built-in
open
function, and it reads that file’s text into memory all at once with the
file object
read
method. Since file
objects returned by
open
are part of
the core Python language itself, I assume that you have at least a
passing familiarity with them at this point in the text. But just in
case you’ve flipped to this chapter early on in your Pythonhood, the
following calls load a file’s contents into a string, load a fixed-size
set of bytes into a string, load a file’s contents into a list of line
strings, and load the next line in the file into a string,
respectively:

open('file').read()            # read entire file into string
open('file').read(N) # read next N bytes into string
open('file').readlines() # read entire file into line strings list
open('file').readline() # read next line, through '\n'

As we’ll see in a moment, these calls can also be applied to shell
commands in Python to read their output. File objects also have
write
methods for sending strings to the
associated file. File-related topics are covered in depth in
Chapter 4
, but making an output file and
reading it back is easy in Python:

>>>
file = open('spam.txt', 'w')
# create file spam.txt
>>>
file.write(('spam' * 5) + '\n')
# write text: returns #characters written
21
>>>
file.close()
>>>
file = open('spam.txt')
# or open('spam.txt').read()
>>>
text = file.read()
# read into a string
>>>
text
'spamspamspamspamspam\n'
Using Programs in Two Ways

Also by way of
review, the last few lines in the
more.py
file in
Example 2-1
introduce one of the first big
concepts in shell tool programming. They instrument the file to be used
in either of two ways—as a
script
or as a
library
.

Recall that every Python module has a built-in
__name__
variable that
Python sets to the
__main__
string
only when the file is run as a program, not when it’s imported as a
library. Because of that, the
more
function in this file is executed automatically by the last line in the
file when this script is run as a top-level program, but not when it is
imported elsewhere. This simple trick turns out to be one key to writing
reusable script code: by coding program logic as
functions
rather than as top-level code, you can
also import and reuse it in other scripts.

The upshot is that we can run
more.py
by
itself or import and call its
more
function elsewhere. When running the file as a top-level program, we
list on the command line the name of a file to be read and paged: as
I’ll describe in more depth in the next chapter, words typed in the
command that is used to start a program show up in the built-in
sys.argv
list in Python. For example, here is
the script file in action, paging itself (be sure to type this command
line in your
PP4E\System
directory, or it won’t
find the input file; more on command lines later):

C:\...\PP4E\System>
python more.py more.py
"""
split and interactively page a string or file of text
"""
def more(text, numlines=15):
lines = text.splitlines() # like split('\n') but no '' at end
while lines:
chunk = lines[:numlines]
lines = lines[numlines:]
for line in chunk: print(line)
More?
y
if lines and input('More?') not in ['y', 'Y']: break
if __name__ == '__main__':
import sys # when run, not imported
more(open(sys.argv[1]).read(), 10) # page contents of file on cmdline

When the
more.py
file is imported, we pass an
explicit string to its
more
function,
and this is exactly the sort of utility we need for documentation text.
Running this utility on the
sys
module’s documentation string gives us a bit more information in
human-readable form about what’s available to scripts:

C:\...\PP4E\System>
python
>>>
from more import more
>>>
import sys
>>>
more(sys.__doc__)
This module provides access to some objects used or maintained by the
interpreter and to functions that interact strongly with the interpreter.
Dynamic objects:
argv -- command line arguments; argv[0] is the script pathname if known
path -- module search path; path[0] is the script directory, else ''
modules -- dictionary of loaded modules
displayhook -- called to show results in an interactive session
excepthook -- called to handle any uncaught exception other than SystemExit
To customize printing in an interactive session or to install a custom
top-level exception handler, assign other functions to replace these.
stdin -- standard input file object; used by input()
More?

Pressing “y” or “Y” here makes the function display the next few
lines of documentation, and then prompt again, unless you’ve run past
the end of the lines list. Try this on your own machine to see what the
rest of the module’s documentation string looks like. Also try
experimenting by passing a different window size in the second
argument—
more(sys.__doc__, 5)
shows just 5 lines at a
time.

Python Library Manuals

If that still isn’t enough
detail, your next step is to read the Python library
manual’s entry for
sys
to get the
full story. All of Python’s standard manuals are available online, and
they often install alongside Python itself. On Windows, the standard
manuals are installed automatically, but here are a few simple
pointers:

  • On Windows, click the Start button, pick All Programs, select
    the Python entry there, and then choose the Python Manuals item. The
    manuals should magically appear on your display; as of Python 2.4,
    the manuals are provided as a Windows help file and so support
    searching and navigation.

  • On Linux or Mac OS X, you may be able to click on the manuals’
    entries in a file explorer or start your browser from a shell
    command line and navigate to the library manual’s HTML files on your
    machine.

  • If you can’t find the manuals on your computer, you can always
    read them online. Go to Python’s website at
    http://www.python.org
    and follow the documentation
    links there. This website also has a simple searching utility for
    the manuals.

However you get started, be sure to pick the Library manual for
things such as
sys
; this manual
documents all of the standard library, built-in types and functions, and
more. Python’s standard manual set also includes a short tutorial, a
language reference, extending references, and more.

Commercially Published References

At the risk of
sounding like a marketing droid, I should mention that you
can also purchase the Python manual set, printed and bound; see the book
information page at
http://www.python.org
for
details and links. Commercially published Python reference books are
also available today, including
Python Essential
Reference
,
Python in a
Nutshell
,
Python Standard
Library
, and
Python Pocket
Reference
. Some of these books are more complete and
come with examples, but the last one serves as a convenient memory
jogger once you’ve taken a library
tour or two.
[
4
]

[
3
]
They may also work their way into your subconscious. Python
newcomers sometimes describe a phenomenon in which they “dream in
Python” (insert overly simplistic Freudian analysis here…).

[
4
]
Full disclosure: I also wrote the last of the books listed as
a replacement for the reference appendix that appeared in the first
edition of this book; it’s meant to be a supplement to the text
you’re reading, and its latest edition also serves as a translation
resource for Python 2.X readers. As explained in the Preface, the
book you’re holding is meant as tutorial, not reference, so you’ll
probably want to find some sort of reference resource eventually
(though I’m nearly narcissistic enough to require that it be
mine).

Introducing the sys Module

But enough
about documentation sources (and scripting basics)—let’s
move on to system module details. As mentioned earlier, the
sys
and
os
modules form the core of much of Python’s system-related tool set. To see
how, we’ll turn to a quick, interactive tour through some of the tools in
these two modules before applying them in bigger examples. We’ll start
with
sys
, the smaller of the two;
remember that to see a full list of all the attributes in
sys
, you need to pass it to the
dir
function (or see where we did so earlier in
this chapter).

Platforms and Versions

Like most
modules,
sys
includes
both informational names and functions that take action. For instance,
its attributes give us the name of the underlying operating system on
which the platform code is running, the largest possible “natively
sized” integer on this machine (though integers can be arbitrarily long
in Python 3.X), and the version number of the Python interpreter running
our code:

C:\...\PP4E\System>
python
>>>
import sys
>>>
sys.platform, sys.maxsize, sys.version
('win32', 2147483647, '3.1.1 (r311:74483, Aug 17 2009, 17:02:12)
...more deleted...
')
>>>
if sys.platform[:3] == 'win': print('hello windows')
...
hello windows

If you have code that must act differently on different machines,
simply test the
sys.platform
string as
done here; although most of Python is cross-platform, nonportable tools
are usually wrapped in
if
tests like
the one here. For instance, we’ll see later that some program launch and
low-level console interaction tools may vary per platform—simply test
sys.platform
to pick the right tool
for the machine on which your script is running.

The Module Search Path

The
sys
module also
lets us inspect the module search path both interactively
and within a Python program.
sys.path
is a list of directory name strings representing the true search path in
a running Python interpreter. When a module is imported, Python scans
this list from left to right, searching for the module’s file on each
directory named in the list. Because of that,
this is the place to look to verify that your search path
is really set as intended.
[
5
]

The
sys.path
list is simply
initialized from your
PYTHONPATH
setting—the content of any
.pth
path files located
in Python’s directories on your machine plus system
defaults—
when the interpreter is first
started up. In fact, if you inspect
sys.path
interactively, you’ll notice quite a
few directories that are not on your
PYTHONPATH
:
sys.path
also includes an indicator for the
script’s home directory (an empty string—something I’ll explain in more
detail after we meet
os.getcwd
) and a
set of standard library directories that may vary per
installation:

>>>
sys.path
['', 'C:\\PP4thEd\\Examples',
...plus standard library paths deleted...
]

Surprisingly,
sys.path
can
actually be
changed
by a program, too. A script can
use list operations such as
append
,
extend
,
insert
,
pop
, and
remove
, as well as the
del
statement to configure the search path at
runtime to include all the source directories to which it needs access.
Python always uses the current
sys.path
setting to import, no matter what
you’ve changed it to:

>>>
sys.path.append(r'C:\mydir')
>>>
sys.path
['', 'C:\\PP4thEd\\Examples',
...more deleted...
, 'C:\\mydir']

Changing
sys.path
directly like
this is an alternative to setting your
PYTHONPATH
shell variable, but not a very
permanent one. Changes to
sys.path
are retained only until the Python process ends, and they must be remade
every time you start a new Python program or session. However, some
types of programs (e.g., scripts that run on a web server) may not be
able to depend on
PYTHONPATH
settings; such scripts can instead configure
sys.path
on startup to include all the
directories from which they will need to import modules. For a more
concrete use case, see
Example 1-34
in the prior
chapter—
there we had to tweak the search
path dynamically this way, because the web server violated our import
path assumptions.

Windows Directory Paths

Notice the use of a raw string literal in the
sys.path
configuration code: because
backslashes normally introduce escape code sequences in Python
strings,
Windows users should be sure to either double up on
backslashes when using them in DOS directory path
strings (e.g., in
"C:\\dir"
,
\\
is an escape sequence that
really means
\
), or use raw string
constants to retain backslashes literally (e.g.,
r"C:\dir"
).

If you inspect directory paths on Windows (as in the
sys.path
interaction listing), Python prints
double
\\
to mean a single
\
. Technically, you can get away with a
single
\
in a string if it is followed by
a character Python does not recognize as the rest of an escape
sequence, but doubles and raw strings are usually easier than
memorizing escape code tables.

Also note that most Python library calls accept either
forward (
/
) or
backward (
\
) slashes as directory
path separators, regardless of the underlying platform. That is,
/
usually works on Windows too and
aids in making scripts portable to Unix. Tools in the
os
and
os.path
modules, described later in this
chapter, further aid in script path portability.

The Loaded Modules Table

The
sys
module
also contains hooks into the interpreter;
sys.modules
, for example, is a dictionary
containing one
name:module
entry for every
module imported in your
Python
session or program (really, in the calling Python process):

>>>
sys.modules
{'reprlib': ,
...more deleted...
>>>
list(sys.modules.keys())
['reprlib', 'heapq', '__future__', 'sre_compile', '_collections', 'locale', '_sre',
'functools', 'encodings', 'site', 'operator', 'io', '__main__',
...more deleted...
]
>>>
sys

>>>
sys.modules['sys']

We might use such a hook to write programs that display or
otherwise process all the modules loaded by a program (just iterate over
the keys of
sys.modules
).

Also in the interpret hooks category, an object’s reference count
is available
via
sys.getrefcount
,
and the names of modules built-in to the Python executable are listed in
sys.builtin_module_names
. See
Python’s library manual for details; these are mostly Python internals
information, but such hooks can sometimes become important to
programmers writing tools for other programmers to use.

Exception Details

Other attributes in the
sys
module
allow us to fetch all the information related to the most
recently raised Python exception. This is handy if we want to process
exceptions in a more generic fashion. For instance, the
sys.exc_info
function
returns a tuple with the latest exception’s type, value, and traceback
object. In the all class-based exception model that Python 3 uses, the
first two of these correspond to the most recently raised exception’s
class, and the instance of it which was raised:

>>>
try:
...
raise IndexError
...
except:
...
print(sys.exc_info())
...
(, IndexError(), )

We might use such information to format our own error message to
display in a GUI pop-up window or HTML web page (recall that by default,
uncaught exceptions terminate programs with a Python error display). The
first two items returned by this call have reasonable string displays
when printed directly, and the third is a traceback object that can be
processed with the standard
traceback
module:

>>>
import traceback, sys
>>>
def grail(x):
...
raise TypeError('already got one')
...
>>>
try:
...
grail('arthur')
...
except:
...
exc_info = sys.exc_info()
...
print(exc_info[0])
...
print(exc_info[1])
...
traceback.print_tb(exc_info[2])
...

already got one
File "", line 2, in
File "", line 2, in grail

The
traceback
module can also
format messages as strings and route them to specific file objects; see
the Python library manual for more details.

Other sys Module Exports

The
sys
module
exports additional commonly-used tools that we will meet
in the context of larger topics and examples introduced later in this
part of the book. For instance:

  • Command-line arguments show up as a list of strings called
    sys.argv
    .

  • Standard streams are available as
    sys.stdin
    ,
    sys.stdout
    , and
    sys.stderr
    .

  • Program exit can be forced with
    sys.exit
    calls.

Since these lead us to bigger topics, though, we will cover them
in sections of their
own.

[
5
]
It’s not impossible that Python sees
PYTHONPATH
differently than you do. A
syntax error in your system shell configuration files may botch the
setting of
PYTHONPATH
, even if it
looks fine to you. On Windows, for example, if a space appears
around the
=
of a DOS
set
command in your configuration file
(e.g.,
set NAME = VALUE
), you may
actually set
NAME
to an empty
string, not to
VALUE
!

Other books

Cutting Edge by Allison Brennan
In a Heartbeat by Elizabeth Adler
A Tricky Sleepover by Meg Greve, Sarah Lawrence
Diario de Invierno by Paul AUSTER
My Secret History by Paul Theroux
More Than Friends by Beverly Farr
Strikers Instinct by A. D. Rogers
Three-Part Harmony by Angel Payne