Programming Python (171 page)

Read Programming Python Online

Authors: Mark Lutz

Tags: #COMPUTERS / Programming Languages / Python

BOOK: Programming Python
2.78Mb size Format: txt, pdf, ePub
Part V. Tools and Techniques

This part of the book presents a collection of additional Python
application topics. Most of the tools presented along the way can be used
in a wide variety of application domains. You’ll find the following
chapters here:

Chapter 17

This chapter covers commonly used and advanced Python
techniques for storing information between program executions—DBM
files, object pickling, object shelves, and Python’s SQL database
API—and briefly introduces full-blown OODBs such as ZODB, as well as
ORMs such as SQLObject and SQLAlchemy. The Python standard library’s
SQLite support is used for the SQL examples, but the API is portable
to enterprise-level systems such as MySQL.

Chapter 18

This chapter explores techniques for implementing more
advanced data structures in Python—stacks, sets, binary search
trees, graphs, and the like. In Python, these take the form of
object implementations.

Chapter 19

This chapter addresses Python tools and techniques for parsing
text-based information—string splits and joins, regular expression
matching, XML parsing, recursive descent parsing, and more advanced
language-based topics.

Chapter 20

This chapter introduces integration techniques—both extending
Python with compiled libraries and embedding Python code in other
applications. While the main focus here is on linking Python with
compiled C code, we’ll also investigate integration with Java, .NET,
and more. This chapter assumes that you know how to read C programs,
and it is intended mostly for developers responsible for
implementing application integration layers.

This is the last technical part of the book, and it makes heavy use
of tools presented earlier in the text to help underscore the notion of
code reuse. For instance, a calculator GUI (PyCalc) serves to demonstrate
language processing and code reuse concepts.

Chapter 17. Databases and Persistence
“Give Me an Order of Persistence, but Hold the Pickles”

So far in this book, we’ve used Python in the system programming,
GUI development, and Internet scripting domains—three of Python’s most
common applications, and representative of its use as an application
programming language at large. In the next four chapters, we’re going to
take a quick look at other major Python programming topics: persistent
data, data structure techniques, text and language processing, and
Python/C integration.

These four topics are not really application areas themselves, but
they are techniques that span domains. The database topics in this
chapter, for instance, can be applied on the Web, in desktop GUI
applications, and so on. Text processing is a similarly general tool.
Moreover, none of these final four topics is covered exhaustively (each
could easily fill a book alone), but we’ll sample Python in action in
these domains and highlight their core concepts and tools. If any of these
chapters spark your interest, additional resources are readily available
in the Python world.

Persistence Options in Python

In this chapter, our focus is on
persistent
data—the kind that outlives a
program that creates it. That’s not true by default for
objects a script constructs, of course; things like lists, dictionaries,
and even class instance
objects live in your computer’s memory and are lost as soon
as the script ends. To make data live longer, we need to do something
special. In Python programming, there are today at least six traditional
ways to save information in between program executions:

Flat files

Text and bytes stored directly on your computer

DBM keyed files

Keyed access to
strings stored in dictionary-like files

Pickled objects

Serialized Python
objects saved to files and streams

Shelve files

Pickled Python
objects saved in DBM keyed files

Object-oriented databases (OODBs)

Persistent
Python objects stored in persistent dictionaries
(ZODB, Durus)

SQL relational databases (RDBMSs)

Table-based
storage that supports SQL queries (SQLite, MySQL,
PostGreSQL, etc.)

Object relational mappers (ORMs)

Mediators
that map Python classes to relational
tables (SQLObject, SQLAlchemy)

In some sense, Python’s interfaces to network-based object
transmission protocols such as
SOAP, XML-RPC, and CORBA also offer persistence options, but
they are beyond the scope of this chapter. Here, our interest is in
techniques that allow a program to store its data directly and, usually,
on the local machine. Although some database servers may operate on a
physically remote machine on a network, this is largely transparent to
most of the techniques we’ll study here.

We studied Python’s simple (or “flat”) file
interfaces in earnest in
Chapter 4
, and we have been using them ever
since. Python provides standard access to both the
stdio
filesystem (through the built-in
open
function), as well as lower-level
descriptor-based files (with the built-in
os
module). For simple data storage tasks, these
are all that many scripts need. To save for use in a future program run,
simply write data out to a newly opened file on your computer in text or
binary mode, and read it back from that file later. As we’ve seen, for
more advanced tasks, Python also supports other file-like
interfaces
such as pipes, fifos, and
sockets.

Since we’ve already explored flat files, I won’t say more about them
here. The rest of this chapter introduces the remaining topics on the
preceding list. At the end, we’ll also meet a GUI program for browsing the
contents of things such as shelves and DBM files. Before that, though, we
need to learn what manner of beast
these are.

Note

Fourth edition coverage note
: The prior
edition of this book used the
mysql-python
interface to the MySQL relational
database system, as well as the ZODB object database system. As I update
this chapter in June 2010, neither of these is yet available for Python
3.X, the version of Python used in this edition. Because of that, most
ZODB information has been trimmed, and the SQL database examples here
were changed to use the SQLite in-process database system that ships
with Python 3.X as part of its standard library. The prior edition’s
ZODB and MySQL examples and overviews are still available in the
examples package, as described later. Because Python’s SQL database API
is portable, though, the SQLite code here should work largely unchanged
on most other systems.

DBM Files

Flat files are handy
for simple persistence tasks, but they are generally geared
toward a sequential processing mode. Although it is possible to jump
around to arbitrary locations with
seek
calls, flat files don’t provide much structure to data beyond the notion
of bytes and text lines.

DBM files, a standard tool in the Python library for database
management, improve on that by providing key-based access to stored text
strings. They implement a random-access, single-key view on stored data.
For instance, information related to objects can be stored in a DBM file
using a unique key per object and later can be fetched back directly with
the same key. DBM files are implemented by a variety of underlying modules
(including one coded in Python), but if you have Python, you have a
DBM.

Using DBM Files

Although
DBM filesystems have to do a bit of work to map chunks of
stored data to keys for fast retrieval (technically, they generally use
a technique
called
hashing
to store data in
files), your scripts don’t need to care about the action going on behind
the scenes. In fact, DBM is one of the easiest ways to save information
in Python—DBM files behave so much like in-memory dictionaries that you
may forget you’re actually dealing with a file at all. For instance,
given a DBM file object:

  • Indexing by key fetches data from the file.

  • Assigning to an index stores data in the file.

DBM file objects also support common dictionary methods such as
keys-list fetches and tests and key deletions. The DBM library itself is
hidden behind this simple model. Since it is so simple, let’s jump right
into an interactive example that creates a DBM file and shows how the
interface works:

C:\...\PP4E\Dbase>
python
>>>
import dbm
# get interface: bsddb, gnu, ndbm, dumb
>>>
file = dbm.open('movie', 'c')
# make a DBM file called 'movie'
>>>
file['Batman'] = 'Pow!'
# store a string under key 'Batman'
>>>
file.keys()
# get the file's key directory
[b'Batman']
>>>
file['Batman']
# fetch value for key 'Batman'
b'Pow!'
>>>
who = ['Robin', 'Cat-woman', 'Joker']
>>>
what = ['Bang!', 'Splat!', 'Wham!']
>>>
for i in range(len(who)):
...
file[who[i]] = what[i]
# add 3 more "records"
...
>>>
file.keys()
[b'Cat-woman', b'Batman', b'Joker', b'Robin']
>>>
len(file), 'Robin' in file, file['Joker']
(4, True, b'Wham!')
>>>
file.close()
# close sometimes required

Internally, importing the
dbm
standard library
module automatically loads whatever DBM interface is available in your
Python interpreter (attempting alternatives in a fixed order), and
opening the new DBM file creates one or more external files with names
that start with the string
'movie'
(more on the details in a moment). But after the import and open, a DBM
file is virtually indistinguishable from a dictionary.

In effect, the object called
file
here can be thought of as a dictionary
mapped to an external file called
movie
; the only obvious differences are that
keys must be strings (not arbitrary immutables), and we need to remember
to open to access and close after changes.

Unlike normal dictionaries, though, the contents of
file
are retained between Python program runs.
If we come back later and restart Python, our dictionary is still
available. Again, DBM files are like dictionaries that must be
opened:

C:\...\PP4E\Dbase>
python
>>>
import dbm
>>>
file = dbm.open('movie', 'c')
# open existing DBM file
>>>
file['Batman']
b'Pow!'
>>>
file.keys()
# keys gives an index list
[b'Cat-woman', b'Batman', b'Joker', b'Robin']
>>>
for key in file.keys(): print(key, file[key])
...
b'Cat-woman' b'Splat!'
b'Batman' b'Pow!'
b'Joker' b'Wham!'
b'Robin' b'Bang!'

Notice how DBM files return a real list for the
keys
call; not shown here, their
values
method instead returns an iterable view
like dictionaries. Further, DBM files always store both keys and values
as
bytes
objects; interpretation as
arbitrary types of Unicode text is left to the client application. We
can use either
bytes
or
str
strings in our code when accessing or
storing keys and values—using
bytes
allows your keys and values to retain arbitrary Unicode encodings, but
str
objects in our code will be
encoded to
bytes
internally using the
UTF-8 Unicode encoding by Python’s DBM implementation.

Still, we can always decode to Unicode
str
strings to display in a more friendly
fashion if desired, and DBM files have a keys iterator just like
dictionaries. Moreover, assigning and deleting keys updates the DBM
file, and we should close after making changes (this ensure that changes
are flushed to disk):

>>>
for key in file: print(key.decode(), file[key].decode())
...
Cat-woman Splat!
Batman Pow!
Joker Wham!
Robin Bang!
>>>
file['Batman'] = 'Ka-Boom!'
# change Batman slot
>>>
del file['Robin']
# delete the Robin entry
>>>
file.close()
# close it after changes

Apart from having to import the interface and open and close the
DBM file, Python programs don’t have to know anything about DBM itself.
DBM modules achieve this integration by overloading the indexing
operations and routing them to more primitive library tools. But you’d
never know that from looking at this Python code—DBM files look like
normal Python dictionaries, stored on external files. Changes made to
them are retained indefinitely:

C:\...\PP4E\Dbase>
python
>>>
import dbm
# open DBM file again
>>>
file = dbm.open('movie', 'c')
>>>
for key in file: print(key.decode(), file[key].decode())
...
Cat-woman Splat!
Batman Ka-Boom!
Joker Wham!

As you can see, this is about as simple as it can be.
Table 17-1
lists the most commonly used DBM file
operations. Once such a file is opened, it is processed just as though
it were an in-memory Python dictionary. Items are fetched by indexing
the file object by key and are stored by assigning to a
key.

Table 17-1. DBM file operations

Python
code

Action

Description

import dbm

Import

Get DBM
implementation

file=dbm.open('filename',
'c')

Open

Create or open an
existing DBM file for I/O

file['key'] = 'value'

Store

Create or change the
entry for
key

value = file['key']

Fetch

Load the value for the
entry
key

count = len(file)

Size

Return the number of
entries stored

index = file.keys()

Index

Fetch the stored keys
list (not a view)

found = 'key' in file

Query

See if there’s an entry
for
key

del file['key']

Delete

Remove the entry for
key

for key in file:

Iterate

Iterate over stored
keys

file.close()

Close

Manual close, not always
needed

DBM Details: Files, Portability, and Close

Despite the dictionary-like interface, DBM files really do map to
one or more external files. For instance, the underlying default
dbm
interface used by Python 3.1 on
Windows writes two files—
movie.dir
and
movie.dat
—when a DBM file called
movie
is made, and saves a
movie.bak
on later opens. If your Python has
access to a different underlying keyed-file interface, different
external files might show up on your computer.

Technically, the module
dbm
is
really an interface to whatever DBM-like filesystem you
have available in your Python:

  • When opening an already existing DBM file,
    dbm
    tries to determine the system that
    created it with the
    dbm.whichdb
    function instead. This determination is based upon the content of
    the database itself.

  • When creating a new file,
    dbm
    today tries a set of keyed-file
    interface modules in a fixed order. According to its documentation,
    it attempts to import the interfaces
    dbm.bsd
    ,
    dbm.gnu
    ,
    dbm.ndbm
    , or
    dbm.dumb
    , and uses the first that
    succeeds. Pythons without any of these automatically fall back on an
    all-Python and always-present implementation called
    dbm.dumb
    , which is not really “dumb,” or
    course, but may not be as fast or robust as other options.

Future Pythons are free to change this selection order, and may
even add additional alternatives to it. You normally don’t need to care
about any of this, though, unless you delete any of the files your DBM
creates, or transfer them between machines with different
configurations—if you need to care about the
portability
of your DBM files (and as we’ll see
later, by proxy, that of your shelve files), you should configure
machines such that all have the same DBM interface installed or rely
upon the
dumb
fallback. For example,
the Berkeley DB package (a.k.a.
bsddb
) used by
dbm.bsd
is widely available and
portable.

Note that DBM files may or may not need to be explicitly closed,
per the last entry in
Table 17-1
. Some DBM
files don’t require a close call, but some depend on it to flush changes
out to disk. On such systems, your file may be corrupted if you omit the
close call. Unfortunately, the default DBM in some older Windows
Pythons,
dbhash
(a.k.a.
bsddb
), is one of the DBM systems that
requires a close call to avoid data loss. As a rule of thumb, always
close your DBM files explicitly after making changes and before your
program exits to avoid potential problems; it’s essential a “commit”
operation for these files. This rule extends by proxy to shelves, a
topic we’ll meet later in this
chapter.

Note

Recent changes
: Be sure to also pass a
string
'c'
as a second argument
when calling
dbm.open
, to force
Python to create the file if it does not yet exist and to simply open
it for reads and writes otherwise. This used to be the default
behavior but is no longer. You do not need the
'c'
argument when opening
shelves
discussed ahead—they still use an “open
or create”
'c'
mode by default if
passed no open mode argument. Other open mode strings can be passed to
dbm
, including
n
to always create the file, and
r
for read-only of an existing file—the new
default. See the Python library manual for more details.

In addition, Python 3.X stores both key and value strings as
bytes
instead of
str
as we’ve seen (which turns out to be
convenient for pickled data in shelves, discussed ahead) and no longer
ships with
bsddb
as a standard
component—it’s available independently on the Web as a third-party
extension, but in its absence Python falls back on its own DBM file
implementation. Since the underlying DBM implementation rules are
prone to change with time, you should always consult Python’s library
manuals as well as the
dbm
module’s
standard library source code for more information.

Other books

Mending Fences by Francis, Lucy
El hombre de arena by E.T.A. Hoffmann
Latin American Folktales by John Bierhorst
The Black Mage: Candidate by Rachel E. Carter
Long Live the Dead by Hugh B. Cave
The Demon Who Fed on a Shark by Hyacinth, Scarlet
Midnight in Berlin by James MacManus
Ascent of the Aliomenti by Alex Albrinck