Imagine, if you will, that you need to keep track of information
about people for some reason. Maybe you want to store an address book on
your computer, or perhaps you need to keep track of employees in a small
business. For whatever reason, you want to write a program that keeps
track of details about these people. In other words, you want to keep
records in a database—to permanently store lists of people’s attributes on
your computer.
Naturally, there are off-the-shelf programs for managing databases
like these. By writing a program for this task yourself, however, you’ll
have complete control over its operation. You can add code for special
cases and behaviors that precoded software may not have anticipated. You
won’t have to install and learn to use yet another database product. And
you won’t be at the mercy of a software vendor to fix bugs or add new
features. You decide to write a Python program to manage your
people.
If we’re going to
store records in a database, the first step is probably
deciding what those records will look like. There are a variety of ways to
represent information about people in the Python language. Built-in object
types such as lists and dictionaries are often sufficient, especially if
we don’t initially care about processing the data we store.
Lists, for example,
can collect attributes about people in a positionally
ordered way. Start up your Python interactive interpreter and type the
following two statements:
>>>bob = ['Bob Smith', 42, 30000, 'software']
>>>sue = ['Sue Jones', 45, 40000, 'hardware']
We’ve just made two records, albeit simple ones, to represent two
people, Bob and Sue (my apologies if you really are Bob or Sue,
generically or otherwise
[
2
]
). Each record is a list of four properties: name, age,
pay, and job fields. To access these fields, we simply index by
position; the result is in parentheses here because it is a tuple of two
results:
>>>bob[0], sue[2]
# fetch name, pay
('Bob Smith', 40000)
Processing records is easy with this representation; we just use
list operations. For example, we can extract a last name
by splitting the name field on blanks and grabbing the last part, and we
can give someone a raise by changing their list in-place:
>>>bob[0].split()[-1]
# what's bob's last name?
'Smith'
>>>sue[2] *= 1.25
# give sue a 25% raise
>>>sue
['Sue Jones', 45, 50000.0, 'hardware']
The last-name expression here proceeds from left to right: we
fetch Bob’s name, split it into a list of substrings around spaces, and
index his last name (run it one step at a time to see how).
Since this is
the first code in this book, here are some quick
pragmatic pointers for reference:
This code may be typed in the IDLE GUI; after typingpython
at a shell prompt
(or the full directory path to it if it’s not on your system
path); and so on.
The>>>
characters are Python’s prompt (not code you type
yourself).
The informational lines that Python prints when this prompt
starts up are usually omitted in this book to save space.
I’m running all of this book’s code under Python 3.1;
results in any 3.X release should be similar (barring
unforeseeable Python changes, of course).
Apart from some system and C integration code, most of this
book’s examples are run under Windows 7, though thanks to Python
portability, it generally doesn’t matter unless stated
otherwise.
If you’ve never run Python code this way before, see an
introductory resource such as O’Reilly’s
Learning
Python
for help with getting started. I’ll also
have a few words to say about running code saved in script files later
in this chapter.
Of course,
what we’ve really coded so far is just two variables,
not a database; to collect Bob and Sue into a unit, we might simply
stuff them into another list:
>>>people = [bob, sue]
# reference in list of lists
>>>for person in people:
print(person)
['Bob Smith', 42, 30000, 'software']
['Sue Jones', 45, 50000.0, 'hardware']
Now the people list represents our database. We can fetch
specific records by their relative positions and process them one at a
time, in loops:
>>>people[1][0]
'Sue Jones'
>>>for person in people:
print(person[0].split()[-1])
# print last namesperson[2] *= 1.20
# give each a 20% raise
Smith
Jones
>>>for person in people: print(person[2])
# check new pay
36000.0
60000.0
Now that we have a list, we can also collect values from records
using some of Python’s more powerful iteration tools, such as list
comprehensions, maps, and generator
expressions
:
>>>pays = [person[2] for person in people]
# collect all pay
>>>pays
[36000.0, 60000.0]
>>>pays = map((lambda x: x[2]), people)
# ditto (map is a generator in 3.X)
>>>list(pays)
[36000.0, 60000.0]
>>>sum(person[2] for person in people)
# generator expression, sum built-in
96000.0
To add a record to the database, the usual list operations, such
asappend
andextend
, will suffice:
>>>people.append(['Tom', 50, 0, None])
>>>len(people)
3
>>>people[-1][0]
'Tom'
Lists work for our people database, and they might be sufficient
for some programs, but they suffer from a few major flaws. For one
thing, Bob and Sue, at this point, are just fleeting objects in memory
that will disappear once we exit Python. For another, every time we
want to extract a last name or give a raise, we’ll have to repeat the
kinds of code we just typed; that could become a problem if we ever
change the way those operations work—we may have to update many places
in our code. We’ll address these issues in a few moments.
Perhaps more
fundamentally, accessing fields by position in a list
requires us to memorize what each position means: if you see a bit of
code indexing a record on magic position 2, how can you tell it is
extracting a pay? In terms of understanding the code, it might be
better to associate a field name with a field value.
We might try to associate names with relative positions by using
the Pythonrange
built-in
function, which generates successive integers when used in iteration
contexts (such as the sequence assignment used initially here):
>>>NAME, AGE, PAY = range(3)
# 0, 1, and 2
>>>bob = ['Bob Smith', 42, 10000]
>>>bob[NAME]
'Bob Smith'
>>>PAY, bob[PAY]
(2, 10000)
This addresses readability: the three uppercase variables
essentially become field names. This makes our code dependent on the
field position assignments, though—we have to remember to update the
range assignments whenever we change record structure. Because they
are not directly associated, the names and records may become out of
sync over time and require a maintenance step.
Moreover, because the field names are independent variables,
there is no direct mapping from a record list back to its field’s
names. A raw record list, for instance, provides no way to label its
values with field names in a formatted display. In the preceding
record, without additional code, there is no path from value 42 to
label AGE:bob.index(42)
gives 1,
the value of AGE, but not the name AGE itself.
We might also try this by using lists of tuples, where the
tuples record both a field name and a value; better yet, a list of
lists would allow for updates (tuples are immutable). Here’s what that
idea translates to, with slightly simpler records:
>>>bob = [['name', 'Bob Smith'], ['age', 42], ['pay', 10000]]
>>>sue = [['name', 'Sue Jones'], ['age', 45], ['pay', 20000]]
>>>people = [bob, sue]
This really doesn’t fix the problem, though, because we still
have to index by position in order to fetch fields:
>>>for person in people:
print(person[0][1], person[2][1])
# name, pay
Bob Smith 10000
Sue Jones 20000
>>>[person[0][1] for person in people]
# collect names
['Bob Smith', 'Sue Jones']
>>>for person in people:
print(person[0][1].split()[-1])
# get last namesperson[2][1] *= 1.10
# give a 10% raise
Smith
Jones
>>>for person in people: print(person[2])
['pay', 11000.0]
['pay', 22000.0]
All we’ve really done here is add an extra level of positional
indexing. To do better, we might inspect field names in loops to find
the one we want (the loop uses tuple assignment here to unpack the
name/value pairs):
>>>for person in people:
for (name, value) in person:
if name == 'name': print(value)
# find a specific field
Bob Smith
Sue Jones
Better yet, we can code a fetcher function to do the job for
us:
>>>def field(record, label):
for (fname, fvalue) in record:
if fname == label:
# find any field by namereturn fvalue
>>>field(bob, 'name')
'Bob Smith'
>>>field(sue, 'pay')
22000.0
>>>for rec in people:
print(field(rec, 'age'))
# print all ages
42
45
If we proceed down this path, we’ll eventually wind up with a
set of record interface functions that generically map field names to
field data. If you’ve done any Python coding in the past, though, you
probably already know that there is an easier way to code this sort of
association, and you can probably guess where we’re headed in the next
section.
The list-based
record representations in the prior section work, though
not without some cost in terms of performance required to search for
field names (assuming you need to care about milliseconds and such). But
if you already know some Python, you also know that there are more
efficient and convenient ways to associate property names and values.
The built-in dictionary object is a natural:
>>>bob = {'name': 'Bob Smith', 'age': 42, 'pay': 30000, 'job': 'dev'}
>>>sue = {'name': 'Sue Jones', 'age': 45, 'pay': 40000, 'job': 'hdw'}
Now, Bob and Sue are objects that map field names to values
automatically, and they make our code more understandable and
meaningful. We don’t have to remember what a numeric offset means, and
we let Python search for the value associated with a field’s name with
its efficient dictionary indexing:
>>>bob['name'], sue['pay']
# not bob[0], sue[2]
('Bob Smith', 40000)
>>>bob['name'].split()[-1]
'Smith'
>>>sue['pay'] *= 1.10
>>>sue['pay']
44000.0
Because fields are accessed mnemonically now, they are more
meaningful to those who read your code (including you).
Dictionaries
turn out to be so useful in Python programming that
there are even more convenient ways to code them than the traditional
literal syntax shown earlier—e.g., with keyword arguments and the type
constructor, as long as the keys are all strings:
>>>bob = dict(name='Bob Smith', age=42, pay=30000, job='dev')
>>>sue = dict(name='Sue Jones', age=45, pay=40000, job='hdw')
>>>bob
{'pay': 30000, 'job': 'dev', 'age': 42, 'name': 'Bob Smith'}
>>>sue
{'pay': 40000, 'job': 'hdw', 'age': 45, 'name': 'Sue Jones'}
by filling out a dictionary one field at a time (recall that
dictionary keys are pseudo-randomly ordered):
>>>sue = {}
>>>sue['name'] = 'Sue Jones'
>>>sue['age'] = 45
>>>sue['pay'] = 40000
>>>sue['job'] = 'hdw'
>>>sue
{'job': 'hdw', 'pay': 40000, 'age': 45, 'name': 'Sue Jones'}
and by zipping together name/value lists:
>>>names = ['name', 'age', 'pay', 'job']
>>>values = ['Sue Jones', 45, 40000, 'hdw']
>>>list(zip(names, values))
[('name', 'Sue Jones'), ('age', 45), ('pay', 40000), ('job', 'hdw')]
>>>sue = dict(zip(names, values))
>>>sue
{'job': 'hdw', 'pay': 40000, 'age': 45, 'name': 'Sue Jones'}
We can even make dictionaries from a sequence of key values and
an optional starting value for all the keys (handy to initialize an
empty dictionary):
>>>fields = ('name', 'age', 'job', 'pay')
>>>record = dict.fromkeys(fields, '?')
>>>record
{'job': '?', 'pay': '?', 'age': '?', 'name': '?'}
Regardless of
how we code them, we still need to collect our
dictionary-based records into a database; a list does the trick again,
as long as we don’t require access by key at the top level:
>>>bob
{'pay': 30000, 'job': 'dev', 'age': 42, 'name': 'Bob Smith'}
>>>sue
{'job': 'hdw', 'pay': 40000, 'age': 45, 'name': 'Sue Jones'}
>>>people = [bob, sue]
# reference in a list
>>>for person in people:
print(person['name'], person['pay'], sep=', ')
# all name, pay
Bob Smith, 30000
Sue Jones, 40000
>>>for person in people:
if person['name'] == 'Sue Jones':
# fetch sue's payprint(person['pay'])
40000
Iteration tools work just as well here, but we use keys rather
than obscure positions (in database terms, the list comprehension and
map in the following code project the database on the “name” field
column):
>>>names = [person['name'] for person in people]
# collect names
>>>names
['Bob Smith', 'Sue Jones']
>>>list(map((lambda x: x['name']), people))
# ditto, generate
['Bob Smith', 'Sue Jones']
>>>sum(person['pay'] for person in people)
# sum all pay
70000
Interestingly, tools such as list comprehensions and on-demand
generator expressions can even approach the utility of SQL queries
here, albeit operating on in-memory
objects
:
>>>[rec['name'] for rec in people if rec['age'] >= 45]
# SQL-ish query
['Sue Jones']
>>>[(rec['age'] ** 2 if rec['age'] >= 45 else rec['age']) for rec in people]
[42, 2025]
>>>G = (rec['name'] for rec in people if rec['age'] >= 45)
>>>next(G)
'Sue Jones'
>>>G = ((rec['age'] ** 2 if rec['age'] >= 45 else rec['age']) for rec in people)
>>>G.__next__()
42
And because dictionaries are normal Python objects, these
records can also be accessed and updated with normal Python
syntax:
>>>for person in people:
print(person['name'].split()[-1])
# last nameperson['pay'] *= 1.10
# a 10% raise
Smith
Jones
>>>for person in people: print(person['pay'])
33000.0
44000.0
Incidentally, we
could avoid the last-name extraction code in the prior
examples by further structuring our records. Because all of Python’s
compound datatypes can be nested inside each other and as deeply as we
like, we can build up fairly complex information structures
easily—simply type the object’s syntax, and Python does all the work
of building the components, linking memory structures, and later
reclaiming their space. This is one of the great advantages of a
scripting language such as Python.
The following, for instance, represents a more structured record
by nesting a dictionary, list, and tuple inside another
dictionary:
>>>bob2 = {'name': {'first': 'Bob', 'last': 'Smith'},
'age': 42,
'job': ['software', 'writing'],
'pay': (40000, 50000)}
Because this record contains nested
structures, we simply index twice to go two levels deep:
>>>bob2['name']
# bob's full name
{'last': 'Smith', 'first': 'Bob'}
>>>bob2['name']['last']
# bob's last name
'Smith'
>>>bob2['pay'][1]
# bob's upper pay
50000
The name field is another dictionary here, so instead of
splitting up a string, we simply index to fetch the last name.
Moreover, people can have many jobs, as well as minimum and maximum
pay limits. In fact, Python becomes a sort of query language in such
cases—we can fetch or change nested data with the usual object
operations:
>>>for job in bob2['job']: print(job)
# all of bob's jobs
software
writing
>>bob2['job'][-1]
# bob's last job
'writing'
>>>bob2['job'].append('janitor')
# bob gets a new job
>>>bob2
{'job': ['software', 'writing', 'janitor'], 'pay': (40000, 50000), 'age': 42, 'name':
{'last': 'Smith', 'first': 'Bob'}}
It’s OK to grow the nested list withappend
, because it
is really an independent object. Such nesting can come in handy for
more sophisticated applications; to keep ours simple, we’ll stick to
the original flat record structure.
One last twist on
our people database: we can get a little more mileage
out of dictionaries here by using one to represent the database
itself. That is, we can use a dictionary of dictionaries—the outer
dictionary is the database, and the nested dictionaries are the
records within it. Rather than a simple list of records, a
dictionary-based database allows us to store and retrieve records by
symbolic key:
>>>bob = dict(name='Bob Smith', age=42, pay=30000, job='dev')
>>>sue = dict(name='Sue Jones', age=45, pay=40000, job='hdw')
>>>bob
{'pay': 30000, 'job': 'dev', 'age': 42, 'name': 'Bob Smith'}
>>>db = {}
>>>db['bob'] = bob
# reference in a dict of dicts
>>>db['sue'] = sue
>>>
>>>db['bob']['name']
# fetch bob's name
'Bob Smith'
>>>db['sue']['pay'] = 50000
# change sue's pay
>>>db['sue']['pay']
# fetch sue's pay
50000
Notice how this structure allows us to access a record directly
instead of searching for it in a loop—we get to Bob’s name immediately
by indexing on keybob
. This really
is a dictionary of dictionaries, though you won’t see all the gory
details unless you display the database all at once (the
Pythonpprint
pretty-printer module can help with legibility here):
>>>db
{'bob': {'pay': 30000, 'job': 'dev', 'age': 42, 'name': 'Bob Smith'}, 'sue':
{'pay': 50000, 'job': 'hdw', 'age': 45, 'name': 'Sue Jones'}}
>>>import pprint
>>>pprint.pprint(db)
{'bob': {'age': 42, 'job': 'dev', 'name': 'Bob Smith', 'pay': 30000},
'sue': {'age': 45, 'job': 'hdw', 'name': 'Sue Jones', 'pay': 50000}}
If we still need to step through the database one record at a
time, we can now rely on dictionary iterators.
In recent Python releases, a dictionary iterator
produces one key in afor
loop each
time through (for compatibility with earlier releases, we can also
call thedb.keys
method
explicitly in thefor
loop rather
than saying justdb
, but since
Python 3’skeys
result is a generator, the effect is
roughly the same):
>>>for key in db:
print(key, '=>', db[key]['name'])
bob => Bob Smith
sue => Sue Jones
>>>for key in db:
print(key, '=>', db[key]['pay'])
bob => 30000
sue => 50000
To visit all records, either index by key as you go:
>>>for key in db:
print(db[key]['name'].split()[-1])
db[key]['pay'] *= 1.10
Smith
Jones
or step through the dictionary’s values to access records
directly:
>>>for record in db.values(): print(record['pay'])
33000.0
55000.0
>>>x = [db[key]['name'] for key in db]
>>>x
['Bob Smith', 'Sue Jones']
>>>x = [rec['name'] for rec in db.values()]
>>>x
['Bob Smith', 'Sue Jones']
And to add a new record, simply assign it to a new key; this is
just a dictionary, after all:
>>>db['tom'] = dict(name='Tom', age=50, job=None, pay=0)
>>>
>>>db['tom']
{'pay': 0, 'job': None, 'age': 50, 'name': 'Tom'}
>>>db['tom']['name']
'Tom'
>>>list(db.keys())
['bob', 'sue', 'tom']
>>>len(db)
3
>>>[rec['age'] for rec in db.values()]
[42, 45, 50]
>>>[rec['name'] for rec in db.values() if rec['age'] >= 45]
# SQL-ish query
['Sue Jones', 'Tom']
Although our database is still a transient object in memory, it
turns out that this
dictionary
-of-
dictionaries
format corresponds exactly
to a system that saves objects permanently—the
shelve (yes, this should probably be
shelf
, grammatically speaking, but the Python
module name and term is
shelve
). To learn how,
let’s move on to
the next section.