Programming Python (181 page)

Read Programming Python Online

Authors: Mark Lutz

Tags: #COMPUTERS / Programming Languages / Python

BOOK: Programming Python
3.48Mb size Format: txt, pdf, ePub

[
70
]
Because Python is so dynamic, guesses about relative
performance in Python are just as likely to be wrong as right.
Moreover, their accuracy is prone to change over time. Trust me on
this. I’ve made sweeping statements about performance in other
books, only to be made wrong by a later Python release that
optimized some operations more than others. Performance measurement
in Python is both nontrivial and an ongoing task. In general, code
for readability first, and worry about performance later, but always
gather data to support your optimization efforts.

Implementing Sets

Another commonly used
data structure is the
set
, a collection
of objects that support
operations such as:

Intersection

Make a new set with all items in common.

Union

Make a new set with all items in either operand.

Membership

Test whether an item exists in a set.

Other operations, such as difference and subset tests, can be useful
as well, depending on the intended use. Sets come in handy for dealing
with abstract group combinations. For instance, given a set of engineers
and a set of writers, you can pick out individuals who do both activities
by intersecting the two sets. A union of such sets would contain either
type of individual, but would include any given individual only once. This
latter property also makes sets ideal for removing duplicates from
collections—simply convert to and from a set to filter out repeats.

In fact, we relied on such operations in earlier chapters; PyMailGUI
in
Chapter 14
, for example, used
intersection, union, and difference to manage the set of active mail
downloads, and filtered out duplicate recipients in multiple contexts with
set conversion. Sets are a widely relevant tool on practical
programs.

Built-in Options

If you’ve studied the core
Python language, you should already know that, as for
stacks, Python comes with built-in support here as well. Here, though,
the support is even more direct—Python’s set datatype provides standard
and optimized set operations today. As a quick review, built-in set
usage is straightforward: set objects are initially created by calling
the type name with an iterable or sequence giving the components of the
set or by running a set comprehension expression:

>>>
x = set('abcde')
# make set from an iterable/sequence
>>>
y = {c for c in 'bdxyz'}
# same via set comprehension expression
>>>
x
{'a', 'c', 'b', 'e', 'd'}
>>>
y
{'y', 'x', 'b', 'd', 'z'}

Once you have a set, all the usual operations are available; here
are the most common:

>>>
'e' in x
# membership
True
>>>
x – y
# difference
{'a', 'c', 'e'}
>>>
x & y
# intersection
{'b', 'd'}
>>>
x | y
# union
{'a', 'c', 'b', 'e', 'd', 'y', 'x', 'z'}

Interestingly, just like the dictionaries, built-in sets are
unordered, and require that all set components be hashable (immutable).
Making a set with a dictionary of items works, but only because
set
uses the dictionary iterator, which
returns the next key on each iteration (it ignores key values):

>>>
x = set(['spam', 'ham', 'eggs'])
# sequence of immutables
>>>
x
{'eggs', 'ham', 'spam'}
>>>
x = {'spam', 'ham', 'eggs'}
# same but set literal if items known
>>>
x
{'eggs', 'ham', 'spam'}
>>>
x = set([['spam', 'ham'], ['eggs']])
# immutables do not work as items
TypeError: unhashable type: 'list'
>>>
x = set({'spam':[1, 1], 'ham': [2, 2], 'eggs':[3, 3]})
>>> x
{'eggs', 'ham', 'spam'}

Plus there are additional operations we won’t illuminate here—see
a core language text such as
Learning
Python
for more details. For instance, built-in sets
also support operations such as superset testing, and they come in two
flavors: mutable and frozen (frozen sets are hashable, and thus usable
in sets of sets). Moreover, set comprehensions are more powerful than
suggested, and sets are a natural at duplicate removal:

>>>
y = {c.upper() * 4 for c in 'spamham'}
# set comprehension
>>>
y
{'SSSS', 'AAAA', 'MMMM', 'HHHH', 'PPPP'}
>>>
>>>
list(set([1, 2, 3, 1, 2]))
# remove duplicates from a list
[1, 2, 3]

As for stacks, though, the built-in set type might not by itself
achieve all our goals. Moreover, homegrown set implementations turn out
to be an ideal vehicle for studying custom data structure
implementations in Python. Although the end result may not compete with
the performance of built-in set objects today, the code can still be
instructive to read, and fun to experiment with.

Also as for stacks, a custom set implementation will generally be
based upon other built-in types. Python lists, tuples, and strings come
close to the notion of a set: the
in
operator tests membership,
for
iterates, and so on. Here, we’ll add operations not directly supported
by Python sequences. In effect, we’re
extending
built-in types for unique requirements.

Set Functions

As before, let’s
first start out with a function-based set manager. But
this time, instead of managing a shared set object in a module, let’s
define functions to implement set operations on passed-in Python
sequences (see
Example 18-8
).

Example 18-8. PP4E\Dstruct\Basic\inter.py

"set operations for two sequences"
def intersect(seq1, seq2):
res = [] # start with an empty list
for x in seq1: # scan the first sequence
if x in seq2:
res.append(x) # add common items to the end
return res
def union(seq1, seq2):
res = list(seq1) # make a copy of seq1
for x in seq2: # add new items in seq2
if not x in res:
res.append(x)
return res

These functions work on any type of sequence—lists strings,
tuples, and other iterable objects that conform to the protocols
expected by these functions (
for
loops,
in
membership tests). In fact,
we can even use them on mixed object types: the last two commands in the
following test compute the intersection and union of a list and a tuple.
As usual in Python, the object
interface
is what
matters, not the specific types:

C:\...\PP4E\Dstruct\Basic>
python
>>>
from inter import *
>>>
s1 = "SPAM"
>>>
s2 = "SCAM"
>>>
intersect(s1, s2), union(s1, s2)
(['S', 'A', 'M'], ['S', 'P', 'A', 'M', 'C'])
>>>
intersect([1,2,3], (1,4))
[1]
>>>
union([1,2,3], (1,4))
[1, 2, 3, 4]

Notice that the result is always a list here, regardless of the
type of sequences passed in. We could work around this by converting
types or by using a class to sidestep this issue (and we will in a
moment). But type conversions aren’t clear-cut if the operands are
mixed-type sequences. Which type do we convert to?

Supporting multiple operands

If we’re going to use the
intersect
and
union
functions as general tools, one useful
extension is support for multiple arguments (i.e., more than two). The
functions in
Example 18-9
use Python’s variable-length argument lists feature to compute the
intersection and union of arbitrarily many
operands.

Example 18-9. PP4E\Dstruct\Basic\inter2.py

"set operations for multiple sequences"
def intersect(*args):
res = []
for x in args[0]: # scan the first list
for other in args[1:]: # for all other arguments
if x not in other: break # this item in each one?
else:
res.append(x) # add common items to the end
return res
def union(*args):
res = []
for seq in args: # for all sequence-arguments
for x in seq: # for all nodes in argument
if not x in res:
res.append(x) # add new items to result
return res

These multi-operand functions work on sequences in the same way
as the originals, but they also support three or more operands. Notice
the use of an
else
on the
intersection’s
for
loop here to detect common items. Also
note that the last two examples in the following session work on lists
with embedded compound objects: the
in
tests used by the
intersect
and
union
functions apply equality testing to
sequence nodes recursively, as deep as necessary to determine
collection comparison results:

C:\...\PP4E\Dstruct\Basic>
python
>>>
from inter2 import *
>>>
s1, s2, s3 = 'SPAM', 'SLAM', 'SCAM'
>>>
intersect(s1, s2)
['S', 'A', 'M']
>>>
intersect(s1, s2, s3)
['S', 'A', 'M']
>>>
intersect(s1, s2, s3, 'HAM')
['A', 'M']
>>>
union(s1, s2), union(s1, s2, s3)
(['S', 'P', 'A', 'M', 'L'], ['S', 'P', 'A', 'M', 'L', 'C'])
>>>
intersect([1, 2, 3], (1, 4), range(5))
# 3.X: range okay
[1]
>>>
s1 = (9, (3.14, 1), "bye", [1, 2], "mello")
>>>
s2 = [[1, 2], "hello", (3.14, 0), 9]
>>>
intersect(s1, s2)
[9, [1, 2]]
>>>
union(s1, s2)
[9, (3.14, 1), 'bye', [1, 2], 'mello', 'hello', (3.14, 0)]
Set Classes

The preceding
section’s set functions can operate on a variety of
objects, but they aren’t as friendly as true objects. Among other
things, your scripts need to keep track of the sequences passed into
these functions manually. Classes can do better: the class in
Example 18-10
implements a set
object that can hold any type of object. Like the stack classes, it’s
essentially a wrapper around a Python list with extra set
operations.

Example 18-10. PP4E\Dstruct\Basic\set.py

"multi-instance, customizable, encapsulated set class"
class Set:
def __init__(self, value = []): # on object creation
self.data = [] # manages a local list
self.concat(value)
def intersect(self, other): # other is any sequence type
res = [] # self is the instance subject
for x in self.data:
if x in other:
res.append(x)
return Set(res) # return a new Set
def union(self, other):
res = self.data[:] # make a copy of my list
for x in other:
if not x in res:
res.append(x)
return Set(res)
def concat(self, value): # value: a list, string, Set...
for x in value: # filters out duplicates
if not x in self.data:
self.data.append(x)
def __len__(self): return len(self.data)
def __getitem__(self, key): return self.data[key]
def __and__(self, other): return self.intersect(other)
def __or__(self, other): return self.union(other)
def __repr__(self): return ''

The
Set
class is used like the
Stack
class we saw earlier in this
chapter: we make instances and apply sequence operators plus unique set
operations to them. Intersection and union can be called as methods, or
by using the
&
and
|
operators normally used for built-in integer
objects. Because we can string operators in expressions now (e.g.,
x & y & z
), there is no
obvious need to support multiple operands in
intersect
/
union
methods here (though this model’s need
to create temporary objects within expressions might eventually come to
bear on performance). As with all rightly packaged objects, we can
either use the
Set
class within a
program or test it interactively as
follows:

>>>
from set import Set
>>>
users1 = Set(['Bob', 'Emily', 'Howard', 'Peeper'])
>>>
users2 = Set(['Jerry', 'Howard', 'Carol'])
>>>
users3 = Set(['Emily', 'Carol'])
>>>
users1 & users2

>>>
users1 | users2

>>>
users1 | users2 & users3

>>>
(users1 | users2) & users3

>>>
users1.data
['Bob', 'Emily', 'Howard', 'Peeper']
Optimization: Moving Sets to Dictionaries

Once you start using
the
Set
class, the
first problem you might encounter is its performance: its nested
for
loops and
in
scans become exponentially slow. That
slowness may or may not be significant in your applications, but library
classes should generally be coded as efficiently as possible.

One way to optimize set performance is by changing the
implementation to use dictionaries rather than lists for storing sets
internally—items may be stored as the keys of a dictionary whose values
are all
None
. Because lookup time is
constant and short for dictionaries, the
in
list scans of the original set can be
replaced with direct dictionary fetches in this scheme. In traditional
terms, moving sets to dictionaries replaces slow linear searches with
fast hashtable fetches. A computer scientist would explain this by
saying that the repeated nested scanning of the list-based intersection
is an
exponential
algorithm in terms of its
complexity, but dictionaries can be
linear
.

The module in
Example 18-11
implements this
idea. Its class is a subclass of the original set, and it redefines the
methods that deal with the internal representation but inherits others.
The inherited
&
and
|
methods trigger the new
intersect
and
union
methods here, and the inherited
len
method works on dictionaries as is. As
long as
Set
clients are not dependent
on the order of items in a set, most can switch to this version directly
by just changing the name of the module from which
Set
is imported; the class name is the
same.

Example 18-11. PP4E\Dstruct\Basic\fastset.py

"optimize with linear-time scans using dictionaries"
import set
# fastset.Set extends set.Set
class Set(set.Set):
def __init__(self, value = []):
self.data = {} # manages a local dictionary
self.concat(value) # hashing: linear search times
def intersect(self, other):
res = {}
for x in other: # other: a sequence or Set
if x in self.data: # use hash-table lookup; 3.X
res[x] = None
return Set(res.keys()) # a new dictionary-based Set
def union(self, other):
res = {} # other: a sequence or Set
for x in other: # scan each set just once
res[x] = None
for x in self.data.keys(): # '&' and '|' come back here
res[x] = None # so they make new fastset's
return Set(res.keys())
def concat(self, value):
for x in value: self.data[x] = None
# inherit and, or, len
def __getitem__(self, ix):
return list(self.data.keys())[ix] # 3.X: list()
def __repr__(self):
return '' % list(self.data.keys()) # ditto

This works about the same as the previous version, even though the
internal implementation is radically different:

>>>
from fastset import Set
>>>
users1 = Set(['Bob', 'Emily', 'Howard', 'Peeper'])
>>>
users2 = Set(['Jerry', 'Howard', 'Carol'])
>>>
users3 = Set(['Emily', 'Carol'])
>>>
users1 & users2

>>>
users1 | users2

>>>
users1 | users2 & users3

>>>
(users1 | users2) & users3

>>>
users1.data
{'Peeper': None, 'Bob': None, 'Howard': None, 'Emily': None}

The main functional difference in this version is the
order
of items in the set: because dictionaries are
randomly ordered, this set’s order will differ from the original. The
order of results can even vary across Python releases (in fact it did,
between Python 2.X and 3.X in the third and fourth editions of this
book). For instance, you can store compound objects in sets, but the
order of items varies in this version:

>>>
import set, fastset
>>>
a = set.Set([(1,2), (3,4), (5,6)])
>>>
b = set.Set([(3,4), (7,8)])
>>>
a & b

>>>
a | b

>>>
a = fastset.Set([(1,2), (3,4), (5,6)])
>>>
b = fastset.Set([(3,4), (7,8)])
>>>
a & b

>>>
a | b

>>>
b | a

Sets aren’t supposed to be ordered anyhow, so this isn’t a
showstopper. A deviation that might matter, though, is that this version
cannot be used to store
unhashable
(that is,
immutable
) objects. This stems from the fact that
dictionary keys must be immutable. Because values are stored in keys,
dictionary sets can contain only things such as tuples, strings,
numbers, and class objects with immutable signatures. Mutable objects
such as lists and dictionaries won’t work directly in this
dictionary-based set, but do in the original set class. Tuples do work
here as compound set items, though, because they are immutable; Python
computes hash values and tests key equality as expected:

>>>
set.Set([[1, 2],[3, 4]])

>>>
fastset.Set([[1, 2],[3, 4]])
TypeError: unhashable type: 'list'
>>>
x = fastset.Set([(1, 2), (3, 4)])
>>>
x & fastset.Set([(3, 4), (1, 5)])

Timing the results under Python 3.1

So how did we do on the optimization front this time? Again,
guesses aren’t usually good enough, though algorithmic complexity
seems a compelling piece of evidence here. To be sure,
Example 18-12
codes a script to
compare set class performance. It reuses the
timer
module of
Example 18-6
used earlier to
compare stacks (our code may implement different objects, but it
doesn’t warp time).

Example 18-12. PP4E\Dstruct\Basic\settime.py

"compare set alternatives performance"
import timer, sys
import set, fastset
def setops(Class): # 3.X: range okay
a = Class(range(50)) # a 50-integer set
b = Class(range(20)) # a 20-integer set
c = Class(range(10))
d = Class(range(5))
for i in range(5):
t = a & b & c & d # 3 intersections
t = a | b | c | d # 3 unions
if __name__ == '__main__':
rept = int(sys.argv[1])
print('set => ', timer.test(rept, setops, set.Set))
print('fastset =>', timer.test(rept, setops, fastset.Set))

The
setops
function makes
four sets and combines them with intersection and union operators five
times. A command-line argument controls the number of times this whole
process is repeated. More accurately, each call to
setops
makes 34
Set
instances (4 + [5 × (3 + 3)]) and runs
the
intersect
and
union
methods 15 times each (5 × 3) in the
for
loop’s body. The performance
improvement is equally dramatic this time around, on the same Windows
7 laptop under Python 3.1:

C:\...\PP4E\Dstruct\Basic>
python settime.py 50
set => 0.637593916437
fastset => 0.20435049302
C:\...\PP4E\Dstruct\Basic>
python settime.py 100
set => 1.21924758303
fastset => 0.393896570828
C:\...\PP4E\Dstruct\Basic>
python settime.py 200
set => 2.51036677716
fastset => 0.802708664223

These results will vary per machine, and they may vary per
Python release. But at least for this specific test case, the
dictionary-based set implementation (
fastest
) is roughly
three
times
faster than the simple list-based set (
set
). In fact, this threefold speedup is
probably sufficient. Python dictionaries are already optimized
hashtables that you might be hard-pressed to improve on. Unless there
is evidence that dictionary-based sets are still too slow, our work
here is probably done.

By comparison, results for Python 2.4 in the prior edition of
this book showed
fastest
to be six times faster than
set
in all cases. Either iteration
operations sped up, or dictionary operations slowed down in 3.X. In
the even older Python 1.5.2 and second edition, the relative results
were the same as they are today in Python 3.1. In any event, this well
underscores the fact that you must test performance on your machine
and your Python—today’s Python performance observation may easily be
tomorrow’s historic
anecdote.

Other books

Bound Together by Eliza Jane
Jess the Lonely Puppy by Holly Webb
Shatter - Sins of the Sidhe by Briana Michaels
Evolution's Captain by Peter Nichols
When You Make It Home by Claire Ashby
Cody's Varsity Rush by Todd Hafer
Ever After by Ryan, Carrie Ann, Harte, Marie, Royce, Rebecca, Davis, Lia, Shaw, Leia