Programming Python (41 page)

Read Programming Python Online

Authors: Mark Lutz

Tags: #COMPUTERS / Programming Languages / Python

BOOK: Programming Python
12.03Mb size Format: txt, pdf, ePub
Copying Directory Trees

My CD writer
sometimes does weird things. In fact, copies of files with
odd names can be totally botched on the CD, even though other files show
up in one piece. That’s not necessarily a showstopper; if just a few files
are trashed in a big CD backup copy, I can always copy the offending files
elsewhere one at a time. Unfortunately, drag-and-drop copies on some
versions of Windows don’t play nicely with such a CD: the copy operation
stops and exits the moment the first bad file is encountered. You get only
as many files as were copied up to the error, but no more.

In fact, this is not limited to CD copies. I’ve run into similar
problems when trying to back up my laptop’s hard drive to another
drive—the drag-and-drop copy stops with an error as soon as it reaches a
file with a name that is too long or odd to copy (common in saved web
pages). The last 30 minutes spent copying is wasted time; frustrating, to
say the least!

There may be some magical Windows setting to work around this
feature, but I gave up hunting for one as soon as I realized that it would
be easier to code a copier in Python. The
cpall.py
script in
Example 6-10
is one way
to do it. With this script, I control what happens when bad files are
found—I can skip over them with Python exception handlers, for instance.
Moreover, this tool works with the same interface and effect on other
platforms. It seems to me, at least, that a few minutes spent writing a
portable and reusable Python script to meet a need is a better investment
than looking for solutions that work on only one platform (if at
all).

Example 6-10. PP4E\System\Filetools\cpall.py

"""
################################################################################
Usage: "python cpall.py dirFrom dirTo".
Recursive copy of a directory tree. Works like a "cp -r dirFrom/* dirTo"
Unix command, and assumes that dirFrom and dirTo are both directories.
Was written to get around fatal error messages under Windows drag-and-drop
copies (the first bad file ends the entire copy operation immediately),
but also allows for coding more customized copy operations in Python.
################################################################################
"""
import os, sys
maxfileload = 1000000
blksize = 1024 * 500
def copyfile(pathFrom, pathTo, maxfileload=maxfileload):
"""
Copy one file pathFrom to pathTo, byte for byte;
uses binary file modes to supress Unicde decode and endline transform
"""
if os.path.getsize(pathFrom) <= maxfileload:
bytesFrom = open(pathFrom, 'rb').read() # read small file all at once
open(pathTo, 'wb').write(bytesFrom)
else:
fileFrom = open(pathFrom, 'rb') # read big files in chunks
fileTo = open(pathTo, 'wb') # need b mode for both
while True:
bytesFrom = fileFrom.read(blksize) # get one block, less at end
if not bytesFrom: break # empty after last chunk
fileTo.write(bytesFrom)
def copytree(dirFrom, dirTo, verbose=0):
"""
Copy contents of dirFrom and below to dirTo, return (files, dirs) counts;
may need to use bytes for dirnames if undecodable on other platforms;
may need to do more file type checking on Unix: skip links, fifos, etc.
"""
fcount = dcount = 0
for filename in os.listdir(dirFrom): # for files/dirs here
pathFrom = os.path.join(dirFrom, filename)
pathTo = os.path.join(dirTo, filename) # extend both paths
if not os.path.isdir(pathFrom): # copy simple files
try:
if verbose > 1: print('copying', pathFrom, 'to', pathTo)
copyfile(pathFrom, pathTo)
fcount += 1
except:
print('Error copying', pathFrom, 'to', pathTo, '--skipped')
print(sys.exc_info()[0], sys.exc_info()[1])
else:
if verbose: print('copying dir', pathFrom, 'to', pathTo)
try:
os.mkdir(pathTo) # make new subdir
below = copytree(pathFrom, pathTo) # recur into subdirs
fcount += below[0] # add subdir counts
dcount += below[1]
dcount += 1
except:
print('Error creating', pathTo, '--skipped')
print(sys.exc_info()[0], sys.exc_info()[1])
return (fcount, dcount)
def getargs():
"""
Get and verify directory name arguments, returns default None on errors
"""
try:
dirFrom, dirTo = sys.argv[1:]
except:
print('Usage error: cpall.py dirFrom dirTo')
else:
if not os.path.isdir(dirFrom):
print('Error: dirFrom is not a directory')
elif not os.path.exists(dirTo):
os.mkdir(dirTo)
print('Note: dirTo was created')
return (dirFrom, dirTo)
else:
print('Warning: dirTo already exists')
if hasattr(os.path, 'samefile'):
same = os.path.samefile(dirFrom, dirTo)
else:
same = os.path.abspath(dirFrom) == os.path.abspath(dirTo)
if same:
print('Error: dirFrom same as dirTo')
else:
return (dirFrom, dirTo)
if __name__ == '__main__':
import time
dirstuple = getargs()
if dirstuple:
print('Copying...')
start = time.clock()
fcount, dcount = copytree(*dirstuple)
print('Copied', fcount, 'files,', dcount, 'directories', end=' ')
print('in', time.clock() - start, 'seconds')

This script implements its own recursive tree traversal logic and
keeps track of both the “from” and “to” directory paths as it goes. At
every level, it copies over simple files, creates directories in the “to”
path, and recurs into subdirectories with “from” and “to” paths extended
by one level. There are other ways to code this task (e.g., we might
change the working directory along the way with
os.chdir
calls or there is probably an
os.walk
solution which replaces from and to path
prefixes as it walks), but extending paths on recursive descent works well
in this script.

Notice this script’s reusable
copyfile
function—just in case there are
multigigabyte files in the tree to be copied, it uses a file’s size to
decide whether it should be read all at once or in chunks (remember, the
file
read
method without arguments
actually loads the entire file into an in-memory string). We choose fairly
large file and block sizes, because the more we read at once in Python,
the faster our scripts will typically run. This is more efficient than it
may sound; strings left behind by prior reads will be garbage collected
and reused as we go. We’re using binary file modes here again, too, to
suppress the Unicode encodings and end-of-line translations of text
files—trees may contain arbitrary kinds of files.

Also notice that this script creates the “to” directory if needed,
but it assumes that the directory is empty when a copy starts up; for
accuracy, be sure to remove the target directory before copying a new tree
to its name, or old files may linger in the target tree (we could
automatically remove the target first, but this may not always be
desired). This script also tries to determine if the source and target are
the same; on Unix-like platforms with oddities such as links,
os.path.samefile
does a more accurate job than
comparing absolute file names (different file names may be the same
file).

Here is a copy of a big book examples tree (I use the tree from the
prior edition throughout this chapter) in action on Windows; pass in the
name of the “from” and “to” directories to kick off the process, redirect
the output to a file if there are too many error messages to read all at
once (e.g.,
> output.txt
), and run
an
rm –r
or
rmdir /S
shell command (or similar
platform-specific tool) to delete the target directory first if
needed:

C:\...\PP4E\System\Filetools>
rmdir /S copytemp
copytemp, Are you sure (Y/N)?
y
C:\...\PP4E\System\Filetools>
cpall.py C:\temp\PP3E\Examples copytemp
Note: dirTo was created
Copying...
Copied 1430 files, 185 directories in 10.4470980971 seconds
C:\...\PP4E\System\Filetools>
fc /B copytemp\PP3E\Launcher.py
C:\temp\PP3E\Examples\PP3E\Launcher.py
Comparing files COPYTEMP\PP3E\Launcher.py and C:\TEMP\PP3E\EXAMPLES\PP3E\LAUNCHER.PY
FC: no differences encountered

You can use the copy function’s
verbose
argument to trace the process if you
wish. At the time I wrote this edition in 2010, this test run copied a
tree of 1,430 files and 185 directories in 10 seconds on my woefully
underpowered netbook machine (the built-in
time.clock
call is used to query the system time
in seconds); it may run arbitrarily faster or slower for you. Still, this
is at least as fast as the best drag-and-drop I’ve timed on this
machine.

So how does this script work around bad files on a CD backup? The
secret is that it catches and ignores file
exceptions
, and it keeps walking. To copy all the
files that are good on a CD, I simply run a command line such as this
one:

C:\...\PP4E\System\Filetools>
python cpall.py G:\Examples C:\PP3E\Examples

Because the CD is addressed as “G:” on my Windows machine, this is
the command-line equivalent of drag-and-drop copying from an item in the
CD’s top-level folder, except that the Python script will recover from
errors on the CD and get the rest. On copy errors, it prints a message to
standard output and continues; for big copies, you’ll probably want to
redirect the script’s output to a file for later inspection.

In general,
cpall
can be passed
any absolute directory path on your machine, even those that indicate
devices such as CDs. To make this go on Linux, try a root directory such
as
/dev/cdrom
or something similar to address your CD
drive. Once you’ve copied a tree this way, you still might want to verify;
to see how, let’s move on to the next
example.

Comparing Directory Trees

Engineers
can be a paranoid sort (but you didn’t hear that from me).
At least I am. It comes from decades of seeing things go terribly wrong, I
suppose. When I create a CD backup of my hard drive, for instance, there’s
still something a bit too magical about the process to trust the CD writer
program to do the right thing. Maybe I should, but it’s tough to have a
lot of faith in tools that occasionally trash files and seem to crash my
Windows machine every third Tuesday of the month. When push comes to
shove, it’s nice to be able to verify that data copied to a backup CD is
the same as the
original—
or at least
to spot deviations from the
original—
as soon as possible. If a backup is
ever needed, it will be
really
needed.

Because data CDs are accessible as simple directory trees in the
file system, we are once again in the realm of tree walkers—to verify a
backup CD, we simply need to walk its top-level directory. If our script
is general enough, we will also be able to use it to verify other copy
operations as well—e.g., downloaded tar files, hard-drive backups, and so
on. In fact, the combination of the
cpall
script of the prior section and a general
tree comparison would provide a portable and scriptable way to copy and
verify data sets.

We’ve already studied generic directory tree walkers, but they won’t
help us here directly: we need to walk
two
directories in parallel and inspect common files along the way. Moreover,
walking either one of the two directories won’t allow us to spot files and
directories that exist only in the other. Something more custom and
recursive seems in order here.

Finding Directory Differences

Before we start coding,
the first thing we need to clarify is what it means to
compare two directory trees. If both trees have exactly the same branch
structure and depth, this problem reduces to comparing corresponding
files in each tree. In general, though, the trees can have arbitrarily
different shapes, depths, and so on.

More generally, the contents of a directory in one tree may have
more or fewer entries than the corresponding directory in the other
tree. If those differing contents are filenames, there is no
corresponding file to compare with; if they are directory names, there
is no corresponding branch to descend through. In fact, the only way to
detect files and directories that appear in one tree but not the other
is to detect differences in each level’s directory.

In other words, a tree comparison algorithm will also have to
perform
directory
comparisons along the way.
Because this is a nested and simpler operation, let’s start by coding
and debugging a single-directory comparison of filenames in
Example 6-11
.

Example 6-11. PP4E\System\Filetools\dirdiff.py

"""
################################################################################
Usage: python dirdiff.py dir1-path dir2-path
Compare two directories to find files that exist in one but not the other.
This version uses the os.listdir function and list difference. Note that
this script checks only filenames, not file contents--see diffall.py for an
extension that does the latter by comparing .read() results.
################################################################################
"""
import os, sys
def reportdiffs(unique1, unique2, dir1, dir2):
"""
Generate diffs report for one dir: part of comparedirs output
"""
if not (unique1 or unique2):
print('Directory lists are identical')
else:
if unique1:
print('Files unique to', dir1)
for file in unique1:
print('...', file)
if unique2:
print('Files unique to', dir2)
for file in unique2:
print('...', file)
def difference(seq1, seq2):
"""
Return all items in seq1 only;
a set(seq1) - set(seq2) would work too, but sets are randomly
ordered, so any platform-dependent directory order would be lost
"""
return [item for item in seq1 if item not in seq2]
def comparedirs(dir1, dir2, files1=None, files2=None):
"""
Compare directory contents, but not actual files;
may need bytes listdir arg for undecodable filenames on some platforms
"""
print('Comparing', dir1, 'to', dir2)
files1 = os.listdir(dir1) if files1 is None else files1
files2 = os.listdir(dir2) if files2 is None else files2
unique1 = difference(files1, files2)
unique2 = difference(files2, files1)
reportdiffs(unique1, unique2, dir1, dir2)
return not (unique1 or unique2) # true if no diffs
def getargs():
"Args for command-line mode"
try:
dir1, dir2 = sys.argv[1:] # 2 command-line args
except:
print('Usage: dirdiff.py dir1 dir2')
sys.exit(1)
else:
return (dir1, dir2)
if __name__ == '__main__':
dir1, dir2 = getargs()
comparedirs(dir1, dir2)

Given listings of names in two directories, this script simply
picks out unique names in the first and unique names in the second, and
reports any unique names found as differences (that is, files in one
directory but not the other). Its
comparedirs
function
returns a true result if no differences were found, which
is useful for detecting differences in callers.

Let’s run this script on a few directories; differences are
detected and reported as names unique in either passed-in directory
pathname. Notice that this is only a
structural
comparison that just checks names in listings, not file contents (we’ll
add the latter in a moment):

C:\...\PP4E\System\Filetools>
dirdiff.py C:\temp\PP3E\Examples copytemp
Comparing C:\temp\PP3E\Examples to copytemp
Directory lists are identical
C:\...\PP4E\System\Filetools>
dirdiff.py C:\temp\PP3E\Examples\PP3E\System ..
Comparing C:\temp\PP3E\Examples\PP3E\System to ..
Files unique to C:\temp\PP3E\Examples\PP3E\System
... App
... Exits
... Media
... moreplus.py
Files unique to ..
... more.pyc
... spam.txt
... Tester
... __init__.pyc

The
unique
function
is the heart of this script: it performs a simple list
difference
operation
. When applied
to directories,
unique
items represent tree
differences, and
common
items are names of files or
subdirectories that merit further comparisons or traversals. In fact, in
Python 2.4 and later, we could also use the built-in
set
object type if we don’t care about the
order in the results—because sets are not sequences, they would not
maintain any original and possibly platform-specific left-to-right order
of the directory listings provided by
os.listdir
. For that reason (and to avoid
requiring users to upgrade), we’ll keep using our own
comprehension-based function instead
of
sets
.

Finding Tree Differences

We’ve just coded a directory
comparison tool that picks out unique files and
directories. Now all we need is a tree walker that applies
dirdiff
at each level
to report unique items, explicitly compares the contents of files in
common, and descends through directories in common.
Example 6-12
fits the bill.

Example 6-12. PP4E\System\Filetools\diffall.py

"""
################################################################################
Usage: "python diffall.py dir1 dir2".
Recursive directory tree comparison: report unique files that exist in only
dir1 or dir2, report files of the same name in dir1 and dir2 with differing
contents, report instances of same name but different type in dir1 and dir2,
and do the same for all subdirectories of the same names in and below dir1
and dir2. A summary of diffs appears at end of output, but search redirected
output for "DIFF" and "unique" strings for further details. New: (3E) limit
reads to 1M for large files, (3E) catch same name=file/dir, (4E) avoid extra
os.listdir() calls in dirdiff.comparedirs() by passing results here along.
################################################################################
"""
import os, dirdiff
blocksize = 1024 * 1024 # up to 1M per read
def intersect(seq1, seq2):
"""
Return all items in both seq1 and seq2;
a set(seq1) & set(seq2) woud work too, but sets are randomly
ordered, so any platform-dependent directory order would be lost
"""
return [item for item in seq1 if item in seq2]
def comparetrees(dir1, dir2, diffs, verbose=False):
"""
Compare all subdirectories and files in two directory trees;
uses binary files to prevent Unicode decoding and endline transforms,
as trees might contain arbitrary binary files as well as arbitrary text;
may need bytes listdir arg for undecodable filenames on some platforms
"""
# compare file name lists
print('-' * 20)
names1 = os.listdir(dir1)
names2 = os.listdir(dir2)
if not dirdiff.comparedirs(dir1, dir2, names1, names2):
diffs.append('unique files at %s - %s' % (dir1, dir2))
print('Comparing contents')
common = intersect(names1, names2)
missed = common[:]
# compare contents of files in common
for name in common:
path1 = os.path.join(dir1, name)
path2 = os.path.join(dir2, name)
if os.path.isfile(path1) and os.path.isfile(path2):
missed.remove(name)
file1 = open(path1, 'rb')
file2 = open(path2, 'rb')
while True:
bytes1 = file1.read(blocksize)
bytes2 = file2.read(blocksize)
if (not bytes1) and (not bytes2):
if verbose: print(name, 'matches')
break
if bytes1 != bytes2:
diffs.append('files differ at %s - %s' % (path1, path2))
print(name, 'DIFFERS')
break
# recur to compare directories in common
for name in common:
path1 = os.path.join(dir1, name)
path2 = os.path.join(dir2, name)
if os.path.isdir(path1) and os.path.isdir(path2):
missed.remove(name)
comparetrees(path1, path2, diffs, verbose)
# same name but not both files or dirs?
for name in missed:
diffs.append('files missed at %s - %s: %s' % (dir1, dir2, name))
print(name, 'DIFFERS')
if __name__ == '__main__':
dir1, dir2 = dirdiff.getargs()
diffs = []
comparetrees(dir1, dir2, diffs, True) # changes diffs in-place
print('=' * 40) # walk, report diffs list
if not diffs:
print('No diffs found.')
else:
print('Diffs found:', len(diffs))
for diff in diffs: print('-', diff)

At each directory in the tree, this script simply runs the
dirdiff
tool to detect unique names,
and then compares names in common by intersecting directory lists. It
uses recursive function calls to traverse the tree and visits
subdirectories only after comparing all the files at each level so that
the output is more coherent to read (the trace output for subdirectories
appears after that for files; it is not intermixed).

Notice the
misses
list, added
in the third edition of this book; it’s very unlikely, but not
impossible, that the same name might be a file in one directory and a
subdirectory in the other. Also notice the
blocksize
variable; much like the tree copy
script we saw earlier, instead of blindly reading entire files into
memory all at once, we limit each read to grab up to 1 MB at a time,
just in case any files in the directories are too big to be loaded into
available memory. Without this limit, I ran into
MemoryError
exceptions on some machines with a
prior version of this script that read both files all at once, like
this:

bytes1 = open(path1, 'rb').read()
bytes2 = open(path2, 'rb').read()
if bytes1 == bytes2: ...

This code was simpler, but is less practical for very large files
that can’t fit into your available memory space (consider CD and DVD
image files, for example). In the new version’s loop, the file reads
return what is left when there is less than 1 MB present or remaining
and return empty strings at end-of-file. Files match if all blocks read
are the same, and they reach end-of-file at the same time.

We’re also dealing in binary files and byte strings again to
suppress Unicode decoding and end-line translations for file content,
because trees may contain arbitrary binary and text files. The usual
note about changing this to pass byte strings to
os.listdir
on platforms where filenames may
generate Unicode decoding errors applies here as well (e.g. pass
dir1.encode()
). On some platforms,
you may also want to detect and skip certain kinds of special files in
order to be fully general, but these were not in my trees, so they are
not in my script.

One minor change for the fourth edition of this book:
os.listdir
results are now gathered just once
per subdirectory and passed along, to avoid extra calls in
dirdiff
—not a huge win, but every cycle counts
on the pitifully underpowered netbook I used when writing this
edition.

Running the Script

Since we’ve already studied the tree-walking tools this script
employs, let’s jump right into a few example runs. When run on identical
trees, status messages scroll during the traversal, and a
No diffs found.
message appears at the
end:

C:\...\PP4E\System\Filetools>
diffall.py C:\temp\PP3E\Examples copytemp > diffs.txt
C:\...\PP4E\System\Filetools>
type diffs.txt | more
--------------------
Comparing C:\temp\PP3E\Examples to copytemp
Directory lists are identical
Comparing contents
README-root.txt matches
--------------------
Comparing C:\temp\PP3E\Examples\PP3E to copytemp\PP3E
Directory lists are identical
Comparing contents
echoEnvironment.pyw matches
LaunchBrowser.pyw matches
Launcher.py matches
Launcher.pyc matches
...over 2,000 more lines omitted...
--------------------
Comparing C:\temp\PP3E\Examples\PP3E\TempParts to copytemp\PP3E\TempParts
Directory lists are identical
Comparing contents
109_0237.JPG matches
lawnlake1-jan-03.jpg matches
part-001.txt matches
part-002.html matches
========================================
No diffs found.

I usually run this with the
verbose
flag passed in as
True
, and redirect output to a file (for big
trees, it produces too much output to scroll through comfortably); use
False
to watch fewer status messages
fly by. To show how differences are reported, we need to generate a few;
for simplicity, I’ll manually change a few files scattered about one of
the trees, but you could also run a global search-and-replace script
like the one we’ll write later in this chapter. While we’re at it, let’s
remove a few common files so that directory uniqueness differences show
up on the scope, too; the last two removal commands in the following
will generate one difference in the same directory in different
trees:

C:\...\PP4E\System\Filetools>
notepad copytemp\PP3E\README-PP3E.txt
C:\...\PP4E\System\Filetools>
notepad copytemp\PP3E\System\Filetools\commands.py
C:\...\PP4E\System\Filetools>
notepad C:\temp\PP3E\Examples\PP3E\__init__.py
C:\...\PP4E\System\Filetools>
del copytemp\PP3E\System\Filetools\cpall_visitor.py
C:\...\PP4E\System\Filetools>
del copytemp\PP3E\Launcher.py
C:\...\PP4E\System\Filetools>
del C:\temp\PP3E\Examples\PP3E\PyGadgets.py

Now, rerun the comparison walker to pick out differences and
redirect its output report to a file for easy inspection. The following
lists just the parts of the output report that identify differences. In
typical use, I inspect the summary at the bottom of the report first,
and then search for the strings
"DIFF"
and
"unique"
in the report’s text if I need more
information about the differences summarized; this interface could be
much more user-friendly, of course, but it does the job for me:

C:\...\PP4E\System\Filetools>
diffall.py C:\temp\PP3E\Examples copytemp > diff2.txt
C:\...\PP4E\System\Filetools>
notepad diff2.txt
--------------------
Comparing C:\temp\PP3E\Examples to copytemp
Directory lists are identical
Comparing contents
README-root.txt matches
--------------------
Comparing C:\temp\PP3E\Examples\PP3E to copytemp\PP3E
Files unique to C:\temp\PP3E\Examples\PP3E
... Launcher.py
Files unique to copytemp\PP3E
... PyGadgets.py
Comparing contents
echoEnvironment.pyw matches
LaunchBrowser.pyw matches
Launcher.pyc matches
...more omitted...
PyGadgets_bar.pyw matches
README-PP3E.txt DIFFERS
todos.py matches
tounix.py matches
__init__.py DIFFERS
__init__.pyc matches
--------------------
Comparing C:\temp\PP3E\Examples\PP3E\System\Filetools to copytemp\PP3E\System\Fil...
Files unique to C:\temp\PP3E\Examples\PP3E\System\Filetools
... cpall_visitor.py
Comparing contents
commands.py DIFFERS
cpall.py matches
...more omitted...
--------------------
Comparing C:\temp\PP3E\Examples\PP3E\TempParts to copytemp\PP3E\TempParts
Directory lists are identical
Comparing contents
109_0237.JPG matches
lawnlake1-jan-03.jpg matches
part-001.txt matches
part-002.html matches
========================================
Diffs found: 5
- unique files at C:\temp\PP3E\Examples\PP3E - copytemp\PP3E
- files differ at C:\temp\PP3E\Examples\PP3E\README-PP3E.txt –
copytemp\PP3E\README-PP3E.txt
- files differ at C:\temp\PP3E\Examples\PP3E\__init__.py –
copytemp\PP3E\__init__.py
- unique files at C:\temp\PP3E\Examples\PP3E\System\Filetools –
copytemp\PP3E\System\Filetools
- files differ at C:\temp\PP3E\Examples\PP3E\System\Filetools\commands.py –
copytemp\PP3E\System\Filetools\commands.py

I added line breaks and tabs in a few of these output lines to
make them fit on this page, but the report is simple to understand. In a
tree with 1,430 files and 185 directories, we found five differences—the
three files we changed by edits, and the two directories we threw out of
sync with the three removal commands.

Other books

Think Murder by Cassidy Salem
Breathe Again by Joelle Charming
In the Air Tonight by Stephanie Tyler
At the Queen's Command by Michael A. Stackpole
The Empty Hours by Ed McBain
Texas Moon TH4 by Patricia Rice
In Gallant Company by Alexander Kent