Programming Python (25 page)

Read Programming Python Online

Authors: Mark Lutz

Tags: #COMPUTERS / Programming Languages / Python

BOOK: Programming Python
9.86Mb size Format: txt, pdf, ePub
Walking Directory Trees

You may have
noticed that almost all of the techniques in this section
so far return the names of files in only a
single
directory (globbing with more involved patterns is the only exception).
That’s fine for many tasks, but what if you want to apply an operation
to every file in every directory and subdirectory in an entire directory
tree
?

For instance, suppose again that we need to find every occurrence
of a global name in our Python scripts. This time, though, our scripts
are arranged into a module
package
: a directory
with nested subdirectories, which may have subdirectories of their own.
We could rerun our hypothetical single-directory searcher manually in
every directory in the tree, but that’s tedious, error prone, and just
plain not fun.

Luckily, in Python it’s almost as easy to process a directory tree
as it is to inspect a single directory. We can either write a recursive
routine to traverse the tree, or use a tree-walker utility built into
the
os
module. Such tools can be used
to search, copy, compare, and otherwise process arbitrary directory
trees on any platform that Python runs on (and that’s just about
everywhere).

The os.walk visitor

To make it easy to
apply an operation to all files in a complete directory
tree, Python comes with a utility that scans trees for us and runs
code we provide at every directory along the way: the
os.walk
function is called with a directory
root name and automatically walks the entire tree at root and
below.

Operationally,
os.walk
is
a
generator function
—at each
directory in the tree, it yields a three-item tuple, containing the
name of the current directory as well as lists of both all the files
and all the subdirectories in the current directory. Because it’s a
generator, its walk is usually run by a
for
loop (or other iteration tool); on each
iteration, the walker advances to the next subdirectory, and the loop
runs its code for the next level of the tree (for instance, opening
and searching all the files at that level).

That description might sound complex the first time you hear it,
but
os.walk
is fairly
straightforward once you get the hang of it. In the following, for
example, the loop body’s code is run for each directory in the tree
rooted at the current working directory (
.
). Along the way, the loop simply prints
the directory name and all the files at the current level after
prepending the directory name. It’s simpler in Python than in English
(I removed the PP3E subdirectory for this test to keep the output
short):

>>>
import os
>>>
for (dirname, subshere, fileshere) in os.walk('.'):
...
print('[' + dirname + ']')
...
for fname in fileshere:
...
print(os.path.join(dirname, fname))
# handle one file
...
[.]
.\random.bin
.\spam.txt
.\temp.bin
.\temp.txt
[.\parts]
.\parts\part0001
.\parts\part0002
.\parts\part0003
.\parts\part0004

In other words, we’ve coded our own custom and easily changed
recursive directory listing tool in Python. Because this may be
something we would like to tweak and reuse elsewhere, let’s make it
permanently available in a module file, as shown in
Example 4-4
, now that we’ve worked
out the details interactively.

Example 4-4. PP4E\System\Filetools\lister_walk.py

"list file tree with os.walk"
import sys, os
def lister(root): # for a root dir
for (thisdir, subshere, fileshere) in os.walk(root): # generate dirs in tree
print('[' + thisdir + ']')
for fname in fileshere: # print files in this dir
path = os.path.join(thisdir, fname) # add dir name prefix
print(path)
if __name__ == '__main__':
lister(sys.argv[1]) # dir name in cmdline

When packaged this way, the code can also be run from a shell
command line. Here it is being launched with the root directory to be
listed passed in as a command-line argument:

C:\...\PP4E\System\Filetools>
python lister_walk.py C:\temp\test
[C:\temp\test]
C:\temp\test\random.bin
C:\temp\test\spam.txt
C:\temp\test\temp.bin
C:\temp\test\temp.txt
[C:\temp\test\parts]
C:\temp\test\parts\part0001
C:\temp\test\parts\part0002
C:\temp\test\parts\part0003
C:\temp\test\parts\part0004

Here’s a more involved example of
os.walk
in action. Suppose you have a
directory tree of files and you want to find all Python source files
within it that reference the
mimetypes
module we’ll study in
Chapter 6
. The following is one (albeit
hardcoded and overly specific) way to accomplish this task:

>>>
import os
>>>
matches = []
>>>
for (dirname, dirshere, fileshere) in os.walk(r'C:\temp\PP3E\Examples'):
...
for filename in fileshere:
...
if filename.endswith('.py'):
...
pathname = os.path.join(dirname, filename)
...
if 'mimetypes' in open(pathname).read():
...
matches.append(pathname)
...
>>>
for name in matches: print(name)
...
C:\temp\PP3E\Examples\PP3E\Internet\Email\mailtools\mailParser.py
C:\temp\PP3E\Examples\PP3E\Internet\Email\mailtools\mailSender.py
C:\temp\PP3E\Examples\PP3E\Internet\Ftp\mirror\downloadflat.py
C:\temp\PP3E\Examples\PP3E\Internet\Ftp\mirror\downloadflat_modular.py
C:\temp\PP3E\Examples\PP3E\Internet\Ftp\mirror\ftptools.py
C:\temp\PP3E\Examples\PP3E\Internet\Ftp\mirror\uploadflat.py
C:\temp\PP3E\Examples\PP3E\System\Media\playfile.py

This code loops through all the files at each level, looking for
files with
.py
at the end of their names and
which contain the search string. When a match is found, its full name
is appended to the results list object; alternatively, we could also
simply build a list of all
.py
files and search
each in a
for
loop after the walk.
Since we’re going to code much more general solution to this type of
problem in
Chapter 6
, though, we’ll
let this stand for now.

If you want to see what’s really going on in the
os.walk
generator, call its
__next__
method (or equivalently, pass it to
the
next
built-in function)
manually a few times, just as the
for
loop does automatically; each time, you
advance to the next subdirectory in the tree:

>>>
gen = os.walk(r'C:\temp\test')
>>>
gen.__next__()
('C:\\temp\\test', ['parts'], ['random.bin', 'spam.txt', 'temp.bin', 'temp.txt'])
>>>
gen.__next__()
('C:\\temp\\test\\parts', [], ['part0001', 'part0002', 'part0003', 'part0004'])
>>>
gen.__next__()
Traceback (most recent call last):
File "", line 1, in
StopIteration

The library manual documents
os.walk
further than we will here. For
instance, it supports bottom-up instead of top-down walks with its
optional
topdown=False
argument,
and callers may prune tree branches by deleting names in the
subdirectories lists of the yielded tuples.

Internally, the
os.walk
call
generates filename lists at each level with the
os.listdir
call we met earlier, which
collects both file and directory names in no particular order and
returns them without their directory paths;
os.walk
segregates this list into
subdirectories and files (technically, nondirectories) before yielding
a result. Also note that
walk
uses
the very same subdirectories list it yields to callers in order to
later descend into subdirectories. Because lists are mutable objects
that can be changed in place, if your code modifies the yielded
subdirectory names list, it will impact what
walk
does next. For example, deleting
directory names will prune traversal branches, and sorting the list
will order the
walk.

Recursive os.listdir traversals

The
os.walk
tool
does the work of tree traversals for us; we simply
provide loop code with task-specific logic. However, it’s sometimes
more flexible and hardly any more work to do the walking ourselves.
The following script recodes the directory listing script with a
manual
recursive
traversal function (a function
that calls itself to repeat its actions). The
mylister
function in
Example 4-5
is almost the same as
lister
in
Example 4-4
but calls
os.listdir
to generate file paths manually
and calls itself recursively to descend into subdirectories.

Example 4-5. PP4E\System\Filetools\lister_recur.py

# list files in dir tree by recursion
import sys, os
def mylister(currdir):
print('[' + currdir + ']')
for file in os.listdir(currdir): # list files here
path = os.path.join(currdir, file) # add dir path back
if not os.path.isdir(path):
print(path)
else:
mylister(path) # recur into subdirs
if __name__ == '__main__':
mylister(sys.argv[1]) # dir name in cmdline

As usual, this file can be both imported and called or run as a
script, though the fact that its result is printed text makes it less
useful as an imported component unless its output stream is captured
by another program.

When run as a script, this file’s output is equivalent to that
of
Example 4-4
, but not
identical—unlike the
os.walk
version, our recursive walker here doesn’t order the walk to visit
files before stepping into subdirectories. It could by looping through
the filenames list twice (selecting files first), but as coded, the
order is dependent on
os.listdir
results. For most use cases, the walk order would be
irrelevant:

C:\...\PP4E\System\Filetools>
python lister_recur.py C:\temp\test
[C:\temp\test]
[C:\temp\test\parts]
C:\temp\test\parts\part0001
C:\temp\test\parts\part0002
C:\temp\test\parts\part0003
C:\temp\test\parts\part0004
C:\temp\test\random.bin
C:\temp\test\spam.txt
C:\temp\test\temp.bin
C:\temp\test\temp.txt

We’ll make better use of most of this section’s techniques in
later examples in
Chapter 6
and in
this book at large. For example, scripts for copying and comparing
directory trees use the tree-walker techniques introduced here. Watch
for these tools in action along the way. We’ll also code a
find
utility in
Chapter 6
that combines the tree traversal
of
os.walk
with the filename
pattern
expansion of
glob.glob
.

Handling Unicode Filenames in 3.X: listdir, walk, glob

Because all
normal strings are Unicode in Python 3.X, the directory
and file names generated by
os.listdir
,
os.walk
, and
glob.glob
so far in this chapter are
technically Unicode strings. This can have some ramifications if your
directories contain unusual names that might not decode properly.

Technically, because filenames may contain arbitrary text, the
os.listdir
works in two modes in 3.X:
given a
bytes
argument, this function
will return filenames as encoded byte strings; given a normal
str
string argument, it instead returns
filenames as Unicode strings, decoded per the filesystem’s encoding
scheme:

C:\...\PP4E\System\Filetools>
python
>>>
import os
>>>
os.listdir('.')[:4]
['bigext-tree.py', 'bigpy-dir.py', 'bigpy-path.py', 'bigpy-tree.py']
>>>
os.listdir(b'.')[:4]
[b'bigext-tree.py', b'bigpy-dir.py', b'bigpy-path.py', b'bigpy-tree.py']

The byte string version can be used if undecodable file names may
be present. Because
os.walk
and
glob.glob
both work by calling
os.listdir
internally, they inherit
this behavior by proxy. The
os.walk
tree walker, for example, calls
os.listdir
at each directory level; passing
byte string arguments suppresses decoding and returns byte string
results:

>>>
for (dir, subs, files) in os.walk('..'):
print(dir)
...
..
..\Environment
..\Filetools
..\Processes
>>>
for (dir, subs, files) in os.walk(b'..'): print(dir)
...
b'..'
b'..\\Environment'
b'..\\Filetools'
b'..\\Processes'

The
glob.glob
tool similarly
calls
os.listdir
internally before
applying name patterns, and so also returns undecoded byte string names
for byte string arguments:

>>>
glob.glob('.\*')[:3]
['.\\bigext-out.txt', '.\\bigext-tree.py', '.\\bigpy-dir.py']
>>>
>>>
glob.glob(b'.\*')[:3]
[b'.\\bigext-out.txt', b'.\\bigext-tree.py', b'.\\bigpy-dir.py']

Given a normal string name (as a command-line argument, for
example), you can force the issue by converting to byte strings with
manual encoding to suppress decoding:

>>>
name = '.'
>>>
os.listdir(name.encode())[:4]
[b'bigext-out.txt', b'bigext-tree.py', b'bigpy-dir.py', b'bigpy-path.py']

The upshot is that if your directories may contain names which
cannot be decoded according to the underlying platform’s Unicode
encoding scheme, you may need to pass byte strings to these tools to
avoid Unicode encoding errors. You’ll get byte strings back, which may
be less readable if printed, but you’ll avoid errors while traversing
directories and files.

This might be especially useful on systems that use simple
encodings such as ASCII or Latin-1, but may contain files with
arbitrarily encoded names from cross-machine copies, the Web, and so on.
Depending upon context, exception handlers may be used to suppress some
types of encoding errors as well.

We’ll see an example of how this can matter in the first section
of
Chapter 6
, where an undecodable
directory name generates an error if printed during a full disk scan
(although that specific error seems more related to printing than to
decoding in general).

Note that the basic
open
built-in function allows the name of the file being opened to be passed
as either Unicode
str
or raw
bytes
, too, though this is used only to name
the file initially; the additional mode argument determines whether the
file’s content is handled in text or binary modes. Passing a byte string
filename allows you to name files with arbitrarily encoded names.

Unicode policies: File content versus file names

In fact, it’s
important to keep in mind that there are two different
Unicode concepts related to files: the encoding of file
content
and the encoding of file
name
. Python provides your platform’s defaults
for these settings in two different attributes; on
Windows 7
:

>>>
import sys
>>>
sys.getdefaultencoding()
# file content encoding, platform default
'utf-8'
>>>
sys.getfilesystemencoding()
# file name encoding, platform scheme
'mbcs'

These settings allow you to be explicit when needed—the content
encoding is used when data is read and written to the file, and the
name encoding is used when dealing with names prior to transferring
data. In addition, using
bytes
for
file name tools may work around incompatibilities with the underlying
file system’s scheme, and opening files in binary mode can suppress
Unicode decoding errors for content.

As we’ve seen, though, opening text files in
binary
mode
may also mean that the raw and still-encoded text will
not match search strings as expected: search strings must also be byte
strings encoded per a specific and possibly incompatible encoding
scheme. In fact, this approach essentially mimics the behavior of text
files in Python 2.X, and underscores why elevating Unicode in 3.X is
generally desirable—such text files sometimes may appear to work even
though they probably shouldn’t. On the other hand, opening text in
binary mode to suppress Unicode content decoding and avoid decoding
errors might still be useful if you do not wish to skip undecodable
files and content is largely irrelevant.

As a rule of thumb, you should try to always provide an encoding
name for text content if it might be outside the platform default, and
you should rely on the default Unicode API for file names in most
cases. Again, see Python’s manuals for more on the Unicode file name
story than we have space to cover fully here, and see
Learning
Python
, Fourth Edition, for more on Unicode in
general
.

In
Chapter 6
, we’re going to
put the tools we met in this chapter to realistic use. For example,
we’ll apply file and directory tools to implement file splitters,
testing systems, directory copies and compares, and a variety of
utilities based on tree walking. We’ll find that Python’s directory
tools we met here have an enabling quality that allows us to automate
a large set of real-world tasks. First, though,
Chapter 5
concludes our basic tool survey, by
exploring another system topic that tends to weave its way into a wide
variety of application domains—parallel processing in
Python.

Other books

Frozen Fire by Evans, Bill, Jameson, Marianna
Soul Sucker by Pearce, Kate
Scars: Book One by West, Sinden
A Daughter's Disgrace by Kitty Neale
Black Onyx by Victor Methos
Fall From Grace by David Ashton
Does Your Mother Know by Green, Bronwyn
Driving the King by Ravi Howard
The Pretender's Crown by C. E. Murphy