Programming Python (37 page)

Read Programming Python Online

Authors: Mark Lutz

Tags: #COMPUTERS / Programming Languages / Python

BOOK: Programming Python
9.95Mb size Format: txt, pdf, ePub
A Quick Game of “Find the Biggest Python File”

Quick: what’s the biggest Python source file on your computer? This
was the query innocently posed by a student in one of my Python classes.
Because I didn’t know either, it became an official exercise in subsequent
classes, and it provides a good example of ways to apply Python system
tools for a realistic purpose in this book. Really, the query is a bit
vague, because its scope is unclear. Do we mean the largest Python file in
a directory, in a full directory tree, in the standard library, on the
module import search path, or on your entire hard drive? Different scopes
imply different solutions.

Scanning the Standard Library Directory

For instance,
Example 6-1
is a first-cut
solution that looks for the biggest Python file in one
directory—a limited scope, but enough to get started.

Example 6-1. PP4E\System\Filetools\bigpy-dir.py

"""
Find the largest Python source file in a single directory.
Search Windows Python source lib, unless dir command-line arg.
"""
import os, glob, sys
dirname = r'C:\Python31\Lib' if len(sys.argv) == 1 else sys.argv[1]
allsizes = []
allpy = glob.glob(dirname + os.sep + '*.py')
for filename in allpy:
filesize = os.path.getsize(filename)
allsizes.append((filesize, filename))
allsizes.sort()
print(allsizes[:2])
print(allsizes[-2:])

This script uses the
glob
module to run through a directory’s files and detects the
largest by storing sizes and names on a list that is sorted at the
end—because size appears first in the list’s tuples, it will dominate
the ascending value sort, and the largest percolates to the end of the
list. We could instead keep track of the currently largest as we go, but
the list scheme is more flexible. When run, this script scans the Python
standard library’s source directory on Windows, unless you pass a
different directory on the command line, and it prints both the two
smallest and largest files it finds:

C:\...\PP4E\System\Filetools>
bigpy-dir.py
[(0, 'C:\\Python31\\Lib\\build_class.py'), (56, 'C:\\Python31\\Lib\\struct.py')]
[(147086, 'C:\\Python31\\Lib\\turtle.py'), (211238, 'C:\\Python31\\Lib\\decimal.
py')]
C:\...\PP4E\System\Filetools>
bigpy-dir.py .
[(21, '.\\__init__.py'), (461, '.\\bigpy-dir.py')]
[(1940, '.\\bigext-tree.py'), (2547, '.\\split.py')]
C:\...\PP4E\System\Filetools>
bigpy-dir.py ..
[(21, '..\\__init__.py'), (29, '..\\testargv.py')]
[(541, '..\\testargv2.py'), (549, '..\\more.py')]
Scanning the Standard Library Tree

The prior section’s
solution works, but it’s obviously a partial answer—Python
files are usually located in more than one directory. Even within the
standard library, there are many subdirectories for module packages, and
they may be arbitrarily nested. We really need to traverse an entire
directory tree. Moreover, the first output above is difficult to read;
Python’s
pprint
(for “pretty
print”) module can help here.
Example 6-2
puts these extensions into
code.

Example 6-2. PP4E\System\Filetools\bigpy-tree.py

"""
Find the largest Python source file in an entire directory tree.
Search the Python source lib, use pprint to display results nicely.
"""
import sys, os, pprint
trace = False
if sys.platform.startswith('win'):
dirname = r'C:\Python31\Lib' # Windows
else:
dirname = '/usr/lib/python' # Unix, Linux, Cygwin
allsizes = []
for (thisDir, subsHere, filesHere) in os.walk(dirname):
if trace: print(thisDir)
for filename in filesHere:
if filename.endswith('.py'):
if trace: print('...', filename)
fullname = os.path.join(thisDir, filename)
fullsize = os.path.getsize(fullname)
allsizes.append((fullsize, fullname))
allsizes.sort()
pprint.pprint(allsizes[:2])
pprint.pprint(allsizes[-2:])

When run, this new version uses
os.walk
to search an
entire tree of directories for the largest Python source file. Change
this script’s
trace
variable if you
want to track its progress through the tree. As coded, it searches the
Python standard library’s source tree, tailored for Windows and
Unix-like locations:

C:\...\PP4E\System\Filetools>
bigpy-tree.py
[(0, 'C:\\Python31\\Lib\\build_class.py'),
(0, 'C:\\Python31\\Lib\\email\\mime\\__init__.py')]
[(211238, 'C:\\Python31\\Lib\\decimal.py'),
(380582, 'C:\\Python31\\Lib\\pydoc_data\\topics.py')]
Scanning the Module Search Path

Sure enough—
the prior section’s script found smallest and largest
files in subdirectories. While searching Python’s entire standard
library tree this way is more inclusive, it’s still incomplete: there
may be additional modules installed elsewhere on your computer, which
are accessible from the module import search path but outside Python’s
source tree. To be more exhaustive, we could instead essentially perform
the same tree search, but for every directory on the module import
search path.
Example 6-3
adds
this extension to include every importable Python-coded module on your
computer—located both on the path directly and nested in package
directory trees.

Example 6-3. PP4E\System\Filetools\bigpy-path.py

"""
Find the largest Python source file on the module import search path.
Skip already-visited directories, normalize path and case so they will
match properly, and include line counts in pprinted result. It's not
enough to use os.environ['PYTHONPATH']: this is a subset of sys.path.
"""
import sys, os, pprint
trace = 0 # 1=dirs, 2=+files
visited = {}
allsizes = []
for srcdir in sys.path:
for (thisDir, subsHere, filesHere) in os.walk(srcdir):
if trace > 0: print(thisDir)
thisDir = os.path.normpath(thisDir)
fixcase = os.path.normcase(thisDir)
if fixcase in visited:
continue
else:
visited[fixcase] = True
for filename in filesHere:
if filename.endswith('.py'):
if trace > 1: print('...', filename)
pypath = os.path.join(thisDir, filename)
try:
pysize = os.path.getsize(pypath)
except os.error:
print('skipping', pypath, sys.exc_info()[0])
else:
pylines = len(open(pypath, 'rb').readlines())
allsizes.append((pysize, pylines, pypath))
print('By size...')
allsizes.sort()
pprint.pprint(allsizes[:3])
pprint.pprint(allsizes[-3:])
print('By lines...')
allsizes.sort(key=lambda x: x[1])
pprint.pprint(allsizes[:3])
pprint.pprint(allsizes[-3:])

When run, this script marches down the module import path and, for
each valid directory it contains, attempts to search the entire tree
rooted there. In fact, it nests loops three deep—for items on the path,
directories in the item’s tree, and files in the directory. Because the
module path may contain directories named in arbitrary ways, along the
way this script must take care to:

  • Normalize directory paths—fixing up slashes and dots to map
    directories to a common form.

  • Normalize directory name case—converting to lowercase on
    case-insensitive Windows, so that same names match by string
    equality, but leaving case unchanged on Unix, where it
    matters.

  • Detect
    repeats to avoid visiting the same directory twice
    (the same directory might be reached from more than one entry on
    sys.path
    ).

  • Skip any file-like item in the tree for which
    os.path.getsize
    fails (by default
    os.walk
    itself silently ignores things it
    cannot treat as directories, both at the top of and within the
    tree).

  • Avoid potential
    Unicode decoding errors
    in file content by opening files in binary mode in order to count
    their lines. Text mode requires decodable content, and some files in
    Python 3.1’s library tree cannot be decoded properly on Windows.
    Catching Unicode exceptions with a
    try
    statement would avoid program exits,
    too, but might skip candidate files.

This version also adds line counts; this might add significant run
time to this script too, but it’s a useful metric to report. In fact,
this version uses this value as a sort key to report the three largest
and smallest files by line counts too—this may differ from results based
upon raw file size. Here’s the script in action in Python 3.1 on my
Windows 7 machine; since these results depend on platform, installed
extensions, and path settings, your
sys.path
and largest and smallest files may
vary:

C:\...\PP4E\System\Filetools>
bigpy-path.py
By size...
[(0, 0, 'C:\\Python31\\lib\\build_class.py'),
(0, 0, 'C:\\Python31\\lib\\email\\mime\\__init__.py'),
(0, 0, 'C:\\Python31\\lib\\email\\test\\__init__.py')]
[(161613, 3754, 'C:\\Python31\\lib\\tkinter\\__init__.py'),
(211238, 5768, 'C:\\Python31\\lib\\decimal.py'),
(380582, 78, 'C:\\Python31\\lib\\pydoc_data\\topics.py')]
By lines...
[(0, 0, 'C:\\Python31\\lib\\build_class.py'),
(0, 0, 'C:\\Python31\\lib\\email\\mime\\__init__.py'),
(0, 0, 'C:\\Python31\\lib\\email\\test\\__init__.py')]
[(147086, 4132, 'C:\\Python31\\lib\\turtle.py'),
(150069, 4268, 'C:\\Python31\\lib\\test\\test_descr.py'),
(211238, 5768, 'C:\\Python31\\lib\\decimal.py')]

Again, change this script’s
trace
variable if you want to track its
progress through the tree. As you can see, the results for largest files
differ when viewed by size and lines—a disparity which we’ll probably
have to hash out in our next requirements
meeting.

Scanning the Entire Machine

Finally,
although searching trees rooted in the module import path
normally includes every Python source file you can import on your
computer, it’s still not complete. Technically, this approach checks
only modules; Python source files which are top-level scripts run
directly do not need to be included in the module path. Moreover, the
module search path may be manually changed by some scripts dynamically
at runtime (for example, by direct
sys.path
updates in scripts that run on web
servers) to include additional directories that
Example 6-3
won’t catch.

Ultimately, finding the largest source file on your computer
requires searching your entire drive—a feat which our tree searcher in
Example 6-2
almost
supports, if we generalize it to accept the
root directory name as an argument and add some of the bells and
whistles of the path searcher version (we really want to avoid visiting
the same directory twice if we’re scanning an entire machine, and we
might as well skip errors and check line-based sizes if we’re investing
the time).
Example 6-4
implements such general tree scans, outfitted for the heavier lifting
required for scanning drives.

Example 6-4. PP4E\System\Filetools\bigext-tree.py

"""
Find the largest file of a given type in an arbitrary directory tree.
Avoid repeat paths, catch errors, add tracing and line count size.
Also uses sets, file iterators and generator to avoid loading entire
file, and attempts to work around undecodable dir/file name prints.
"""
import os, pprint
from sys import argv, exc_info
trace = 1 # 0=off, 1=dirs, 2=+files
dirname, extname = os.curdir, '.py' # default is .py files in cwd
if len(argv) > 1: dirname = argv[1] # ex: C:\, C:\Python31\Lib
if len(argv) > 2: extname = argv[2] # ex: .pyw, .txt
if len(argv) > 3: trace = int(argv[3]) # ex: ". .py 2"
def tryprint(arg):
try:
print(arg) # unprintable filename?
except UnicodeEncodeError:
print(arg.encode()) # try raw byte string
visited = set()
allsizes = []
for (thisDir, subsHere, filesHere) in os.walk(dirname):
if trace: tryprint(thisDir)
thisDir = os.path.normpath(thisDir)
fixname = os.path.normcase(thisDir)
if fixname in visited:
if trace: tryprint('skipping ' + thisDir)
else:
visited.add(fixname)
for filename in filesHere:
if filename.endswith(extname):
if trace > 1: tryprint('+++' + filename)
fullname = os.path.join(thisDir, filename)
try:
bytesize = os.path.getsize(fullname)
linesize = sum(+1 for line in open(fullname, 'rb'))
except Exception:
print('error', exc_info()[0])
else:
allsizes.append((bytesize, linesize, fullname))
for (title, key) in [('bytes', 0), ('lines', 1)]:
print('\nBy %s...' % title)
allsizes.sort(key=lambda x: x[key])
pprint.pprint(allsizes[:3])
pprint.pprint(allsizes[-3:])

Unlike the prior tree version, this one allows us to search in
specific directories, and for specific extensions. The default is to
simply search the current working directory for Python files:

C:\...\PP4E\System\Filetools>
bigext-tree.py
.
By bytes...
[(21, 1, '.\\__init__.py'),
(461, 17, '.\\bigpy-dir.py'),
(818, 25, '.\\bigpy-tree.py')]
[(1696, 48, '.\\join.py'),
(1940, 49, '.\\bigext-tree.py'),
(2547, 57, '.\\split.py')]
By lines...
[(21, 1, '.\\__init__.py'),
(461, 17, '.\\bigpy-dir.py'),
(818, 25, '.\\bigpy-tree.py')]
[(1696, 48, '.\\join.py'),
(1940, 49, '.\\bigext-tree.py'),
(2547, 57, '.\\split.py')]

For more custom work, we can pass in a directory name, extension
type, and trace level on the command-line now (trace level 0 disables
tracing, and 1, the default, shows directories visited along the
way):

C:\...\PP4E\System\Filetools>
bigext-tree.py .. .py 0
By bytes...
[(21, 1, '..\\__init__.py'),
(21, 1, '..\\Filetools\\__init__.py'),
(28, 1, '..\\Streams\\hello-out.py')]
[(2278, 67, '..\\Processes\\multi2.py'),
(2547, 57, '..\\Filetools\\split.py'),
(4361, 105, '..\\Tester\\tester.py')]
By lines...
[(21, 1, '..\\__init__.py'),
(21, 1, '..\\Filetools\\__init__.py'),
(28, 1, '..\\Streams\\hello-out.py')]
[(2547, 57, '..\\Filetools\\split.py'),
(2278, 67, '..\\Processes\\multi2.py'),
(4361, 105, '..\\Tester\\tester.py')]

This script also lets us scan for different file types; here it is
picking out the smallest and largest text file from one level up (at the
time I ran this script, at least):

C:\...\PP4E\System\Filetools>
bigext-tree.py .. .txt 1
..
..\Environment
..\Filetools
..\Processes
..\Streams
..\Tester
..\Tester\Args
..\Tester\Errors
..\Tester\Inputs
..\Tester\Outputs
..\Tester\Scripts
..\Tester\xxold
..\Threads
By bytes...
[(4, 2, '..\\Streams\\input.txt'),
(13, 1, '..\\Streams\\hello-in.txt'),
(20, 4, '..\\Streams\\data.txt')]
[(104, 4, '..\\Streams\\output.txt'),
(172, 3, '..\\Tester\\xxold\\README.txt.txt'),
(435, 4, '..\\Filetools\\temp.txt')]
By lines...
[(13, 1, '..\\Streams\\hello-in.txt'),
(22, 1, '..\\spam.txt'),
(4, 2, '..\\Streams\\input.txt')]
[(20, 4, '..\\Streams\\data.txt'),
(104, 4, '..\\Streams\\output.txt'),
(435, 4, '..\\Filetools\\temp.txt')]

And now, to search your entire system, simply pass in your
machine’s root directory name (use
/
instead of
C:\
on Unix-like
machines), along with an optional file extension type (
.py
is just the default now). The winner
is…(please, no wagering):

C:\...\PP4E\dev\Examples\PP4E\System\Filetools>
bigext-tree.py C:\
C:\
C:\$Recycle.Bin
C:\$Recycle.Bin\S-1-5-21-3951091421-2436271001-910485044-1004
C:\cygwin
C:\cygwin\bin
C:\cygwin\cygdrive
C:\cygwin\dev
C:\cygwin\dev\mqueue
C:\cygwin\dev\shm
C:\cygwin\etc
...MANY more lines omitted...
By bytes...
[(0, 0, 'C:\\cygwin\\...\\python31\\Python-3.1.1\\Lib\\build_class.py'),
(0, 0, 'C:\\cygwin\\...\\python31\\Python-3.1.1\\Lib\\email\\mime\\__init__.py'),
(0, 0, 'C:\\cygwin\\...\\python31\\Python-3.1.1\\Lib\\email\\test\\__init__.py')]
[(380582, 78, 'C:\\Python31\\Lib\\pydoc_data\\topics.py'),
(398157, 83, 'C:\\...\\Install\\Source\\Python-2.6\\Lib\\pydoc_topics.py'),
(412434, 83, 'C:\\Python26\\Lib\\pydoc_topics.py')]
By lines...
[(0, 0, 'C:\\cygwin\\...\\python31\\Python-3.1.1\\Lib\\build_class.py'),
(0, 0, 'C:\\cygwin\\...\\python31\\Python-3.1.1\\Lib\\email\\mime\\__init__.py'),
(0, 0, 'C:\\cygwin\\...\\python31\\Python-3.1.1\\Lib\\email\\test\\__init__.py')]
[(204107, 5589, 'C:\\...\Install\\Source\\Python-3.0\\Lib\\decimal.py'),
(205470, 5768, 'C:\\cygwin\\...\\python31\\Python-3.1.1\\Lib\\decimal.py'),
(211238, 5768, 'C:\\Python31\\Lib\\decimal.py')]

The script’s trace logic is preset to allow you to monitor its
directory progress. I’ve shortened some directory names to protect the
innocent here (and to fit on this page). This command may take a
long time
to finish on your computer—on my sadly
underpowered Windows 7 netbook, it took 11 minutes to scan a solid state
drive with some 59G of data, 200K files, and 25K directories when the
system was lightly loaded (8 minutes when not tracing directory names,
but half an hour when many other applications were running).
Nevertheless, it provides the most exhaustive solution to the original
query of all our attempts.

This is also as complete a solution as we have space for in this
book. For more fun, consider that you may need to scan more than one
drive, and some Python source files may also appear in zip archives,
both on the module path or not (
os.walk
silently ignores zip files in
Example 6-3
). They might also be named
in other ways—with
.pyw
extensions
to suppress shell pop ups on Windows, and with arbitrary extensions for
some top-level scripts. In fact, top-level scripts might have no
filename extension at all, even though they are Python source files. And
while they’re generally not Python files, some importable modules may
also appear in frozen binaries or be statically linked into the Python
executable. In the interest of space, we’ll leave such higher resolution
(and potentially intractable!) search extensions as suggested
exercises.

Other books

In My Skin by Cassidy Ryan
The Black List by Robin Burcell
With All My Soul by Rachel Vincent
Spark by John Lutz
My Calling by Lyssa Layne
Royal Chase by Sariah Wilson
Model Guy by Brooke, Simon