Programming Python (22 page)

Read Programming Python Online

Authors: Mark Lutz

Tags: #COMPUTERS / Programming Languages / Python

BOOK: Programming Python
2.02Mb size Format: txt, pdf, ePub
Reading lines with file iterators

In older versions
of Python, the traditional way to read a file line by
line in a
for
loop was to read the
file into a list that could be stepped through as usual:

>>>
file = open('data.txt')
>>>
for line in file.readlines():
# DON'T DO THIS ANYMORE!
...
print(line, end='')

If you’ve already studied the core language using a first book
like
Learning
Python
, you may already know that this coding
pattern is actually more work than is needed
today—
both for you and your computer’s
memory. In recent Pythons, the file object includes an
iterator
which is smart enough to grab just one
line per request in all iteration contexts, including
for
loops and list comprehensions. The
practical benefit of this extension is that you no longer need to
call
readlines
in a
for
loop to scan line by line—the
iterator reads lines on request automatically:

>>>
file = open('data.txt')
>>>
for line in file:
# no need to call readlines
...
print(line, end='')
# iterator reads next line each time
...
Hello file world!
Bye file world.

Better still, you can open the file in the loop statement
itself, as a temporary which will be automatically closed on garbage
collection when the loop ends (that’s normally the file’s sole
reference):

>>>
for line in open('data.txt'):
# even shorter: temporary file object
...
print(line, end='')
# auto-closed when garbage collected
...
Hello file world!
Bye file world.

Moreover, this file line-iterator form does not load the entire
file into a line’s list all at once, so it will be more space
efficient for large text files. Because of that, this is the
prescribed way to read line by line today. If you want to see what
really happens inside the
for
loop,
you can use the iterator manually; it’s just a
__next__
method (run by the
next
built-in function), which is similar to
calling the
readline
method each
time through, except that read methods return an empty string at
end-of-file (
EOF
) and the iterator
raises an exception to end the iteration:

>>>
file = open('data.txt')
# read methods: empty at EOF
>>>
file.readline()
'Hello file world!\n'
>>>
file.readline()
'Bye file world.\n'
>>>
file.readline()
''
>>>
file = open('data.txt')
# iterators: exception at EOF
>>>
file.__next__()
# no need to call iter(file) first,
'Hello file world!\n' # since files are their own iterator
>>>
file.__next__()
'Bye file world.\n'
>>>
file.__next__()
Traceback (most recent call last):
File "", line 1, in
StopIteration

Interestingly, iterators are automatically used in all iteration
contexts, including the
list
constructor call, list comprehension expressions,
map
calls, and
in
membership checks:

>>>
open('data.txt').readlines()
# always read lines
['Hello file world!\n', 'Bye file world.\n']
>>>
list(open('data.txt'))
# force line iteration
['Hello file world!\n', 'Bye file world.\n']
>>>
lines = [line.rstrip() for line in open('data.txt')]
# comprehension
>>>
lines
['Hello file world!', 'Bye file world.']
>>>
lines = [line.upper() for line in open('data.txt')]
# arbitrary actions
>>>
lines
['HELLO FILE WORLD!\n', 'BYE FILE WORLD.\n']
>>>
list(map(str.split, open('data.txt')))
# apply a function
[['Hello', 'file', 'world!'], ['Bye', 'file', 'world.']]
>>>
line = 'Hello file world!\n'
>>>
line in open('data.txt')
# line membership
True

Iterators may seem somewhat implicit at first glance, but
they’re representative of the many ways that Python makes developers’
lives easier over
time.

Other open options

Besides the
w
and (default)
r
file
open modes, most platforms support an
a
mode string, meaning “append.” In this
output mode,
write
methods add data
to the end of the file, and the
open
call will not erase the current
contents of the file:

>>>
file = open('data.txt', 'a')
# open in append mode: doesn't erase
>>>
file.write('The Life of Brian')
# added at end of existing data
>>>
file.close()
>>>
>>>
open('data.txt').read()
# open and read entire file
'Hello file world!\nBye file world.\nThe Life of Brian'

In fact, although most files are opened using the sorts of calls
we just ran,
open
actually supports
additional arguments for more specific processing needs, the first
three of which are the most commonly used—the filename, the open mode,
and a buffering specification. All but the first of these are
optional: if omitted, the open mode argument defaults to
r
(input), and the buffering policy is to
enable full buffering. For special needs, here are a few things you
should know about these three
open
arguments:

Filename

As mentioned earlier, filenames can include an explicit
directory path to refer to files in arbitrary places on your
computer; if they do not, they are taken to be names relative to
the current working directory (described in the prior chapter).
In general, most filename forms you can type in your system
shell will work in an
open
call. For instance, a relative filename argument
r'..\temp\spam.txt'
on Windows means
spam.txt
in the
temp
subdirectory of the current working directory’s
parent—
up one, and down to
directory
temp
.

Open mode

The
open
function
accepts other modes, too, some of which we’ll see
at work later in this chapter:
r+
,
w+
, and
a+
to open for reads
and
writes, and any mode string with a
b
to designate binary mode.
For instance, mode
r+
means
both reads and writes are allowed on an existing file;
w+
allows reads and writes but creates
the file anew, erasing any prior content;
rb
and
wb
read and write data in binary mode
without any translations; and
wb+
and
r+b
both combine binary mode and input
plus output. In general, the mode string defaults to
r
for read but can be
w
for write and
a
for append, and you may add a
+
for update, as well as a
b
or
t
for binary or text mode; order is
largely irrelevant.

As we’ll see later in this chapter, the
+
modes are often used in conjunction
with the file object’s
seek
method to achieve random read/write access. Regardless of mode,
file contents are always strings in Python programs—read methods
return a string, and we pass a string to write methods. As also
described later, though, the mode string implies which type of
string is used:
str
for text
mode or
bytes
and other byte
string types for binary mode.

Buffering policy

The
open
call also
takes an optional third buffering policy argument which lets you
control buffering for the file—the way that data is queued up
before being transferred, to boost performance. If passed, 0
means file operations are unbuffered (data is transferred
immediately, but allowed in binary modes only), 1 means they are
line buffered, and any other positive value means to use a full
buffering (which is the default, if no buffering argument is
passed).

As usual, Python’s library manual and reference texts have the
full story on additional
open
arguments beyond these three. For instance, the
open
call supports additional arguments
related to the
end-of-line
mapping behavior and
the automatic Unicode
encoding
of content
performed for text-mode files. Since we’ll discuss both of these
concepts in the next section, let’s move
ahead.

Binary and Text Files

All of the
preceding examples process simple text files, but Python
scripts can also open and process files containing
binary
data—JPEG images, audio
clips, packed binary data produced by FORTRAN and C programs, encoded
text, and anything else that can be stored in files as bytes. The
primary difference in terms of your code is the
mode
argument passed to the built-in
open
function:

>>>
file = open('data.txt', 'wb')
# open binary output file
>>>
file = open('data.txt', 'rb')
# open binary input file

Once you’ve opened binary files in this way, you may read and
write their contents using the same methods just illustrated:
read
,
write
, and so on. The
readline
and
readlines
methods as well as the file’s line
iterator still work here for text files opened in binary mode, but they
don’t make sense for truly binary data that isn’t line oriented
(end-of-line bytes are meaningless, if they appear at all).

In all cases, data transferred between files and your programs is
represented as Python strings within scripts, even if it is binary data.
For binary mode files, though, file content is represented as
byte strings
. Continuing with our text file from
preceding examples:

>>>
open('data.txt').read()
# text mode: str
'Hello file world!\nBye file world.\nThe Life of Brian'
>>>
open('data.txt', 'rb').read()
# binary mode: bytes
b'Hello file world!\r\nBye file world.\r\nThe Life of Brian'
>>>
file = open('data.txt', 'rb')
>>>
for line in file: print(line)
...
b'Hello file world!\r\n'
b'Bye file world.\r\n'
b'The Life of Brian'

This occurs because Python 3.X treats
text-mode files as Unicode, and automatically decodes
content on input and encodes it on output. Binary mode files instead
give us access to file content as raw byte strings, with no translation
of content—they reflect exactly what is stored on the file. Because
str
strings are always Unicode text
in 3.X, the special
bytes
string is
required to represent binary data as a sequence of byte-size integers
which may contain any 8-bit value. Because normal and byte strings have
almost identical operation sets, many programs can largely take this on
faith; but keep in mind that you really
must
open
truly binary data in binary mode for input, because it will not
generally be decodable as Unicode text.

Similarly, you must also supply byte strings for binary mode
output—normal strings are not raw binary data, but are decoded Unicode
characters (a.k.a. code points) which are encoded to binary on text-mode
output:

>>>
open('data.bin', 'wb').write(b'Spam\n')
5
>>>
open('data.bin', 'rb').read()
b'Spam\n'
>>>
open('data.bin', 'wb').write('spam\n')
TypeError: must be bytes or buffer, not str

But notice that this file’s line ends with just
\n
, instead of the Windows
\r\n
that showed up in the preceding example
for the text file in binary mode. Strictly speaking, binary mode
disables Unicode encoding translation, but it also prevents the
automatic end-of-line character translation performed by text-mode files
by default. Before we can understand this fully, though, we need to
study the two main ways in which text files differ from binary.

Unicode encodings for text files

As mentioned earlier,
text-mode file objects always translate data according
to a default or provided Unicode encoding type, when the data is
transferred to and from external file. Their content is encoded on
files, but decoded in memory. Binary mode files don’t perform any such
translation, which is what we want for truly binary data. For
instance, consider the following string, which embeds a Unicode
character whose binary value is outside the normal 7-bit range of the
ASCII encoding standard:

>>>
data = 'sp\xe4m'
>>>
data
'späm'
>>>
0xe4, bin(0xe4), chr(0xe4)
(228, '0b11100100', 'ä')

It’s possible to manually encode this string according to a
variety of Unicode encoding types—its raw binary byte string form is
different under some encodings:

>>>
data.encode('latin1')
# 8-bit characters: ascii + extras
b'sp\xe4m'
>>>
data.encode('utf8')
# 2 bytes for special characters only
b'sp\xc3\xa4m'
>>>
data.encode('ascii')
# does not encode per ascii
UnicodeEncodeError: 'ascii' codec can't encode character '\xe4' in position 2:
ordinal not in range(128)

Python displays printable characters in these strings normally,
but nonprintable bytes show as
\xNN
hexadecimal escapes which become more prevalent under more
sophisticated encoding schemes (
cp500
in the following is an
EBCDIC encoding):

>>>
data.encode('utf16')
# 2 bytes per character plus preamble
b'\xff\xfes\x00p\x00\xe4\x00m\x00'
>>>
data.encode('cp500')
# an ebcdic encoding: very different
b'\xa2\x97C\x94'

The encoded results here reflect the string’s raw binary form
when stored in files. Manual encoding is usually unnecessary, though,
because text files handle encodings
automatically
on data transfers—reads
decode and writes encode, according
to
the encoding name
passed in (or a default for the underlying
platform: see
sys.
get
default
encoding
). Continuing our interactive
session:

>>>
open('data.txt', 'w', encoding='latin1').write(data)
4
>>>
open('data.txt', 'r', encoding='latin1').read()
'späm'
>>>
open('data.txt', 'rb').read()
b'sp\xe4m'

If we open in binary mode, though, no encoding translation
occurs—the last command in the preceding example shows us what’s
actually stored on the file. To see how file content differs for other
encodings, let’s save the same string again:

>>>
open('data.txt', 'w', encoding='utf8').write(data)
# encode data per utf8
4
>>>
open('data.txt', 'r', encoding='utf8').read()
# decode: undo encoding
'späm'
>>>
open('data.txt', 'rb').read()
# no data translations
b'sp\xc3\xa4m'

This time, raw file content is different, but text mode’s
auto-decoding makes the string the same by the time it’s read back by
our script. Really, encodings pertain only to strings while they are
in files; once they are loaded into memory, strings are simply
sequences of Unicode characters (“code points”). This translation step
is what we want for text files, but not for binary. Because binary
modes skip the translation, you’ll want to use them for truly binary
data. If fact, you usually must—trying to write unencodable data and
attempting to read undecodable data is an error:

>>>
open('data.txt', 'w', encoding='ascii').write(data)
UnicodeEncodeError: 'ascii' codec can't encode character '\xe4' in position 2:
ordinal not in range(128)
>>>
open(r'C:\Python31\python.exe', 'r').read()
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 2:
character maps to

Binary mode is also a last resort for reading text files, if
they cannot be decoded per the underlying platform’s default, and the
encoding type is unknown—the following recreates the original strings
if encoding type is known, but fails if it is not known unless binary
mode is used (such failure may occur either on inputting the data or
printing it, but it fails nevertheless):

>>>
open('data.txt', 'w', encoding='cp500').writelines(['spam\n', 'ham\n'])
>>>
open('data.txt', 'r', encoding='cp500').readlines()
['spam\n', 'ham\n']
>>>
open('data.txt', 'r').readlines()
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2:
character maps to
>>>
open('data.txt', 'rb').readlines()
[b'\xa2\x97\x81\x94\r%\x88\x81\x94\r%']
>>>
open('data.txt', 'rb').read()
b'\xa2\x97\x81\x94\r%\x88\x81\x94\r%'

If all your text is
ASCII you generally can ignore encoding altogether; data
in files maps directly to characters in strings, because ASCII is a
subset of most platforms’ default encodings. If you must process files
created with other encodings, and possibly on different platforms
(obtained from the Web, for instance), binary mode may be required if
encoding type is unknown. Keep in mind, however, that text in
still-encoded binary form might not work as you expect: because it is
encoded per a given encoding scheme, it might not accurately compare
or combine with text encoded in other schemes.

Again, see other resources for more on the Unicode story. We’ll
revisit the Unicode story at various points in this book, especially
in
Chapter 9
, to see how it
relates to the tkinter
Text
widget,
and in
Part IV
, covering Internet
programming, to learn what it means for data shipped over networks by
protocols such as FTP, email, and the Web at large. Text files have
another feature, though, which is similarly a nonfeature for binary
data: line-end translations, the topic of the next
section.

End-of-line translations for text files

For historical reasons,
the end of a line of text in a file is represented by
different characters on different platforms. It’s a single
\n
character on Unix-like platforms, but the
two-character sequence
\r\n
on
Windows. That’s why files moved between Linux and Windows may look odd
in your text editor after transfer—they may still be stored using the
original platform’s end-of-line convention.

For example, most Windows editors handle text in Unix format,
but Notepad has been a notable exception—text files copied from Unix
or Linux may look like one long line when viewed in Notepad, with
strange characters inside (
\n
).
Similarly, transferring a file from Windows to Unix in binary mode
retains the
\r
characters (which
often appear as
^M
in text
editors).

Python scripts that process text files don’t normally have to
care, because the files object automatically maps the DOS
\r\n
sequence to a single
\n
. It works like this by
default—
when scripts are run on
Windows:

  • For files opened in text mode,
    \r\n
    is translated to
    \n
    when input.

  • For files opened in text mode,
    \n
    is translated to
    \r\n
    when output.

  • For files opened in binary mode, no translation occurs on
    input or output.

On Unix-like platforms, no translations occur, because
\n
is used in files. You should keep in mind
two important consequences of these rules. First, the end-of-line
character for text-mode files is almost always represented as a single
\n
within Python scripts,
regardless of how it is stored in external files on the underlying
platform. By mapping to and from
\n
on input and output, Python hides the platform-specific
difference.

The second consequence of the mapping is subtler: when
processing binary files, binary open modes (e.g,
rb
,
wb
)
effectively turn off line-end translations. If they did not, the
translations listed previously could very well corrupt data as it is
input or output—a random
\r
in data
might be dropped on input, or added for a
\n
in the data on output. The net effect is
that your binary data would be trashed when read and written—probably
not quite what you want for your audio files and images!

This issue has become almost secondary in Python 3.X, because we
generally cannot use binary data with text-mode files anyhow—because
text-mode files automatically apply Unicode encodings to content,
transfers will generally fail when the data cannot be decoded on input
or encoded on output. Using binary mode avoids Unicode errors, and
automatically disables line-end translations as well (Unicode error
can be caught in
try
statements as
well). Still, the fact that binary mode prevents end-of-line
translations to protect file content is best noted as a separate
feature, especially if you work in an ASCII-only world where Unicode
encoding issues are irrelevant.

Here’s the end-of-line translation at work in Python 3.1 on
Windows—text mode translates to and from the platform-specific
line-end sequence so our scripts are
portable
:

>>>
open('temp.txt', 'w').write('shrubbery\n')
# text output mode: \n -> \r\n
10
>>>
open('temp.txt', 'rb').read()
# binary input: actual file bytes
b'shrubbery\r\n'
>>>
open('temp.txt', 'r').read()
# test input mode: \r\n -> \n
'shrubbery\n'

By contrast, writing data in binary mode prevents all
translations as expected, even if the data happens to contain bytes
that are part of line-ends in text mode (byte strings print their
characters as ASCII if printable, else as hexadecimal escapes):

>>>
data = b'a\0b\rc\r\nd'
# 4 escape code bytes, 4 normal
>>>
len(data)
8
>>>
open('temp.bin', 'wb').write(data)
# write binary data to file as is
8
>>>
open('temp.bin', 'rb').read()
# read as binary: no translation
b'a\x00b\rc\r\nd'

But reading binary data in text mode, whether accidental or not,
can corrupt the data when transferred because of line-end translations
(assuming it passes as decodable at all; ASCII bytes like these do on
this Windows platform):

>>>
open('temp.bin', 'r').read()
# text mode read: botches \r !
'a\x00b\nc\nd'

Similarly, writing binary data in text mode can have as the same
effect—line-end bytes may be changed or inserted (again, assuming the
data is encodable per the platform’s default):

>>>
open('temp.bin', 'w').write(data)
# must pass str for text mode
TypeError: must be str, not bytes # use bytes.decode() for to-str
>>>
data.decode()
'a\x00b\rc\r\nd'
>>>
open('temp.bin', 'w').write(data.decode())
8
>>>
open('temp.bin', 'rb').read()
# text mode write: added \r !
b'a\x00b\rc\r\r\nd'
>>>
open('temp.bin', 'r').read()
# again drops, alters \r on input
'a\x00b\nc\n\nd'

The short story to remember here is that you should generally
use
\n
to refer to end-line in all
your text file content, and you should always open binary data in
binary file modes to suppress both end-of-line translations and any
Unicode encodings. A file’s content generally determines its open
mode, and file open modes usually process file content exactly as we
want.

Keep in mind, though, that you might also need to use binary
file modes for text in special contexts. For instance, in
Chapter 6
’s examples, we’ll sometimes open
text files in binary mode to avoid possible Unicode decoding errors,
for files generated on arbitrary platforms that may have been encoded
in arbitrary ways. Doing so avoids encoding errors, but also can mean
that some text might not work as expected—searches might not always be
accurate when applied to such raw text, since the search key must be
in bytes string formatted and encoded according to a specific and
possibly incompatible encoding scheme.

In
Chapter 11
’s PyEdit, we’ll also
need to catch Unicode exceptions in a “grep” directory file search
utility, and we’ll go further to allow Unicode encodings to be
specified for file content across entire trees. Moreover, a script
that attempts to translate between different platforms’ end-of-line
character conventions explicitly may need to read text in binary mode
to retain the original line-end representation truly present in the
file; in text mode, they would already be translated to
\n
by the time they reached the
script.

It’s also possible to disable or further tailor end-of-line
translations in text mode with additional
open
arguments we will finesse here. See the
newline
argument in
open
reference documentation for details; in
short, passing an empty string to this argument also prevents line-end
translation but retains other text-mode behavior. For this chapter,
let’s turn next to two common use cases for binary data files: packed
binary data and random
access.

Other books

Dead and Beyond by Jayde Scott
Silent Night by Deanna Raybourn
The Battered Heiress Blues by Van Dermark, Laurie
The Golden Fleece by Brian Stableford
Sunny Sweet Is So Not Sorry by Jennifer Ann Mann
Beekeeper by J. Robert Janes
Promise of Joy by Allen Drury
Drone by Mike Maden