Programming Python (23 page)

Read Programming Python Online

Authors: Mark Lutz

Tags: #COMPUTERS / Programming Languages / Python

BOOK: Programming Python
7.45Mb size Format: txt, pdf, ePub
Parsing packed binary data with the struct module

By using the
letter
b
in the
open
call, you can open binary datafiles in
a platform-neutral way and read and write their content with normal
file object methods. But how do you process binary data once it has
been read? It will be returned to your script as a simple string of
bytes, most of which are probably not printable characters.

If you just need to pass binary data along to another file or
program, your work is
done—
for
instance, simply pass the byte string to another file opened in binary
mode. And if you just need to extract a number of bytes from a
specific position, string slicing will do the job; you can even follow
up with bitwise operations if you need to. To get at the contents of
binary data in a structured way, though, as well as to construct its
contents, the standard library
struct
module is a more powerful
alternative.

The
struct
module provides
calls to pack and unpack binary data, as though the data was laid out
in a C-language
struct
declaration.
It is also capable of composing and decomposing using any endian-ness
you desire
(endian-ness determines whether the most significant
bits of binary numbers are on the left or right side). Building a
binary datafile, for instance, is straightforward—pack Python values
into a byte string and write them to a file. The format string here in
the
pack
call means big-endian
(
>
), with an integer,
four-character string, half integer, and floating-point number:

>>>
import struct
>>>
data = struct.pack('>i4shf', 2, 'spam', 3, 1.234)
>>>
data
b'\x00\x00\x00\x02spam\x00\x03?\x9d\xf3\xb6'
>>>
file = open('data.bin', 'wb')
>>>
file.write(data)
14
>>>
file.close()

Notice how the
struct
module
returns a bytes string: we’re in the realm of binary data here, not
text, and must use binary mode files to store. As usual, Python
displays most of the packed binary data’s bytes here with
\xNN
hexadecimal escape sequences, because
the bytes are not printable characters. To parse data like that which
we just produced, read it off the file and pass it to the
struct
module with the same format
string—you get back a tuple containing the values parsed out of the
string and converted to Python objects:

>>>
import struct
>>>
file = open('data.bin', 'rb')
>>>
bytes = file.read()
>>>
values = struct.unpack('>i4shf', data)
>>>
values
(2, b'spam', 3, 1.2339999675750732)

Parsed-out strings are byte strings again, and we can apply
string and bitwise operations to probe deeper:

>>>
bin(values[0] | 0b1)
# accessing bits and bytes
'0b11'
>>>
values[1], list(values[1]), values[1][0]
(b'spam', [115, 112, 97, 109], 115)

Also note that slicing comes in handy in this domain; to grab
just the four-character string in the middle of the packed binary data
we just read, we can simply slice it out. Numeric values could
similarly be sliced out and then passed to
struct.unpack
for conversion:

>>>
bytes
b'\x00\x00\x00\x02spam\x00\x03?\x9d\xf3\xb6'
>>>
bytes[4:8]
b'spam'
>>>
number = bytes[8:10]
>>>
number
b'\x00\x03'
>>>
struct.unpack('>h', number)
(3,)

Packed binary data crops up in many contexts, including some
networking tasks, and in data produced by other programming languages.
Because it’s not part of every programming job’s description, though,
we’ll defer to the
struct
module’s
entry in the Python library manual for more
details.

Random access files

Binary files
also typically see action in random access processing.
Earlier, we mentioned that adding a
+
to the
open
mode string allows a file to be both
read and written. This mode is typically used in conjunction with the
file object’s
seek
method to support random read/write access. Such
flexible file processing modes allow us to read bytes from one
location, write to another, and so on. When scripts combine this with
binary file modes, they may fetch and update arbitrary bytes within a
file.

We used
seek
earlier to
rewind files instead of closing and reopening. As mentioned, read and
write operations always take place at the current position in the
file; files normally start at offset 0 when opened and advance as data
is transferred. The
seek
call lets
us move to a new position for the next transfer operation by passing
in a byte offset.

Python’s
seek
method also
accepts an optional second argument that has one of three values—0 for
absolute file positioning (the default); 1 to seek relative to the
current position; and 2 to seek relative to the file’s end. That’s why
passing just an offset of 0 to
seek
is roughly a file
rewind
operation: it
repositions the file to its absolute start. In general,
seek
supports random access on a byte-offset
basis. Seeking to a multiple of a record’s size in a binary file, for
instance, allows us to fetch a record by its relative position.

Although you can use
seek
without
+
modes in
open
(e.g., to just read from random
locations), it’s most flexible when combined with input/output files.
And while you can perform random access in
text
mode
, too, the fact that text modes perform Unicode
encodings and line-end translations make them difficult to use when
absolute byte offsets and lengths are required for seeks and
reads—your data may look very different when stored in files. Text
mode may also make your data nonportable to platforms with different
default encodings, unless you’re willing to always specify an explicit
encoding for opens. Except for simple unencoded ASCII text without
line-ends,
seek
tends to works best
with binary mode files.

To demonstrate, let’s create a file in
w+b
mode (equivalent to
wb+
) and write some data to it; this mode
allows us to both read and write, but initializes the file to be empty
if it’s already present (all
w
modes do). After writing some data, we seek back to file start to read
its content (some integer return values are omitted in this example
again for brevity):

>>>
records = [bytes([char] * 8) for char in b'spam']
>>>
records
[b'ssssssss', b'pppppppp', b'aaaaaaaa', b'mmmmmmmm']
>>>
file = open('random.bin', 'w+b')
>>>
for rec in records:
# write four records
...
size = file.write(rec)
# bytes for binary mode
...
>>>
file.flush()
>>>
pos = file.seek(0)
# read entire file
>>>
print(file.read()
)
b'ssssssssppppppppaaaaaaaammmmmmmm'

Now, let’s reopen our file in
r+b
mode; this mode allows both reads and
writes again, but does not initialize the file to be empty. This time,
we seek and read in multiples of the size of data items (“records”)
stored, to both fetch and update them at random:

c:\temp>
python
>>>
file = open('random.bin', 'r+b')
>>>
print(file.read())
# read entire file
b'ssssssssppppppppaaaaaaaammmmmmmm'
>>>
record = b'X' * 8
>>>
file.seek(0)
# update first record
>>>
file.write(record)
>>>
file.seek(len(record) * 2)
# update third record
>>>
file.write(b'Y' * 8)
>>>
file.seek(8)
>>>
file.read(len(record))
# fetch second record
b'pppppppp'
>>>
file.read(len(record))
# fetch next (third) record
b'YYYYYYYY'
>>>
file.seek(0)
# read entire file
>>>
file.read()
b'XXXXXXXXppppppppYYYYYYYYmmmmmmmm'
c:\temp>
type random.bin
# the view outside Python
XXXXXXXXppppppppYYYYYYYYmmmmmmmm

Finally, keep in mind that
seek
can be used to achieve random access,
even if it’s just for input. The following seeks in multiples of
record size to read (but not write) fixed-length records at random.
Notice that it also uses
r
text
mode: since this data is simple ASCII text bytes and has no line-ends,
text and binary modes work the same on this platform:

c:\temp>
python
>>>
file = open('random.bin', 'r')
# text mode ok if no encoding/endlines
>>>
reclen = 8
>>>
file.seek(reclen * 3)
# fetch record 4
>>>
file.read(reclen)
'mmmmmmmm'
>>>
file.seek(reclen * 1)
# fetch record 2
>>>
file.read(reclen)
'pppppppp'
>>>
file = open('random.bin', 'rb')
# binary mode works the same here
>>>
file.seek(reclen * 2)
# fetch record 3
>>>
file.read(reclen)
# returns byte strings
b'YYYYYYYY'

But unless your file’s content is always a simple unencoded text
form like ASCII and has no translated line-ends, text mode should not
generally be used if you are going to seek—line-ends may be translated
on Windows and Unicode encodings may make arbitrary transformations,
both of which can make absolute seek offsets difficult to use. In the
following, for example, the positions of characters after the first
non-ASCII no longer match between the string in Python and its encoded
representation on the file:

>>>
data = 'sp\xe4m'
# data to your script
>>>
data, len(data)
# 4 unicode chars, 1 nonascii
('späm', 4)
>>>
data.encode('utf8'), len(data.encode('utf8'))
# bytes written to file
(b'sp\xc3\xa4m', 5)
>>>
f = open('test', mode='w+', encoding='utf8')
# use text mode, encoded
>>>
f.write(data)
>>>
f.flush()
>>>
f.seek(0); f.read(1)
# ascii bytes work
's'
>>>
f.seek(2); f.read(1)
# as does 2-byte nonascii
'ä'
>>>
data[3]
# but offset 3 is not 'm' !
'm'
>>>
f.seek(3); f.read(1)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa4 in position 0:
unexpected code byte

As you can see, Python’s file modes provide flexible file
processing for programs that require it. In fact, the
os
module offers even more file processing
options, as the next section
describes.

Lower-Level File Tools in the os Module

The
os
module
contains an additional set of file-processing functions
that are distinct from the built-in file
object
tools demonstrated in previous examples. For instance, here is a partial
list of
os
file-related calls:

os.open(
path,
flags, mode
)

Opens a
file and returns its descriptor

os.read(
descriptor, N
)

Reads at
most
N
bytes and returns
a byte string

os.write(
descriptor, string
)

Writes
bytes in byte string
string
to the file

os.lseek(
descriptor, position
,
how
)

Moves
to
position
in the
file

Technically,
os
calls process
files by their
descriptors
, which are integer codes
or “handles” that identify files in the operating system.
Descriptor-based files deal in raw bytes, and have no notion of the
line-end or Unicode translations for text that we studied in the prior
section. In fact, apart from extras like buffering, descriptor-based
files generally correspond to binary mode file objects, and we similarly
read and write
bytes
strings, not
str
strings. However, because the
descriptor-based file tools in
os
are
lower level and more complex than the built-in file objects created with
the built-in
open
function, you
should generally use the latter for all but very special file-processing
needs.
[
9
]

Using os.open files

To give you the general
flavor of this tool set, though, let’s run a few
interactive experiments. Although built-in file objects and
os
module descriptor files are processed
with distinct tool sets, they are in fact related—the file system used
by file objects simply adds a layer of logic on top of
descriptor-based files.

In fact, the
fileno
file
object method returns the integer descriptor associated with a
built-in file object. For instance, the standard stream file objects
have descriptors 0, 1, and 2; calling the
os.write
function to send data to
stdout
by descriptor has the same effect as
calling the
sys.stdout.write
method:

>>>
import sys
>>>
for stream in (sys.stdin, sys.stdout, sys.stderr):
...
print(stream.fileno())
...
0
1
2
>>>
sys.stdout.write('Hello stdio world\n')
# write via file method
Hello stdio world
18
>>>
import os
>>>
os.write(1, b'Hello descriptor world\n')
# write via os module
Hello descriptor world
23

Because file objects we open explicitly behave the same way,
it’s also possible to process a given real external file on the
underlying computer through the built-in
open
function, tools in the
os
module, or both (some integer return
values are omitted here for brevity):

>>>
file = open(r'C:\temp\spam.txt', 'w')
# create external file, object
>>>
file.write('Hello stdio file\n')
# write via file object method
>>>
file.flush()
# else os.write to disk first!
>>>
fd = file.fileno()
# get descriptor from object
>>>
fd
3
>>>
import os
>>>
os.write(fd, b'Hello descriptor file\n')
# write via os module
>>>
file.close()
C:\temp>
type spam.txt
# lines from both schemes
Hello stdio file
Hello descriptor file
os.open mode flags

So why the extra file tools in
os
? In short, they give more low-level
control over file processing. The built-in
open
function is easy to use, but it may be
limited by the underlying filesystem that it uses, and it adds extra
behavior that we do not want. The
os
module lets scripts be more specific—for
example, the following opens a descriptor-based file in read-write and
binary modes by performing a binary “or” on two mode flags exported by
os
:

>>>
fdfile = os.open(r'C:\temp\spam.txt', (os.O_RDWR | os.O_BINARY))
>>>
os.read(fdfile, 20)
b'Hello stdio file\r\nHe'
>>>
os.lseek(fdfile, 0, 0)
# go back to start of file
>>>
os.read(fdfile, 100)
# binary mode retains "\r\n"
b'Hello stdio file\r\nHello descriptor file\n'
>>>
os.lseek(fdfile, 0, 0)
>>>
os.write(fdfile, b'HELLO')
# overwrite first 5 bytes
5
C:\temp>
type spam.txt
HELLO stdio file
Hello descriptor file

In this case, binary mode strings
rb+
and
r+b
in the basic
open
call are equivalent:

>>>
file = open(r'C:\temp\spam.txt', 'rb+')
# same but with open/objects
>>>
file.read(20)
b'HELLO stdio file\r\nHe'
>>>
file.seek(0)
>>>
file.read(100)
b'HELLO stdio file\r\nHello descriptor file\n'
>>>
file.seek(0)
>>>
file.write(b'Jello')
5
>>>
file.seek(0)
>>>
file.read()
b'Jello stdio file\r\nHello descriptor file\n'

But on some systems,
os.open
flags let us specify more advanced things like
exclusive
access
(
O_EXCL
) and
nonblocking
modes (
O_NONBLOCK
) when a file is opened. Some of
these flags are not portable across platforms (another reason to use
built-in file objects most of the time); see the library manual or run
a
dir(os)
call on your machine for
an exhaustive list of other open flags available.

One final note here: using
os.open
with the
O_EXCL
flag is the most portable way to
lock files
for concurrent updates or other
process synchronization in Python today. We’ll see contexts where this
can matter in the next chapter, when we begin to explore
multi
processing tools. Programs running
in parallel on a server machine, for instance, may need to lock files
before performing updates, if multiple threads or processes might
attempt such updates at the same
time.

Wrapping descriptors in file objects

We saw earlier how to
go from file object to field descriptor with the
fileno
file object method; given a
descriptor, we can use
os
module
tools for lower-level file access to the underlying file. We can also
go the other way—the
os.fdopen
call wraps
a file descriptor in a file object. Because conversions work both
ways, we can generally use either tool set—file object or
os
module:

>>>
fdfile = os.open(r'C:\temp\spam.txt', (os.O_RDWR | os.O_BINARY))
>>>
fdfile
3
>>>
objfile = os.fdopen(fdfile, 'rb')
>>>
objfile.read()
b'Jello stdio file\r\nHello descriptor file\n'

In fact, we can wrap a file descriptor in either a binary or
text-mode file object: in text mode, reads and writes perform the
Unicode encodings and line-end translations we studied earlier and
deal in
str
strings instead of
bytes
:

C:\...\PP4E\System>
python
>>>
import os
>>>
fdfile = os.open(r'C:\temp\spam.txt', (os.O_RDWR | os.O_BINARY))
>>>
objfile = os.fdopen(fdfile, 'r')
>>>
objfile.read()
'Jello stdio file\nHello descriptor file\n'

In Python 3.X, the built-in
open
call also accepts a file descriptor
instead of a file name string; in this mode it works much like
os.fdopen
, but gives you greater
control—for example, you can use additional arguments to specify a
nondefault Unicode encoding for text and suppress the default
descriptor close. Really, though,
os.fdopen
accepts the same extra-control
arguments in 3.X, because it has been redefined to do little but call
back to the built-in
open
(see
os.py
in the standard
library):

C:\...\PP4E\System>
python
>>>
import os
>>>
fdfile = os.open(r'C:\temp\spam.txt', (os.O_RDWR | os.O_BINARY))
>>>
fdfile
3
>>>
objfile = open(fdfile, 'r', encoding='latin1', closefd=False)
>>>
objfile.read()
'Jello stdio file\nHello descriptor file\n'
>>>
objfile = os.fdopen(fdfile, 'r', encoding='latin1', closefd=True)
>>>
objfile.seek(0)
>>>
objfile.read()
'Jello stdio file\nHello descriptor file\n'

We’ll make use of this file object wrapper technique to simplify
text-oriented pipes and other descriptor-like objects later in this
book (e.g., sockets have a
makefile
method which achieves similar effects).

Other os module file tools

The
os
module also includes
an assortment of file tools that accept a file pathname string and
accomplish file-related tasks such as renaming (
os.rename
), deleting (
os.remove
), and changing the file’s owner
and permission settings (
os.chown
,
os.chmod
). Let’s step through a few
examples of these tools in action:

>>>
os.chmod('spam.txt', 0o777)
# enable all accesses

This
os.chmod
file permissions call passes a 9-bit string composed of
three sets of three bits each. From left to right, the three sets
represent the file’s owning user, the file’s group, and all others.
Within each set, the three bits reflect read, write, and execute
access permissions. When a bit is “1” in this string, it means that
the corresponding operation is allowed for the assessor. For instance,
octal 0777 is a string of nine “1” bits in binary, so it enables all
three kinds of accesses for all three user groups; octal 0600 means
that the file can be read and written only by the user that owns it
(when written in binary, 0600 octal is really bits 110 000
000).

This scheme stems from Unix file permission settings, but the
call works on Windows as well. If it’s puzzling, see your system’s
documentation (e.g., a Unix manpage) for
chmod
.
Moving on:

>>>
os.rename(r'C:\temp\spam.txt', r'C:\temp\eggs.txt')
# from, to
>>>
os.remove(r'C:\temp\spam.txt')
# delete file?
WindowsError: [Error 2] The system cannot find the file specified: 'C:\\temp\\...'
>>>
os.remove(r'C:\temp\eggs.txt')

The
os.rename
call used here changes a file’s name; the
os.remove
file
deletion call deletes a file from your system and is synonymous
with
os.unlink
(the
latter reflects the call’s name on Unix but was obscure to
users
of other platforms).
[
10
]
The
os
module also
exports the
stat
system
call:

>>>
open('spam.txt', 'w').write('Hello stat world\n')
# +1 for \r added
17
>>>
import os
>>>
info = os.stat(r'C:\temp\spam.txt')
>>>
info
nt.stat_result(st_mode=33206, st_ino=0, st_dev=0, st_nlink=0, st_uid=0, st_gid=0,
st_size=18, st_atime=1267645806, st_mtime=1267646072, st_ctime=1267645806)
>>>
info.st_mode, info.st_size
# via named-tuple item attr names
(33206, 18)
>>>
import stat
>>>
info[stat.ST_MODE], info[stat.ST_SIZE]
# via stat module presets
(33206, 18)
>>>
stat.S_ISDIR(info.st_mode), stat.S_ISREG(info.st_mode)
(False, True)

The
os.stat
call returns a tuple of values (really, in 3.X a special
kind of tuple with named items) giving low-level information about the
named file, and the
stat
module
exports constants and functions for querying this information in a
portable way. For instance, indexing an
os.stat
result on offset
stat.ST_SIZE
returns the file’s size, and
calling
stat.S_ISDIR
with the mode
item from an
os.stat
result checks
whether the file is a directory. As shown earlier, though, both of
these operations are available in the
os.path
module, too, so it’s rarely
necessary to use
os.stat
except for
low-level file queries:

>>>
path = r'C:\temp\spam.txt'
>>>
os.path.isdir(path), os.path.isfile(path), os.path.getsize(path)
(False, True, 18)

Other books

Her Cowboy Protector by Roxie Rivera
Jigsaw Man by Elena Forbes
Swords of the Six by Scott Appleton, Becky Miller, Jennifer Miller, Amber Hill
The Oxford Book of Victorian Ghost Stories by Michael Cox, R.A. Gilbert
Secret Horse by Bonnie Bryant
Chocolate Quake by Fairbanks, Nancy
Incursion by Aleksandr Voinov
Fiendish Play by Angela Richardson