Programming Python (187 page)

Read Programming Python Online

Authors: Mark Lutz

Tags: #COMPUTERS / Programming Languages / Python

BOOK: Programming Python
7.1Mb size Format: txt, pdf, ePub
Using the re Module

The Python
re
module
comes with functions that can search for patterns right
away or make compiled pattern objects for running matches later. Pattern
objects (and module search calls) in turn generate match objects, which
contain information about successful matches and matched substrings. For
reference, the next few sections describe the module’s interfaces and
some of the operators you can use to code patterns.

Module functions

The top level of the module provides functions for matching,
substitution, precompiling, and so on:

compile(pattern [,
flags])

Compile a
regular expression
pattern
string into a regular
expression pattern object, for later matching. See the reference
manual
or
Python
Pocket Reference
for the
flags
argument’s meaning.

match(pattern, string [,
flags])

If zero or
more characters at the start of
string
match the
pattern
string, return a corresponding
match object, or
None
if no
match is found. Roughly like a
search
for a pattern that begins with
the
^
operator.

search(pattern, string [,
flags])

Scan
through
string
for a location matching
pattern
, and return a corresponding
match object, or
None
if no
match is found.

findall(pattern, string [,
flags])

Return a
list of strings giving all nonoverlapping matches
of
pattern
in
string
. If there are any groups in
patterns, returns a list of groups, and a list of tuples if the
pattern has more than one group.

finditer(pattern, string [,
flags])

Return
iterator over all nonoverlapping matches of
pattern
in
string
.

split(pattern, string [, maxsplit,
flags])

Split
string
by
occurrences of
pattern
. If
capturing parentheses (
()
)
are used in the pattern, the text of all groups in the pattern
are also returned in the resulting list.

sub(pattern, repl, string [, count,
flags])

Return the
string obtained by replacing the (first
count
) leftmost nonoverlapping
occurrences of
pattern
(a
string or a pattern object) in
string
by
repl
(which may be a string with
backslash escapes that may back-reference a matched group, or a
function that is passed a single match object and returns the
replacement string).

subn(pattern, repl, string [,
count, flags])

Same as
sub
,
but returns a tuple: (new-string,
number-of-substitutions-made).

escape(string)

Return
string
with all
nonalphanumeric characters backslashed, such that
they can be compiled as a string literal.

Compiled pattern objects

At the next level,
pattern objects provide similar attributes, but the
pattern string is implied. The
re.compile
function in the previous section
is useful to optimize patterns that may be matched more than once
(compiled patterns match faster). Pattern objects returned by
re.compile
have these sorts of
attributes:

match(string [, pos] [,
endpos])
search(string [, pos] [,
endpos])
findall(string [, pos [,
endpos]])
finditer(string [, pos [,
endpos]])
split(string [, maxsplit])
sub(repl, string [, count])
subn(repl, string [, count])

These are the same as the
re
module functions, but the pattern is implied, and
pos
and
endpos
give start/end string indexes for the
match.

Match objects

Finally, when
a
match
or
search
function or method is successful, you
get back a match object (
None
comes
back on failed matches). Match objects export a set of attributes of
their own, including:

group(g)
group(g1, g2, ...)

Return the substring that matched a parenthesized group
(or groups) in the pattern. Accept group numbers or names. Group
numbers start at 1; group 0 is the entire string matched by the
pattern. Returns a tuple when passed multiple group numbers, and
group number defaults to 0 if omitted.

groups()

Returns a tuple of all groups’ substrings of the match
(for group numbers 1 and higher).

groupdict()

Returns a dictionary containing all named groups of the
match (see
(?PR)
syntax ahead).

start([group])
end([group])

Indices of the start and end of the substring matched by
group
(or the entire matched
string, if no
group
is
passed).

span([group])

Returns the two-item tuple:
(start(group), end(group))
.

expand([template])

Performs backslash group substitutions; see the Python
library manual.

Regular expression patterns

Regular expression strings
are built up by concatenating single-character regular
expression forms, shown in
Table 19-1
. The
longest-matching string is usually matched by each form, except for
the nongreedy operators. In the table,
R
means any regular expression form,
C
is a character, and
N
denotes a digit.

Table 19-1. re pattern syntax

Operator

Interpretation

.

Matches any character
(including newline if
DOTALL
flag is specified or
(?s)
at pattern
front)

^

Matches start of the
string (of every line in
MULTILINE
mode)

$

Matches end of the
string (of every line in
MULTILINE
mode)

C

Any nonspecial (or
backslash-escaped) character matches itself

R*

Zero or more of
preceding regular expression
R
(as many as
possible)

R+

One or more of
preceding regular expression
R
(as many as
possible)

R?

Zero or one occurrence
of preceding regular expression
R
(optional)

R{m}

Matches exactly
m
copies preceding
R
:
a{5}
matches
'aaaaa'

R{m,n}

Matches from
m
to
n
repetitions of preceding regular
expression
R

R*?, R+?, R??,
R{m,n}?

Same as
*
,
+
, and
?
but matches as few
characters/times as possible; these are known as
nongreedy
match operators (unlike others,
they match and consume as few characters as
possible)

[...]

Defines character set:
e.g.,
[a-zA-Z]
to match all
letters (alternatives, with
-
for ranges)

[^...]

Defines complemented
character set: matches if
char
is not in set

\

Escapes special
char
s (e.g.,
*?+|()
) and introduces special
sequences in
Table 19-2

\\

Matches a literal
\
(write as
\\\\
in pattern, or use
r'\\'
)

\N

Matches the contents of
the group of the same number N:
'(.+)
\1'
matches “42 42”

R|R

Alternative: matches
left or right
R

RR

Concatenation: match
both
R
s

(R)

Matches any regular
expression inside
()
, and
delimits a group (retains matched substring)

(?:R)

Same as
(R)
but simply delimits part R and
does not denote a saved group

(?=R)

Look-ahead assertion:
matches if
R
matches next,
but doesn’t consume any of the string (e.g.,
X (?=Y)
matches
X
only if followed by
Y
)

(?!R)

Matches if
R
doesn’t match next; negative of
(?=R)

(?PR)

Matches any regular
expression inside
()
, and
delimits a named group

(?P=name)

Matches whatever text
was matched by the earlier group named
name

(?#...)

A comment;
ignored

(?letter)

Set mode flag;
letter
is one of
aiLmsux
(see the library
manual)

(?<=R)

Look-behind assertion:
matches if the current position in the string is preceded by a
match of
R
that ends at the
current position

(?

Matches if the current
position in the string is not preceded by a match for
R
; negative of
(?<= R)

(?(id/name)
yespattern
|
nopattern
)

Will try to match with
yespattern
if the group
with given
id
or
name
exists, else with optional
nopattern

Within patterns, ranges and selections can be combined. For
instance,
[a-zA-Z0-9_]+
matches the
longest possible string of one or more letters, digits, or
underscores. Special characters can be escaped as usual in Python
strings:
[\t ]*
matches zero or
more tabs and spaces (i.e., it skips such whitespace).

The parenthesized grouping construct,
(R)
, lets you extract matched substrings
after a successful match. The portion of the string matched by the
expression in parentheses is retained in a numbered register. It’s
available through the
group
method
of a match object after a successful match.

In addition to the entries in this table, special sequences in
Table 19-2
can be used in patterns, too.
Because of Python string rules, you sometimes must double up on
backslashes (
\\
) or use Python raw
strings (
r'...'
) to retain
backslashes in the pattern verbatim. Python ignores backslashes in
normal strings if the letter following the backslash is not recognized
as an escape code. Some of the entries in
Table 19-2
are affected by Unicode when
matching
str
instead of
bytes
, and an ASCII flag may be set to
emulate the behavior for
bytes
; see
Python’s manuals for more details.

Table 19-2. re special sequences

Sequence

Interpretation

\
number

Matches text of group
number
(numbered from
1)

\A

Matches only at the
start of the string

\b

Empty string at word
boundaries

\B

Empty string not at
word boundaries

\d

Any decimal digit
character (
[0-9]
for
ASCII)

\D

Any nondecimal digit
character (
[^O-9]
for
ASCII)

\s

Any whitespace
character (
[ \t\n\r\f\v]
for ASCII)

\S

Any nonwhitespace
character (
[^ \t\n\r\f\v]
for ASCII)

\w

Any alphanumeric
character (
[a-zA-Z0-9_]
for
ASCII)

\W

Any nonalphanumeric
character (
[^a-zA-Z0-9_]
for ASCII )

\Z

Matches only at the end
of the string

Most of the standard escapes supported by Python string literals
are also accepted by the regular expression parser:
\a
,
\b
,
\f
,
\n
,
\r
,
\t
,
\v
,
\x
,
and
\\
. The Python library manual
gives these escapes’ interpretation and additional details on pattern
syntax in general. But to further demonstrate how the
re
pattern syntax is typically used in
scripts, let’s go back to writing some
code.

Other books

Thai Horse by William Diehl
Summer Snow by Pawel, Rebecca
The Fine Line by Alicia Kobishop
A Nose for Death by Glynis Whiting
The Ring of Five by Eoin McNamee
Strange Highways by Dean Koontz
Young Stalin by Simon Sebag Montefiore
The Devil's Tattoo by Nicole R Taylor