There are some metacharacters that we haven't covered yet. Most of
them will be covered in this section.
Some of the remaining metacharacters to be discussed are
zero-width assertions. They don't cause the engine to advance
through the string at all; instead, they consume no characters at all,
and simply succeed or fail. For example, \b is an
assertion that the current position is located at a word boundary; the
position isn't changed by the \b at all. This means that
zero-width assertions should never be repeated, because if they match
once at a given location, they can obviously be matched an infinite
number of times.
|
- Alternation, or the ``or'' operator.
If A and B are regular expressions,
A|B will match any string that matches either "A" or "B".
| has very low precedence, in order to make it work reasonably when
you're alternating multi-character strings.
Crow|Servo will match either "Crow" or "Servo", not
"Cro", a "w" or an "S", and "ervo".
To match a literal "|",
use \|, or enclose it inside a character class, as in [|].
- ^
- Matches at the beginning of lines. Unless the
MULTILINE flag has been set, this will only match at the
beginning of the string. In MULTILINE mode, this also
matches immediately after each newline within the string.
For example, if you wish to match the word "From" only at the
beginning of a line, the RE to use is From.
>>> print re.match('^From', 'From Here to Eternity')
<re.MatchObject instance at 80c1520>
>>> print re.match('^From', 'Reciting From Memory')
None
To match a literal "^",
use \^, or enclose it inside a character class, as in [].
- $
- Matches at the end of lines, which is defined as
either the end of the string, or any location followed by a newline
character.
>>> print re.search('}$', '{block}')
<re.MatchObject instance at 80adfa8>
>>> print re.search('}$', '{block} ')
None
>>> print re.search('}$', '{block}\n')
<re.MatchObject instance at 80adfa8>
To match a literal "$",
use \$, or enclose it inside a character class, as in [$].
- \A
- Matches only at the start of the string. When not
in MULTILINE mode, \A and ^ are effectively
the same. In MULTILINE mode, however, they're different;
\A still matches only at the beginning of the string, but
^ may match at several locations inside the string (anywhere
following a newline character).
- \Z
- Matches only at the end of the string.
- \b
- Word boundary.
This is a zero-width assertion that matches only at the
beginning or end of a word. A word is defined as a sequence of
alphanumeric characters, as indicated by so the end of a word is indicated by
whitespace or a non-alphanumeric character.
The following example matches "class" only when it's a complete
word; it won't match when it's containing inside another word.
>>> p = re.compile(r'\bclass\b')
>>> print p.search('no class at all')
<re.MatchObject instance at 80c8f28>
>>> print p.search('the declassified algorithm')
None
>>> print p.search('one subclass is')
None
There are two subtleties you should remember when using this special
sequence. First, this is the worst collision between Python's string
literals and regular expression sequences. In Python's string
literals, "\b" is the backspace character, ASCII value 8. If
you're not using raw strings, then Python will convert the "\b" to
a backspace, and your RE won't match as you expect it to. The
following example looks the same as our previous RE, but omits
the "r" in front of the RE string.
>>> p = re.compile('\bclass\b')
>>> print p.search('no class at all')
None
>>> print p.search('\b' + 'class' + '\b')
<re.MatchObject instance at 80c3ee0>
Second, inside a character class, where there's no use for this
assertion, \b represents the backspace character, for
compatibility with Python's string literals.
- \B
- Another zero-width assertion, this is the
opposite of \b, only matching when the current
position is not at a word boundary.