2.1 Matching Characters

Most letters and characters will simply match themselves. For example, the regular expression test will match the string "test" exactly. (You can enable a case-insensitive mode that would let this RE match "Test" or "TEST" as well; more about this later.)

There are exceptions to this rule; some characters are special, and don't match themselves. Instead, they signal that some out-of-the-ordinary thing should be matched, or they affect other portions of the RE by repeating them. Much of this document is devoted to discussing various metacharacters and what they do.

Here's a complete list of the metacharacters; their meanings will be discussed in the rest of this HOWTO.

. ^ $ * + ? { [ \ | ( )

The first metacharacter we'll look at is "["; it's used for specifying a character class, which is a set of characters that you wish to match. Characters can be listed individually, or a range of characters can be indicated by giving two characters and separating them by a "-". For example, [abc] will match any of the characters "a", "b", or "c"; this is the same as [a-c], which uses a range to express the same set of characters. If you wanted to match only lowercase letters, your RE would be [a-z].

Metacharacters are not active inside classes. For example, [akm$] will match any of the characters "a", "k", "m", or "$"; "$" is usually a metacharacter, but inside a character class it's stripped of its special nature.

You can match the characters not within a range by complementing the set. This is indicated by including a "^" as the first character of the class; "^" elsewhere will simply match the "^" character. For example, [5] will match any character except "5".

Perhaps the most important metacharacter is the backslash, "\". As in Python string literals, the backslash can be followed by various characters to signal various special sequences. It's also used to escape all the metacharacters so you can still match them in patterns; for example, if you need to match a "[" or "\", you can precede them with a backslash to remove their special meaning: \[ or \\.

Some of these special sequences represent predefined sets of characters that are often useful, such as the set of digits, or the set of letters, or the set of anything that isn't whitespace. The following predefined special sequences are available:

\d: Matches any decimal digit; this is equivalent to the class [0-9].
\D: Matches any non-digit character; this is equivalent to the class [0-9].
\s: Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].
\S: Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].
\w: Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].
\W: Matches any non-alphanumeric character; this is equivalent to the class [â-zA-Z0-9_].

These sequences can be included inside a character class. For example, [\s,.] is a character class that will match any whitespace character, or "," or ".".

The final metacharacter in this section is .. It matches anything except a newline character, and there's an alternate mode (re.DOTALL) where it will match even a newline. "." is often used where you want to match ``any character''.