3.5 Compilation Flags

Compilation flags let you modify some aspects of how regular expressions work. Flags are available in the re module under two names, a long name such as IGNORECASE, and a short, one-letter form such as I. (If you're familiar with Perl's pattern modifiers, the one-letter forms use the same letters; the short form of re.VERBOSE is re.X.) Multiple flags can be specified by bitwise OR-ing them; re.I | re.M sets both the I and M flags, for example.

Here's a table of the available flags, followed by a more detailed explanation of each one.

Flag Meaning
DOTALL, S Make . match any character, including newlines
IGNORECASE, I Do case-insensitive matches
LOCALE, L Do a locale-aware match
MULTILINE, M Multi-line matching, affecting ^ and $
VERBOSE, X Enable verbose REs, which can be organized more cleanly and understandably.

I
IGNORECASE
Perform case-insensitive matching; character class and literal strings will match letters by ignoring case. For example, [A-Z] will match lowercase letters, too, and Spam will match "Spam", "spam", or "spAM".1This doesn't take the current locale into account.

L
LOCALE
Make \w, \W, \b, and \B, dependent on the current locale.

Locales are a feature of the C library intended to help in writing programs that take account of language differences. For example, if you're processing French text, you'd want to be able to write \w+ to match words, but \w only matches the character class [A-Za-z]; it won't match "é" or "ç". If your system is configured properly, and a French locale is selected, certain C functions will tell the program that "é" should also be considered a letter. Setting the LOCALE flag when compiling a regular expression will cause the resulting compiled object to use these C functions for \w; this is slower, but also enables \w+ to match French words as you'd expect.

M
MULTILINE
Usually ^ matches only at the beginning of the string, and $ only at the end of the string and immediately before the newline (if any) at the end of the string. When this flag is specified, ^ matches at the beginning of the string and at the beginning of each line within the string, immediately following each newline. Similarly, the $ metacharacter matches either at the end of the string and at the end of each line (immediately preceding each newline).

S
DOTALL
Makes the "." special character match any character at all, including a newline; without this flag, "." will match anything except a newline.

X
VERBOSE
This flag allows you to write regular expressions that are more readable, by giving you more flexibility in how you can format them. When this flag has been specified, whitespace within the RE string is ignored, except when in a character class or preceded by an unescaped backslash; this lets you organize and indent the RE more clearly. It also enables you to put comments within a RE; comments are marked by a "#" that's neither in a character class or preceded by an unescaped backslash. Comments are simply ignored.

For example, here's a RE that uses re.VERBOSE; see how much easier it is to read?

charref = re.compile(r"""
 &#		     # Start of a numeric entity reference
 (?P<char>      
   [0-9]+[^0-9]      # Decimal form
   | 0[0-7]+[^0-7]   # Octal form
   | x[0-9a-fA-F]+[^0-9a-fA-F] # Hexadecimal form
 )
""", re.VERBOSE)
Without the verbose setting, the RE would look like this:
charref = re.compile("&#(?P<char>[0-9]+[^0-9]"
                     "|0[0-7]+[^0-7]"
                     "|x[0-9a-fA-F]+[^0-9a-fA-F])")
In the above example, Python's automatic concatenation of string literals has been used to break up the RE into smaller pieces, but it's still more difficult to understand than the version using re.VERBOSE.



Footnotes

...spAM.1
"Spam", "spam", "spam", "spam", "spam", ...