3. Pattern Differences

The regex module uses Emacs-style regular expressions, while the re module uses Perl-style expressions. The largest difference between the two is that in Emacs-style syntax, all metacharacters have a backslash in front of them. For example, groups are delimited by "\(" and "\)"; "(" and ")" match the literal characters. This clutters moderately complicated expressions with lots of backslashes. Unfortunately, Python's string literals also use the backslash as an escape character, so it's frequently necessary to add backslashes in front of backslashes; "\\\\" is required to match a single "\", for example.

In Perl-style expressions, things are just the opposite; "\(" and "\)" match the literal characters "(" and ")", while "(" and ")" in a pattern denote grouping. This makes patterns neater, since you'll rarely need to match literal "( )" characters, but will often be using grouping.

regex pattern:

\(\w+\|[0-9]+\)

re pattern:

(\w+|[0-9]+)

The Perl syntax also has more character classes that allow simplifying some expressions. The regex module only supports "\w" to match alphanumeric characters, and "\W" to match non-alphanumeric characters. The re module adds "\d" and "\D" for digits and non-digits, and "\s" and "\S" for whitespace and non-whitespace characters.

regex pattern:

[0-9]+[ \t\n]+

re pattern:

\d+\s+

Regular expressions can get very complicated and difficult to understand. To make expressions clearer, the re.VERBOSE flag can be specified. This flag causes whitespace outside of a character class to be ignored, and a "#" symbol outside of a character class is treated as the start of a comment, extending to the end of the line. This means a pattern can be put inside a triple-quoted string, and then formatted for clarity.

re code:

pat = re.compile("""
(?P<command> # A command contains ...
  \w+)       # ... a word ...
  \s+        # ... followed by whitespace ...
(?P<var>     # ... and an optional variable name 
  (?!\d)     # Lookahead assertion: can't start with a digit
  \w+        # Match a word 
)""", re.VERBOSE)

If the re.VERBOSE flag seems a bit easy to overlook, off at the end of the statement, you can put (?x) inside the pattern; this has the same effect as specifying re.VERBOSE, but makes the flag setting part of the pattern. There are similar extensions to specify re.DOTALL, re.IGNORECASE, re.LOCALE, and re.MULTILINE: (?s), (?i), (?L), and (?m).

A module to automatically translate expressions from the old regex syntax to the new syntax has been written, and is available as "reconvert.py" in Python 1.5b2. (I haven't really looked at it yet.)