The top-level functions in the re module are quite similar to those in the regex module; there are a few new functions and some new optional arguments, but these can mostly be ignored when converting regex code. re is now the only module required to access all the available functionality. regsub has been swallowed by re, and is available as the sub(), subn(), and split() functions. There's no equivalent of regex_syntax; re supports only one syntax, and you can't change it. If you want alternative regex syntaxes, you'll have to manually parse the syntax and convert it to the basic Perl-like syntax; sorry!
The functions in the regex module return an integer giving the length or position of any match, or -1 if no match was found. The subgroups from the match are then available as attributes of the compiled pattern object: regs, last, and so forth. This doesn't interact well with threads, because two threads may use the same pattern object at almost the same time; the results from the second thread's operation will then stomp on the first thread's results.
To fix this problem, functions in the re module return a MatchObject instance, or None if the match failed. Pattern objects now have no attributes that change after the object is created. Code must therefore be converted to store the MatchObject instance in a variable, and check for None to determine if a match was found:
regex code:
pat = regex.compile('[0-9]+') if pat.match(strvar) == -1: print 'No match'
re code:
pat = re.compile('[0-9]+') m = pat.match(strvar) if m == None: print 'No match'
The search() and match() functions have the same parameters in both modules; the re module returns None or a MatchObject instance instead of an integer, and adds an optional flags argument. Of course, the re versions use the new regular expression syntax (see "Pattern Differences", below).
regex code:
result = regex.match('\\w+', 'abc abc')
re code:
result = re.match('\\w+', 'abc abc')
Another thing that's disappeared is the translate argument to the compile() function; in the regex module, a 256-character string can be passed to indicate how characters should be translated before matching them. This feature was often used to perform case-insensitive matching, or to map the digits 0-9 to 0 to simplify patterns that matched digits. With the re.IGNORECASE and re.LOCALE flags, and special sequences such as "\d", the need for this functionality is greatly reduced; since the on-the-fly translation complicated the matching engine and made it slower, the feature was dropped. If you still need it, you'll have to explicitly call string.translate() on your target string before running your regular expression on it.
regex code:
pat = regex.compile('[abc]', translation) result = pat.match(str_var)
re code:
pat = re.compile('[abc]') result = pat.match( string.translate(str_var, translation) )
Some programs use a translation string to convert foreign characters such as é or øo to characters in the range A-Za-z so they can be matched by "\w". These programs can specify the re.LOCALE flag, which causes "\w" to match the alphabetic characters defined in the current locale.
The most common use of the translation string is to do a case-insensitive match by passing regex.casefold. The re equivalent is to pass re.IGNORECASE (or re.I, which is the same thing) as the flags argument to re.compile().
regex code:
pat = regex.compile('[abc]', regex.casefold)
re code:
pat = re.compile('[abc]', re.IGNORECASE)
regex.symcomp() is no longer required; named groups are always available using the (?P<name>...) syntax.
regex code:
pat = regex.symcomp( "(<integer>[0-9]+)" )
re code:
pat = re.compile( "(?P<integer>[0-9]+)" )
With the regex module, the group(args) method of pattern objects takes zero or more arguments, either integers or strings containing symbolic group names. group(args) returns a tuple containing the corresponding groups from the last pattern match. With no arguments, .group() returns a tuple containing all the groups in the pattern. In all these cases, if the tuple would only contain a single element, just that single string is returned. This is inconsistent, but convenient:
x = pat.group(1) x,y = pat.group(1, 2)
The re module moves this method to MatchObject instances. The action of group() with no arguments has been changed for consistency with the start(), end(), and span() methods; they all assume group 0 as the default. To get a tuple containing all the subgroups, use the groups() method.
regex code:
substring = pattern.group(0)
re code:
substring = match.group(0)
match.group()
regex code:
substring = pattern.group()
re code:
substring = match.groups()