Python - Regex

General

all regex functions are in the re module

Regex Testers

Four Steps for Python Regex with search

import the re model
pass the regex string to re.compile() to get a pattern object
pass the text string to the pattern object's search() method to get a match object
call the match object's group() method to get the string of the matched text

An example of the 4 steps

import re
phone_num_pattern_obj = re.compile(r'\d{3}-\d{3}-\d{4}')
match_obj = phone_num_pattern_obj.search('My number is 666-777-9999.')
match_obj.group()
Output => '666-777-9999'

further explanation
- phone_num_pattern_obj = re.compile(r'\d{3}-\d{3}-\d{4}')
  - passing the regular expression string to re.compile() returns a pattern object
  - you only need to compile the pattern object once, after that you can call the pattern object's search() method for as many different strings as you want
- match_obj = phone_num_pattern_obj.search('My number is 666-777-9999.')
  - a pattern object's search() method searches the string it is passed for any matches to the regex
  - the search() method will return None if the regex pattern isn't found in the string
  - if the pattern is found, the search() method returns a match object, which will have a group() method that returns a string of the matched text

Matching a Phone Number like 666-777-9999

\d will match one decimal number in the range 0 - 9
Matching a phone number could look like this
- r'\d\d\d-\d\d\d-\d\d\d\d'
- it could be simplified to this
- r'\d{3}-\d{3}-\d{4}'
  - r' indicates a raw string
  - because regex strings often have backslashes, the raw string is used so there is less escaping, for example \\d
  - Match 3 decimals, a dash, match 3 decimals, a dash, match 4 decimals

Parentheses and Regex

Use case, you want to separate one part of the matched text, like the area code of a phone number
Adding parens creates groups in the regex string
- r'(\d\d\d)-(\d\d\d-\d\d\d\d)'
  - Then use the group() method of match objects to grab the match from just one group
the first set of parens is group 1
the second set of parens is group 2
0 or nothing returns the entire matched text

import re
phone_re = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
mo = phone_re.search('My number is 666-777-9999.')
mo.group(1)
=> '666'
mo.group(2)
=> '777-9999'
mo.group(0)
=> '666-777-9999'
mo.group()
=> '666-777-9999'
mo.groups()
=> ('666', '777-9999')
area_code, main_number = mo.groups()
print(area_code)
=> 666
print(main_number)
777-9999

mo.groups() returns a tuple so you can use multiple-assignment to assign each value to a separate value

Using Escape Characters

Even with a raw string, you would still need to escape parens if you wanted to use them in the phone number like this: (666) 777-9999
Escape the parens to match them

import re
pattern = re.compile(r'(\((\d\d\d\)) (\d\d\d-\d\d\d\d)')
mo = pattern.search('My phone number is (666) 777-9999.')
mo.group(1)
=> '(666)'
mo.group(2)
=> '777-9999'

Matching Characters from Alternate Groups

The pipe is the alternation operator
- Ex: r'Cat|Dog'
See ATBSWP 191 for a long example of the alternation operator used with matching groups

search vs. findall

search() returns a match object for only the first matched texted in the searched string
findall() returns the strings of every match in the searched string
The caveat to findall()
- It works as long as there are no groups in the regex
  - In this case it returns a list of tuples
    - Each tuple represents a single match, and the tuple has strings for each group in the regex
Another caveat of findall()
- it doesn't overlap matches
- if you say match 3 digits in 1234, it matches 123 and not 234 even though it fits the pattern
Steps with findall
import the re model
pass the regex string to re.compile() to get a pattern object
pass the text string to the pattern object's findall() method
not sure why findall() doesn't use group() like search

import re
pattern = re.compile(r'\d{3}-\d{3}-\d{4}')
pattern.findall('Cell: 666-777-9999 Work 111-222-3333')
=> ['666-777-9999', '111-222-3333']

import re
pattern = re.compile(r'(\d{3})-(\d{3})-(\d{4})')
pattern.findall('Cell: 666-777-9999 Work 111-222-3333')
=> [('666', '777', '9999'), ('111', '222', '3333')]

the first example doesn't use parens and returns a list of strings
the second example uses parens and returns a list of tuples

Troubleshooting

the error "unterminated subpattern at position 0
- indicates a closing paren is missing