Python - Regex: Difference between revisions
Appearance
No edit summary |
No edit summary |
||
| Line 1: | Line 1: | ||
== General == | == General == | ||
* all regex functions are in the re module | * all regex functions are in the re module | ||
== Regex Testers == | == Regex Testers == | ||
| Line 11: | Line 6: | ||
* https://regex101.com | * https://regex101.com | ||
== Four Steps for Python Regex == | == Four Steps for Python Regex with search== | ||
* import the re model | |||
* pass the regex string to re.compile() to get a pattern object | |||
* pass the text string to the pattern object's search() method to get a match object | |||
* call the match object's group() method to get the string of the matched text | |||
* An example of the 4 steps | * An example of the 4 steps | ||
| Line 90: | Line 84: | ||
=> '777-9999' | => '777-9999' | ||
</pre> | </pre> | ||
== Matching Characters from Alternate Groups == | |||
* The pipe is the alternation operator | |||
** Ex: r'Cat|Dog' | |||
* See ATBSWP 191 for a long example of the alternation operator used with matching groups | |||
== search vs. findall == | |||
* search() returns a match object for only the first matched texted in the searched string | |||
* findall() returns the strings of every match in the searched string | |||
* The caveat to findall() | |||
** It works as long as there are no groups in the regex | |||
*** In this case it returns a list of tuples | |||
**** Each tuple represents a single match, and the tuple has strings for each group in the regex | |||
* Another caveat of findall() | |||
** it doesn't overlap matches | |||
** if you say match 3 digits in 1234, it matches 123 and not 234 even though it fits the pattern | |||
* Steps with findall | |||
* import the re model | |||
* pass the regex string to re.compile() to get a pattern object | |||
* pass the text string to the pattern object's findall() method | |||
* not sure why findall() doesn't use group() like search | |||
<pre> | |||
import re | |||
pattern = re.compile(r'\d{3}-\d{3}-\d{4}') | |||
pattern.findall('Cell: 666-777-9999 Work 111-222-3333') | |||
=> ['666-777-9999', '111-222-3333'] | |||
import re | |||
pattern = re.compile(r'(\d{3})-(\d{3})-(\d{4})') | |||
pattern.findall('Cell: 666-777-9999 Work 111-222-3333') | |||
=> [('666', '777', '9999'), ('111', '222', '3333')] | |||
</pre> | |||
* the first example doesn't use parens and returns a list of strings | |||
* the second example uses parens and returns a list of tuples | |||
== Troubleshooting == | == Troubleshooting == | ||
* the error "unterminated subpattern at position 0 | * the error "unterminated subpattern at position 0 | ||
** indicates a closing paren is missing | ** indicates a closing paren is missing | ||
Latest revision as of 18:40, 26 December 2025
General
- all regex functions are in the re module
Regex Testers
Four Steps for Python Regex with search
- import the re model
- pass the regex string to re.compile() to get a pattern object
- pass the text string to the pattern object's search() method to get a match object
- call the match object's group() method to get the string of the matched text
- An example of the 4 steps
import re
phone_num_pattern_obj = re.compile(r'\d{3}-\d{3}-\d{4}')
match_obj = phone_num_pattern_obj.search('My number is 666-777-9999.')
match_obj.group()
Output => '666-777-9999'
- further explanation
- phone_num_pattern_obj = re.compile(r'\d{3}-\d{3}-\d{4}')
- passing the regular expression string to re.compile() returns a pattern object
- you only need to compile the pattern object once, after that you can call the pattern object's search() method for as many different strings as you want
- match_obj = phone_num_pattern_obj.search('My number is 666-777-9999.')
- a pattern object's search() method searches the string it is passed for any matches to the regex
- the search() method will return None if the regex pattern isn't found in the string
- if the pattern is found, the search() method returns a match object, which will have a group() method that returns a string of the matched text
- phone_num_pattern_obj = re.compile(r'\d{3}-\d{3}-\d{4}')
Matching a Phone Number like 666-777-9999
- \d will match one decimal number in the range 0 - 9
- Matching a phone number could look like this
- r'\d\d\d-\d\d\d-\d\d\d\d'
- it could be simplified to this
- r'\d{3}-\d{3}-\d{4}'
- r' indicates a raw string
- because regex strings often have backslashes, the raw string is used so there is less escaping, for example \\d
- Match 3 decimals, a dash, match 3 decimals, a dash, match 4 decimals
Parentheses and Regex
- Use case, you want to separate one part of the matched text, like the area code of a phone number
- Adding parens creates groups in the regex string
- r'(\d\d\d)-(\d\d\d-\d\d\d\d)'
- Then use the group() method of match objects to grab the match from just one group
- r'(\d\d\d)-(\d\d\d-\d\d\d\d)'
- the first set of parens is group 1
- the second set of parens is group 2
- 0 or nothing returns the entire matched text
import re
phone_re = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
mo = phone_re.search('My number is 666-777-9999.')
mo.group(1)
=> '666'
mo.group(2)
=> '777-9999'
mo.group(0)
=> '666-777-9999'
mo.group()
=> '666-777-9999'
mo.groups()
=> ('666', '777-9999')
area_code, main_number = mo.groups()
print(area_code)
=> 666
print(main_number)
777-9999
- mo.groups() returns a tuple so you can use multiple-assignment to assign each value to a separate value
Using Escape Characters
- Even with a raw string, you would still need to escape parens if you wanted to use them in the phone number like this: (666) 777-9999
- Escape the parens to match them
import re
pattern = re.compile(r'(\((\d\d\d\)) (\d\d\d-\d\d\d\d)')
mo = pattern.search('My phone number is (666) 777-9999.')
mo.group(1)
=> '(666)'
mo.group(2)
=> '777-9999'
Matching Characters from Alternate Groups
- The pipe is the alternation operator
- Ex: r'Cat|Dog'
- See ATBSWP 191 for a long example of the alternation operator used with matching groups
search vs. findall
- search() returns a match object for only the first matched texted in the searched string
- findall() returns the strings of every match in the searched string
- The caveat to findall()
- It works as long as there are no groups in the regex
- In this case it returns a list of tuples
- Each tuple represents a single match, and the tuple has strings for each group in the regex
- In this case it returns a list of tuples
- It works as long as there are no groups in the regex
- Another caveat of findall()
- it doesn't overlap matches
- if you say match 3 digits in 1234, it matches 123 and not 234 even though it fits the pattern
- Steps with findall
- import the re model
- pass the regex string to re.compile() to get a pattern object
- pass the text string to the pattern object's findall() method
- not sure why findall() doesn't use group() like search
import re
pattern = re.compile(r'\d{3}-\d{3}-\d{4}')
pattern.findall('Cell: 666-777-9999 Work 111-222-3333')
=> ['666-777-9999', '111-222-3333']
import re
pattern = re.compile(r'(\d{3})-(\d{3})-(\d{4})')
pattern.findall('Cell: 666-777-9999 Work 111-222-3333')
=> [('666', '777', '9999'), ('111', '222', '3333')]
- the first example doesn't use parens and returns a list of strings
- the second example uses parens and returns a list of tuples
Troubleshooting
- the error "unterminated subpattern at position 0
- indicates a closing paren is missing