Confused by complex code? Let our AI-powered Code Explainer demystify it for you. Try it out!
A regular expression is a special sequence of characters that forms a search pattern, it can be used to check if a string contains a specified pattern, and it can also be used to extract all occurrences of that pattern and much more.
Regex is everywhere, from validating email addresses, passwords, and date formats to being used in search engines, so it is an essential skill for any developer, and most programming languages provide regex capabilities.
If you're familiar with Linux, I guess you already saw some of the regular expressions using sed and grep commands, but in this tutorial, we'll be re module in Python. Here are the techniques we gonna cover:
We won't be covering the basics of constructing regular expressions from scratch in this tutorial, instead, we'll focus more on how you can use regex on Python effectively.
For a demonstration on how to use re.match()
function, say you want to validate user passwords. For instance, you want to ensure the password they enter is at least 8 characters long and contains at least a single digit. The following code does that:
import re # stands for regular expression
# a regular expression for validating a password
match_regex = r"^(?=.*[0-9]).{8,}$"
# a list of example passwords
passwords = ["pwd", "password", "password1"]
for pwd in passwords:
m = re.match(match_regex, pwd)
print(f"Password: {pwd}, validate password strength: {bool(m)}")
match_regex
is the regular expression responsible for validating the password criteria we mentioned earlier:
^
: Start character.(?=.*[0-9])
: Ensure string has at least a digit..{8,}
: Ensure string has at least 8 characters.$
: End character.We then used a list of passwords to match, here is the output:
Password: pwd, validate password strength: False
Password: password, validate password strength: False
Password: password1, validate password strength: True
As expected, failed for the first two, and succeeded for the last. The first password (pwd
) has less than 8 characters, the second doesn't include a digit, whereas the third has at least 8 characters and contains a digit.
Note we wrapped the re.match()
method with the built-in bool()
method to return a boolean that indicates whether the string matches the pattern.
A good example to demonstrate the re.search()
method is to search for a specific pattern in a string. For this section, we'll try to extract an IPv4 address from a part of the output of ipconfig command in Windows:
import re
# part of ipconfig output
example_text = """
Wireless LAN adapter Wi-Fi:
Connection-specific DNS Suffix . :
Link-local IPv6 Address . . . . . : fe80::380e:9710:5172:caee%2
IPv4 Address. . . . . . . . . . . : 192.168.1.100
Subnet Mask . . . . . . . . . . . : 255.255.255.0
Default Gateway . . . . . . . . . : 192.168.1.1
"""
# regex for IPv4 address
ip_address_regex = r"((25[0-5]|(2[0-4]|1[0-9]|[1-9]|)[0-9])(\.(?!$)|$)){4}"
# use re.search() method to get the match object
match = re.search(ip_address_regex, example_text)
print(match)
Don't worry much about ip_address_regex
expression, it basically validates an IPv4 address (making sure that each number of the total 4 doesn't exceed 255).
We used re.search()
in this case to search for a valid IPv4 address, here is the output:
<_sre.SRE_Match object; span=(281, 292), match='192.168.1.1'>
re.search()
returns a match object which has the start and end indices of the string found and the actual string, in this case, it returned '192.168.1.1'
as the matched string. You can use:
match.start()
to get the index of the first character of the found pattern.match.end()
to get the index of the last character of the found pattern.match.span()
to get both start and end as a tuple (start
, end
).match.group()
to get the actual string found.As you can see, it only returns the first match and ignores the remaining valid IP addresses. In the next section, we'll see how to extract multiple matches in a string.
We'll be using the output of the same command (ipconfig) but we will try to use regular expressions to match for MAC addresses this time:
import re
# fake ipconfig output
example_text = """
Ethernet adapter Ethernet:
Media State . . . . . . . . . . . : Media disconnected
Physical Address. . . . . . . . . : 88-90-E6-28-35-FA
Ethernet adapter Ethernet 2:
Physical Address. . . . . . . . . : 04-00-4C-4F-4F-60
Autoconfiguration IPv4 Address. . : 169.254.204.56(Preferred)
Wireless LAN adapter Local Area Connection* 2:
Media State . . . . . . . . . . . : Media disconnected
Physical Address. . . . . . . . . : B8-21-5E-D3-66-98
Wireless LAN adapter Wi-Fi:
Physical Address. . . . . . . . . : A0-00-79-AA-62-74
IPv4 Address. . . . . . . . . . . : 192.168.1.101(Preferred)
Default Gateway . . . . . . . . . : 192.168.1.1
"""
# regex for MAC address
mac_address_regex = r"([0-9A-Fa-f]{2}[:-]){5}([0-9A-Fa-f]{2})"
# iterate over matches and extract MAC addresses
extracted_mac_addresses = [ m.group(0) for m in re.finditer(mac_address_regex, example_text) ]
print(extracted_mac_addresses)
After defining the regular expression, we used re.finditer()
function to find all occurrences of MAC addresses in the string passed.
Since finditer()
returns an iterator of match objects, we used a list comprehension to extract only the found MAC addresses using group(0)
(the entire match). Check out the output:
['88-90-E6-28-35-FA', '04-00-4C-4F-4F-60', 'B8-21-5E-D3-66-98', 'A0-00-79-AA-62-74']
Awesome, we have successfully extracted all MAC addresses in that string. In the next section, we'll see how to use regex to replace occurrences of the pattern in strings.
If you have experience with web scraping, you may be encountered with a website that uses a service like CloudFlare to hide email addresses from email harvester tools. In this section, we will do exactly that, given a string that has email addresses, we will replace each one of the addresses by a '[email protected]'
token:
import re
# a basic regular expression for email matching
email_regex = r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+"
# example text to test with
example_text = """
Subject: This is a text email!
From: John Doe <john@doe.com>
Some text here!
===============================
Subject: This is another email!
From: Abdou Rockikz <example@domain.com>
Some other text!
"""
# substitute any email found with [email protected]
print(re.sub(email_regex, "[email protected]", example_text))
We used the re.sub()
method which takes 3 arguments, the first is the regular expression (the pattern), the second is the replacement of all patterns found, the third is the target string, here is the output:
Subject: This is a text email!
From: John Doe <[email protected]>
Some text here!
===============================
Subject: This is another email!
From: Abdou Rockikz <[email protected]>
Some other text!
Great, as we expected, the re.sub()
function returns the string obtained by replacing the leftmost non-overlapping occurences of the pattern in string by the replacement specified (2nd argument).
Now you have the skills to use regular expressions in Python, note that we didn't cover all the methods provided by re module; there are other handy functions like split()
and fullmatch()
, so I highly encourage you to check the Python's official documentation.
If you aren't sure how to build and construct regular expressions for your needs, you can either check the official documentation or this tutorial.
Learn also: How to Make an Email Extractor in Python.
Happy Coding ♥
Want to code smarter? Our Python Code Assistant is waiting to help you. Try it now!
View Full Code Transform My Code
Got a coding query or need some guidance before you comment? Check out this Python Code Assistant for expert advice and handy tips. It's like having a coding tutor right in your fingertips!