A REGEX PRIMER
2022-10-24
Introduction
Regular expressions are one of the most useful tool to extract information from raw data. Developers, sysadmins, data scientists and even editors can benefit from learning this powerful language. In this guide we will try to learn regex through practical examples.In order to follow this tutorial, I suggest you to use regex101.com, a web based interactive regex debugger that shows the result of the regular expression in real time. You can also use any text editor with a regex engine(vim, vscode, emacs, notepad++, word, etc.) or a programming language that supports text matching through regular expressions(Python, Javascript, Java, C#, C++, etc.).
What is a Regular Expression?
From a formal language theory point of view, a regular expression is a string of a regular language, i.e. a finite context-free language that can be represented by a nondeterministic finite automation(NFA) by applying the Thompson's construction which is then made deterministic using the powerset/subset construction. The result is a deterministic finite automation(DFA) that can be used on the target string to recognize patterns that match the original string.The name “regular” was first coined by the American mathematician Stephen Cole Kleene(one of the students of Alonzo Church) on this paper to describe regulars events. In the Chomsky's hierarchy, type-3 grammars are the equivalent of regular languages.
Theory aside, you can see a regular expression as a pattern to extract information from a string.
Regex Engines Support?
Before getting started with the actual tutorial, we need to point out that every regex engine(i.e. the software that runs regular expressions) works in a slightly different way from the others, in this tutorial all the examples are tested with the Python engine. If you are using a different programming language(or a text editor) you may need to change some syntax elements in order to get the expression running correctly.Match Any Character
Let us get started with the most basic regex operator: the dot metacharacter. This operator will match any single character with the only exception of line breaks. Suppose that you want to match all the lines of length three from the following buffer:
regular expression
regex
foo bar
guide
bar
The dot operator allows us to write the following regular expression:
...
But that matches any string containing at least three character in any position of the string.
What we want to do is to apply the regex expression at the string boundaries, i.e. at the
beginning and at the end of words. To do so, we can use the
\b
metacharacter:
\b...\b
The matched strings are:
foo bar
bar
as expected. We will soon see that there are other, more efficient
ways to accomplish the same result.
Anchors Metacharacters
In the previous section we talked about how\b
operator can be used to apply a
specific pattern to string boundaries, i.e. the start and the end of a word. Similarly,
the anchors metacharacters - ^
and
$
- match a given pattern at the beginning
and at end of a string, respectively. Let us see an example:
Given the same text buffer:
regular expression
regex
foo bar
guide
bar
we want to write a regular expression that matches all the lines that start and end with bar.
If we had tried to apply the
\b
operator(\bbar\b
), we would have matched both the third and the
fifth line, since both lines have a word containing “bar” in it; but if we try to use the anchors operators,
the regular engine will match only the last line, i.e. bar
, as we expected. Thus,
the correct regular expression is:
^bar$
Letters, Digits and Special Characters
So far, we saw how to deal with any kind of character, but what if we want to match a letter of the alphabet, a number or a special character?The simplest way to match a character of the english alphabet, a number or an underscore is to use the
\w
metacharacter. This operator will match anything in the range of
a-zA-Z
(both lowercase and uppercase),
0-9
and _
.
Similarly, to match a digit between 0 and 9 we can use the
\d
metacharacter.
Special characters, on the other hand, must be escaped. For instance if we are trying to match a dot(
.
) in a string, the appropriate regular expression is
\.
since the dot operator has
a different meaning for the regular engine. The same goes for the anchors and for any other
reserved character we will encounter during the rest of this guide.
Let us take a look at an example:
Given the following list of passwords:
Qwry23
Qtrh23
Badpassword
goodpassword16
#fgh32
2xdg16
we want to match a password if and only if:
- It starts with a letter of the alphabet(both lowercase and uppercase) or a digit;
- It ends with two digits;
- Its length is equal to 6 at most.
-
The first character to be between a-zA-Z0-9_(
\w
); -
To match any character three times(
...
); -
To match any number between 0 and 9 two times(
\d\d
).
^\w...\d\d$
which produces:
Qwry23
Qtrh23
2xdg16
Kleene Star, Kleene Plus and Optionality
One of missing features in the previous sections was the ability to repeat a matched character an arbitrary number of times. Suppose for instance that we want to match a word of arbitrary length containing letters of the alphabet or numbers in the range of0-9
, for example:
regex
guide
foo
bar
A naive approach could be to repeat the \w
metacharacter times(i.e. \w\w\w...
).
The problem with this approach is that we do not know how long a word could be.
A solution to this problem is to use either the Kleene Star operator(*
)
or the Klee Plus operator(+
).
The star will match the previous character zero or more times, while the plus operator will match the previous
character one or more times. In the previous example, since we want to match the previous character(\w
)
at least one time, we can use the plus operator:
^\w+
This will match all the lines in the original text.
Let us see another example. Suppose that we want to match all the lines but the last one in the following buffer:
xxxyyyyzzz
xzzz
xxxyyz
Since all the lines have at least one x, we can use the plus operator(x+
);
the y on the other hand, does not occur everywhere, so we want to match it
zero or more times(y*
).
The z, finally, occurs in all three rows, so we can use
the plus operator(z+
). The complete regex is:
^x+y*z+$
The other operator is the question mark(?
). This operator will match the previous
character zero or one times and is used to indicate optionality. Let us see a very common example.
Suppose that we want to match both the words color and colour from a text file. To do that, we need to tell the regex engine that the letter “u” is optional. To do so, we can use this regular expression:
colou?r
You may be wondering why we did not use the star operator. The reason is, the star operator would have
matched any occurrence of the letter u(e.g. colouur, colouuuuuur and so on)
while here we want to match it either zero or one time.
Another variant of the question mark operator is the double question mark(
??
).
The difference between these two characters is that the ?
operator will test the pattern first, if it
fails it will test for nullability on backtrack. The ??
operator,
on the other hand, will first test the empty string and if it fails,
it will try to match the pattern.
The single question mark is said to be greedy, while the double question mark is lazy.
If the last part of this section didn't make much sense, just keep reading. In the next section we will introduce the notions of greediness and laziness.
Greediness vs Laziness
Any regular expression engine available out there, supports both a greedy and a lazy version of quantifiers. The greedy operators will try to match the pattern as many times as possible while the lazy operator will try to match the pattern as few times as possible. To understand what this actually mean, we will see a practical example.Let us suppose that we want to parse the class and the ID name of an HTML element, for example from:
<div class="container" id="d1">Content</div>
we want to get container
and
d1
. A first approach could be to match any
character(except empty string) in between the double quotes(""
),
something like this:
".+"
Surprisingly the result is:
"container" id="d1"
The problem with our solution is that both the *
and
the +
(as well as ?
)
are greedy operator. Take a look at the following diagram to understand what
the engine does when it encounters a greedy operator:
The engine start matching the string when it encounters the first quote.
It continues matching until it find another quote.
Same goes for the first quote of the id field
and for the second one.
Since the *
operator is greedy, the engine continues searching for another quote until the
end of the line. When it reaches the end of the row, the engine must go
back to find the other quote, this operation is known as backtracking:
The highlighted part in yellow is the result of the regular expression. To obtain
the expected result, we need to use a lazy operator, i.e. an operator that stops matching
the string as soon as it encounters the quote character. The lazy version of the star,
the plus and the question mark operator are, respectively, *?
, +?
and ??
.
Thus, the correct regular expression is:
".+?"
which produces "container"
and
"d1"
as results.
Below there is a table with greedy and lazy quantifiers:
Greedy Operator | Lazy Operator | Meaning |
---|---|---|
* |
*? |
Matches zero or more times. |
+ |
+? |
Matches one or more times. |
? |
?? |
Matches zero or one time. |
Character Sets
Another useful regex feature is the ability to match a character from a set of symbols. To do that, we can simply wrap the characters we want to search inside square brackets; the regex engine will match a single element of that list. Take a look a following example:Suppose that we want to match all names starting with “J”, “D” and “M” from a list of names:
John
Johnathan
David
Dennis
Micheal
Mike
Sean
Charlotte
To do that we can specify that the first letter must be either a J a D or
an M([JDM]
)
and that the rest of the word is any character between “a”
and “z”([a-z]
).
The full expression is:
[JDM][a-z]+
which will match the following names:
John
Johnathan
David
Dennis
Micheal
Mike
Character Ranges
As you can see, you can also specify ranges inside character sets. For instance, to match every character between B and F(uppercase) we can use the expression[B-F]
,
to match any number between 0 and 9 we can use the expression [0-9]
.
In fact the metacharacters
\w
and \d
we saw in the previous sections are just aliases to
[a-zA-Z0-9_]
and
[0-9]
, respectively. Another useful character set alias
is \s
, which matches any whitespace character, i.e.
[\r\n\t\f\v ]
.
Excluding Characters
Character sets can also be used to exclude certain characters(or ranges of characters) from the pattern. To do that, we can use the hat symbol(^
)
inside the square brackets.
For instance, to exclude the lines that begin with “S” or “C” from the following list of names:
John
Johnathan
David
Dennis
Micheal
Mike
Sean
Charlotte
we can write
^[^SC][a-z]+
which will match the following names:
John
Johnathan
David
Dennis
Micheal
Mike
Do note that the ^
operator has two different meanings here: one inside the
brackets(exclusion) and another outside the brackets(anchor).
Repetition Quantifiers
So far, we saw three different quantifiers: one to match zero or one time(?
), one
to match zero or more times(*
) and one to match one or
more times(+
).
In this section we will see how to match a precise number of times.
Regex engines support repetition through curly braces(
{}
).
For example to match a
certain character exactly 5 times, you can use the regular expression
x{5}
To match it at least 5 times you use:
x{5,}
To match it at most 5 times you can use:
x{,5}
And, finally, to match it at least 3 times, but not more than 5 times, you use:
x{3,5}
Let us see an example:
Suppose that you want to match all the domain names that have at least 2 character in the TLD(i.e.,
example.com
,
test.it
, etc.) from the following buffer:
example.com
test.it
youtube.org
google.c
To do so, we can use:
\b\.[a-z]{2,}\b
Which will match the first three rows as expected.
Like any other quantifiers(
*
, +
, ?
),
{}
is also greedy by default. If you
want to be lazy, be sure to append an ?
at the end of the curly brackets.
Capture Groups
Up to this point, we saw how regular expressions can be used to match texts inside a text file. While regex is a useful tool for finding and replacing strings inside a text file, it is not its only purpose. Regular expressions can also be used to extract(or capture) information from a string to be processed, for instance, inside a program. To capture the result of a regular expression we use a pair of parenthesis(()
).
Suppose that we want to parse a simple configuration file where each line is in the form:
<key> = <val>;
and to store each value in a Python dictionary. An example of the previous schema is:
width= 640px;
height = 480px;
scale = 1.5679;
color = #99ff66;
default_status ="inactive";
val == "foo";
val = bar
The regular expression is
^(\w+)\s*=\s*([\w\.\#\"]+);$
Let us break it down.
-
(\w+)
: captures the name of the attribute(i.e., a word containing letters, digits or underscore); -
\s*
: matches zero or more spaces(the user may not add a space, like in the first line); -
=
: matches a single equal sign; -
\s*
: matches zero or more spaces; -
([\w\.\#\"]+)
: captures either a word with letters, numbers, underscore, the sharp symbol(#) or a quote("); -
;
: matches only those lines that end with a blockquote.
import re
entries = [
'width= 640px;',
'height = 480px;',
'scale = 1.5679;',
'color = #99ff66;',
'default_status ="inactive";',
# Invalid keys
'val == "foo";',
'val = bar'
]
exp = r"^(\w+)\s*=\s*([\w\.\#\"]+);$"
keys = {}
def main():
matches = [re.search(exp, entry) for entry in entries]
for match in matches:
if match:
keys[match.group(1)] = match.group(2)
print(keys)
if __name__ == "__main__":
main()
The output is:
{'width': '640px', 'height': '480px', 'scale': '1.5679', 'color': '#99ff66', 'default_status': '"inactive"'}
As you can see capture groups indexes start from 1. We will soon see how to name
groups and how to refer to them.
Conditional Capture Groups
Capture groups allow us to use an OR operator(|
) to
capture all the strings that include a set
of characters specified in the pattern. For example, if we want to match all the
domain names with .com
or
.org
as the TLD, we can use the following expression.
List of domains:
google.com
kernel.org
songdata.io
Regular expression:
^\w+\.(com|org)$
which filter out the last line.
Nested Capture Groups
In some occasions, it may be useful to capture multiples groups at same time. For instance, in the following list of file names
secret_code.txt
database_backup.sql
regular_expression_article.md
list_of_sales2022.csv
it may be useful to extract the complete filename(database_backup.sql
),
the name without
the extension(database_backup
)
and the extension(sql
). To do that,
we need to write a nested capture group:
^((\w+)\.(\w+))$
Thus, ^((\w+)\.(\w+))$
captures the whole filename, (\w+)\.
captures the
name without extension and \.(\w+)
captures the extension.
In other words, for each row we get:
["secret_code.txt", "secret_code", "txt"]
["database_backup.sql", "database_backup", "sql"]
["regular_expression_article.md", "regular_expression_article", "md"]
["list_of_sales2022.csv", "list_of_sales2022", "csv"]
Numbered Backreferences
Capture groups can be referenced in later parts of the regular expression using their index(starting from 1). This action is known as backreferencing. Suppose for example that we want to remove duplicate words in a text file:
The quick quick brown fox fox jumps over the lazy dog
The The quick brown fox jumps over the lazy dog
Here the words quick, fox and The are duplicated. We can avoid fixing this typo manually using the
following regular expression:
\b([a-zA-Z]+)\s\1\b
Where:
-
([a-zA-Z]+)\s
: captures all words separated by a space; -
\1
: refers to the previous capture group(i.e. it matches the same text).
\1
in the “replace” textbox of your text editor.
Do note that some editors/programming languages use a different syntax to refer to capture groups.
For instance, Visual Studio Code uses $n
(where \( n \) is the index of the capture group) instead of \n
.
You can see these backreferences as a sort of variable to refer to previous groups. Let us see another common example:
Let us say that we want to match the opening and closing tag of an HTML element without repeating the same pattern twice(i.e., for
<a>
and for
</a>
).
For instance, we want to match every line from the following file:
<p>This is a paragraph</p>
<I>This is italic</I>
<b>This is bold</b>
<div>This is a div</div>
To do that we can write:
<([A-Za-z0-9]+)>.*?</\1>
Where:
-
<([A-Za-z0-9]+)>
: matches the name of the tag(both uppercase and lowercase); -
.*?
: matches the content of the element; -
</\1>
: matches the closing tag(i.e., it refers to the string captured by the first capture group).
ini
and yaml
,
JSON
is already too complex.
Named Backreferences
Referencing to previously captured groups using indexes is not always very clear. When you are working with large and complex regular expressions, and you have to refer to multiple groups at the same time, you may find useful giving these groups an explicit name. To give a name to a capture group use the following syntax:
(?P<custom_name>)
(?P=custom_name)
The first one is used on group declaration while the second one is used for back-referencing.
For instance, the duplicate words example become:
\b(?P<duplicate>[a-zA-Z]+)\s(?P=duplicate)\b
Named back-referencing does not replace groups indexes. You can still refer
to the duplicate
group using the
index \1
(or $1
).
Anonymous Capture Group
We saw so far that capture groups can be accessed using back-referencing. Sometimes, though, we want to create a capture group to organize the regular expression, but we do not want to be able to access it later. In such cases we can create an anonymous capture group(also known as non-capturing group). The syntax for an anonymous capture group is the following:
(?:)
To give an example of why anonymous capture groups can be useful, let us go back
to the example about nested capture groups:
secret_code.txt
database_backup.sql
regular_expression_article.md
list_of_sales2022.csv
Instead of retrieving the whole filename(database_backup.sql
)
we want to be able to access only the name without
extension(database_backup
) and the extension(sql
)
while keeping them inside a nested capture group.
To do so, we can refactor the regular expression from this:
^((\w+)\.(\w+))$
to this
^(?:(\w+)\.(\w+))$
Which gives us the following rows:
["secret_code", "txt"]
["database_backup", "sql"]
["regular_expression_article", "md"]
["list_of_sales2022", "csv"]
Practical Examples
Now that we covered pretty much all the basic concepts of regular expressions, we can use them to solve some practical problems. In the next sections we will solve four typical sysadmin/programmer tasks using regex in Python. Keep in mind that the following regular expressions are not suitable for every scenario, you may need to trim them according to your needs before using them in production.1. Quotation Mark Replacement
i.e., from this:>Given a string surrounded by single quotation marks(`
), double quotation marks(""
) or angle brackets(<<>>
), replace them with polish quotation mark(,,''
).
`Your time is limited, so don't waste it living someone else's life.`
""Life is what happens when you're busy making other plans.""
<<Tell me and I forget. Teach me and I remember. Involve me and I learn.>>
,,It is during our darkest moments that we must focus to see the light.''
we want this:
,,Your time is limited, so don't waste it living someone else's life.''
,,Life is what happens when you're busy making other plans.''
,,Tell me and I forget. Teach me and I remember. Involve me and I learn.''
,,It is during our darkest moments that we must focus to see the light.''
The regular expression is:
^[`\"<]{1,2}([\w\s'.,]+)[`\">]{1,2}$
where:
-
^[`\"<]{1,2}
: matches the opening quotes either one or two times; -
([\w\s'.,]+)
: captures the text inside the quotes; -
[`\">]{1,2}$
: matches the closing quotes either one or two times.
import re
phrases = [
"`Your time is limited, so don't waste it living someone else's life.`",
"\"\"Life is what happens when you're busy making other plans.\"\"",
"<<Tell me and I forget. Teach me and I remember. Involve me and I learn.>>",
",,It is during our darkest moments that we must focus to see the light.''"
]
exp = r"^[`\"<]{1,2}([\w\s'.,]+)[`\">]{1,2}$"
def main():
res = [re.sub(exp, r",,\1''", phrase) for phrase in phrases]
print(*res, sep='\n')
if __name__ == "__main__":
main()
2. Log File
i.e., from this:>Given a log file of the following form:Parse the[LEVEL] [YYYY/MM/DD - HH:MM:SS.uuu] | <message>
LEVEL
, the date(YYYY/MM/DD - HH:MM:SS.uuu
) and themessage
.
[I] [2022/10/22 - 15:57:30.157] | Server up and running
[I] [2022/10/22 - 16:40:23.472] | Server accepted a new connection
[W] [2022/10/22 - 17:45:00.123] | Server has reached memory limit
[E] [2022/10/22 - 17:47:32.100] | Server not responding, trying to restart...
[E] [2022/10/22 - 17:48:00.000] |Server crashed.
We want this:
Level: I
Date: 2022/10/22 - 15:57:30.157
Message: Server up and running
The regular expression is:
^\[(?P<level>[CDEIW])\]\s\[(?P<date>[\d\/\s\-:\.]+)\]\s\|\s*(?P<message>[A-Za-z,. ]+)$
where:
-
^\[(?P<level>[CDEIW])\]\s
: captures the log level, i.e. a letter than can either be Info, Warning, Error, Critical or Debug; -
[(?P<date>[\d\/\s\-:\.]+)\]\s\|\s*
: captures the date; -
(?P<message>[A-Za-z,. ]+)
: captures the message, i.e. a string with alphabet letters and syntactical elements.
import re
logs = [
"[I] [2022/10/22 - 15:57:30.157] | Server up and running",
"[I] [2022/10/22 - 16:40:23.472] | Server accepted a new connection",
"[W] [2022/10/22 - 17:45:00.123] | Server has reached memory limit",
"[E] [2022/10/22 - 17:47:32.100] | Server not responding, trying to restart...",
"[E] [2022/10/22 - 17:48:00.000] |Server crashed."
]
exp = r"^\[(?P<level>[CDEIW])\]\s\[(?P<date>[\d\/\s\-:\.]+)\]\s\|\s*(?P<message>[A-Za-z,. ]+)$"
def main():
matches = [re.search(exp, log) for log in logs]
[print(f"Level: {match.group(1)}\nDate: {match.group(2)}\nMessage: {match.group(3)}\n") for match in matches]
if __name__ == "__main__":
main()
3. Parse URLs
For instance, from this>Given a URL of the following form:Extract the<PROTOCOL>://<ADDRESS>:<PORT>/<RESOURCE>
protocol
, theaddress
, theport
(if it exists) and theresource
.
https://www.google.com/search?q=regex+tutorial
we want this:
Protocol: https
Address: www.google.com
Query: search?q=regex+tutorial
The regular expression to do that is:
^(\w+)://([\w\-\.]+):?(\d+)?\/(.*)$
where:
-
^(\w+)://
: captures the protocol; -
([\w\-\.]+)
: captures the address; -
:?(\d+)?
: captures the port if it exists; -
(.*)
: captures the resource.
import re
urls = [
"https://marcocetica.com/posts/wireguard_pihole/",
"https://www.google.com/search?q=regex+tutorial",
"https://www.youtube.com/watch?v=dQw4w9WgXcQ",
"file://localhost:3000/archive.zip"
]
exp = r"^(\w+)://([\w\-\.]+):?(\d+)?\/(.*)$"
def main():
nl = "\n"
matches = [re.search(exp, url) for url in urls]
[print(f"For URL: {url}\n" \
f"Protocol: {match.group(1)}\n" \
f"Address: {match.group(2)}\n" \
f'{" ".join(("Port: ", match.group(3), nl)) if match.group(3) else ""}' \
f"Query: {match.group(4)}\n")
for (url, match) in zip(urls, matches)]
if __name__ == "__main__":
main()
4. Parse IPv4 Addresses
For example, from this list:>Given a list containing IPv4 address, we want to filter out invalid entries.
33.150.44.43
145.158.214.165
74.50.35.148
149.184.69.21
86.120.183.166
91.133.69.226
19.111.62.200
71.241.160.194
169.180.151.86
114.136.215.231
127.0.0.1
192.168.1.64.
320.410.0.1
192.168
127.0.0.1.5
10.01.1.02
10.0.0.02
1...0
we want this list:
33.150.44.43
145.158.214.165
74.50.35.148
149.184.69.21
86.120.183.166
91.133.69.226
19.111.62.200
71.241.160.194
169.180.151.86
114.136.215.231
127.0.0.1
The regular expression to do that is:
^\b(?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.){3}(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\b$
Where:
-
(?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.){3}
: captures the first three octets of the forms25X
,2XX
,1XX
,XX
orX
; -
(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])
: captures the last byte of the form25X
,2XX
,1XX
,XX
orX
.
import re
ip_addresses = [
"33.150.44.43",
"145.158.214.165",
"74.50.35.148",
"149.184.69.21",
"86.120.183.166",
"91.133.69.226",
"19.111.62.200",
"71.241.160.194",
"169.180.151.86",
"114.136.215.231",
"127.0.0.1",
# Invalid addresses
"192.168.1.64.",
"320.410.0.1",
"192.168",
"127.0.0.1.5",
"10.01.1.02",
"10.0.0.02",
"1...0"
]
exp = r"^\b(?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.){3}(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\b$"
def main():
print("Valid addresses: ")
match = [ip for ip in ip_addresses if re.findall(exp, ip) != []]
print(*match, sep='\n')
if __name__ == "__main__":
main()