A REGEX PRIMER

2022-10-24

Introduction

Regular expressions are one of the most useful tool to extract information from raw data. Developers, sysadmins, data scientists and even editors can benefit from learning this powerful language. In this guide we will try to learn regex through practical examples.

In order to follow this tutorial, I suggest you to use regex101.com, a web based interactive regex debugger that shows the result of the regular expression in real time. You can also use any text editor with a regex engine(vim, vscode, emacs, notepad++, word, etc.) or a programming language that supports text matching through regular expressions(Python, Javascript, Java, C#, C++, etc.).

What is a Regular Expression?

From a formal language theory point of view, a regular expression is a string of a regular language, i.e. a finite context-free language that can be represented by a nondeterministic finite automation(NFA) by applying the Thompson's construction which is then made deterministic using the powerset/subset construction. The result is a deterministic finite automation(DFA) that can be used on the target string to recognize patterns that match the original string.

The name “regular” was first coined by the American mathematician Stephen Cole Kleene(one of the students of Alonzo Church) on this paper to describe regulars events. In the Chomsky's hierarchy, type-3 grammars are the equivalent of regular languages.

Theory aside, you can see a regular expression as a pattern to extract information from a string.

Regex Engines Support?

Before getting started with the actual tutorial, we need to point out that every regex engine(i.e. the software that runs regular expressions) works in a slightly different way from the others, in this tutorial all the examples are tested with the Python engine. If you are using a different programming language(or a text editor) you may need to change some syntax elements in order to get the expression running correctly.

Match Any Character

Let us get started with the most basic regex operator: the dot metacharacter. This operator will match any single character with the only exception of line breaks. Suppose that you want to match all the lines of length three from the following buffer:


regular expression
regex
foo bar
guide
bar

The dot operator allows us to write the following regular expression:

...

But that matches any string containing at least three character in any position of the string. What we want to do is to apply the regex expression at the string boundaries, i.e. at the beginning and at the end of words. To do so, we can use the \b metacharacter:


\b...\b

The matched strings are:


foo bar
bar

as expected. We will soon see that there are other, more efficient ways to accomplish the same result.

Anchors Metacharacters

In the previous section we talked about how \b operator can be used to apply a specific pattern to string boundaries, i.e. the start and the end of a word. Similarly, the anchors metacharacters - ^ and $ - match a given pattern at the beginning and at end of a string, respectively. Let us see an example:

Given the same text buffer:


regular expression
regex
foo bar
guide
bar

we want to write a regular expression that matches all the lines that start and end with bar.

If we had tried to apply the \b operator(\bbar\b), we would have matched both the third and the fifth line, since both lines have a word containing “bar” in it; but if we try to use the anchors operators, the regular engine will match only the last line, i.e. bar, as we expected. Thus, the correct regular expression is:


^bar$

Letters, Digits and Special Characters

So far, we saw how to deal with any kind of character, but what if we want to match a letter of the alphabet, a number or a special character?

The simplest way to match a character of the english alphabet, a number or an underscore is to use the \w metacharacter. This operator will match anything in the range of a-zA-Z(both lowercase and uppercase), 0-9 and _.

Similarly, to match a digit between 0 and 9 we can use the \d metacharacter.

Special characters, on the other hand, must be escaped. For instance if we are trying to match a dot(.) in a string, the appropriate regular expression is \. since the dot operator has a different meaning for the regular engine. The same goes for the anchors and for any other reserved character we will encounter during the rest of this guide.

Let us take a look at an example:

Given the following list of passwords:


Qwry23
Qtrh23
Badpassword
goodpassword16
#fgh32
2xdg16

we want to match a password if and only if:

It starts with a letter of the alphabet(both lowercase and uppercase) or a digit;
It ends with two digits;
Its length is equal to 6 at most.

This means that we want:

The first character to be between a-zA-Z0-9_(\w);
To match any character three times(...);
To match any number between 0 and 9 two times(\d\d).

The complete regular expression is:


^\w...\d\d$

which produces:


Qwry23
Qtrh23
2xdg16

Kleene Star, Kleene Plus and Optionality

One of missing features in the previous sections was the ability to repeat a matched character an arbitrary number of times. Suppose for instance that we want to match a word of arbitrary length containing letters of the alphabet or numbers in the range of 0-9, for example:


regex
guide
foo
bar

A naive approach could be to repeat the \w metacharacter times(i.e. \w\w\w...). The problem with this approach is that we do not know how long a word could be. A solution to this problem is to use either the Kleene Star operator(*) or the Klee Plus operator(+). The star will match the previous character zero or more times, while the plus operator will match the previous character one or more times. In the previous example, since we want to match the previous character(\w) at least one time, we can use the plus operator:


^\w+

This will match all the lines in the original text.

Let us see another example. Suppose that we want to match all the lines but the last one in the following buffer:


xxxyyyyzzz
xzzz
xxxyyz

Since all the lines have at least one x, we can use the plus operator(x+); the y on the other hand, does not occur everywhere, so we want to match it zero or more times(y*). The z, finally, occurs in all three rows, so we can use the plus operator(z+). The complete regex is:


^x+y*z+$

The other operator is the question mark(?). This operator will match the previous character zero or one times and is used to indicate optionality. Let us see a very common example.

Suppose that we want to match both the words color and colour from a text file. To do that, we need to tell the regex engine that the letter “u” is optional. To do so, we can use this regular expression:


colou?r

You may be wondering why we did not use the star operator. The reason is, the star operator would have matched any occurrence of the letter u(e.g. colouur, colouuuuuur and so on) while here we want to match it either zero or one time.

Another variant of the question mark operator is the double question mark(??). The difference between these two characters is that the ? operator will test the pattern first, if it fails it will test for nullability on backtrack. The ?? operator, on the other hand, will first test the empty string and if it fails, it will try to match the pattern.

The single question mark is said to be greedy, while the double question mark is lazy.

If the last part of this section didn't make much sense, just keep reading. In the next section we will introduce the notions of greediness and laziness.

Greediness vs Laziness

Any regular expression engine available out there, supports both a greedy and a lazy version of quantifiers. The greedy operators will try to match the pattern as many times as possible while the lazy operator will try to match the pattern as few times as possible. To understand what this actually mean, we will see a practical example.

Let us suppose that we want to parse the class and the ID name of an HTML element, for example from:


<div class="container" id="d1">Content</div>

we want to get container and d1. A first approach could be to match any character(except empty string) in between the double quotes(""), something like this:


".+"

Surprisingly the result is:


"container" id="d1"

The problem with our solution is that both the * and the +(as well as ?) are greedy operator. Take a look at the following diagram to understand what the engine does when it encounters a greedy operator: greedy example part 1

The engine start matching the string when it encounters the first quote. greedy example part 2

It continues matching until it find another quote. greedy example part 3

Same goes for the first quote of the id field greedy example part 4

and for the second one. greedy example part 5

Since the * operator is greedy, the engine continues searching for another quote until the end of the line. When it reaches the end of the row, the engine must go back to find the other quote, this operation is known as backtracking: greedy example part 6

The highlighted part in yellow is the result of the regular expression. To obtain the expected result, we need to use a lazy operator, i.e. an operator that stops matching the string as soon as it encounters the quote character. The lazy version of the star, the plus and the question mark operator are, respectively, *?, +? and ??.

Thus, the correct regular expression is:


".+?"

which produces "container" and "d1" as results.

Below there is a table with greedy and lazy quantifiers:

Greedy Operator	Lazy Operator	Meaning
`*`	`*?`	Matches zero or more times.
`+`	`+?`	Matches one or more times.
`?`	`??`	Matches zero or one time.

Character Sets

Another useful regex feature is the ability to match a character from a set of symbols. To do that, we can simply wrap the characters we want to search inside square brackets; the regex engine will match a single element of that list. Take a look a following example:

Suppose that we want to match all names starting with “J”, “D” and “M” from a list of names:


John
Johnathan
David
Dennis
Micheal
Mike

Sean
Charlotte

To do that we can specify that the first letter must be either a J a D or an M([JDM]) and that the rest of the word is any character between “a” and “z”([a-z]). The full expression is:


[JDM][a-z]+

which will match the following names:


John
Johnathan
David
Dennis
Micheal
Mike

Character Ranges

As you can see, you can also specify ranges inside character sets. For instance, to match every character between B and F(uppercase) we can use the expression [B-F], to match any number between 0 and 9 we can use the expression [0-9].

In fact the metacharacters \w and \d we saw in the previous sections are just aliases to [a-zA-Z0-9_] and [0-9], respectively. Another useful character set alias is \s, which matches any whitespace character, i.e. [\r\n\t\f\v ].

Excluding Characters

Character sets can also be used to exclude certain characters(or ranges of characters) from the pattern. To do that, we can use the hat symbol(^) inside the square brackets. For instance, to exclude the lines that begin with “S” or “C” from the following list of names:


John
Johnathan
David
Dennis
Micheal
Mike

Sean
Charlotte

we can write


^[^SC][a-z]+

which will match the following names:


John
Johnathan
David
Dennis
Micheal
Mike

Do note that the ^ operator has two different meanings here: one inside the brackets(exclusion) and another outside the brackets(anchor).

Repetition Quantifiers

So far, we saw three different quantifiers: one to match zero or one time(?), one to match zero or more times(*) and one to match one or more times(+). In this section we will see how to match a precise number of times.

Regex engines support repetition through curly braces({}). For example to match a certain character exactly 5 times, you can use the regular expression


x{5}

To match it at least 5 times you use:


x{5,}

To match it at most 5 times you can use:


x{,5}

And, finally, to match it at least 3 times, but not more than 5 times, you use:


x{3,5}

Let us see an example:

Suppose that you want to match all the domain names that have at least 2 character in the TLD(i.e., example.com, test.it, etc.) from the following buffer:


example.com
test.it
youtube.org

google.c

To do so, we can use:


\b\.[a-z]{2,}\b

Which will match the first three rows as expected.

Like any other quantifiers(*, +, ?), {} is also greedy by default. If you want to be lazy, be sure to append an ? at the end of the curly brackets.

Capture Groups

Up to this point, we saw how regular expressions can be used to match texts inside a text file. While regex is a useful tool for finding and replacing strings inside a text file, it is not its only purpose. Regular expressions can also be used to extract(or capture) information from a string to be processed, for instance, inside a program. To capture the result of a regular expression we use a pair of parenthesis(()).

Suppose that we want to parse a simple configuration file where each line is in the form:


<key> = <val>;

and to store each value in a Python dictionary. An example of the previous schema is:


width= 640px;
height =   480px;
scale    =   1.5679;
color = #99ff66;
default_status ="inactive";

val == "foo";
val = bar

The regular expression is


^(\w+)\s*=\s*([\w\.\#\"]+);$

Let us break it down.

(\w+): captures the name of the attribute(i.e., a word containing letters, digits or underscore);
\s*: matches zero or more spaces(the user may not add a space, like in the first line);
=: matches a single equal sign;
\s*: matches zero or more spaces;
([\w\.\#\"]+): captures either a word with letters, numbers, underscore, the sharp symbol(#) or a quote(");
;: matches only those lines that end with a blockquote.

We can see a working example in Python:


import re

entries = [
    'width= 640px;',
    'height =   480px;',
    'scale    =   1.5679;',
    'color = #99ff66;',
    'default_status ="inactive";',
    # Invalid keys
    'val == "foo";',
    'val = bar'
]

exp = r"^(\w+)\s*=\s*([\w\.\#\"]+);$"
keys = {}

def main():
    matches = [re.search(exp, entry) for entry in entries]
    for match in matches:
        if match:
            keys[match.group(1)] = match.group(2)

    print(keys)

if __name__ == "__main__":
    main()

The output is:


{'width': '640px', 'height': '480px', 'scale': '1.5679', 'color': '#99ff66', 'default_status': '"inactive"'}

As you can see capture groups indexes start from 1. We will soon see how to name groups and how to refer to them.

Conditional Capture Groups

Capture groups allow us to use an OR operator(|) to capture all the strings that include a set of characters specified in the pattern. For example, if we want to match all the domain names with .com or .org as the TLD, we can use the following expression.

List of domains:


google.com
kernel.org

songdata.io

Regular expression:


^\w+\.(com|org)$

which filter out the last line.

Nested Capture Groups

In some occasions, it may be useful to capture multiples groups at same time. For instance, in the following list of file names


secret_code.txt
database_backup.sql
regular_expression_article.md
list_of_sales2022.csv

it may be useful to extract the complete filename(database_backup.sql), the name without the extension(database_backup) and the extension(sql). To do that, we need to write a nested capture group:


^((\w+)\.(\w+))$

Thus, ^((\w+)\.(\w+))$ captures the whole filename, (\w+)\. captures the name without extension and \.(\w+) captures the extension. In other words, for each row we get:


["secret_code.txt", "secret_code", "txt"]
["database_backup.sql", "database_backup", "sql"]
["regular_expression_article.md", "regular_expression_article", "md"]
["list_of_sales2022.csv", "list_of_sales2022", "csv"]

Numbered Backreferences

Capture groups can be referenced in later parts of the regular expression using their index(starting from 1). This action is known as backreferencing. Suppose for example that we want to remove duplicate words in a text file:


The quick quick brown fox fox jumps over the lazy dog
The The quick brown fox jumps over the lazy dog

Here the words quick, fox and The are duplicated. We can avoid fixing this typo manually using the following regular expression:


\b([a-zA-Z]+)\s\1\b

Where:

([a-zA-Z]+)\s: captures all words separated by a space;
\1: refers to the previous capture group(i.e. it matches the same text).

To remove duplicate words, you can just write \1 in the “replace” textbox of your text editor. Do note that some editors/programming languages use a different syntax to refer to capture groups. For instance, Visual Studio Code uses $n (where $ n $ is the index of the capture group) instead of \n.

You can see these backreferences as a sort of variable to refer to previous groups. Let us see another common example:

Let us say that we want to match the opening and closing tag of an HTML element without repeating the same pattern twice(i.e., for <a> and for </a>). For instance, we want to match every line from the following file:


<p>This is a paragraph</p>
<I>This is italic</I>
<b>This is bold</b>
<div>This is a div</div>

To do that we can write:


<([A-Za-z0-9]+)>.*?</\1>

Where:

<([A-Za-z0-9]+)>: matches the name of the tag(both uppercase and lowercase);
.*?: matches the content of the element;
</\1>: matches the closing tag(i.e., it refers to the string captured by the first capture group).

You will soon realize that regular expressions are not very suitable for parsing real world HTML. In fact if you try to use the previous expression to match an element containing a class name, an id or any other attribute, it will fail immediately. Deterministic finite automation just are not suitable for this scope. Complex languages should always be tokenized by a lexical analyzer and then parsed into an abstract syntax tree. The only languages I would personally parse using regular expression are ini and yaml, JSON is already too complex.

Named Backreferences

Referencing to previously captured groups using indexes is not always very clear. When you are working with large and complex regular expressions, and you have to refer to multiple groups at the same time, you may find useful giving these groups an explicit name. To give a name to a capture group use the following syntax:


(?P<custom_name>)
(?P=custom_name)

The first one is used on group declaration while the second one is used for back-referencing.

For instance, the duplicate words example become:


\b(?P<duplicate>[a-zA-Z]+)\s(?P=duplicate)\b

Named back-referencing does not replace groups indexes. You can still refer to the duplicate group using the index \1(or $1).

Anonymous Capture Group

We saw so far that capture groups can be accessed using back-referencing. Sometimes, though, we want to create a capture group to organize the regular expression, but we do not want to be able to access it later. In such cases we can create an anonymous capture group(also known as non-capturing group). The syntax for an anonymous capture group is the following:


(?:)

To give an example of why anonymous capture groups can be useful, let us go back to the example about nested capture groups:


secret_code.txt
database_backup.sql
regular_expression_article.md
list_of_sales2022.csv

Instead of retrieving the whole filename(database_backup.sql) we want to be able to access only the name without extension(database_backup) and the extension(sql) while keeping them inside a nested capture group. To do so, we can refactor the regular expression from this:


^((\w+)\.(\w+))$

to this


^(?:(\w+)\.(\w+))$

Which gives us the following rows:


["secret_code", "txt"]
["database_backup", "sql"]
["regular_expression_article", "md"]
["list_of_sales2022", "csv"]

Practical Examples

Now that we covered pretty much all the basic concepts of regular expressions, we can use them to solve some practical problems. In the next sections we will solve four typical sysadmin/programmer tasks using regex in Python. Keep in mind that the following regular expressions are not suitable for every scenario, you may need to trim them according to your needs before using them in production.

1. Quotation Mark Replacement

>
Given a string surrounded by single quotation marks(`), double quotation marks("") or angle brackets(<<>>), replace them with polish quotation mark(,,'').

i.e., from this:


`Your time is limited, so don't waste it living someone else's life.`
""Life is what happens when you're busy making other plans.""
<<Tell me and I forget. Teach me and I remember. Involve me and I learn.>>
,,It is during our darkest moments that we must focus to see the light.''

we want this:


,,Your time is limited, so don't waste it living someone else's life.''
,,Life is what happens when you're busy making other plans.''
,,Tell me and I forget. Teach me and I remember. Involve me and I learn.''
,,It is during our darkest moments that we must focus to see the light.''

The regular expression is:


^[`\"<]{1,2}([\w\s'.,]+)[`\">]{1,2}$

where:

^[`\"<]{1,2}: matches the opening quotes either one or two times;
([\w\s'.,]+): captures the text inside the quotes;
[`\">]{1,2}$: matches the closing quotes either one or two times.

In Python this is:


import re

phrases = [
    "`Your time is limited, so don't waste it living someone else's life.`",
    "\"\"Life is what happens when you're busy making other plans.\"\"",
    "<<Tell me and I forget. Teach me and I remember. Involve me and I learn.>>",
    ",,It is during our darkest moments that we must focus to see the light.''"
]

exp = r"^[`\"<]{1,2}([\w\s'.,]+)[`\">]{1,2}$"

def main():
    res = [re.sub(exp, r",,\1''", phrase) for phrase in phrases]
    print(*res, sep='\n')

if __name__ == "__main__":
    main()

2. Log File

>
Given a log file of the following form:
[LEVEL] [YYYY/MM/DD - HH:MM:SS.uuu] | <message>
Parse the LEVEL, the date(YYYY/MM/DD - HH:MM:SS.uuu) and the message.

i.e., from this:


[I] [2022/10/22 - 15:57:30.157] | Server up and running
[I] [2022/10/22 - 16:40:23.472] | Server accepted a new connection
[W] [2022/10/22 - 17:45:00.123] | Server has reached memory limit
[E] [2022/10/22 - 17:47:32.100] | Server not responding, trying to restart...
[E] [2022/10/22 - 17:48:00.000] |Server crashed.

We want this:


Level: I
Date: 2022/10/22 - 15:57:30.157
Message: Server up and running

The regular expression is:


^\[(?P<level>[CDEIW])\]\s\[(?P<date>[\d\/\s\-:\.]+)\]\s\|\s*(?P<message>[A-Za-z,. ]+)$

where:

^\[(?P<level>[CDEIW])\]\s: captures the log level, i.e. a letter than can either be Info, Warning, Error, Critical or Debug;
[(?P<date>[\d\/\s\-:\.]+)\]\s\|\s*: captures the date;
(?P<message>[A-Za-z,. ]+): captures the message, i.e. a string with alphabet letters and syntactical elements.

In Python this is:


import re

logs = [
    "[I] [2022/10/22 - 15:57:30.157] | Server up and running",
    "[I] [2022/10/22 - 16:40:23.472] | Server accepted a new connection",
    "[W] [2022/10/22 - 17:45:00.123] | Server has reached memory limit",
    "[E] [2022/10/22 - 17:47:32.100] | Server not responding, trying to restart...",
    "[E] [2022/10/22 - 17:48:00.000] |Server crashed."
]

exp = r"^\[(?P<level>[CDEIW])\]\s\[(?P<date>[\d\/\s\-:\.]+)\]\s\|\s*(?P<message>[A-Za-z,. ]+)$"

def main():
    matches = [re.search(exp, log) for log in logs]
    [print(f"Level: {match.group(1)}\nDate: {match.group(2)}\nMessage: {match.group(3)}\n") for match in matches]

if __name__ == "__main__":
    main()

3. Parse URLs

>
Given a URL of the following form:
<PROTOCOL>://<ADDRESS>:<PORT>/<RESOURCE>
Extract the protocol, the address, the port(if it exists) and the resource.

For instance, from this


https://www.google.com/search?q=regex+tutorial

we want this:


Protocol: https
Address: www.google.com
Query: search?q=regex+tutorial

The regular expression to do that is:


^(\w+)://([\w\-\.]+):?(\d+)?\/(.*)$

where:

^(\w+)://: captures the protocol;
([\w\-\.]+): captures the address;
:?(\d+)?: captures the port if it exists;
(.*): captures the resource.

In Python this is:


import re

urls = [
    "https://marcocetica.com/posts/wireguard_pihole/",
    "https://www.google.com/search?q=regex+tutorial",
    "https://www.youtube.com/watch?v=dQw4w9WgXcQ",
    "file://localhost:3000/archive.zip"
]

exp = r"^(\w+)://([\w\-\.]+):?(\d+)?\/(.*)$"

def main():
    nl = "\n"
    matches = [re.search(exp, url) for url in urls]
    [print(f"For URL: {url}\n" \
            f"Protocol: {match.group(1)}\n" \
            f"Address: {match.group(2)}\n" \
            f'{" ".join(("Port: ", match.group(3), nl)) if match.group(3) else ""}' \
            f"Query: {match.group(4)}\n") 
        for (url, match) in zip(urls, matches)]

if __name__ == "__main__":
    main()

4. Parse IPv4 Addresses

>
Given a list containing IPv4 address, we want to filter out invalid entries.

For example, from this list:


33.150.44.43
145.158.214.165
74.50.35.148
149.184.69.21
86.120.183.166
91.133.69.226
19.111.62.200
71.241.160.194
169.180.151.86
114.136.215.231
127.0.0.1

192.168.1.64.
320.410.0.1
192.168
127.0.0.1.5
10.01.1.02
10.0.0.02
1...0

we want this list:


33.150.44.43
145.158.214.165
74.50.35.148
149.184.69.21
86.120.183.166
91.133.69.226
19.111.62.200
71.241.160.194
169.180.151.86
114.136.215.231
127.0.0.1

The regular expression to do that is:


^\b(?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.){3}(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\b$

Where:

(?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.){3}: captures the first three octets of the forms 25X, 2XX, 1XX, XX or X;
(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9]): captures the last byte of the form 25X, 2XX, 1XX, XX or X.

In Python this is:


import re

ip_addresses = [
    "33.150.44.43",
    "145.158.214.165",
    "74.50.35.148",
    "149.184.69.21",
    "86.120.183.166",
    "91.133.69.226",
    "19.111.62.200",
    "71.241.160.194",
    "169.180.151.86",
    "114.136.215.231",
    "127.0.0.1",
    # Invalid addresses
    "192.168.1.64.",
    "320.410.0.1",
    "192.168",
    "127.0.0.1.5",
    "10.01.1.02",
    "10.0.0.02",
    "1...0"
]

exp = r"^\b(?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.){3}(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\b$"

def main():
    print("Valid addresses: ")
    match = [ip for ip in ip_addresses if re.findall(exp, ip) != []]
    print(*match, sep='\n')

if __name__ == "__main__":
    main()