# A REGEX PRIMER

2022-10-24

## Introduction

Regular expressions are one of the most useful tool to extract information from raw data. Developers, sysadmins, data scientists and even editors can benefit from learning this powerful language. In this guide we will try to learn regex through practical examples.

In order to follow this tutorial, I suggest you to use regex101.com, a web based interactive regex debugger that shows the result of the regular expression in real time. You can also use any text editor with a regex engine(vim, vscode, emacs, notepad++, word, etc.) or a programming language that supports text matching through regular expressions(Python, Javascript, Java, C#, C++, etc.).

## What is a Regular Expression?

From a formal language theory point of view, a regular expression is a string of a regular language, i.e. a finite context-free language that can be represented by a nondeterministic finite automation(NFA) by applying the Thompson's construction which is then made deterministic using the powerset/subset construction. The result is a deterministic finite automation(DFA) that can be used on the target string to recognize patterns that match the original string.

The name “regular” was first coined by the American mathematician Stephen Cole Kleene(one of the students of Alonzo Church) on his paper to describe regulars events. In the Chomsky's hierarchy, type-3 grammars are the equivalent of regular languages.

Theory aside, you can see a regular expression as a pattern to extract information from a string.

## Regex Engines Support?

Before getting started with the actual tutorial, we need to point out that every regex engine(i.e. the software that runs regular expressions) works in a slightly different way from the others, in this tutorial all the examples are tested with the Python engine. If you are using a different programming language(or a text editor) you may need to change some syntax elements in order to get the expression running correctly.

## Match Any Character

Let us get started with the most basic regex operator: the dot metacharacter. This operator will match any single character with the only exception of line breaks. Suppose that you want to match all the lines of length three from the following buffer:
                
regular expression
regex
foo bar
guide
bar


The dot operator allows us to write the following regular expression:
                
...


But that matches any string containing at least three character in any position of the string. What we want to do is to apply the regex expression at the string boundaries, i.e. at the beginning and at the end of words. To do so, we can use the \b metacharacter. So our regex became:
                
\b...\b


The matched strings are:
                
foo bar
bar


as expected. We will soon see that there are other, more efficient ways to accomplish the same result.

## Anchors Metacharacters⌗

In the previous section we talked about how \b operator can be used to apply a specific pattern to string boundaries, i.e. the start and the end of a word. Similarly, the anchors metacharacters - ^ and $ - match a given pattern at the beginning and at end of a string, respectively. Let us see an example: Given the same text buffer:   regular expression regex foo bar guide bar   we want to write a regular expression that matches all the lines that start and end with bar. If we had tried to apply the \b operator(\bbar\b), we would have matched both the third and the fifth line, since both lines have a word containing “bar” in it; but if we try to use the anchors operators, the regular engine will match only the last line, i.e. bar, as we expected. Thus, the correct regular expression is:   ^bar$



## Letters, Digits and Special Characters

So far, we saw how to deal with any kind of character, but what if we want to match a letter of the alphabet, a number or a special character?

The simplest way to match a character of the english alphabet, a number or an underscore is to use the \w metacharacter. This operator will match anything in the range of a-zA-Z(both lowercase and uppercase), 0-9 and _.

Similarly, to match a digit between 0 and 9 we can use the \d metacharacter.

Special characters, on the other hand, must be escaped. For instance if we are trying to match a dot(.) in a string, the appropriate regular expression is \. since the dot operator has a different meaning for the regular engine. The same goes for the anchors and for any other reserved character we will encounter during the rest of this guide.

Let us take a look at an example:

Given the following list of passwords:
                
Qwry23
Qtrh23
#fgh32
2xdg16


we want to match a password if and only if:
• It starts with a letter of the alphabet(both lowercase and uppercase) or a digit;
• It ends with two digits;
• Its length is equal to 6 at most.
This means that we want:
• The first character to be between a-zA-Z0-9_(\w);
• To match any character three times(...);
• To match any number between 0 and 9 two times(\d\d).
The complete regular expression is:
                
^\w...\d\d$  which produces:   Qwry23 Qtrh23 2xdg16   ## Kleene Star, Kleene Plus and Optionality One of missing features in the previous sections was the ability to repeat a matched character an arbitrary number of times. Suppose for instance that we want to match a word of arbitrary length containing letters of the alphabet or numbers in the range of 0-9, for example:   regex guide foo bar   A naive approach could be to repeat the \w metacharacter times(i.e. \w\w\w...). The problem with this approach is that we do not know how long a word could be. A solution to this problem is to use either the Kleene Star operator(*) or the Klee Plus operator(+). The star will match the previous character zero or more times, while the plus operator will match the previous character one or more times. In the previous example, since we want to match the previous character(\w) at least one time, we can use the plus operator:   ^\w+   This will match all the lines in the original text. Let us see another example. Suppose that we want to match all the lines but the last one in the following buffer:   xxxyyyyzzz xzzz xxxyyz   Since all the lines have at least one x, we can use the plus operator(x+); the y on the other hand, does not occur everywhere, so we want to match it zero or more times(y*). The z, finally, occurs in all three rows, so we can use the plus operator(z+). The complete regex is:   ^x+y*z+$


The other operator is the question mark(?). This operator will match the previous character zero or one times and is used to indicate optionality. Let us see a very common example.

Suppose that we want to match both the words color and colour from a text file. To do that, we need to tell the regex engine that the letter “u” is optional. To do so, we can use this regular expression:
                
colou?r


You may be wondering why we did not use the star operator. The reason is, the star operator would have matched any occurrence of the letter u(e.g. colouur, colouuuuuur and so on) while here we want to match it either zero or one time.

Another variant of the question mark operator is the double question mark(??). The difference between these two characters is that the ? operator will test the pattern first, if it fails it will test for nullability on backtrack. The ?? operator, on the other hand, will first test the empty string and if it fails, it will try to match the pattern.

The single question mark is said to be greedy, while the double question mark is lazy.

If the last part of this section didn't make much sense, just keep reading. In the next section we will introduce the notions of greediness and laziness.

## Greediness vs Laziness

Any regular expression engine available out there, supports both a greedy and a lazy version of quantifiers. The greedy operators will try to match the pattern as many times as possible while the lazy operator will try to match the pattern as few times as possible. To understand what this actually mean, we will see a practical example.

Let us suppose that we want to parse the class and the ID name of an HTML element, for example from:
                
<div class="container" id="d1">Content</div>


we want to get container and d1. A first approach could be to match any character(except empty string) in between the double quotes(""), something like this:
                
".+"


Surprisingly the result is:
                
"container" id="d1"


The problem with our solution is that both the * and the +(as well as ?) are greedy operator. Take a look at the following diagram to understand what the engine does when it encounters a greedy operator: The engine start matching the string when it encounters the first quote. It continues matching until it find another quote. Same goes for the first quote of the id field and for the second one. Since the * operator is greedy, the engine continues searching for another quote until the end of the line. When it reaches the end of the row, the engine must go back to find the other quote, this operation is known as backtracking: The highlighted part in yellow is the result of the regular expression. To obtain the expected result, we need to use a lazy operator, i.e. an operator that stops matching the string as soon as it encounters the quote character. The lazy version of the star, the plus and the question mark operator are, respectively, *?, +? and ??.

Thus, the correct regular expression is:
                
".+?"


which produces "container" and "d1" as results.

Below there is a table with greedy and lazy quantifiers:
Greedy Operator Lazy Operator Meaning
* *? Matches zero or more times.
+ +? Matches one or more times.
? ?? Matches zero or one time.

## Character Sets

Another useful regex feature is the ability to match a character from a set of symbols. To do that, we can simply wrap the characters we want to search inside square brackets; the regex engine will match a single element of that list. Take a look a following example:

Suppose that we want to match all names starting with “J”, “D” and “M” from a list of names:
                
John
Johnathan
David
Dennis
Micheal
Mike

Sean
Charlotte


To do that we can specify that the first letter must be either a J a D or an M([JDM]) and that the rest of the word is any character between “a” and “z”([a-z]). The full expression is:
                
[JDM][a-z]+


which will match the following names:
                
John
Johnathan
David
Dennis
Micheal
Mike



## Character Ranges

As you can see, you can also specify ranges inside character sets. For instance, to match every character between B and F(uppercase) we can use the expression [B-F], to match any number between 0 and 9 we can use the expression [0-9].

In fact the metacharacters \w and \d we saw in the previous sections are just aliases to [a-zA-Z0-9_] and [0-9], respectively. Another useful character set alias is \s, which matches any whitespace character, i.e. [\r\n\t\f\v ].

## Excluding Characters

Character sets can also be used to exclude certain characters(or ranges of characters) from the pattern. To do that, we can use the hat symbol(^) inside the square brackets. For instance, to exclude the lines that begin with “S” or “C” from the following list of names:
                
John
Johnathan
David
Dennis
Micheal
Mike

Sean
Charlotte


we can write
                
^[^SC][a-z]+


which will match the following names:
                
John
Johnathan
David
Dennis
Micheal
Mike


Do note that the ^ operator has two different meanings here: one inside the brackets(exclusion) and another outside the brackets(anchor).

## Repetition Quantifiers

So far, we saw three different quantifiers: one to match zero or one time(?), one to match zero or more times(*) and one to match one or more times(+). In this section we will see how to match a precise number of times.

Regex engines support repetition through curly braces({}). For example to match a certain character exactly 5 times, you can use the regular expression
                
x{5}


To match it at least 5 times you use:
                
x{5,}


To match it at most 5 times you can use:
                
x{,5}


And, finally, to match it at least 3 times, but not more than 5 times, you use:
                
x{3,5}


Let us see an example:

Suppose that you want to match all the domain names that have at least 2 character in the TLD(i.e., example.com, test.it, etc.) from the following buffer:
                
example.com
test.it



To do so, we can use:
                
\b\.[a-z]{2,}\b


Which will match the first three rows as expected.

Like any other quantifiers(*, +, ?), {} is also greedy by default. If you want to be lazy, be sure to append an ? at the end of the curly brackets.

## Capture Groups

Up to this point, we saw how regular expressions can be used to match texts inside a text file. While regex is a useful tool for finding and replacing strings inside a text file, it is not its only purpose. Regular expressions can also be used to extract(or capture) information from a string to be processed, for instance, inside a program. To capture the result of a regular expression we use a pair of parenthesis(()).

Suppose that we want to parse a simple configuration file where each line is in the form:
                
<key> = <val>;


and to store each value in a Python dictionary. An example of the previous schema is:
                
width= 640px;
height =   480px;
scale    =   1.5679;
color = #99ff66;
default_status ="inactive";

val == "foo";
val = bar


The regular expression is Let us break it down.
• (\w+): captures the name of the attribute(i.e., a word containing letters, digits or underscore);
• \s*: matches zero or more spaces(the user may not add a space, like in the first line);
• =: matches a single equal sign;
• \s*: matches zero or more spaces;
• ([\w\.\#\"]+): captures either a word with letters, numbers, underscore, the sharp symbol(#) or a quote(");
• ;: matches only those lines that end with a blockquote.
We can see a working example in Python:
                
import re

entries = [
'width= 640px;',
'height =   480px;',
'scale    =   1.5679;',
'color = #99ff66;',
'default_status ="inactive";',
# Invalid keys
'val == "foo";',
'val = bar'
]

exp = r"^(\w+)\s*=\s*([\w\.\#\"]+);$" keys = {} def main(): matches = [re.search(exp, entry) for entry in entries] for match in matches: if match: keys[match.group(1)] = match.group(2) print(keys) if __name__ == "__main__": main()   The output is:   {'width': '640px', 'height': '480px', 'scale': '1.5679', 'color': '#99ff66', 'default_status': '"inactive"'}   As you can see capture groups indexes start from 1. We will soon see how to name groups and how to refer to them. ## Conditional Capture Groups Capture groups allow us to use an OR operator(|) to capture all the strings that include a set of characters specified in the pattern. For example, if we want to match all the domain names with .com or .org as the TLD, we can use the following expression. List of domains:   google.com kernel.org songdata.io   Regular expression:   ^\w+\.(com|org)$


which filter out the last line.

## Nested Capture Groups

In some occasions, it may be useful to capture multiples groups at same time. For instance, in the following list of file names
                
secret_code.txt
database_backup.sql
regular_expression_article.md
list_of_sales2022.csv


it may be useful to extract the complete filename(database_backup.sql), the name without the extension(database_backup) and the extension(sql). To do that, we need to write a nested capture group:
                
^((\w+)\.(\w+))$  Thus, ^((\w+)\.(\w+))$ captures the whole filename, (\w+)\. captures the name without extension and \.(\w+) captures the extension. In other words, for each row we get:
                
["secret_code.txt", "secret_code", "txt"]
["database_backup.sql", "database_backup", "sql"]
["regular_expression_article.md", "regular_expression_article", "md"]
["list_of_sales2022.csv", "list_of_sales2022", "csv"]



## Numbered Backreferences

Capture groups can be referenced in later parts of the regular expression using their index(starting from 1). This action is known as backreferencing. Suppose for example that we want to remove duplicate words in a text file:
                
The quick quick brown fox fox jumps over the lazy dog
The The quick brown fox jumps over the lazy dog


Here the words quick, fox and The are duplicated. We can avoid fixing this typo manually using the following regular expression:
                
\b([a-zA-Z]+)\s\1\b


Where:
• ([a-zA-Z]+)\s: captures all words separated by a space;
• \1: refers to the previous capture group(i.e. it matches the same text).

## Anonymous Capture Groups

We saw so far that capture groups can be accessed using back-referencing. Sometimes, though, we want to create a capture group to organize the regular expression, but we do not want to be able to access it later. In such cases we can create an anonymous capture group(also known as non-capturing group). The syntax for an anonymous capture group is the following:
                
(?:)


To give an example of why anonymous capture groups can be useful, let us go back to the example about nested capture groups:
                
secret_code.txt
database_backup.sql
regular_expression_article.md
list_of_sales2022.csv


Instead of retrieving the whole filename(database_backup.sql) we want to be able to access only the name without extension(database_backup) and the extension(sql) while keeping them inside a nested capture group. To do so, we can refactor the regular expression from this:
                
^((\w+)\.(\w+))$  to this   ^(?:(\w+)\.(\w+))$


Which gives us the following rows:
                
["secret_code", "txt"]
["database_backup", "sql"]
["regular_expression_article", "md"]
["list_of_sales2022", "csv"]



## Practical Examples

Now that we covered pretty much all the basic concepts of regular expressions, we can use them to solve some practical problems. In the next sections we will solve five typical sysadmin/programmer tasks using regex in Python. Keep in mind that the following regular expressions are not suitable for every scenario, you may need to trim them according to your needs before using them in production.

## 1. Quotation Mark Replacement

>
Given a string surrounded by single quotation marks(), double quotation marks("") or angle brackets(<<>>), replace them with polish quotation mark(,,'').
i.e., from this:
                
Your time is limited, so don't waste it living someone else's life.
""Life is what happens when you're busy making other plans.""
<<Tell me and I forget. Teach me and I remember. Involve me and I learn.>>
,,It is during our darkest moments that we must focus to see the light.''


we want this:
                
,,Your time is limited, so don't waste it living someone else's life.''
,,Life is what happens when you're busy making other plans.''
,,Tell me and I forget. Teach me and I remember. Involve me and I learn.''
,,It is during our darkest moments that we must focus to see the light.''


The regular expression is:
                
^[\"<]{1,2}([\w\s'.,]+)[\">]{1,2}$  where: • ^[\"<]{1,2}: matches the opening quotes either one or two times; • ([\w\s'.,]+): captures the text inside the quotes; • [\">]{1,2}$: matches the closing quotes either one or two times.
In Python this is:
                
import re

phrases = [
"Your time is limited, so don't waste it living someone else's life.",
"\"\"Life is what happens when you're busy making other plans.\"\"",
"<<Tell me and I forget. Teach me and I remember. Involve me and I learn.>>",
",,It is during our darkest moments that we must focus to see the light.''"
]

exp = r"^[\"<]{1,2}([\w\s'.,]+)[\">]{1,2}$" def main(): res = [re.sub(exp, r",,\1''", phrase) for phrase in phrases] print(*res, sep='\n') if __name__ == "__main__": main()   ## 2. Log Files > Given a log file of the following form:   [LEVEL] [YYYY/MM/DD - HH:MM:SS.uuu] | <message>   Parse the LEVEL, the date(YYYY/MM/DD - HH:MM:SS.uuu) and the message. i.e., from this:   [I] [2022/10/22 - 15:57:30.157] | Server up and running [I] [2022/10/22 - 16:40:23.472] | Server accepted a new connection [W] [2022/10/22 - 17:45:00.123] | Server has reached memory limit [E] [2022/10/22 - 17:47:32.100] | Server not responding, trying to restart... [E] [2022/10/22 - 17:48:00.000] |Server crashed.   We want this:   Level: I Date: 2022/10/22 - 15:57:30.157 Message: Server up and running   The regular expression is:   ^$(?P<level>[CDEIW])$\s$(?P<date>[\d\/\s\-:\.]+)$\s\|\s*(?P<message>[A-Za-z,. ]+)$


where:
• ^$(?P<level>[CDEIW])$\s: captures the log level, i.e. a letter than can either be Info, Warning, Error, Critical or Debug;
• [(?P<date>[\d\/\s\-:\.]+)\]\s\|\s*: captures the date;
• (?P<message>[A-Za-z,. ]+): captures the message, i.e. a string with alphabet letters and syntactical elements.
In Python this is:
                
import re

logs = [
"[I] [2022/10/22 - 15:57:30.157] | Server up and running",
"[I] [2022/10/22 - 16:40:23.472] | Server accepted a new connection",
"[W] [2022/10/22 - 17:45:00.123] | Server has reached memory limit",
"[E] [2022/10/22 - 17:47:32.100] | Server not responding, trying to restart...",
"[E] [2022/10/22 - 17:48:00.000] |Server crashed."
]

exp = r"^$(?P<level>[CDEIW])$\s$(?P<date>[\d\/\s\-:\.]+)$\s\|\s*(?P<message>[A-Za-z,. ]+)$" def main(): matches = [re.search(exp, log) for log in logs] [print(f"Level: {match.group(1)}\nDate: {match.group(2)}\nMessage: {match.group(3)}\n") for match in matches] if __name__ == "__main__": main()   ## 3. Parse URLs > Given a URL of the following form:   <PROTOCOL>://<ADDRESS>:<PORT>/<RESOURCE>   Extract the protocol, the address, the port(if it exists) and the resource. For instance, from this   https://www.google.com/search?q=regex+tutorial   we want this:   Protocol: https Address: www.google.com Query: search?q=regex+tutorial   The regular expression to do that is:   ^(\w+)://([\w\-\.]+):?(\d+)?\/(.*)$


where:
• ^(\w+)://: captures the protocol;
• ([\w\-\.]+): captures the address;
• :?(\d+)?: captures the port if it exists;
• (.*): captures the resource.
In Python this is:
                
import re

urls = [
"https://marcocetica.com/posts/wireguard_pihole/",
"file://localhost:3000/archive.zip"
]

exp = r"^(\w+)://([\w\-\.]+):?(\d+)?\/(.*)$" def main(): nl = "\n" matches = [re.search(exp, url) for url in urls] [print(f"For URL: {url}\n" \ f"Protocol: {match.group(1)}\n" \ f"Address: {match.group(2)}\n" \ f'{" ".join(("Port: ", match.group(3), nl)) if match.group(3) else ""}' \ f"Query: {match.group(4)}\n") for (url, match) in zip(urls, matches)] if __name__ == "__main__": main()   ## 4. Parse Email Addresses > Given an email address of the form   <address>@<domain>.<tld>   We want to parse valid email addresses only. i.e., from this list:   johndoe @ hotmail . com john.doe@gmail.com johndoe@hotmail.com email@johndoe.net doe-john96@yahoo.it john!#$%&'*+-/=?^_{|}~doe@gmail.com
john @ gmail.co

john.doe@gmail.com;;
john@gmail.c
<<john.doe@gmail.com;;
johndoegmail.com


We want to get the following list:
                
johndoe@hotmail.com
john.doe@gmail.com
johndoe@hotmail.com
email@johndoe.net
doe-john96@yahoo.it
john!#$%&'*+-/=?^_{|}~doe@gmail.com john@gmail.co   The regular expression to do that is:   ^\s*?([\w\.+!#$%&'*+\-\/=?^_{|}~]+)\s*?@\s*?([\w]+)\s*?\.\s*?([\w]{2,})$  Where: • \s*?([\w\.+!#$%&'*+\-\/=?^_{|}~]+): captures any supported character from the address;
• ([\w]+): captures the domain;
• ([\w]{2,}): captures any tld of length $$\geq 2$$.
In Python this is:
                
import re

"        johndoe        @      hotmail      .  com",
"john.doe@gmail.com",
"johndoe@hotmail.com",
"email@johndoe.net",
"doe-john96@yahoo.it",
"john!#$%∓'*+-/=?^_{|}~doe@gmail.com", "john @ gmail.co", # Invalid email addresses "john.doe@gmail.com;;", "john@gmail.c", "<<john.doe@gmail.com;;", "johndoegmail.com" ] exp = r"^\s*?([\w\.+!#$%&'*+\-\/=?^_{|}~]+)\s*?@\s*?([\w]+)\s*?\.\s*?([\w]{2,})$" def main(): matches = [re.search(exp, address) for address in email_addresses] [print(f"Original email: {email}\n" \ f"Parsed email:{match.group(1)}@{match.group(2)}.{match.group(3)}\n") for (email, match) in zip(email_addresses, matches) if match] if __name__ == "__main__": main()   ## 5. Parse IPv4 Addresses > Given a list containing IPv4 address, we want to filter out invalid entries. For example, from this list:   33.150.44.43 145.158.214.165 74.50.35.148 149.184.69.21 86.120.183.166 91.133.69.226 19.111.62.200 71.241.160.194 169.180.151.86 114.136.215.231 127.0.0.1 192.168.1.64. 320.410.0.1 192.168 127.0.0.1.5 10.01.1.02 10.0.0.02 1...0   we want this list:   33.150.44.43 145.158.214.165 74.50.35.148 149.184.69.21 86.120.183.166 91.133.69.226 19.111.62.200 71.241.160.194 169.180.151.86 114.136.215.231 127.0.0.1   The regular expression to do that is:   ^\b(?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.){3}(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\b$


Where:
• (?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.){3}: captures the first three octets of the forms 25X, 2XX, 1XX, XX or X;
• (?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9]): captures the last byte of the form 25X, 2XX, 1XX, XX or X.
In Python this is:
                
import re

"33.150.44.43",
"145.158.214.165",
"74.50.35.148",
"149.184.69.21",
"86.120.183.166",
"91.133.69.226",
"19.111.62.200",
"71.241.160.194",
"169.180.151.86",
"114.136.215.231",
"127.0.0.1",
"192.168.1.64.",
"320.410.0.1",
"192.168",
"127.0.0.1.5",
"10.01.1.02",
"10.0.0.02",
"1...0"
]

exp = r"^\b(?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.){3}(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\b\$"

def main():

`