Regex Tutorial

1. History of Regex

Regular expressions (Regex) were first introduced in 1951, when mathematician Stephen Cole Kleene described regular languages using his mathematical notation called regular events. In 1973 Ken Thompson written a program called grep. It is a command-line utility for searching plain-text data sets for lines that match a regular expression. Its name comes from the ed command g/re/p (globally search for a regular expression and print them). Grep was originally developed for the Unix operating system, but later available for all Unix-like systems (eg. Linux, MacOS, BSD, Solaris).

Regular expressions are not a programming language - they are used for matching, searching and replacing text by describing search patterns. It allows you to find text patterns, extract parts from long strings or validate data submitted by users.

Today, Regex is widely supported in programming languages (Java, JavaScript, Perl, Python, C#, etc.), text processing programs, advanced text editors, and some other programs. Regex is an essential skill for any computer programmer to have.

2. Getting started

2.1 Different regular expression engines

There are different implementations or versions of Regular Expression, these are the most popular:

C/C++
Java
JavaScript
.NET
Perl
PHP
Python
Ruby
Unix
Apache
MySQL

2.2 Choose a regular expression engine to get started

RegExr - recommended for beginners, online tool, no installation required, cross-platform, supports JavaScript and PHP engines
Regex101 - online tool, no installation required, cross-platform, supports PHP, JavaScript, Python, Golang and Java engines
RegExPal - online tool, no installation required, cross-platform, supports JavaScript and PHP engines
grep, egrep - Essential command line tools in Unix and Unix-like systems
Visual Studio Code - text editor for Microsoft Windows, Linux and macOS
Atom - text editor for macOS, Linux, Microsoft Windows and Chrome OS
Notepad++ - text editor for Microsoft Windows
TextMate - text editor for macOS

3. Parts of Regex syntax

Bellow is an example of Regular expression with marked sections of syntax. This particular syntax can be used to check whether entered email address is valid.

/^([a-z0-9_.-]+)@([a-z0-9_.-]+)＼.([a-z.]{2,6})$/gi

Delimiters and Flags, Literal Characters, Escapes,
Character Sets, Groups, Quantifiers, Anchors

4. Delimiters

Delimiters indicate the start and end of the regular expression pattern. It depends on Regex implementation, most frequently it is delimited with forward slashes /abc/ (JavaScript, PHP, Perl), where “abc” is a regular expression. Forward slashes are not part of the expression, they are just the delimiters, that hold it. In some implementations you can find e.g. double quotes "abc" (Python) or backticks `abc` (Golang) as delimiters.

5. Flags

There are different search modes (flags) which can be used within regular expressions:

Flags	Syntax	Meaning
standard	`/abc/`	Find first match
global	`/abc/g`	Find all matches, look globally
Case insentitive	`/abc/i`	Find matches regardless of uppercase or lowercase letters
multiline	`/abc/m`	Find text that stretches across more than one line

NOTE: It is not recommended to use Case insensitive mode, safer way is to set case sensitivity within syntax

6. Literal characters

Literal characters from a to z, which represent themselves.

Example	Text matching
`/al/`	seal, sell, salad, proposal, silk	Try it
`/al/g`	seal, sell, salad, proposal, silk	Try it

7. Metacharacters

Non-letter characters \.*+-{}[]^$|?():!=, which have special meanings.

Metacharacter	Meaning
`.`	Any character except newline character (Wildcard)
`[]`	A set of characters (Character sets)
`^`	Starts with (Anchors)
`$`	Ends with (Anchors)
`*`	Zero or more occurrences (Quantifiers)
`+`	One or more occurrences (Quantifiers)
`?`	Zero or one occurrence (Quantifiers)
`{}`	Exactly the specified number of occurrences (Quantifiers)
`()`	Character group (Groups)
`\|`	Either or (Alternation)
`\`	Escape the next metacharacter and treat it as literal character (Escapes)
`\t`	Tab character (Shorthand character classes)
`\n`	New line in Unix and Unix-like systems (Linux, macOS, BSD)
`\r\n`	New line in Microsoft Windows

7.1 Wildcard

Wildcard metacharacter . matches any single character except newline. If you need to match period or dot symbol use an escape metacharacter before period \.

Example	Text matching
`/9.00/g`	9.00 9500 9-00	Try it
`/9\.00/g`	9.00 9500 9-00	Try it

7.2 Escapes

Backslash metacharacter \ escapes the next metacharacter - this means the next metacharacter will be treated as literal character.

Escaping is only for metacharacters. Literal characters should never be escaped as it gives them different meaning (see Shorthand character classes). Quotation marks do not need to be escaped as they are usually not metacharacters.

Example	Text matching
`/9.00/g`	9.00 9500 9-00	Try it
`/9\.00/g`	9.00 9500 9-00	Try it

7.3 Character sets

With square brackets [] you match one of several characters inside brackets.

Metacharacters inside square brackets usually does not need to be escaped - except ]-^\ .

Character set	Meaning
`[abcd]`	character `a`, `b`, `c` or `d`
`[a-d]`	character `a`, `b`, `c` or `d`
`[^ab]`	any character except `a` and `b`
`[123]`	any of the digits `1`, `2` or `3`
`[0-9]`	any digit
`[7-9][0-9]`	any two-digit numbers from `70` to `99`
`/[70-99]/g`	any three-digit numbers starting with `7` and ending with `9`
`[a-z]`	any lowercase character
`[A-Za-z]`	any uppercase or lowercase character

Example	Text matching
`/s[ea]l/g`	seal, sell, salad, proposal, silk	Try it
`/s[aei]l/g`	seal, sell, salad, proposal, silk	Try it
`/se[a-z]/g`	seal, sell, salad, proposal, silk	Try it
`/s[^a]l/g`	seal, sell, salad, proposal, silk	Try it
`/s[^ei]l/g`	seal, sell, salad, proposal, silk	Try it
`/s[a.]l/g`	seal, sell, salad, proposal, silk	Try it

7.4 Quantifiers

The metacharacters *, +, ? or {} are used to specify how many times a preceding subpattern can occur. These metacharacters act differently in different situations.

Quantifier	Meaning	Alternative
`*`	Matches zero or more repetitions of the preceding character	`{0,}`
`+`	Matches one or more repetitions of the preceding character	`{1,}`
`?`	Matches zero or one repetition of the preceding character	`{0,1}`
`{n}`	Matches exactly n repetitions of the preceding character
`{min,}`	Matches min or more repetitions of the preceding character
`{0,max}`	Matches max or less repetitions of the preceding character
`{min,max}`	Matches *at least min* repetitions but no more than max repetitions** of the preceding character

7.4.1 Greedy

Generally, a greedy quantifiers will match the longest possible string.
By default, all quantifiers are greedy.
Regex match as much as possible before giving control to the next expression part

7.4.2 Lazy

Generally, a lazy quantifiers will match the shortest possible string.
To make quantifiers lazy, just append ? to the existing quantifier, e.g. *?, +?, {0,2}?.
Regex match as little as possible before giving control to the next expression part

7.4.3 Examples

Example	Text matching
`/s.?a/g`	seal, sell, salad, proposal, silk	Try it
`/s.+a/g`	seal, sell, salad, proposal, silk	Try it
`/s.+?a/g`	seal, sell, salad, proposal, silk	Try it
`/s.*a/g` (greedy)	seal, sell, salad, proposal, silk	Try it
`/s.*?a/g` (lazy)	seal, sell, salad, proposal, silk	Try it
`/s.{1}a/g`	seal, sell, salad, proposal, silk	Try it
`/s.{1,2}a/g`	seal, sell, salad, proposal, silk	Try it
`/s.{1,}l/g` (greedy)	seal, sell, salad, proposal, silk	Try it
`/s.{1,}?l/g` (lazy)	seal, sell, salad, proposal, silk	Try it

7.5 Groups

Groups are a way to treat multiple characters as a single unit. They are created by placing the characters to be grouped inside a set of parentheses ()
Groups can be used for: - Create a group of alternation expression - Apply repetition operators to a group - Capture group for use in matching and replacing

Example	Text matching
`/s(a\|e\|i)l/g` (same as `s[aei]l`)	seal, sell, salad, proposal, silk	Try it
`/s(a\|e\|i){0,1}a/g`	seal, sell, salad, proposal, silk	Try it
`/(dark)?blue/g`	lightblue, darkblue, darkdarkblue	Try it
`/(dark){0,}blue/g`	lightblue, darkblue, darkdarkblue	Try it
`/^[0-9a-f]{2}(:[0-9a-f]{2}){5}$/i`	23:32:A5:98:F1:CD	Try it

7.6 Alternations

You can use the vertical bar | metacharacter to match any one of a series of patterns, where the | character separates each pattern.
Metacharacter | is used as an “OR” operator

Example	Text matching
`/s(a\|e\|i)l/g`	seal, sell, salad, proposal, silk	Try it
`/(dark\|light)blue/g`	blue, lightblue, darkblue, darkdarkblue	Try it
`/(dark\|light)?blue/g`	blue, lightblue, darkblue, darkdarkblue	Try it
`/(dark\|light){0,}blue/g`	blue, lightblue, darkblue, darkdarkblue	Try it
`/dark(blue\|green)?/g` (greedy)	dark, darkblue, darkgreen	Try it
`/dark(blue\|green)??/g` (lazy)	dark, darkblue, darkgreen	Try it

7.7 Anchors

Anchors are referencing a position, not an actual character

Anchors	Meaning
`^`	Start of string/line
`$`	End of string/line
`\A` (not all engines)	Start of string, never end of line
`\Z` (not all engines)	End of string, never end of line
`\b`	Word boundary (start/end of word)
`\B`	Not a word boundary

Example	Text matching
`/^\w+ \w+$/g`	San Francisco	Try it
`/^\w+ \w+$/g`	San Francisco San Francisco	Try it
`/\b\w+\b\w+\b/g`	San Francisco San Francisco	Try it
`/\b\w+\b \b\w+\b/g`	San Francisco San Francisco	Try it
`/\B\w+\B/g`	San Francisco San Francisco	Try it

7.8. Shorthand character classes

Other characters with special meaning

Shorthand	Meaning
`\n`	New line
`\r`	Carriage return
`\t`	Tab
`\v`	Vertical tab
`\d`	Any digit: `[0-9]`
`\w`	Any word character including underscore: `[a-zA-Z0-9_]`
`\w\-`	Any word character including underscore and hyphen: `[a-zA-Z0-9_\-]`
`\s`	Whitespace: `[\t\r\n]`
`\D`	Not digit: `[^\d]` or `[^0-9]`
`\W`	Not word character: `[^\w]` or `[^a-zA-Z0-9_]`
`\S`	Not whitespace: `[^\s]` or `[^\t\r\n]`

All letters, digits, underscores and dashes can be matched with [\w\-]

No digits or whitespace characters can be matched with [^\d\s]. Don’t use [\D\S], it means “either no digits or no whitespace characters”

8. Further resources

Learning Regular Expressions by Kevin Skoglund
RegExr - RegExr is a HTML/JS based tool for creating, testing, and learning about Regular Expressions
W3 Schools: Python RegEx