Regex Tutorial
1. History of Regex
Regular expressions (Regex) were first introduced in 1951, when mathematician Stephen Cole Kleene described regular languages using his mathematical notation called regular events. In 1973 Ken Thompson written a program called grep
. It is a command-line utility for searching plain-text data sets for lines that match a regular expression. Its name comes from the ed command g/re/p
(globally search for a regular expression and print them). Grep was originally developed for the Unix operating system, but later available for all Unix-like systems (eg. Linux, MacOS, BSD, Solaris).
Regular expressions are not a programming language - they are used for matching, searching and replacing text by describing search patterns. It allows you to find text patterns, extract parts from long strings or validate data submitted by users.
Today, Regex is widely supported in programming languages (Java, JavaScript, Perl, Python, C#, etc.), text processing programs, advanced text editors, and some other programs. Regex is an essential skill for any computer programmer to have.
2. Getting started
2.1 Different regular expression engines
There are different implementations or versions of Regular Expression, these are the most popular:
- C/C++
- Java
- JavaScript
- .NET
- Perl
- PHP
- Python
- Ruby
- Unix
- Apache
- MySQL
2.2 Choose a regular expression engine to get started
- RegExr - recommended for beginners, online tool, no installation required, cross-platform, supports JavaScript and PHP engines
- Regex101 - online tool, no installation required, cross-platform, supports PHP, JavaScript, Python, Golang and Java engines
- RegExPal - online tool, no installation required, cross-platform, supports JavaScript and PHP engines
- grep, egrep - Essential command line tools in Unix and Unix-like systems
- Visual Studio Code - text editor for Microsoft Windows, Linux and macOS
- Atom - text editor for macOS, Linux, Microsoft Windows and Chrome OS
- Notepad++ - text editor for Microsoft Windows
- TextMate - text editor for macOS
3. Parts of Regex syntax
Bellow is an example of Regular expression with marked sections of syntax. This particular syntax can be used to check whether entered email address is valid.
4. Delimiters
Delimiters indicate the start and end of the regular expression pattern. It depends on Regex implementation, most frequently it is delimited with forward slashes /abc/
(JavaScript, PHP, Perl), where “abc” is a regular expression. Forward slashes are not part of the expression, they are just the delimiters, that hold it. In some implementations you can find e.g. double quotes "abc"
(Python) or backticks `abc`
(Golang) as delimiters.
5. Flags
There are different search modes (flags) which can be used within regular expressions:
Flags | Syntax | Meaning |
---|---|---|
standard | /abc/ |
Find first match |
global | /abc/g |
Find all matches, look globally |
Case insentitive | /abc/i |
Find matches regardless of uppercase or lowercase letters |
multiline | /abc/m |
Find text that stretches across more than one line |
NOTE: It is not recommended to use Case insensitive mode, safer way is to set case sensitivity within syntax
6. Literal characters
Literal characters from a
to z
, which represent themselves.
Example | Text matching | |
---|---|---|
/al/ |
se |
Try it |
/al/g |
se |
Try it |
7. Metacharacters
Non-letter characters \.*+-{}[]^$|?():!=
, which have special meanings.
Metacharacter | Meaning |
---|---|
. |
Any character except newline character (Wildcard) |
[] |
A set of characters (Character sets) |
^ |
Starts with (Anchors) |
$ |
Ends with (Anchors) |
* |
Zero or more occurrences (Quantifiers) |
+ |
One or more occurrences (Quantifiers) |
? |
Zero or one occurrence (Quantifiers) |
{} |
Exactly the specified number of occurrences (Quantifiers) |
() |
Character group (Groups) |
| |
Either or (Alternation) |
\ |
Escape the next metacharacter and treat it as literal character (Escapes) |
\t |
Tab character (Shorthand character classes) |
\n |
New line in Unix and Unix-like systems (Linux, macOS, BSD) |
\r\n |
New line in Microsoft Windows |
7.1 Wildcard
Wildcard metacharacter .
matches any single character except newline. If you need to match period or dot symbol use an escape metacharacter before period \.
Example | Text matching | |
---|---|---|
/9.00/g |
|
Try it |
/9\.00/g |
Try it |
7.2 Escapes
Backslash metacharacter \
escapes the next metacharacter - this means the next metacharacter will be treated as literal character.
Escaping is only for metacharacters. Literal characters should never be escaped as it gives them different meaning (see Shorthand character classes). Quotation marks do not need to be escaped as they are usually not metacharacters.
Example | Text matching | |
---|---|---|
/9.00/g |
|
Try it |
/9\.00/g |
Try it |
7.3 Character sets
With square brackets []
you match one of several characters inside brackets.
Metacharacters inside square brackets usually does not need to be escaped - except ]-^\
.
Character set | Meaning |
---|---|
[abcd] |
character a , b , c or d
|
[a-d] |
character a , b , c or d
|
[^ab] |
any character except a and b
|
[123] |
any of the digits 1 , 2 or 3
|
[0-9] |
any digit |
[7-9][0-9] |
any two-digit numbers from 70 to 99
|
/[70-99]/g |
any three-digit numbers starting with 7 and ending with 9
|
[a-z] |
any lowercase character |
[A-Za-z] |
any uppercase or lowercase character |
Example | Text matching | |
---|---|---|
/s[ea]l/g |
seal, |
Try it |
/s[aei]l/g |
seal, |
Try it |
/se[a-z]/g |
Try it | |
/s[^a]l/g |
seal, |
Try it |
/s[^ei]l/g |
seal, sell, |
Try it |
/s[a.]l/g |
seal, sell, |
Try it |
7.4 Quantifiers
The metacharacters *
, +
, ?
or {}
are used to specify how many times a preceding subpattern can occur. These metacharacters act differently in different situations.
Quantifier | Meaning | Alternative |
---|---|---|
* |
Matches zero or more repetitions of the preceding character | {0,} |
+ |
Matches one or more repetitions of the preceding character | {1,} |
? |
Matches zero or one repetition of the preceding character | {0,1} |
{n} |
Matches exactly n repetitions of the preceding character | |
{min,} |
Matches min or more repetitions of the preceding character | |
{0,max} |
Matches max or less repetitions of the preceding character | |
{min,max} |
Matches at least min repetitions but no more than max repetitions of the preceding character |
7.4.1 Greedy
- Generally, a greedy quantifiers will match the longest possible string.
- By default, all quantifiers are greedy.
- Regex match as much as possible before giving control to the next expression part
7.4.2 Lazy
- Generally, a lazy quantifiers will match the shortest possible string.
- To make quantifiers lazy, just append
?
to the existing quantifier, e.g.*?
,+?
,{0,2}?
. - Regex match as little as possible before giving control to the next expression part
7.4.3 Examples
Example | Text matching | |
---|---|---|
/s.?a/g |
|
Try it |
/s.+a/g |
|
Try it |
/s.+?a/g |
|
Try it |
/s.*a/g (greedy) |
|
Try it |
/s.*?a/g (lazy) |
|
Try it |
/s.{1}a/g |
|
Try it |
/s.{1,2}a/g |
|
Try it |
/s.{1,}l/g (greedy) |
|
Try it |
/s.{1,}?l/g (lazy) |
|
Try it |
7.5 Groups
Groups are a way to treat multiple characters as a single unit. They are created by placing the characters to be grouped inside a set of parentheses ()
Groups can be used for:
- Create a group of alternation expression
- Apply repetition operators to a group
- Capture group for use in matching and replacing
Example | Text matching | |
---|---|---|
/s(a|e|i)l/g (same as s[aei]l ) |
seal, |
Try it |
/s(a|e|i){0,1}a/g |
|
Try it |
/(dark)?blue/g |
light |
Try it |
/(dark){0,}blue/g |
light |
Try it |
/^[0-9a-f]{2}(:[0-9a-f]{2}){5}$/i |
Try it |
7.6 Alternations
You can use the vertical bar |
metacharacter to match any one of a series of patterns, where the |
character separates each pattern.
Metacharacter |
is used as an “OR” operator
Example | Text matching | |
---|---|---|
/s(a|e|i)l/g |
seal, |
Try it |
/(dark|light)blue/g |
blue, |
Try it |
/(dark|light)?blue/g |
|
Try it |
/(dark|light){0,}blue/g |
|
Try it |
/dark(blue|green)?/g (greedy) |
|
Try it |
/dark(blue|green)??/g (lazy) |
|
Try it |
7.7 Anchors
Anchors are referencing a position, not an actual character
Anchors | Meaning |
---|---|
^ |
Start of string/line |
$ |
End of string/line |
\A (not all engines) |
Start of string, never end of line |
\Z (not all engines) |
End of string, never end of line |
\b |
Word boundary (start/end of word) |
\B |
Not a word boundary |
Example | Text matching | |
---|---|---|
/^\w+ \w+$/g |
Try it | |
/^\w+ \w+$/g |
San Francisco San Francisco | Try it |
/\b\w+\b\w+\b/g |
San Francisco San Francisco | Try it |
/\b\w+\b \b\w+\b/g |
|
Try it |
/\B\w+\B/g |
S |
Try it |
7.8. Shorthand character classes
Other characters with special meaning
Shorthand | Meaning |
---|---|
\n |
New line |
\r |
Carriage return |
\t |
Tab |
\v |
Vertical tab |
\d |
Any digit: [0-9]
|
\w |
Any word character including underscore: [a-zA-Z0-9_]
|
\w\- |
Any word character including underscore and hyphen: [a-zA-Z0-9_\-]
|
\s |
Whitespace: [\t\r\n]
|
\D |
Not digit: [^\d] or [^0-9]
|
\W |
Not word character: [^\w] or [^a-zA-Z0-9_]
|
\S |
Not whitespace: [^\s] or [^\t\r\n]
|
All letters, digits, underscores and dashes can be matched with
[\w\-]
No digits or whitespace characters can be matched with
[^\d\s]
. Don’t use[\D\S]
, it means “either no digits or no whitespace characters”
8. Further resources
- Learning Regular Expressions by Kevin Skoglund
- RegExr - RegExr is a HTML/JS based tool for creating, testing, and learning about Regular Expressions
- W3 Schools: Python RegEx