1. History of Regex

Regular expressions (Regex) were first introduced in 1951, when mathematician Stephen Cole Kleene described regular languages using his mathematical notation called regular events. In 1973 Ken Thompson written a program called grep. It is a command-line utility for searching plain-text data sets for lines that match a regular expression. Its name comes from the ed command g/re/p (globally search for a regular expression and print them). Grep was originally developed for the Unix operating system, but later available for all Unix-like systems (eg. Linux, MacOS, BSD, Solaris).

Regular expressions are not a programming language - they are used for matching, searching and replacing text by describing search patterns. It allows you to find text patterns, extract parts from long strings or validate data submitted by users.

Today, Regex is widely supported in programming languages (Java, JavaScript, Perl, Python, C#, etc.), text processing programs, advanced text editors, and some other programs. Regex is an essential skill for any computer programmer to have.

2. Getting started

2.1 Different regular expression engines

There are different implementations or versions of Regular Expression, these are the most popular:

  • C/C++
  • Java
  • JavaScript
  • .NET
  • Perl
  • PHP
  • Python
  • Ruby
  • Unix
  • Apache
  • MySQL

2.2 Choose a regular expression engine to get started

  • RegExr - recommended for beginners, online tool, no installation required, cross-platform, supports JavaScript and PHP engines
  • Regex101 - online tool, no installation required, cross-platform, supports PHP, JavaScript, Python, Golang and Java engines
  • RegExPal - online tool, no installation required, cross-platform, supports JavaScript and PHP engines
  • grep, egrep - Essential command line tools in Unix and Unix-like systems
  • Visual Studio Code - text editor for Microsoft Windows, Linux and macOS
  • Atom - text editor for macOS, Linux, Microsoft Windows and Chrome OS
  • Notepad++ - text editor for Microsoft Windows
  • TextMate - text editor for macOS

3. Parts of Regex syntax

Bellow is an example of Regular expression with marked sections of syntax. This particular syntax can be used to check whether entered email address is valid.

/^([a-z0-9_.-]+)@([a-z0-9_.-]+)\.([a-z.]{2,6})$/gi

Delimiters and Flags, Literal Characters, Escapes,
Character Sets, Groups, Quantifiers, Anchors

4. Delimiters

Delimiters indicate the start and end of the regular expression pattern. It depends on Regex implementation, most frequently it is delimited with forward slashes /abc/ (JavaScript, PHP, Perl), where “abc” is a regular expression. Forward slashes are not part of the expression, they are just the delimiters, that hold it. In some implementations you can find e.g. double quotes "abc" (Python) or backticks `abc` (Golang) as delimiters.

5. Flags

There are different search modes (flags) which can be used within regular expressions:

Flags Syntax Meaning
standard /abc/ Find first match
global /abc/g Find all matches, look globally
Case insentitive /abc/i Find matches regardless of uppercase or lowercase letters
multiline /abc/m Find text that stretches across more than one line

NOTE: It is not recommended to use Case insensitive mode, safer way is to set case sensitivity within syntax

6. Literal characters

Literal characters from a to z, which represent themselves.

Example Text matching  
/al/ seal, sell, salad, proposal, silk Try it
/al/g seal, sell, salad, proposal, silk Try it

7. Metacharacters

Non-letter characters \.*+-{}[]^$|?():!=, which have special meanings.

Metacharacter Meaning
. Any character except newline character (Wildcard)
[] A set of characters (Character sets)
^ Starts with (Anchors)
$ Ends with (Anchors)
* Zero or more occurrences (Quantifiers)
+ One or more occurrences (Quantifiers)
? Zero or one occurrence (Quantifiers)
{} Exactly the specified number of occurrences (Quantifiers)
() Character group (Groups)
| Either or (Alternation)
\ Escape the next metacharacter and treat it as literal character (Escapes)
\t Tab character (Shorthand character classes)
\n New line in Unix and Unix-like systems (Linux, macOS, BSD)
\r\n New line in Microsoft Windows

7.1 Wildcard

Wildcard metacharacter . matches any single character except newline. If you need to match period or dot symbol use an escape metacharacter before period \.

Example Text matching  
/9.00/g 9.00 9500 9-00 Try it
/9\.00/g 9.00 9500 9-00 Try it

7.2 Escapes

Backslash metacharacter \ escapes the next metacharacter - this means the next metacharacter will be treated as literal character.

Escaping is only for metacharacters. Literal characters should never be escaped as it gives them different meaning (see Shorthand character classes). Quotation marks do not need to be escaped as they are usually not metacharacters.

Example Text matching  
/9.00/g 9.00 9500 9-00 Try it
/9\.00/g 9.00 9500 9-00 Try it

7.3 Character sets

With square brackets [] you match one of several characters inside brackets.

Metacharacters inside square brackets usually does not need to be escaped - except ]-^\ .

Character set Meaning
[abcd] character a, b, c or d
[a-d] character a, b, c or d
[^ab] any character except a and b
[123] any of the digits 1, 2 or 3
[0-9] any digit
[7-9][0-9] any two-digit numbers from 70 to 99
/[70-99]/g any three-digit numbers starting with 7 and ending with 9
[a-z] any lowercase character
[A-Za-z] any uppercase or lowercase character
Example Text matching  
/s[ea]l/g seal, sell, salad, proposal, silk Try it
/s[aei]l/g seal, sell, salad, proposal, silk Try it
/se[a-z]/g seal, sell, salad, proposal, silk Try it
/s[^a]l/g seal, sell, salad, proposal, silk Try it
/s[^ei]l/g seal, sell, salad, proposal, silk Try it
/s[a.]l/g seal, sell, salad, proposal, silk Try it

7.4 Quantifiers

The metacharacters *, +, ? or {} are used to specify how many times a preceding subpattern can occur. These metacharacters act differently in different situations.

Quantifier Meaning Alternative
* Matches zero or more repetitions of the preceding character {0,}
+ Matches one or more repetitions of the preceding character {1,}
? Matches zero or one repetition of the preceding character {0,1}
{n} Matches exactly n repetitions of the preceding character  
{min,} Matches min or more repetitions of the preceding character  
{0,max} Matches max or less repetitions of the preceding character  
{min,max} Matches at least min repetitions but no more than max repetitions of the preceding character  

7.4.1 Greedy

  • Generally, a greedy quantifiers will match the longest possible string.
  • By default, all quantifiers are greedy.
  • Regex match as much as possible before giving control to the next expression part

7.4.2 Lazy

  • Generally, a lazy quantifiers will match the shortest possible string.
  • To make quantifiers lazy, just append ? to the existing quantifier, e.g. *?, +?, {0,2}?.
  • Regex match as little as possible before giving control to the next expression part

7.4.3 Examples

Example Text matching  
/s.?a/g seal, sell, salad, proposal, silk Try it
/s.+a/g seal, sell, salad, proposal, silk Try it
/s.+?a/g seal, sell, salad, proposal, silk Try it
/s.*a/g (greedy) seal, sell, salad, proposal, silk Try it
/s.*?a/g (lazy) seal, sell, salad, proposal, silk Try it
/s.{1}a/g seal, sell, salad, proposal, silk Try it
/s.{1,2}a/g seal, sell, salad, proposal, silk Try it
/s.{1,}l/g (greedy) seal, sell, salad, proposal, silk Try it
/s.{1,}?l/g (lazy) seal, sell, salad, proposal, silk Try it

7.5 Groups

Groups are a way to treat multiple characters as a single unit. They are created by placing the characters to be grouped inside a set of parentheses ()
Groups can be used for: - Create a group of alternation expression - Apply repetition operators to a group - Capture group for use in matching and replacing

Example Text matching  
/s(a|e|i)l/g (same as s[aei]l) seal, sell, salad, proposal, silk Try it
/s(a|e|i){0,1}a/g seal, sell, salad, proposal, silk Try it
/(dark)?blue/g lightblue, darkblue, darkdarkblue Try it
/(dark){0,}blue/g lightblue, darkblue, darkdarkblue Try it
/^[0-9a-f]{2}(:[0-9a-f]{2}){5}$/i 23:32:A5:98:F1:CD Try it

7.6 Alternations

You can use the vertical bar | metacharacter to match any one of a series of patterns, where the | character separates each pattern.
Metacharacter | is used as an “OR” operator

Example Text matching  
/s(a|e|i)l/g seal, sell, salad, proposal, silk Try it
/(dark|light)blue/g blue, lightblue, darkblue, darkdarkblue Try it
/(dark|light)?blue/g blue, lightblue, darkblue, darkdarkblue Try it
/(dark|light){0,}blue/g blue, lightblue, darkblue, darkdarkblue Try it
/dark(blue|green)?/g (greedy) dark, darkblue, darkgreen Try it
/dark(blue|green)??/g (lazy) dark, darkblue, darkgreen Try it

7.7 Anchors

Anchors are referencing a position, not an actual character

Anchors Meaning
^ Start of string/line
$ End of string/line
\A (not all engines) Start of string, never end of line
\Z (not all engines) End of string, never end of line
\b Word boundary (start/end of word)
\B Not a word boundary
Example Text matching  
/^\w+ \w+$/g San Francisco Try it
/^\w+ \w+$/g San Francisco San Francisco Try it
/\b\w+\b\w+\b/g San Francisco San Francisco Try it
/\b\w+\b \b\w+\b/g San Francisco San Francisco Try it
/\B\w+\B/g San Francisco San Francisco Try it

7.8. Shorthand character classes

Other characters with special meaning

Shorthand Meaning
\n New line
\r Carriage return
\t Tab
\v Vertical tab
\d Any digit: [0-9]
\w Any word character including underscore: [a-zA-Z0-9_]
\w\- Any word character including underscore and hyphen: [a-zA-Z0-9_\-]
\s Whitespace: [\t\r\n]
\D Not digit: [^\d] or [^0-9]
\W Not word character: [^\w] or [^a-zA-Z0-9_]
\S Not whitespace: [^\s] or [^\t\r\n]

All letters, digits, underscores and dashes can be matched with [\w\-]

No digits or whitespace characters can be matched with [^\d\s]. Don’t use [\D\S], it means “either no digits or no whitespace characters”

8. Further resources