regex guide

References taken from regular-expressions.info

special chacters

\ ^ $ . | ? * + ( ) [ {

inside character classes only [class] i.e. [0-9]

\ ^ and  additionally - ]

Most regular expression flavors treat the brace { as a literal character, unless it is part of a repetition operator like a{1,3}

Some flavors also support the \Q…\E escape sequence. All the characters between the \Q and the \E are interpreted as literal characters. E.g. \Q*\d+*\E matches the literal text *\d+*.

The backslash in combination with a literal character can create a regex token with a special meaning. E.g. \d is a shorthand that matches a single digit from 0 to 9.

Programming Languages

In your source code, you have to keep in mind which characters get special treatment inside strings by your programming language. That is because those characters are processed by the compiler, before the regex library sees the string.

Non-Printable Characters

You can use special character sequences to put non-printable characters in your regular expression. Use \t to match a tab character (ASCII 0x09), \r for carriage return (0x0D) and \n for line feed (0x0A).

Regex Syntax versus String Syntax

Many programming languages support similar escapes for non-printable characters in their syntax for literal strings in source code. Then such escapes are translated by the compiler into their actual characters before the string is passed to the regex engine. If the regex engine does not support the same escapes, this can cause an apparent difference in behavior when a regex is specified as a literal string in source code compared with a regex that is read from a file or received from user input.

Character Classes or Character Sets

A character class matches only a single character.

[ae] - matches a or e
[0-9] matches a single digit between 0 and 9
[0-9a-fA-F] matches a single hexadecimal digit

Negated Character Classes

Typing a caret after the opening square bracket negates the character class.

[^0-9\r\n] matches any character that is not a digit or a line break.

Metacharacters Inside Character Classes

To include a backslash as a character without any special meaning inside a character class, you have to escape it with another backslash. [\\x] matches a backslash or an x.

To include an unescaped caret as a literal, place it anywhere except right after the opening bracket. [x^] matches an x or a caret.

You can generally include an unescaped closing bracket by placing it right after the opening bracket, or right after the negating caret. []x] matches a closing bracket or an x. [^]x] matches any character that is not a closing bracket or an x. JavaScript and ruby are exceptions and require escapes.

The hyphen can be included right after the opening bracket, or right before the closing bracket, or right after the negating caret. Both [-x] and [x-] match an x or a hyphen.

Many regex tokens that work outside character classes can also be used inside character classes.

Repeating Character Classes

If you repeat a character class by using the ?, * or + operators, you're repeating the entire character class. You're not repeating just the character that it matched. The regex [0-9]+ can match 837 as well as 222.

Shorthand Character Classes

Since certain character classes are used often, a series of shorthand character classes are available. \d is short for [0-9].

\w stands for "word character". It always matches the ASCII characters [A-Za-z0-9_]

\s stands for "whitespace character". In all flavors discussed in this tutorial, it includes [ \t\r\n\f].

- - Which characters these shorthands actually include depend on the regex flavor.

Be careful when using the negated shorthands inside square brackets. [\D\S] is not the same as [^\d\s].

The Dot Matches (Almost) Any Character

The dot matches any single character except line break characters.

Example date string verification allowing various field separators...

d\d.\d\d.\d\d   matches a date like 02/12/03, but also 02512703

\d\d[- /.]\d\d[- /.]\d\d is a better solution

In Perl, the mode where the dot also matches line breaks is called "single-line mode". You can activate single-line mode by adding an s after the regex code, like this: m/^regex$/s;.

Other languages and regex libraries have adopted Perl's terminology.

Anchors

^ and $ match the start and end of lines. $ generally matches the zero space character in front of a \n or the void after the last charcter in a file.

A and \Z only match at the start and the end of the entire file.

exception: python uses a lower case \z to macth end of file.

regex guide

Contents

special chacters

Programming Languages

Non-Printable Characters

Regex Syntax versus String Syntax

Character Classes or Character Sets

Negated Character Classes

Metacharacters Inside Character Classes

Repeating Character Classes

Shorthand Character Classes

The Dot Matches (Almost) Any Character

Anchors

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Tools