regex guide
References taken from regular-expressions.info
Contents |
special chacters
\ ^ $ . | ? * + ( ) [ {
inside character classes only [class] i.e. [0-9]
\ ^ and additionally - ]
Most regular expression flavors treat the brace { as a literal character, unless it is part of a repetition operator like a{1,3}
Some flavors also support the \Q…\E escape sequence. All the characters between the \Q and the \E are interpreted as literal characters. E.g. \Q*\d+*\E matches the literal text *\d+*.
The backslash in combination with a literal character can create a regex token with a special meaning. E.g. \d is a shorthand that matches a single digit from 0 to 9.
Programming Languages
In your source code, you have to keep in mind which characters get special treatment inside strings by your programming language. That is because those characters are processed by the compiler, before the regex library sees the string.
Non-Printable Characters
You can use special character sequences to put non-printable characters in your regular expression. Use \t to match a tab character (ASCII 0x09), \r for carriage return (0x0D) and \n for line feed (0x0A).
Regex Syntax versus String Syntax
Many programming languages support similar escapes for non-printable characters in their syntax for literal strings in source code. Then such escapes are translated by the compiler into their actual characters before the string is passed to the regex engine. If the regex engine does not support the same escapes, this can cause an apparent difference in behavior when a regex is specified as a literal string in source code compared with a regex that is read from a file or received from user input.
Character Classes or Character Sets
A character class matches only a single character.
[ae] - matches a or e
[0-9] matches a single digit between 0 and 9
[0-9a-fA-F] matches a single hexadecimal digit
Negated Character Classes
Typing a caret after the opening square bracket negates the character class.
[^0-9\r\n] matches any character that is not a digit or a line break.
Metacharacters Inside Character Classes
To include a backslash as a character without any special meaning inside a character class, you have to escape it with another backslash. [\\x] matches a backslash or an x.
To include an unescaped caret as a literal, place it anywhere except right after the opening bracket. [x^] matches an x or a caret.
You can generally include an unescaped closing bracket by placing it right after the opening bracket, or right after the negating caret. []x] matches a closing bracket or an x. [^]x] matches any character that is not a closing bracket or an x. JavaScript and ruby are exceptions and require escapes.
The hyphen can be included right after the opening bracket, or right before the closing bracket, or right after the negating caret. Both [-x] and [x-] match an x or a hyphen.
Many regex tokens that work outside character classes can also be used inside character classes.
Repeating Character Classes
If you repeat a character class by using the ?, * or + operators, you're repeating the entire character class. You're not repeating just the character that it matched. The regex [0-9]+ can match 837 as well as 222.
Shorthand Character Classes
Since certain character classes are used often, a series of shorthand character classes are available. \d is short for [0-9].
\w stands for "word character". It always matches the ASCII characters [A-Za-z0-9_]
\s stands for "whitespace character". In all flavors discussed in this tutorial, it includes [ \t\r\n\f].
- Which characters these shorthands actually include depend on the regex flavor.
Be careful when using the negated shorthands inside square brackets. [\D\S] is not the same as [^\d\s].
The Dot Matches (Almost) Any Character
The dot matches any single character except line break characters.
Example date string verification allowing various field separators...
d\d.\d\d.\d\d matches a date like 02/12/03, but also 02512703
\d\d[- /.]\d\d[- /.]\d\d is a better solution
In Perl, the mode where the dot also matches line breaks is called "single-line mode". You can activate single-line mode by adding an s after the regex code, like this: m/^regex$/s;.
Other languages and regex libraries have adopted Perl's terminology.
Anchors
^ and $ match the start and end of lines. $ generally matches the zero space character in front of a \n or the void after the last charcter in a file.
A and \Z only match at the start and the end of the entire file.
exception: python uses a lower case \z to macth end of file.