Useful regex resources

Today I want to post a list of regex resources that have been useful for me. In the first time I haven’t read a regex book, I found all my knowledge on-line, and there is plenty of regex information out there.

On-line resources

As example, because those languages provide good documentation and because I know them best, here the regex docs for Perl and C#. Especially the Perl tutorials can be recommended also to regex learners of other languages, just ignore the, by your language not supported, advanced parts.

Perl

  • perlrequick, Perl regular expressions quick start, a tutorial that gives a basic regex understanding.
  • perlretut, Perl regular expressions tutorial. A really extensive tutorial going into all details of Perl regexes.
  • perlre, the Perl regex module documentation.

C#

Online Regex Tester

  • Regexr, my favourite one. Based on Flex, supports lookahead and fixed length lookbehind. Has also a regex community library and explain regex constructs. It is also possible to create permanent links to solutions.
  • Rubular, based on Ruby, allows also to create permanent links.
  • Regular Expression Analyzer, give it a regex and it will analyze it
  • Refiddle, you can choose the regex engine (JavaScript, Ruby, .NET)

Applications

You do know Quantifiers. Really?

Basics

Everybody knows the Quantifiers + and *. Most will also know ?. But do you know that they are only shortcuts for convenience? No?

The real quantifier in regular expressions is {m,n} where m is a number that defines the minimum amount to match and n is the number that defines the upper bound. The upper bound can be omitted, that means there is no upper bound.

It can also be only a number between the curly brackets like this: {4}. That means match exactly 4.

So, now back to the quantifiers we all know +, * and ?. What are they now?

? makes something optional. Means match it 0 or 1 time. So the long version is {0,1}
* matches something 0 or more times, that is then {0,}
+ matches at least once, that is in the long version {1,}

Matching Behaviour

Greedy versus Lazy

The default matching behaviour of regex quantifiers is greedy, they match as much as possible. E.g. the regex /a.*a/ would match from the first a to the last a.

Our test string here shall be ababababababa

/a.*a/ would match ababababababa in our test string.

but sometimes this is too much and one would only match till the next a. So what to do? There exists a modifier that changes the matching behaviour of a quantifier, this is the question mark ? ! Wait, what? Yes, if you see something like this /a.*?a/ it does not make the * optional (would make no sense, eh), it modifies the behaviour of the *

So /a.*?a/ would match only ababababababa in our test string.

Read more about repetition on regular-expressions.info

Possessive

Some regex engines have an additional modifier for quantifiers. That modifier is the + and changes the matching behaviour to be “possessive”, means what that quantifier has matched is not released anymore. There is no backtracking within that quantifier. That is sometimes useful to make a regex fail faster.

/a.++/ would match ababababababa in our test string.

/a.++a/ would not match abababababa. What!? Why not?

Because a.*+ matched the first “a” (obviously) and the rest matched the string till the end. At the end the regex still requires to match an “a”, so a normal quantifier would give back (backtrack) the last  “a”, so that the regex can find the last “a” in the string. But the possessive quantifier will not give back any character that it has matched, and therefore the regex can’t find a match and fails.

If you know atomic groups, then it is more easy for you. A possessive quantifier is the same as when you would place a atomic group around that single quantifier. So /a.++a/ is the same as /a(?>.+)a/ .

As usual a link to Jan Goyvaerts great page regular-expressions.info, he has also an article about Possessive Quantifiers where he explains the matching behaviour and the background more detailed.

Conclusion

There is more about quantifiers than only ?+*. If needed, the amount of  required repetitions and the matching behaviour can be adjusted to the particular needs. Just keep in mind there is more about quantifiers and remember that, when your regex does not match what you expect it to.

I would be happy to read your comments, if you haven’t understood a single word (and of course also, when you learned something and liked it, or when something in between happened)

Do you write readable regexes?

 Why not?

Imagine you have a regex to validate a password

^(?=.*\p{Lu})(?=.*\P{L})\S{8,}$ 

Thats not too complicated, but the readability could be better. The solution here is the option x or IgnorePatternWhitespace.

Most regular expression flavours allow you to use the option x, this is an important option everybody should know, who want to write longer patterns. The option x is doing two things:

  1. Allows to use whitespace to make the pattern more readable
  2. Allows the usage of comments inside the pattern

The whitespace used then inside the pattern does not belong to the pattern itself, it is only to improve the readability.  But that means also that if you want to match e.g. a space you have to escape it using a or use the whitespace class \s.

Example in C#

Regex password = new Regex(@"
        ^ # Match the start of the string
        (?=.*\p{Lu}) # Positive lookahead assertion, is true when there is an uppercase letter
        (?=.*\P{L}) # Positive lookahead assertion, is true when there is a non-letter
        \S{8,} # At least 8 non whitespace characters
        $ # Match the end of the string
    ", RegexOptions.IgnorePatternWhitespace);

Looks a lot better, what do you think?

That way its much clearer what this regex is doing and together with meaningful comments, even a regex novice will see quite quick what this part of the code is doing.

This useful feature is available in the most important languages like .NET, Java, Perl, PCRE, Python, Ruby. it is not supported by e.g. ECMA (JavaScript) and POSIX BRE/POSIX ERE (the PHP ereg functions are using  POSIX ERE, the preg functions are using PCRE).

So, in future, hopefully everyone is going to write readable regexes. You have now seen that it is not a feature of regular expressions to be unreadable, it is as always  the programmer how writes unreadable code.

For more details about this option you can have a look at  regular-expressions.info

How to test your regular expression

Of course, the ultimate tool is RegexBuddy by Jan Goyvaerts. I personally don’t have it at the moment, but I surely need to get it soon. It supports various languages, gives you detailed explanation on each part, you can step through the matching process …

But if you use regex not often and do only small ones, you should be fine with free online regular expression testers. And it is important that you test your regex, extract use cases from your real data and add also test data that you don’t want to match!

Online regex testers are a great help while developing a regular expression. As long as you do not want to use features they don’t support, they visualize your match and show you the content of the capturing groups instantly.

Critical features, you should test in your real language, are mainly lookbehind assertions and Unicode  features like Unicode properties.

My favourites are

Regexr is based on ActionScript 3, that means it implements regex after the ECMA-262 standard.  See regular-expressions.info for more info. (I am not sure if the standard has changed or Regexr has changed, because Regexr supports simple look behind assertions, but ECMA-262 does not).

Matches are highlighted and the content of the capturing groups is shown when the mouse hovers over the match, it also allows to test replacements and allows to create permalinks to your tested regex (to share it easily on SO ;))

Rubular is based on Ruby.   See regular-expressions.info for more info. Matches are highlighted and the matches and the content of the according capturing groups are shown in a list. The regex is processed on a server, therefore the result is not shown instantly but quite quick the most of the time. It allows also to create permalinks to your tested regex

Other online testers are (I use them not often, but try them to find your personal favorite)

And at last, this is not tool for testing, you can give it a regex and it tells you what this is doing:

What absolutely every Programmer should know about regular expressions

No, I am not going into theoretical definitions. I am going to talk about regex in todays computer languages and applications.

I am surprised how many programmers think regexes are complicated and ask e.g. on Stackoverflow for a regex for a specific task. They will get most probably an answer, if they ask nicely. But most of the time, their specifications are not complete or wrong, so they end up with a regex that works for some examples but not for all of their real data. If they recognize it, they have no glue why it is not working for that case and how to fix it. But it is not difficult to get the basics and to understand at least basic regexes.

I want to explain here the absolutely necessary basics, to write and understand basic regexes, so that you are able to use them more efficiently, search with the correct vocabulary or at least to read them.

Not all features that I explain here are available in all regex flavours, the only solution is then to check the documentation. A good point for regex informations is regular-expressions.info. There is also a feature list for a lot of different regex flavours .

Regular Expressions

The first thing to know is, a regex describes a pattern of characters.  This enables you to find that pattern inside a text. A very simple pattern would be

/Foo/ will find “Foo“, “Foobar”, “Foooo” and “BarFoo“, its case sensitive, it will not find “foo”!

the slashes around does not belong to the pattern, they are the regex delimiters. Thats Perl style, it depends on the language how a regex is denoted correctly.

Metacharacters

Now, there are some characters that have a special meaning in a regex. They are often called “Metacharacters”. Those are

[^$.|() ?*+

if you want to match one of those characters, you have to escape them using the special character \

The . is a very special character, it will match every character except newline characters.

/F.o/ will find "Foo", "Fxobar", "F&ooo"

/F\.o/ will find "F.o", "F.oooo",  but not  "Foo",   "Fxobar", "F&ooo"

Quantifiers

You can say, repeat the character or group by using a quantifier. That would be

{x,y} where x is the minimum amount of occurrences and y is the maximum amount. If x==y only write {x}, if y should be unlimited, leave it empty {x,}. So

/o{2}/ will find  "Fooo"

For convenience there  are now some shortcuts

? is {0,1} means match 0 or 1, it makes the previous character of group optional

+ is {1,} means match 1 or more

*  is {0,} means match 0 or more

/Fo+/ will find "Foo", "Foobar", "Foooo" and "BarFoo"

/Fo+b?/ will find "Foo", "Foobar", "Foooo" and "BarFoo"

Character Classes

You can also define your own set of characters when there can be more than one, but . would match to many.

/F[ox]o/ will find “Foo“, “Fxobar”, but not “F&ooo”

You can put as many characters inside such a class as you want, but [abcdne] would only match one character (out of that class), if you want to match more, you need to use a quantifier after that class.

Metacharacters inside char class, will loose their special meaning. So

/Fo[+]/ will match “Fo+

But now other characters get a special meaning, or change their meaning inside a character class. I haven’t told you the meaning of till now, but it is a different inside a character class, at least when it is the first character. [^o] is a negated character class, this construct will match every character, but not  “o”.

 /F[^o]+/ will match “Fxo”, but not “Foo”

- is creating a range in a character class. [a-m] would match every character in the ASCII table from “a” to “m”.

/F[a-q]+/ would match   “Foo“, “Foobar”, “Foooo“, “Fabcdefghijklmopqrs” and “BarFoo

So, please if you want to add a dash “-” to your chara, “Foooo“, “Foooo“cter class, escape it (or put it as first or last character in the class), otherwise it will define ranges and match much more than you want. 

There are some predefined classes for your convenience:

\w is a word character, that means letters, digits and the underscore. What letters are, depends on your language, either only the ASCII letters (the worse case) or Unicode code points with the property letter.

\d is a digit

\s is a whitespace character, e.g. space, tab and newlines.

If the letter is an uppercase, then it’s the negated form of that class, e.g. The negated form of \w is \W.

Groups

You can group stuff together by using brackets (). By default such a group is a capturing group. That means it stores the text that has been matched by that part of the pattern in a variable that can be then accessed inside the pattern by using back references. From my experience, this is a bit time consuming, so if you don’t need that partial result use a non  capturing group. Every group that starts with a ? is a non capturing group with a special meaning. Just non capturing is (?:pattern).

/F(?:oo){2}/ would match “Foooo“, but not “Foo”

/F(oo)1/ would match “Foooo”, but not “Foo”, 1 is a backreference to the part matched inside the brackets. So this requires “oo” to be matched inside the brackets and then there are two more needed because of the backreference.

Alternations

Another important construct is the alternation.

/Foo|Bar/ would match “Foo” or “Bar

Anchors

As last part to define a pattern I want to talk about anchors. Anchors are zero width assertions. That means they don’t match a character, they match a position. Anchors are important to define, where a pattern should match. There are three important anchors:

^ matches the start of the string

$ matches the end of the string

\b matches a word boundary. A word boundary is the position where on the one side is a w character and on the other side is a W character.

/^Foo/ matches “Foo“, “Foo bar text”, but not “This  is a Foo text”

/\bFoo\b/ matches   “Foo“, “Foo bar text”, “This  is a Foo text”, but not “Foobar”

/Foo$/ matches “Foo“,  “This  is Foo” but not   “Foo bar text”

Options

The matching behaviour of the regex can be modified by options or modifiers.

i makes the pattern match letters case independent. /a/i would match “a” and “A”.

m is the multiline modifier. It changes the behaviour of the ^ and $ anchor to match the start and end of the row instead of only the string. The $ anchor will then match before a n character.

s is the singleline modifier. It changes the behaviour of the dot ., it makes it also match newline characters.

 The End

Of course these are really only the very basics and I told you not everything, but this leaves me some more things to write about in the future.

Thank you if you have read that far, my first blog post got a bit longer than I expected. I hope this post helped someone, at least a little bit. Please tell me what you think or if you found a mistake somewhere, leave a comment.