You do know Quantifiers. Really?


Basics

Everybody knows the Quantifiers + and *. Most will also know ?. But do you know that they are only shortcuts for convenience? No?

The real quantifier in regular expressions is {m,n} where m is a number that defines the minimum amount to match and n is the number that defines the upper bound. The upper bound can be omitted, that means there is no upper bound.

It can also be only a number between the curly brackets like this: {4}. That means match exactly 4.

So, now back to the quantifiers we all know +, * and ?. What are they now?

? makes something optional. Means match it 0 or 1 time. So the long version is {0,1}
* matches something 0 or more times, that is then {0,}
+ matches at least once, that is in the long version {1,}

Matching Behaviour

Greedy versus Lazy

The default matching behaviour of regex quantifiers is greedy, they match as much as possible. E.g. the regex /a.*a/ would match from the first a to the last a.

Our test string here shall be ababababababa

/a.*a/ would match ababababababa in our test string.

but sometimes this is too much and one would only match till the next a. So what to do? There exists a modifier that changes the matching behaviour of a quantifier, this is the question mark ? ! Wait, what? Yes, if you see something like this /a.*?a/ it does not make the * optional (would make no sense, eh), it modifies the behaviour of the *

So /a.*?a/ would match only ababababababa in our test string.

Read more about repetition on regular-expressions.info

Possessive

Some regex engines have an additional modifier for quantifiers. That modifier is the + and changes the matching behaviour to be “possessive”, means what that quantifier has matched is not released anymore. There is no backtracking within that quantifier. That is sometimes useful to make a regex fail faster.

/a.++/ would match ababababababa in our test string.

/a.++a/ would not match abababababa. What!? Why not?

Because a.*+ matched the first “a” (obviously) and the rest matched the string till the end. At the end the regex still requires to match an “a”, so a normal quantifier would give back (backtrack) the last  “a”, so that the regex can find the last “a” in the string. But the possessive quantifier will not give back any character that it has matched, and therefore the regex can’t find a match and fails.

If you know atomic groups, then it is more easy for you. A possessive quantifier is the same as when you would place a atomic group around that single quantifier. So /a.++a/ is the same as /a(?>.+)a/ .

As usual a link to Jan Goyvaerts great page regular-expressions.info, he has also an article about Possessive Quantifiers where he explains the matching behaviour and the background more detailed.

Conclusion

There is more about quantifiers than only ?+*. If needed, the amount of  required repetitions and the matching behaviour can be adjusted to the particular needs. Just keep in mind there is more about quantifiers and remember that, when your regex does not match what you expect it to.

I would be happy to read your comments, if you haven’t understood a single word (and of course also, when you learned something and liked it, or when something in between happened)


Leave a Reply

Your email address will not be published. Required fields are marked *