Matching One or More Characters
To match one or more instances of a character (or set),simply append a + character. + matches one or more characters (at least one; zero would not match). Whereas a matches a, a+ matches one or more as. Similarly, whereas [0-9] matches any digits, [0-9]+ matches one or more consecutive digits. When you use + with sets, the + should be placed outside the set. Therefore, [0-9]+ is correct, but [0-9+] (it is legal but you may not want it) is not.
Matching Zero or More Characters
+ matches one or more characters. Zero characters will not match—there has to be at least one. But what if you wanted to match entirely optional characters so that zero characters would be allowed? Todo this,you use the * metacharacter. * is used exactly like +; it is placed right after a character or a set and will match zero or more instances of the character or set. Therefore, pattern B.* Forta would match B Forta,B. Forta,Ben Forta,and other combinations, too.
Matching Zero or One Character
One other very useful metacharacter is ?. Like +, ? matches optional text (and so zero instances will match). But unlike +, ? matches only zero or one instance of a character (or set), but not more than one. As such, ? is very useful for matching specific, single optional characters in a block of text.
Using Intervals
+, *,and ? are used to solve many problems with regular expressions, but sometimes they are not enough. Consider the following:
• + and * match an unlimited number of characters. They provide no way to set a maximum number of characters to match.
•The only minimums supported by +, *,and ? are zero or one. They provide no way to set an explicit minimum number of
matches.
•There is also no way to specify an exact number of matches desired.
To solve these problems, and to provide a greater degree of control over repeating matches, regular expressions allow for the use of intervals. Intervals are specified between the { and } characters.
-> Exact Interval Matching – {3}, where 3 could be any number. It will match exact 3 occurrences. It there are 2 matchs then it will not output anything. e.g. #[[:xdigit:]]{6} will match only #78de58 and not #78de5
-> Range Interval Matching – {2,4}, where 2 is the minimum number of matches and 4 is max. e.g. – \d{1,2} will match 5 and 55, and not 555.
-> “At Least” Interval Matching – The final use of intervals is to specify the minimum number of instances to be matched (without any maximum). The syntax for this type of interval is similar to that of a range,but with the maximum omitted. For example, {3,}means match at least 3 instances, or stated differently,match 3 or more instances.
Preventing Over Matching
? matches are limited in scope (zero or one match only),and so are interval matches when using exact amounts or ranges. But the other forms of repetition described in this lesson can match an unlimited number of matches—sometimes too many.
e.g. Test Case – <B>AK</B> and <B>HI</B>.
RegEx – <[bB]>.*[bB]>
Match – <B>AK</B> and <B>HI</B>.
We were expecting two matches but it will only give one. The reason for this is that metacharacters such as *and+ are greedy; that is, they look for the greatest possible match as opposed to the smallest. It is almost as if the matching starts from the end of the text,working backward until the next match is found, in contrast to starting from the beginning. This is deliberate and by design, quantifiers are greedy.
The solution is to use lazy versions of these quantifiers (they are referred to as being lazy because they match the fewest characters instead of the most). Lazy quantifiers are defined by appending an ? to the quantifier being used,and each of the greedy quantifiers has a lazy equivalent as listed below :
Greedy | Lazy |
---|---|
* | *? |
+ | +? |
{n,} | {n,}? |
Hope that give you clear idea of how to match exact number of matches.