Using Metacharacters in RegEx

faltutech

6 months ago

Escaping :

Metacharacters are characters that have special meaning within regular expressions. The period (.) is a metacharacter; it is usedto match any single character. Similarly, the left bracket ([) is a metacharacter; it is used to mark the beginning of a set.

Because metacharacters take on special significance when used in regular expressions,these characters cannot be used to refer to themselves. For example, you cannot use a [ to match [ or '.' to match '.' .

\ is used to escape metacharacters. This means that \ is itself a metacharacter; it is used to escape other characters. To refer to \,you would need to escape the reference as \\.

Matching Whitespace Character

When you are performing regular expression searches, you’ll often need to match nonprinting whitespace characters embedded in your text. For example, you may want to find all tab characters, or you may want to find line breaks. Because typing this character into your regular expressions directly would be very tricky (to say the least), you can use the special metacharacters listed below :

Metacharacter	Description
\b	backspace
\f	Form Feed
\n	New Line
\t	tab
\r	Carriage return
\v	Vertical Tab

Matching Specific Character Types :

Thus far, you might have seen how to match specific characters, any characters (using .), one of a set of characters (using [ and ]), and how to negate matches (using ^). Sets of characters (matching one of a set) is the most common form of matching,and special metacharacters can be used in lieu of commonly used sets. These metacharacters are said to match classes of characters. Class metacharacters are never actually needed (you can always enumerate the characters to match or use ranges), but you will undoubtedly find them to be incredibly useful.

Metacharacter	Description
\d	Any digit (same as [0-9])
\D	Any nondigit (same as [^0-9])
\w	Any alphanumeric character in upper- or lower case and underscore (same as [a-zA-Z0-9_])
\W	Any nonalphanumeric or underscore character (same as [^a-zA-Z0-9_])
\s	Any whitespace character (same as [\f\n\r\t\v])
\S	Any nonwhitespace character (same as [^\f\n\r\t\v])
\x	Hexadecimal
\0	Octal

Using POSIX Character Classes

Metacharacters and shortcuts for various character sets would not be complete without a mention of the POSIX character classes. These are yet another form of shortcut that is supported by many (but not all) regular expression implementations.

POSIX Character Classes:

[:alnum:] Any letter or digit,(same as [a-zA-Z0-9])
[:alpha:] Any letter (same as [a-zA-Z])
[:blank:] Space or tab (same as [\t ])
[:cntrl:] ASCII control characters (ASCII 0 through 31 and 127)
[:digit:] Any digit (same as [0-9])
[:graph:] Same as [:print:] but excludes space
[:lower:] Any lowercase letter (same as [a-z])
[:print:] Any printable character
[:punct:] Any character that is neither in [:alnum:] nor [:cntrl:]

[:space:] Any whitespace character including space (same as [\f\n\r\t\v ])
[:upper:] Any uppercase letter (same as [A-Z])
[:xdigit:] Any hexadecimal digit (same as [a-fA-F0-9])

The below is a example of regular expression to locate RGB values in a block of HTML code:

#[[:xdigit:]][[:xdigit:]][[:xdigit:]][[:xdigit:]][[:xdigit:]][[:xdigit:]]

Match : <BODY BGCOLOR=”#336633” TEXT=”#FFFFFF”
MARGINWIDTH=”0” MARGINHEIGHT=”0”
TOPMARGIN=”0” LEFTMARGIN=”0”>

Note : Notice that the regular expression starts with [[ and ends with ]] (two sets of brackets). This is important and required when using POSIX classes. POSIX classes are enclosed within [: and :]; the POSIX we used is [:xdigit:] (not :xdigit:). The outer [ and ] are defining the set; the inner [ and ] are part of the POSIX class itself.