Monday, February 7, 2011

Regex are AWESOME

So today in class we`ve seen regular expressions ( widely known as Regex ). They are expressions used to recognize if a piece of code follows a certain pattern. A regex is usually a string of character representing the expression. Regex'es are language agnostic, meaning they will work with whatever language you are using provided it uses the right libraries.

A regex looks like this : "^[a-z]\d[a-z][\s-]?\d[a-z]\d$" .
This particular regex is used to check a string to see if it represent a canadian postal code (they look like this H2Y 8D7 with a space, dash or nothing in the middle).

How does it works ? Like this : A regular expression is filled with special alphanumeric characters and symbols. How you place these characters determines how the target string will be evaluated. Rather like programming languages with keywords and expressions.

The first set of symbols is the [a-z] interval. An interval is used to represent any value comprised between those two. In this instance I mean any lowercase letter between a to z.

The second set of symbol is \d . \d is a pre-defined character. It means whatever numeric character. It is the same as [0-9] but shorter. Then is another a to z interval. These three expressions are used to check for the first three characters of the targeted string. The last three is checked the same way.

A simple problem we are faced is the middle space, or the absence of it. Indeed we have to check that the fourth character is a space, dash or nothing. Several operators are used to check occurrences of several characters.

  • * : The asterisk is used to check 0, 1 or several occurrences of the previous character. A* will return true to A, AA, AAA, AAAA and basically every A in the text.
  • + : The plus sign is used to check 1 or several occurrences of the previous character (or expression between brackets [ ] a "class" ).  H+ will return true to HH, HHH and every other series of two consecutive H. Not simply one though.
  • ? : The interrogation mark is used to check 0 or 1 occurrence of the previous class. It will test if a class is there or not. This what we need to test if there is a space, a dash or none.
The expression used to test this condition is [\s-]? . This checks if there is, or not, a white space or a dash. White space being represented by \s.

Lastly, you can see symbols in the first and last position of the regex. These are to make sure the value we check does not contain any other characters before and after the string. The ^ symbol is used at the beginning of an expression to check if the string *starts* with the tested expression and the $ is used at the end to check if it *ends* with the expression. Using both we make sure the tested string starts and ends with our pattern.

An example :
We want to check if a string is a Canadian permanent code. These codes are made of four letters and 8 numbers with spaces(or not) at the 5th and 8th position.

Regular Expression :
^[a-z]{4}[-\s]?\d{4}[-\s]?\d{4}$

Test strings : 
SIGH03129117
DUG712345673
SIGH 0312 9117
SIGH 03129117
SIGH0312 9117
1234DUGH9154

Using the expression above, we can see that the second and last expression does NOT fit and we did that only by checking it against a string of symbols. Much less hassle than going by letter by letter with a loop.

Here is the program I made in class today, it contains many good examples. Available with the .net framework 4, as are the majority of what I do anyway. Don't forget to grab the latest version.

EDIT : Testing a regex can be a pain in the ass. This is why I use a tool, its here if anyone would like to try it. It is an AWESOME regex testing tool and I recommended it to anyone interested. 

No comments:

Post a Comment