Learning .NET Regular Expressions, Regex Part-1

By Pritam
In regex
Feb 5th, 2012
Learning .NET Regular ExpressionsRegex – an acronym for Regular Expression is a flexible and concise wayfor matching strings, recognizing specific patterns of characters or words in atext. Regex proves to be a strong tool whenever one needs to make patternmatching in a large pool of text. 
Regexis just an elaborate extension of wild cards which we use normally in searching(For e.g.: ‘*.docx’ matches for all files with extension)
Usage: Whilewriting programs or web pages that manipulate text, it is frequently necessaryto locate strings that match complex patterns. Regular expressions wereinvented to describe such patterns.
Thus a regular expression is just a shorthandcode for a pattern. In this tutorial I will take you to a walkthrough ofhow regex works and how to write regex strings to match a desired pattern.

A regular expression is basically a string with characters, dots, braces,numbers, special characters etc. written in a fashion to match a specificpattern.

Regex mayvary from a single literal character, e.g.: ‘a’ (It will match the firstoccurrence of that character in the string. If the string is ‘Jack is a boy‘,it will match the ‘a‘ after the ‘J‘) to a more complex, e.g.: ‘b[A-Z0-9._%+-][email protected][A-Z0-9.-]+.[A-Z]{2,4}b‘(It will match a string for a valid email address). This regex may scare youfor a moment but on the contrary regular expressions aren’t as complex as they look. 
The best way to learn regex is to startwriting and experimenting. 
Let’s start with the simple examples:
  • Find a word
    •  top (find top)
      • This is aperfectly valid regular expression that searches for an exact sequence ofcharacters. In .NET, you can easily set options to ignore the case ofcharacters, so this expression will match “Top”, “TOP”, or”tOp”. Unfortunately, it will also match the last three letters ofthe word “stop”. We can improve the expression as follows:
    • btopb(find “top” as a whole word)
      • The “b” is aspecial code that means, “Match the position at the beginning or end ofany word”. This expression will only match complete words spelled “top”with any combination of lower case or capital letters.
    • btopb.*bblogb(find text with “top”followed by “blog”)
      • The regex will find all lines in which the word“top” is followed by word “blog”. The period or dot “.” is a special codethat matches any character other than a newline. The asterisk “*” means repeat theprevious term as many times as necessary to guarantee a match. Thus, “.*” means “match anynumber of characters other than newline”. It is now a simple matter tobuild an expression that means “search for the word ‘top’ followed on thesame line by the word ‘blog’.”
  • Find a valid phone number
    • bddd-dddd Find seven-digit phone number
      • Each “d” means “match anysingle digit”. The “-” has no special meaning and is interpreted literally,matching a hyphen. To avoid the annoying repetition, we can use a shorthandnotation that means the same thing:
    • bd{3}-d{4} Findseven-digit phone number a better way
      • The “{3}“following the “d” means “repeat thepreceding character three times”.

                                               Commonlyused special characters                 

Match any character except newline
Match any alphanumeric character
Match any whitespace character
Match any digit
Match the beginning or end of a word
Match the beginning of the string
Match the end of the string
    • baw*b Find words that start with the letter a
      • This worksby searching for the beginning of a word (b), then the letter “a”,then any number of repetitions of alphanumeric characters (w*), then the endof a word (b).
    • d+ Find repeated strings of digits
      • Here, the”+” is similar to “*”, except it requires at least onerepetition.
    • bw{6}b Find six letter words
  • Match Characters/word in the beginning of text

The specialcharacters “^” and “$” areused when looking for something that must start at the beginning of the textand/or end at the end of the text. This is especially useful for validatinginput in which the entire text must match a pattern. For example, to validate aseven-digit phone number, you might use:

    •   ^d{3}-d{4}$ Validate a seven-digit phone number
      • This is thesame as example (5), but forced to fill the whole text string, with nothingelse before or after the matched text. By setting the “Multiline”option in .NET, “^” and “$“change their meaning to match the beginning and end of a single line of text,rather than the entire text string.
  • Escaped Characters
    • A problemoccurs if you actually want to match one of the special characters, like “^” or”$“. Use the backslash to remove the specialmeaning. Thus, “^“, “.“, and”\“, match the literal characters “^“,”.“, and ““,respectively.
  •  Repetitions
    • You’ve seenthat “{3}” and “*” canbe used to indicate repetition of a single character. Later, you’ll see how thesame syntax can be used to repeat entire sub expressions. There are severalother ways to specify a repetition, as shown in this table: 
Repeat any number of times
Repeat one or more times
Repeat zero or one time
Repeat n times
Repeat at least n, but no more than m times
Repeat at least n times

          Let’s try afew more examples: 
    • bw{5,6}b Find all five and six letter words
    • bd{3}sd{3}-d{4} Find ten digit phone numbers
    • ^w* The first word in the line or in thetext

By now you must have got an idea of how handy and powerfulregex is. This is all for now. I will soon be back with rest of the part. Stayconnected for the more complex regex used commonly in real word application.The following is the list of some advanced topics in regex to be covered in thecoming article:

  •  Character Classes
  • Negation
  • Alternatives
  • Grouping
  • Positive Look around
  • Negative Look around
  • Greedy and Lazy
  • Some commonly used regex expressions

About the Author

- Co-Founder of IdleBrains, is software Engineer by profession with expertise in .NET technologies and data structures. An avid reader and writer, loves to keep himself well versed with new technologies. When not working can be found on Badminton court or chatting with friends. Among other hobbies, loves to listen old hindi numbers of Kishore Kumar and Mukesh.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

facebook comments: