RegEx In Python
Let’s grasp something new today about regular expressions in Python through this tutorial. Today, in this blog we are going to share something regarding regex, regexp or re.
What is RegEx?
A regular expression (regex, regexp or re) is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern. Regular expression patterns are assembled into a set of byte codes which are then executed by a matching engine written in C. Regular expressions are widely used in the world of UNIX.
Now let’s understand simple basic regular expression through the following image.
The caret sign (^) serves two purposes. Here, in this figure, it’s checking for the string that doesn’t contain upper case, lower case, digits, underscore and space in the strings. In short, we can say that it is simply matching for special characters in the given string. If we use caret outside the square brackets, it will simply check for the starting of the string.
An example of a “proper” email-matching regex (like the one in the exercise), see below:
The most common usages of regular expressions are:
- Search a string (search and match)
- Finding a string (findall)
- Break string into a sub strings (split)
- Replace part of a string (sub)
The module ‘re’ gives full assistance for Perl-like regular expressions in Python. The re module raises the exception re.error if an error occurs while compiling or using a regular expression.
Now if we talk about the re’ module, the re module gives an interface to the regular expression engine, that permits you to arrange REs into objects and then perform with the matches. Regular expression is simply a sequence of characters that define a search pattern. Pythons’ built-in “re” module provides excellent support for the regular expressions with a modern and complete regex flavor.
Now, let’s understand everything about regular expressions and how they can be implemented in python. The very first step would be to import “re” module which provides all the necessary functionalities to play with. It can be done by the following statement in any of the IDE’s.
Meta characters are characters or we can say its a sequence of such characters, that holds a unique meaning specifically in a computing application. These characters have special meaning just like a ‘*’ in wild cards. Some set of characters might be used to represent other characters, like an unprintable character or any logical operation. They are also known as “operators” and are mostly used to give rise to an expression that can represent a required character in a string or a file.
Below is the list of the meta characters, and how to use such characters in the regular expression or regex like;
Initially, the meta characters we are going to explain is [ and ]. It’s used for specifying the class of the character which is a set of characters that you wish to match.
Characters can be listed individually here, or the range of characters can be indicated by giving two characters and separating them by a ‘-‘. For instance, [abc] will match any of the characters a, b, or c; we can say in another way to express the same set of characters i.e. [a-c]. If you wanted to match only lowercase letters, your RE would be [a-z].
Let’s understand what these characters illuminate:
Here, [abc] will match if the string you are trying to match contains any of the a, b or c.
You can also specify a range of characters using – inside square brackets.
- [a-e] is the same as [abcde].
- [1-4] is the same as .
- [0-9] is the same as [0123—9]
You can complement (invert) the character set by using caret ^ symbol at the start of a square-bracket.
- [^abc] means any character except a or b or c.
- [^0-9] means any non-digit character.
The basic usages of commonly used metacharacters are shown in the following table:
For example, \$a match if a string contains $ followed by a. Here, $ is not interpreted by a RegEx engine in a special way.
\ is used to match a character having special meaning. For example: ‘.’ matches ‘.’, ‘+’matches ‘+’ etc.
We need to use ‘\’ to match . Regex recognizes common escape sequences such as \n for newline, \t for tab, \r for carriage-return, \nnn for a up to 3-digit octal number, \xhh for a two-digit hex code, \uhhhh for a 4-digit Unicode, \uhhhhhhhh for a 8-digit Unicode.
The following code example will show you the regex ‘.’ function:
Other Special Sequences
There are some of the Special sequences that make commonly used patterns easier to write. Below is a list of such special sequences:
Understanding special sequences with examples
\A – Matches if the specified characters are at the start of a string.
\b – Matches if the specified characters are at the beginning or end of a word.
\B – Opposite of \b. Matches if the specified characters are not at the beginning or end of a word.
\d – Matches any decimal digit. Equivalent to [0-9]
\D – Matches any non-decimal digit. Equivalent to [^0-9]
\s – Matches where a string contains any white space character. Equivalent to [ \t\n\r\f\v].
\S – Matches where a string contains any non-white space character. Equivalent to [^ \t\n\r\f\v].
\w – Matches any alphanumeric character (digits and alphabets). Equivalent to [a-zA-Z0-9_]. By the way, underscore _ is also considered an alphanumeric character.
\W – Matches any non-alphanumeric character. Equivalent to [^a-zA-Z0-9_]
\Z – Matches if the specified characters are at the end of a string.
Module- Level Functions
‘Re’ module provides so many top level functions & among them primarily used functions are: match(), search(), findall(), sub(), split(), compile().
These functions are responsible for taking arguments, primarily, regular expression pattern as the first argument and the string where regex has to be applied being the second. It returns either None or a match object instance. They store the compiled object in a cache for the purpose of making future calls using the same regular expressions and avoiding the need to parse the pattern again and again.
We will explain some of these function in the below section.
1. re.match() – The match() function is used to match the beginning of the string. In the following example, the match() function will match the first letter of the given string whether it is a digit, lowercase or uppercase letter (underscores included).
If we add ‘+’ outside the character set, it will check for the repeatability of the given characters in ‘RE’. In the following example, ‘+’ checks about one or more repetitions of uppercase, lowercase, and digits (underscore included, white spaces excluded).
‘*’ is a quantifier which is responsible for matching the regex preceding it 0 or more times. In short, we can say it matches any character zero or more times. Let’s understand via the below given example. In the given string (‘Welcome to programming’), ‘*’ will match for characters given in the regex as long as possible.
If we add ‘*’ inside the character set, the regex will check for the presence of ‘*’ in the beginning of the string. Since, in the following example ‘*’ is not present at the beginning of the string, so it will result in ‘W’.
Using quantifier ‘?’ matches zero or one of whatever precedes it. In the following example ‘?’ matches uppercase or lowercase characters including underscore as well in the beginning of the string.
There’s ‘re’ module function that offer you the set of functions that mainly allows you to search a string for a match. Let’s understand what these functions perform for.
2. re. search()- It is mainly used to search the pattern in a text. The function re. search() takes a regex pattern and a string and searches for that particular pattern within the string. In that case, if the search is successful, search() returns a match object or None otherwise. The syntax of re. search is as follows:
You can better understand with the following example.
3. re. findall()- Returns a list containing all matches. The function re. findall() is used when you want to iterate over the lines of file or string, it will return a list of all the matches in a single step. String is scanned left-to-right, and matches are returned in the order that found. The syntax of re. findall() is as follows:
Below is an example of re. findall() function.
4. re. split () – Returns a list where the string has been split at each match. Split string by the occurrences of pattern. The syntax of re. split is given below:
Look at the following example re. split() function:
5. re. sub() – It replaces one or many matches with a string. It is used to replace sub strings and it will replace the matches in string with replacing value. The synatx of re. sub() is as follows:
The following example replaces all the digits in the given string by empty string.
6. re. compile() – We can compile pattern into the pattern objects all with the help of function re.compile(), and which contains various methods for operations such as searching for pattern matches or performing string substitutions.
In the following example, the compile function compiles the regex function mentioned and then the code asks user to enter a name. If user types/inputs any digit or other special characters, the compile results won’t match and it will again ask user for input. It will continue doing this unless and until user inputs a name containing characters only.
The output of the following code is as follows:
Now that we have a rough understanding of what RegEx is, how regex works in python, further we can move onto something more technical. It’s time to get a small project up and running.