Python Regex Groups, Match and Search

Hey, Roberts Greibers here. I’ll give you a brief intro about my 7-year experience as a Python developer in just a second. 

But now I want you to scroll down and dive right into Python regex groups and the Python regex example I explained in this blog post. 

The following regex Python example is going to show you.. ⚠️

  • What actually is import re in Python and how you can use it!
  • A real-life Python regex example (sent in by a reader of this blog post)

If you’ve been Google searching for Python re examples, you’ve found the RIGHT place!

Most regular expression Python examples you’ll find online are very theoretical, showing you just the concept of a regex capturing group – don’t be afraid if they don’t make sense yet, I’ll explain the best way I know of using regex in Python in this post. 

The way I go about writing these posts is by taking real situations and explaining them in detail 🚀

And if you really think about it…

It’s the ONLY way for you to truly understand regex Python examples and use them for your benefit. Feel free to copy and use the following regex code example.

So, if getting deeper into a regular expression Python example sounds interesting to you – keep on reading!

I’m explaining the whole story of how I came to the following situation in Python where regex capture group was part of the solution down below! 👇🏻

What Are Regular Expressions In Python?

Regular expressions in Python – That’s one of the most important tools you’ll have to learn as a Python developer if you’re dealing with log file parsing, text file parsing, or any other kind of text extraction from an existing string content. 

A situation where you want to extract only a certain part of a string (like a specific ID value, number, IBAN, etc.) will appear in a lot of different industries related to Python development. 

🚨 In most Python developer interviews you’ll be asked to at least explain the concept of using regex or even showcase a regular expression example in Python. 

So I’d really recommend you pay attention to the regex Python example I’m about share with you in this post. 

What Is import re In Python?

import re in Python SIMPLY means you’re importing the regular expressions library for the Python script you’re currently working on.

Once you’ve included the following line at the top of your Python script, the regular expression Python library will be available for you.

import re

Real Regular Expression Situation In Python

Here’s a quick example of the type of log file you’re going to deal with:

Event: Cdr Privilege: cdr,all AccountCode: Source:491454490 Destination:1545454572877 110 DestinationContext: testing CallerID: Channel: Console/dsp DestinationChannel: LastApplication: Hangup
LastData: StartTime: 2010-08-23 08:27:21 AnswerTime: 2010-08-23 08:27:21 EndTime: 2010-08-23 08:28:21 Duration: 60 BillableSeconds: 0 Disposition: ANSWERED AMAFlags: DOCUMENTATION UniqueID: 1282570041.3 UserField: Rate: 0.02 Carrier: BS&S

Event: Cdr Privilege: cdr,all AccountCode: Source: Destination: 110 DestinationContext: testing CallerID: Channel: Console/dsp DestinationChannel: LastApplication: Hangup
LastData: StartTime: 2010-08-23 08:27:21 AnswerTime: 2010-08-23 08:27:21 EndTime: 2010-08-23 08:27:21 Duration: 0 BillableSeconds: 0 Disposition: ANSWERED AMAFlags: DOCUMENTATION UniqueID: 1282570041.3 UserField: Rate: 0.02 Carrier: BS&S

As you can see, the problem here is that one entry (let’s call Event + LastData one entry) is separated into multiple file lines.

Usually, you would extract all the necessary information from one line (as you would iterate through the file line by line with Python for loop), but now you have to come up with a way to collect information for one entry from multiple lines.

And I can see how it could be challenging for someone just starting out with Python.

Read file and remove empty lines

with open('path/to/logfile.txt', 'r') as file:
    not_empty_lines = [
        line.replace('\n', '')
        for line in file.readlines()
        if line.replace('\n', '')
    ]

Open a log file

As you already might know, there are a couple of ways to open a file in Python, but the easiest and the simplest way to open a file for reading is to use with statement together with Python’s built-in open() function. Change 'path/to/logfile.txt' part if you want to open your own file, the provided code above is just an example. For reading a file, always use mode='r' which basically means you’re opening a file in reading mode. And finally, as file: part defines how you’re going to call a variable that will contain a file object.

Remember, in Python, everything is an object, so if you open a file, it’s going to be an object in memory, in this case, an object is going to be saved in a variable called file. And you can use file variable to do whatever you want with the actual file – read, update information, etc. In this case, of course, you want to read the file.

Remove empty lines from a log file

Next step, before you go into parsing a file, you need to remove empty lines from the log file. The way I’ve decided to do it here is by simply using list comprehension which is a shortcut to using a for loop to create a list. Depending on how large is your log file, you might want to take a look at using a yield keyword in a combination with a for loop.

If you don’t care about the speed and performance of the parsing process, ignore yield part, otherwise, I’d recommend checking it, it will drastically increase the speed of a parsing process if you’ll learn to use it properly. In short, yield when called will return an object immediately, without keeping it in the memory which saves a lot of processing time.

Back to a list comprehension, if you take a closer look at the following code part:

not_empty_lines = [
    line.replace('\n', '')
    for line in file.readlines()
    if line.replace('\n', '')
]

It translates to a very simple for loop with a simple if statement:

not_empty_lines = []
for line in file.readlines():
    if solid_line := line.replace('\n', ''):
        not_empty_lines.append(solid_line)

file.readlines() is a way to get all file lines as strings in a List. line.replace('\n', '') is used to replace all new-line ( \n ) strings with just an empty string (this will remove all empty lines).

Meaning, if you remove a new-line string from a line and it still evaluates to a solid boolean string bool(put_your_string_here) == True you can consider this as a solid_line with a text in it, the one that could be parsed properly.

Group log file lines with zip() function

with open('path/to/logfile.txt', 'r') as file:
    not_empty_lines = [
        line.replace('\n', '')
        for line in file.readlines()
        if line.replace('\n', '')
    ]

for event, last_data in zip(not_empty_lines[::2], not_empty_lines[1::2]):
    print(f'event: {event}')
    print(f'last_data: {last_data}')

Output when using zip()

Since you’re dealing with two lines and want to extract information from both of those lines just to collect the necessary data for one entry, you need to come up with a way to somehow zip two lines together.

Well, in Python the above can be done using a built-in function called zip(). Take a look at what zip() the function gives you.

Code example for zip() function usage:

import json

zipped = zip(not_empty_lines[::2], not_empty_lines[1::2])
zipped_tuple = tuple(zipped)
zipped_pretty = json.dumps(zipped_tuple, indent=4)

Output:

[
    [
        "Event: Cdr Privilege: cdr,all AccountCode: Source:491454490 Destination:1545454572877 110 DestinationContext: testing CallerID: Channel: Console/dsp DestinationChannel: LastApplication: Hangup",
        "LastData: StartTime: 2010-08-23 08:27:21 AnswerTime: 2010-08-23 08:27:21 EndTime: 2010-08-23 08:28:21 Duration: 60 BillableSeconds: 0 Disposition: ANSWERED AMAFlags: DOCUMENTATION UniqueID: 1282570041.3 UserField: Rate: 0.02 Carrier: BS&S"
    ],
    [
        "Event: Cdr Privilege: cdr,all AccountCode: Source: Destination: 110 DestinationContext: testing CallerID: Channel: Console/dsp DestinationChannel: LastApplication: Hangup",
        "LastData: StartTime: 2010-08-23 08:27:21 AnswerTime: 2010-08-23 08:27:21 EndTime: 2010-08-23 08:27:21 Duration: 0 BillableSeconds: 0 Disposition: ANSWERED AMAFlags: DOCUMENTATION UniqueID: 1282570041.3 UserField: Rate: 0.02 Carrier: BS&S"
    ]
]

As you can see, zip() lets you have a list with lines grouped by two. This is exactly what you need to be able to parse a whole log file by groups of two.

  • List #1 represents lines 1 and 2 of a logfile.txt
  • List #2 represents lines 3 and 4 of a logfile.txt

Use regex to parse a log file lines

import typing

with open('path/to/logfile.txt', 'r') as file:
    not_empty_lines = [
        line.replace('\n', '')
        for line in file.readlines()
        if line.replace('\n', '')
    ]

for event, last_data in zip(not_empty_lines[::2], not_empty_lines[1::2]):
    print(f'event: {event}')
    print(f'last_data: {last_data}')

    # How extract a value from an event or last_data line?

Now you have to start to think about how you could work with event and last_data lines.

Each line contains specific information you want to extract with regex. From the logfile.txt above it’s pretty clear you want to extract source, destination, start and end values for each entry.

So, the question here is – how can you extract the above values from each entry?

Well, over the past couple of years, while using regex for a bunch of different situations I’ve come up with a very good “go-to” way of using regex.

It consists of the following steps:

  • Use regex101.com site to come up with a valid regex pattern and check if it works right then and there.
  • For regex pattern use a group (meaning, use parentheses) and the simplest pattern that could be applied in almost any situation: (.+) – You only change surrounding text and expect to extract the part that’s in (.+)
  • Use the following Regex class with one method for extracting information (you should only change regex pattern and use the same regex method over and over again)
How to use regex101.com
import re
import json

with open('path/to/logfile.txt', 'r') as file:
    not_empty_lines = [
        line.replace('\n', '')
        for line in file.readlines()
        if line.replace('\n', '')
    ]

class Regex:

    _source = r'Source:(.+) Destination:'
    _destination = r'Destination:(.+) DestinationContext:'
    _start = r'LastData: StartTime:(.+) AnswerTime:'
    _end = r'EndTime:(.+) Duration:'

    def xtract_with_regex(self, line: str, regex: str):
        rg = re.compile(regex, re.IGNORECASE | re.DOTALL)
        match = rg.search(line)

        try:
            return match.group(1)
        except AttributeError or IndexError:
            return

regex = Regex()

entries = []

for event, last_data in zip(not_empty_lines[::2], not_empty_lines[1::2]):
    source = regex.xtract_with_regex(
        line=event,
        regex=regex._source
    )
    destination = regex.xtract_with_regex(
        line=event,
        regex=regex._destination
    )
    start_time = regex.xtract_with_regex(
        line=last_data,
        regex=regex._start
    )
    end_time = regex.xtract_with_regex(
        line=last_data,
        regex=regex._end
    )
    entries.append({
        'source': source,
        'destination': destination,
        'start_time': start_time,
        'end_time': end_time,
    })

for entry in entries:
    print(' ')
    print('Entry:')
    print(' ')
    print(json.dumps(entry, indent=4))
    print(' ')

And in the end, you can use .append() to add a new entry to the entries List. Once everything is collected, simply print out the results:

Entry:
 
{
    "source": "491454490",
    "destination": "1545454572877 110",
    "start_time": " 2010-08-23 08:27:21",
    "end_time": " 2010-08-23 08:28:21"
}
 
 
Entry:
 
{
    "source": null,
    "destination": " 110",
    "start_time": " 2010-08-23 08:27:21",
    "end_time": " 2010-08-23 08:27:21"
}

I'll help you become a Python developer!

If you're interested in learning Python and getting a job as a Python developer, send me an email to roberts.greibers@gmail.com and I'll see if I can help you.

Roberts Greibers

Roberts Greibers

I help engineers to become backend Python/Django developers so they can increase their income

4 Comments

  1. Howdy! Do you use Twitter? I’d like to follow you if that would be ok. I’m absolutely enjoying your blog and look forward to new updates.

  2. Good post however , I was wondering if you could write a litte more on this topic? I’d be very thankful if you could elaborate a little bit further. Thanks!

Leave a Reply