Python Regex Match, Group and Search

This is a follow-up post to a blog post I made 5 years ago. I’ve received quite a few emails from people having problems with Python regex match, group, and search methods. It seems like a lot of people are dealing with a situation where they need to open up and parse a log file in Python – extract information about database dumps, database errors or it could also be any kind of other information, as long as it’s a log or error file with text in it. 

As I have mentioned in my first blog post (Log File Parsing In Python), Python and Regex are a perfect combination of tools to read log files line by line. It’s easy to use and very easy to learn. 

The last time a did a post about this topic I gave you an example of how to parse Skype for Business iOS application log files. This time I’m going to take a look at a log file sent in by a reader of this blog and give you an example of how I would deal with such a situation. 

Here’s a quick example of the type of log file you’re going to deal with:

Event: Cdr Privilege: cdr,all AccountCode: Source:491454490 Destination:1545454572877 110 DestinationContext: testing CallerID: Channel: Console/dsp DestinationChannel: LastApplication: Hangup
LastData: StartTime: 2010-08-23 08:27:21 AnswerTime: 2010-08-23 08:27:21 EndTime: 2010-08-23 08:28:21 Duration: 60 BillableSeconds: 0 Disposition: ANSWERED AMAFlags: DOCUMENTATION UniqueID: 1282570041.3 UserField: Rate: 0.02 Carrier: BS&S

Event: Cdr Privilege: cdr,all AccountCode: Source: Destination: 110 DestinationContext: testing CallerID: Channel: Console/dsp DestinationChannel: LastApplication: Hangup
LastData: StartTime: 2010-08-23 08:27:21 AnswerTime: 2010-08-23 08:27:21 EndTime: 2010-08-23 08:27:21 Duration: 0 BillableSeconds: 0 Disposition: ANSWERED AMAFlags: DOCUMENTATION UniqueID: 1282570041.3 UserField: Rate: 0.02 Carrier: BS&S

As you can see, the problem here is that one entry (let’s call Event + LastData one entry) is separated into multiple file lines.

Usually, you would extract all the necessary information from one line (as you would iterate through the file line by line with Python for loop), but now you have to come up with a way to collect information for one entry from multiple lines.

And I can see how it could be challenging for someone just starting out with Python.

Read file and remove empty lines

with open('path/to/logfile.txt', 'r') as file:
    not_empty_lines = [
        line.replace('\n', '')
        for line in file.readlines()
        if line.replace('\n', '')
    ]

Open a log file

As you already might know, there are a couple of ways to open a file in Python, but the easiest and the simplest way to open a file for reading is to use with statement together with Python’s built-in open() function. Change 'path/to/logfile.txt' part if you want to open your own file, the provided code above is just an example. For reading a file, always use mode='r' which basically means you’re opening a file in reading mode. And finally, as file: part defines how you’re going to call a variable that will contain a file object.

Remember, in Python, everything is an object, so if you open a file, it’s going to be an object in memory, in this case, an object is going to be saved in a variable called file. And you can use file variable to do whatever you want with the actual file – read, update information, etc. In this case, of course, you want to read the file.

Remove empty lines from a log file

Next step, before you go into parsing a file, you need to remove empty lines from the log file. The way I’ve decided to do it here is by simply using list comprehension which is a shortcut to using a for loop to create a list. Depending on how large is your log file, you might want to take a look at using a yield keyword in a combination with a for loop.

If you don’t care about the speed and performance of the parsing process, ignore yield part, otherwise, I’d recommend checking it, it will drastically increase the speed of a parsing process if you’ll learn to use it properly. In short, yield when called will return an object immediately, without keeping it in the memory which saves a lot of processing time.

Back to a list comprehension, if you take a closer look at the following code part:

not_empty_lines = [
    line.replace('\n', '')
    for line in file.readlines()
    if line.replace('\n', '')
]

It translates to a very simple for loop with a simple if statement:

not_empty_lines = []
for line in file.readlines():
    if solid_line := line.replace('\n', ''):
        not_empty_lines.append(solid_line)

file.readlines() is a way to get all file lines as strings in a List. line.replace('\n', '') is used to replace all new-line ( \n ) strings with just an empty string (this will remove all empty lines).

Meaning, if you remove a new-line string from a line and it still evaluates to a solid boolean string bool(put_your_string_here) == True you can consider this as a solid_line with a text in it, the one that could be parsed properly.

Group log file lines with zip() function

with open('path/to/logfile.txt', 'r') as file:
    not_empty_lines = [
        line.replace('\n', '')
        for line in file.readlines()
        if line.replace('\n', '')
    ]

for event, last_data in zip(not_empty_lines[::2], not_empty_lines[1::2]):
    print(f'event: {event}')
    print(f'last_data: {last_data}')

Output when using zip()

Since you’re dealing with two lines and want to extract information from both of those lines just to collect the necessary data for one entry, you need to come up with a way to somehow zip two lines together.

Well, in Python the above can be done using a built-in function called zip(). Take a look at what zip() the function gives you.

Code example for zip() function usage:

import json

zipped = zip(not_empty_lines[::2], not_empty_lines[1::2])
zipped_tuple = tuple(zipped)
zipped_pretty = json.dumps(zipped_tuple, indent=4)

Output:

[
    [
        "Event: Cdr Privilege: cdr,all AccountCode: Source:491454490 Destination:1545454572877 110 DestinationContext: testing CallerID: Channel: Console/dsp DestinationChannel: LastApplication: Hangup",
        "LastData: StartTime: 2010-08-23 08:27:21 AnswerTime: 2010-08-23 08:27:21 EndTime: 2010-08-23 08:28:21 Duration: 60 BillableSeconds: 0 Disposition: ANSWERED AMAFlags: DOCUMENTATION UniqueID: 1282570041.3 UserField: Rate: 0.02 Carrier: BS&S"
    ],
    [
        "Event: Cdr Privilege: cdr,all AccountCode: Source: Destination: 110 DestinationContext: testing CallerID: Channel: Console/dsp DestinationChannel: LastApplication: Hangup",
        "LastData: StartTime: 2010-08-23 08:27:21 AnswerTime: 2010-08-23 08:27:21 EndTime: 2010-08-23 08:27:21 Duration: 0 BillableSeconds: 0 Disposition: ANSWERED AMAFlags: DOCUMENTATION UniqueID: 1282570041.3 UserField: Rate: 0.02 Carrier: BS&S"
    ]
]

As you can see, zip() lets you have a list with lines grouped by two. This is exactly what you need to be able to parse a whole log file by groups of two.

  • List #1 represents lines 1 and 2 of a logfile.txt
  • List #2 represents lines 3 and 4 of a logfile.txt

Use regex to parse a log file lines

import typing

with open('path/to/logfile.txt', 'r') as file:
    not_empty_lines = [
        line.replace('\n', '')
        for line in file.readlines()
        if line.replace('\n', '')
    ]

for event, last_data in zip(not_empty_lines[::2], not_empty_lines[1::2]):
    print(f'event: {event}')
    print(f'last_data: {last_data}')

    # How extract a value from an event or last_data line?

Now you have to start to think about how you could work with event and last_data lines.

Each line contains specific information you want to extract with regex. From the logfile.txt above it’s pretty clear you want to extract source, destination, start and end values for each entry.

So, the question here is – how can you extract the above values from each entry?

Well, over the past couple of years, while using regex for a bunch of different situations I’ve come up with a very good “go-to” way of using regex.

It consists of the following steps:

  • Use regex101.com site to come up with a valid regex pattern and check if it works right then and there.
  • For regex pattern use a group (meaning, use parentheses) and the simplest pattern that could be applied in almost any situation: (.+) – You only change surrounding text and expect to extract the part that’s in (.+)
  • Use the following Regex class with one method for extracting information (you should only change regex pattern and use the same regex method over and over again)
How to use regex101.com
import re
import json

with open('path/to/logfile.txt', 'r') as file:
    not_empty_lines = [
        line.replace('\n', '')
        for line in file.readlines()
        if line.replace('\n', '')
    ]

class Regex:

    _source = r'Source:(.+) Destination:'
    _destination = r'Destination:(.+) DestinationContext:'
    _start = r'LastData: StartTime:(.+) AnswerTime:'
    _end = r'EndTime:(.+) Duration:'

    def xtract_with_regex(self, line: str, regex: str):
        rg = re.compile(regex, re.IGNORECASE | re.DOTALL)
        match = rg.search(line)

        try:
            return match.group(1)
        except AttributeError or IndexError:
            return

regex = Regex()

entries = []

for event, last_data in zip(not_empty_lines[::2], not_empty_lines[1::2]):
    source = regex.xtract_with_regex(
        line=event,
        regex=regex._source
    )
    destination = regex.xtract_with_regex(
        line=event,
        regex=regex._destination
    )
    start_time = regex.xtract_with_regex(
        line=last_data,
        regex=regex._start
    )
    end_time = regex.xtract_with_regex(
        line=last_data,
        regex=regex._end
    )
    entries.append({
        'source': source,
        'destination': destination,
        'start_time': start_time,
        'end_time': end_time,
    })

for entry in entries:
    print(' ')
    print('Entry:')
    print(' ')
    print(json.dumps(entry, indent=4))
    print(' ')

And in the end, you can use .append() to add a new entry to the entries List. Once everything is collected, simply print out the results:

Entry:
 
{
    "source": "491454490",
    "destination": "1545454572877 110",
    "start_time": " 2010-08-23 08:27:21",
    "end_time": " 2010-08-23 08:28:21"
}
 
 
Entry:
 
{
    "source": null,
    "destination": " 110",
    "start_time": " 2010-08-23 08:27:21",
    "end_time": " 2010-08-23 08:27:21"
}

I'll help you become a Python developer!

If you're interested in learning Python and getting a job as a Python developer, send me an email to roberts.greibers@gmail.com and I'll see if I can help you.

Roberts Greibers

Roberts Greibers

I help QA engineers to become backend Python/Django developers so they can increase their income

4 Comments

  1. Howdy! Do you use Twitter? I’d like to follow you if that would be ok. I’m absolutely enjoying your blog and look forward to new updates.

  2. Good post however , I was wondering if you could write a litte more on this topic? I’d be very thankful if you could elaborate a little bit further. Thanks!

Leave a Reply