How to create Python regex search to extract specific values from the server logs

Question

0.00/5 (No votes)

See more:

I am looking for suggestions to create a regular expression based search in Python. I have got following type of string values in a server log file,

2017-03-18 13:24:05,791 INFO [STDOUT] SUB Request Status :Resubmitted INBIOS_ABZ824
2017-03-12 13:24:05,796 INFO [STDOUT] SUB Submit Status :Resubmitted INDROS_MSR656
2017-04-12 13:24:05,991 INFO [STDOUT] SUB Request Status :Resubmitted INHP_GSN848

and I need to search the log and extract the values like following,

2017-03-18 13:24:05,791 INBIOS_ABZ824
2017-03-12 13:24:05,796 INDROS_MSR656
2017-04-12 13:24:05,991 INHP_GSN848

I am using the following code, but its extracting the complete line where the strings like these are present (INBIOS_ABZ824). How can I extract only the specified values from the log as above, Please share your thoughts

What I have tried:

Python

import os
import re

# Regex used to match relevant loglines (in this case)

line_regex = re.compile(r"[A-Z]+TECH_[A-Z]+[0-9]+", re.IGNORECASE)


# Output file, where the matched loglines will be copied to
output_filename = os.path.normpath("output.log")
# Overwrites the file, ensure we're starting out with a blank file
with open(output_filename, "w") as out_file:
    out_file.write("")

# Open output file in 'append' mode
with open(output_filename, "a") as out_file:
    # Open input file in 'read' mode
    with open("ServerError.txt", "r") as in_file:
        # Loop over each log line
        for line in in_file:
            # If log line matches our regex, print to console, and output file
            if (line_regex.search(line)):
                print(line)
                out_file.write(line)

Posted 14-Jun-18 0:19am

py.Net.JS

Updated 14-Jun-18 2:15am

v2

Add a Solution

2 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Jochen Arndt · Accepted Answer · 2018-06-14T00:54:00

Solution 1

So you want to get the time stamp and the part at the end?

While this can be done with regular expressions, it would be much simpler (and faster) with classic string operations.

The time stamp is at the begin of the line with a fixed lenghth of 23 and the end part is after the last space in the string (untested):

Python

# Get index of last space
last_ndx = line.rfind(' ')
# line[:23]: The time stamp (first 23 characters)
# line[last_ndx:]: Last space and following characters
out_file.write(line[:23] + line[last_ndx:])

If you have also other log entries which should not be matched, you can still apply a regex to the last part line[last_ndx:] and check if that matches (e.g " IN[_A-Z]+?[0-9]+$").

Posted 14-Jun-18 0:54am

Jochen Arndt

Comments

py.Net.JS 14-Jun-18 7:32am

This is perfect mate... Thanks you so much.
How can i add a regEx on the last index to make sure it matches the regEx?

Jochen Arndt 14-Jun-18 7:45am

Use something like
matchObj = re.match(pattern, line[last_ndx:])
If matchObj is not null, the pattern has been found in the string.

Note that match() checks at the beginning of string (at the space). You might also use search() where the pattern can be located anywhere in the string and/or use line[last_ndx+1:] because you already know that there is a space.

py.Net.JS 14-Jun-18 8:05am

Absolutely Stunning ....
That is the perfect solution for my first python project.
thanks a lot mate

Jochen Arndt 14-Jun-18 8:31am

You are welcome and thank you for accepting my solution.

One note to your solution:
There is no need to check for the match twice. The line
if (line_regex.search(line)):
can be removed (and the block outdented).

py.Net.JS 19-Jun-18 10:59am

now my code is returning the following output

2017-03-18 , INBIOS_ABZ824
2017-03-19 , INBIOS_ABZ824
2017-03-12 , INDROS_MSR656
2017-03-17 , INDROS_MSR656
2017-04-12 , INHP_GSN848
2017-04-19 , INHP_GSN848

There are several multiple values with different date values, Out of it i want to extract only the oldest date and eliminate the other ones? what should be the best approach? could you please suggest?

the final output needs to be like the one below,

2017-03-18 , INBIOS_ABZ824
2017-03-12 , INDROS_MSR656
2017-04-12 , INHP_GSN848

Jochen Arndt 20-Jun-18 2:51am

That can't be done with regular expressions.
You might store the results in a list and process that later to filter out the required items.

py.Net.JS 20-Jun-18 4:10am

Thank you , Jochen.

I have formed a list from the output file,

text_file = open("dataoutput.txt", "r")
lines = text_file.read().split('^')

how can I extract oldest dates and its corresponding value from that list ?

Jochen Arndt 20-Jun-18 4:37am

There are multiple solutions.

It is an important task of programming to think about a problem and finding possible solutions before writing any line code.

Here you have a list sorted by dates (oldest first). That makes it quite simple. You can for example create a new list to hold the result. Then step through the input list and add a record only if the identifier is not present.

However, this should use a list hold in memory instead of a file which should be also used for the previous task.

py.Net.JS · Accepted Answer · 2018-06-14T02:10:00

So here is the final code which gives me the perfect results,

Python

import os
import re

# Regex used to match relevant loglines (in this case, a specific IP address)
line_regex = re.compile(r"error", re.IGNORECASE)

line_regex = re.compile(r"[A-Z]+OS_[A-Z]+[0-9]+", re.IGNORECASE)


# Output file, where the matched loglines will be copied to
output_filename = os.path.normpath("output.log")
# Overwrites the file, ensure we're starting out with a blank file
with open(output_filename, "w") as out_file:
    out_file.write("")

# Open output file in 'append' mode
with open(output_filename, "a") as out_file:
    # Open input file in 'read' mode
    with open("ServerError.txt", "r") as in_file:
        # Loop over each log line
        for line in in_file:
            # If log line matches our regex, print to console, and output file
            if (line_regex.search(line)):

                # Get index of last space
                last_ndx = line.rfind(' ')
                # line[:23]: The time stamp (first 23 characters)
                # line[last_ndx:]: Last space and following characters

                # using match object to eliminate other strings which are associated with the pattern ,
                # need the string from which the request ID is in the last index
                matchObj = re.match(line_regex, line[last_ndx+1:])
                #print(matchObj)
                #check if matchobj is not null
                if matchObj:
                    print(line[:23] + line[last_ndx:])
                    out_file.write(line[:23] + line[last_ndx:])