Click here to Skip to main content
15,888,527 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
I am looking for suggestions to create a regular expression based search in Python. I have got following type of string values in a server log file,

2017-03-18 13:24:05,791 INFO [STDOUT] SUB Request Status :Resubmitted INBIOS_ABZ824
2017-03-12 13:24:05,796 INFO [STDOUT] SUB Submit Status :Resubmitted INDROS_MSR656
2017-04-12 13:24:05,991 INFO [STDOUT] SUB Request Status :Resubmitted INHP_GSN848

and I need to search the log and extract the values like following,

2017-03-18 13:24:05,791 INBIOS_ABZ824
2017-03-12 13:24:05,796 INDROS_MSR656
2017-04-12 13:24:05,991 INHP_GSN848

I am using the following code, but its extracting the complete line where the strings like these are present (INBIOS_ABZ824). How can I extract only the specified values from the log as above, Please share your thoughts

What I have tried:

Python
import os
import re

# Regex used to match relevant loglines (in this case)

line_regex = re.compile(r"[A-Z]+TECH_[A-Z]+[0-9]+", re.IGNORECASE)


# Output file, where the matched loglines will be copied to
output_filename = os.path.normpath("output.log")
# Overwrites the file, ensure we're starting out with a blank file
with open(output_filename, "w") as out_file:
    out_file.write("")

# Open output file in 'append' mode
with open(output_filename, "a") as out_file:
    # Open input file in 'read' mode
    with open("ServerError.txt", "r") as in_file:
        # Loop over each log line
        for line in in_file:
            # If log line matches our regex, print to console, and output file
            if (line_regex.search(line)):
                print(line)
                out_file.write(line)
Posted
Updated 14-Jun-18 2:15am
v2

So you want to get the time stamp and the part at the end?

While this can be done with regular expressions, it would be much simpler (and faster) with classic string operations.

The time stamp is at the begin of the line with a fixed lenghth of 23 and the end part is after the last space in the string (untested):
Python
# Get index of last space
last_ndx = line.rfind(' ')
# line[:23]: The time stamp (first 23 characters)
# line[last_ndx:]: Last space and following characters
out_file.write(line[:23] + line[last_ndx:]) 
If you have also other log entries which should not be matched, you can still apply a regex to the last part line[last_ndx:] and check if that matches (e.g " IN[_A-Z]+?[0-9]+$").
 
Share this answer
 
Comments
py.Net.JS 14-Jun-18 7:32am    
This is perfect mate... Thanks you so much.
How can i add a regEx on the last index to make sure it matches the regEx?
Jochen Arndt 14-Jun-18 7:45am    
Use something like
matchObj = re.match(pattern, line[last_ndx:])
If matchObj is not null, the pattern has been found in the string.

Note that match() checks at the beginning of string (at the space). You might also use search() where the pattern can be located anywhere in the string and/or use line[last_ndx+1:] because you already know that there is a space.
py.Net.JS 14-Jun-18 8:05am    
Absolutely Stunning ....
That is the perfect solution for my first python project.
thanks a lot mate
Jochen Arndt 14-Jun-18 8:31am    
You are welcome and thank you for accepting my solution.

One note to your solution:
There is no need to check for the match twice. The line
if (line_regex.search(line)):
can be removed (and the block outdented).
py.Net.JS 19-Jun-18 10:59am    
now my code is returning the following output

2017-03-18 , INBIOS_ABZ824
2017-03-19 , INBIOS_ABZ824
2017-03-12 , INDROS_MSR656
2017-03-17 , INDROS_MSR656
2017-04-12 , INHP_GSN848
2017-04-19 , INHP_GSN848

There are several multiple values with different date values, Out of it i want to extract only the oldest date and eliminate the other ones? what should be the best approach? could you please suggest?

the final output needs to be like the one below,

2017-03-18 , INBIOS_ABZ824
2017-03-12 , INDROS_MSR656
2017-04-12 , INHP_GSN848
So here is the final code which gives me the perfect results,

Python
import os
import re

# Regex used to match relevant loglines (in this case, a specific IP address)
line_regex = re.compile(r"error", re.IGNORECASE)

line_regex = re.compile(r"[A-Z]+OS_[A-Z]+[0-9]+", re.IGNORECASE)


# Output file, where the matched loglines will be copied to
output_filename = os.path.normpath("output.log")
# Overwrites the file, ensure we're starting out with a blank file
with open(output_filename, "w") as out_file:
    out_file.write("")

# Open output file in 'append' mode
with open(output_filename, "a") as out_file:
    # Open input file in 'read' mode
    with open("ServerError.txt", "r") as in_file:
        # Loop over each log line
        for line in in_file:
            # If log line matches our regex, print to console, and output file
            if (line_regex.search(line)):

                # Get index of last space
                last_ndx = line.rfind(' ')
                # line[:23]: The time stamp (first 23 characters)
                # line[last_ndx:]: Last space and following characters

                # using match object to eliminate other strings which are associated with the pattern ,
                # need the string from which the request ID is in the last index
                matchObj = re.match(line_regex, line[last_ndx+1:])
                #print(matchObj)
                #check if matchobj is not null
                if matchObj:
                    print(line[:23] + line[last_ndx:])
                    out_file.write(line[:23] + line[last_ndx:])
 
Share this answer
 
v2

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900