Click here to Skip to main content
15,845,565 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
I want to extract an email id from a text 'file' using 'Regular Expressions'.


Is there a way to feed a file as input, use the relevant regular expression, & get an email id from it?

What I have tried:

#Extracting email Ids
email_pattern='[a-z0-9A-Z_]*@[a-z0-9A-Z]*\.[a-zA-Z]*'
email_match=re.findall(email_pattern, e_text) 
print(email_match)


This code prints the email id only when 'e_text' is fed with sentences directly. When I give a text file 'path' in place of 'e_text', the output is an empty list--> []
Posted
Updated 8-Sep-22 3:26am

You cannot use a filename for the scond parameter to the re — Regular expression operations — Python 3.10.7 documentation[^] findall function. You need to read the contents of the file and use the resulting text. See 7. Input and Output — Python 3.10.7 documentation[^].
 
Share this answer
 
Comments
Apoorva 2022 10-Sep-22 13:57pm    
Thanks. The documentation is very helpful.
To add to what Richard has said ... Email validation is pretty complicated, and your regex is far too simplistic - it allows illegal addresses such as "@." and doesn't find legal addresses such as "a.b@x.com", "a@x.co.uk", "a@x-y.com", and so forth.

To "find an email address" somewhere in a bunch of text is complicated, even if you allow all the valid characters in such an address - which can include spaces, quote and double quote, backslash, ... and even '@' can be part of your local or domain section if quoted! Email address - Wikipedia[^]
Why complicated? Because emails addresses are text, and even if you stick to a subset of valid characters (and that'll annoy a lot of people) it's next to impossible to tell is any given chunk of text that includes an '@' somewhere in it is actually intended to be an email address reliably.

I'd strongly suggest that you take a close look at your input file and see if you can identify anything which identifies "this is an email address" before the address and "that was an email address" after it:
Email: me@mycompany.com
For example makes it easier: you have a lead in "Email: ", and a newline to terminate.

Without that, you are going to get a lot of false positives, as well as bad detecs of partial email addresses.
 
Share this answer
 
I found the solution through the 'tika' library. Here's the code -->

Extracting text from '.docx' file (Works for pdf as well)

Python
from tika import parser

file = r'../content/drive/My Drive/foldy folder/resume.docx'
file_data = parser.from_file(file)
text = file_data['content']
print(text)


Extracting email Ids

Python
email_pattern='[a-z0-9A-Z_]*@[a-z0-9A-Z]*\.[a-zA-Z]*'
email_match=re.findall(email_pattern, text) 
print(email_match)


NOTE: Coded in Google Colab
 
Share this answer
 
Comments
Apoorva 2022 13-Sep-22 1:59am    
Hi. I understand the email validation isn't very accurate. Since this is a basic project I'm working on, I'll include more patterns once I get a good grip on the coding part.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900