How do I extract an email id from a file using 'regular expressions'?

Question

0.00/5 (No votes)

See more:

I want to extract an email id from a text 'file' using 'Regular Expressions'.

Is there a way to feed a file as input, use the relevant regular expression, & get an email id from it?

What I have tried:

#Extracting email Ids
email_pattern='[a-z0-9A-Z_]*@[a-z0-9A-Z]*\.[a-zA-Z]*'
email_match=re.findall(email_pattern, e_text) 
print(email_match)

This code prints the email id only when 'e_text' is fed with sentences directly. When I give a text file 'path' in place of 'e_text', the output is an empty list--> []

Posted 8-Sep-22 1:12am

Apoorva 2022

Updated 8-Sep-22 2:26am

Add a Solution

3 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Richard MacCutchan · Answer 1 · 2022-09-08T01:47:00

Solution 1

You cannot use a filename for the scond parameter to the re — Regular expression operations — Python 3.10.7 documentation[^] findall function. You need to read the contents of the file and use the resulting text. See 7. Input and Output — Python 3.10.7 documentation[^].

Posted 8-Sep-22 1:47am

Richard MacCutchan

Comments

Apoorva 2022 10-Sep-22 13:57pm

Thanks. The documentation is very helpful.

OriginalGriff · Answer 2 · 2022-09-08T02:26:00

To add to what Richard has said ... Email validation is pretty complicated, and your regex is far too simplistic - it allows illegal addresses such as "@." and doesn't find legal addresses such as "a.b@x.com", "a@x.co.uk", "a@x-y.com", and so forth.

To "find an email address" somewhere in a bunch of text is complicated, even if you allow all the valid characters in such an address - which can include spaces, quote and double quote, backslash, ... and even '@' can be part of your local or domain section if quoted! Email address - Wikipedia[^]
Why complicated? Because emails addresses are text, and even if you stick to a subset of valid characters (and that'll annoy a lot of people) it's next to impossible to tell is any given chunk of text that includes an '@' somewhere in it is actually intended to be an email address reliably.

I'd strongly suggest that you take a close look at your input file and see if you can identify anything which identifies "this is an email address" before the address and "that was an email address" after it:

Email: me@mycompany.com

For example makes it easier: you have a lead in "Email: ", and a newline to terminate.

Without that, you are going to get a lot of false positives, as well as bad detecs of partial email addresses.

Apoorva 2022 · Answer 3 · 2022-09-10T08:05:00

I found the solution through the 'tika' library. Here's the code -->

Extracting text from '.docx' file (Works for pdf as well)

Python

from tika import parser

file = r'../content/drive/My Drive/foldy folder/resume.docx'
file_data = parser.from_file(file)
text = file_data['content']
print(text)

Extracting email Ids

Python

email_pattern='[a-z0-9A-Z_]*@[a-z0-9A-Z]*\.[a-zA-Z]*'
email_match=re.findall(email_pattern, text) 
print(email_match)

NOTE: Coded in Google Colab

How do I extract an email id from a file using 'regular expressions'?

3 solutions

Solution 1

Solution 2

Solution 3

Add your solution here

Preview 0