Trying to save an email as HTML and PDF - encoding problem, keep having � , Â and \u2020

Question

0.00/5 (No votes)

See more:

I'm trying to write a program that will download my emails and save them as PDF.

I've encountered a problem with encoding.

I'm using the email and imaplib modules. When I use this method to write the file: part.get_payload(decode=True) I get an html file with \u2013 and � in it.

Writing the raw email in html works and doesn't show any � but it also shows the header of the email message, trying to get rid of the headers makes the � return. I've tried changing the encoding to ISO-8859-1 which removes the � but instead I get \u2020 and \u2013

Removing this line from the html solved the problem, until I converted it to PDF: <html xmlns="http://www.w3.org/1999/xhtml" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:asp="remove"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta name="viewport" content="width=device-width, initial-scale=1.0"></meta><meta name="format-detection" content="telephone=no, date=no, address=no, email=no, url=no"></meta><style type="text/css">

When I converted it to PDF Â and â started appearing on the document.

This is the code I wrote:

What I have tried:

m = imaplib.IMAP4_SSL('imap.mail.yahoo.com')
m.login('xxxx@yahoo.com', 'xxxxx')




m.select('IL', readonly=True)
resp, data = m.search(None, '(SINCE "01-Jul-2019" BEFORE "29-Oct-2020" SUBJECT \"Your order\")')

messages = data[0].split()


for item in messages:
    typ, data = m.fetch(item, '(RFC822)')
    raw_email = data[0][1].decode("utf-8")
    email_message = email.message_from_string(raw_email)
    to_ = email_message['To']
    from_ = email_message['From']
    subject_= email_message['Subject']
    date_ = email_message['date']
    counter = 1
    for part in email_message.walk():
        if part.get_content_maintype() == "multipart":
            continue
        filename = part.get_filename()
        content_type = part.get_content_type()
        if not filename:
            ext = mimetypes.guess_extension(content_type)
            if not ext:
                ext = '.bin'
            filename = 'msg-part-%08d%s' %(counter, ext)
        counter +=1
    save_path = os.path.join(os.getcwd(), "emails", date_, subject_)
    if not os.path.exists(r'save_path'):
        print (save_path)
        os.makedirs(r'save_path')
    with open(os.path.join(r'save_path', filename), 'wb') as fp:
        fp.write(part.get_payload(decode=True))
    pdfkit.from_file('msg-part-00000001.htm', 'test.pdf')

<pre lang="Python">

Posted 29-Oct-20 5:16am

Member 14978731

Add a Solution

Comments

[no name] 29-Oct-20 14:27pm

It's punctuation; check your character set.

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)