Click here to Skip to main content
15,880,364 members
Please Sign up or sign in to vote.
5.00/5 (2 votes)
Hi folks - I need your help. I am tring and searching to get rid of this problem for several days but somehow I did not manage it: I like to use some .NET code to read and interpret my e-mails automatically.

Basically it works fine, only some UTF characters are disturbing my work. This is what happens: E-mail header says a mail is encoded with UTF-8. For to read my mails I use ReadLine() from StreamReader class. I store the return values in a String class object.

As far as I know, StreamReader is set to UTF-8 by default. I have also read that String class objects are unicode. Because UTF-8 also is unicode I do not understand that I get return values as "=C3=A4" or "=E2=80=9C" within the normal text.

Besides:
MIDL
StreamReader^ reader = gcnew StreamReader(sslstream);

I have tried:
MIDL
StreamReader^ reader = gcnew StreamReader(sslstream, Encoding::UTF8, false);

and
MIDL
Encoding ^enc = Encoding::GetEncoding("utf-8");
StreamReader^ reader = gcnew StreamReader(sslstream, enc, false);

(where false is to prevent automatic search for some start up byte orders for encoding indentifiers)


Nothing changes and I don't know why...

What I find strange is (when debugging the StreamReader object) that I find StreamReader's "CurrentEncoding"-Value set to
CurrentEncoding = 0x00c6bfa4 { CodePageASCII=20127 ISO_8859_1=28591 ...}

I think the encoding mode is the problem. When StreamReader tries to read the mail in ASCII mode it must have a problem with special characters. The only questions is, how can I force it to switch to unicode/UFT-8. It seems to have no effect - whatever I do - when creating the StreamReader object.

Can you help? Thanks a lot!
Posted
Updated 23-Feb-11 7:53am
v6

1 solution

First of all, you usually should not assume encoding in StreamReader. Use the other constructors, those accepting Boolean parameter bool detectEncodingFromByteOrderMarks. This API accept BOM at the beginning of the stream.

For more information, see this http://en.wikipedia.org/wiki/Byte-order_mark[^] and this: http://www.unicode.org/faq/utf_bom.html#BOM[^].

—SA
 
Share this answer
 
v3
Comments
Widder29 21-Feb-11 13:23pm    
Hi, thank you very much for your answer!

First, let me tell you that meanwhile I have tried the other constructor:StreamReader^ reader = gcnew StreamReader(sslstream, true);(where true enables detectEncodingFromByteOrderMarks)

I am very sorry to tell you that this change had no effect.

I still get special characters.

Second, I already know about the BOM - and your suggestion to use the constructor above surely might be a good idea. But I wonder why StreamReader does not like to be forced into UTF-8 mode.

Anyhow - do you have another idea I could try?
Sergey Alexandrovich Kryukov 21-Feb-11 16:42pm    
Look, you know the BOM, but are you sure you really have it in your text? Just answer to close this part. You could do is in, sat, Notepad in "save as".

What do you mean: "StreamReader does not like to be forced into UTF-8 mode". It will use the mode you tell. Half of problem is reading, another half is writing: what is really in the text and why. Unicode will ultimately be read one-to-one, you can test it by reading and writing back a copy and comparing it.

So, as soon as we sort out encoding and reading, the remaining part is: what's in your file? You can always read the file as binary and compare what's expected with what you read. Essentially, in Unicode there are not "special characters" except BOMs and surrogates. What you see is probably something else. How do I know if it's wrong or not. This is something in the file, that's it. Why? OK, do you have a code which had written the file?
--SA

Widder29 22-Feb-11 16:40pm    
Hi again. No, I cannot say if there is a BOM in the emails. I also have no "SAVE/SAVE AS" functionality in my webmail, so I can not save it to file and look for it.

With "StreamReader does not like to be forced into UTF-8 mode" I would like to say that (while debugging the StreamReader object) I always find find StreamReader's "CurrentEncoding"-Value set to: CurrentEncoding = 0x00c6bfa4 { CodePageASCII=20127 ISO_8859_1=28591 ...}

This happens any time, no matter what constructor and encoding mode I have used to create the StreamReader object. I expected to see there something else like - let's say - "CodePageUTF8=65001" or similar. I seems clear to me that with using some ASCII codepage there will never be a correct decoding.

Ok, "special characters" maybe was the wrong phrase in this context. I change this phrase to "characters that are specific to certain languages". EXAMPLE: In this special case instead of a german letter 'ä' I get "=C3=A4" - and instead of german letter 'ö' I get "=C3=B6" within the mail body. Same behaviour I see at specific french and norwegian letters.

No, sorry - there is no code that has written some of the mails. In full they are some thousands of (more or less) regular emails from over the world and number is increasing every day. I do not know which mailing software did generate them.

Hope this all helps you understanding what I wanted to say initially. Thank you a lot for trying to help!
Sergey Alexandrovich Kryukov 22-Feb-11 18:26pm    
What do you mean "I cannot say if there is a BOM in the emails."? You can use binary editor to see, or check up StreamReader under debugger.
E-mails usually (or maybe by standard) go without BOMs. In this case, you should enforce the encoding written in the parameter "charset" of the e-mail, if any.
If may so happen that e-mail is sub-standard. Your "codes" look familiar. Can you post or reference a sample of your e-mail with explanation what is supposed to be there. Also, please tell me, can you open your e-mail with "Microsoft Outlook Express" (Important: Express) and see if you can see what's expected. This application also has View/Encoding option, you can select what's looks right as so figure out encoding.

--SA
Widder29 23-Feb-11 5:56am    
"You can use binary editor to see, or check up StreamReader under debugger."

While debugging I did not see anything looking like a BOM. I am afraid I can not use a binary editor because I can not save the mails to disk and look into them with the editor. Or do you know a binary editor that is able to connect to a webmail service? Then I would check for it, of course.

No, I have no Outlook/Outlook Express - and if I can avoid - I won't install any client based mail software.

I will post some email here. It is a common PayPal e-mail and (therefore I believe) it is not sub-standard. But who knows for shure...

###########################################################
E-MAIL FOLLOWS - All "should be"-Text is marked with ***
All blocks with "should be"-Text are repeated by me in ()-brackets
###########################################################

* OK Gimap ready for requests from ##.###.###.### i12if1339990bkh.43
* CAPABILITY IMAP4rev1 UNSELECT LITERAL+ IDLE NAMESPACE QUOTA ID XLIST CHILDREN X-GM-EXT-1 UIDPLUS COMPRESS=DEFLATE
* FLAGS (\Answered \Flagged \Draft \Deleted \Seen)
* OK [PERMANENTFLAGS (\Answered \Flagged \Draft \Deleted \Seen \*)]
* OK [UIDVALIDITY 598686366]
* 24310 EXISTS
* 0 RECENT
* OK [UIDNEXT 37895]
A0002 OK [READ-WRITE] INBOX selected. (Success)
* 22916 FETCH (BODY[] {14521}
Delivered-To: USERNAME@googlemail.com
Received: by ##.###.###.### with SMTP id ##############;
Wed, 15 Dec 2010 08:22:03 -0800 (PST)
Received: by ##.###.###.### with SMTP id ###############.############;
Wed, 15 Dec 2010 08:22:01 -0800 (PST)
Return-Path: <payment@paypal.com>
Received: from someserver.com (someserver.com [##.##.###.###])
by mx.google.com with ESMTP id ######################;
Wed, 15 Dec 2010 08:22:00 -0800 (PST)
Received-SPF: softfail (google.com: domain of transitioning payment@paypal.com does not designate ##.##.###.### as permitted sender) client-ip=##.##.###.###;
DomainKey-Status: good
Authentication-Results: mx.google.com; spf=softfail (google.com: domain of transitioning payment@paypal.com does not designate ##.##.###.### as permitted sender) smtp.mail=payment@paypal.com; domainkeys=pass header.From=sendmail@paypal.com
Received: from mx0.phx.paypal.com (mx0.phx.paypal.com [##.###.###.###])
by someserver.com (Postfix) with ESMTP id ###############
for <info@username.de>; Wed, 15 Dec 2010 17:21:58 +0100 (CET)
DomainKey-Signature: s=dkim; d=paypal.com; c=nofws; q=dns;
h=Received:Date:Message-Id:Subject:X-MaxCode-Template:To:
From:Sender:X-Email-Type-Id:X-XPT-XSL-Name:Content-Type:
MIME-Version;
b=iIo9Uhm+7eu7KDz6w1S/YSRLwjpr0x///rdj18ZudQDh8B7CGzpyzRFR
pnr+5ct6/T4gw/un81kwRohizSwj7PFhxfRcbNjF1zY691gbUarkSHsX8
cOt0e07llFWdKD73+Xmvsk6qCYbAqJ2I92YQ5/fJ97D19tuj3OCMpIwnZ
c=;
Received: (qmail 4368 invoked by uid 993); 15 Dec 2010 16:21:57 -0000
Date: Wed, 15 Dec 2010 08:21:57 -0800
Message-Id: <#########.####@paypal.com>
Subject: PayPal-Zahlungsanforderung von ### & ###
X-MaxCode-Template: email-transaction-counterparty
To: ### & ### <info@username.de>
From: "######@web.de" <###@web.de>
Sender: sendmail@paypal.com
X-Email-Type-Id: PP274
X-XPT-XSL-Name:
email_pimp/default/de_DE/transaction/seller/TransactionCounterparty.xsl
Content-Type: multipart/alternative;
boundary=--NextPart_048F8BC8A2197DE2036A
MIME-Version: 1.0

----NextPart_048F8BC8A2197DE2036A
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset=windows-1252

Guten Tag, ### & ###=21

#### hat Ihnen eine Zahlung gesendet.


----------------------------------------------------------------


-----------------------------------
Zahlungsdetails
-----------------------------------

Betrag: =##,## EUR
=20
Transaktionsdatum: 15. Dezember 2010
=20
Transaktionscode: ##################
=20
Betreff: PayPal-Zahlungsanforderung von ### & ###

Loggen Sie sich in Ihr Konto ein, und =F6ffnen Sie die Registerkarte =
=22Kontoauszug=22, um die Details zu dieser Transaktion ei

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900