Click here to Skip to main content
15,888,590 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
I have an ascii text file that contains valid 8-bit character codes.
How do I read this file and have the 8-bit char translated into valid
unicode? I know I could UTF8 encode the file or could read the bytes
and then encode it. But this all assumes that I know about the 8-bit
codes before hand.

Is there any method that will read the file and automaticly do the
conversion?

James Johnson
Posted

Simply read with

C#
System.IO.StreamReader reader =
   new System.IO.StreamReader(fileName,  System.Text.Encoding.ASCII);

or, more universally, auto-detect the encoding:
C#
System.IO.StreamReader reader =
   new System.IO.StreamReader(fileName,  true);

It will give you Unicode string(s) based on your ASCII data. In principle, this is all you need. You can write it back with
C#
bool appendOrNot = //something
System.IO.StreamWriter writer =
   new System.IO.StreamWriter(fileName,  appendOrNot, System.Text.Encoding.UTF8);


As you text data is, generally speaking, always Unicode, prefer using on output only one of Unicode UTFs. The only text encoding supported in character and string data internally is UTF-16. All other encodings are only supported as persistence; they are represented in memory as arrays of bytes, with no regards to characters boundaries, which can vary (in UTF-8, character size is 1-4 bytes, in UTF-16 — one or two 16-bit words (two words called surrogate pair, in UTF-32 — always one 32-bit word). Please see two very last links above.

Please see:

http://msdn.microsoft.com/en-us/library/system.io.streamreader.aspx[^],
http://msdn.microsoft.com/en-us/library/system.io.streamwriter.aspx[^],http://msdn.microsoft.com/en-us/library/f5f5x7kt.aspx[^];

http://msdn.microsoft.com/en-us/library/system.text.encoding.aspx[^];

you also need to understand how Unicode and BOM work:
http://unicode.org/[^],
http://unicode.org/faq/utf_bom.html[^].

BOM (or its absence) is used for auto-detection of encoding mentioned above.

[EDIT]

Apparently, auto-detecting of the encoding by BOM is needed only in one case: if the encoding is some Unicode UTF, you know what encoding is that, but BOM is not present. Such things happen. This is also explained in the last Unicode article referenced above.

—SA
 
Share this answer
 
v4
Comments
Andreas Gieriet 29-Jan-12 22:42pm    
You assume here 7-bit ASCII. As WBurgMo refers to "8-bit ASCII" (which does not exist), he must give the code page of the encoding. E.g. in your code, a small modification is needed for that:

int codepage = ...; // e.g. 1250 for iso-8859-2
System.IO.StreamReader reader =

new System.IO.StreamReader(fileName, new System.Text.Encoding(codepage));


See also my solution.

Cheers

Andi
Hello WBurgMo,

there is no such thing like "valid 8-bit ASCII code" (see
http://en.wikipedia.org/wiki/ASCII[^]).

If you have plain ASCII 7-bit text, you may use the ASCIIEncoding to read the data. If you have some 8-bit extension of the ASCII 7-bit encoding, you must specify the code page as described in http://msdn.microsoft.com/en-us/library/system.text.encoding.aspx[^] (see the constructor that takes the code page as argument).

Note: you must give that information about the code page from outside, i.e. there is no way to deduce from the 8-bit ASCII-extended text, what code page it is.

Cheers

Andi
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900