Reading a file with 8-bit ascii char

Question

0.00/5 (No votes)

See more:

I have an ascii text file that contains valid 8-bit character codes.
How do I read this file and have the 8-bit char translated into valid
unicode? I know I could UTF8 encode the file or could read the bytes
and then encode it. But this all assumes that I know about the 8-bit
codes before hand.

Is there any method that will read the file and automaticly do the
conversion?

James Johnson

Posted 29-Jan-12 10:31am

WBurgMo

Add a Solution

2 solutions

Solution 1

Simply read with

C#

System.IO.StreamReader reader =
   new System.IO.StreamReader(fileName,  System.Text.Encoding.ASCII);

or, more universally, auto-detect the encoding:

C#

System.IO.StreamReader reader =
   new System.IO.StreamReader(fileName,  true);

It will give you Unicode string(s) based on your ASCII data. In principle, this is all you need. You can write it back with

C#

bool appendOrNot = //something
System.IO.StreamWriter writer =
   new System.IO.StreamWriter(fileName,  appendOrNot, System.Text.Encoding.UTF8);

As you text data is, generally speaking, always Unicode, prefer using on output only one of Unicode UTFs. The only text encoding supported in character and string data internally is UTF-16. All other encodings are only supported as persistence; they are represented in memory as arrays of bytes, with no regards to characters boundaries, which can vary (in UTF-8, character size is 1-4 bytes, in UTF-16 — one or two 16-bit words (two words called surrogate pair, in UTF-32 — always one 32-bit word). Please see two very last links above.

Please see:

http://msdn.microsoft.com/en-us/library/system.io.streamreader.aspx[^],
http://msdn.microsoft.com/en-us/library/system.io.streamwriter.aspx[^],http://msdn.microsoft.com/en-us/library/f5f5x7kt.aspx[^];

http://msdn.microsoft.com/en-us/library/system.text.encoding.aspx[^];

you also need to understand how Unicode and BOM work:
http://unicode.org/[^],
http://unicode.org/faq/utf_bom.html[^].

BOM (or its absence) is used for auto-detection of encoding mentioned above.

[EDIT]

Apparently, auto-detecting of the encoding by BOM is needed only in one case: if the encoding is some Unicode UTF, you know what encoding is that, but BOM is not present. Such things happen. This is also explained in the last Unicode article referenced above.

—SA

Posted 29-Jan-12 11:13am

Sergey Alexandrovich Kryukov

Updated 29-Jan-12 13:52pm

v4

Comments

Andreas Gieriet 29-Jan-12 22:42pm

You assume here 7-bit ASCII. As WBurgMo refers to "8-bit ASCII" (which does not exist), he must give the code page of the encoding. E.g. in your code, a small modification is needed for that:

int codepage = ...; // e.g. 1250 for iso-8859-2
System.IO.StreamReader reader =

   new System.IO.StreamReader(fileName, new System.Text.Encoding(codepage));

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Andreas Gieriet · Accepted Answer · 2012-01-29T16:35:00

Hello WBurgMo,

there is no such thing like "valid 8-bit ASCII code" (see
http://en.wikipedia.org/wiki/ASCII[^]).

If you have plain ASCII 7-bit text, you may use the ASCIIEncoding to read the data. If you have some 8-bit extension of the ASCII 7-bit encoding, you must specify the code page as described in http://msdn.microsoft.com/en-us/library/system.text.encoding.aspx[^] (see the constructor that takes the code page as argument).

Note: you must give that information about the code page from outside, i.e. there is no way to deduce from the 8-bit ASCII-extended text, what code page it is.

Cheers

Andi