Click here to Skip to main content
15,907,687 members
Please Sign up or sign in to vote.
1.00/5 (1 vote)
See more:
Hi All,

I create an application “Web Crawler” where I can download sites pages.

Now what I need into it is if a site says wiki have page in Hindi or Arabic language it should be visible in my file with same encoding format.

I download the pages and save into files but it will not have same kind of encoding. But if I copy paste a page into notepad it will have same encoding and will be visible exactly same.

Anyone can provide any information on it.

Thanks
Posted

The Content-Encoding should tell you what text encoding is used. You should then extract the text content from the request using that encoding.
 
Share this answer
 
Comments
sunder.tinwar 1-Jun-11 7:26am    
Hi I think I did not make my self clear.
public string DownloadWebPage(string Url)
{
// Open a connection
HttpWebRequest WebRequestObject = (HttpWebRequest)HttpWebRequest.Create(Url);

// You can also specify additional header values like
// the user agent or the referer:
WebRequestObject.UserAgent = ".NET Framework/2.0";
WebRequestObject.Referer = "http://www.example.com/";

// Request response:
WebResponse Response = WebRequestObject.GetResponse();
// Open data stream:
Stream WebStream = Response.GetResponseStream();

// Create reader object:
StreamReader Reader = new StreamReader(WebStream);

// Read the entire stream content:
string PageContent = Reader.ReadToEnd();

// Cleanup
Reader.Close();
WebStream.Close();
Response.Close();

return PageContent;
}
this the code i am using to download a web page. Now if a page is in arabic language when it will be stored in a file it should be visible same what it did on site page.
For example now a page arabic is using utf-8 or utf-16 or anything else. I have to use same while writing in file.
How can I get the text in same unicode format.
BobJanova 1-Jun-11 18:13pm    
You need to set the StreamReader to use the same encoding as the page.
sunder.tinwar 1-Jun-11 7:29am    
How come windows determine what kind of text it is...when I just press ctrl+c on site page say in arabic language and press ctrl+v in notepade. Text will look exactly same what is in site page.

Anyone provide any light on this....
Sergey Alexandrovich Kryukov 1-Jun-11 15:54pm    
Arabic text is not encoding. This is all Unicode.
--SA
Sergey Alexandrovich Kryukov 1-Jun-11 15:54pm    
Correct. My 5.
--SA
Probably, there is something you don't understand. Both .NET and Web use Unicode, but in Web there are also many obsolete encodings. The only valid method of encoding which could show text of the different languages (depends on what languages though) these days is Unicode.

There is no a single Unicode encoding. Unicode is just a table if one-to-one correspondence of characters (as cultural entities, abstracted from fonts, styles, encoding and other details) and integer values called code points (code points are understood as abstract mathematical integer numbers, abstracted from binary presentation, size, little- or big- endian and other technical detail; this is all covered in encodings). Unicode also defines several UTFs which represent physical encoding of code points. No, Unicode is not 16-bit encoding! Full set of code point needs way more than 16-bits. All UTFs support code points beyond 16 bits.

Now, there are no languages. There are scripts, sub-sets of the code points. For example Hindi is supported by Devanagari script which also supports many most used languages of India, including Sanscrit.

See http://unicode.org/[^], http://unicode.org/faq/utf_bom.html[^].

—SA
 
Share this answer
 
v2
Comments
Monjurul Habib 1-Jun-11 18:20pm    
nice links and description,my 5 .
Sergey Alexandrovich Kryukov 1-Jun-11 18:32pm    
Thank you. Monjurul.
--SA
Espen Harlinn 1-Jun-11 18:39pm    
Right, my 5
Sergey Alexandrovich Kryukov 1-Jun-11 18:43pm    
Thank you, Espen.
--SA

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900