reading non-english html pages with c#

https://stackoverflow.com/questions/3008482

26-09-2019
|

Question

I am trying to find a string in Hebrew in a website. The reading code is attached.

Afterward I try to read the file using streamReader but I can't match strings in other languages. what am I suppose to do?

   // used on each read operation
    byte[] buf = new byte[8192];

    // prepare the web page we will be asking for
    HttpWebRequest request = (HttpWebRequest)
        WebRequest.Create("http://www.webPage.co.il");

    // execute the request
    HttpWebResponse response = (HttpWebResponse)
        request.GetResponse();

    // we will read data via the response stream
    Stream resStream = response.GetResponseStream();

    string tempString = null;
    int count = 0;
    FileStream fileDump = new FileStream(@"c:\dump.txt", FileMode.Create);
    do
    {
        count = resStream.Read(buf, 0, buf.Length);
        fileDump.Write(buf, 0, buf.Length);

    }
    while (count > 0); // any more data to read?

    fileDump.Close();

Solution

You are missing appropriate encoder, take a look at WebResponse.GetResponseStream Method for details

Updated: Use Hebrew (Windows) encoding is 1255

Encoding encode = System.Text.Encoding.GetEncoding(1255); // Hebrew (Windows) 

// Pipe the stream to a higher level stream reader with the required encoding format. 
 StreamReader readStream = new StreamReader( resStream , encode );

OTHER TIPS

Solved it.

The problem was choosing the wrong encoding, I chose utf-8 which isn't always the right answer :)

key lines:

Encoding encode = System.Text.Encoding.GetEncoding("windows-1255");
StreamReader readStream = new StreamReader(ReceiveStream, encode);

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow