Question

I am parsing some web content in a response from a HttpWebRequest.

This web content is using charset ISO-8859-1 and when parsing it and finally getting the word needed from the response, I am receiving a string with a question mark like this and I want to know which is the right way to transform it back into a readable string.

So, what I've tried is to convert the current word encoding into UTF-8 like this:

(I am wondering if UTF-8 could solve my problem)

string word = "ESPA�OL";

Encoding iso = Encoding.GetEncoding("ISO-8859-1");
Encoding utf = Encoding.GetEncoding("UTF-8");

byte[] isoBytes = iso.GetBytes(word);
byte[] utfBytes = Encoding.Convert(iso, utf, isoBytes);

string utfWord = utf.GetString(utfBytes);

Console.WriteLine(utfWord);

However, utfWord variable outputs ESPA?OL which is still wrong. The correct output is supposed to be ESPAÑOL.

Can someone please give me the right directions to solve this, if possible?

Was it helpful?

Solution

The word in question is "ESPAÑOL". This can be encoded correctly in ISO-8859-1 since all characters in the word are represented in ISO-8859-1.

You can see this for yourself using the following simple program:

using System;
using System.Diagnostics;
using System.Text;

namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {
            Encoding enc = Encoding.GetEncoding("ISO-8859-1");
            string original = "ESPAÑOL";
            byte[] iso_8859_1 = enc.GetBytes(original);
            string roundTripped = enc.GetString(iso_8859_1);
            Debug.Assert(original == roundTripped);
            Console.WriteLine(roundTripped);
        }
    }
}

What this tells you is that you need to properly diagnose where the erroneous character comes from. By the time that you have a � character, it is too late. The information has been lost. The presence of the � character indicates that, at some point, a conversion was performed into a character set that did not contain the character Ñ.

A conversion from ISO-8859-1 to a Unicode encoding will correctly handle "ESPAÑOL" because that word can be encoded in ISO-8859-1.

The most likely explanation is that somewhere along the way, the text "ESPAÑOL" is being converted to a character set that does not contain the letter Ñ.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top