Two byte character in a single byte character encoded (ISO-8859-1) HTML document

https://stackoverflow.com/questions/19421557

01-07-2022
|

Question

I learned that ISO-8859-1 is a single-byte charset.

See the page http://www.manoramaonline.com/cgi-bin/MMOnline.dll/portal/ep/malayalamContentView.do?tabId=11&programId=1073753760&BV_ID=@@@&contentId=15238737&contentType=EDITORIAL&articleType=Malayalam%20News. It is using Malayalam language.

The HTTP header and meta tag tell that it is using ISO-8859-1 as character-encoding.

But in this page a two byte character (0x201A) is used (http://unicodelookup.com/#%E2%80%9A).

enter image description here

(copy the character and look up it in http://unicodelookup.com)

<div id="articleTitleMal" style="padding-top:10px;">
    <font face= "Manorama" >
         ¼ÈØOVA¢: ÜÍß‚Äí 1.28 ...
    </font>
 </div>

How is it possible to use two byte character in the single byte encoding?

Mine is not a curiosity to know that. One of my task is stucked because of not understanding the above issue.

Update: They are using the font www.manoramaonline.com/portal/mmcss/Manorama.ttf and I think some of the character in the Manaorama-font using two byte.

UPDATE2: I tried to convert the document from ISO-8859-1 to UTF-8 using the below code.

<?php
$t = file_get_contents('http://www.manoramaonline.com/cgi-bin/MMOnline.dll/portal/ep/malayalamContentView.do?tabId=11&programId=1073753760&BV_ID=@@@&contentId=15238737&contentType=EDITORIAL&articleType=Malayalam%20News');

// Change the charset info in meta-tag
$t  = str_replace('ISO-8859-1', 'UTF-8', $t);

file_put_contents('t.html', utf8_encode($t));

That time the above selected character is missing.

enter image description here

Solution

Even though the page is declared as ISO-8859-1 encoded in HTTP headers, browsers interpret it as Windows-1252 encoded. This is a longstanding tradition, now being formalized e.g. in the WHATWG Encoding Standard.

Thus, when the data contains the byte 82 (hex), it is not taken as a control character (as per ISO 8859-1) but as U+201A “‚” (as per Windows-1252).

However, the page uses font trickery that maps code positions to Malayalam characters according to a special internal, nonstandard encoding. (You can see this if you disable style sheets on the page. All texts become gibberish.) The page is not really meant to contain U+201A “‚” but the byte 82 to which a Malayalam character is assigned in the font.

So you need to preserve the byte as-is to get the same results. A conversion to UTF-8 would break this.

If you wanted to convert the data to Unicode, you would need to find out the internal encoding of the font being used and perform that mapping at the character level.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow