Domanda

international html files archived by wget should contain chars like this

(example hebrew and thai:) אב הם and ยคน

instead they are saved like this: íäáåãéú and ÃÒ¡à§é

How to get the these displayed properly?

iconv filename.html iconv: illegal input sequence at position 1254

SOLVED: There was nothing wrong. Only i didnt notice the default php.ini did set the charset in the http header but to use various charsets like this meta http-equiv="Content-Type" content="text/html; charset=windows-874" you needed to set: default_charset = "empty"; ....

È stato utile?

Soluzione

The pages aren't "saved like this", whatever you're using to view the file is simply interpreting the encoding incorrectly. To know what encoding the file is in you should have paid attention to the HTTP Content-Type header during download; that's gone now.
Your only other chance is to parse the equivalent HTML meta tag in the <head>, if the document has one.

Otherwise, you can only guess the encoding of the document.

See What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text for more required background knowledge.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top