Pergunta

I have the wikipedia data dump and trying to decode special characters in the page titles, except a lot of characters don't match up the "standard" ascii encoding (referencing from here.)

As an example, in wikipedia ë and ã are given as:

ë = %C3%AB

ã = %C3%A3

Is there a defined key anywhere I can pull from?

Foi útil?

Solução

It's UTF-8.

Besides, neither character is in ASCII. They're in various "extended ASCII" character sets, but these encodings are not ASCII, they're remnants of a wild west age of character encodings. Treat them as legacy encodings that civilized people like us may have to decode but ideally should never produce. At least for ASCII there is a single table which almost the entire western world can agree on (and the rest of the world if they use UTF-8), while "extended" character sets are so numerous that it's anyone's guess what any given byte above 127 means.

The page you're linking to tacitly assumes one of these many "extended" character sets and (if a quick search didn't betray me) fails to mention. Now, in English texts it's often safe to assume some variant of Latin-1 (or ISO-whatsthenumber etc.) is implied, but it's still sloppy. Furthermore, as far as I am aware, there is by no means any standard as to what encoding percent-encoded bytes should be interpreted as. Again Latin-1 etc. are common but far from universal even in English language text. You should really get better sources.

Licenciado em: CC-BY-SA com atribuição
scroll top