Question

I'm currently facing a very strange encoding issue when dealing with an html source code. I got the following line:

"requête présentée par..."

When an extern library does an utf8_decode I got:

"reque^te présente´e par..."

So accents are placed right to the accented characters. If I do an utf8_encode from that result, I don't get the original "requête présentée par..." but I keep having "reque^te présente´e par..."

Even stranger: If I open the original html in Notepad++, encoding is utf8 without BOM (so far, so good) but I can actually select half of the character with the text selection (keyboard or mouse). Yes, half of it. As if the real code was "e^" but it was displayed as "ê". When I try to copy it to my IDE it copies "ê" but pastes "e^".

I have come up with a basic replacement function:

"e^" => "ê", "e´" => "é", ...

and some other french cases, and it's working properly for now. But as the HTML comes in differents languages, I'm pretty sure I won't be able to successfully replace every character under this encoding issue.

Has anybody face this issue before and (hopefully) has a more general solution?

Thanks in advance.

Was it helpful?

Solution

It sounds like your HTML source is using Combining characters. That is, instead of using a single unicode character to represent the ê, it's using first a regular e and then a combining character to add the diacritic ^. You can verify this with a hex editor to see the character codes, in this case the combining circumflex is hex code 0302.

See also Unicode equivalence.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top