How do I fix invalid HTML characters in pages served with different encoding?

https://stackoverflow.com/questions/3833300

26-09-2019
|

Question

I have a number of websites that are rendering invalid characters. The pages' meta tags specify UTF-8 encoding. However, a number of pages contain characters that can't be interpreted by UTF-8, probably because the files were saved with another encoding (such as ANSI). The one in particular I'm concerned about right now is a fancy apostrophe (as in "Bob’s"...sorry if that doesn't show up correctly). W3's validator indicates the entity is "\x92", but it won't validate the file because it doesn't map to unicode. And, of course, if I open the file in Notepad++ and change the encoding to UTF-8, the character is replaced by a 92 in a black box.

Here's my question: what's the easiest way to fix this? Do I have to open all the pages and replace that character with a conventional apostrophe? Or is there a quick fix I could add (say, to IIS) that might override or fix the encoding issue? Or do I have to brute-force find/replace? I have hundreds of pages on these websites and I have no idea how many of them I'd have to change, so if anyone knows a way I could either circumvent this problem or fix it quickly I would appreciate it.

Solution

Are you serving the pages as straight HTML, or do you have another script serving the content? If you have a script which is serving the content, that script could just look for any instance of \x92 and replace it with an apostrophe. In PHP this would be a simple str_replace()

If you're serving straight HTML then you'll have to actually modify the files themselves. This can be automated, however (and probably should be if you have hundreds of files) depending on what tools you have available to you and what Operating System you're in. Since you said you're using Notepad++ I suppose it's safe to assume you're in MS Windows (therefore no fun Unix commands to speed things up)

It may be possible to create a BATCH script which can do this, however. There are very simple ASCII text editing tools built into Command Prompt. If that's not possible then it's very possible to make a C or C++ program to do this if you have a compiler on your system and moderate knowledge of C. If you have the former and not the latter, ask and I'll whip up some source for you.

OTHER TIPS

I'm not sure about the encoding part of it myself, but if you wind up having to do it by brute force, you could always write a short program that iterates through all of your web pages, loads each file into memory, runs a regex.replace to fix the problem character, and saves the file back to disk. Obviously not ideal but better than opening each file on your own.

Good Luck

I just ran into a similar issue where some not breaking spaces "xA0" got into a supposedly UTF-8 document. In notepad++ these are displayed in a black box with "xA0" written in it. However notepad++ doesn't allow them to be copied or pasted.

I did a little research and found out what is going on. A hex editor reveals that these are being encoded as a single byte: "A0" which is invalid UTF-8. Anything not ASCII should be at least two bytes, so the proper encoding is "C2 A0" in hexadecimal.

For your fancy apostrophe example, you are dealing with the same thing. Actually though, your problem is more complicated because in extended ascii character \x92 (decimal 146) is an apostrophe but in unicode \x92 is a control character and the right single quotation should be U+2019 (decimal 8217). Adding this symbol in notepad++ (via Edit->Character panel) and inspecting in a hex editor reveals that the proper hexadecimal encoding is "E2 80 99" which in binary is 11100010 10000000 10011001. When you remove the UTF-8 control bytes (non bold) this yields 0010 0000 0000 0001 1001 which is equal to a decimal value of 8217.

The proper way of handling this would be to open your file as a byte stream (unsigned char * in c) and search for improper UTF-8 sequences. Then you can either replace them with � (see https://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences) or you can try to custom handle them, by making replacements like A0 -> C2 A0 (improperly encoded non breaking space) and 92 -> E2 80 99 (improperly encoded right single quotation mark).

All special charcters should be HTML encoded, e.g. a copyright symbol should be in your HTML as

&copy;

HTML entity list:

http://www.w3schools.com/HTML/html_entities.asp

As for how you implement this largely depends on how you are creating the code in the first place, but something like ASP.Net will have server side functions like:

Server.HTMLEncode("string with special chars")

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow