Since you are loading an HTML file, you need to pre-parse the HTML and check if its <head>
tag contains a <meta>
tag specifying a specific charset. If it does, you must load the HTML using that charset, or else it will not decode to Unicode correctly.
If there is no charset specified in the HTML, you have to choose an appropriate charset, or ask the user. For instance, if you are downloading the HTML from a webserver, you can check if a charset is specified in the HTTP Content-Type
header, and if so then save that charset with (or even in) the HTML so you can use it later. Otherwise, the default charset for downloaded HTML is usually ISO-8859-1 unless known otherwise.
The only time you should ever load HTML as UTF-8 is if you know for a fact that the HTML is actually UTF-8 encoded. You cannot just blindly assume the HTML is UTF-8 encoded, unless you are the one who created the HTML in the first place.
From what you have described, it sounds like your HTML is not UTF-8. But it is hard to know for sure since you did not provide the file that you are trying to load.