Can't work with UTF-8 encoding

https://stackoverflow.com/questions/17590914

02-06-2022
|

Вопрос

I load a text file using this code (my file encoding is UTF-8) (How to read a text file that contains 'NULL CHARACTER' in Delphi?):

uses
IOUtils;

var
  s: string;
  ss: TStringStream;
begin
  s := TFile.ReadAllText('c:\MyFile.txt');
  s := StringReplace(s, #0, '', [rfReplaceAll]);  //Removes NULL CHARS
  ss := TStringStream.Create(s);

  try
    RichEdit1.Lines.LoadFromStream(ss, TEncoding.UTF8); //UTF8
  finally
    ss.Free;
  end;

end;

But my problem is that the RichEdit1 doesn't load the whole text. It's not because of Null Characters. It's because of the encoding. When I run the application with this code, It loads the whole text:

uses
IOUtils;

var
  s: string;
  ss: TStringStream;
begin
  s := TFile.ReadAllText('c:\MyFile.txt');
  s := StringReplace(s, #0, '', [rfReplaceAll]);  //Removes NULL CHARS
  ss := TStringStream.Create(s);

  try
    RichEdit1.Lines.LoadFromStream(ss, TEncoding.Default);
  finally
    ss.Free;
  end;

end;

I changed TEncoding.UTF8 to TEncoding.Default. The whole text loaded but it's not in right format and it's not readable.

I guess there are some characters that UTF 8 doesn't support. So the loading process stops when it want to load that char.

Please Help. Any workarounds?

****EDIT:**

I'm sure its UTF-8 and it plain text. It's a HTML source file. I'm sure it has null charas I saw them using Notepad++ And the value of the Richedit.Plainext is true

Решение

You should give the encoding to TFile.ReadAllText. After that you are working with Unicode strings only and don't have to bother with UTF8 in the RichEdit.

var
  s: string;
begin
  s := TFile.ReadAllText('c:\MyFile.txt', TEncoding.UTF8);
  // normally this shouldn't be necessary 
  s := StringReplace(s, #0, '', [rfReplaceAll]);  //Removes NULL CHARS
  RichEdit1.Lines.Text := s;

end;

Другие советы

Since you are loading an HTML file, you need to pre-parse the HTML and check if its <head> tag contains a <meta> tag specifying a specific charset. If it does, you must load the HTML using that charset, or else it will not decode to Unicode correctly.

If there is no charset specified in the HTML, you have to choose an appropriate charset, or ask the user. For instance, if you are downloading the HTML from a webserver, you can check if a charset is specified in the HTTP Content-Type header, and if so then save that charset with (or even in) the HTML so you can use it later. Otherwise, the default charset for downloaded HTML is usually ISO-8859-1 unless known otherwise.

The only time you should ever load HTML as UTF-8 is if you know for a fact that the HTML is actually UTF-8 encoded. You cannot just blindly assume the HTML is UTF-8 encoded, unless you are the one who created the HTML in the first place.

From what you have described, it sounds like your HTML is not UTF-8. But it is hard to know for sure since you did not provide the file that you are trying to load.

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow