HttpGetText(), autodetect charset, and convert source to UTF8

https://stackoverflow.com/questions/5500293

14-11-2019
|

Question

I'm using HttpGetText with Synapse for Delphi 7 Professional to get the source of a web page - but feel free to recommend any component and code.

The goal is to save some time by 'unifying' non-ASCII characters to a single charset, so I can process it with the same Delphi code.

So I'm looking for something similar to "Select All and Convert To UTF without BOM in Notepad++", if you know what I mean. ANSI instead of UTF8 would also be okay.

Webpages are encoded in 3 charsets: UTF8, "ISO-8859-1=Win 1252=ANSI" and straight up the alley HTML4 without charset spec, ie. htmlencoded Å type characters in the content.

If I need to code a PHP page that does the conversion, that's fine too. Whatever is the least code / time.

Solution 2

I instead did the reverse conversion directly after retrieving the HTML using GpTextStream. Making the documents conform to ISO-8859-1 made them processable using straight up Delphi, which saved quite a bit of code changes. On output all the data was converted to UTF-8 :)

Here's some code. Perhaps not the prettiest solution but it certainly got the job done in less time. Note that this is for the reverse conversion.

procedure UTF8FileTo88591(fileName: string);
const bufsize=1024*1024;
var
fs1,fs2: TFileStream;
ts1,ts2: TGpTextStream;
buf:PChar;
siz:integer;
    procedure LG2(ss:string);
    begin
        //dont log for now.
    end;

begin
    fs1 := TFileStream.Create(fileName,fmOpenRead);
    fs2 := TFileStream.Create(fileName+'_ISO88591.txt',fmCreate);
    //compatible enough for my purposes with default 'Windows/Notepad' CP 1252 ANSI and Swe ANSI codepage, Latin1 etc.
    //also works for ASCII sources with htmlencoded accent chars, naturally
    try
      LG2('Files opened OK.');
      GetMem(buf,bufsize);
      ts1 := TGpTextStream.Create(fs1,tsaccRead,[],CP_UTF8);
      ts2 := TGpTextStream.Create(fs2,tsaccWrite,[],ISO_8859_1);
      try
        siz:=ts1.Read(buf^,bufsize);
        LG2(inttostr(siz)+' bytes read.');
        if siz>0 then ts2.Write(buf^,siz);
      finally
        LG2('Bytes read and written OK.');
      FreeAndNil(ts1);FreeAndNil(ts2);end;
    finally FreeAndNil(fs1);FreeAndNil(fs2);FreeMem(buf);
        LG2('Everything freed OK.');
    end;
end; // UTF8FileTo88591

OTHER TIPS

When you retreive a webpage, its Content-Type header (or sometimes a <meta> tag inside the HTML itself) tells you which charset is being used for the data. You would decode the data to Unicode using that charset, then you can encode the Unicode to whatever you need for your processing.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow