Question

I use TIdHttp to fetch web content. The response header indicates the content encoding to be utf8. I want to print content in console as CP936 (simplified chinese), but the actual content is not readable.

Result := TEncoding.Utf8.GetString(ResponseBuffer);

I do the same thing in python (using httplib2) without any problems.

def python_try():
    conn = httplib2.HttpConn()
    respose, content = conn.get(...)
    print content.decode('utf8') # readable in console

UPDATE 1

I debugged the raw response and noticed that the content is gzipped.

HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Content-Type: text/html;charset=UTF-8
Transfer-Encoding: chunked
Content-Encoding: gzip
Vary: Accept-Encoding
Date: Mon, 24 Dec 2012 15:27:44 GMT
Connection: Keep-Alive

I tried to assign a IdCompressorZLib instance to IdHttp instance. Unfortunately, the application will crash while decompressing gzipped content. The test address is "http\://www.baidu.com" (encoding=gb2312).


UPDATE 2

I also tried to download a gzipped jquery script file, which contains only ascii chars. This time it works, which means to be a problem of Indy library. If I were not wrong, I should close the question.

Was it helpful?

Solution

TIdHTTP handles the gzip decompression for you, if you have a TIdCompressorZLib component assigned to the TIdHTTP.Compressor property. Otherwise, you will have to decompress it manually (TIdHTTP will not send an Accept-Encoding header by default if the Compressor property is not assigned).

As for the UTF-8 encoding, TIdHTTP also handles that for you as well, if you are calling the overloaded version of the TIdHTTP.Get() or TIdHTTP.Post() method that returns a String value instead of fill a TStream object. It will decode the UTF-8 to UTF-16 for you. To convert that to CP936, you can let the RTL do the conversion for you:

type
  Cp936String = type AnsiString(936);
var
  S: Cp936String;
begin
  S := Cp936String(IdHTTP1.Get(...));

OTHER TIPS

Do not use any auto detect encoding, it cannot be done reliably. Simply believe the Content-Type header.

Result := TEncoding.Utf8.GetString(ResponseBuffer);

If the Content-Type header is missing or lying, then you need to detect encoding. Although I would not use any algorithm that would misdetect UTF-8 as CP936...

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top