Question

Based on this question: How can I get HTML source code from TWebBrowser

If I run this code with a html page that has Unicode code page, the result is gibberish becouse TStringStream is not Unicode in D7. the page might be UTF8 encoded or other (Ansi) code page encoded.

How can I detect if a TStream/IPersistStreamInit is Unicode/UTF8/Ansi?

How do I always return correct result as WideString for this function?

function GetWebBrowserHTML(const WebBrowser: TWebBrowser): WideString;

If I replace TStringStream with TMemoryStream, and save TMemoryStream to file it's all good. It can be either Unicode/UTF8/Ansi. but I always want to return the stream back as WideString:

function GetWebBrowserHTML(const WebBrowser: TWebBrowser): WideString;
var
  // LStream: TStringStream;
  LStream: TMemoryStream;
  Stream : IStream;
  LPersistStreamInit : IPersistStreamInit;
begin
  if not Assigned(WebBrowser.Document) then exit;
  // LStream := TStringStream.Create('');
  LStream := TMemoryStream.Create;
  try
    LPersistStreamInit := WebBrowser.Document as IPersistStreamInit;
    Stream := TStreamAdapter.Create(LStream,soReference);
    LPersistStreamInit.Save(Stream,true);
    // result := LStream.DataString;
    LStream.SaveToFile('c:\test\test.txt'); // test only - file is ok
    Result := ??? // WideString
  finally
    LStream.Free();
  end;
end;

EDIT: I found this article - How to load and save documents in TWebBrowser in a Delphi-like way

Which does exactlly what I need. but it works correctlly only with Delphi Unicode compilers (D2009+). read Conclusion section:

There is obviously a lot more we could do. A couple of things immediately spring to mind. We retro-fit some of the Unicode functionality and support for non-ANSI encodings to the pre-Unicode compiler code. The present code when compiled with anything earlier than Delphi 2009 will not save document content to strings correctly if the document character set is not ANSI.

The magic is obviously in TEncoding class (TEncoding.GetBufferEncoding). but D7 does not have TEncoding. Any ideas?

Was it helpful?

Solution

I used GpTextStream to handle the convertion (Should work for all Delphi versions):

function GetCodePageFromHTMLCharSet(Charset: WideString): Word;
const
  WIN_CHARSET = 'windows-';
  ISO_CHARSET = 'iso-';
var
  S: string;
begin
  Result := 0;
  if Charset = 'unicode' then
    Result := CP_UNICODE else
  if Charset = 'utf-8' then
    Result := CP_UTF8 else
  if Pos(WIN_CHARSET, Charset) <> 0 then
  begin
    S := Copy(Charset, Length(WIN_CHARSET) + 1, Maxint);
    Result := StrToIntDef(S, 0);
  end else
  if Pos(ISO_CHARSET, Charset) <> 0 then // ISO-8859 (e.g. iso-8859-1: => 28591)
  begin
    S := Copy(Charset, Length(ISO_CHARSET) + 1, Maxint);
    S := Copy(S, Pos('-', S) + 1, 2);
    if S = '15' then // ISO-8859-15 (Latin 9)
      Result := 28605
    else
      Result := StrToIntDef('2859' + S, 0);
  end;
end;

function GetWebBrowserHTML(WebBrowser: TWebBrowser): WideString;
var
  LStream: TMemoryStream;
  Stream: IStream;
  LPersistStreamInit: IPersistStreamInit;
  TextStream: TGpTextStream;
  Charset: WideString;
  Buf: WideString;
  CodePage: Word;
  N: Integer;
begin
  Result := ''; 
  if not Assigned(WebBrowser.Document) then Exit;
  LStream := TMemoryStream.Create;
  try
    LPersistStreamInit := WebBrowser.Document as IPersistStreamInit;
    Stream := TStreamAdapter.Create(LStream, soReference);
    if Failed(LPersistStreamInit.Save(Stream, True)) then Exit;
    Charset := (WebBrowser.Document as IHTMLDocument2).charset;
    CodePage := GetCodePageFromHTMLCharSet(Charset);
    N := LStream.Size;
    SetLength(Buf, N);
    TextStream := TGpTextStream.Create(LStream, tsaccRead, [], CodePage);
    try
      N := TextStream.Read(Buf[1], N * SizeOf(WideChar)) div SizeOf(WideChar);
      SetLength(Buf, N);
      Result := Buf;
    finally
      TextStream.Free;
    end;
  finally
    LStream.Free();
  end;
end;
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top