Question

What is the best way to convert a Delphi XE AnsiString containing escaped combining diacritical marks like "Fu\u0308rst" into a frienly WideString "Fürst"?

I am aware of the fact that this is not always possible for all combinations, but the common Latin blocks should be supported without building silly conversion tables on my own. I guess the solution can be found somewhere in the new Characters unit, but I don't get it.

Was it helpful?

Solution

I think you need to perform Unicode Normalization. on your string.

I don't know if there's a specific call in Delphi XE RTL to do this, but the WinAPI call NormalizeString should help you here, with mode NormalizationKC:

NormalizationKC

Unicode normalization form KC, compatibility composition. Transforms each base plus combining characters to the canonical precomposed equivalent and all compatibility characters to their equivalents. For example, the ligature fi becomes f + i; similarly, A + ¨ + fi + n becomes Ä + f + i + n.

OTHER TIPS

Here is the complete code that solved my problem:

function Unescape(const s: AnsiString): string;
var
  i: Integer;
  j: Integer;
  c: Integer;
begin
  // Make result at least large enough. This prevents too many reallocs
  SetLength(Result, Length(s));
  i := 1;
  j := 1;
  while i <= Length(s) do begin
    if s[i] = '\' then begin
      if i < Length(s) then begin
        // escaped backslash?
        if s[i + 1] = '\' then begin
          Result[j] := '\';
          inc(i, 2);
        end
        // convert hex number to WideChar
        else if (s[i + 1] = 'u') and (i + 1 + 4 <= Length(s)) 
                and TryStrToInt('$' + string(Copy(s, i + 2, 4)), c) then begin
          inc(i, 6);
          Result[j] := WideChar(c);
        end else begin
          raise Exception.CreateFmt('Invalid code at position %d', [i]);
        end;
      end else begin
        raise Exception.Create('Unexpected end of string');
      end;
    end else begin
      Result[j] := WideChar(s[i]);
      inc(i);
    end;
    inc(j);
  end;

  // Trim result in case we reserved too much space
  SetLength(Result, j - 1);
end;

const
  NormalizationC = 1;

function NormalizeString(NormForm: Integer; lpSrcString: LPCWSTR; cwSrcLength: Integer;
 lpDstString: LPWSTR; cwDstLength: Integer): Integer; stdcall; external 'Normaliz.dll';

function Normalize(const s: string): string;
var
  newLength: integer;
begin
  // in NormalizationC mode the result string won't grow longer than the input string
  SetLength(Result, Length(s));
  newLength := NormalizeString(NormalizationC, PChar(s), Length(s), PChar(Result), Length(Result));
  SetLength(Result, newLength);
end;

function UnescapeAndNormalize(const s: AnsiString): string;
begin
  Result := Normalize(Unescape(s));
end;

Thank you all! I am sure that my first experience with StackOverflow won't be my last one :-)

Are they always escaped like this? Always in a number of 4 digits?

How is the \ character itself escaped?

Assuming the \character is escaped by \xxxx where xxxx is the code for the \ character, you can easily loop through the string:

function Unescape(s: AnsiString): WideString;
var
  i: Integer;
  j: Integer;
  c: Integer;
begin
  // Make result at least large enough. This prevents too many reallocs
  SetLength(Result, Length(s));
  i := 1; j := 1;
  while i <= Length(s) do
  begin
     // If a '\' is found, typecast the following 4 digit integer to widechar
     if s[i] = '\' then
     begin
       if (s[i+1] <> 'u') or not TryStrToInt(Copy(s, i+2, 4), c) then
         raise Exception.CreateFmt('Invalid code at position %d', [i]);

       Inc(i, 6);
       Result[j] := WideChar(c);
     end
     else
     begin
       Result[j] := WideChar(s[i]);
       Inc(i);
     end;
     Inc(j);
  end;

  // Trim result in case we reserved too much space
  SetLength(Result, j-1);
end;

Use like this

  MessageBoxW(0, PWideChar(Unescape('\u0252berhaupt')), nil, MB_OK);

This code is tested in Delphi 2007, but should work in XE as well due to the explicit use of Ansistring and Widestring.

[edit] Code is ok. Highlighter fails.

If I'm not mistaken, Delphi XE now supports regular expressions. I don't use them that often, though, but it seems a good way to parse the string and then replace all escaped values. Maybe someone has a good example of how to do this in Delphi with regular expressions?

GolezTrol, you forget '$'

if (s[i+1] <> 'u') or not TryStrToInt('$'+Copy(s, i+2, 4), c) then
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top