I can reproduce the bug in Delphi XE4. I get the correct behavior in Delphi XE5.
The bug was in TPerlRegEx.ComputeReplacement
. The code that I contributed to Embarcadero for inclusion with Delphi XE3 used UTF8String
. With Delphi XE4 Embarcadero eliminated UTF8String
from the RegularExpressionsCore
unit and replaced it with TBytes
. The developer that made this change seems to have missed a crucial difference between strings and dynamic arrays in Delphi. Strings use a copy-on-write mechanism, while dynamic arrays do not.
So in my original code, TPerlRegEx.ComputeReplacement
could do S := FReplacement
and then modify the temporary variable S
to substitute backreferences without affecting the FReplacement
field because both were strings. In the modified code, S := FReplacement
makes S
point to the same array as FReplacement
and when backreferences in S
are substituted, FReplacement
is also modified. Hence the first replacement is made correctly, while following replacements are wrong because FReplacement
was crippled.
In Delphi XE5 this was fixed by replacing S := FReplacement
with this to make a proper temporary copy:
SetLength(S, Length(FReplacement));
Move(FReplacement[0], S[0], Length(FReplacement));
When Delphi 2009 was released there was a lot of talk from Embarcadero that one shouldn't use string types to represent sequences of bytes. It seems they are now making the opposite mistake of using TBytes to represent strings.
The solution to this whole mess, which I have previously recommended to Embarcadero, is to switch to the new pcre16 functions which use UTF16LE just like Delphi strings. These functions did not exist when Delphi XE was released, but they do now and they should be used.