Question

I'd like to find two non-identical Unicode words separated by a colon using a PCRE regex.

Take for example, this string:

Lôrem:ipsüm dõlör:sït amêt:amêt cønsectetûr:cønsectetûr âdipiscing:elït

I can easily find the two identical words separated by a colon using:

(\p{L}+):(\1)

which will match: cønsectetûr:cønsectetûr and amêt:amêt

However, I want to negate the backreference to find only non-identical Unicode words separated by a colon.

What's the proper way to negate a backreference in PCRE?

(\p{L}+):(^\1) obviously does not work.

Was it helpful?

Solution

You start by using a negative lookahead to prevent a match if the captured part repeats after the colon:

(\p{L}+):(?!\1)

Then you need to match the second unicode word, another \p{L}+:

(\p{L}+):(?!\1)\p{L}+

And last, to prevent false matches, use word boundaries:

\b(\p{L}+):(?!\1\b)\p{L}+\b

regex101 demo

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top