Question

I am trying in vain to write a regex that would only return a match if a given substring "/oem/en" (followed by a forward slash or a double quote" is found in the <target> XML node of a single line XLIFF code. So far here what I got but it is still matching even occurences in <source> nodes:

/oem/en(?=/|\")(?=.*?</target>)

XLIFF code sample with <source> and <target> nodes:

<?xml version="1.0" encoding="utf-8"?><xliff version="1.2" xmlns="urn:oasis:names:tc:xliff:document:1.2"><file product-version="1.4" original="" source-language="default" target-language="default" datatype="plaintext" category="page_content"><header></header><body><trans-unit id="page_description" datatype="x-System.String" maxwidth="0" minwidth="0"><source><![CDATA[bla bla]]></source><target><![CDATA[translated bla bla]]></target></trans-unit><trans-unit id="f55c4d88-1f2e-4ad9-aaa8-819af4ee7ee8" datatype="x-System.String" resname="PublishingPageContent" maxwidth="0" minwidth="0"><source><![CDATA[<a href="/oem/en/link1">bla bla</a>]]></source><target><![CDATA[<a href="/oem/en/link1/>translated bla bla</a>]]></target></trans-unit><trans-unit id="f55c4d88-1f2e-4ad9-aaa8-819af4ee7ee8" datatype="x-System.String" resname="PublishingPageContent" maxwidth="0" minwidth="0"><source><![CDATA[<a href="/oem/en/link2">bla bla</a>]]></source><target><![CDATA[<a href="/oem/en/link2/>translated bla bla</a>]]></target></trans-unit></body></file></xliff>

My approach was to try to craft an expression that would look ahead until it matches either an </source> or </target> and if it finds the former first, it means we are in the <source> node and thus it is not a match.

Your help on this is greatly appreciated!

Was it helpful?

Solution

Description

This expression will find only the /oem/en string if it is inside the target tag

regex: (<target>(?:(?!<\/target>).)*?)(\/oem\/en(?=\/|\"))

replace with: $1~~~~New Value~~~~~

enter image description here

C# code example

Input Text

<?xml version="1.0" encoding="utf-8"?><xliff version="1.2" xmlns="urn:oasis:names:tc:xliff:document:1.2"><file product-version="1.4" original="" source-language="default" target-language="default" datatype="plaintext" category="page_content"><header></header><body><trans-unit id="page_description" datatype="x-System.String" maxwidth="0" minwidth="0"><source><![CDATA[bla bla]]></source><target><![CDATA[translated bla bla]]></target></trans-unit><trans-unit id="f55c4d88-1f2e-4ad9-aaa8-819af4ee7ee8" datatype="x-System.String" resname="PublishingPageContent" maxwidth="0" minwidth="0"><source><![CDATA[<a href="/oem/en/link1">bla bla</a>]]></source><target><![CDATA[<a href="/oem/en/link1/>translated bla bla</a>]]></target></trans-unit><trans-unit id="f55c4d88-1f2e-4ad9-aaa8-819af4ee7ee8" datatype="x-System.String" resname="PublishingPageContent" maxwidth="0" minwidth="0"><source><![CDATA[<a href="/oem/en/link2">bla bla</a>]]></source><target><![CDATA[<a href="/oem/en/link2/>translated bla bla</a>]]></target></trans-unit></body></file></xliff>

Code

using System;
using System.Text.RegularExpressions;
namespace myapp
{
  class Class1
    {
      static void Main(string[] args)
        {
          String sourcestring = "source string to match with pattern";
          String matchpattern = @"(<target>(?:(?!<\/target>).)*?)(\/oem\/en(?=\/|\""))";
          String replacementpattern = @"$1~~~~~new value~~~~~";
          Console.WriteLine(Regex.Replace(sourcestring,matchpattern,replacementpattern,RegexOptions.IgnoreCase));
        }
    }
}

Yields

<?xml version="1.0" encoding="utf-8"?><xliff version="1.2" xmlns="urn:oasis:names:tc:xliff:document:1.2"><file product-version="1.4" original="" source-language="default" target-language="default" datatype="plaintext" category="page_content"><header></header><body><trans-unit id="page_description" datatype="x-System.String" maxwidth="0" minwidth="0"><source><![CDATA[bla bla]]></source><target><![CDATA[translated bla bla]]></target></trans-unit><trans-unit id="f55c4d88-1f2e-4ad9-aaa8-819af4ee7ee8" datatype="x-System.String" resname="PublishingPageContent" maxwidth="0" minwidth="0"><source><![CDATA[<a href="/oem/en/link1">bla bla</a>]]></source><target><![CDATA[<a href="~~~~~new value~~~~~/link1/>translated bla bla</a>]]></target></trans-unit><trans-unit id="f55c4d88-1f2e-4ad9-aaa8-819af4ee7ee8" datatype="x-System.String" resname="PublishingPageContent" maxwidth="0" minwidth="0"><source><![CDATA[<a href="/oem/en/link2">bla bla</a>]]></source><target><![CDATA[<a href="~~~~~new value~~~~~/link2/>translated bla bla</a>]]></target></trans-unit></body></file></xliff>

OTHER TIPS

I don't think this would be reliable without requiring the CDATA section.

Below is the regex. It finds the target tag followed by the CDATA. It then matches any character not followed by the CDATA close ]]. /oem/en is matched as long as the ]] isn't encountered. I believe C# supports negated look aheads and non capturing groups. Obviously negated look aheads are essential.

<target><!\[CDATA\[(?:.(?!\]\]))*(/oem/en)

If you need to accommodate target having parameters you can do something like <target[^>]*>. If there is going to be whitespace between target and CDATA, then `\w*

I have a regex editor that's very alpha, but you can test it on: http://rey.gimenez.biz/

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top