Question

I'm trying to parse following XML from a file using Powershell without actually loading it as XML document using [xml] since the document contain errors.

<data>
  <company>Walter & Cooper</company>
  <contact_name>Patrick O'Brian</contact_name>
</data>

To load document successfully I need to fix errors by replacing special characters as follows

& with &amp;
< with &lt;
' with &apos; etc..

I know I could do something like this to find and replace characters in a document

(Get-Content $fileName) | Foreach-Object {
  $_-replace '&', '&amp;' `
    -replace "'", "&apos;" `
    -replace '"', '&quot;'} | Set-Content $fileName

But this will replace characters everywhere in the file, I'm only interest in checking for characters inside xml tags like <company> and replacing them with xml safe entities so that resultant text is a valid document which I can load using [xml].

Was it helpful?

Solution

Something like this should work for each character you need to replace:

$_-replace '(?<=\W)(&)(?=.*<\/.*>)', '&amp' `
  -replace '(?<=\W)(')(?=.*<\/.*>)', '&apos;' `
  -replace '(?<=\W)(")(?=.*<\/.*>)', '&quot;' `
  -replace '(?<=\W)(>)(?=.*<\/.*>)', '&gt;' `
  -replace '(?<=\W)(\*)(?=.*<\/.*>)', '&lowast;' } | Set-Content $fileName

which does a positive look-behind with a non-word character, then the capturing group followed by a positive look-ahead.

examples:

updated: http://regex101.com/r/aY8iV3 | original: http://regex101.com/r/yO7wB1

OTHER TIPS

A little bit of regex look-behind and look-ahead should do the trick:

$str = @'
<data>
  <company>Walter & Cooper & Brannigan</company>
  <contact_name>Patrick & O'Brian</contact_name>
</data>
'@

$str -replace '(?is)(?<=<company>.*?)&(?=.*?</company>)', '&amp;'
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top