MS Word splits words in its XML format

https://stackoverflow.com/questions/1729969

20-09-2019
|

Question

I have a Word 2003 document saved as a XML in WordProcessingML format. It contains few placeholders which will be dynamically replaced by an appropriate content. But, the problem is that Word seemingly randomly splits them in the separate words. For example, instead of this:

<w:t>${dl.d.out.ecs_rev}</w:t>

I have this:

...
<w:t>${</w:t>
 </w:r>
 <w:r wsp:rsidR="005D11C0">
  <w:rPr>
   <w:sz w:val="20" />
   <w:sz-cs w:val="20" />
  </w:rPr>
  <w:t>dl.</w:t>
 </w:r>
<w:r wsp:rsidRPr="00696324">
 <w:rPr>
  <w:sz w:val="20" />
  <w:sz-cs w:val="20" />
 </w:rPr>
<w:t>d.out.ecs_rev}</w:t>
...

Is there any way to save a "clean" XML document using Word 2003, or is there any existing solution which can do the cleaning?

I tried to program a method in Java which will concatenate separated parts of the placeholders, but because the number of different cutting combinations is relatively big, the algorithm for that is far more complex than a original task that I have to do, so it is problem for itself.

Solution

You can use Aspose.Words and invoke this:

Document.JoinRunsWithSameFormatting.

http://www.aspose.com/documentation/.net-components/aspose.words-for-.net-and-java/aspose.words.document.joinrunswithsameformatting.html

OTHER TIPS

If you have control over the original Word documents, you can stop Word from inserting rsid and highlighting grammar/spelling errors.

         Word.Options opts = Word.Options;
            opts.CheckGrammarAsYouType = false;
            opts.CheckGrammarWithSpelling = false;
            opts.CheckSpellingAsYouType = false;
            opts.StoreRSIDOnSave = false;

Words will still get split, if for example you change font part way through the word.

Hmmm, I have a simple+ugly bit of xslt which I've used to clean WordML like the example you posted. I could commit it to docx4j if you want it, but as you say, there are various combinations which wouldn't be covered. Anyway, if you want it, please post to the docx4j forum.

A more robust approach would be to extract the plain text, and relate the plain text to the XML, so you can search the plain text, and go from there to the XML.

Word 2003 XML is unusually complex and hard to decode. The reason you are getting multiple tags is because Word ML generates tags called runs (the w:r tag). As far as I know, there is no easy way to do the clean the XML above. I would recommend using HTML instead of WordML. It is way easier to manipulate and replace your placeholders with appropriate content. If cost is not an objective, use a product like Aspose. It does everything for you and is simple to use.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow