Extract fields from ODT document using Java library

https://stackoverflow.com/questions/9976581

28-05-2021
|

Question

I need to use a Java library - or code - to extract field tags from the content of an ODT document. I know odt is some sort of zipped file and it has its contents ina a content.xml file. Of course I could just extract the files, open content.xml and parse it, but I believe some higher level code exists. Just as an example, the content looks like this:

<text:p text:style-name="Standard">Hi ${name}!</text:p>    
<text:p text:style-name="Standard">
<text:text-input text:description="JOOScript">$nome</text:text-input></text:p>

I would like to extract the fields as ${name} and $nome.

I know Apache Tika could be used for that, but I haven't spotted an example that actually shows field extraction. I believe this is because the fields I am using are unstructured text instead of input field tags.

Thanks in advance, Daniel

Solution

Well, just in case anyone is interested, we ended up using Apache Tika for obtaining the content from the odt and we have parsed it using the following regular expression:

\$\{[\w\-\.]*\}

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow