Question

I am doing a project and we have a few hundred old xml documents. We think that there are probably about 60 different schemas used by these hundreds of xml documents but don't know what those schemas are.

Is there any kind of tool that exists to do this type of job? If not what would be the best way to go about comparing them programatically?

Was it helpful?

Solution

I would start by doing some ad-hoc queries. Assuming that you have all the documents in a directory and that you have an XSLT or query processor like Saxon that can read all the documents in a directory using the collection() function, you could start with

<xsl:for-each-group select="collection('dir?select=*.xml')" group-by="node-name(*)">
  <e name="name(*)" count="count(current-group())"/>
</xsl:for-each-group>

to see whether it's useful to group them by top-level element name.

You could then perhaps select one representative document for each top-level element name and use a tool to generate a schema for that document, then run a similar query to validate all the documents in that group against that schema (for this you will need a schema-aware XSLT or XQuery processor).

(Most of the IDE's such as oXygen include a tool to generate a schema from an instance. But I'm not aware of a tool that can be invoked programmatically.)

After this it depends a little on what you discover...

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top