How do we build/assemble test data for large XML schema?

Question

I am focusing on your last (and bolded question)... and quickly skim through the others...

I am not an XMLSPY user... I've asked some people though, and as far as I was told, there's no out of the box report that would give you exactly what you want. I do know that XMLSPY has an automation API which I am sure could be put to good use to get exactly what need.

From a generic tooling perspective, QTAssistant (I am associated with it) has an out of the box report that gives you a dependency report (can be exported to Excel). Below is a screenshot showing UBL 2.1.0 XSD file dependencies.

enter image description here

Thinking about your request, I believe you're on the right path (componentize your approach)... I would propose though a better angle...

From my experience, one of the many mistakes people do in your kind of scenario is to focus on the XSD file layout when defining the model of the test data. Generally speaking, relations between XSD files as described by an XSD author (through xsd:include/xsd:import) are not always relevant to the XSD processor. In fact, a reference may be missing without affecting the integrity of the XSD set; superfluous references could cause unnecessary overhead in your layout, while the XSD processor would safely "ignore" them. This ultimately means that the relationships between XSD components is not necessarily the same as that described by the XSD file layout.

Another common mistake is to put the test data in a model that uses XSD components in a verbatim way, through names and/or structure. At least in my experience, test data and XSD models are managed by two different teams, with vastly different priorities, timelines, and understanding of technology in general, XML and XSD in particular. This ultimately means that the relationships between XSD components is not necessarily the same as that which is described in a project requirement document, or even in the business domain: the granularity may be different, relationships may be flattened or super-normalized, etc.

There's a downside to couple test data modeling to an XSD... for e.g., whoever creates the test data needs the XSD as early as possible, more often than not, unreasonably (for the XSD designers) early; not to mention that changes to the XSD (due to new requirements, compliance and bug fixes, etc.) create havoc on the test data side.

If the XSD already exists, or the XSD is "the" model (instead of UML, or other kind on modeling language)... it could be used as a "source of inspiration"... bringing them together though, in my opinion, should be through a mapping layer, which ensures decoupling of the two: your test data model, from that described by the XSD.

Which brings me to our recommended approach... There's always something behind the XSD (requirements, etc.), or take the model described by the XSD (not its components), or test data requirements. Use that to build a "normalized" data set... for example, it could be a bunch of Excel worksheets, or some tables in your favourite/available database.

Employ some combination of tools that would use mapping information to convert data in the normalized data set to XML, ideally directly based on the XSD, or indirectly (e.g. ORM mapping). You have to check what works for you, depending on the platform you run on, and what technologies you can bring in house (we use XML Builder for this).

For example, say you work with objects such as customers, accounts, invoices, etc. Your test data should describe that. While the XSD should also line up, it may use all sorts of other things, and for different reasons (substitution groups, type hierarchies, groups, etc. for extensibility, reuse, accommodating XSD to code binding, etc.) The point is, a lot of things one puts in an XSD to supports all sorts of requirements that have nothing to do with the business domain model, could prove a burden to your test data model.

Not to mention that modeling your test data could also provide the seed for a test model in general... which is even more divergent from that which is meant to be fulfilled by a (good) XSD.