Good conventions for embedding schema of a flat file

https://stackoverflow.com/questions/2488065

21-09-2019
|

Question

We receive lots of data as flat files: delimitted or just fixed length records. It's sometimes hard to find out what the files actually contain.

Are there any well established practices for embedding the schema of the file to the beginning or the end of a file to make the file self-explanatory?

Just to get an idea, imagine something like this:

<data name=test records=2 type=fixed>
   <field name=foo start=0 length=2 type=numeric>
   <field name=bar start=2 length=4 type=text>
</data>
11test
12ing

We would parse the xml in the beginning and use it for reading the records.

Solution

So far as I'm aware no - or at least not hugely.

The only thing I'm aware of (in terms of a widely accepted standard) is for the first row of the data file to be the column names - at least for delimited records, for fixed length its harder especially if your data can contain multiple record types (which I've found to be far more likely with fixed length than with delimited).

From where I sit I'd suggest that you can't really embed the definition into the file I'm assuming you're getting data from external sources so you're unlikely to get help from them and even if you do you immediately create challenges as you can't (for example) easily open the files with Excel if necessary.

Thinking a bit laterally you could - if using XML - potentially embed the file into the definition (big lump of CDATA). This is a slightly more practical solution as its putting a wrapper round your external data not asking that the data itself be modified. Not sure how practical this is - but it feels better to me than the other way round.

OTHER TIPS

have you looked at Protocol Buffers for inspiration?

I don't know about any established practice, but your idea of just prepending the schema to the data seems fine. Apache Avro is a data serialization tool similar to Protocol Buffers and Thrift. I believe typical Avro usage involves storing the schema with the data (by prepending it in the stream, I'd guess).

I wanted to also mention the PADS project. They have a schema language designed to let you describe "ad-hoc" data formats. Currently I believe they only have C and ML implementations, which may be a problem. On the other hand, their schema language was designed to handle a wide variety of formats, so it still might be worth using it over your own XML-based thing.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow