Pregunta

I'm writing a Python tool to convert/store the data from a commonly used plaintext file format for volumetric data generated by computational chemistry calculations into the binary HDF5 format using h5py. The plaintext files are often huge, frequently 100MB or more, and in cases where certain types of semi-lossy compression is acceptable, the filter pipeline built into HDF5/h5py enables storage at a compression factor of anywhere from 50-300x (benchmarks here, for the curious). Thus, whereas exchange of the uncompressed files is terribly inconvenient, if not completely impractical, my hope is that these new compressed files will eventually become the lingua franca of the medium.

A further advantage of the HDF5 format is that it's cross-platform and readable by any application that can tie into its C/C++/FORTRAN interfaces, or that has support for Python scripting (via h5py). Thus, I also hope that various software applications that currently support read/write of the plaintext file format will eventually implement support for this HDF5 flavor.

At a high level, HDF5 stores data in key/value pairs. In my Python tool, I've defined standard keys for the various data values I'm using at the moment. However, I anticipate that the set of keys used in the HDF5 files will change over time. Given that (1) the structure of the file format will likely change and that (2) my goal is for other software to interoperate directly with these files, I've started putting together an independent, versioned file specification for the format.

Where I'm a bit confounded is the decision on the versioning paradigm for this file specification. I'm fully on board the SemVer train for code (or at least my understanding of it: {API breaking change}.{API extending change}.{bugfix}). But, I can't decide if this paradigm makes sense for versioning a file specification.

On one hand, I like the approach of just having a single version number. Each version would be its own standalone entity, and regardless of how similar a given version is to another it would be sharply distinct. Further, it wouldn't matter much if the specification changes in a "nonlinear" way (e.g., v3 and v6 are compatible by some metric, but v3/v4 and v3/v5 aren't) because inter-version compatibility for a given application would (probably?) always just be a set of one-to-one mappings.

On the other hand, in situations where, say, the only change in going from version x to version x+1 is the addition of a new key, then there would be some value to a SemVer-like structure, say an X.Y format, because then an application could define ranges of compatibility (e.g., >=2.3,<3.0, like with Python packages). But, nonlinear 'jumps' in the specification would tend to lead to rapid increment of the 'major version' anyways, diminishing the value of range-wise compatibility specification.

Thus, the title question: Is it appropriate to use Semantic Versioning for the specification for a key/value file format? Why or why not?

¿Fue útil?

Solución

Semantic versioning's main advantage is to other developers who rely on your interface.

It would be a great way to signal to other developers whether or not the changes to the file format might break their tools - e.g. if you removed a key, or changed its name. However, some people would argue that you should never change or remove keys anyway.

Minor version number changes would signal backwards compatible changes (e.g. adding keys).

There is a discussion of using semantic versioning for JSON files here: https://stackoverflow.com/questions/10042742/what-is-the-best-way-to-handle-versioning-using-json-protocol

Licenciado bajo: CC-BY-SA con atribución
scroll top