How are new file formats constructed?

https://softwareengineering.stackexchange.com/questions/208081

29-09-2020
|

Pregunta

I've used a software suite that is installed in offices and on remote vessels. The installations communicate back and forth, and they do that by using a simple proprietary file format that looks something like this:

/SHIP:16
MILES=45213

/ORDER:22943
STATUS=OPEN
TOTAL=447.84
URGENCY=HIGH

/ORDERLINES:22943
ITEM=3544
QUANTITY=1
PRICE=299.99
ITEM=11269
QUANTITY=5
PRICE=29.57

Recently, I've been writing a piece of software for a customer that saves information in the same kind of flat file format.

When the file is opened, the lines are iterated over, and "stuff happens" to the lines (i.e. they're inserted into a database, or whatever).

But it got to me on to thinking, how would this kind of file scale? (I like things being able to scale)

I could of course gzip it; but how does a file format evolve from being something basic like to this, to being monolithic? What typical practises are employed when making a file format for a new piece of software? How are they typically built?

Solución

The ability to scale will depend on the specific usage.

If I take your example of lines inserted in a database, the closer model is a log. An application, such as a web server, writes some data to a log. Daily (or once per hour, or any other period of time), the log is rotated, i.e. the application frees the current file and starts writing to another one. Once the file is freed, an ETL can process this file and load the transformed data into the database.
If I take a different example, such as a large file (and by large, I mean several gigabytes or terabytes) which should be read in a context where any information in it should be accessed quickly, then the format would be different and will probably use pages and indexes to point to the right content; additionally, fragmentation will be a concern too if the data in the file is modified. You can find more information about this sort of usage by reading about PST file format used by Microsoft Outlook (it can often take gigabytes) or file formats used by database files.

This means that the format you are actually using is maybe extremely scalable in the context in which it is used.

How are they typically built?

Like any data structure and any piece of software in general.

Ideally, during architecture and design phase, developers think about ways they can store information in a file, given the different requirements, priorities and constraints. Then the file format can evolve to take in account new requirements, priorities and constraints, while being, if needed, backwards compatible.

Examples:

If a requirement in the format you've shown in your question is that values can be multiline and contain “=”, this brings a specific issue of a value such as “12345¶=PRICE=123”.

If a requirement is to follow the standards, then something like EDIFACT can be used instead of the current format (maybe with some metadata if needed).
If the priority is to make the file readable, “item” and “price” are fine or may even be expanded to be more explicit. If the priority is to shorten the size of the file, “item” could become “i”, “quantity” — “q”, etc. Even better, the file can become:
```
> 22943:3544,1,299.99;11269,5,29.57…
```
or be transformed to a binary format.
If a constraint is to keep the data secure, cryptography will be used. If another constraint tells that some of the involved systems don't support Unicode, this is an additional problem to solve.

Otros consejos

How does a file format evolve from being something basic like to this?

By not thinking ahead and refusing to use existing standards because it's cool to reinvent the wheel.

There are various industry standards, all of which formats have their own quirks, and all of them went through the same drama when they were 'scaled up' (i.e. used outside the company that made them up). Character encodings, line endings, repetitions, parsers, all of it has to be reinvented as soon as an organization uses their inhouse developed format to communicate to the outside world.

What once started as a 'quick and dirty' way to exchange messages between two machines now becomes an heritage you'll never lose.

Sometimes though, thought is put into the structure of such formats. When you are looking to create a new format to use to store or transmit data from or to your application, please make sure absolutely no existing format fits your needs.

YAGNI

There are many different ways to "scale". If you try to design a future-proof file format without knowing with high degree of certainty how the future is going to look, you are bound to fail.

Formats readable with a plain text editor have a huge advantage for debugging. You can always open them and check them with your eyes and makeshift tools using simple text search and replacement. The development time saved compared to binary format for which you need to write debugging tools is significant. As long as your simple text format works, just stick with it.

A file of records that are processed sequentially will scale linearly with the amount of data no matter what the format is. If you change it to binary format, it will probably be smaller, but it will still scale linearly. The same effect can be achieved by compressing and it keeps most of the advantages of text format.

You only need "advanced" format when you need random access. Usually you'd than just take some existing container. If you need to just bundle resources together, most popular is plain old zip archive (it has index at the end, so you can directly read any member). If you need random access to small elements, you want either "*dbm" (berkeley db, ndbm, gdbm, odbm) or sqlite. Or a database server, of course (sqlite is faster than any rdbm server, but only allows limited concurrent access and no clustering and limited triggers etc).

It's not clear what "scale" means in this context, but if you're taking about the file becoming large, I'd suggest breaking it into multiple files that can be processed in parallel, and having some kind of association keyword (e.g., include 'file2') that permits multiple files to be grouped into a single unit. Then you have the option of spawning off another thread or process to handle each file, possibly then merging all the results at the end. If there's no way to perform any processing in parallel, then you'll never truly scale out.

It's good to think about such things, though. The last large data files I worked with were from an engineering package, and they were an evil mishmash of fixed-field data embedded inside SGML-style markup tags...

See, as per my point of view then as long as one can "save" files in the non-x format, things will be fine. But one can never be sure at to which version a recipient has, saving in the "non-x" format is the safest.

Licenciado bajo: CC-BY-SA con atribución

No afiliado a softwareengineering.stackexchange