How do different file types generally store data?

https://softwareengineering.stackexchange.com/questions/357860

20-01-2021
|

Question

I am working on a project, for which I want to create my own file format to store large amounts of data. I am trying to decide how that file format should be built to be as efficient as possible.

The data I want to store is basically a big data structure.

The idea I am currently examining, is storing the data in a way similar to a combination of XML and python. For example:

<Object1>
    <InnerObject1>
        <InnerInnerObject1>
            variable1 = 31415
            variable2 = "Hello World!!!"
        </InnerInnerObject1>
    </InnerObject1>
    <InnerObject2>
        variable1 = "abcd"
        variable2 = 17
    </InnerObject2>
</Object1>

,where the Tags correspond to class-names and the variables to variables.

Considering the time requirements of XML parsers, I am not sure if storing the data this way would allow for fast enough reading.

My question is essentially the following: How exactly do other file formats that store significant amounts of data work, for example MP4 or OBJ? And I am not talking about the compressiong or something like that, but the exact way the data is stored, such that the reading program knows which data to put where in memory.

Thanks for any help in advance!

Solution

The reason there are many different file formats is that there are many different goals for the way data is formatted. Some of these are in opposition to each other and some are orthogonal to each other. Before you can embark on this, you need to determine what goals you wish to achieve.

I would say the first and most prominent decision is human-readability vs. file size. These two goals are roughly in opposition with each other. By human readability, I mean that you can take a basic text editor and understand the data. What you've shown above would fall in the highly-readable and very bulky classification. An example on the other end of this spectrum, is something liks Avro.

When you talk about things like MP4, this is a specialized format that is completely non-human readable and designed for extremely small size relative to the amount of data they contain. It's very specialized in that it depends on the reality that in video, almost every frame is very similar to the one that came before it. For text, like you are presenting in your example, this is not generally the case.

IMO, XML is a great format for documents (e.g. XHTML) but a very poor one for data transfer. This seems to be a more or less general consensus as people are moving more to things like JSON. It's not the only reason but it's a factor.

If you want to consider the entire XML spec, it's pretty tough to write a parser due to it's lineage from SGML. Most people will never use or encounter most of the crazy things that you would need to support to be able to parse any valid XML file. The common stuff would be easy. JSON is very easy to parse, relatively speaking.

Writing a spec and a parser for non-human-readable format isn't necessarily harder. It might also a useful exercise in learning more about data structures etc.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange