Pregunta

Suppose I want design and implement a a large and complex file format, like pdf or docx; how do I have to structure it? How can they contain so many different data types like images, macros or graphics?

Edit: for structures I mean the ways for storing different data types, I don't think that a plain binary file is so viable. The file format that I want to design is like a Word page with various multimedia contents. It's an word processor application for mobile platforms which needs a custom file format for its particular structure, I know that is a time consuming task but what I need is a extensible base structure which I can expand in future

¿Fue útil?

Solución

Suppose I want design and implement a a large and complex file format, like pdf or docx

If you are alone, I would try hard to avoid such a large specification effort. In particular, consider instead:

  • using existing textual formats such as JSON, YAML or maybe XML or S-Expressions; then you still need to specify how you would use these, e.g. define names of attributes or of tags (and specify their roles and rules for using them) .

  • using existing database engines, perhaps SQLite or genuine RDBMS servers like PostGreSQL (or non-relational database servers à la MongoDB). BTW, it is often worthwhile to have textual structured data (e.g. JSON) inside databases; of course you still need to specify a database schema and the set of requests used on it. In some cases key-value indexed files (à la GDBM or TokyoCabinet) could be enough.

  • using and extending existing embeddable interpreters (à la Lua or Guile) and having your file becoming a script for that interpreter

  • defining some (preferably textual) domain specific language (which is quite close to extend some interpreter), inspired by existing ones.

How can they contain so many different data types like images, macros or graphics?

These are only sequences of bytes having some metadata (perhaps some content type).

If you really want to design a large and complex file format, think about portability (between machine architectures : word size, endianness) and extensibility first. Specify your format on paper (e.g. using some EBNF notation) and have it reviewed by others. Write a sample implementation library to parse and generate that format (incrementally, while you are specifying it).

Make your specification publicly available to enable feedback from outside. Make your sample implementation library free software or open source.

Data often outlive software, so work hard to get a reasonable format right.

If you still do define your format, be aware that it could take you years of work to get it right.

Study existing file formats before inventing your own. Notice that a file format is successful only when several applications are using it, so there is an important social issue (convincing others to use your format), hence you should try to specify it with some other people.

Edit: for structures I mean the ways for storing different data types, I don't think that a plain binary file is so viable

Read also about serialization.

The file format that I want to design is like a Word page with various multimedia contents

Study existing formats, e.g. OpenDocument. If possible, adapt it. Otherwise, budget a dozen years of full time effort. Try to find several senior engineers to work with you.

(very probably your format and your software would be ignored, consider that possibility seriously)

Remember that on current computers, I/O (network or disk, even SSD) is much slower than CPU (more than a thousand times slower), so the parsing and writing CPU time of textual formats is generally much less than I/O time. In other words, the network or the disk or SSD is always the bottleneck. And textual formats (à la JSON, etc...) are much easier to debug.

It's an word processor application for mobile platforms which needs a custom file format for its particular structure,

I would still recommend using some existing format, or at least a strict subset (and well specified one) of some existing format. Why can't you use EPUB, some subset of OpenDocument, or a subset of HTML? Look also into the HTML-like formats used by GTK and by Qt, at least for inspiration (and perhaps by using such a library).

The advantage of such an approach is that you won't need to code a lot of converters (since you would be able to reuse some of the existing ones).

PS. If you just want to produce a nice looking document (e.g. in PDF) from some code, consider instead generating some textual file to feed it to some typesetting system like LaTeX or Lout, or find then use a library to emit PDF files.

Licenciado bajo: CC-BY-SA con atribución
scroll top