Question

I am starting work on a new piece of software that will end up needing some robust and expandable file IO. There are a lot of formats out there. XML, JSON, INI, etc. However, there are always plusses and minuses so I thought I would ask for some community input.

Here are some rough requirements:

  1. The format is a "standard"...I don't want to reinvent the wheel if I don't have to. It doesn't have to be a formal IEEE standard, but something you could Google and get some information on as a new user, may have some support tools (editors) beyond vi. (Though the software users will generally be computer savvy and happy to use vi.)
  2. Easily integrates with C++. I don't want to have to pull along a 100mb library and three different compilers to get it up and running.
  3. Supports tabular input (2d, n-dimensional)
  4. Supports POD types
  5. Can expand as more inputs are required, binds well to variables, etc.
  6. Parsing speed is not terribly important
  7. Ideally, as easy to write (reflect) as it is to read
  8. Works well on Windows and Linux
  9. Supports compositing (one file referencing another file to read, and so on.)
  10. Human Readable

In a perfect world, I would use a header-only library or some clean STL implementation, but I'm fine with leveraging Boost or some small external library if it works well.

So, what are your thoughts on various formats? Drawbacks? Advantages?

Edit

Options to consider? Anything else to add?

  • XML
  • YAML
  • SQLite
  • Google Protocol Buffers
  • Boost Serialization
  • INI
  • JSON
Was it helpful?

Solution 4

For my purposes, I think the way to go is XML.

  1. The format is a standard, but allows for modification and flexibility for the schema to change as the program requirements evolve.
  2. There are several library options. Some are larger (Xerces-C) some are smaller (ezxml), but there are many options, so we won't be locked in to a single provider or very specific solution.
  3. It can supports tabular input (2d, n-dimensional). This requires more parsing work on "our" end, and is likely the weakest point for XML.
  4. Supports POD types: Absolutely.
  5. Can expand as more inputs are required, binds well to variables, etc. through schema modifications and parser modifications.
  6. Parsing speed is not terribly important, so processing a text file or files is not an issue.
  7. XML can be programmatically written just as easily as read.
  8. Works well on Windows and Linux or any other OS that supports C and text files.
  9. Supports compositing (one file referencing another file to read, and so on.)
  10. Human Readable with many text editors (Sublime, vi, etc.) supporting syntax highlighting out of the box. Many web browsers display the data well.

Thanks for all the great feedback! I think if we wanted a purely binary solution, Protocol Buffers or boost::serialization is likely the way that we would go.

OTHER TIPS

There is one excellent format that meets all your criteria:

SQLite!

Please read article about using SQLite as an application file format. Also, please watch Google Tech Talk by D. Richard Hipp (SQLite author) about this very topic.

Now, lets see how SQLite meets your requirements:

The format is a "standard"

SQLite has become format of choice for most mobile environments, and for many desktop apps (Firefox, Thunderbird, Google Chrome, Adobe Reader, you name it).

Easily integrates with C++

SQLite has standard C interface, which is only one source file and one header file. There are C++ wrappers too.

Supports tabular input (2d, n-dimensional)

SQLite table is as tabular as you could possibly imagine. To represent say 3-dimensional data, create table with columns x,y,z,value and store your data as a set of rows like this:

x1,y1,z1,value1
x2,y2,z2,value2
...

Supports POD types

I assume by POD you meant Plain Old Data, or BLOB. SQLite lets you store BLOB fields as is.

Can expand as more inputs are required, binds well to variables

This is where it really shines.

Parsing speed is not terribly important

But SQLite speed is superb. In fact, parsing is basically transparent.

Ideally, as easy to write (reflect) as it is to read

Just use INSERT to write and SELECT to read - what could be easier?

Works well on Windows and Linux

You bet, and all other platforms as well.

Supports compositing (one file referencing another file to read)

You can ATTACH one database to another.

Human Readable

Not in binary, but there are many excellent SQLite browsers/editors out there. I like SQLite Expert Personal on Windows and sqliteman on Linux. There is also SQLite editor plugin for Firefox.


There are other advantages that SQLite gives you for free:

  • Data is indexable which makes it very fast to search. You just cannot do this using XML, JSON or any other text-only formats.

  • Data can be edited partially, even when amount of data is very large. You do not have to rewrite few gigabytes just to edit one value.

  • SQLite is fully transactional: it guarantees that your data is consistent at all times. Even if your application (or whole computer) crashes, your data will be automatically restored to last known consistent state on next first attempt to connect to the database.

  • SQLite stores your data verbatim: you do not need to worry about escaping junk characters in your data (including zero bytes embedded in your strings) - simply always use prepared statements, that's all it takes to make it transparent. This can be big and annoying problem when dealing with text data formats, XML in particular.

  • SQLite stores all strings in Unicode: UTF-8 (default) or UTF-16. In other words, you do not need to worry about text encodings or international support for your data format.

  • SQLite allows you to process data in small chunks (row by row in fact), thus it works well in low memory conditions. This can be a problem for any text based formats, because often they need to load all text into memory to parse it. Granted, there are few efficient stream-based XML parsers out there, but in general any XML parser will be quite memory greedy compared to SQLite.

Having worked quite a bit with both XML and json, here's my rather subjective opinion of both as extendable serialization formats:

  • The format is a "standard": Yes for both
  • Easily integrates with C++: Yes for both. In each case you'll probably wind up with some kind of library to handle it. On Linux, libxml2 is a standard, and libxml++ is a C++ wrapper for it; you should be able to get both of those from your distro's package manager. It will take some small effort to get those working on Windows. There appears to be some support in Boost for json, but I haven't used it; I've always dealt with json using libraries. Really, the library route is not very onerous for either.
  • Supports tabular input (2d, n-dimensional): Yes for both
  • Supports POD types: Yes for both
  • Can expand as more inputs are required: Yes for both - that's one big advantage to both of them.
  • Binds well to variables: If what you mean is some way inside the file itself to say "This piece of data must be automatically deserialized into this variable in my program", then no for both.
  • As easy to write (reflect) as it is to read: Depends on the library you use, but in my experience yes for both. (You can actually do a tolerable job of writing json using printf().)
  • Works well on Windows and Linux: Yes for both, and ditto Mac OS X for that matter.
  • Supports one file referencing another file to read: If you mean something akin to a C #include, then XML has some ability to do this (e.g. document entities), while json doesn't.
  • Human readable: Both are typically written in UTF-8, and permit line breaks and indentation, and thus can be human-readable. However, I've just been working with a 479 KB XML file that's all on one line, so I had to run it through a prettyprinter to make sense of it. json can also be pretty unreadable, but in my experience is often formatted better than XML.

When starting new projects, I generally prefer json; it's more compact and more human-readable. The main reason I might select XML over json would be if I were worried about receiving badly-formed documents, since XML supports automated document format validation, while you have to write your own validation code with json.

Check out google buffers. This handles most of your requirements.

From their documentation, the high level steps are:

Define message formats in a .proto file.
Use the protocol buffer compiler.
Use the C++ protocol buffer API to write and read messages.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top