- file |
I'm currently quite curious in how other programmers organise data into files. Can anyone recommend any good articles or books on best practices for creating file structures?
For example, if you've created your own piece of software for whatever purpose, do you leave the saved data as plain text, serialize it, encode to xml, and why do you do this?
Are there any secrets I've missed?
Generally, go with the simplest thing that can possibly work, at least at first. Consider, eg, UNIX, where most of the configuration files are nothing but whitepace-delimited fields, or fields delimited with another character (like /etc/passwd, which uses ":" delimiters because the GCOS field can contain blanks.)
If your data needs a lot more structure, then ask yourself "what tools can I use easily?" Python and Ruby have JSON and YAML, for example.
XML is basically useful if you have lots of XML-based stuff already, OR you expect to transform the XML to a displayable form in a browser. Otherwise, it's usually very heavyweight (code size, complexity) for what you get from it.
No matter which format you choose remember to store some kind of version number inside (I'm pretty sure that you'll have to introduce some changes).
Format depends heavily on the application and amount of data. For some applications XML is appropriate, for other applications fixed size structs stored in a binary file are good.
I use many different formats, depending on situation, for example:
- plain text file (delimited) for storing datasets for Matlab and R analysis
- binary files - for storing fixed size structures (with dynamic sized the random access gets difficult without maintaining a separate array of offsets for the elements). One the positives you've got performance and space efficiency (why do most of databases store data in binary format?), but it is not very good for human beings to work with. Remember of the endianess.
- XML - usually for configuration data, or data that I want to give to other users applications (along with XSD). The other side can write nice XSLT transformation or consume the data in other manner (of course they could do the same with plain text or binary data given the format description)
Unless you have unique requirements, use something for which there is already a mature library, so you can avoid writing your own parsing code. That means XML/JSON, etc, like people have said.
One other nice one is Google's protocol buffers (http://code.google.com/p/protobuf). There you write a common message definition and the protocol buffer compiler generates objects for filling out, serializing, and deserializing the data for you. Typically the format is binary, but you can use their TextFormat class to write JSON-like plain text too. The nice thing about protobufs is that the versioning code is generated for you. In version 2 of your file format, all you have to do is add fields to the .proto definition file. The new version can read the old file format, and just leaves the new fields blank. It's not exactly what protobufs were designed for, but they make an easy, efficient binary file format for custom messages, and the code is generated for you.
Also see Facebook's Thrift, now in the Apache incubator.
As the years have gone by I've found myself more and more favoring text unless it's simply out of the question. CPU's are fast enough now that we can decode it fast enough.
Obviously, when you have to frequently update little pieces of information inside a big file this isn't an option--but that most likely describes a database.
It would take an unusual situation at this point to make me go with something other than one of these two options.
+1 for XML. Has a little bit overhead, but easy to parse, read, and debug. Can be strict, if you're using a schema. Easy to transform with XSLT, and very portable (in wire or just in a pendrive:)
This really depends upon the particular situation. You would need to consider your options against the answers to various questions:
- How much data do you need to store? Do you need to optimise for compact representation?
- Is the performance of reads/writes critical? Do you need to optimise for disk access and low-impact serialisation and deserialisation?
- Do you need random access within the file? Do you need to optimise the structure for seeking within the data?
- Is this data going to be used across different systems, possibly with different character encodings? Do you need to optimise for portability?
The nature of the data itself will have an impact. Is it a flat list structure? Is it a tree? Is it a cyclic graph? Are the records of fixed or variable widths?
Once the answers to these questions are known, you can select amongst your options, keeping it as simple as possible. Often the popular options (XML, CSV, YAML) will suit your purposes. If not, then you'll have to develop your own formatting and your own writing and reading procedures.
There are so many possibilities, but the most pragmatic has to be XML
- There are decent XML libraries for nearly every development platform
- Most platforms allow object graph serialisation with a couple of lines of code, so XML is painless to implement
- Most platforms have an in memory and/or streaming reader, so you can handle really large files without too much memory usage
- Most platform provide an XSLT tranformer, so you can move files from one format to another, even from XML to non XML
- There are indexing extension for XML to handle really large files too
- XML has XSD's to validate the format before you attempt to read it
- XML is capable of representing any simple or complex object
- If you are worried about file size, just zip the final XML. This technique is used in Microsoft Office etc
- XML is still human readable
- XML is a common standard