Question

I'm wondering if there are any general guidelines or best practices regarding when to split data into a metadata format, as oppose to directly embedding it within the data. (Specific example below).

My understanding of metadata is that it describes data (without the need to actually look at the data), allowing for data to be quickly search/filtered for easy access.

Let's take for example a simple 3D model format. The actual data file is itself a binary file containing vertices and colors. Things like creation date, modified data and author name would be things that describe the binary data, so I would say these belong as metadata (outside of the binary file).

But the following questions arise:

  • What if the application had no need to search or filter by these fields?
  • Would it be acceptable to embed these fields directly into the binary data itself?
  • Could they be duplicated in both the binary data and the meta data, or would this be considered bad practice?
  • What about more ambiguous fields such as the model name, which could be considered part of the data itself, but also as data describing the binary data?
  • How do you decide which data to embed in the actual binary file, as opposed to separating into a more flexible metadata format?

Thanks!

Was it helpful?

Solution

I think saving the metadata inside the binary file and providing a specification so anyone can program an API has its advantages.

Many binary types include the metadata inside the file itself, providing a public specification or API of how to access it. Examples could be the ID tags of the mp3 formats, the metadata of PDF files, the EXIF data of images, etc.

That asures the metadada travels with the file wherever it goes

Aplications have no problem reading that metadata to populate a database, or even updating the metadata in the file itself, like iTunes or Rhythmbox do on audio files.

OTHER TIPS

You have the distinction between data and metadata spot on.

Two aspects come to mind for this. One is programming and the other is manageability. On the programming side I agree that "one file, one data" has its appeal. You can write a nice, clean handler for the bindary data without cluttering it with messy meta-shenanigans. For manageability, anyone who uses the binary is going to want to know its provenance, quality, timeliness etc. Separating these is a problem wating to happen. One day you copy the binary and not the meta and two weeks later you can't remember what is which or where it came from.

On balance I opt for an all-in-one-file approach when I have the choice. The file handler can be a wrapper around the binary-parsing and meta-parsing bits. If an application never calls for the meta then the relevant part of code never gets called. The size of metadata is rarely going to be a concern compared to the binary portion, either.

As the name implies "metadata" goes "beyond the data". There are no general rules about what, how or where it should be.

Metadata inside the same file

There are a lot of examples: Image files may contain EXIF data inside, MP3 files may contain id data, etc..

Sometimes it's at the beginning of the file (harder to edit), and sometimes is added at the end (easier to add or edit).

You don't have much freedom to change the metadata structure and should adhere to a predefined format.

Metadata as a "side-car" file

If you prefer not to touch the original file, you can use a side-car file. You have much freedom to change the structure of the metadata, as you don't tamper the original file in anyway.

Metadata in a database

There are certain applications where the metadata isn't even a physical file, for example many Document Management applications. The advantages are easy to see.

Conclusion:

Each has its pros and cons.

  • If you don't want or cannot touch the file itself, then go to the sidecar or database.
  • If you want to do fast searches or have the metadata more centralized, then go to the database.
  • If you want to keep it compact... then go to the metadata inside... and document well your metadata structure or you may break the file integrity.
Licensed under: CC-BY-SA with attribution
scroll top