Pregunta

I'm designing an append-only ("journaled") file format, and I'd like it to be friendly towards being stored in VCS (git/Mercurial/...).

On one hand, I'd like if after each change, only delta (difference) from previous file state was added into the repository (i.e. some trailing bytes), not the whole contents of the file again. So, for this, I'm considering making the format "pseudo-text", that is no NUL (0x00) byte in the contents (or maybe even some stricter subset of UTF-8), to make it easily diff-able by git/Mercurial.

On the other hand, the "pseudo-text" file format would not be predisposed to merging, it would totally cripple the contents. So to avoid that possibility, I'm inclined to make it "binary", even if only by putting a NUL byte at offset 0 in file. But then, merging even for "text" but structured file formats is not always really possible, even for "typical" cases like sourcecode, so maybe no need to worry here? I'm quite certain there will be totally conflicts all over the place if anyone tries to merge such files, so that could be enough of a warning sign.

Did you have experiences with similar choice in the past? Which choice should I make, and why?

¿Fue útil?

Solución

I don't know about mercurial, but git always initially stores the whole content and during the repack operation (part of gc operation; by default ran automatically when there is too many "loose" objects) will find binary deltas. In git these may be against older version than previous, but in your cases the ones from previous will be smallest so git will choose them. Both the initial full copy and the delta are stored deflated.

So the choice between text and binary has negligible effect on the storage size.

Big advantage of text file is that you can debug it with simple text viewer and the diff will show reasonable information. Merging will always cause a conflict, because all changes are always at the end. Whether it is resolveable and makes sense to resolve depends on the actual format and dependencies between entries in it, but you'll always have control over that.

Note however, that you can get diffs work even with binary format, because in git you can specify custom diff program for specific files (via gitattributes).

Licenciado bajo: CC-BY-SA con atribución
scroll top