Is there a distributed VCS that can manage large files?

https://stackoverflow.com/questions/70392

09-06-2019
|

Question

Is there a distributed version control system (git, bazaar, mercurial, darcs etc.) that can handle files larger than available RAM?

I need to be able to commit large binary files (i.e. datasets, source video/images, archives), but I don't need to be able to diff them, just be able to commit and then update when the file changes.

I last looked at this about a year ago, and none of the obvious candidates allowed this, since they're all designed to diff in memory for speed. That left me with a VCS for managing code and something else ("asset management" software or just rsync and scripts) for large files, which is pretty ugly when the directory structures of the two overlap.

Solution

It's been 3 years since I asked this question, but, as of version 2.0 Mercurial includes the largefiles extension, which accomplishes what I was originally looking for:

The largefiles extension allows for tracking large, incompressible binary files in Mercurial without requiring excessive bandwidth for clones and pulls. Files added as largefiles are not tracked directly by Mercurial; rather, their revisions are identified by a checksum, and Mercurial tracks these checksums. This way, when you clone a repository or pull in changesets, the large files in older revisions of the repository are not needed, and only the ones needed to update to the current version are downloaded. This saves both disk space and bandwidth.

OTHER TIPS

No free distributed version control system supports this. If you want this feature, you will have to implement it.

You can write off git: they are interested in raw performance for the Linux kernel development use case. It is improbable they would ever accept the performance trade-off in scaling to huge binary files. I do not know about Mercurial, but they seem to have made similar choices as git in coupling their operating model to their storage model for performance.

In principle, Bazaar should be able to support your use case with a plugin that implements tree/branch/repository formats whose on-disk storage and implementation strategy is optimized for your use case. In case the internal architecture blocks you, and you release useful code, I expect the core developers will help fix the internal architecture. Also, you could set up a feature development contract with Canonical.

Probably the most pragmatic approach, irrespective of the specific DVCS would be to build a hybrid system: implement a huge-file store, and store references to blobs in this store into the DVCS of your choice.

Full disclosure: I am a former employee of Canonical and worked closely with the Bazaar developers.

Yes, Plastic SCM. It's distributed and it manages huge files in blocks of 4Mb so it's not limited by having to load them entirely on mem at any time. Find a tutorial on DVCS here: http://codicesoftware.blogspot.com/2010/03/distributed-development-for-windows.html

BUP might be what you're looking for. It was built as an extension of git functionality for doing backups, but that's effectively the same thing. It breaks files into chunks and uses a rolling hash to make the file content addressable/do efficient storage.

I think it would be inefficient to store binary files in any form of version control system.

The better idea would be to store meta-data textfiles in the repository that reference the binary objects.

Does it have to be distributed? Supposedly the one big benefit subversion has to the newer, distributed VCSes is its superior ability to deal with binary files.

I came to the conclusion that the best solution in this case would be to use the ZFS.

Yes ZFS is not a DVCS but:

You can allocate space for repository via creating new FS
You can track changes by creating snapshots
You can send snapshots (commits) to another ZFS dataset

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow