Question

We have a project which has data and code, bundled into a single Mercurial repository. The data is just as important the code (it contains parameters for business logic, some inputs, etc.) However, the format of the data files changes rarely, and it's quite natural to change the data files independently from the code.

One advantage of the unified repository is that we don't have to keep track of multiple revisions: if we ever need to recreate output from a previous run, we only need to update the system to the single revision number stored in the output log.

One disadvantage is that if we modify the data while multiple heads are active, we may lose the data changes unless we manually copy those changes to each head.

Are there any other pros/cons to splitting the code and the data into separate repositories?

Was it helpful?

Solution

Multiple repos:

  • pros:

    • component-based approach (you identify groups of files that can evolve independently one from another)
    • configuration specification: you list the references (here "revisions") you need for your system to work. If you want to modify one part without changing the other, you update that list.
    • partial clones: if you don't need all components, you can only clone the ones you want (doesn't apply in your case)
  • cons

    • configuration management: you need to track that configuration (usually through a parent repo, registering subrepos)
    • in your case, data is quite dependent on certain versions of the projects (you can have new data which doesn't make sense for old versions of the project)

One repo

  • pros
    • system-based approach: you see your modules as one system (project and data).
    • repo management: all in one
    • tight link between modules (which can makes sense for data)
  • cons
    • data propagation (when, as you mention, several HEAD are active)
    • intermediate revisions (not to reflect a new feature, but just because some data changes)
    • larger clone (not relevant here, unless your data include large binaries)

For non-binary data, with infrequent changes, I would still keep them in the same repo.

OTHER TIPS

Yes, you should separate code and data. Keep you code in version control and your data in a database.

I love version control since I am a programmer since more then ten years and I like this job.

But during the last months I realized: Data must not be in version control. Sometimes it is hard for a person which is familiar with git (or an other version control system) to "let it go".

You need a good ORM which supports database schema migrations. The migrations (schemamigrations and datamigrations) are kept in version control, but the data is not.

I know your question was about using one or two repositories, but maybe my answer helps you to get a different view point.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top