Using multiple Git repositories instead of a single one containing many apps from different teams? [duplicate]

https://softwareengineering.stackexchange.com/questions/206668

29-09-2020
|

Question

I am migrating a 10-years-old big CVS repository to Git. It seemed obvious to split this multiple-projects repository into several Git ones. But the decision-makers are used to CVS, therefore their point of view is influenced by CVS philosophy.

To convince them to migrate from one CVS repo to different Git repositories I need to give them some arguments.

When I speak with mates working on Git repo for years, they say that using multiple Git repo is the way to use Git. I do not know really why (they give me some ideas). I am a newbie in this field so I ask here my question.

What are the arguments to use multiple Git repositories instead of a single one containing different applications and libraries from different teams?

I have already listed:

branches/tags impact the whole Git repository files => pollutes other team projects
4GB limit Git repo size but this is wrong
git annotate may be slower on bloat Git repo...

Eamon Nerbonne has noticed the related question:
Choosing between Single or multiple projects in a git repository?
The reason the team managers finally have accepted the split: the single Git repo (550 MB) was requiring 13 minutes to be cloned on Windows _{(one minute on Linux)}.
The bloat CVS repo split in 100 Git repositories:
- each dead apps in one repo
- each stabilized library in one repo (source code almost never changed any longer)
- related apps/libs kept together in one repo
- moved large files not used for compilation (config...) to other repos (Git does not like large files)
- skipped other unrelevant files (*.jar, *.pcb, *.dll, *.so, *.backup ...)
Successfully installed the repo tool used by Android Open Source Project in order to handle all these Git repos:
- easy installation on Linux
- more difficult on Windows because of Cygwin and NTFS native symlinks requirements

Solution

You're dealing with multiple teams and multiple projects. Likely decades of work went into the codebase.

The short answer is that your teams and projects have varying needs and varying dependencies.

The monolithic repository approach reduces commits to "Everything is stable in this configuration!!!" (i.e. unrealistic, huge commits sourced from many teams). That, or many intermediate points of incompatibilities for many projects. Either way, there's a lot of wasted energy invested in supporting configurations which were simply never meant to be.

Your repositories should be structured independently instead, and should have multiple repositories which represent their dependencies. The dependencies should be configured, updated, and tested by the project maintainers at appropriate points in development.

ProjectA saw its last major release 3 years ago. It is in maintenance mode and has "older" system requirements. It should refer to an appropriate set of dependencies. It has 20 dependencies.
ProjectB was just released. It has more modern System Requirements, and was developed and tested by another team. It has 15 dependent libraries (=repos), 10 of which are shared with ProjectA. These projects generally refer to different commits of their dependent libraries. Dependencies are updated at appropriate points in development.
ProjectC is yet to be released. It's very similar to ProjectB, but includes significant changes and improvements to its dependencies. Developers of ProjectB are only interested in taking the stable releases of the dependencies they share with ProjectC. ProjectB's team makes some commits to the shared dependencies, although they are mostly bugfixes and optimizations at this time. A monolithic repository would either hold development of ProjectC back in order to maintain support for ProjectA, or ProjectC's changes would break A and B, or developers would just end up not sharing/reusing code.

With multiple (distributed) repositories, each team can work independently and minimize impacting the other projects while reusing and constantly improving the codebases. This also keeps teams from shifting focus/speed when changes come in from other teams. The centralized monolithic repository makes each team dependent on every team's move, and that would have to be synchronized.

OTHER TIPS

There doesn't seem to be an argument in favor of the big repo in this thread, so here's one:

The advantage of a big repo with all your code in it, is that you have a reliable source of truth. All the state in your overarching project is represented in that repo's history. You don't have to worry about questions like "What version of libA do I need to build libB from 3 months ago?" or "Did the integration tests start failing because of Susan's change in libC or Bob's change in libD?" or "Are there any callers left for evilMethod()?" It's all in the history.

When related projects are split into separate repos, git isn't keeping track of their relationships for you. Your build system needs to know where to find the code for all its dependencies, and more importantly what version of the code to build. You can "just build everything from master", but that makes it hard to reproduce past builds, hard to make changes (or rollbacks) that need to be synced across repos, and hard to keep branches in a stable state.

So the question isn't "One big repo or many small repos?" It's actually "One big repo or many small repos with tooling." What tool are you going to use? Repo (Android) and gclient (Chromium) from Google are two examples. Git submodules is another. All of those have major downsides that you have to weigh against the downsides of a big repo.

Edit: Here are some more answers Choosing between Single or multiple projects in a git repository?

PS: All that said, I'm working on a tool to hopefully make things better, for when you have to split repos or use other people's code: https://github.com/buildinspace/peru

Git tends to experience performance problems when used with large repositories.

To quote Linus:

And git obviously doesn't have that kind of model at all. Git
fundamnetally never really looks at less than the whole repo. Even if you limit things a bit (ie check out just a portion, or have the history go back just a bit), git ends up still always caring about the whole thing, and carrying the knowledge around.

So git scales really badly if you force it to look at everything as one huge repository. I don't think that part is really fixable, although we can probably improve on it.

Emphasis mine. That's not to say your company's version control repository is "large," but this is one reason people tend to avoid large repositories within Git.

They want [something that can] show their changes across all projects instead of trying to remember what project they made a change [to].

Sourcetree (a free-as-in-beer GUI Git frontend) allows you to register multiple repositories, organise them into logical groups, and then view status across all of them at once: screenshot

I am not affiliated with them in any way.

TL;DR; the equivalent of a git repository is a CVS module, not a CVS repository.

CVS is designed with a notion of modules being a subdivision of a repository, and it is common to use CVS repositories with several modules having a quite independent life. As an example, it is easy to have branches specific to one module and not present in another.

git has not be designed with a notion of module, each git repository is limited to one module in CVS term. When you create a branch, it is valid for the whole repository.

Thus, if you want to import a CVS repository with several modules in git, you better create a repository per module, especially if the modules have a more or less independent life and don't shares things like branch and labels. (Due to different usage patterns of branches in CVS and git, you may even investigate the usefulness to have one repository per CVS branch; but for a migration from CVS to git, it is probable that your workflow will at the start be similar enough to a CVS workflow that it doesn't worth the pain).

If you're willing to play ball with them in order to appease, you could set it up this way. Or this method. Other than that, I think they're just expecting a single point of entry into the system to access assets.

Depending on access needs, the separated GIT repos might still be the better way to go, as "John Smith" may need access to certain data, but not other data. While "Suzy Que" may be a sys admin needing access to everything.

If you go with a single repo, you might run into issues with your internal access requirements. If it's an "everybody gets full access" type of thing, then I could possibly see their point of view.

Git migration help page from Eclipse suggests to reorganize the CVS/SVN directory tree into multiple Git repositories:

Now is a great time to refactor your code structure. Map the current CVS/SVN directories, modules, plugins, etc. to their new home in Git. Typically, one Git repository (.git) is created for each logical grouping of code -- a project, a component, and so on.

The arguments:

The trade-off here is that each additional Git repository adds extra overhead to your development process - all Git commands and operations happen at the level of a single Git repository. On the flip side, each repository user will have an entire copy of the repository history, making very large repositories cumbersome to work with for casual contributors.

Git operates on a whole tree at once, not on just the subdirectory you're on.

Let's say you have your project at

C:\MyCode\ProjectABC

And let's say these two files have changed:

C:\MyCode\ProjectABC\stuff.txt
C:\MyCode\ProjectABC\Stuff\MoreStuff\morestuff.txt

When you're in the project root and you do a git status, you'll see these files as having changed:

stuff.txt
Stuff\MoreStuff\morestuff.txt

If you cd to the MoreStuff directory, though, will you see only the morestuff.txt file? No. You'll still see both files, relative to your position:

..\..\stuff.txt
morestuff.txt

As a result, if you lump all your projects together in one big ol' Git repository, then every time you go to check in, you'll have to pick from among changes in every project.

Now there could be ways to mitigate that; for instance, you can try to make sure you at least temporarily commit your changes before switching to work on a different project. But that's a lot of overhead that each person on your team would have to deal with, compared to simply doing it the right way: one Git repository per project.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange