When to break up a large Git repository into smaller ones?

Question 1

That process can be guided by a component approach, where you identified coherent set of files (an application, a project, a library)

In term of history (in a source control tool), a coherent set means it will be labelled, branched or merged as a all, independently of the other set of files.

For a distributed version control system (like git), each of those set of files is a good candidate for a git repo of its own, and you can then group those you need for a specific project in a parent repo with submodules.

I describe this approach for instance in;

"Git repository setup for a project that has a server and client" (server and client being two obvious coherent separate sets which benefit from having their own repo)
"What is Component-Driven Development?"

The opposite (keeping everything in one repo) is called "system-based approach", but can lead to huge Git repo, which, as I mentioned in "Performance for Git", isn't compatible with how Git is implemented.

The OP onionjake asks in the comments:

Could you please include more information on the subtleties of identifying components?

This process (of identifying "components", which in turn become git repos) is guide by the software architecture of your system.
Any subset which acts as an independent set of file is a good candidate for its own repo. It can be a library, or dll, but also part of an application (a GUI, a client vs. a server, a dispatcher, ...)

Each time you identify a group of tightly linked files (meaning modifying one will likely have effect to others), there should be part of the component, or in git, the same repo.

Question 2

Personally I like small repos - they work well when you have a good dependancy management system like Composer for PHP.

It takes the pain away managing the check out process and also tracks versions etc.

It also permits repos to be hosted by different providers. We use a combination of bespoke code and open source repos.

Question 3

I would say, go with subtrees most of time if not all the time - and feel free to make subtrees freely as you see necessary.

With lots and lots of dependencies, submodules start to become painful. If you have any effect on the development of those dependancies, then that goes doubly so. Submodule might be ok if you have a completely 3rd party library that doesn't change versions very often, and that you would never actively develop for as part of your larger project.

Submodules are too separated from the super-repo for dependencies you actually work on.

Example: If you make a change to a submodule, you have to commit on the submodule, push up, cd up to the super repo, add the submodule to the index/stage, commit it, and push up again. its a hassle of a workflow. Not to mention the hassle of removing, moving, or renaming a submodule.

Git subtrees are much better. The histories are intertwined, but you can split out a directory as a subtree at any given whim. If you decide you dont want something to be a subtree anymore... just stop performing subtree split or pushes.

The downside to subtrees is that they arent tracked at all. So you have to remember all the paths and their relationship to their repositories - and anyone else working on the project also just has to know that if they want to perform subtree operations. The good news, is most developers can just work on any code on any of the dependencies without worrying about how it will be pushed out to those repos. Also, as you said, some bash scripts can hel automate the manual stuff.

Question 4

When you have a good re-use case for multiple projects then consider splitting it out to a sub-project. I would avoid creating a shared project before you have two projects that use it.

Criteria I would use to consider making a sub-project repo:

Is it used by multiple projects?
Is it self contained?
Does it change frequently?

I find subtrees the easiest to manage in that I can develop the library as part of a project and then split it off when the need arises.

I'd also just like to point out, it's perfectly okay for 2 projects to diverge on common libraries, and often preferred in order to keep them in a stable state. So long as it's easy to converge common code, I see no harm in taking a lazy approach to sharing libraries.

In any case, it's a good sign to have this problem; it means you have done a good job of making re-usable code. :)

Question 5

When you're working in a distributed environment, giving the features of git, you should avoid to directly group different components into a single repository if those components are used by other projects or if you plan to do that. Or if it's either probable or desirable it will happen in the future.

This because developers/contributors will be able to focus on their part without the need to download the full history of every other components they're not going to use/change. Think at that is also crucial if you're working with contributors from countries/areas where internet speed is slower than the one we're used at.

As you tried and understand various methods you're not stuck with low knowledge and it shouldn't be hard a hard task. As far as I know you got all possible alternatives.

I won't worry about having dozens or potentially hundreds of smaller repositories if they're somehow independent from the main repository. Having so many repository will only increase the time of first configuration of your new main repository.

You should favor the big repository solution only if you need to migrate "immediately" from subversion. Or someone with no or low knowledge of alternatives.

I would use git subtree because it's available with git as standard features: users will not be required to install anything additional than git, and it will continue to stay around until git will.