How ensure no unsaved work in Mercurial repository clones (or any DVCS)

https://stackoverflow.com/questions/11589118

22-06-2021
|

Question

BRIEF:

How do you ensure that there is no unsaved work in any DVCS's distributed repository clones?

I am thinking about this specifically for Mercurial, but it also applies to git, bzr, etc.

DETAIL:

Back in the bad old days I used to run cron jobs that might do the equivalent of - pseudocode, because I may not remember the CVS commands:

find all checked out CVS trees
   do a cvs status command (which I think is something like cvs update -n?) 
   | grep '^M' to find all modified files not yet committed to the central repo

(These days were bad (1) because we were using CVS, and (2) because from time to time I was the guy in charge of making sue nothing got lost. OK, that last was not so bad, but ulcerating.)

Q: how do I do the equivalent for a modern DVCS system like Mercurial. I thought it was easy, but on closer inspection there are pieces missing:

I started off by doing something like

find all ...path/.hg directories, and then look at ...path
    do hg status - look at the output  // this is easy enough
    do hg outgoing // this is where it gets interesting

You might thing that doing an hg outgoing is good enough. But it isn't necessarily.

Consider:

cd workspace-area
hg clone master repo1
hg clone repo1 repo2
rm -rf repo1
hg clone repo2 repo1

Now repo1's default path is repo2, and vice versa.

Of course, this won't happen if you have the right workflow. If you only ever clone from something upstream of you, never from a peer. But... lightweight cloning is part of the reason top do DVCS. Plus, it has already happened to me.

To handle this problem,I usually have an hg path somewhere, set up in my ~/.hgrc, set to some project-master URL. This works fine - for that one project. Not so fine if you have many, many projects. Even if you call them project1-master project2-master, etc., there just get to be a lot of them. Worse still if subrepos are proliferating because of libraries that want to be shared between projects.

Also, this has to be in the user's .hgrc. Or a site .hgrc. Not so good for somebody who may not have that .hgrc set up - like an admin who doesn't know the ins and outs of each of several dozen (or hundreds) of projects on his systems - but who still wishes to do his users the favor of finding stale work. (They may have come to expect it.) Or if you simply want to give standard instructions as to how to do this.

I have considered putting the name of some standard master repo for the project (or a list) in a text file, checked into the repo. Say repo/.hg_master_repos. This looks like it may work, although it has some issues (you may only see the global project master, not an additional local project master. I don't want to explain more than that.).

But... before I do this, is there any standard way of doing this?

By the way, here is what I have so far:

#!/usr/bin/perl
use strict;

# check to see if there is any unsaved stuff in the hg repo(s) on the command line

# -> hg status, looking for Ms, etc.
#        for now, just send it all to stdout, let the user sort it out

# -> hg outgoing
# issue: who to check outgoing wrt to?
#   generic
#      a) hg outgoing
#           but note that I often make default-push disabled
#           also, may not point anywhere useful, e.g
#               hg clone master r1
#               hg clone r1 r2
#               rm -rf r1
#               hg clone r2 r1`
#           plus, repos that are not clones, masters...
#      b) hg outgoing default-push
#      c) hg outgoing default
#   various repos specific to me or my company


foreach my $a ( @ARGV ) {
    print "**********  $a\n";
    $a =~ s|/\.hg$||;
    if( ! -e "$a/.hg" ) {
        print STDERR "Warning: $a/.hg dos not exist, probably not a Mercurial repository\n";
    }
    else {
        foreach my $cmd (
                 "hg status",
                 # generic
                 "hg outgoing",
                 "hg outgoing default-push",
                 "hg outgoing default",
                 # specific
                 "hg outgoing PROJECT1-MASTER",
                 "hg outgoing MY-LOCAL-PROJECT1-MASTER",
                 "hg outgoing PROJECT2-MASTER",
                 # maybe go through all paths?
                 # maybe have a file that contains some sort of reference master?
                )
          {
              my $cmd_args = "$cmd -R $a";
              print "=======  $cmd_args\n";
              system($cmd_args);
          }
    }
}

As you can see, I haven't adorned it with anything to parse what it gets - just letting the user, me, eyeball it.

But just doing

find ~ -name '*.hg' | xargs ~/bin/hg-any-unsaved-stuff.pl

found a lot of suspiciously unsaved stuff that I did not know about.

Old unsaved changes reported by hg status are highly suspicious. Unpushed work reported by outgoing is suspect, but perhaps not so bad for somebody who thinks that a clone is a branch. However, I prefer not to have a diverged clone live forever, but to but things onto branches so that it somebody can see all the history by cloning from one place.

BOTTOM LINE:

Is there a standard way of finding unsaved work, un-checked-in and/or unpushed, that is not vulnerable to the sorts of cycles I mention above?

Is there some convention for recording the "true" project master repo in a file somewhere?

Hmm... I suppose if the repos involved in pushes and clones wand checkins were recorded somewhere, I could make some guesses as to what the proper project masters might be.

Solution

Here's what you can do:

Identify the possible central repositories on your server.
Iterate over repositories on the client to match them up with central repositories.
Run hg outgoing against the central repository you found.

A bit more detail:

I assume you have a central place for your repositories, since otherwise your question becomes moot. Now, a repository can be identified by the root changeset. This changeset will be revision zero and you can get the full changeset like this:
```
$ hg log -r 0 --template "{node}"
```
Run a script on the server that dumps a list of (node, URL) pairs into a file that is accessible by the clients. The URLs will be the push targets.
Run a script on the clients that first downloads the (node, URL) list from the server and then identifies all local repositories and the corresponding push URL on the server.
Run hg outgoing URL with the URL you found in the previous step. You can (and should!) use a full URL with hg outgoing so that you avoid depending on any local configuration done on the client. That way you avoid dealing with default and default-push paths and since the URL points back to the server you know that it's a good URL to compare with.

If the server has multiple clones of the same repository, then there will be several different URLs to choose from. You can then either try them all and use the one with fewest outgoing changesets for your report or you can side-step the issue by combining the clones on the server-side (by pulling changesets from all the clones into a single repository) and then compare against this combined repository.

When you run the script on the client you might have some repositories that are local and don't exist on the server. Your script should handle those: it should probably fire off an email to the developer asking him to create the repository on the server.

Finally, a repository might have more than one root changeset. The above will still work pretty well: all clones done the normal way will keep revision zero the same on both server and client. The script will therefore correctly match up the client repo with the server repo, even with multiple roots.

It is only if a developer runs something like hg clone -r the-other-root ... that the above fails since the other root now becomes revision zero. The repository will thus be seen as a local repo. Your script should handle that anyway, so it's no big deal.

OTHER TIPS

If all you concern is a data loss and you are using git then just create a repository. Add all created repositories as a remotes to this one and run

git fetch --all

This will efficienty make backup of all data in all repositories. It also backups he current snapshot of all references.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow