Question

Some time ago I learned that rsync deletes files much faster that many other tools.

A few days ago I came across this wonderful answer on Serverfault which explains why rsync is so good at deleting files.

Quotation from that answer:

I revisited this today, because most filesystems store their directory structures in a btree format, the order of which you delete files is also important. One needs to avoid rebalancing the btree when you perform the unlink. As such I added a sort before deletes occur.

Could you explain how does removing files in-order prevents or reduces the number of btree rebalancings?


I expect the answer to show how deleting in order increase deletion speed, with details of what happens at btree level. People, who wrote rsync and another programs (see links in the question) used this knowledge to create better programs. I think it's important for other programmers to have this understanding to be able to write better soft.

Was it helpful?

Solution

It is not important, nor b-tree issue. It is just a coincidence.

First of all, this is very much implementation dependent and very much ext3 specific. That's why I said it's not important (for general use). Otherwise, put the ext3 tag or edit the summary line.

Second of all, ext3 does not use b-tree for the directory entry index. It uses Htree. The Htree is similar to b-tree but different and does not require balancing. Search "htree" in fs/ext3/dir.c.

Because of the htree based index, a) ext3 has a faster lookup compare to ext2, but b) readdir() returns entries in hash value order. The hash value order is random relative to file creation time or physical layout of data. As we all know random access is much slower than sequential access on a rotating media.

A paper on ext3 published for OLS 2005 by Mingming Cao, et al. suggests (emphasis mine):

to sort the directory entries returned by readdir() by inode number.

Now, onto rsync. Rsync sorts files by file name. See flist.c::fsort(), flist.c::file_compare(), and flist.c::f_name_cmp().

I did not test the following hypothesis because I do not have the data sets from which @MIfe got 43 seconds. but I assume that sorted-by-name was much closer to the optimal order compare to the random order returned by readdir(). That was why you saw much faster result with rsync on ext3. What if you generate 1000000 files with random file names then delete them with rsync? Do you see the same result?

OTHER TIPS

Let's assume that the answer you posted is correct, and that the given file system does indeed store things in a balanced tree. Balancing a tree is a very expensive operation. Keeping a tree "partially" balanced is pretty simple, in that when you allow for a tree to be imbalanced slightly, you only worry about moving things around the point of insertion/deletion. However, when talking about completely balanced trees, when you remove a given node, you may find that suddenly, the children of this node could belong on the complete opposite side of the tree, or a child node on the opposite side has become the root node, and all of it's children need to be rotated up the tree. This requires you to do either a long series of rotations, or to place all the items into an array and re-create the tree.

            5
    3               7
2       4       6       8

now remove the 7, easy right?

            5
    3               8
2       4       6       

Now remove the 6, still easy, yes...?

            5
    3               8
2       4       

Now remove the 8, uh oh

            5
    3               
2       4

Getting this tree to be the proper balanced form like:

        4
    3       5
2

Is quite expensive, compared at least to the other removals we have done, and gets exponentially worse as the depth of our tree increases. We could make this go much(exponentially) faster by removing the 2 and the 4, before removing the 8. Particularly if our tree was more than 3 levels deep.

Without sorting removal is on average a O(K * log_I(N)^2). N representing the number of elements total, and K the number to be removed, I the number of children a given node is permitted, log_I(N) then being the depth, and for each level of depth we increase the number of operations quadratically.

Removal with some ordering help is on average O(K * log_I(N)), though sometimes ordering cannot help you and you are stuck removing something that will require a re-balance. Still, minimizing this is optimal.

EDIT:

Another possible tree ordering scheme:

            8
    6               7   
1       2       3       4

Accomplishing optimal removal under this circumstance would be easier, because we can take advantage of our knowledge of how things are sorted. Under either situation it is possible, and in fact both are identical, under this one the logic is just a little simpler to understand, because the ordering is more human friendly for the given scenario. In either case in-order is defined as "remove the farthest leaf first", in this case it just so happens to be that the farthest leaves are also the smallest numbers, a fact that we could take advantage of to make it even a little more optimal, but this fact is not necessarily true for the file system example presented(though it may be).

I am not convinced that the number of B-tree rebalancing changes significantly if you delete the files in-order. However I do believe that the number of different seeks to external storage will be significantly smaller if you do this. At any time, the only nodes in the B-tree that need be visited will then be the far right boundary of the tree, whereas, with a random order, each leaf block in the B tree is visited with equal probability for each file.

Rebalancing for B-Trees are cheaper than B-Tree+ implementations that's why most filesystems and database indexes implementations use them.

There are many approaches when deletion, depending on the approach it can be more efficient in terms of time and the need to rebalance the tree. You'll also have to consider the size of the node, since the number of keys the node can store will affect the need for rebalancing the tree. A large node size will just reorder keys inside the node, but a small one probably will make the tree rebalance many times.

A great resource for understanding this is the famous CLR (Thomas Cormen) book "Introduction to Algorithms".

On storage systems where hosting huge directories, the buffer cache will be under stress and buffers may get recycled. So, if you have deletes spaced apart by time, then the number of disk reads to get the btree back in core into the buffer cache, between deletes, may be high.

If you sort the files to be deleted, effectively you are delaying the deletes and bunching them. This may have the side effect of more deletes per block of btree paged in. If there are stats to say what the buffer cache hits are between the two experiments, it may tell if this hypo is wrong or not.

But, if there is no stress on the buffer cache during the deletes, then the btree blocks could stay in core and then my hypothesis is not a valid one.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top