Are there any techniques for detecting redundancies a stream of changes to a filesystem?

https://softwareengineering.stackexchange.com/questions/364511

26-01-2021
|

Question

I'm working on a file-synchronization client that currently produces a stream of changes to the underlying filesystem. That is a stream of create/update/move/delete events is produced for each synchronization "target".

Each event includes a sequence-ID, which provides information about the ordering of events, and then information such as:

source path (& destination path for move events)
md5 (for files)
mtime timestamp
event type : {create, move, update, delete}

This works reasonably well, but there are often redundancies. For example, a move from path X to path Y and then back to path X will be reported as two events:

X -> Y
Y -> X

This is clearly redundant, and can be removed altogether from the stream.

Are there any well-documented techniques for detecting and removing such redundancies?

Similarly, is there a well-understood and efficient way to detect redundancies stemming from deletions? For example:

Update A
A -> B
B -> C
Delete C

Here, changes 1 - 4 could be reduced to Delete A.

Informal approaches, along with academic papers would be most welcome, bearing in mind that it's not uncommon to encounter streams of tens of thousands of events in this case.

Many thanks!

EDIT

Does this problem have a name? The reason I'm here is because google has failed me, and I'm guessing this is because I'm missing some vocabulary!

Solution

What you're describing is very similar in code generation & optimization.

A difference is that in code generation copies are not destructive, whereas in your situation, the moves are destructive of the originals. However, I believe many of the same approaches could be adapted to your situation. So, I'll point some of them out. Optimization operates on a sequence or stream of instructions.

The code stream being optimized contains instructions like

Copy X -> Y
...
Copy Y -> X

We want to eliminate unnecessary copies in the final optimized code stream. So one approach is copy propagation. The optimization equates X & Y together upon seeing the first copy and substitutes Y for X later to eliminate presence of Y. Thus the code becomes:

//Copy X -> Y  eliminated by copy elimination, by equating X & Y
...             // here verify that X is not changed so X still == Y 
Copy X -> X     // later eliminated as redundant

Another optimization is called dead code elimination. This works backward through the stream. First some late element in the stream generates an instruction whose target isn't subsequently used,

add P + Q -> X
add X + Y -> Z

(here Z is not used) so that instruction is eliminated. That affects the preceding instructions that generate targets (X, Y) that were sourced by the eliminated instruction as their targets may no longer be used. So some preceding instructions can also be eliminated, and this can work backward through the stream.

These analyses involves detecting uses (sources) and definitions (targets), in an algorithm or field called data flow analysis. It is probably more complex in code generation b/c of the control flow within the instruction stream that needs to be factored into the algorithm. Control flow in cg is usually broken down into units called basic blocks that are sequences of sequential flow from start to end; the basic blocks are then interconnected by control flow operations (fork,join,loop). Your stream probably doesn't have control flow to deal with so it would be like doing the first part of data flow analysis, which happens on a single basic block and skipping the iteration that computes a larger procedure or program from those basic blocks. See also def-use chain.

OTHER TIPS

All you have to do is buffer your structured events such that you can look them up by affected path (source and destination, probably).

The actual mechanism for combining any given pair of events will depend on what those events are, so you just need to manually code the optimisations you want.

It's much simpler to do this pairwise than to somehow recognise chains of 4 events. For example:

copy and delete of source -> move
- get delete event, look up other events indexed by deleted path
- find one, a copy
- matching path is the source of the copy
- so replace it by a move
copy and delete of dest -> no-op
- as above, but matching path is the destination
- remove original copy event and you're done
move and another move
- get a move event, look up other events indexed by source path
- find a move whose dest path matches
- replace move(source1,dest1)+move(dest1,source2) with move(source1,dest2)

I assume you're processing the events in sequence order, so your chain of 4 events will be collapsed pairwise.

You can probably find a nice way to structure and express it as a rules engine or something, but it might take more effort than just coding the transformations you want directly.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange