質問

I have the following problem:

There is a "clean" sequence of sequences, say:

clean = [
    [1, 2],
    [3],
    [4, 5]
]

And a "noisy" sequence which is not segmented:

segmented = [1, 100, 2, 3, 3, 101, 4, 5]

I would like to partition segmented into exactly len(clean) parts such that the sum of the edit distances of each part and its clean counterpart is minimized.

For this example, one optimal solution would be:

optimal = [
    [1, 100, 2], # edit distance 1
    [3, 3],      # edit distance 1
    [101, 4, 5]  # edit distance 1
]                # total 3

It seems like a typical dynamic programming problem. My first thought was to use a similar algorithm to what TeX uses for line breaking, which brought me to SMAWK.

This is where I am stuck, because I cannot figure out the cost function. For regular line breaking, the function cost(i,j) is the cost of having a line go from index i to index j, like here. But for this problem, we would need a third parameter, namely which line in the clean reference to compare to.

Is this problem not conditioned correctly for SMAWK? Or is there a different cost function definition that I'm missing? Is there another, similar problem which has a more applicable solution?

正しい解決策はありません

ライセンス: CC-BY-SA帰属
所属していません cs.stackexchange
scroll top