Similarity measurement of time-sequenced data with different length

https://stackoverflow.com/questions/23249567

08-07-2023
|

質問

Consider the following data.

Groundtruth      |  Dataset1         |  Dataset2         |  Dataset3
Datapoints|Time  |  Datapoints|Time  |  Datapoints|Time  |  Datapoints|Time
    A     |0     |      a     |0     |      a     |0     |      a     |0
    B     |10    |      b     |5     |      b     |5     |      b     |13
    C     |15    |      c     |12    |      c     |12    |      c     |21
    D     |25    |      d     |22    |      d     |14    |      d     |30
    E     |30    |      e     |30    |      e     |17    |
                 |                   |      f     |27    |
                 |                   |      g     |30    |

Visualized like this (as in number of - between each identifier):

Time ->         
Groundtruth: A|----------|B|-----|C|----------|D|-----|E
Dataset1:    a|-----|b|-------|c|----------|d|--------|e
Dataset2:    a|-----|b|-------|c|--|d|---|e|----------|f|---|g
Dataset3:    a|-------------|b|--------|c|---------|d

My goal is to compare the datasets with the groundtruth. I want to create a function that generates a similarity measurement between one of the datasets and the groundtruth in order to evaluate how good my segmentation algorithm is. Obviously I would like the segmentation algorithm to consist of equal number of datapoints(segments) as the groundtruth but as illustrated with the datasets this is not a guarantee, neither is the number of datapoints known ahead of time.

I've already created a Jacard Index to generate a basic evaluation score. But I am now looking into an evaluation method that punish the abundance/absence of datapoints as well as limit the distance to a correct datapoint. That is, b doesn't have to match B, it just has to be close to a correct datapoint.

I've tried to look into a dynamic programming method where I introduced a penalty for removing or adding a datapoint as well as a distance penalty to move to the closest datapoint. I'm struggling though, due to: 1. I need to limit each datapoint to one correct datapoint 2. Figure out which datapoint to delete if needed 3. General lack of understanding in how to implement DP algorithms

Anyone have ideas how to do this? If dynamic programming is the way to go, I'd love some link recommendation as well as some pointers in how to go about it.

解決

Basically, you can modify the DP for Levenshtein edit distance to compute distances for your problem. The Levenshtein DP amounts to finding shortest paths in an acyclic directed graph that looks like this

*-*-*-*-*
|\|\|\|\|
*-*-*-*-*
|\|\|\|\|
*-*-*-*-*

where the arcs are oriented left-to-right and top-to-bottom. The DAG has rows numbered 0 to m and columns numbered 0 to n, where m is the length of the first sequence, and n is the length of the second. Lists of instructions for changing the first sequence into the second correspond one-to-one (cost and all) to paths from the upper left to the lower right. The arc from (i, j) to (i + 1, j) corresponds to the instruction of deleting the i^th element from the first sequence. The arc from (i, j) to (i, j + 1) corresponds to the instruction of adding the j^th element from the second sequence. The arc from (i, j) corresponds to modifying the i^th element of the first sequence to become the j^th element of the second sequence.

All you have to do to get a quadratic-time algorithm for your problem is to define the cost of (i) adding a datapoint (ii) deleting a datapoint (iii) modifying a datapoint to become another datapoint and then compute shortest paths on the DAG in one of the ways described by Wikipedia.

(As an aside, this algorithm assumes that it is never profitable to make modifications that "cross over" one another. Under a fairly mild assumption about the modification costs, this assumption is superfluous. If you're interested in more details, see this answer of mine: Approximate matching of two lists of events (with duration) .)

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow