Best matching in a bipartite graph (e.g.associating labels with points on a plot)

https://stackoverflow.com/questions/10152710

31-05-2021
|

Вопрос

I am trying to extract semantics from graphical xy plots where the points are plotted and some or all have a label. The label is plotted "near the point" so that a human can normally understand which label goes with which point. For example in this plot it is clear which label(number) belongs to which point(*) and an algorithm based on Euclidian distance would work. (The labels and points have no semantic ordering - e.g. a scatterplot)

In congested plots the authoring software/human may place the label in different directions to avoid overlap. For example in

1**2
 **4
 3

A human reader can normally work out which label is associated with which label.

One solution I'd accept would be to create a Euclidean distance matrix and shuffle the rows to get the minimum of a function (e.g. the summed squares of the distances on the diagonal or other heuristic). In the second example (with the points labelled a,b,c,d clockwise from the NW corner) we have a distance matrix (to 1 d.p.)

             a   b   c   d
 1ab2    1  1.0 2.0 2.2 1.4    
  dc4    2  2.0 1.0 1.4 2.2
  3      3  2.0 2.2 1.4 1.0
         4  2.2 1.4 1.0 2.0

and we need to label a1 b2 c4 d3. Swapping rows 3 and 4 gives the minimum sum of the diagonal. Here's a more complex example where simply picking the nearest may fail

 *1*2*5
  **4
  3 *6

If this is solved then I shall need to go to cases where the number of labels may be smaller or larger than the number of points.

If the algorithm is standard than I would appreciate a pointer to Open Source Java (e.g. JAMA or Apache maths)

NOTE: This SO answer Associating nearby points with a path doesn't quite work as an answer because the path through the points is given.

Решение

You have a complete bipartite graph that one part is numbers and other one is points. Weight's of edge in this graph is euclidean distance between numbers and points. And you're task is finding matching with minimal weight.

This is known problem and has a well known algorithm named as Hungarian Algorithm:

From Wiki:

We are given a nonnegative n×n matrix, where the element in the i-th row and j-th column represents the cost of assigning the j-th point to the i-th number. We have to find an assignment of the point to the numbers that has minimum cost. If the goal is to find the assignment that yields the maximum cost, the problem can be altered to fit the setting by replacing each cost with the maximum cost subtracted by the cost.

The algorithm is easier to describe if we formulate the problem using a bipartite graph. We have a complete bipartite graph G=(S, T; E) with n number vertices (S) and n point vertices (T), and each edge has a nonnegative cost c(i,j). We want to find a perfect matching with minimum cost. The Hungarian method is a combinatorial optimization algorithm which solves the assignment problem in polynomial time and which anticipated later primal-dual methods. f

For detailed algorithm and code you can take a look at topcoder article and this pdf maybe to use

there is a media file to describe it. (This video explains why the Hungarian algorithm works)

algorithm : step 1:- prepare a cost matrix.if the cost matrix is not a square matrix then add a dummy row(column) with zero cost element.

step 2:- subtract the minimum element in each row from all the elements of the respective rows.

step 3:- further modify the resulting matrix by subtracting the minimum elememnt of each column from all the elements of the respective columns.thus obtain the modified matrix.

step 4:- then,draw minimum no of horizontal and vertical lines to cover all zeros in the resulting matrix.let the minimum no of lines be N.now there are 2 possible cases.

case 1 - if N=n,where n is the order of matrix,then an optimal assignment can be made.so make the assignment to get the required solution.

case 2 - if N less than n then proceed to step 5

step 5: determine the smallest uncovered element in the matrix(element not covered by N lines).subtract this minimum element from all uncovered elements and add the same elements at the intersection of horizontal and vertical lines.thus the second modified matrix is obtained.

step 6:- repeat step(3) and (4) untill we get the case (1) of step 4.

step 7:- (to make zero assignments) examine the rows successively untill a row-wise exactly single zero is found.circle(o) this zero to make the assignment.then mark a cross(x) over all zeros if lying in the column of the circled zero,showing that they can't be considered for future assignment.continue in this manner untill all the zeros have been examined. repeat the same procedure for column also.

step 8:- repeat the step 6 succeccively until one of the following situation arises- (i)if no unmarked zeros is left,then the process ends or (ii) if there lies more than one of the unmarked zero in any column or row then,circle one of the unmarked zeros arbitrarily and mark a cross in the cell of remaining zeros in its row or column.repeat the process untill no unmarked zero is left in the matrix.

step 9:- thus exactly one marked circled zero in each row and each column of the matrix is obtained. the assignment corresponding to these marked circle zeros will give the optimal assignment.

For details see wiki and http://www.ams.jhu.edu/~castello/362/Handouts/hungarian.pdf

Другие советы

If I have understood your question, each of the examples you show has a unique best solution that minimizes the sum of the squares of the distances between points and labels. There is an exponential number of mappings between points and labels, but perhaps you can try the following:

In polynomial time, compute the distance from each label to each point. In a general graph you would have to solve the all-pairs shortest-path problem. Here, as Mikola points out, you can just do it with a doubly nested and use the coordinate geometry: pick either the Euclidean distance or the Manhattan distance.
In polynomial time, find the minimum-cost bipartite matching between points and labels. The solution to this problem will give you a matching between points and labels that minimizes the total distance.

All algorithms (shortest paths, Euclidean distance, min-cost bipartite matching) are standard and can be found on Wikipedia.

What is slightly nonstandard is if you find more than one bipartite matching with minimum cost. If that happens, you can try them all and see if one matching minimizes the sum of distances squared. If there are still ties, I recommend you treat horizontal distance as slightly shorter than vertical distance, and run the algorithm again. If you still have ties, there may not be a unique solution, or you may want to treat "label to the right" as slightly "closer" than label to the left.

But when there is a unique solution, all-pairs shortest paths followed by miniumum-cost bipartite matching should find it.

I haven't seen a common algorithm for this case. Therefore I would solve this problem pragmantically:

Assuming that a label belonging to a point is always the nearest (maybe with others), you could orient on a region growing algorithm (see animated gif). Iterate through every point (red) for each growing step (circle around number label). The growing step is determined by the minimal distance between a point and a label.

region growing

Use temporary lists for points and labels. Each time you find a definite pair, remove the corresponding point and label. Just skip the label if there is more than one nearest point (here: label 2). By solving other point-label combinations (here: label 3), they should be mapped in another iteration.

After iterations without any progress due to overall ambiguous situations, you can define a selection method to solve them (e.g. prefer top over bottom, left over right).

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow