Memory-efficient line stitching in very large images

https://stackoverflow.com/questions/4837465

27-10-2019
|

Question

Background

I work with very large datasets from Synthetic Aperture Radar satellites. These can be thought of as high dynamic range greyscale images of the order of 10k pixels on a side.

Recently, I've been developing applications of a single-scale variant of Lindeberg's scale-space ridge detection algorithm method for detecting linear features in a SAR image. This is an improvement on using directional filters or using the Hough Transform, methods that have both previously been used, because it is less computationally expensive than either. (I will be presenting some recent results at JURSE 2011 in April, and I can upload a preprint if that would be helpful).

The code I currently use generates an array of records, one per pixel, each of which describes a ridge segment in the rectangle to bottom right of the pixel and bounded by adjacent pixels.

struct ridge_t { unsigned char top, left, bottom, right };
int rows, cols;
struct ridge_t *ridges;  /* An array of rows*cols ridge entries */

An entry in ridges contains a ridge segment if exactly two of top, left, right and bottom have values in the range 0 - 128. Suppose I have:

ridge_t entry;
entry.top = 25; entry.left = 255; entry.bottom = 255; entry.right = 76;

Then I can find the ridge segment's start (x1,y1) and end (x2,y2):

float x1, y1, x2, y2;
x1 = (float) col + (float) entry.top / 128.0;
y1 = (float) row;
x2 = (float) col + 1;
y2 = (float) row + (float) entry.right / 128.0;

When these individual ridge segments are rendered, I get an image something like this (a very small corner of a far larger image):

Rendered ridge segments

Each of those long curves are rendered from a series of tiny ridge segments.

It's trivial to determine whether two adjacent locations which contain ridge segments are connected. If I have ridge1 at (x, y) and ridge2 at (x+1, y), then they are parts of the same line if 0 <= ridge1.right <= 128 and ridge2.left = ridge1.right.

Problem

Ideally, I would like to stitch together all of the ridge segments into lines, so that I can then iterate over each line found in the image to apply further computations. Unfortunately, I'm finding it hard to find an algorithm for doing this which is both low complexity and memory-efficient and suitable for multiprocessing (all important consideration when dealing with really huge images!)

One approach that I have considered is scanning through the image until I find a ridge which only has one linked ridge segment, and then walking the resulting line, flagging any ridges in the line as visited. However, this is unsuitable for multiprocessing, because there's no way to tell if there isn't another thread walking the same line from the other direction (say) without expensive locking.

What do readers suggest as a possible approach? It seems like the sort of thing that someone would have figured out an efficient way to do in the past...

Solution

I'm not entirely sure this is correct, but I thought I'd throw it out for comment. First, let me introduce a lockless disjoint set algorithm, which will form an important part of my proposed algorithm.

Lockless disjoint set algorithm

I assume the presence of a two-pointer-sized compare-and-swap operation on your choice of CPU architecture. This is available on x86 and x64 architectures at the least.

The algorithm is largely the same as described on the Wikipedia page for the single threaded case, with some modifications for safe lockless operation. First, we require that the rank and parent elements to both be pointer-sized, and aligned to 2*sizeof(pointer) in memory, for atomic CAS later on.

Find() need not change; the worst case is that the path compression optimization will fail to have full effect in the presence of simultaneous writers.

Union() however, must change:

function Union(x, y)
  redo:
    x = Find(x)
    y = Find(y)
    if x == y
        return
    xSnap = AtomicRead(x) -- read both rank and pointer atomically
    ySnap = AtomicRead(y) -- this operation may be done using a CAS
    if (xSnap.parent != x || ySnap.parent != y)
        goto redo
    -- Ensure x has lower rank (meaning y will be the new root)
    if (xSnap.rank > ySnap.rank)
        swap(xSnap, ySnap)
        swap(x, y)
    -- if same rank, use pointer value as a fallback sort
    else if (xSnap.rank == ySnap.rank && x > y)
        swap(xSnap, ySnap)
        swap(x, y)
    yNew = ySnap
    yNew.rank = max(yNew.rank, xSnap.rank + 1)
    xNew = xSnap
    xNew.parent = y
    if (!CAS(y, ySnap, yNew))
      goto redo
    if (!CAS(x, xSnap, xNew))
      goto redo
    return

This should be safe in that it will never form loops, and will always result in a proper union. We can confirm this by observing that:

First, prior to termination, one of the two roots will always end up with a parent pointing to the other. Therefore, as long as there is no loop, the merge succeeds.
Second, rank always increases. After comparing the order of x and y, we know x has lower rank than y at the time of the snapshot. In order for a loop to form, another thread would need to have increased x's rank first, then merged x and y. However in the CAS that writes x's parent pointer, we check that rank has not changed; therefore, y's rank must remain greater than x.

In the event of simultaneous mutation, it is possible that y's rank may be increased, then return to redo due to a conflict. However, this implies that either y is no longer a root (in which case rank is irrelevant) or that y's rank has been increased by another process (in which case the second go around will have no effect and y will have correct rank).

Therefore, there should be no chance of loops forming, and this lockless disjoint-set algorithm should be safe.

And now on to the application to your problem...

Assumptions

I make the assumption that ridge segments can only intersect at their endpoints. If this is not the case, you will need to alter phase 1 in some manner.

I also make the assumption that co-habitation of a single integer pixel location is sufficient for ridge segments can be connected. If not, you will need to change the array in phase 1 to hold multiple candidate ridge segments+disjoint-set pairs, and filter through to find ones that are actually connected.

The disjoint set structures used in this algorithm shall carry a reference to a line segment in their structures. In the event of a merge, we choose one of the two recorded segments arbitrarily to represent the set.

Phase 1: Local line identification

We start by dividing the map into sectors, each of which will be processed as a seperate job. Multiple jobs may be processed in different threads, but each job will be processed by only one thread. If a ridge segment crosses a sector, it is split into two segments, one for each sector.

For each sector, an array mapping pixel position to a disjoint-set structure is established. Most of this array will be discarded later, so its memory requirements should not be too much of a burden.

We then proceed over each line segment in the sector. We first choose a disjoint set representing the entire line the segment forms a part of. We first look up each endpoint in the pixel-position array to see if a disjoint set structure has already been assigned. If one of the endpoints is already in this array, we use the assigned disjoint set. If both are in the array, we perform a merge on the disjoint sets, and use the new root as our set. Otherwise, we create a new disjoint-set, and associate with the disjoint-set structure a reference to the current line segment. We then write back into the pixel-position array our new disjoint set's root for each of our endpoints.

This process is repeated for each line segment in the sector; by the end, we will have identified all lines completely within the sector by a disjoint set.

Note that since the disjoint sets are not yet shared between threads, there's no need to use compare-and-swap operations yet; simply use the normal single-threaded union-merge algorithm. Since we do not free any of the disjoint set structures until the algorithm completes, allocation can also be made from a per-thread bump allocator, making memory allocation (virtually) lockless and O(1).

Once a sector is completely processed, all data in the pixel-position array is discarded; however data corresponding to pixels on the edge of the sector is copied to a new array and kept for the next phase.

Since iterating over the entire image is O(x*y), and disjoint-merge is effectively O(1), this operation is O(x*y) and requires working memory O(m+2*x*y/k+k^2) = O(x*y/k+k^2), where t is the number of sectors, k is the width of a sector, and m is the number of partial line segments in the sector (depending on how often lines cross borders, m may vary significantly, but it will never exceed the number of line segments). The memory carried over to the next operation is O(m + 2*x*y/k) = O(x*y/k)

Phase 2: Cross-sector merges

Once all sectors have been processed, we then move to merging lines that cross sectors. For each border between sectors, we perform lockless merge operations on lines that cross the border (ie, where adjacent pixels on each side of the border have been assigned to line sets).

This operation has running time O(x+y) and consumes O(1) memory (we must retain the memory from phase 1 however). Upon completion, the edge arrays may be discarded.

Phase 3: Collecting lines

We now perform a multi-threaded map operation over all allocated disjoint-set structure objects. We first skip any object which is not a root (ie, where obj.parent != obj). Then, starting from the representative line segment, we move out from there and collect and record any information desired about the line in question. We are assured that only one thread is looking at any given line at a time, as intersecting lines would have ended up in the same disjoint-set structure.

This has O(m) running time, and memory usage dependent on what information you need to collect about these line segments.

Summary

Overall, this algorithm should have O(x*y) running time, and O(x*y/k + k^2) memory usage. Adjusting k gives a tradeoff between transient memory usage on the phase 1 processes, and the longer-term memory usage for the adjacency arrays and disjoint-set structures carried over into phase 2.

Note that I have not actually tested this algorithm's performance in the real world; it is also possible that I have overlooked concurrency issues in the lockless disjoint-set union-merge algorithm above. Comments welcome :)

OTHER TIPS

You could use a non-generalized form of the Hough Transform. It appears that it reaches an impressive O(N) time complexity on N x N mesh arrays (if you've got access to ~10000x10000 SIMD arrays and your mesh is N x N - note: in your case, N would be a ridge struct, or cluster of A x B ridges, NOT a pixel). Click for Source. More conservative (non-kernel) solutions list the complexity as O(kN^2) where k = [-π/2, π]. Source.

However, the Hough Transform does have some steep-ish memory requirements, and the space complexity will be O(kN) but if you precompute sin() and cos() and provide appropriate lookup tables, it goes down to O(k + N), which may still be too much, depending on how big your N is... but I don't see you getting it any lower.

Edit: The problem of cross-thread/kernel/SIMD/process line elements is non-trivial. My first impulse tells me to subdivide the mesh into recursive quad-trees (dependent on a certain tolerance), check immediate edges and ignore all edge ridge structs (you can actually flag these as "potential long lines" and share them throughout your distributed system); just do the work on everything INSIDE that particular quad and progressively move outward. Here's a graphical representation (green is the first pass, red is the second, etc). However, my intuition tells me that this is computationally-expensive..

Passes

If the ridges are resolved enough that the breaks are only a few pixels then the standard dilate - find neighbours - erode steps you would do for finding lines / OCR should work.

Joining longer contours from many segments and knowing when to create a neck or when to make a separate island is much more complex

Okay, so having thought about this a bit longer, I've got a suggestion that seems like it's too simple to be efficient... I'd appreciate some feedback on whether it seems sensible!

1) Since I can easily determine whether each ridge_t ridge segment at is connected to zero, one or two adjacent segments, I could colour each one appropriately (LINE_NONE, LINE_END or LINE_MID). This can easily be done in parallel, since there is no chance of a race condition.

2) Once colouring is complete:

for each `LINE_END` ridge segment X found:
    traverse line until another `LINE_END` ridge segment Y found
    if X is earlier in memory than Y:
        change X to `LINE_START`
    else:
        change Y to `LINE_START`

This is also free of race conditions, since even if two threads are simultaneously traversing the same line, they will make the same change.

3) Now every line in the image will have exactly one end flagged as LINE_START. The lines can be located and packed into a more convenient structure in a single thread, without having to do any look-ups to see if the line has already been visited.

It's possible that I should consider whether statistics such as line length should be gathered in step 2), to help with the final re-packing...

Are there any pitfalls that I've missed?

Edit: The obvious problem is that I end up walking the lines twice, once to locate RIDGE_STARTs and once to do the final re-packing, leading to a computational inefficiency. It's still appears to be O(N) in terms of storage and computation time, though, which is a good sign...

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow