Assembling random indexed packages into an ordered sequence(s)?

https://softwareengineering.stackexchange.com/questions/370808

05-02-2021
|

题

I am pretty confident that this is well-known and solved problem.

Suppose there are n uniquely indexed "packages" randomly getting into single entry point like .AcceptPackage(Package package). Packages might be different in term of length, so I can't deterministically compute one's output-file position knowing it's index. Hence, I have to assemble those sequentially starting from the 1st (0st actually) and all way down till the last one (last one guaranteed to exist).

Then there is extremely costly operation .Write(Package[] packages). I'd like to call it, say, as long as k packages-successors gathered into a sequence.

Thus, I'm interested in the most efficient way of tracking those sequences.

My idea (not tested code, just a sketch):

lastVerifiedIndex = 0;
if (acceptedPackage.Index == lastVerifiedIndex + 1) {
    do {
        lastVerifiedIndex += 1;
    } while (_packages[lastVerifiedIndex + 1] != null);

if (lastVerifiedIndex == magicConstant) {
    Write(_packages.Take(magicConstant)); // will remove those
    lastVerifiedIndex = _packages[0]; // need to check array length here
}

Brief explanation: I'm keeping an eye on last sequential index. As long as I observe it's successor, I go further and hope to find some more successors. Then I check, whether sequence grew long enough and, if it did, I trigger expensive operation. Seems like complexity is liner: O(n), however someway I have a strange feeling like there might be hidden logariphmic solution...

解决方案

When the total number of packets is bounded, using a simple array similar to your suggestion is sufficient. You don't have to scan the array after each received packet to see whether you can flush part of the output. Instead, keep the last verified index as persistent state alongside the array, and advance it only as far as possible when the next index is received. This solution is optimal: you move this index past each packet exactly once. A visual example:

| 1 | 2 | _ | 4 | 5 | _ | _ | 8
        ^
        last verified

Only if packet 3 is received do we have to advance the last verified index:

| 1 | 2 | 3 | 4 | 5 | _ | _ | 8
                    ^
                    last verified

If the total number of packets is not bounded but the packets are received in roughly the correct order, use a circular buffer. These allow the index to wrap around at the end, which lets you reuse storage from items that were already flushed. Many circular buffer implementations have a fixed size, however they can also be implemented to be resizable. A circular buffer contains the following fields:

a storage array
the storage capacity
the start of the current items
the end of the current items

Example:

| 3 | 4 | _ | _ | 1 | 2 |
        ^       ^
       end     start

In your case, you do not need the end pointer as you are not using the circular buffer as a queue, but as an indexable array. Instead, you need the last verified pointer. You would grow the buffer if a packet index is out of bounds. To determine the correct index, you need to substract the current start index offset. This needs to be stored in another external variable.

For example, consider this buffer (V = last verified, S = start):

                        V
| _ | 4 | 5 | _ | 1 | 2 |  offset=1
                S

Upon receiving packet 8, we see the array does not have sufficient storage, and we resize (different layouts after resizing are possible):

                                V
| _ | 4 | 5 | _ | _ | 8 | 1 | 2 |  offset=1
                        S

Upon receiving packet 3, we can advance the last-verified pointer:

            V
| 3 | 4 | 5 | _ | _ | 8 | 1 | 2 |  offset=1
                        S

If we now dequeue the next 4 items, we would end up with:

            V
| _ | _ | 5 | _ | _ | 8 | _ | _ |  offset=5
        S

Upon receiving 7, 12, 10, 6 we would get this state in which we can dequeue the next 4 items:

                         V
| _ | 12 | 5 | 6 | 7 | 8 | _ | 10 |  offset=5
         S

                         V
| _ | 12 | _ | _ | _ | _ | _ | 10 |  offset=9
                         S

Upon receiving 9:

                                  V
| _ | 12 | _ | _ | _ | _ | 9 | 10 |  offset=9
                         S

So these offsets and indices can get quite tricky, but using circular buffers like this allow you to reorder packets with as little temporary storage as possible, without expensive bookkeeping operations. The maximum size of the buffer is determined by the largest difference packet index - start index offset. Processing all n packets will take asymptotically O(n) time, i.e. optimal.

This is OK if the packets are only mixed up a little (so are received in roughly the correct order). If they are shuffled truly randomly, the buffer could grow to the number of all packets (if the last packet is received first).

If the packets are shuffled randomly and storage is a more limiting factor than runtime, use a min-heap (priority queue). Heap data structures allow you to insert elements with a priority (here, the packet index), and retrieve them in correct order. Entering an element into a heap takes O(log n) or O(1) time for n elements in the heap. The next element can be peeked at in constant time. Removing the next element takes O(log n) time.

To combine this with your output buffer, keep an array of k items ready (if you want to write the output every k items), and keep track of the last verified number. After inserting an item into the heap, check the top element. If it is the next packet to be verified, remove it from the heap and store it in the output buffer.

Visual example (using k=4, and a binary heap):

Let's start with the state from the circular buffer:

     Output                Heap
v=2  | 1  | 2 | _ | _ |    | 4 | 5 | _ | _ |

Upon receiving 8, no resize is necessary:

     Output                Heap
v=2  | 1  | 2 | _ | _ |    | 4 | 5 | 8 | _ |

Upon receiving 3 into the heap, it ends in the top position:

     Output                Heap
v=2  | 1  | 2 | _ | _ |    | 3 | 4 | 8 | 5 |

We can now move 3 to the output buffer:

     Output                Heap
v=3  | 1  | 2 | 3 | _ |    | 4 | 5 | 8 | _ |

And move 4 to the output, and flush the buffer:

     Output                Heap
v=4  | _  | _ | _ | _ |    | 5 | 8 | _ | _ |

And move 5:

     Output                Heap
v=5  | 5  | _ | _ | _ |    | 8 | _ | _ | _ |

Quick examples of insertion for 7, 12, 10, 6:

insert 7
v=5  | 5  | _ | _ | _ |    | 7 | 8 | _ | _ |
insert 12
v=5  | 5  | _ | _ | _ |    | 7 | 8 | 12 | _ |
insert 10
v=5  | 5  | _ | _ | _ |    | 7 | 8 | 12 | 10 |
insert 6
v=5  | 5  | _ | _ | _ |    | 6 | 7 | 12 | 10 | 8 | _ | _ | _ |
v=6  | 5  | 6 | _ | _ |    | 7 | 8 | 12 | 10 | _ | _ | _ | _ |
v=7  | 5  | 6 | 7 | _ |    | 8 | 10 | 12 | _ | _ | _ | _ | _ |
v=8  | 5  | 6 | 7 | 8 |    | 10 | 12 | _ | _ | _ | _ | _ | _ |
flush
v=8  | _  | _ | _ | _ |    | 10 | 12 | _ | _ | _ | _ | _ | _ |

As a shortcut, it would of course be possible to avoid heap insertion if the packet is the next to be verified.

So while heaps could save you storage (they didn't in this scenario), you pay with increased time complexity. To process all n packets in order, you would now need up to O(n log n) time.

Note that the heap approach has the same time O(n log n) complexity as gathering all packets into an array and then sorting them. Like most sorting algorithms, it is a comparison-based algorithm. We do however use our knowledge that the packet indices are consecutive in order to remove items early, which keeps the heap size small. The array- and circular-buffer based approaches have better O(n) time complexity, because we use the packet index to determine the absolute output position (instead of only determining a relative order between packets).

The “best” approach depends on the total number of packets (for only a few packets, don't use complicated algorithms!) and on the probability distribution of the packet order. The circular buffer is optimal when the max packet index - verified packet index difference is bounded, otherwise the heap can have storage savings. For the worst case that all packets are received exactly in reverse, both will use O(n) storage.

许可以下： CC-BY-SA 和归因

不隶属于 softwareengineering.stackexchange