Efficiently removing empty spaces within set of numbers

https://stackoverflow.com/questions/10264767

02-06-2021
|

سؤال

I'll use Python syntax and objects to represent the problem, but in reality it's meant for a model in SQL databases, with a Python API and ORM.

I have a list of numbers like this:

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

From time to time, some numbers get removed and null spaces remain:

[0, 1, 2, None, None, 5, 6, None, None, None, 10]

What I need to do is to efficiently pack this set of numbers on a maintenance step done periodically, both in ordered and unordered fashion, such that no null spaces remain between numbers:

So, in ordered fashion I need that list to become:

[0, 1, 2, 5, 6, 10, None, None, None, None, None]

And when unordered it doesn't really matter where each number goes, as long as there are no null spaces between them.

Numbers can be moved in contiguous blocks, and moving them any number of places to left or right costs the same, but there's a setup and teardown cost which makes it a lot more efficient to move larger blocks and achieve it in as few updates as possible.

Right now I'm using the most simple solution, finding blocks of contiguous numbers and moving them to the nearest left one block at a time until it's packed. So, in the example, 5, 6 is moved 2 blocks left in a single update, and then 10 is moved 5 blocks to left in another update.

[0, 1, 2, None, None, 5, 6, None, None, None, 10]

[0, 1, 2, 5, 6, None, None, None, None, None, 10]

[0, 1, 2, 5, 6, 10, None, None, None, None, None]

This trivial approach seems to be the most efficient when order matters, but in reality most of my operations will be unordered and I think there should be a better approach. For instance, in this case, the list can be packed in a single update by moving the 0, 1, 2 block between 6 and 10:

[None, None, None, None, None, 5, 6, 0, 1, 2, 10]

In reality there will be thousands of blocks, but I know beforehand the size of each block and each gap. Moving blocks is also very expensive compared to the computation needed for combinatorics between their size and the gaps, so finding the optimal solution is the ideal.

This seems a kind of bin packing problem, but I really don't know how to approach it to find the best solution. Any ideas?

المحلول

For the unordered case, suppose that somebody tells you what spaces the final contiguous block should fill. Then one heuristic is to suppose that if you move in the largest blocks outside this region into it first, then everything will fit and you don't have to break any blocks up. As suggested in the comment, you could run A* (or branch and bound) with this. Then your first decision is where the final contiguous block should be, but this is just another level of A*/branch and bound - in fact under this heuristic, the most promising final contiguous region will be the one currently holding the largest number of filled in sub-regions, because you are assuming that you only have to move in sub-regions outside this.

If you do find this is too expensive, one way to speed up branch and bound, at the cost of getting poorer answers, is to discard possible answers that could improve the best answer found so far by only X%, for some X.

Actually I think you can get a slightly better lower bound than this - max(number of separate contiguous gaps in the target area, number of separate contiguous areas to be moved in from source area) should be slightly better as one move can at best move in a single contiguous area of numbers and fill a single gap in the target area.

One easy way to get a lower bound is to ignore enough constraints on the problem to make it easy. Assuming that the unknown correct answer is still a feasible solution this must give you a lower bound, because a best solution on the weakened problem must be at least as good as the unknown correct answer. You can apply this to your problem with gappy updates by pretending that two updates will never collide with each other. Given a specified target area, computing this heuristic amounts to finding a best way to cut up the source area into chunks, each of which fit into the target area. You could solve this with a dynamic program: you work out best answer for the first n+1 cells of the source area by considering all possible ways of copying in the last k cells of the source area, and then adding on the cost of copying in the first n+1-k cells of the source area, which you will have already worked out. Unfortunately, I have no idea of whether this heuristic is strong enough to be useful.

نصائح أخرى

The problem you describe is called the compaction problem. In the classical compaction problem (both order and unordered variants), the cost of data movement is not seen as so prohibitive. So, it can be trivially solved by using an auxiliary storage and copying non-empty entries into the auxiliary storage in a single linear scan. The new compacted storage can simply replace the original or copied over to the original, depending on the context. Now, all of this can be done in linear time, and using only linear additional storage. So, it is not considered a hard problem in the sense that bin-packing is. For bean packing, there is absolutely no easy solution whether or not you allow a linear amount of extra storage. So, clearly what we're dealing with here is not bin-packing.

When data movement is costly, there is now the additional constraint of minimizing the number of movements of non-contiguous data blocks. One can look at this problem as an instance of either of the two problems:

In-place sorting of a binary array. Here, you model your array as containing only two kinds of data -- 0s and 1s. This can easily be achieved in your case using a predicate isNull(a) that returns 1 for an empty data entry and 0 for a non-empty one. The simplest solution I can think of here is to use Selection Sort for sorting the binary array. It never does more than O(n) data movements in the worst-case, even though it can make O(n²) number of comparisons, but you don't mind that, since you only want to minimize the number of data movements. If there's no data to move, it doesn't do any! Some improvements that complicate things a bit could be:
- To swap blocks rather than individual entries. By this I mean, two blocks (one of zeros and the other of ones) can be swapped only if the zero block is bigger. You could also use the greedy heuristic that the next swap is always the one that minimizes the absolute difference of these two i.e. abs(len(zeroBlock) - len(oneBlock)). This would only work for the unordered instance of your problem.
- Two other optimizations would be to do a preprocessing to decide weather to sort ascending or descending.
- Also, you might want to exclude contiguous ends of the list.
Garbage Compaction. Essentially, the idea is to consider the free spaces as deallocated space in memory that need to be garbage collected. For this let me refer you to this interesting SO discussion thread and also this one. You might also find this research paper or this one useful.

Good luck!

#include <stdio.h>
#include <string.h>

#define IS_EMPTY(c) ((c) <= '@')

unsigned moverup(char buff[], unsigned size)
{
unsigned src,dst,cnt;

for (src=dst=cnt=0; src < size; src++ ) {
        if (!IS_EMPTY(buff[src])) { cnt++; continue; }
        if (!cnt) continue;
ugly:
        memmove(buff+dst, buff+src-cnt, cnt );
        dst += cnt;
        cnt = 0;
        }
if (cnt) goto ugly;
return dst;
}

int main(void)
{
unsigned result;
char array[] = "qwe@rty@ui#op";

printf("Before:%s\n", array );

result = moverup (array, strlen (array) );

printf("result:%u\n", result );
// entries beyond result will contain garbage now.
// array[result] = 0;
printf("After:%s\n", array );

return 0;
}

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow