Minimize sequence storage by overlapping prefixes

https://cs.stackexchange.com/questions/110499

05-11-2019
|

Pergunta

I bumped into this problem today, and after a bit of pondering, I think I have a solution in $O(n^3)$, which is better than no solution or an $O(n!)$ solution, but my answer still isn't great. Can you beat my best answer?

The Real-World Impetus

Given a set of data $C$ that consists of nonoverlapping, nonempty, unique sequences $S_1, S_2, S_3, ...$ of atomic values. Each sequence tends to have a lot of repeated values.

For example, in the code where I bumped into the problem, the sequences were initially something like this:

    9, 1, 1, 1, 1, 1;
    5, 5, 5, 3, 2;
    1, 1, 1, 1, 4;
    1, 7, 5, 5;
    4, 4, 4;
    ...

To access the data, there's a secondary table of offsets into the overall data to show where each sequence starts: i.e., sequence $S_1$ starts at 1, and $S_2$ at 7, and $S_3$ at 13, and so on. But this means that the sequences may occur in any order, and they may even overlap.

Given that the data could overlap, it occurred to me that by reordering the sequences, I could reduce the storage required. For example, in the above data, I can reduce the size of the required storage by reordering the sequences as $S_1, S_3, S_5; S_4, S_2$:

    9, 1, 1, 1, 1, 1, 4, 4, 4;
    1, 7, 5, 5, 5, 3, 2;
    ...

By overlapping the prefixes and suffixes, we can store 16 values instead of the original 23, a considerable savings.

The Problem Statement

Generalizing this, the problem, then, is to find an algorithm for ordering an arbitrary set of sequences, so that the resulting ordering will have the maximum overall overlap of those sequences' prefixes and suffixes.

(And for what it's worth, I don't actually have to solve this for the code I was working on: This is purely a question of academic interest at this point.)

My Best Answer

It's easy to "see" an optimal answer for trivial examples like the one above, but in the general case of a large number of sequences of arbitrary values, it gets hard quickly. A greedy algorithm could likely at least give an acceptable answer in polynomial time, but that wouldn't be the true minimal answer. And obviously, you can simply try every possible ordering, but $O(n!)$ is not a very nice solution either.

It occurred to me a possible solution might exist as a variation on a longest-path-through-a-weighted-graph problem. You consider each sequence $S_1, S_2, S_3, ...$ as a node in a graph, and construct a directed edge anywhere $S_n$ has a prefix that matches a suffix of $S_m$ for all $n$ and $m$, weighted by the match length. This results in a two-step solution:

Calculate all the edge weights. The edge weights can be calculated trivially in $O(n^2)$, or maybe even in $O(n)$ using a trie constructed via dynamic programming.
Then for each vertex, use Dijkstra's shortest-path algorithm (or similar), inverted, to find the longest path for that vertex. Of all the possible longest paths, then take the overall longest. This is $O(n^3)$, or maybe $O(n^2\ log\ n)$ using priority queues.

I think this might work, with an overall time of $O(n^3)$ or maybe $O(n^2\ log\ n)$, but I haven't attempted to prove it for real (or implement it). There might also be a better technique than Dijkstra's algorithm for "longest path in the graph when you don't care what node it starts or ends on."

Can you do better?

Here's the problem statement again:

Find an algorithm that can efficiently order a set of sequences $S_1, S_2, S_3, ...$ such that there is maximal global overlap of their prefixes and suffixes.

Good luck; I'm curious to see if you can solve it better than I could!

Nenhuma solução correta

Licenciado em: CC-BY-SA com atribuição

Não afiliado a cs.stackexchange