What is the best data structure and algorithm for comparing a list of strings?

https://stackoverflow.com/questions/4529872

13-10-2019
|

Question

I want to find the longest possible sequence of words that match the following rules:

Each word can be used at most once
All words are Strings
Two strings sa and sb can be concatenated if the LAST two characters of sa matches the first two characters of sb.

In the case of concatenation, it is performed by overlapping those characters. For example:

sa = "torino"
sb = "novara"
sa concat sb = "torinovara"

For example, I have the following input file, "input.txt":

novara

torino

vercelli

ravenna

napoli

liverno

messania

noviligure

roma

And, the output of the above file according to the above rules should be:

torino

novara

ravenna

napoli

livorno

noviligure

since the longest possible concatenation is:

torinovaravennapolivornovilligure

Can anyone please help me out with this? What would be the best data structure for this?

Solution

This can be presented as a directed graph problem -- the nodes are words, and they are connected by an edge if they overlap (and the smallest overlap is chosen to get the longest length), and then find the highest weight non-intersecting path.

(Well, actually, you want to expand the graph a bit to handle beginning and ending at a word. Adjoin a "starting node" with with an edge to each word of weight length word / 2. Between words, -overlap + length start + length finish / 2, and between each word and an "ending node" "length word / 2". Might be easier to double it.)

https://cstheory.stackexchange.com/questions/3684/max-non-overlapping-path-in-weighted-graph

OTHER TIPS

I would start really simple. Make 2 vectors of strings, one sorted normally, one sorted by the last 2 letters. Create an index (vector of ints) for the second vector that points out it's position in the first.

To find the longest.. first remove the orphans. words that don't match at either end to anything. Then you want to build a neighbor joining tree, this is where you determine which words can possibly reach each other. If you have 2 or more trees you should try the largest tree first.

Now with a tree your job is to find ends that are rare, and bind them to other ends, and repeat. This should get you a pretty nice solution, if it uses all the words your golden, skip the other trees. If it doesn't then your into a whole slew of algorithms to make this efficient.

Some items to consider: If you have 3+ unique ends you are guaranteed to drop 1+ words. This can be used to prune your tries down while hunting for an answer. recalculate unique ends often. Odd numbers of a given end ensure that one must be dropped(you get 2 freebies at the ends). Segregate words that can self match , you can always throw them in last, and they muck up the math otherwise. You may be able to create small self matching rings, you can treat these like self matching words, as long as you don't orphan them when you create them. This can make the performance fantastic, but no guarantees on a perfect solution.

The search space is order(N!) a list of millions of elements may be hard to prove an exact answer. Of course I could be overlooking something.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow