Question

My friend came across this problem in an interview . We are given a file having some sentences. The sentences have only 0-9, a-z, A-Z and full-stop(.) , We need to read the file and store it in a manner that querying is faster. Time taken by this phase is not a concern. Here query will consist of some words, and we need to return the smallest sub-string having all these words. Order is not important. (Note: Assuming whole file can fit in main memory)

For example if the file is: "Ram was doing a computer science degree. Ram has a computer at home. Ram is now at home."

Query 1 :"Ram computer a" Output: "Ram has a computer" Query 2: "Ram home" Response: "home. Ram"

I thought of storing the file as a Link-List where each node consists of a word. If is a last word then word+fullstop is stored in node. At query time, we need to traverse the LL and find minimum string having all the words.

How can we optimize it further ? Can we store the file in a better way ?

Was it helpful?

Solution

You can store the file in the form of Suffix array. It could be constructed in O(N) time, where N is length of the file. Each query word could be found with binary search in O(M + log N) time, where M is the length of query word. As shown in this paper: "Replacing suffix trees with enhanced suffix arrays" by Mohamed Ibrahim Abouelhodaa, Stefan Kurtzb, Enno Ohlebusch, we may improve this to O(M).

Since time taken by pre-processing phase is not a concern, instead of suffix array, you can use a Trie. Just add each word of the input file into the trie and add position of this word in the file to the list of this word's positions (one such list is needed for each trie node).

After positions of all the query words are found in suffix array or in the trie, you have to sort them (for suffix array only, because they are already sorted in case of trie), then find a set of positions that is closest to each other:

  1. add smallest positions of all the query words to priority queue (may be implemented as min-heap),
  2. while the list of not-yet-processed positions for the top word in this priority queue is not empty, substitute topmost queue entry with next position of the same word. Every time some entry is added/removed from priority queue, add/remove position of corresponding word's end to some ordered collection (like binary search tree). Difference between the largest entry in this collection and the smallest entry in the priority queue allows to determine the smallest sub-string having all the query words.
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top