Suffix Range c++

https://stackoverflow.com/questions/7165964

09-01-2021
|

Question

I am trying to build a Suffix range that is

if I have strings "catalog" "catalyst" "ban" "bany"

then suffix tree will be like

                            .
                           / \
                          c   b
                         /     \
                        a       a
                       /         \
                      t           n
                     / \         / \        
                    a   a       $   y 
                   /     \         / \
                  l       l       $    $
                 /         \
                o           y         
               /             \
              g               s
             / \               \
            $   $               t
                                /\
                               $   $

I want to find Suffix range of each string now .. that if I take string "Cat" then it should give me a range enclosing all its suffixes to which "cat" is a prefix. I need to use sentinels to separate each string.. may be a "$"

Can any one suggest me a best way to find out this using c++ . Any references will be helpfull. thank you

Solution

Here is, I guess, the most concise answer. :)

set<string> s;
string word = "ABC"
//Inserts.
// e.g. s.insert("ABCD");

for(set<string>::iterator it=s.begin();it!=s.end();++it)
    if(!(*it).compare(0,word.size(),word))
        cout<<*it<<endl;

Tested code! :P

OTHER TIPS

Much simpler answer than my first. You have a std::set of strings:

typedef std::set<std::string>::iterator iter_type;
std::set<std::string> data;

and a function named find() which returns a pair of iterators. The first iterator points at the beginning of the strings that match the prefix, and the last iterator is one past the last string that matches the prefix. If you have 10000 strings, this needs to only check about 26 of them.

std::pair<iter_type, iter_type> find(std::string substr) {
   std::pair<iter_type, iter_type> r;
   r.first = data.lower_bound(substr);
   substr[substr.size()-1]++; //I'm assuming substr is at least one character
   r.second = data.upper_bound(substr);
   return r;
}

Then, after the data has been loaded, you merely call the find(...) function, and it returns a pair of iterators pointing at the strings you want. You can use these as inputs to any standard algorithm, or do whatever.

int main() {
    data.insert("catalog");
    data.insert("catalyst");
    data.insert("ban");
    data.insert("bany");
    //find the region of strings beginning with "cat"
    std::pair<iter_type, iter_type> range = find("cat");
    //display them all
    for(iter_type i=range.first; i!=range.second; ++i)
        std::cout << *i << '\n';
}

Solution 1 : Space efficient Use Trie data structure (one-char is one node, one node can point to 26 different nodes) Find the last-node for given prefix. Print prefix+'path to all terminal nodes'

Solution 2 : Time efficient say you are interested in only first 3 prefix chars. Create a 3d array

 vector<string> arr[27][27][27]

Insert . if you want to insert
word : ABCD arr[A][B][C].push_back("D") word : BBBX arr[B][B][B].push_back("X")

Print : vector & a = arr[char1][char2][char3] for( string s in a) char1-char2-char3+ s

I posted an algorithm to solve a remarkably similar problem at Is there a suitable data structure for solving this question?. First, we create a suffix tree of nodes similar to

class node { //create a prefix node type
    node & operator=(const node & b); //UNDEFINED, NO COPY
    node & operator=(const node && b); //UNDEFINED, NO COPY
    node * next[27];  // pointers to nodes of the next letter (27th letter is $)
public:
    node(); 
    ~node();
    void add(char* mystring);
    void find(char* mystring, 
        std::vector<std::pair<int, std::string>>& out, 
        std::string sofar="");
}root;

And fill it. Then, to find all the substrings of "cata", we iterate through the tree according to the letters in "cata" (root[3]->[0]->['t'-'a'?]->[0]), and keep track of the string sofar. When we reach the end of mystring, we recursively try going down each child, instead of just the ones that match the substring, and wherever we find 'end' (letter 27), we push sofar onto out. Then we simply return, and out holds all the full strings beginning with "cata".

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow