Question

#include<iostream>
#include<string.h>
#include<utility>
#include<algorithm>

using namespace std;

struct xx
{
    string x;
    short int d;
    int lcp;
};

bool compare(const xx a,const xx b)
{
    return a.x<b.x;
}

int findlcp(string a,string b)
{
    int i=0,j=0,k=0;
    while(i<a.length() && j<b.length())
    {
        if(a[i]==b[j])
        {
            k++;
            i++;
            j++;
        }
        else
        {
            break;
        }
    }
    return k;
}

int main()
{   
    string a="banana";
    xx b[100];
    a=a+'$';
    int len=a.length();
    for(int i=0;i<len;i++)
    {
        b[i].x=a.substr(i);
        b[i].d=i+1;
    }
    sort(b,b+len,compare);
    for(int i=0;i<len;i++)
        cout<<b[i].x<<" "<<b[i].d<<endl;
    b[0].lcp=0;
    b[1].lcp=0;
    for(int i=2;i<len;i++)
    {
        b[i].lcp=findlcp(b[i].x,b[i-1].x);
    }
    for(int i=0;i<len;i++)
        cout<<b[i].d<<" "<<b[i].lcp<<endl;      
}

This is a implementation of Suffix Array. What my question is in the wikipedia article construction is given as o(n) in worst case

So in my construction:

  1. I am sorting all the suffixes of the string using stl sort .This may at least a O(nlogn) in worst case.So here i am violating O(n) construction.
  2. Second one is in constructing a longest common prefix array construction is given O(n).But i think my implementation in O(n^2)

So for the 1st one i.e for the sorting

  • If i use count sort i may decrease to O(n).If i use Count sort is it correct?Is my understanding is correct?let me know if my understanding is wrong

  • And is there any way to find LCP in O(n) time?

Was it helpful?

Solution

First, regarding your two statements:

1) I am sorting all the suffixes of the string using stl sort. This may at least a O(nlogn) in worst case. So here i am violating O(n) construction.

The complexity of std::sort here is worse than O(n log n). The reason is that O(n log n) assumes that there are O(n log n) individual comparisons, and that each comparison is performed in O(1) time. The latter assumption is wrong, because you are sorting strings, not atomic items (like characters or integers).

Since the length of the string items, being substrings of the main string, is O(n), it would be safe to say that the worst-case complexity of your sorting algorithm is O(n2 log n).

2) Second one is in constructing a longest common prefix array construction is given O(n).But i think my implementation in O(n^2)

Yes, your construction of the LCP array is O(n2) because you are running your lcp function n == len times, and your lcp function requires O(min(len(x),len(y))) time for a pair of strings x, y.

Next, regarding your questions:

If I use count sort I may decrease to O(n). If I use Count sort is it correct? Is my understanding is correct? Let me know if my understanding is wrong.

Unfortunately, your understanding is incorrect. Counting sort is only linear if you can, in O(1) time, get access to an atomic key for each item you want to sort. Again, the items are strings O(n) characters in length, so this won't work.

And is there any way to find LCP in O(n) time?

Yes. Recent algorithms for suffix array computation, including the DC algorithm (aka Skew algorithm), provide for methods to calculate the LCP array along with the suffix array, and do so in O(n) time.

The reference for the DC algorithm is Juha Kärkkäinen, Peter Sanders: Simple linear work suffix array construction, Automata, Languages and Programming Lecture Notes in Computer Science Volume 2719, 2003, pp 943-955 (DOI 10.1007/3-540-45061-0_73). (But this is not the only algorithm that allows you to do this in linear time.)

You may also want to take a look at the open-source implementations mentioned in this SO post: What's the current state-of-the-art suffix array construction algorithm?. Many of the algorithms used there enable linear-time LCP array construction in addition to the suffix-array construction (but not all of the implementations there may actually include an implementation of that; I am not sure).

If you are ok with examples in Java, you may also want to look at the code for jSuffixArrays. It includes, among other algorithms, an implementation of the DC algorithm along with LCP array construction in linear time.

OTHER TIPS

jogojapan has comprehensively answered your question. Just to mention an optimized cpp implementation, you might want to take a look at here.

Posting the code here in case GitHub goes down.

const int N = 1000 * 100 + 5; //max string length

namespace Suffix{
    int sa[N], rank[N], lcp[N], gap, S;
    bool cmp(int x, int y) {
        if(rank[x] != rank[y])
            return rank[x] < rank[y];
        x += gap, y += gap;
        return (x < S && y < S)? rank[x] < rank[y]: x > y;
    }
    void Sa_build(const string &s) {
        S = s.size();
        int tmp[N] = {0};
        for(int i = 0;i < S;++i)
            rank[i] = s[i],
            sa[i] = i;
        for(gap = 1;;gap <<= 1) {
            sort(sa, sa + S, cmp);
            for(int i = 1;i < S;++i)
                tmp[i] = tmp[i - 1] + cmp(sa[i - 1], sa[i]);
            for(int i = 0;i < S;++i)
                rank[sa[i]] = tmp[i];
            if(tmp[S - 1] == S - 1)
                break;
        }
    }
    void Lcp_build() {
        for(int i = 0, k = 0;i < S;++i, --k)
            if(rank[i] != S - 1) {
                k = max(k, 0);
                while(s[i + k] == s[sa[rank[i] + 1] + k])
                    ++k;
                lcp[rank[i]] = k;
            }
            else
                k = 0;
    }
};
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top