How to find all combinations of a multiset in a string in linear time?

https://stackoverflow.com/questions/8022925

22-02-2021
|

質問

I am given a bag B (multiset) of characters with the size m and a string text S of size n. Is it possible to find all substrings that can be created by B (4!=24 combinations) in S in linear time O(n)?

Example:

S = abdcdbcdadcdcbbcadc (n=19)
B = {b, c, c, d} (m=4)
Result: {cdbc (Position 3), cdcb (Position 10)}

The fastest solution I found is to keep a counter for each character and compare it with the Bag in each step, thus the runtime is O(n*m). Algorithm can be shown if needed.

解決 2

Thanks for the answer. The add() and remove() methods have to be changed to make the algorithm work correctly.

add(c):
    if hist[c] > 0 and histrun[c] < hist[c] then
        histrunsum++
    else
        histrunsum--

    histrun[c] = histrun[c] + 1


remove(c):
    if histrun[c] > hist[c] then
        histrunsum++
    else
        histrunsum--

    histrun[c] = histrun[c] - 1

Explanation: histrunsum can be seen as a score of how identical both multisets are.

add(c): when there are less occurrences of a char in the histrun multiset than in the hist multiset, the additional occurrence of that char has to be "rewarded" since the histrun multiset is getting closer to the hist multiset. If there are at least equal or more chars in the histrun set already, and additional char is negative.

remove(c): like add(c), where a removal of a char is weighted positively when it's number in the histrun multiset > hist multiset.

Sample Code (PHP):

function multisetSubstrings($sequence, $mset)
{
    $multiSet = array();
    $substringLength = 0;
    foreach ($mset as $char)
    {
        $multiSet[$char]++;
        $substringLength++;
    }

    $sum = 0;
    $currentSet = array();
    $result = array();

    for ($i=0;$i<strlen($sequence);$i++)
    {

        if ($i>=$substringLength)
        {
            $c = $sequence[$i-$substringLength];

            if ($currentSet[$c] > $multiSet[$c])
                $sum++;
            else
                $sum--;

            $currentSet[$c]--;
        }


        $c = $sequence[$i];

        if ($currentSet[$c] < $multiSet[$c])
            $sum++;
        else
            $sum--;

        $currentSet[$c]++;

        echo $sum."<br>";


        if ($sum==$substringLength)
            $result[] = $i+1-$substringLength;
    }


    return $result;
}

他のヒント

There is a way to do it in O(n), assuming we're only interested in substrings of length m (otherwise it's impossible, because for the bag that has all characters in the string, you'd have to return all substrings of s, which means a O(n^2) result that can't be computed in O(n)).

The algorithm is as follows:

Convert the bag to a histogram:

hist = []
for c in B do:
    hist[c] = hist[c] + 1

Initialize a running histogram that we're going to modify (histrunsum is the total count of characters in histrun):
```
histrun = []
histrunsum = 0
```

We need two operations: add a character to the histogram and remove it. They operate as follows:

add(c):
    if hist[c] > 0 and histrun[c] < hist[c] then:
        histrun[c] = histrun[c] + 1
        histrunsum = histrunsum + 1

remove(c):
    if histrun[c] > 0 then:
        histrun[c] = histrun[c] - 1
        histrunsum = histrunsum + 1

Essentially, histrun captures the amount of characters that are present in B in current substring. If histrun is equal to hist, our substring has the same characters as B. histrun is equal to hist iff histrunsum is equal to length of B.
Now add first m characters to histrun; if histrunsum is equal to length of B; emit first substring; now, until we reach the end of string, remove the first character of the current substring and add the next character.
add, remove are O(1) since hist and histrun are arrays; checking if hist is equal to histrun is done by comparing histrunsum to length(B), so it's also O(1). Loop iteration count is O(n), the resulting running time is O(n).

Use hashing. For each character in the multiset, assign a UNIQUE prime number. Compute the hash for any string by multiplying the prime number associated with a number, as many times as the frequency of that number.

Example : CATTA. Let C = 2, A=3, T = 5. Hash = 2*3*5*5*3 = 450

Hash the multiset ( treat it as a string ). Now go through the input string, and compute the hash of each substring of length k ( where k is the number of characters in the multiset ). Check if this hash matches the multiset hash. If yes, then it is one such occurence.

The hashes can be computed very easily in linear time as follows :

Let multiset = { A, A, B, C }, A=2, B=3, C=5.

Multiset hash = 2*2*3*5 = 60

Let text = CABBAACCA

(i) CABB = 5*2*3*3 = 90

(ii) Now, the next letter is A, and the letter discarded is the first one, C. So the new hash = ( 90/5 )*2 = 36

(iii) Now, A is discarded, and A is also added, so new hash = ( 36/2 ) * 2= 36

(iv) Now B is discarded, and C is added, so hash = ( 36/3 ) * 5 = 60 = multiset hash. Thus we have found one such required occurence - BAAC

This procedure will obviously take O( n ) time.

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow