Question

learning python as I go for school work. Essentially, I need to find the longest repeated substring in a list of string as shown here. I have been reading through this article and have an understanding of what I should do.

so far my implementation is as follows:

def long_rptr_subString(testList):
    longSubstring = ''

    if len(testList) > 1 and len(testList[0]) > 0:
        for i in range(len(testList[0])):
            for j in range(len(testList[0])-i+1):
                if j > len(longSubstring) and all(testList[0][i:i+j] in x for x in testList):
                    longSubstring = testList[0][i:i+j]
    return longSubstring

now when I call my function with let's say ['slide', 'glidb', 'flidt', 'cridz', 'bidr'] I get the correct result of 'id' as being my longest substring.

However, when I pass the list ['slide', 'glidb', 'flidt', 'cridz', 'bidr', 'balh', 'tejka', 'djakljskdl', 'blah', 'blah', 'blah'] I don't get any return results. I should be getting back 'blah' as my longest substring, but I haven't figured out a way to accomplish this. It seems I am only able to match substrings when they are in ALL items of the list. How might I modify my code/logic to get the longest substring that occurs more than once?

Thank you.

Was it helpful?

Solution

You can only match substrings in all items because that's exactly what you ask for:

all(testList[0][i:i+j] in x for x in testList)

Even if you change that, you can only find the longest substring that is in the first substring, because you only check through testlist[0] .

Instead, try something like:

def longest_substr(lst):
    longest = None
    for word in lst:
        for i in range(len(word)):
            for j in range(i+1, len(word)+1):
                if ((longest is None or (j - i > len(longest))) and
                    sum(word[i:j] in w for w in lst) > 1):
                    longest = word[i:j]
    return longest

This finds the longest substring that's in at least two (> 1) of the words (or will return None for an empty list or list of empty strings) and gives me the following results:

>>> longest_substr(['slide', 'glidb', 'flidt', 'cridz', 'bidr'])
'lid'
>>> longest_substr(['slide', 'glidb', 'flidt', 'cridz', 'bidr', 'balh', 'tejka', 'djakljskdl', 'blah', 'blah', 'blah'])
'blah'

Note that, once you remove the requirement that the substring must be in all strings, the longest in your first list of words is actually 'lid'.

OTHER TIPS

If you really want blah, then you need to remove the condition that your longest common subsequence must be in all strings. Thus, you could do this (note, you get lid for the first example):

def allCommonSubstrings(s1, s2):
    answer = set()
    if len(s1) > len(s2):
        s1, s2 = s2, s1
    for i in range(len(s1)):
        for j in range(i+1, len(s1)+1):
            if s1[i:j] in s2:
                answer.add(s1[i:j])
    return answer

def longestCommonSubstring(strings):
    common = set()
    for s1,s2 in itertools.combinations(strings, 2):
        common = common.union(allCommonSubstrings(s1, s2))
    return max(common, key=len)

In [288]: longestCommonSubstring(['slide', 'glidb', 'flidt', 'cridz', 'bidr', 'balh', 'tejka', 'djakljskdl', 'blah', 'blah', 'blah'])
Out[288]: 'blah'

In [289]: longestCommonSubstring(['slide', 'glidb', 'flidt', 'cridz', 'bidr'])
Out[289]: 'lid'
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top