How to optimize this python script further?

https://stackoverflow.com/questions/11577505

22-06-2021
|

Question

I've created this script to compute the string similarity in python. Is there any way I can make it run any faster?

tries = input()
while tries > 0:
    mainstr = raw_input()
    tot = 0
    ml = len(mainstr)
    for i in xrange(ml):
        j = 0
        substr = mainstr[i:]
        ll = len(substr)
        for j in xrange(ll):
            if substr[j] != mainstr[j]:
                break
            j = j + 1
        tot = tot + j
    print tot
    tries = tries - 1

EDIT: After applying some optimization this is the code, but it's not enough!

tries = int(raw_input())
while tries > 0:
    mainstr = raw_input()
    tot = 0
    ml = len(mainstr)
    for i in xrange(ml):
        for j in xrange(ml-i):
            if mainstr[i+j] != mainstr[j]:
                break
            j += 1
        tot += j
    print tot
    tries = tries - 1

EDIT 2: The third version of the code. It's still no go!

def mf():
    tries = int(raw_input())
    for _ in xrange(tries):
        mainstr = raw_input()
        tot = 0
        ml = len(mainstr)
        for i in xrange(ml):
            for j in xrange(ml-i):
                if mainstr[i+j] != mainstr[j]:
                    break
                j += 1
            tot += j
        print tot
mf()

Solution

You could improve it by a constant factor if you use i = mainstr.find(mainstr[0], i+1) instead of checking all i. Special case for i==0 also could help.

Put the code inside a function. It also might speed up things by a constant factor.

Use for ... else: j += 1 to avoid incrementing j at each step.

Try to find a better than O(n**2) algorithm that exploits the fact that you compare all suffixes of the string.

The most straight-forward C implementation is 100 times faster than CPython (Pypy is 10-30 times faster) and passes the challenge:

import os

def string_similarity(string, _cp=os.path.commonprefix):
    return sum(len(_cp([string, string[i:]])) for i in xrange(len(string)))

for _ in xrange(int(raw_input())):
    print string_similarity(raw_input())

The above optimizations give only several percents improvement and they are not enough to pass the challenge in CPython (Python time limit is only 8 time larger).

There is almost no difference (in CPython) between:

def string_similarity(string):
    len_string = len(string)
    total = len_string # similarity with itself
    for i in xrange(1, len_string):
        for n, c in enumerate(string[i:]):
            if c != string[n]:
                break
        else:
            n += 1

        total += n
    return total

And:

def string_similarity(string):
    len_string = len(string)
    total = len_string # similarity with itself
    i = 0
    while True:
        i = string.find(string[0], i+1)
        if i == -1:
            break
        n = 0
        for n in xrange(1, len_string-i):
            if string[i+n] != string[n]:
                break
        else:
            n += 1

        total += n
    return total

OTHER TIPS

You can skip the memory allocation inside the loop. substr = mainstr[i:] allocates a new string unnecessarily. You only use it in substr[j] != mainstr[j], which is equivalent to mainstr[i + j] != mainstr[j], so you don't need to build substr.

Memory allocations are expensive, so you'll want to avoid them in tight loops.

For such simple numeric scripts there are just two things you have to do:

Use PyPy (it does not have complex dependencies and will be massively faster)
Put most of the code in a function. That speeds up stuff for both CPython and PyPy quite drastically. Instead of:

some_code

do:

def main():
    some_code

if __name__ == '__main__':
    main()

That's pretty much it.

Cheers, fijal

Here's mine. It passes the test case, but may not be the absolute fastest.

import sys

def simstring(string, other):
    val = 0
    for l, r in zip(string, other):
        if l != r:
            return val
        val += 1
    return val


dsize = sys.stdin.readline()

for i in range(int(dsize)):
    ss = 0
    string = sys.stdin.readline().strip()
    suffix = string
    while suffix:
        ss += simstring(string, suffix)
        suffix = suffix[1:]
    sys.stdout.write(str(ss)+"\n")

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow