python snippet optimization explanation

https://softwareengineering.stackexchange.com/questions/313163

13-12-2020
|

Pergunta

I'm attempting optimize and figure out why for input n a function I wrote f(n) appears to never finish executing when n > 26.

TL;DR I need to figure out how to find the complexity of various python instructions that involve concatenating an empty string and dictionary lookups all within a while loop. The rest appears to be O(1) or negligible. Read below for proof that I've RTFM'd extensively and need help grasping this

f(n) returns a random alphabetical string of length n:

def func(n, prefix=None):
    if prefix is None:
        empty = ""
    else:
        empty = prefix
    while (len(empty) < n):
        c = padder[random.randint(0,len(padder) -1)]
        if empty.find(c) == -1:
            str = empty+c
            empty = str
    return empty

I disassemble the Python bytecode

>>>dis.dis(func(8))
        0 BREAK_LOOP
        1 POP_BLOCK
        2 PRINT_NEWLINE
        3 INPLACE_POWER
        4 RETURN_VALUE
        5 PRINT_ITEM
        6 <69>
        7 IMPORT_STAR

However, when n > 26 it hangs. Why is the following is faster? (note: replace x with desired length of string)

print(''.join(random.choice(string.ascii_uppercase) for i in range(x)))

When I disassemble it like above, it appears to execute in O(n).

So far, I came to the conclusion that I should break down each operation, find their complexities, and that would determine why it hangs at n > 26. I found a cheatsheet for the time complexity of common python operations

However, it doesnt cover SETUP_LOOP, LOAD_CONST vs LOAD_FAST, etc. and other python instructions

Can someone please explain why my first script is slower than the second and at least point me to some good reference material? I google dork'd

intext:"python" intext:"time complexity" intext:"instruction"

Which was light on a cheatsheet. I found a good explanation but it was somewhat hard to follow

Solução

There is a significant difference in your two code snippets that causes them to perform differently.

Your first version create a string containing random unique letters from the source set. Your second version merely creates a string containing random letters from the set. So 'AABDEEE' would be generated for the second, but not the first.

So the second version's algorithm is:

while number of letters less than goal
    select a letter at random
    add it to the result

Your second is:

while number of letters less than goal
    do    
        select a letter at random
    while letter is already in goal
    add it to the result

Think about what happens as you select letters, assuming the source set is 26 letters.

First choice, you will always get one that works. Second choice, there is a 1/26 chance you have to re-pick. Third choice, there is a 2/26 chance you have to re-pick. . . Twenty-fifth choice, there is a 24/26 chance you have to re-pick.

Thus if you ask for a string that is 25 characters long, with 26 choices, then it will take 13 picks to find one that works. If you do the math, that means it has to do 71 calls to random.randint to fill a 25 character string. In contrast, the second version requires only 25 calls to random.choice to generate a 25 character string (with duplicates.)

if uniqueness is a requirement, then what you really want is an algorithm that uses some sort of shuffle. And in fact, the python random module has something that does exactly what I think you are trying to do:

 ''.join(random.sample(string.ascii_uppercase, 25))

If you want to do it by hand, the algorithm is likely something like:

 s = ascii letters
 result = ""
 for i in 1 to 25
     i = random int
     c = s[i]
     result += c
     delete s[i]

(Translation to python left as an exercise for the reader.)

Licenciado em: CC-BY-SA com atribuição

Não afiliado a softwareengineering.stackexchange