Space complexity for storing integers in Python

https://cs.stackexchange.com//questions/116923

29-11-2019
|

Pregunta

So I was watching this mock interview by an Airbnb engineer on interviewing.io (https://youtu.be/cdCeU8DJvPM?t=1224) and around ~20:11 seconds he raises an interesting point.

The question that the interviewee was trying to solve was -

Array 1 has all unique integers.
Array 2 has the same integers as array 1, except one integer missing.
The code has to find that missing integer.

The interviewee suggested we can just subtract the sum of the second array from that of the first array and that'd give us the missing integer, and since these are just sums (integers), they'd take O(1) space. But the interviewer then said it's not actually constant since a sufficiently large sum can take more space than a smaller sum.

While that is true, and I double-checked it, via -

In [1]: import sys

In [2]: sys.getsizeof(90)
Out[2]: 24

In [3]: sys.getsizeof(9000000000000000000000000000000000)
Out[3]: 40

what complexity is it then?

Is saying constant space complexity for these algorithms wrong then?

Solución

It depends on the model of computation. In the transdichotomous model, which is the standard model in the analysis of algorithms, we assume that the word size is $w=O(\log n)$ bits, where $n$ is the size of input in bits. In this assumption, the sum of the input can be represented with 1 word, so the space complexity is $O(1)$ words. Measured in bits, the space complexity is $O(\log N + \log M)$, where $N$ is the number of integers in input and $M$ is maximum value in input. However note that normally we don't measure space complexity in bits: if we did, then the space complexity of most linear time algorithms, like DFS and BFS would have an extra $O(\log n)$ factor.

In my opinion, there is no single right answer when talking about space complexities like this in an informal setting like job interview. However you should keep in mind that having a single right answer is not the goal of job interviews.

Otros consejos

Resource usage always depends on your model of computation. If you're in a situation where integers can grow arbitrarily large then, yes, you need to take that into account.

One way of doing this is by assuming that integer variables take an amount of space that depends on the value stored. Another way is to use something like the word RAM model, which more or less assumes that integer variables are big enough to hold whatever numbers come up in the instances you care about. Another is to explicitly assume that whatever size of integer variable you fix is big enough. What solutions might be appropriate depends on what you're doing.

No, breaking the algorithm by increasing n (the amount of numbers) makes this 100% not working in constant space.

To the question nobody asked:

You could easily modify the algorithm to get the space to be constant. (Only the solution given in the interview has this problem. The "best" solution does not have this problem.)

Basically the size of the sum should not be a problem either. Assuming that each number can be stored in constant space of b bits you could easily modify the algorithm to use less space by doing this:

calculating the sum for each list/array mod $2^b$ ($r_1$ and $r_2$, with $r_1 ≥ r_2$)
- if $r_1 = r_2$ -> either the result is $0$ or the input is broken
getting the difference ($d = r_1 - r_2$)

Thanks to a possible overflow caused by the number missing:

calculate the inverted result $d' = 2^b - d = r_2 - r_1 + 2^b$ (mod $2^b$)
search for both numbers $d$ and $d'$ in both lists - the one found exactly once is the result

If space was limited (for any reason) this would be an option. It comes at a price: looping again over the input - which might not be possible if the data comes from a server and you do not have the storage to save it.

So almost constant space that only scales with size of the input - I would ignore it in this case because the size of numbers is often ignored. (-> 2 ints and 2 bools). For doubling the steps in the loops. (And 2*O(n) = O(n).)

Example:

b = 3 -> mod 8
list1 = 3,5,7 -> 15 (mod 8) = 7
list2 = 3,5 -> 8 (mod 8) = 0
d = 7 - 0 = 7
d' = 0 - 7 + 8 = 1

Example 2:

b = 3 -> mod 8
list1 = 1,3,5,7 = 16 (mod 8) = 0
list2 = 1,3,5 = 9 (mod 8) = 1
d = 1 - 0 = 1
d' = 0 - 1 + 8 = 7

For both: So either 7 or 1 should be exactly once in the 2 lists. Here 7 is both times the result and the 1 can be found only in amounts that are even.

(Writing a function to flip a bit when you find a given number while looping over those 2 lists should be easy.)

Optimization:

By using the length of the 2 arrays and comparing them you could easily get some information about which value ($d$ or $d'$) is not important. Calculate sum(longer array) - sum(shorter array) and you will have the number.

-> Single loop over both "arrays" in parallel and $2$ numbers of space would be the best case scenario. (And this would even work with some data stream that can't be repeated for some reason.)

In case you cannot use both arrays in parallel you would need one extra number ($≥2$ bits). Then you would add $1$ for each number in array1 and substract $1$ for each number in array2. The result is either $1$ or $-1$ which gives you information which list is longer. (Other results: something is wrong.)

As spitemaster suggested: Using XOR can be used to improve this even more. His idea reduces the amount of work a bit on older machines (or slower machines you can buy today - like µC) and allows you to save about 50% of the RAM (or 66% of the registers) needed above.

(If you implement the algorithm in hardware this would be really interesting, because XOR is much smaller/faster than an adder.)

The algorithm would simply run over both arrays using XOR on all values. The result is a mix of all numbers that appear 1 + 2*k times. So if there is only one such number you have the number.

Licenciado bajo: CC-BY-SA con atribución

No afiliado a cs.stackexchange