Question

The code below (to compute cosine similarity), when run repeatedly on my computer, will output 1.0, 0.9999999999999998, or 1.0000000000000002. When I take out the normalize function, it will only return 1.0. I thought floating point operations were supposed to be deterministic. What would be causing this in my program if the same operations are being applied on the same data on the same computer each time? Is it maybe something to do with where on the stack the normalize function is being called? How can I prevent this?

#! /usr/bin/env python3

import math

def normalize(vector):
    sum = 0
    for key in vector.keys():
        sum += vector[key]**2
    sum = math.sqrt(sum)
    for key in vector.keys():
        vector[key] = vector[key]/sum
    return vector

dict1 = normalize({"a":3, "b":4, "c":42})
dict2 = dict1

n_grams = list(list(dict1.keys()) + list(dict2.keys()))
numerator = 0
denom1 = 0
denom2 = 0

for n_gram in n_grams:
    numerator += dict1[n_gram] * dict2[n_gram]
    denom1 += dict1[n_gram]**2
    denom2 += dict2[n_gram]**2

print(numerator/(math.sqrt(denom1)*math.sqrt(denom2)))
Was it helpful?

Solution

Floating-point math may be deterministic, but the ordering of dictionary keys is not.

When you call .keys(), the order of the resulting list is potentially random.

Thus the order of your math operations inside the loops are also potentially random, and thus the result is not going to be deterministic because while any single floating-point operation might be deterministic, the result of a series of operations is very much dependent on ordering.

You could enforce a consistent order by sorting your key lists.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top