In python/numpy, I have a 10,000x10,000 array named random_matrix. I use md5 to compute the hash for str(random_matrix) and for random_matrix itself. It takes 0.00754404067993 seconds on the string version, and 1.6968960762 on the numpy array version. When I make it into a 20,000x20,000 array, it takes 0.0778470039368 on the string version and 60.641119957 seconds on the numpy array version. Why is this? Do numpy arrays take up a lot more memory than strings? Also, if I want to make filenames identified by these matrices, is converting to a string before computing hashes a good idea, or are there some drawbacks?

有帮助吗?

解决方案

str(random_matrix) will not include all of the matrix due to numpy's eliding things with "...":

>>> x = np.ones((1000, 1000))
>>> print str(x)
[[ 1.  1.  1. ...,  1.  1.  1.]
 [ 1.  1.  1. ...,  1.  1.  1.]
 [ 1.  1.  1. ...,  1.  1.  1.]
 ..., 
 [ 1.  1.  1. ...,  1.  1.  1.]
 [ 1.  1.  1. ...,  1.  1.  1.]
 [ 1.  1.  1. ...,  1.  1.  1.]]

So when you hash str(random_matrix), you aren't really hashing all the data.

See this previous question and this one about how to hash numpy arrays.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top