How to speedup K-Means used in 'for loop'
-
09-12-2020 - |
Question
I'm trying to solve an interesting problem. One solution which seems to work well, involves using K-Means in a 'for loop'. The dataset per loop is independent and fairly small (Minibatch not required).
e.g.
for i in range(100):
y = Kmeans(x) # x is new data per loop
Results are good but execution is quite slow. Debugging shows that most of the delay is in the K-Means algorithm (from scikit-learn).
ncalls tottime percall cumtime percall filename:lineno(function)
40960 13.402 0.000 92.785 0.002 k_means_.py:426(_kmeans_single_elkan)
243337 8.506 0.000 51.421 0.000 pairwise.py:165(euclidean_distances)
931529 6.950 0.000 6.950 0.000 {method 'reduce' of 'numpy.ufunc' objects}
341357 5.178 0.000 32.424 0.000 validation.py:332(check_array)
2157218/1952418 5.010 0.000 28.877 0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
341357 4.617 0.000 15.927 0.000 validation.py:36(_assert_all_finite)
117456 3.356 0.000 3.849 0.000 _k_means.pyx:258(__pyx_fuse_1_centers_dense)
789264 2.617 0.000 9.859 0.000 fromnumeric.py:73(_wrapreduction)
40960 2.584 0.000 37.103 0.001 k_means_.py:43(_k_init)
1265786 2.053 0.000 2.053 0.000 {built-in method numpy.array}
253435 2.015 0.000 2.015 0.000 {built-in method numpy.core._multiarray_umath.c_einsum}
I've already reduced the number of iterations and initialisations. Is there a way to speed up this execution?
Solution
Using the multiprocessing library I was able to parallelize the loop and speedup execution.
Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange