Question

I'm trying to solve an interesting problem. One solution which seems to work well, involves using K-Means in a 'for loop'. The dataset per loop is independent and fairly small (Minibatch not required).

e.g.

for i in range(100):
    y = Kmeans(x) # x is new data per loop

Results are good but execution is quite slow. Debugging shows that most of the delay is in the K-Means algorithm (from scikit-learn).

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    40960   13.402    0.000   92.785    0.002 k_means_.py:426(_kmeans_single_elkan)
   243337    8.506    0.000   51.421    0.000 pairwise.py:165(euclidean_distances)
   931529    6.950    0.000    6.950    0.000 {method 'reduce' of 'numpy.ufunc' objects}
   341357    5.178    0.000   32.424    0.000 validation.py:332(check_array)
2157218/1952418    5.010    0.000   28.877    0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
   341357    4.617    0.000   15.927    0.000 validation.py:36(_assert_all_finite)
   117456    3.356    0.000    3.849    0.000 _k_means.pyx:258(__pyx_fuse_1_centers_dense)
   789264    2.617    0.000    9.859    0.000 fromnumeric.py:73(_wrapreduction)
    40960    2.584    0.000   37.103    0.001 k_means_.py:43(_k_init)
  1265786    2.053    0.000    2.053    0.000 {built-in method numpy.array}
   253435    2.015    0.000    2.015    0.000 {built-in method numpy.core._multiarray_umath.c_einsum}

I've already reduced the number of iterations and initialisations. Is there a way to speed up this execution?

Was it helpful?

Solution

Using the multiprocessing library I was able to parallelize the loop and speedup execution.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top