What needs to be done to make n_jobs work properly on sklearn? in particular on ElasticNetCV?

https://datascience.stackexchange.com/questions/74253

11-12-2020
|

Question

The constructor of sklearn.linear_model.ElasticNetCV takesn_jobs as an argument. Quoting the documentation here

n_jobs: int, default=None

Number of CPUs to use during the cross validation. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

However, running the below simple program on my 4 core machine (spec details below) shows performance is best when n_jobs = None, progressively deteriorating as you increase n_jobs all the way to n_jobs = -1 (supposedly requesting all cores)

    import numpy as np
    from sklearn.linear_model import ElasticNetCV

    # ------------- Setup  X, y -------------
    IMPORTANT_K = 10
    TOTAL_K = 30
    SAMPLE_N = 1000
    ERROR_VOL = 1

    np.random.seed(0)
    important_X = np.random.rand(SAMPLE_N, IMPORTANT_K)
    other_X = np.random.rand(SAMPLE_N, TOTAL_K - IMPORTANT_K)
    actual_coefs = np.linspace(0.1, 1, IMPORTANT_K)
    noise = np.random.rand(SAMPLE_N)
    y = important_X @ actual_coefs + noise * ERROR_VOL
    total_X = np.concatenate((important_X, other_X), axis=1)

    # ---------------- Setup ElasticNetCV -----------------
    LASSO_RATIOS = np.linspace(0.01, 1, 10)
    CV = 10

    def enet_fit(X, y, n_jobs: int):
        enet_cv = ElasticNetCV(l1_ratio=LASSO_RATIOS, fit_intercept=True,
                               cv=CV, n_jobs=n_jobs)
        enet_cv.fit(X=X, y=y)
        return enet_cv

    # ------------------- n_jobs test --------------
    N_JOBS = [None, 1, 2, 3, 4, -1]
    import time
    for n_jobs in N_JOBS:
        start = time.perf_counter()
        enet_cv = enet_fit(X=total_X, y=y, n_jobs=n_jobs)
        print('n_jobs = {}, perf_counter = {}'.format(n_jobs, time.perf_counter() - start))

What needs to be done to make this work as expected?

Some people seem to think this is broken in Windows as stated here. Indeed I am running Windows 10 on an Intel i7-7700 4.2Ghz machine, but unfortunately the question the above link points to does not have any commentary or answers, nor unfortunately do I have access to a Unix machine to try if this works on Unix.

Some people here say running from an interactive session is a problem. I observe the above program perform in the same way whether run on Jupyter Lab or as a script from the terminal.

I will also add that I have run this on several version combinations of sklearn and joblib, but have not been able to resolve the problem in that way.

One of those combinations is below, put together by conda env create when only specifying the python version python==3.6.10 thereby allowing conda to install the versins it deems are most appropriate. I also include the numpy, mkl and scipy dependencies that were installed.

intel-openmp              2020.1                      216
joblib                    0.14.1                     py_0
mkl                       2020.1                      216
mkl-service               2.3.0            py36hb782905_0
mkl_fft                   1.0.15           py36h14836fe_0
mkl_random                1.1.0            py36h675688f_0
numpy                     1.18.1           py36h93ca92e_0
numpy-base                1.18.1           py36hc3f5095_1
python                    3.6.10               h9f7ef89_2
scikit-learn              0.22.1           py36h6288b17_0
scipy                     1.4.1            py36h9439919_0

The questions

Does the simple program above give you better performance with increasing n_jobs when you run it?

On what OS / setup?

Is there something that must be tweaked for this to work properly?

All help very much appreciated

Solution 3

Having profiled and stepped through sklearn´s code, I´ve got some answers.

The summary:

Contrary to what has been suggested, sklearn's ElasticNetCV()'s poor scalability to n_jobs is not due to:

the overhead of launching threads or processes.
SequentialBackend always being used irrespective of n_jobs. (I cannot reproduce this problem as stated in n1tk's answer, However I can confirm that SequentialBackend is used whenever n_jobs = 1, irrespective of the actual context backend. Rather than a bug, this seems to be reasonable behavior)

The actual problem is sklearn's ElasticNetCV's default use of the threading parallel_backend, while the tasks it sends into joblib's Parallel() for parallelization spend most of their time holding onto the GIL.

Clearly this is counterproductive.

The above is true of the OP's problem size of n=1000 and k=30. Once n or k grow very large, then the coordinate descent algorithm, the only code running outside the GIL, begins to crunch enough cycles for threading to bring execution time down relative to sequential.

This means that for problems taking about a second to solve, sklearn's out of the box parallelism doesn't help. And if you have 10,000 of these problems to solve at any one time, you are going to have to find another parallelism solution or wait for the best part of three hours between runs.

This situation can be significantly improved in various ways with varying degrees of effort vs returns.

The detail

The ElasticNetCV call stack goes something like this:

ElasticNetCV.fit()
    Parallel(...)
        _path_residuals(...)
            enet_path(...)
                cd_fast.enet_coordinate_descent_gram(...)

It's not until cd_fast.enet_coordinate_descent_gram() that the GIL is released. You can see the with nogil: statement here

%prun shows that 99.3% of the time in this stack is spent in Parallel(...)

%lprun shows that only 40% of this time is spent in cd_fast.enet_coordinate_descent_gram(...)

The remaining 60% of the time spent in Parallel(...), is spent under the GIL, and is hence un-parallelizable.

A back of the envelope calculation suggests that the 40% parallelizable code should allow 2 threads (more than 2 would not help) to complete the 100 tasks into which this problem is broken down in 60%*99 + 1% = 60.4% of sequential time.

So why is the actual performance of n_jobs = 2 worse than that of n_jobs = 1?

I'm not sure, but on my 4 core machine (8 logical), n_jobs = 2 spawns 8 extra threads rather than just one extra thread (or two workers). With 60% of the execution un-parallelizable, this could simply cause an oversubscription problem. I do see a very large jump in kernel activity to around 50% of CPU usage in task manager when n_jobs = 2. This suggests oversubscription.

I don't yet know why setting n_jobs = 2, results in 8 additional threads, rather than just one (or two workers).

The simplest solution

The problem can be alleviated to some extent by instructing joblib to switch to the loky multi-processing backend.

This is done using with

with parallel_backend('loky'):
    ElasticNetCV.fit(...)

On my 4 core (8 logical) machine, the most improvement using loky is achieved with n_jobs = -1 (equivalent to n_jobs = 8). This achieves a x2.7 performance improvement. That is some way off the x4 improvement I'd expect from fully parallelizable cython code on 4 cores (8 logical) and which I get on this machine on such code.

Perhaps this is because loky is not entirely immune to oversubscription. Its spawned processes themselves spawn threads (2 additional threads in the original process and 4 threads on each sub-process on my 4 core machine). These threads compete for their process's GIL. Nevertheless, as there are more GILs to go around (one per spawned process) the performance is better than with the default threading backend.

It is also certain that process spawning and communication overhead contribute to the gap in performance between x2.7 and x4.

Finally, I am not sure how well this solution scales to 32 cores compared with a properly functioning multithreaded solution. I'd be grateful for commentary from someone able to run the OP's script on a 32 core machine using the simple with parallel_backend('loky'): fix.

For those who want more performance, or are irked by parallel infrastructure being sent un-parallelizable tasks, read on.

Fixing sklearn

There is a bug in sklearn's coordinate_descent.py whereby ElasticNetCV.fit(...) ultimately calls enet_path() without setting check_input = False. The checks are redundant once execution gets to enet_path() and indeed the docs suggest setting check_input = False if the caller has already ensured what is being checked. It seems that in this instance, sklearn have not followed their own advice, and have failed to implement this in ElasticNetCV's case.

Fixing this bug cuts the non-parallelizable time from 60% to 30% and the total running time by 40% when running sequentially.

Yet the threading backend continues to be incapable of improving performance despite now only 30% of the execution time being un-parallelizable. Again, a back of the envelope calculation suggests 4 cores could be put to use here on our 100 subtasks, and complete in around 30% of sequential execution time. In practice however, 8 new threads are spawned when n_jobs = 2, and 4 additional threads for every further n_jobs unit increment. The end result is slower calculation times for n_jobs > 1, and the accompanying symptom of kernel dominating CPU usage (between 50% to 80%).

Once again, this suggests oversubscription as a result of too many threads competing for the same GIL. Once again I don't yet know why the threading backend spawns this seemingly over-inflated number of threads. Anyone know?

Performance using loky is also enhanced by this fix, with diminishing returns as you increase n_jobs. On my 8 core machine I got to x3.7 of the out of the box sklearn sequential time by setting n_jobs = 8 with the check_input = False fix. Note that this is only x2 of the check_input = False sequential time, and that a properly working multi threaded solution would give us approx x4.

A caveat, and perhaps the reason why check_input = False has not been implemented in the official sklearn release. The input checks for MutiTaskElasticNet are only performed inside enet_path(), and they would need to be re-factored elsewhere for this fix not to potentially break the MultiTask version of the algorithm. However, as far as I can tell, the single output version will work fine without having to worry about this.

Other issues:

The tasks passed into Parallel(...) call np.dot(X.T, X) inside the GIL for each fold and (unnecessarily) for each l1_ratio. The unnecessary calculations could be removed and the necessary ones re-factored out of Parallel(...). according to my profiler, this would further reduce the GIL time of parallel tasks from 30% to 20%.

Ultimately, you don't want code subject to GIL going into Parallel(...) with a threading backend. You can either take things out, or cythonsize with nogil: the things you put in.

OTHER TIPS

When I ran your script, I got the same impression, that n_jobs was hurting you performance. However, you have to consider that parallelizing the cross-validation would only benefit if you have more data samples. With few data, the communication overhead indeed is more expensive than the processing cost involved on the task.

I tried your script with more samples SAMPLE_N = 100000 and got the following results. Setup: macos i5 8gb.

n_jobs = None, perf_counter = 21.605680438
n_jobs = 1, perf_counter = 22.555127251
n_jobs = 2, perf_counter = 15.129894496000006
n_jobs = 3, perf_counter = 11.280528144999998
n_jobs = 4, perf_counter = 13.050180971000003
n_jobs = -1, perf_counter = 20.031103849000004

Would try to answer based on experience and understandings of parallel computing in production for DS/ML models:

Answer to your questions as high level:

Does the simple program above give you better performance with increasing n_jobs when you run it? answer: Yes and can be seen bellow in results.
On what OS / setup? answer: OS:ubuntu, 2xCPUsx16Cores+512GB RAM with python=3.7, joblib>=0.14.1 and sklearn >=0.22.1
Is there something that must be tweaked for this to work properly? yes: change/force parallel_backend to be used other then sequential (requires joblib approach with registered parallel_backend and you can use sklearn.utils.parallel_backend ... I tried sequential from sklearn model you have with n_jobs=-1 into joblib Parallel and got huge scale but need to look more for correctness but did saw huge improvement when scaled to 100mil samples on my machine so worth to test it since were amazed by performance with predefined backend.

My conda setup:

scikit-image              0.16.2           py37h0573a6f_0  
scikit-learn              0.22.1           py37hd81dba3_0  
ipython                   7.12.0           py37h5ca1d4c_0  
ipython_genutils          0.2.0                    py37_0  
msgpack-python            0.6.1            py37hfd86e86_1  
python                    3.7.6                h357f687_2    conda-forge
python-dateutil           2.8.1                      py_0  
python_abi                3.7                     1_cp37m    conda-forge
joblib                    0.14.1                     py_0

Try to leave 1 core for your machine if you use personal machine or workstation with n_jobs=-2, you can increase you data because this is the purpose of joblib for optimization (not all algorithms support this approach but is out of scope here) and also change the backend because by default won't perform parallel tasks and would only use sequential, maybe with more data is doing an auto "mode" but not sure about it based since I tested with 1k, 10k 100k, 1 mil and 10 mil samples and without loky backend ElasticNetCV won't go out of sequential backend.

Joblib is optimized to be fast and robust on large data in particular and has specific optimizations for numpy arrays.

As an explanation will look into how is calculated resources:

For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used. None is a marker for ‘unset’ that will be interpreted as n_jobs=1 (sequential execution)

Your code perform bad with n_jobs=-1 so try n_jobs=-2 due following facts:

does use all your CPU cores (based on documentation) but you can change to be used threads by registering a parallel_backend from joblib of your machine so this will be slow and will decrease performance if other processes does use CPU threads/cores (in your case this is happening( you have OS running and other processes that needs CPU power to run) and also is not taking full advantage of "threading" so does use "cores" based on your performance issue.

As an example you will use the "n_jobs=-1" when on cluster mode so the workers as a container does have allocated cores for and will take advantage of parallel approach and distribute optimization or computation part.

you run out of CPU resources in this case and also don’t forget that parallel isn’t “cheap” because does copy SAME data for each “job” so you will get all that “allocation” at the same time.
sklearn parallel implementation isn't perfect so in your case will try to use n_jobs=-2 or if you want to use joblib then you can have more room of optimizing the algorithm. Your CV part is where all performance does get degraded because will be parallelized.

Will add the following from joblib to better understand how does work in your case and limitations + differences:

backend: str, ParallelBackendBase instance or None, default: ‘loky’

    Specify the parallelization backend implementation. Supported backends are:

        “loky” used by default, can induce some communication and memory overhead when exchanging input and output data with the worker Python processes.
        “multiprocessing” previous process-based backend based on multiprocessing.Pool. Less robust than loky.
        “threading” is a very low-overhead backend but it suffers from the Python Global Interpreter Lock if the called function relies a lot on Python objects. “threading” is mostly useful when the execution bottleneck is a compiled extension that explicitly releases the GIL (for instance a Cython loop wrapped in a “with nogil” block or an expensive call to a library such as NumPy).
        finally, you can register backends by calling register_parallel_backend. This will allow you to implement a backend of your liking.

From source code the implementation I do see that sklearn does use cores or is preferred but not default for all algorithms the threads: _joblib.py

    import warnings as _warnings

    with _warnings.catch_warnings():
        _warnings.simplefilter("ignore")
        # joblib imports may raise DeprecationWarning on certain Python
        # versions
        import joblib
        from joblib import logger
        from joblib import dump, load
        from joblib import __version__
        from joblib import effective_n_jobs
        from joblib import hash
        from joblib import cpu_count, Parallel, Memory, delayed
        from joblib import parallel_backend, register_parallel_backend


    __all__ = ["parallel_backend", "register_parallel_backend", "cpu_count",
               "Parallel", "Memory", "delayed", "effective_n_jobs", "hash",
               "logger", "dump", "load", "joblib", "__version__"]

But your listed Elastic Net model algorithm on the CV part does used "threads" as preferred (_joblib_parallel_args(prefer="threads")) and seems is a bug for windows that does only consider cores:

    mse_paths = Parallel(n_jobs=self.n_jobs, verbose=self.verbose,
                         **_joblib_parallel_args(prefer="threads"))(jobs)

Note: This answer is from experience from daily basis working to take advantage of sparkjoblib and parallel_backend('spark') and parallel_backend('dask'). Scales and run fast as expected but don't forget that each executor I do have is basically with 4 cores and 4-32GB RAM so when doing n_jobs=-1 does take parallel part of joblib tasks inside of each executor and that copy SAME data won't be noticed since is distributed.

Does run perfectly CV and fit part, and i do use n_jobs=-1 when performing fit or CV parts.

My results with OP default setup:

# Without tracking/progress execution is faster execution but needed to add progress for clarity:

    n_jobs = None, perf_counter = 1.4849148329813033 ([Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.)
    n_jobs = 1, perf_counter = 1.4728297910187393 ([Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.)
    n_jobs = 2, perf_counter = 1.470994730014354 ([Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.)
    n_jobs = 4, perf_counter = 1.490676686167717 ([Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.)
    n_jobs = 8, perf_counter = 1.465600558090955 ([Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.)
    n_jobs = 12, perf_counter = 1.463360101915896 ([Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.)
    n_jobs = 16, perf_counter = 1.4638906640466303 ([Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.)
    n_jobs = 20, perf_counter = 1.4602260519750416 ([Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.)
    n_jobs = 24, perf_counter = 1.4646347570233047 ([Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.)
    n_jobs = 28, perf_counter = 1.4710926250554621 ([Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.)
    n_jobs = -1, perf_counter = 1.468439529882744 ([Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.)
    n_jobs = -2, perf_counter = 1.4649679311551154 ([Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.)

# With tracking/progress execution needed to add progress+verbose for clarity:

0%|          | 0/12 [00:00<?, ?it/s][Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    2.9s finished
  8%|▊         | 1/12 [00:02<00:31,  2.88s/it][Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
n_jobs = None, perf_counter = 2.8790326060261577
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    2.8s finished
 17%|█▋        | 2/12 [00:05<00:28,  2.87s/it][Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
n_jobs = 1, perf_counter = 2.83856769092381
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    2.8s finished
 25%|██▌       | 3/12 [00:08<00:25,  2.85s/it][Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
n_jobs = 2, perf_counter = 2.8207667160313576
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    2.8s finished
 33%|███▎      | 4/12 [00:11<00:22,  2.84s/it][Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
n_jobs = 4, perf_counter = 2.8043343869503587
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    2.7s finished
 42%|████▏     | 5/12 [00:14<00:19,  2.81s/it][Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
n_jobs = 8, perf_counter = 2.730375789105892
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    2.9s finished
 50%|█████     | 6/12 [00:16<00:16,  2.82s/it][Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
n_jobs = 12, perf_counter = 2.8604282720480114
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    2.8s finished
 58%|█████▊    | 7/12 [00:19<00:14,  2.83s/it][Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
n_jobs = 16, perf_counter = 2.847634136909619
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    2.8s finished
 67%|██████▋   | 8/12 [00:22<00:11,  2.84s/it][Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
n_jobs = 20, perf_counter = 2.8461739809717983
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    2.9s finished
 75%|███████▌  | 9/12 [00:25<00:08,  2.85s/it][Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
n_jobs = 24, perf_counter = 2.8684673600364476
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    2.9s finished
 83%|████████▎ | 10/12 [00:28<00:05,  2.87s/it][Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
n_jobs = 28, perf_counter = 2.9122865139506757
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    3.1s finished
 92%|█████████▏| 11/12 [00:31<00:02,  2.94s/it][Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
n_jobs = -1, perf_counter = 3.1204342890996486
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    3.3s finished
100%|██████████| 12/12 [00:34<00:00,  2.91s/it]
n_jobs = -2, perf_counter = 3.347235122928396

HERE MAGIC starts:

So I would say this is actually the bug, even if does is specified the n_jobs this won't take effect and is still runs as 'None' or '1'. Small differences in time is probably due Caching the results using joblib.Memory and Checkpoint but need to look more on this part in source code (I bet is implemented otherwise will be expensive to be performed CV).

As a reference: this is by using joblib and do parallel part with parallel_backend: results with specifying parallel_backend('loky') in order to have the ability to specify the default backend used by Parallel inside with block and not used 'auto' mode:

# Without tracking/progress execution is faster execution but needed to add progress for clarity:

n_jobs = None, perf_counter = 1.7306506633758545, sec
n_jobs = 1, perf_counter = 1.7046034336090088, sec
n_jobs = 2, perf_counter = 2.1097865104675293, sec
n_jobs = 4, perf_counter = 1.4976494312286377, sec
n_jobs = 8, perf_counter = 1.380277156829834, sec
n_jobs = 12, perf_counter = 1.3992164134979248, sec
n_jobs = 16, perf_counter = 0.7542541027069092, sec
n_jobs = 20, perf_counter = 1.9196209907531738, sec
n_jobs = 24, perf_counter = 0.6940577030181885, sec
n_jobs = 28, perf_counter = 0.780998945236206, sec
n_jobs = -1, perf_counter = 0.7055854797363281, sec
n_jobs = -2, perf_counter = 0.4049191474914551, sec
Completed

Below output will explain EVERYTHING about limitations you do have, "IMPRESSION of paralel expected vs paralell done insklearn algorithm you have" and in general what is executing and how may workers are assigned:

# With tracking/progress execution needed to add progress+verbose for clarity:

0%|          | 0/12 [00:00<?, ?it/s][Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
    [Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    3.4s finished
8%|▊         | 1/12 [00:03<00:37,  3.44s/it][Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
        ......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

n_jobs = None, perf_counter = 3.4446191787719727, sec

 [Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    3.5s finished
 17%|█▋        | 2/12 [00:06<00:34,  3.45s/it]

n_jobs = 1, perf_counter = 3.460832357406616, sec

[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done 100 out of 100 | elapsed:    2.0s finished
 25%|██▌       | 3/12 [00:09<00:27,  3.09s/it]

n_jobs = 2, perf_counter = 2.2389445304870605, sec

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    1.7s finished
 33%|███▎      | 4/12 [00:10<00:21,  2.71s/it]

n_jobs = 4, perf_counter = 1.8393192291259766, sec

[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    1.0s
[Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed:    1.3s finished
 42%|████▏     | 5/12 [00:12<00:16,  2.36s/it]

n_jobs = 8, perf_counter = 1.517085075378418, sec

[Parallel(n_jobs=12)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=12)]: Done  26 tasks      | elapsed:    1.3s
[Parallel(n_jobs=12)]: Done  77 out of 100 | elapsed:    1.5s remaining:    0.4s
[Parallel(n_jobs=12)]: Done 100 out of 100 | elapsed:    1.6s finished
 50%|█████     | 6/12 [00:14<00:13,  2.17s/it]

n_jobs = 12, perf_counter = 1.7410166263580322, sec

[Parallel(n_jobs=16)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=16)]: Done  18 tasks      | elapsed:    0.1s
[Parallel(n_jobs=16)]: Done 100 out of 100 | elapsed:    0.7s finished
 58%|█████▊    | 7/12 [00:15<00:09,  1.81s/it]

n_jobs = 16, perf_counter = 0.9577205181121826, sec

[Parallel(n_jobs=20)]: Using backend LokyBackend with 20 concurrent workers.
[Parallel(n_jobs=20)]: Done  10 tasks      | elapsed:    1.6s
[Parallel(n_jobs=20)]: Done 100 out of 100 | elapsed:    1.9s finished
 67%|██████▋   | 8/12 [00:17<00:07,  1.88s/it]

n_jobs = 20, perf_counter = 2.0630648136138916, sec

[Parallel(n_jobs=24)]: Using backend LokyBackend with 24 concurrent workers.
[Parallel(n_jobs=24)]: Done   2 tasks      | elapsed:    0.0s
[Parallel(n_jobs=24)]: Done 100 out of 100 | elapsed:    0.5s finished
 75%|███████▌  | 9/12 [00:18<00:04,  1.55s/it]

n_jobs = 24, perf_counter = 0.7588121891021729, sec

[Parallel(n_jobs=28)]: Using backend LokyBackend with 28 concurrent workers.
[Parallel(n_jobs=28)]: Done 100 out of 100 | elapsed:    0.6s finished
 83%|████████▎ | 10/12 [00:18<00:02,  1.34s/it]

n_jobs = 28, perf_counter = 0.8542406558990479, sec

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    0.7s finished
 92%|█████████▏| 11/12 [00:19<00:01,  1.21s/it][Parallel(n_jobs=-2)]: Using backend LokyBackend with 31 concurrent workers.

n_jobs = -1, perf_counter = 0.8903687000274658, sec

[Parallel(n_jobs=-2)]: Done 100 out of 100 | elapsed:    0.5s finished
100%|██████████| 12/12 [00:20<00:00,  1.69s/it]

n_jobs = -2, perf_counter = 0.544947624206543, sec

# # Here I do show what is doing behind and to understand differences in times and wil explain 'None' vs '1' execution time (is all about picklink process and Memory Caching implementation for paralel. 
[Parallel(n_jobs=-2)]: Done  71 out of 100 | elapsed:    0.9s remaining:    0.4s
Pickling array (shape=(900,), dtype=int64).
Pickling array (shape=(100,), dtype=int64).
Pickling array (shape=(100,), dtype=float64).
Pickling array (shape=(1000, 30), dtype=float64).
Pickling array (shape=(1000,), dtype=float64).
Pickling array (shape=(900,), dtype=int64).
Pickling array (shape=(100,), dtype=int64).
Pickling array (shape=(100,), dtype=float64).
Pickling array (shape=(1000, 30), dtype=float64).
Pickling array (shape=(1000,), dtype=float64).
Pickling array (shape=(900,), dtype=int64).
Pickling array (shape=(100,), dtype=int64).
Pickling array (shape=(100,), dtype=float64).
Pickling array (shape=(1000, 30), dtype=float64).
Pickling array (shape=(1000,), dtype=float64).
Pickling array (shape=(900,), dtype=int64).
Pickling array (shape=(100,), dtype=int64).
Pickling array (shape=(100,), dtype=float64).
[Parallel(n_jobs=-2)]: Done  73 out of 100 | elapsed:    0.9s remaining:    0.3s
Completed

n_jobs in sklearn Parallelism in sklearn Thread-based parallelism vs process-based parallelism

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange