What is the efficient way to use apply method in column of pandas Dataframe for large dataset?

https://datascience.stackexchange.com/questions/72422

10-12-2020
|

Question

I have a dataset of approximate 1 hundred thousand records. I want to use apply method in each of the records for further data processing but it takes very long time to process (As apply method works linearly). I have tried this in Google Colab by selecting GPU settings but still it is very slow. I also try "swifter.apply" but still it is not as efficient. Is there any way to do this?

Solution

You can try pandarallel it works very efficiently for parallel processing. You can find more information about it here.

You should not use this if your apply function is a lambda function. Now assuming you're trying to apply it on a DataFrame called df:

from pandarallel import pandarallel

pandarallel.initialize(nb_workers=n) #n is the number of worker used for parallelization, you can leave it blank and it will use all the cores

def foo(x):
   return #what ever you're trying to compute


df.parallel_apply(foo, axis=1) #if you're applying to multiple columns
df[column].parallel_apply(foo) # if its just one column

Another option you can try is using the python multiprocessing library, here you will break your dataframe into smaller chunks and run them together.

import numpy as np
from multiprocessing import cpu_count, Parallel

cores = cpu_count() #Gets number of CPU cores on your machine
partitions = cores #Define number of partitions

def parallelize(df, func):
    df_split = np.array_split(df, partitions)
    pool = Pool(cores)
    df = pd.concat(pool.map(func, df_split))
    pool.close()
    pool.join()
    return df

Now you can run this parallelize function on your df:

df = parallelize(df, foo)

The more number of cores you have the faster this will be!

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange