What is the efficient way to use apply method in column of pandas Dataframe for large dataset?
-
10-12-2020 - |
Question
I have a dataset of approximate 1 hundred thousand records. I want to use apply method in each of the records for further data processing but it takes very long time to process (As apply method works linearly). I have tried this in Google Colab by selecting GPU settings but still it is very slow. I also try "swifter.apply" but still it is not as efficient. Is there any way to do this?
Solution
You can try pandarallel
it works very efficiently for parallel processing. You can find more information about it here.
You should not use this if your apply function is a lambda function. Now assuming you're trying to apply it on a DataFrame called df
:
from pandarallel import pandarallel
pandarallel.initialize(nb_workers=n) #n is the number of worker used for parallelization, you can leave it blank and it will use all the cores
def foo(x):
return #what ever you're trying to compute
df.parallel_apply(foo, axis=1) #if you're applying to multiple columns
df[column].parallel_apply(foo) # if its just one column
Another option you can try is using the python multiprocessing
library, here you will break your dataframe into smaller chunks and run them together.
import numpy as np
from multiprocessing import cpu_count, Parallel
cores = cpu_count() #Gets number of CPU cores on your machine
partitions = cores #Define number of partitions
def parallelize(df, func):
df_split = np.array_split(df, partitions)
pool = Pool(cores)
df = pd.concat(pool.map(func, df_split))
pool.close()
pool.join()
return df
Now you can run this parallelize
function on your df
:
df = parallelize(df, foo)
The more number of cores you have the faster this will be!