linear fit by group in apply takes too long using pandas

https://stackoverflow.com/questions/23331057

10-07-2023
|

Question

I have a pandas dataframe imported from a csv file. I need to do a linear fit using two or more columns defined by the user in a ';' string, one of the dataframe columns defines the grouping. The code is straight forward:

    from pandas import DataFrame
    from sklearn import linear_model

    def fit(data, x_names, y_name, fit_by):
        x_names = x_names.split(sep=';')
        data['_out_'] = data[y_name]  #may need to create equation later
        data.replace([inf, -inf], nan, inplace=True)
        data.dropna(subset=x_names, inplace=True)
        phi = data.groupby(fit_by).apply(lambda x: fit_group_func(x, x_names))
        phi.reset_index(inplace=True)
        phi = phi.pivot(index=fit_by, columns='level_1', values=0)
        phi.reset_index(inplace=True)
        x_names.insert(0, fit_by)
        phi.columns = x_names
        return phi

    def fit_group_func(df, x_names):
        model = linear_model.BayesianRidge()
        return DataFrame(model.fit(df[x_names], df['_out_']).coef_.tolist())

This code works pretty well when the data has 147830 rows, I mean no complaint on time used. The problem happens with 1881201 rows, it's really slow and nothing got returned after 2 hours so I killed the task.

I also noticed that my processor was being used as expected (15% one core) until I reached the fit_group_func when it dropped to zero and from time to time it became 1% and dropped again.

Note: I changed the code to have a function when fitting but nothing better happened. previously the line read:

        phi = data.groupby(fit_by).apply(lambda x: DataFrame(model.fit(x[x_names], x['_out_']).coef_.tolist())

Can somebody help me to figure out how to optimize this code and make it faster? I'm currently trying to run it on a Windows PC with 32 GB RAM and 8 cores processor. I also have access to a 96 GB RAM with 20 cores processor, but I don't think my problem is on the number of cores or RAM unless I can run the code in multiprocessor mode or something.

Solution

I figured what the problem was. I ran the script on a linux box with a new anaconda python installation and it had no errors, so upgraded many packages and the problem went away.

For information purposes ipython hanged when using WinPython-64bit-3.3.3.3, after upgrading packages it worked (not sure which one as I upgraded many of them). At last I installed WinPython-64bit-3.3.5.0 and the problem is not there either.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow