sklearn linear regression for large data

https://stackoverflow.com/questions/22668489

21-06-2023
|

题

Does sklearn.LinearRegression support online/incremental learning?

I have 100 groups of data, and I am trying to implement them altogether. For each group, there are over 10000 instances and ~ 10 features, so it will lead to memory error with sklearn if I construct a huge matrix (10^6 by 10). It will be nice if I can update the regressor each time with batch samples of new group.

I found this post relevant, but the accepted solution works for online learning with single new data (only one instance) rather than batch samples.

解决方案

Take a look at linear_model.SGDRegressor, it learns a a linear model using stochastic gradient.

In general, sklearn has many models that admit "partial_fit", they are all pretty useful on medium to large datasets that don't fit in the RAM.

其他提示

Not all algorithms can learn incrementally, without seeing all of the instances at once that is. That said, all estimators implementing the partial_fit API are candidates for the mini-batch learning, also known as "online learning".

Here is an article that goes over scaling strategies for incremental learning. For your purposes, have a look at the sklearn.linear_model.SGDRegressor class. It is truly online so the memory and convergence rate are not affected by the batch size.

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow