sklearn linear regression for large data

https://stackoverflow.com/questions/22668489

21-06-2023
|

문제

Does sklearn.LinearRegression support online/incremental learning?

I have 100 groups of data, and I am trying to implement them altogether. For each group, there are over 10000 instances and ~ 10 features, so it will lead to memory error with sklearn if I construct a huge matrix (10^6 by 10). It will be nice if I can update the regressor each time with batch samples of new group.

I found this post relevant, but the accepted solution works for online learning with single new data (only one instance) rather than batch samples.

해결책

Take a look at linear_model.SGDRegressor, it learns a a linear model using stochastic gradient.

In general, sklearn has many models that admit "partial_fit", they are all pretty useful on medium to large datasets that don't fit in the RAM.

다른 팁

Not all algorithms can learn incrementally, without seeing all of the instances at once that is. That said, all estimators implementing the partial_fit API are candidates for the mini-batch learning, also known as "online learning".

Here is an article that goes over scaling strategies for incremental learning. For your purposes, have a look at the sklearn.linear_model.SGDRegressor class. It is truly online so the memory and convergence rate are not affected by the batch size.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow