Question

I'm trying to run a simple Sklearn Ridge regression using an array of sample weights. X_train is a ~200k by 100 2D Numpy array. I get a Memory error when I try to use sample_weight option. It works just fine without that option. For the sake of simplicity I reduced the features to 2 and sklearn still throws me a Memory Error. Any ideas?

model=linear_model.Ridge()

model.fit(X_train, y_train,sample_weight=w_tr)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/g/anaconda/lib/python2.7/site-packages/sklearn/linear_model/ridge.py", line 449, in fit
    return super(Ridge, self).fit(X, y, sample_weight=sample_weight)
  File "/home/g/anaconda/lib/python2.7/site-packages/sklearn/linear_model/ridge.py", line 338, in fit
    solver=self.solver)
  File "/home/g/anaconda/lib/python2.7/site-packages/sklearn/linear_model/ridge.py", line 286, in ridge_regression
    K = safe_sparse_dot(X, X.T, dense_output=True)
  File "/home/g/anaconda/lib/python2.7/site-packages/sklearn/utils/extmath.py", line 83, in safe_sparse_dot
    return np.dot(a, b)
MemoryError
Was it helpful?

Solution

Setting sample weights can cause big differences in the way the sklearn linear_model Ridge object processes your data - especially if the matrix is tall (n_samples > n_features), as is your case. Without sample weights it will exploit the fact that X.T.dot(X) is a relatively small matrix (100x100 in your case) and will thus invert a matrix in feature space. With given sample weights, the Ridge object decides to stay in sample space (in order to be able to weight the samples individually, see relevant lines here and here for the branching to _solve_dense_cholesky_kernel which works in sample space) and thus needs to invert a matrix of the same size as X.dot(X.T) (which in your case is n_samples x n_samples = 200000 x 200000 and will cause a memory error before it is even created). This is actually an implementation issue, please see the manual workaround below.

TL;DR: The Ridge object is unable to treat sample weights in feature space, and will generate a matrix n_samples x n_samples, which causes your memory error

While waiting for a possible remedy within scikit learn, you could try to solve the problem in feature space explicitly, like so

import numpy as np
alpha = 1.   # You did not specify this in your Ridge object, but it is the default penalty for the Ridge object
sample_weights = w_tr.ravel()  # make sure this is 1D
target = y.ravel()  # make sure this is 1D as well
n_samples, n_features = X.shape
coef = np.linalg.inv((X.T * sample_weights).dot(X) + 
                      alpha * np.eye(n_features)).dot(sample_weights * target)

For a new sample X_new, your prediction would be

prediction = np.dot(X_new, coef)

In order to confirm the validity of this approach you can compare these coef to model.coef_ (after you have fit the model) from your code when applying it to smaller numbers of samples (e.g. 300), that do not cause the memory error when used with the Ridge object.

IMPORTANT: The code above only coincides with sklearn implementations if your data is already centered, i.e. your data must have mean 0. Implementing a full ridge regression with intercept fitting here would amount to a contribution to scikit learn, so it would be better to post it there. The way to center your data is as follows:

X_mean = X.mean(axis=0)
target_mean = target.mean()   # Assuming target is 1d as forced above

You then use the provided code on

X_centered = X - X_mean
target_centered = target - target_mean

For predictions on new data, you need

prediction = np.dot(X_new - X_mean, coef) + target_mean

EDIT: As of April 15th, 2014, scikit-learn ridge regression can deal with this problem (bleeding edge code). It will be available in the 0.15 release.

OTHER TIPS

What NumPy version do you have installed?

Looks like the ultimate method call that does it is numpy.dot(X, X.T) which if in your case X.shape = (200000,2) would generate a 200k-by-200k matrix.

Try converting your observations to a sparse matrix type or reduce the number of observations you use (there may be a variant of ridge regression that uses a few observations one batch at a time?).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top