Question

I have a dataset of 5K records and 60 features focussed on binary classification. Please find my code below for SVM paramter tuning. It's running for a longer time than Xgb.LR and Rf. The other algorithms mentioned returned results within minutes (10-15 mins) whereas SVM is running for more than 45 mins.

Questions

1) Is SVM usually slower and takes longer time?

2) Is there any issue with my code below?

3) How can I make the gridsearch faster?

from sklearn.svm import SVC
param_grid = {'C': [0.001,0.01,0.1,1,10,100,1000],  
              'gamma': [1, 0.1, 0.01, 0.001, 0.0001], 
              'kernel': ['linear', 'rbf','poly'],
              'class_weight':['balanced']}
svm=SVC()
svm_cv=GridSearchCV(svm,param_grid,cv=5)
svm_cv.fit(X_train_std,y_train)
Was it helpful?

Solution

Simple, optimization problem of SVM is of quadratic order. Just check first line of documentation

"The fit time scales at least quadratically with the number of samples and may be impractical beyond tens of thousands of samples."

OTHER TIPS

1) I will cite Noah Weber's answer

The fit time scales at least quadratically with the number of samples and may be impractical beyond tens of thousands of samples.

2) There is nothing wrong but you are entirely searching a space that is

7 * 5 * 3 * 1 * 5(fold) = 525

which is pretty big.

3) To boost your training you can do the training in a subsample or just reducing the search space, I normally use the following function:

def fit_cv_subsample (pipe_cv, X, y, n_max = 10_000):
    '''
    This function fits a CV in a subsample of the first n_max rows
    returns the trained pipe and the best estimator
    '''
    X_sub = X[0:n_max]
    y_sub = y[0:n_max]
    pipe_cv.fit(X_sub,y_sub)
    #pipe_cv.best_estimator_.fit(X,y)
    return pipe_cv, pipe_cv.best_estimator_
results, best_model = fit_cv_subsample(svm_c, X, y)
Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top