Getting a best k in KNN Algorithm

https://datascience.stackexchange.com/questions/73445

11-12-2020
|

Question

So, i was learning the KNN Algorithm and there i learnt cross Validation to find a optimal value of k.Now i want to apply grid search to get the optimal value.I found an answer on stack overflow where both standardScaler and KNN are passed as estimator.

 pipe = Pipeline([
        ('sc', StandardScaler()),     
        ('knn', KNeighborsClassifier(algorithm='brute')) 
    ])
    params = {
        'knn__n_neighbors': [3, 5, 7, 9, 11] # usually odd numbers
    }
    clf = GridSearchCV(estimator=pipe,           
                      param_grid=params, 
                      cv=5,
                      return_train_score=True) # Turn on cv train scores
    clf.fit(X, y)

My questions

I am already applying the standardscaler to standardize the data before passing to KNN. So here do i still need to pass the standardscaler in the estimator?
why X and Y are passed instead of x_train and y_train assuming x and y are independent and dependent variable and x_train,y_train are formed after train_test_split operation ?

Any example of such code will be appericiated.

Solution

Looking into the linked answer, it appears that they are directly training on X and y since they're using a GridSearchCV, which already includes a k-fold cross validation (5 fold by default). So basically you'll already have a score for the classifier by calling GridSearchCV with the defined pipeline.

That being said I'd argue that it is never really the recommended approach to directly do this without a final test step, to asses the performance of the trained model on unseen data. So even if you do a k-fold cross validation, it is advisable to leave a test set to get a final score, specially when the k-fold process involved a hyper-parameter tuning, as in this case. In such cases you need anther validation step that is independent of the tuning.

And in relation to the second point, no, you don't need to include a StandardScaler if the data is already normalised. Though, since you're using pipeline, you might as well include all transformation logic in the pipeline, for the sake of simplicity.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange