Pergunta

I am trying to tune the "n_neighbors" for a kNN model andI have the following problem :

Based on the mean cross validation score the optimal kNN model should be the one with 10 neighbors.

On the other hand, when I plot the "scores vs neighbors" graphs I see that there are models whose score distance between the training and the test data is much smaller ( for instance the model with 20 neighbors ).

I am new to ML and this is still very confusing to me.. but should I stick to the 10 neighbors model, or is the 20 neighbors model better ? How do I decide ? Any help is much appreciated.

Here is my code and the graphs :

best_score = 0
neighbors = np.arange(1,31)
all_train_scores = []
all_test_scores = []

for n_neighbors in neighbors :        

        reg = KNeighborsRegressor(n_neighbors = n_neighbors, metric = 'manhattan')

        score = cross_val_score(reg, X_train, y_train, cv = 5)
        score = np.mean(score) 

        if score > best_score :
            best_score = score
            optimal_choice = {'n_neighbors' : n_neighbors}

        reg.fit(X_train, y_train)
        train_score = reg.score(X_train, y_train)
        test_score = reg.score(X_test, y_test)

        all_train_scores = np.append(all_train_scores, train_score)
        all_test_scores = np.append(all_test_scores, test_score)

enter image description here

Foi útil?

Solução

It depends on couple of things but one of the important ones is how big is your set.

Note that the difference is 0.01 in R squared, so if dataset is small taking 20 samples do determine class might be costly, on the other hand if you can afford it 10 samples might be to little do ensure class membership.

Depending on your next usage I would weigh in data size.

Licenciado em: CC-BY-SA com atribuição
scroll top