Question

I am working on a classification problem using RandomForestClassifier. In the code I'm splitting the dataset into a train and test data for making predictions.

Here's the code:

from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
import numpy as np
from numpy import genfromtxt, savetxt

a = (np.genfromtxt(open('filepath.csv','r'), delimiter=',', dtype='int')[1:])
a_train, a_test = train_test_split(a, test_size=0.33, random_state=0)


def main():
    target = [x[0] for x in a_train]
    train = [x[1:] for x in a_train]

    rf = RandomForestClassifier(n_estimators=100)
    rf.fit(train, target)
    predicted_probs = [[index + 1, x[1]] for index, x in enumerate(rf.predict_proba(a_test))]

    savetxt('filepath.csv', predicted_probs, delimiter=',', fmt='%d,%f', 
            header='Id,PredictedProbability', comments = '')

if __name__=="__main__":
    main()

On exection however, I'm getting the following error:

ValueError: Number of features of the model must match the input. Model n_features is 1434 and input n_features is 1435

Any suggestions as to how I should proceed? Thanks.

Was it helpful?

Solution

It looks like you are using a_test directly, without stripping out the output feature.

The model is confused because it expects only 1434 input features but you are feeding it 1434 features along with the output feature.

You can fix this by doing the same thing with test that you did with train.

test = [x[1:] for x in a_test]

Then use test on the following line:

predicted_probs = [[index + 1, x[1]] for index, x in enumerate(rf.predict_proba(test))]
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top