Question

I have data with multiple labels, for example

enter image description here

My X set is fromt second to third column, and I want to classify either first column or the last column, so I made my Y the last column.

The goal is so that if I would classify Vios it would return me Car or 0 in other words it can find its way to the first row.

Classification use case:

classify("poodle") #just pretend this is a working function

returns: Pets

How I did it in attempt to train my model:

from sklearn.feature_extraction.text import TfidfVectorizer
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 72)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
clf3 = RandomForestClassifier().fit(X_train_tfidf, y_train)

I'm using a guide from somewhere on the net that works a bit the same with it, but at the end im getting returned:

ValueError: Found input variables with inconsistent numbers of samples: [5, 4156]

I knew immediately i was doing it wrong. How do I train a model so that it achieves my goal? Any relevant guides or techniques I should be following intead of this? I dont even know the correct way to use vectors on this case.

Was it helpful?

Solution

Couple of things: Cant replicate the problem exactly but if you follow these steps you are not exposing yourself

TfidfVectorizer is CountVectorizer + TfidfTransformer, you are exposing yourself to unnecessary complexity and potential errors

Use pipelines, cant stress this enough, its a compact way to pack all the sklearn transformers together AND THEN use fit, predict methods...

I would advise you to follow something like that, or find similiar problem here

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top