Error encoding categorical features using sklearn pipelines

https://datascience.stackexchange.com/questions/61323

02-11-2019
|

Pregunta

I am new to sklearn pipelines and am using this post as a guide for my code:

https://www.codementor.io/bruce3557/beautiful-machine-learning-pipeline-with-scikit-learn-uiqapbxuj

I am trying to encode a categorical features using a transformation pipeline, but no matter what encoder I use, I get the same error. As far as I can tell from reading other posts, scikit-learn should be able to handle categorical variables as strings from version 0.20 or greater (namely with the OneHotEncoder.)

ValueError: could not convert string to float: 'Male'

Where I have entered xxxxxxxxxxx below replace with one of the following

ce.OneHotEncoder
ce.TargetEncoder
OneHotEncoder
OrdinalEncoder

from sklearn.pipeline import FeatureUnion
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import TfidfVectorizer
import category_encoders as ce
from sklearn.preprocessing import OrdinalEncoder
from sklearn.pipeline import Pipeline
import pandas as pd
import numpy as np

# create example data
example_df = pd.DataFrame({'Sex':['Male','Female','Female'],'Survived':[1,1,0]})
X_train = example_df.drop('Survived', axis=1)
y_train = pd.DataFrame(example_df['Survived'])

# build example pipeline
cat_pipe = ("categorical_features", ColumnTransformer([
            ("categorical", Pipeline(steps=[
                ("impute_stage", Imputer(missing_values=np.nan, strategy="median")),
                ("label_encoder", xxxxxxxxxxx())]), ["Sex"]
            )])
          )

example_pipeline = Pipeline(steps=[cat_pipe])

# fit pipeline
example_pipeline.fit(X_train, y_train)

# Name                    Version 
scikit-learn              0.20.3 
category_encoders         1.3.0
numpy                     1.16.2
pandas                    0.24.2

No hay solución correcta

Licenciado bajo: CC-BY-SA con atribución

No afiliado a datascience.stackexchange