Question

I have a dataset that consists of 85 feature columns and 13195 rows. Approximately 50 of these features are categorical features which I encoded using OneHotEncoder. I was reading this article about sparse data sets and was intrigued to see how changing the value of the sparse parameter when defining a OneHotEncoder object may reduce memory usage for my dataset.

Before applying OneHotEncoding to categorical features in my dataset, I have a memory usage of 9.394 MB. I found this by running this code:

    BYTES_TO_MB_DIV = 0.000001
    def print_memory_usage_of_data_frame(df):
        mem = round(df.memory_usage().sum() * BYTES_TO_MB_DIV, 3) 
        print("Memory usage is " + str(mem) + " MB")
        
    print_memory_usage_of_data_frame(dataset)

Setting OneHotEncoder spare=True:

    from sklearn.compose import ColumnTransformer
    from sklearn.preprocessing import OneHotEncoder
    from sklearn.impute import SimpleImputer
    from sklearn.pipeline import Pipeline
    
    
    numeric_transformer = Pipeline(steps=[
        ('knnImputer', KNNImputer(n_neighbors=2, weights="uniform")),
        ('scaler', StandardScaler())])
    
    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('onehot', OneHotEncoder(handle_unknown='ignore', sparse=True))])
    
    preprocessor = ColumnTransformer(transformers=[
        ('num', numeric_transformer, selector(dtype_exclude="object")),
        ('cat', categorical_transformer, selector(dtype_include="object"))
    ])
    
      
    Z = pd.DataFrame(preprocessor.fit_transform(X))
    print_memory_usage_of_data_frame(Z)

Memory usage is 25.755 MB

Then running the same code above but setting spare=False like so:

    OneHotEncoder(handle_unknown='ignore', sparse=False)

Memory usage is 25.755 MB

According to the linked article, which used the sparse option in pandas get_dummies, this should result in reduced memory storage, is this not the same for Scikit Learn's OneHotEncoder?

Was it helpful?

Solution

Based on @BenReiniger's comment, I removed the numeric portion from the ColumnTransformer and ran the following code:


    from sklearn.compose import ColumnTransformer
    from sklearn.preprocessing import OneHotEncoder
    from sklearn.impute import SimpleImputer
    from sklearn.pipeline import Pipeline
    
    
    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('onehot', OneHotEncoder(handle_unknown='ignore', sparse=True))])
    
    preprocessor = ColumnTransformer(transformers=[
        ('cat', categorical_transformer, selector(dtype_include="object"))
    ])
    
      
    X = pd.DataFrame(preprocessor.fit_transform(X))
    
    print_memory_usage_of_data_frame(X)

The result was Memory usage is 0.106 MB,

Running the same code above but with sparse option set to False:

    OneHotEncoder(handle_unknown='ignore', sparse=False) 

resulted in Memory usage is 20.688 MB.

So it is clear that changing the sparse parameter in OneHotEncoder does indeed reduce memory usage.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top