Setting sparse=True in Scikit Learn OneHotEncoder does not reduce memory usage
-
13-12-2020 - |
Question
I have a dataset that consists of 85 feature columns and 13195 rows. Approximately 50 of these features are categorical features which I encoded using OneHotEncoder. I was reading this article about sparse data sets and was intrigued to see how changing the value of the sparse parameter when defining a OneHotEncoder object may reduce memory usage for my dataset.
Before applying OneHotEncoding to categorical features in my dataset, I have a memory usage of 9.394 MB. I found this by running this code:
BYTES_TO_MB_DIV = 0.000001
def print_memory_usage_of_data_frame(df):
mem = round(df.memory_usage().sum() * BYTES_TO_MB_DIV, 3)
print("Memory usage is " + str(mem) + " MB")
print_memory_usage_of_data_frame(dataset)
Setting OneHotEncoder spare=True:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
numeric_transformer = Pipeline(steps=[
('knnImputer', KNNImputer(n_neighbors=2, weights="uniform")),
('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore', sparse=True))])
preprocessor = ColumnTransformer(transformers=[
('num', numeric_transformer, selector(dtype_exclude="object")),
('cat', categorical_transformer, selector(dtype_include="object"))
])
Z = pd.DataFrame(preprocessor.fit_transform(X))
print_memory_usage_of_data_frame(Z)
Memory usage is 25.755 MB
Then running the same code above but setting spare=False like so:
OneHotEncoder(handle_unknown='ignore', sparse=False)
Memory usage is 25.755 MB
According to the linked article, which used the sparse option in pandas get_dummies, this should result in reduced memory storage, is this not the same for Scikit Learn's OneHotEncoder?
Solution
Based on @BenReiniger's comment, I removed the numeric portion from the ColumnTransformer and ran the following code:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore', sparse=True))])
preprocessor = ColumnTransformer(transformers=[
('cat', categorical_transformer, selector(dtype_include="object"))
])
X = pd.DataFrame(preprocessor.fit_transform(X))
print_memory_usage_of_data_frame(X)
The result was Memory usage is 0.106 MB,
Running the same code above but with sparse option set to False:
OneHotEncoder(handle_unknown='ignore', sparse=False)
resulted in Memory usage is 20.688 MB.
So it is clear that changing the sparse parameter in OneHotEncoder does indeed reduce memory usage.