Question

I have a dataset with some numerical and categorical features and I am trying to apply CatBoost for categorical encoding and classification.

Since my dataset is highly imbalanced, with a large number of data samples with label 0 compared to those with label 1, I'm also trying to use SMOTE to synthesize label 1 data samples before CatBoost classification.

My code -

# train_categorical_features is a list of columns that have categorical values
train_pool = Pool(data = X,
                  label = y,
                  cat_features = train_categorical_cols)

X_enc = train_pool.get_features()
print(X_enc)
y_enc = train_pool.get_label()
print(y_enc)

smote = SMOTE()
X_res, y_res = smote.fit_resample(X_enc, y_enc)
print('Resampled dataset samples per class {}'.format(Counter(y_res)))

predictions = []
for i in range(10):
    clf = CatBoostClassifier(learning_rate=0.08,
                         depth = 10,
                         loss_function='Logloss',
                         l2_leaf_reg = 4,
                         iterations=1000,
                         task_type="GPU",
                         random_seed=i,
                         logging_level='Silent')
    clf.fit(train_pool, plot=True,silent=True)
    predictions.append(clf.predict_proba(test[inputcols])[:,1])
    print(clf.get_best_score())

I get an error on X_enc = train_pool.get_features() that says -

CatBoostError: Pool has non-numeric features, get_features supports only numeric features

My questions are -

  1. Is my approach towards applying SMOTE with CatBoost correct?
  2. I've diligently followed the catboost documentation, and I am not able to understand or fix the error I've mentioned above. Would appreciate any help.
Was it helpful?

Solution

The reason you are getting the error is that your pool contains categorical features. If all of your features were numerical, it would work fine.

SMOTE is irrelevant here.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top