문제

I have a dataset with some numerical and categorical features and I am trying to apply CatBoost for categorical encoding and classification.

Since my dataset is highly imbalanced, with a large number of data samples with label 0 compared to those with label 1, I'm also trying to use SMOTE to synthesize label 1 data samples before CatBoost classification.

My code -

# train_categorical_features is a list of columns that have categorical values
train_pool = Pool(data = X,
                  label = y,
                  cat_features = train_categorical_cols)

X_enc = train_pool.get_features()
print(X_enc)
y_enc = train_pool.get_label()
print(y_enc)

smote = SMOTE()
X_res, y_res = smote.fit_resample(X_enc, y_enc)
print('Resampled dataset samples per class {}'.format(Counter(y_res)))

predictions = []
for i in range(10):
    clf = CatBoostClassifier(learning_rate=0.08,
                         depth = 10,
                         loss_function='Logloss',
                         l2_leaf_reg = 4,
                         iterations=1000,
                         task_type="GPU",
                         random_seed=i,
                         logging_level='Silent')
    clf.fit(train_pool, plot=True,silent=True)
    predictions.append(clf.predict_proba(test[inputcols])[:,1])
    print(clf.get_best_score())

I get an error on X_enc = train_pool.get_features() that says -

CatBoostError: Pool has non-numeric features, get_features supports only numeric features

My questions are -

  1. Is my approach towards applying SMOTE with CatBoost correct?
  2. I've diligently followed the catboost documentation, and I am not able to understand or fix the error I've mentioned above. Would appreciate any help.
도움이 되었습니까?

해결책

The reason you are getting the error is that your pool contains categorical features. If all of your features were numerical, it would work fine.

SMOTE is irrelevant here.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 datascience.stackexchange
scroll top