SMOTE vs SMOTENC for binary classifier with categorical and numeric data

https://datascience.stackexchange.com/questions/60684

02-11-2019
|

Question

I have a problem that I am having trouble thoroughly understanding.

I am using Xgboost for classification. My y is 0 or 1 (true or false). I have categorical and numeric features, so theoretically, I need to use SMOTENC instead of SMOTE. However, I get better results with SMOTE.

Could anyone explain why this is happening?

Also, if I use some encoder (BinaryEncoder, one hot, etc.) for categorical data, do I need to use SMOTENC after encoding, or before?

I copied my example code (x and y is after cleaninig, include BinaryEncoder).

Thanks for any help.

        _train, X_val, y_train, y_val = train_test_split(x, y, test_size=0.2, random_state=1)

    smt = SMOTE()
    X_resampled, y_resampled = smt.fit_resample(X_train, y_train)


    params_model1 = {
        'booster': ['dart', 'gbtree', 'gblinear'],
        'learning_rate': [0.001, 0.01, 0.05, 0.1],
        'min_child_weight': [1, 5, 10, 15, 20],
        'gamma': [0, 0.5, 1, 1.5, 2, 5],
        'subsample': [0.6, 0.8, 1.0],
        'colsample_bytree': [0.6, 0.8, 1.0],
        'max_depth': [3, 4, 5, 6, 7, 8],
        'max_delta_step': [0, 1, 2, 3, 5, 10],
        'base_score': [0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65],
        'reg_alpha': [0, 0.5, 1, 1.5, 2],
        'reg_lambda': [0, 0.5, 1, 1.5, 2],
        'n_estimators': [100, 200, 500]
    }

    skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=1001)

    xgb = XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
                        colsample_bynode=1, colsample_bytree=0.3, gamma=1,
                        learning_rate=0.1, max_delta_step=0, max_depth=10,
                        min_child_weight=5, missing=None, n_estimators=1000, n_jobs=1,
                        nthread=None, objective='binary:logistic', random_state=0,
                        reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
                        silent=None, subsample=0.8, verbosity=1)

    scoring = 'f1'
    rs_xgb = RandomizedSearchCV(xgb, param_distributions=params_model1, n_iter=1,
                                scoring=scoring,  n_jobs=4, cv=skf.split(X_resampled, y_resampled), verbose=3,
                                random_state=1001)

    rs_xgb.fit(X_resampled, y_resampled)

    refit = rs_xgb.best_estimator_

joblib.dump(refit, 'validator1.pkl')

loaded_xgb = joblib.load('validator1.pkl')

y_predict = loaded_xgb.predict(X_val.as_matrix())

print(confusion_matrix(y_val, y_predict))

print("Final result " + str(f1_score(y_val, y_predict)))

No correct solution

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange