Resampling with Python SMOTE

https://datascience.stackexchange.com/questions/66302

20-10-2020
|

Question

I am trying to do a simple ML re-sampling approach after the train-test split. However when I do this, it throws the below error. Can you please help me understand what this error is about?

KeyError: 'Only the Series name can be used for the key in Series dtype mappings.'

The code is given below:

# split into training and testing datasets
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
from imblearn.over_sampling import SMOTE
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 2, shuffle = True, stratify = y)
print("Number transactions X_train dataset: ", X_train.shape)
print("Number transactions y_train dataset: ", y_train.shape)
print("Number transactions X_test dataset: ", X_test.shape)
print("Number transactions y_test dataset: ", y_test.shape)
print("Before OverSampling, counts of label '1': {}".format(sum(y_train==1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train==0)))

sm = SMOTE(random_state=2)
X_train_res, y_train_res = sm.fit_sample(X_train, y_train.ravel())   # error is thrown here

print('After OverSampling, the shape of train_X: {}'.format(X_train_res.shape))
print('After OverSampling, the shape of train_y: {} \n'.format(y_train_res.shape))

print("After OverSampling, counts of label '1': {}".format(sum(y_train_res==1)))
print("After OverSampling, counts of label '0': {}".format(sum(y_train_res==0)))

Here is the full error message:

KeyError                                  Traceback (most recent call last)
<ipython-input-216-af83b63865ac> in <module>
      3 
      4 sm = SMOTE(random_state=2)
----> 5 X_train_res, y_train_res = sm.fit_sample(X_train, y_train.ravel())
      6 
      7 print('After OverSampling, the shape of train_X: {}'.format(X_train_res.shape))

~\AppData\Local\Continuum\anaconda3\lib\site-packages\imblearn\base.py in fit_resample(self, X, y)
     86         if self._X_columns is not None:
     87             X_ = pd.DataFrame(output[0], columns=self._X_columns)
---> 88             X_ = X_.astype(self._X_dtypes)
     89         else:
     90             X_ = output[0]

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py in astype(self, dtype, copy, errors, **kwargs)
   5863                     results.append(
   5864                         col.astype(
-> 5865                             dtype=dtype[col_name], copy=copy, errors=errors, **kwargs
   5866                         )
   5867                     )

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py in astype(self, dtype, copy, errors, **kwargs)
   5846                 if len(dtype) > 1 or self.name not in dtype:
   5847                     raise KeyError(
-> 5848                         "Only the Series name can be used for "
   5849                         "the key in Series dtype mappings."
   5850                     )

KeyError: 'Only the Series name can be used for the key in Series dtype mappings.'

La solution

do it without ravel (or reshaping of any kind).

Or if you going to do than transform dataframe X_train into an matrix also. This is the correct format fit_sample

Autres conseils

change your dataframe into matrix :

sm.fit_sample(X_train.as_matrix(), y_train.ravel())

Licencié sous: CC-BY-SA avec attribution

Non affilié à datascience.stackexchange