Passing data to SMOTE after applying train/test split

https://datascience.stackexchange.com/questions/67141

08-12-2020
|

Question

I'm trying to resample my dataset after splitting it into train and test partitions using SMOTE. Here's my code:

smote_X = df[cols]
smote_Y = df[target_col]

#Split train and test data
smote_train_X,smote_test_X,smote_train_Y,smote_test_Y = train_test_split(smote_X,smote_Y,test_size = .25,random_state = 111)

smote_train_Y_series = smote_train_Y.iloc[:,0]

#oversampling minority class using smote
os = SMOTE(random_state = 0)
os_smote_X,os_smote_Y = os.fit_sample(smote_train_X,smote_train_Y_series)

I added line #5 to convert the DataFrame coming out of train_test_split to Series as the newer version of SMOTE fit_sample (docs) wants this data type but it now throws the following error.

Any ideas how to fix it?

-------------------------------------------------------------------------- KeyError Traceback (most recent call last) in 16 #oversampling minority class using smote 17 os = SMOTE(random_state = 0) ---> 18 os_smote_X,os_smote_Y = os.fit_sample(smote_train_X,smote_train_Y_series) 19 os_smote_X = pd.DataFrame(data = os_smote_X,columns=cols) 20 os_smote_Y = pd.DataFrame(data = os_smote_Y,columns=target_col)

/opt/conda/lib/python3.6/site-packages/imblearn/base.py in fit_resample(self, X, y) 86 if self._X_columns is not None: 87 X_ = pd.DataFrame(output[0], columns=self._X_columns) ---> 88 X_ = X_.astype(self._X_dtypes) 89 else: 90 X_ = output[0]

/opt/conda/lib/python3.6/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors, **kwargs) 5863
results.append( 5864 col.astype( -> 5865 dtype=dtype[col_name], copy=copy, errors=errors, **kwargs 5866 ) 5867
)

/opt/conda/lib/python3.6/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors, **kwargs) 5846 if len(dtype) > 1 or self.name not in dtype: 5847
raise KeyError( -> 5848 "Only the Series name can be used for " 5849 "the key in Series dtype mappings." 5850 )

KeyError: 'Only the Series name can be used for the key in Series dtype mappings.'

Upd 1/28/2020: Tried two more options with no luck so far. Still looking for help.

A. Passing the raw outputs of train_test_split:

#oversampling minority class using smote
os = SMOTE(random_state = 0)
os_smote_X,os_smote_Y = os.fit_sample(smote_train_X,smote_train_Y)

--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) in 1 #oversampling minority class using smote 2 os = SMOTE(random_state = 0) ----> 3 os_smote_X,os_smote_Y = os.fit_resample(smote_train_X,smote_train_Y) 4 os_smote_X = pd.DataFrame(data = os_smote_X,columns=cols) 5 os_smote_Y = pd.DataFrame(data = os_smote_Y,columns=target_col)

/opt/conda/lib/python3.6/site-packages/imblearn/base.py in fit_resample(self, X, y) 73 """ 74 check_classification_targets(y) ---> 75 X, y, binarize_y = self._check_X_y(X, y) 76 77 self.sampling_strategy_ = check_sampling_strategy(

/opt/conda/lib/python3.6/site-packages/imblearn/base.py in _check_X_y(self, X, y, accept_sparse) 148 if hasattr(y, "loc"): 149 # store information to build a series --> 150 self._y_name = y.name 151 self._y_dtype = y.dtype 152 else:

/opt/conda/lib/python3.6/site-packages/pandas/core/generic.py in getattr(self, name) 5177 if self._info_axis._can_hold_identifiers_and_holds_name(name): 5178
return self[name] -> 5179 return object.getattribute(self, name) 5180 5181 def setattr(self, name, value):

AttributeError: 'DataFrame' object has no attribute 'name'

B. Converting smote_train_X to matrix before passing it alongside smote_train_Y being converted to Series:

smote_train_X_matrix = smote_train_X.as_matrix()
smote_train_Y_series = smote_train_Y.iloc[:,0]

#oversampling minority class using smote
os = SMOTE(random_state = 0)
os_smote_X,os_smote_Y = os.fit_resample(smote_train_X_matrix,smote_train_Y_series)

Note that the resulting matrix and series show a shape of (4633, 46) and (4633,) respectively.

--------------------------------------------------------------------------- ValueError Traceback (most recent call last) /opt/conda/lib/python3.6/site-packages/pandas/core/internals/managers.py in create_block_manager_from_blocks(blocks, axes) 1677
blocks = [ -> 1678 make_block(values=blocks[0], placement=slice(0, len(axes[0]))) 1679 ]

/opt/conda/lib/python3.6/site-packages/pandas/core/internals/blocks.py in make_block(values, placement, klass, ndim, dtype, fastpath) 3283

-> 3284 return klass(values, ndim=ndim, placement=placement) 3285

/opt/conda/lib/python3.6/site-packages/pandas/core/internals/blocks.py in init(self, values, placement, ndim) 127 "Wrong number of items passed {val}, placement implies " --> 128 "{mgr}".format(val=len(self.values), mgr=len(self.mgr_locs)) 129 )

ValueError: Wrong number of items passed 46, placement implies 44

During handling of the above exception, another exception occurred:

ValueError Traceback (most recent call last) in 2 os = SMOTE(random_state = 0) 3 os_smote_X,os_smote_Y = os.fit_resample(smote_train_X_matrix,smote_train_Y_series) ----> 4 os_smote_X = pd.DataFrame(data = os_smote_X,columns=cols) 5 os_smote_Y = pd.DataFrame(data = os_smote_Y,columns=target_col) 6 ###

/opt/conda/lib/python3.6/site-packages/pandas/core/frame.py in init(self, data, index, columns, dtype, copy) 438 mgr = init_dict({data.name: data}, index, columns, dtype=dtype) 439 else: --> 440 mgr = init_ndarray(data, index, columns, dtype=dtype, copy=copy) 441 442 # For data is list-like, or Iterable (will consume into list)

/opt/conda/lib/python3.6/site-packages/pandas/core/internals/construction.py in init_ndarray(values, index, columns, dtype, copy) 211 block_values = [values] 212 --> 213 return create_block_manager_from_blocks(block_values, [columns, index]) 214 215

/opt/conda/lib/python3.6/site-packages/pandas/core/internals/managers.py in create_block_manager_from_blocks(blocks, axes) 1686
blocks = [getattr(b, "values", b) for b in blocks] 1687
tot_items = sum(b.shape[0] for b in blocks) -> 1688 construction_error(tot_items, blocks[0].shape[1:], axes, e) 1689 1690

/opt/conda/lib/python3.6/site-packages/pandas/core/internals/managers.py in construction_error(tot_items, block_shape, axes, e) 1717
raise ValueError("Empty data passed with indices specified.") 1718 raise ValueError( -> 1719 "Shape of passed values is {0}, indices imply {1}".format(passed, implied) 1720 ) 1721

ValueError: Shape of passed values is (8410, 46), indices imply (8410, 44)

Solution

Your smote_train_Y is already a series, so need to use iloc[:,0]. Just use that in fit_sample function-

#oversampling minority class using smote
os = SMOTE(random_state = 0)
os_smote_X,os_smote_Y = os.fit_sample(smote_train_X_matrix, smote_train_Y)

OTHER TIPS

Found the problem - my initial dataset contained duplicate columns created after one-hot encoding of my categorical variables. The original code worked for me upon cleaning the dataset.

Bottom line: Make sure your dataset is sound and convert DataFrame to Series for the 2nd variable you pass to fit_sample of SMOTE.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange