Question

I have a large dataset of >100 columns with nearly all types of data. I want to remove outliers from my dataset for which purpose I've decided to use IQR. Problem is even when I apply quantile of 0.25/0.75, I still get significant amount of outliers in columns like ClientTotalIncome, etc. Further by doing that, I eliminate more than 90% data. My code in Python for outliers removal is as follows:

num_train = train.select_dtypes(include=['number'])
cat_train = train.select_dtypes(exclude=['number'])
Q1 = num_train.quantile(0.25)
Q3 = num_train.quantile(0.75)
IQR = Q3 - Q1
idx = ~((num_train < (Q1 - 1.5 * IQR)) | (num_train > (Q3 + 1.5 * 
IQR))).any(axis=1)
train_cleaned = pd.concat([num_train.loc[idx], cat_train.loc[idx]], axis=1)

Any ideas?

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top