Pregunta

I have a data with only 8 columns:

  • id
  • created_time
  • employee_id
  • rank position
  • hourly price
  • num_work_completed
  • work_category
  • hired

Hired is the target variable with 1 representing hired and 0 representing not hired, and it is imbalanced with 5.7% hired(1) which makes the baseline accuracy 94.3% I am trying to build model that predict whether a employee will be hired. After I finished the EDA, feature engineering(dealing with NAs, encoding categorical variables, normalizing numeric variables), I used 80/20 as the splitting rule and built random forest with rank_position,hourly_price, num_work_completed, work_category_dummy

clf=RandomForestClassifier(n_estimators=100,class_weight=balanced)
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)

However the model's accuracy(Test accuracy) turned out to be 93% while the baseline is 94.3%.
The training accuracy is 99%. Compared to test accuracy 94.3%, I don't think there's a over-fitting problem The logistic regression also has the same problem. Based on correlation blot, most independent variables have pretty weak relationship with target variable smaller than +/- 0.3. what should I do next to improve my model accuracy? I tried parameter tuning but it doesn't help a lot.

No hay solución correcta

Licenciado bajo: CC-BY-SA con atribución
scroll top