Machine learning goal: given a population of 100,000 students, predict a group of 3,000, and minimize the median grade of that group

https://datascience.stackexchange.com/questions/76179

12-12-2020
|

質問

In other words, I am looking to predict students that will fail out of school before it happens. The data includes socioeconomic status and other related variables.

I have tried an XGB binary classification (both tree and forest), but the problem is that it doesn't penalize severely wrong answers (predicting that a student will be in the bottom 3% in terms of grades, but they're actually A+ students). The result is that the average grades of the predicted students is quite low, but the median grades aren't actually that bad - there are a few extremely bad students that pull down the average but not the median.

I have tried a XGB regression (both tree and forest), but the problem is that I can't get the model to focus on the bottom 3%. It seeks to reduce error for all predictions. I couldn't care less about telling the difference between an A student and a B student, I only need to consistently identify the bottom 3%ile.

I was thinking that perhaps this could lend itself to reinforcement learning instead of supervised, but I know nothing about reinforcement... WOuld it be possible to make a reinforcement model where the goal is to minimize the median grades of the 3% of students predicted? Or are there any other machine learning techniques that would work?

解決

Try writing a custom loss function for a regression model!

Keras' neural networks support this, for example. See https://stackoverflow.com/q/43818584/745868

(But many other libraries give support for this as well)

The only special thing about your custom loss function is that it doesn't add up the error of a datapoint if min(pred_y, actual_y) >= THRESHOLD

他のヒント

Writing a custom loss function could be handy, but it may be simpler to just try to treat this as a class balance problem for your regression model. For starters, just try undersampling all of the higher and medium grades until they're close to balanced with your failing students. Given your number of data points and features you can probably still just throw a random forest at it and do fine.

ライセンス： CC-BY-SA と帰属

所属していません datascience.stackexchange