Question

I'm trying to create a regression model that predicts the duration of a task. The training data I have consists of roughly 40 thousand completed tasks with these variables:

  • Who performed the task (~250 different people)
  • What part (subproject) of the project the task was performed on (~20 different parts)
  • The type of task
  • The start date of the task (10 years worth of data)
  • How long the person who has to do the task estimates it will take
  • The actual duration this task took to finish

The duration can vary between half an hour to a couple of hundreds of hours, but is heavily right skewed (most tasks are completed within 10 hour). On log scale the distribution is still slightly right skewed.

The prediction doesn't have to be perfect, but I'm trying to improve the people's estimations. One question to ask is "What measure can we use the define beter?" I think the best measure would be the Mean Squared Error (MSE) since it weighs large errors much worse than small errors.

Before I turned to machine learning I tried some simple approaches such as adjusting the estimation by the average or median error, adjusting it by the average/median error grouped by person, grouped by subproject but each of these happened to perform worse.

With machine learning, one of the first problem I encountered was the number of categorical variables since for most models these have to be encoded someway (e.g. one-hot). Anyway, I tried to apply some linear models, for example with Stochastic Gradient Descent my approach would be:

  1. One-hot encode the categorical features
  2. The converted the date to unix timestamps
  3. Normalize all the features that are not already between 0 and 1
  4. Split the data in 80/20 learn and test sets.
  5. With Grid Search cross validation and the learn set try to find the best hyper parameters and fit the model.
  6. Predict with the test set
  7. Calculate the error/score

Now one thing I noticed was that the results varied quite a bit: On one run the MSE was close to double of another run (150 and 280). Another thing is that the MSE of the people's estimate is about 80, so my model performs a bit worse.

During my efforts to improve the performance I stumbled across this question where someone suggests to use survival models. Now I'm unfamilliar with these kinds of models but it sounded promising but during my initial tests with this it turns out to be way too slow for my purposes (too large of a dataset).

In the same Datascience answer that suggested to use the survival models (and the Wikipedia page) they also mentioned Poisson regression, but I'm not sure how I would apply this to my case.

So a long story short: I have just two questions: 1. Was my approach of using SGD 'correct' and do you think I can improve the results with that? 2. Are other models better suited for this kind of prediction and if so, can you explain a bit how I would use them?

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top