문제

I have a system that manages equipments. When these equipments are faulty, they will be serviced. Imagine my dataset looks like this:

ID
Type
# of times serviced

Example Data:

|ID| Type       | #serviced |
|1 | iphone     | 1         |
|2 | iphone     | 0         |
|3 | android    | 1         |
|4 | android    | 0         |
|5 | blackberry | 0         |

What I want to do is I want to predict "of all the equipments that have not been serviced, which ones are likely to be serviced" ? (ie) identify "at risk" equipments.

The problem is my training data will be #serviced > 0. Any #serviced=0 will not be frozen and dont seem to be valid candidates to include in training. (ie) When it fails, it will be serviced hence the count will go up.

  1. Is this a supervised or unsupervised problem ? (supervised because I have serviced and not-serviced labels, unsupervised because I want to cluster not-serviced with serviced and there by identify at-risk equipments)

  2. What data should I include in training ?

Note:

The example is obviously simplified. In reality I have more features that describe the equipment.

도움이 되었습니까?

해결책

You should include data when the phone was serviced to create a survival model. These models are commonly used in reliability engineering as well as treatment efficacy. For reliability engineering it is very common to fit your data to a Weibull distribution. Even aircraft manufacturers consider the model to be reliable after calibrating with three to five data points. I can highly recommend the R package 'flexsurv' package.

You cannot use typical linear or logistic regressions since some phones in your population will leave your observation period without ever being serviced. Survival models allow for this sort of missing information (this is called censoring).

Typically you would have the following data

|ID| Type       | serviced  | # months_since_purchase
|1 | iphone     | 1         | 12
|2 | iphone     | 0         | 15
|3 | android    | 1         | 2
|4 | android    | 0         | 10
|5 | blackberry | 0         | 5.5

With that data you could create the following model in R

require(survival)
model <- survfit(Surv(months_since_purchase, serviced) ~ strata(Type) +
 cluster(ID), data = phone_repairs)

The survfit.formula Surv(months_since_purchase, serviced) ~ strata(Type) + cluster(ID) indicates that months_since_purchase is the time at which an observation was made, serviced is 1 if the phone was serviced and 0 otherwise, strata(Type) will make sure that you create a different survival model for each phone, cluster(ID) will make sure that events relating to the same ID are considered as a cluster.

You could extend this model with Joint Models such as JM.

다른 팁

This is supervised learning problem. Type is a predictor. #serviced classifier is target variable. Model is trained on samples set you already have. Best guess is that any model will not have substantual predictive ability. Type is not enough.

Try including more factors (predictors) into the model. Like years_being_in_usage, equipment_model, have_been_in_service_before and so on. The more you get, the better model you can train.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 datascience.stackexchange
scroll top