Question

This is a very general question, as I'm still very much in the learning phase with machine learning. I have some utility data around problematic meters. Even tho the data is "time series", I believe I can perform a multi-class classification (looking at 3 labels) towards the data, but would like some opinions before I pursue down that road.

I have been doing some feature engineering to derive other data points to help with the classification process (examples below are columns "Error1" and "Error2").

The meters come in 2 classes, those that are estimated issues ="1", and does that are non-estimated issues ="0".

My dataset roughly looks like below (I have several other Error features):

 Estimated     Meter ID          Date             DaysInDuration    Error1  Error2
     0            BBA         11/19/2019               31              0       0
     0            BBA         12/19/2019               62              1       0
     0            BBA         12/19/2019               92              1       0
     1            JJL         11/2/2019               120              1       0
     1            JJL         12/2/20019              150              1       1    
     1            JJL         1/20/2020               180              2       2    

What I would like to attempt is to use a classification model that can handle multi-class classification (possibly a decision tree), and produce a output such as below:

 Estimated     Meter ID          Date             DaysInDuration    Error1  Error2   Classification Label   
     0            BBA         11/19/2019               31              0       0            1
     0            BBA         12/19/2019               62              1       0            1
     0            BBA         12/19/2019               92              1       0            2
     1            BBA         11/2/2019               120              1       0            3
     1            JJL         12/2/2020                30              1       1            1
     1            JJL         1/20/2020                60              2       2            1

Labels Meaning = "1" = low risk issue/ "2" = medium risk issue/ "3" = high risk issue

The model would classify the either "1","2", or "3" depending on the length of days the meter has been in the "DaysInDuration" column, and the number of counted errors in the "Error1" and "Error2" columns.

In my thoughts it feels like classification would still work, including with train test splits, as the classification is moreso from other data points versus the actual order dependency in a typical time series problem.

Was it helpful?

Solution

Your time intervals seem irregular. For most examples of time series, we have observations at fixed intervals, such as GDP every quarter. You can instead just create a regular machine learning model. If you think there is some sort of time component where you would not want January data to make predictions for November data, then you can manually create your cross validation partitions instead of creating them randomly.

However, you are predicting 1, 2, or 3 based on low, medium, or high DaysInDuration. A classifier would not know that low and high are farther apart than low and medium. You can instead create a regression model that predicts DaysInDuration as your target feature. Once you have your prediction (e.g. 31), you can act accordingly whether the predictions is low, medium, or high.

Finally, you wouldn't be able to use DaysInDuration as a feature to predict the Classification_Label because it would give you the perfect answer every time. That would be target leakage.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top