Time Series Data Multi-Class Classification
Question
This is a very general question, as I'm still very much in the learning phase with machine learning. I have some utility data around problematic meters. Even tho the data is "time series", I believe I can perform a multi-class classification (looking at 3 labels) towards the data, but would like some opinions before I pursue down that road.
I have been doing some feature engineering to derive other data points to help with the classification process (examples below are columns "Error1" and "Error2").
The meters come in 2 classes, those that are estimated issues ="1", and does that are non-estimated issues ="0".
My dataset roughly looks like below (I have several other Error features):
Estimated Meter ID Date DaysInDuration Error1 Error2
0 BBA 11/19/2019 31 0 0
0 BBA 12/19/2019 62 1 0
0 BBA 12/19/2019 92 1 0
1 JJL 11/2/2019 120 1 0
1 JJL 12/2/20019 150 1 1
1 JJL 1/20/2020 180 2 2
What I would like to attempt is to use a classification model that can handle multi-class classification (possibly a decision tree), and produce a output such as below:
Estimated Meter ID Date DaysInDuration Error1 Error2 Classification Label
0 BBA 11/19/2019 31 0 0 1
0 BBA 12/19/2019 62 1 0 1
0 BBA 12/19/2019 92 1 0 2
1 BBA 11/2/2019 120 1 0 3
1 JJL 12/2/2020 30 1 1 1
1 JJL 1/20/2020 60 2 2 1
Labels Meaning = "1" = low risk issue/ "2" = medium risk issue/ "3" = high risk issue
The model would classify the either "1","2", or "3" depending on the length of days the meter has been in the "DaysInDuration" column, and the number of counted errors in the "Error1" and "Error2" columns.
In my thoughts it feels like classification would still work, including with train test splits, as the classification is moreso from other data points versus the actual order dependency in a typical time series problem.
Solution
Your time intervals seem irregular. For most examples of time series, we have observations at fixed intervals, such as GDP every quarter. You can instead just create a regular machine learning model. If you think there is some sort of time component where you would not want January data to make predictions for November data, then you can manually create your cross validation partitions instead of creating them randomly.
However, you are predicting 1, 2, or 3 based on low, medium, or high DaysInDuration. A classifier would not know that low and high are farther apart than low and medium. You can instead create a regression model that predicts DaysInDuration as your target feature. Once you have your prediction (e.g. 31), you can act accordingly whether the predictions is low, medium, or high.
Finally, you wouldn't be able to use DaysInDuration as a feature to predict the Classification_Label because it would give you the perfect answer every time. That would be target leakage.