Devices behavior in one continuous variable vs events rate

https://datascience.stackexchange.com/questions/8245

16-10-2019
|

Question

I have devices on which I have time series data of one continuous variable. I have to evaluate the relation between the profile of that variable on those devices and "events".

Those events are given in terms of occurrences on a time period.

My first intention is to make clusters of similar behavior of that variable and compare those clusters with the low/middle/high events rates.

I was thinking about doing a K-means with the min, max, quartilles, mean, normal q-q p value, Kurstosis, etc. as dimensions, but I don't think it's a good idea because:

Those dimensions are not independant
It's "losing" data and so potentially losing classification potential

Do you have some suggestions to group similar devices together?

Also, do you have other ideas to establish that relationship?

Context:

python3 with the scipy stack
~ 3000 devices and hundreds of thousands of data per day; 5 months to consider

Solution

Done with K-means clustering with descriptive statistics as features:

In short, I've tried the idea described in the question, even if I was thinking it won't work. Let the experience talk...

I initially had a list of devices data. Each element of the list were 2 columns, R rows matrix, and R was different for each device. So, per device:

[
    [mesureValue, timestamp],
    ..., 
    [mesureValue, timestamp],
]

Since I'm only interested in the measureValue distribution, I've transformed the inital data to a 8 columns, N rows matrix, where N = number of devices.

The columns are, computed on the correponding device's measure value:

Arithmetic mean
Median
First quartille
Third quartille
Minimum
Maximum
Range
Standard deviation

With this matrix, I've applied K-means clustering using scikit learn (python).

I made the link between the matrix line and the physical device by using pandas Data Frames (python) who's line index are in fact the serial number of the device.

I've tried with 5 clusters, and it works.

Just in case of, if I need improvements in the future, I'm planning to add other statistics in the columns, especially for deviation vs. normality. So, for example Kurstosis and normal q-q plot p value.

Best regards.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange