Using a K-NN Classification Approach for Time Series Data?
-
16-10-2019 - |
Question
I have a dataset which contains time-series data of water flow over time. I have a flow meter connected to a kitchen faucet, and I am trying to cluster or classify specific water usage events.
The data is collected every second, and in each row I am given a value for the amount of gallons which are flowing through my flow meter.
For example, I am trying to classify someone washing their hands, filling a teapot, cleaning dishes, etc...
Is this something that I can use a k-NN Classification Approach to cluster these events? If a clustering based approach isn't good, what other method of classified would be good for this type of data?
If I run some experiments, I can classify each event and turn it into a supervised learning problem. But at the moment, none of the water events are classified.
A very abridged version of my dataset looks like the following:
EDIT
water = pd.DataFrame(shower1)
rng = pd.date_range('2016-09-01 00:00:00', '2016-09-30 23:59:58', freq='S')
water = water.reindex(rng,fill_value=0.0)
water = water['shower1']
df = pd.DataFrame({'time_stamp':rng,'water_amount':water})
starts = (df['water_amount']>0)&(df['water_amount'].shift(1)==0) #find all starts of events
n_events = sum(starts) #total number of events
df.loc[starts,'event_number'] = range(1,n_events+1) #numerate starts from 1 to n
df['event_number'] = df['event_number'].fillna(method='pad').fillna(-1) #forward fill all the values
df.loc[df['water_amount']==0,'event_number']=-1 #set all event numbers to -1 where the water amount is 0
df.groupby('event_number').agg({'time_stamp':'first',
'water_amount':'sum'}) #feature matrix
Solution
It seems pretty clear from looking at the data when an event starts and ends(basically whenever there is a sequence of positive values). So, instead of starting with some complicated models, I'd suggest calculating a few simple features (like length of the event, total amount of water, amount/seconds, time to previous event, time of day in seconds from start of recording) for every event and then try some clustering algorithm on that new data. k-NN might even produce something meaningful. But a statistical summary of the features can probably already give you a better idea of how to further approach this.
EDIT1
import pandas as pd
import numpy as np
rng = pd.date_range('2017-01-01 14:00:00', '2017-01-01 14:01:00', freq='S')
water = [0,0,0.2,0.3,0.4,0,0,0.3,0.2,0.5]*6+[0]
df = pd.DataFrame({'time_stamp':rng,'water_amount':water,'event_number':np.zeros(len(water))})
j = 1
for k in range(len(df)):
if df.ix[k,'water_amount']== 0:
df.ix[k,'event_number'] = -1
else:
if df.ix[k-1,'water_amount'] > 0:
df.ix[k,'event_number'] = df.loc[k-1,'event_number']
else:
df.ix[k,'event_number'] = j
j = j+1
df.groupby('event_number').agg({'time_stamp':'first',
'water_amount':'sum'}) #feature matrix
EDIT2
rng = pd.date_range('2017-01-01 14:00:00', '2017-01-01 14:01:00', freq='S')
water = [0,0,0.2,0.3,0.4,0,0,0.3,0.2,0.5]*6+[0]
df = pd.DataFrame({'time_stamp':rng,'water_amount':water})
starts = (df['water_amount']>0)&(df['water_amount'].shift(1)==0) #find all starts of events
n_events = sum(starts) #total number of events
df.loc[starts,'event_number'] = range(1,n_events+1) #numerate starts from 1 to n
df['event_number'] = df['event_number'].fillna(method='pad').fillna(-1) #forward fill all the values
df.loc[df['water_amount']==0,'event_number']=-1 #set all event numbers to -1 where the water amount is 0
df.groupby('event_number').agg({'time_stamp':'first',
'water_amount':'sum'}) #feature matrix
OTHER TIPS
Don't forget preprocessing your data.
For example, do feature extractionn
- total amount of water
- duration
- variance