Question

I have a dataset which contains time-series data of water flow over time. I have a flow meter connected to a kitchen faucet, and I am trying to cluster or classify specific water usage events.

The data is collected every second, and in each row I am given a value for the amount of gallons which are flowing through my flow meter.

For example, I am trying to classify someone washing their hands, filling a teapot, cleaning dishes, etc...

Is this something that I can use a k-NN Classification Approach to cluster these events? If a clustering based approach isn't good, what other method of classified would be good for this type of data?

If I run some experiments, I can classify each event and turn it into a supervised learning problem. But at the moment, none of the water events are classified.

A very abridged version of my dataset looks like the following:

Time Series Water Usage

EDIT

water = pd.DataFrame(shower1)
rng = pd.date_range('2016-09-01 00:00:00', '2016-09-30 23:59:58', freq='S')
water = water.reindex(rng,fill_value=0.0)
water = water['shower1']
df = pd.DataFrame({'time_stamp':rng,'water_amount':water})

starts = (df['water_amount']>0)&(df['water_amount'].shift(1)==0) #find all starts of events
n_events = sum(starts) #total number of events
df.loc[starts,'event_number'] = range(1,n_events+1) #numerate starts from 1 to n
df['event_number'] = df['event_number'].fillna(method='pad').fillna(-1) #forward fill all the values
df.loc[df['water_amount']==0,'event_number']=-1 #set all event numbers to -1 where the water amount is 0

df.groupby('event_number').agg({'time_stamp':'first',
                                    'water_amount':'sum'}) #feature matrix

enter image description here

Was it helpful?

Solution

It seems pretty clear from looking at the data when an event starts and ends(basically whenever there is a sequence of positive values). So, instead of starting with some complicated models, I'd suggest calculating a few simple features (like length of the event, total amount of water, amount/seconds, time to previous event, time of day in seconds from start of recording) for every event and then try some clustering algorithm on that new data. k-NN might even produce something meaningful. But a statistical summary of the features can probably already give you a better idea of how to further approach this.

EDIT1

import pandas as pd
import numpy as np

rng = pd.date_range('2017-01-01 14:00:00', '2017-01-01 14:01:00', freq='S')
water = [0,0,0.2,0.3,0.4,0,0,0.3,0.2,0.5]*6+[0]
df = pd.DataFrame({'time_stamp':rng,'water_amount':water,'event_number':np.zeros(len(water))})

j = 1
for k in range(len(df)):
    if df.ix[k,'water_amount']== 0:
        df.ix[k,'event_number'] = -1
    else:
        if df.ix[k-1,'water_amount'] > 0:
            df.ix[k,'event_number'] = df.loc[k-1,'event_number']
        else:
            df.ix[k,'event_number'] = j
            j = j+1


df.groupby('event_number').agg({'time_stamp':'first',
                                'water_amount':'sum'}) #feature matrix

EDIT2

rng = pd.date_range('2017-01-01 14:00:00', '2017-01-01 14:01:00', freq='S')
water = [0,0,0.2,0.3,0.4,0,0,0.3,0.2,0.5]*6+[0]
df = pd.DataFrame({'time_stamp':rng,'water_amount':water})

starts = (df['water_amount']>0)&(df['water_amount'].shift(1)==0) #find all starts of events
n_events = sum(starts) #total number of events
df.loc[starts,'event_number'] = range(1,n_events+1) #numerate starts from 1 to n
df['event_number'] = df['event_number'].fillna(method='pad').fillna(-1) #forward fill all the values
df.loc[df['water_amount']==0,'event_number']=-1 #set all event numbers to -1 where the water amount is 0

df.groupby('event_number').agg({'time_stamp':'first',
                                    'water_amount':'sum'}) #feature matrix

OTHER TIPS

Don't forget preprocessing your data.

For example, do feature extractionn

  • total amount of water
  • duration
  • variance
Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top