Question

I have lots of sensor data from which I need to be able to detect changes reliably. Basically it comes from water level sensor in remote client. It's using accelerometer & float to get the water level. My problem is that the data can be noisy sometimes (it varies by 2-5 units per measurement) and sometimes I need to detect changes as low as 7-9 units.

When I'm graphing the data it's quite obvious for human eye that there's a change but how would I go at it programming wise? Now I'm just trying to detect changes bigger than x programmatically but it's not too reliable. I've attached a sample graph and pointed the changes with arrows. The huge changes in the beginning are just testing, so it's not normal behaviour for data.

The data is in MYSQL database and the code is in PHP so if you could point me to right direction I'd highly appreciate it!

EDIT: Also there can be some spikes in the data which are not considered valid but rather a mistake in the data.

EDIT: Example data can be found from http://pastebin.com/x8C9AtAk The algorithm would need to run every 30 mins or so and should be able to detect changes within the last 2-4 pings. Each ping is in 3-5min interval.

enter image description here

Was it helpful?

Solution

I made some awk that you, or someone else, might like to experiment with. I average the last 10 (m) samples excluding the current one, and also average the last 2 samples (n) and then calculate the difference between the two and output a message if the absolute difference exceeds a threshold.

#!/bin/bash
awk -F, '
                                    # j will count number of samples
                                    # we will average last m samples and last n samples
   BEGIN {j=0;m=10;n=2}

   {d[j]=$3;id[j++]=$1" "$2}        # Store this point in array d[]

   END {                            # Do this at end after reading all samples
      for(i=m-1;i<j;i++){           # Iterate over all samples, except first few while building average

         totlastm=0                 # Calculate average over last m not incl current
         for(k=m;k>0;k--)totlastm+=d[i-k]
         avelastm=totlastm/m        # Average = total/m

         totlastn=0                 # Calculate average over last n
         for(k=n-1;k>=0;k--)totlastn+=d[i-k]
         avelastn=totlastn/n        # Average = total/n

         dif=avelastm-avelastn      # Calculate difference between ave last m and ave last n
         if(dif<0)dif=-dif          # Make absolute

         mesg="";
         if(dif>4)mesg="<-Change detected"; # Make message if change large
         printf "%s: Sample[%d]=%d,ave(%d)=%.2f,ave(%d)=%.2f,dif=%.2f%s\n",id[i],i,d[i],m,avelastm,n,avelastn,dif,mesg;
      }
   }
   ' <(tr -d '"' < levels.txt)

The last bit <(tr...) just removes the double quotes before sending the file levels.txt to awk.

Here is an excerpt from the output:

18393344 2014-03-01 14:08:34: Sample[1319]=343,ave(10)=342.00,ave(2)=342.00,dif=0.00
18393576 2014-03-01 14:13:37: Sample[1320]=343,ave(10)=342.10,ave(2)=343.00,dif=0.90
18393808 2014-03-01 14:18:39: Sample[1321]=343,ave(10)=342.10,ave(2)=343.00,dif=0.90
18394036 2014-03-01 14:23:45: Sample[1322]=342,ave(10)=342.30,ave(2)=342.50,dif=0.20
18394266 2014-03-01 14:28:47: Sample[1323]=341,ave(10)=342.20,ave(2)=341.50,dif=0.70
18394683 2014-03-01 14:38:16: Sample[1324]=346,ave(10)=342.20,ave(2)=343.50,dif=1.30
18394923 2014-03-01 14:43:17: Sample[1325]=348,ave(10)=342.70,ave(2)=347.00,dif=4.30<-Change detected
18395167 2014-03-01 14:48:25: Sample[1326]=345,ave(10)=343.20,ave(2)=346.50,dif=3.30
18395409 2014-03-01 14:53:28: Sample[1327]=347,ave(10)=343.60,ave(2)=346.00,dif=2.40
18395645 2014-03-01 14:58:30: Sample[1328]=347,ave(10)=343.90,ave(2)=347.00,dif=3.10

OTHER TIPS

The right way to go about problems of this kind is to build a model of the phenomenon of interest and also a model of the noise process, and then make inferences about the phenomenon given some data. These inferences are necessarily probabilistic. The general computation you need to carry out is P(H_k | data) = P(data | H_k) P(H_k) / (sum_k (P(data | H_k) P(H_k)) (a generalized form of Bayes rule) where the H_k are all the hypotheses of interest, such as "step of magnitude at time " or "noise of magnitude ". In this case there might be a large number of plausible hypotheses, covering all possible magnitudes and times. You might need to limit the range of hypotheses considered in order to make the problem tractable, e.g. only looking back a certain number of time steps.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top