Question

I need to be able to detect regions of a list of (x, y) data based on the features of the data. Some example data is shown in the first image. Right now, I need to be able to find the region between the black marks (sorry for the poor quality, imgur's editor isn't very accurate). Unfortunately, the problem is complicated by being different lengths and shapes each time this data is collected, as seen in the second image. The sharp drop from ~98 to ~85 is consistent, and the two dip/two peak feature between ~1e-9 and ~1.5e-9 should be fairly consistent.

My question is, what is the best approach for detecting events in a signal, based on features of the signal? If I can get this sliced into the three regions marked (beginning to first mark, first to second mark, second mark to end), then I believe I can extend the method to handle my more complex situations.

I've solved similar problems before, but this one is unique in the amount of variation that occurs from one set of data to another. Last time I simply wrote a hand-crafted algorithm to find a local extrema and use it to locate the edge, but I feel like it's a rather ugly and inefficient solution that can't be easily reused.

I'm using Python 2.7.5, but ideally this should be a language agnostic solution so that I can implement it in other environments like VB.NET.

img1

img2

Was it helpful?

Solution

Just based on the two examples that you posted, I have a couple of different suggestions: thresholding or template matching.

Thresholds

Because you mentioned that the vertical drop in the signal is relatively constant, especially for the first event you're detecting, it seems like you could use a thresholding method, where you place the event at the first occurrence of the signal crossing some threshold of interest. For instance, to place the first event (in Python, and assuming that your measurement data live in a sequence of tuples containing x-y pairs) :

def detect_onset_event(measurements):
    armed = False
    for offset, (timestamp, value) in enumerate(measurements):
        if value > 90:
            armed = True
        if armed and value < 85:
            return offset
    return -1  # failure condition, might want to raise ValueError()

So here we trigger at the first sample offset that drops below 85 after the signal has gone above 90.

You could do something similar for the second event, but it looks like the signal levels that are significant for that event might be a little less clear-cut. Depends on your application and measurement data. This is a good example of what makes thresholding approaches not so great -- they can be brittle and rely on hard-coded values. But if your measurements are quite regular, then this can work well, with little coding effort.

Templates

In this method, you can create a template for each signal event of interest, and then convolve the templates over your signal to identify similar regions of the signal.

import numpy

def detect_twopeak_event(measurements, template):
    data = numpy.asarray(measurements) # convert to numpy array
    activations = numpy.convolve(
        data[:, 1],  # convolve over all "value" elements
        template)
    return activations.argmax()

Here you'll need to create a list of the sample measurements that constitute the event you're trying to detect -- for example, you might extract the measurements from the two-peak area of an example signal to use as your template. Then by convolving this template over the measurement data, you'll get a metric for how similar the measurements are to your template. You can just return the index of the best match (as in the code above) or pass these similarity estimates to some other process to pick a "best."

There are many ways to create templates, but I think one of the most promising approaches is to use an average of a bunch of neighborhoods from labeled training events. That is, suppose you have a database of signals paired with the sample offset where a given event happens. You could create a template by averaging a windowed region around these labeled events :

def create_mean_template(signals, offsets, radius=20):
    w = numpy.hanning(2 * radius)
    return numpy.mean(
        [s[o-radius:o+radius] * w for s, o in zip(signals, offsets)],
        axis=0)

This has been used successfully in many signal processing domains like face recognition (e.g., you can create a template for an eye by averaging the pixels around a bunch of labeled eyes).

One place where the template approach will start to fail is if your signal has a lot of areas that look like the template, but these areas don't correspond to events you want to detect. It's tricky to deal with this, so the template method works best if there's a distinctive signal pattern that happens near your event.

Another way the template method will fail is if your measurement data contain, say, a two-peak area that's interesting but occurs at a different frequency than the samples you use as your template. In this case, you might be able to make your templates more robust to slight frequency changes by working in the time-frequency domain rather than the time-amplitude domain. There, instead of making 1D templates that correspond to the temporal pattern of amplitude changes you're interested in, you can run a windowed FFT on your measurements and then come up with kD templates that correspond to the k-dimensional frequency changes over a small region surrounding the event you're interested in.

Hope some of these suggestions are helpful !

OTHER TIPS

you could probably use a Hidden Markov Model with 6+ states, I am no math genius so I would use one with discrete states and round your data to nearest integer, my model would look something alike:

state 1: start blob (emissions around 97) state 2: 'fall' (emissions between 83 and 100) state 3: interesting stuff ( emissions between 82-86) state 4: peak (80-88) sate 5: last peak (80-94) state 6: base line (87-85)

HMM are not the perfect tool, because they mostly capture ranges of emissions in each state, but they are good at tolerating the stuff coming out much earlier or later because they only care about the p value between states and therefore

I hope this helps and makes sense

if you are super lazy you could probably just label 6 spectra by hand and then cut the data accordingly and calculate the p values for each emission of each state.

#pseudo code
emissions = defaultdict(int) # with relevant labels initialized to 0
for state_lable, value in data:
    emissions[state_lable][value] += 1
# then normalize all states to 1 and voila you have a HMM

the above is super over simplified but should be much better and more robust than the if-statement stuff you usually do :)... HMMs usually also have a transition matrix, but because the signal of your data is so strong you could 'skip' that one and go for my pragmatic solution :)

and then subsequently use the viterbi path to label all your future experiments

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top