How to determine sample rate of a time series dataset?

https://datascience.stackexchange.com/questions/77295

12-12-2020
|

Question

I have a dataset of magnetometer sensor readings which looks like:

TimeStamp       X        Y         Z     
1.59408E+12 -22.619999  28.8    -22.14  
1.59408E+12 -22.5   29.039999   -22.08  
1.59408E+12 -22.32  29.039999   -21.779999  
1.59408E+12 -22.38  29.16   -21.6   
.
.
.
And so on

The timestamp is in milliseconds where 1.59408E+12, 1.59408E+12, 1.59408E+12 is 1594076006983, 1594076006994, 1594076007004 and so on. So what will be the sampling rate/ frequency of the data?

Solution

General Explanation

Generally speaking, the frequency of the data is the difference between consecutive time-stamps.

If all is well, that difference will be constant across your time-series; in this case, this difference is the frequency of your data.
In other cases, it might be a bit more complicated - for example, you can have missing data, or data where some timestamps are shifted a bit forward or backward.

In which case, you might want to 'smooth out' those issues, for example by taking the median of such differences (if some samples come a bit too early and some come a bit too late), or the mode (if some samples are missing, meaning that most will come at the right time and some will be too late by an exact multiple of the real frequency).

Python (pandas) Example

Here's an example in Python, using the pandas library:

let's make up a time series with missing values. Its frequency is 5 minutes, but about 5% of the samples are missing (its data is random integers between 1 and 20, and the 1's have been dropped)

import pandas as pd
import random

index = pd.DatetimeIndex(start='2020-01-01 00:00:00', end='2020-04-01 00:00:00', freq='5T')
s = pd.Series(index=index, data=None)
s = s.apply(lambda x: random.randint(1,20))
s=s[s.ne(1)]

Now let's look at the time-difference between existing (not dropped) samples:

s.index.to_series().diff().value_counts()

yields

00:05:00    23629
00:10:00     1191
00:15:00       63
00:20:00        2

i.e. the vast majority of samples are really 5 minutes apart, but not all of them. We'll take the actual frequency by taking the median of those differences:

s.index.to_series().diff().median()

yields

Timedelta('0 days 00:05:00')

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange