I found a related discussion on sampling on the Splunk Answers page below.
An alternative to filtering by date_minute
or date_second
, is to filter events in a where
clause using the _serial
property or the random()
function. For example,
* | where (_serial % 60) = 0 | ...
or
* | where (random() % 60) = 0 | ...
However, in both cases the search appears to do a full scan of the data. This may still be desirable if you need the flexibility and the result is feeding into a more expensive query. Otherwise, using the date_second
approach is significantly faster because events are apparently indexed by that field. For example, the two queries above ran in 3m 20s
on a subset of data, where the query below ran in 11s
on the same data.
* date_second=0 | ...