Question

I have a table of some type of activity in BigQuery with just about 40Mb of data now. Activity date is stored in one of the fields (string in format YYYY-MM-DD HH:MM:SS). I need to find way to determine periods of inactivity (with some predefined threshold) running reasonable amount of time.

Query that I built runs already hour. Here it is:

SELECT t1.date, MIN(PARSE_UTC_USEC(t1.date) - PARSE_UTC_USEC(t2.date)) AS mintime 
FROM logs t1
JOIN (SELECT date, http_error FROM logs) t2 ON t1.http_error = t2.http_error
WHERE PARSE_UTC_USEC(t1.date) > PARSE_UTC_USEC(t2.date)
GROUP BY t1.date
HAVING mintime > 1000;

Idea is: 1. Take decart multiplication of the table with itself (http_error is field that almost never changes value, so it does the trick) 2. Take only pairs where date1 > date2 3. Take for every date1 date2 with minimal difference 4. Restrict choice by cases where this minimal difference is more than threshold.

I admit that real query I use is burden a bit by fixes to invalid data (this adds additional operations). But I really need better idea to do this. I'll be glad to hear other ideas

Was it helpful?

Solution

I don't know the granularity of inactivity you are looking for, but why not try bucketing by your timestamp, then counting the relative frequency of activities in each bucket:

SELECT
  UTC_USEC_TO_HOUR(PARSE_UTC_USEC(timestamp_usec)) AS hour_bucket,
  COUNT(*) as activity_count
GROUP BY
  hour_bucket
ORDER BY
  activity_count ASC;
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top