Finding blocks of timestamps with a greater rate than 1 per minute with gaps

https://stackoverflow.com/questions/21993380

16-10-2022
|

문제

I have an activity log table which tracks the timestamp of a user's action. I need to be able to identify users who perform more actions in a given time period than there are minutes in that time period, with at least 10 actions required for a "block of actions" to be identified.

E.g.

Flagged

11 activities in 10 minutes
25 activities in 20 minutes

Not flagged

9 activities in 5 minutes
23 activities in 30 minutes

I need to be able to identify the largest block which satisfies these conditions, so for example if a user performs the following actions:

Minutes    Action_Count
   1         3
   2         0
   3         0
   4         3
   5         0
   6         3
   7         0
   8         1
   9         0
   10        0
   11        0
   12        1
   13        3

Even though the actions from minutes 1-8 will be flagged as there are 10 in a "less than 10 minute period", the actions from minutes 12 and 13 should also be included as they constitute 14 actions in 13 minutes, even though the action at 12 minutes is back inside the threshold.

The data set per user is likely to be approximately 100 timestamps over a 7 day period, so there will likely be several blocks of actions per user, corresponding to when they are active.

I can identify blocks of actions which occur within the same minute with the following:

$timestamps = array(1392700382,1392700458,1392700486,1392700612,1392700619,1392700636,1392700648,1392700671,1392700679,1392700701,1392700860,1392815451,1392815486,1392815499,1392815532,1392815539,1392815680,1392815699,1392815763,1392815851,1392815972,1392816075,1392903882,1392903950,1392904029,1392904181,1392904259,1392904377,1392904402,1392904411,1392904437,1392904445,1392904469,1392904638,1392904735,1392988830,1392988858,1392988889,1392988917,1392988980,1392989016,1392989078,1392989108,1392989140,1392989167,1392989203,1392989251,1392989393,1392989401,1392989408,1392989415,1392989511,1393065019,1393065352,1393065448,1393066105,1393066110,1393066114,1393066136,1393066139,1393066144,1393066148,1393066203,1393114548,1393114563,1393114696,1393114697,1393114717,1393114723,1393114742,1393114748,1393114753,1393114785,1393114824,1393204378,1393204383,1393204387,1393204391,1393204408,1393204414,1393204419,1393204424,1393204474);
$elements = array();

foreach ($timestamps as $timestamp) {

    $oneMinuteAgo = $timestamp - 60;

    $elements[] = $timestamp;

    $postsInLastMinute = array_filter($elements, function ($value) use ($oneMinuteAgo) {
        return $value > $oneMinuteAgo;
    });

    echo implode(', ', $postsInLastMinute)."\n";
}

(see output here: http://pastebin.com/LDEQWMxn)

Although I'm not sure how this information helps me, and even if it is the right way to approach the problems.

해결책

Geobits has already mentioned the naive approach and if you are dealing with about 100 time stamps, that seems reasonably fast. I wouldn't bother putting the stamps into one-minute buckets first. Calculate the rate activities / time_span and find the longest stretch based on number of activities or time span where the rate exceeds 1/60. (Or just find out whether such a time span exists.)

In your example data (those in the code example), activities happen in chunks of at most 20 minutes over a period of a week. These chunks have large gaps between them. You can use the nature of the data to improve the naive algorithm by looking at your time stamps chunk-wise. This has the benefit that you can rule out sequences with fewer than 10 timestamps right away. Also, the naiveté of the approach is less dominant, because your inner loop has to do only n - 10 iterations where n rarely exceeds 20.

Here's the approach in pseudocode. (I'm not familiar with PHP.)

flag_piecewise(time[])
    gap = 60 * 60           # choose approriate minimum gap, e.g. 1h
    start = 0
    curr = 1

    while true
        if (curr == len(time) || time[curr] - time[curr - 1] > gap)
            if (flag_range(time[start:curr]) return true            
            if (curr == len(time)) break
            start = curr
        curr++

return false

Here flag_range is the implementation of the naive approach. In a quick test on your sample data, I got a speed-up of about 20. (The naive approach was fast enough, I think, but you get the speed-up without adding much complexity.)

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow