Question

I have a site where I record client metrics in a SQL Server 2008 db on every link clicked. I have already written the query to get the daily total clicks, however I want to find out how many times the user clicked within a given timespan (ie. within 5 seconds).

The idea here is to lock out incoming IP addresses that are trying to scrape content. It would be assumed that if more than 5 "clicks" is detected within 5 seconds or the number of daily clicks from a given IP address exceeds some value, that this is a scraping attempt.

I have tried a few variations of the following:

-- when a user clicked more than 5 times in 5 seconds
SELECT DATEADD(SECOND, DATEDIFF(SECOND, 0, ClickTimeStamp), 0) as ClickTimeStamp, COUNT(UserClickID) as [Count]
FROM UserClicks
WHERE DATEDIFF(SECOND, 0, ClickTimeStamp) = 5
GROUP BY IPAddress, ClickTimeStamp

This one in particular returns the following error:

Msg 535, Level 16, State 0, Line 3 The datediff function resulted in an overflow. The number of dateparts separating two date/time instances is too large. Try to use datediff with a less precise datepart.

So once again, I want to use the seconds datepart, which I believe I'm on the right track, but not quite getting it.

Help appreciated. Thanks.

-- UPDATE --

Great suggestions and helped me think that the approach is wrong. The check is going to be made on every click. What I should do is for a given timestamp, check to see if in the last 5 seconds 5 clicks have been recorded from the same IP address. So it would be something like, count the number of clicks for > GetDate() - 5 seconds

Trying the following still isn't giving me an accurate figure.

SELECT COUNT(*)
FROM UserClicks
WHERE ClickTimeStamp >= GetDate() - DATEADD(SECOND, -5, GetDate())
Was it helpful?

Solution

An answer for your UPDATE: the problem is in the third line of

SELECT COUNT(*)
 FROM UserClicks
 WHERE ClickTimeStamp >= GetDate() - DATEADD(SECOND, -5, GetDate()) 

GetDate() - DATEADD(SECOND, -5, GetDate()) is saying "take the current date time and subtract (the current date time minus five seconds)". I'm not entirely sure what kind of value this produces, but it won't be the one you want.

You still want some kind of time-period, perahps like so:

SELECT count(*)
 from UserClicks
 where IPAddress = @IPAddress
  and ClickTimeStamp between getdate() and dateadd(second, -5, getdate())

I'm a bit uncomfortable using getdate() there--if you have a specific datetime value (accurate to the second), you should probably use it.

OTHER TIPS

Hoping my syntax is good, I only have oracle to test this on. I'm going to assume you have an ID column called user_id that is unique to that user (is it user_click_id? helpful to include table create statements in these questions when you can)

You'll have to preform a self join on this one. Logic will be take the userclick and join onto userclick on userId = userId and difference on clicktimestamp is between 0-5 seconds. Then it's counting from the subselect.

select u1.user_id, u1.clicktimestamp, u2.clicktimestamp
from userclicks uc1
left join user_clicks uc2  
    on u2.userk_id = u1.user_id
    and datediff(second,u1.ClickTimeStamp,u2.ClickTimeStamp) <= 5
    and datediff(second,u1.ClickTimeStamp,u2.ClickTimeStamp) > 0

This select statement should give you the user_id/clicktimestampe and 1 row for every record that is between 0 and 5 seconds apart from that clicktimestamp from the same user. Now it's just a matter of counting all user_id,u1.clicktimestamp combinations and highlighting the ones with 5 or more. Take the above query and turn it into a subselect and pull counts from it:

select u1.user_id, u1.clicktimestamp, count(1)
from 
(select u1.user_id, u1.clicktimestamp
from userclicks uc1
left join user_clicks uc2  
    on u2.userk_id = u1.user_id
    and datediff(second,u1.ClickTimeStamp,u2.ClickTimeStamp) <= 5
    and datediff(second,u1.ClickTimeStamp,u2.ClickTimeStamp) > 0) a
group by u1.user_id, u1.clicktimestamp
having count(1) >= 5

Wish I could verify my syntax on a MS machine....there might be some typo's in there, but the logic should be good.

Assuming log entries are only entered for current activity -- that is, whenever a new row is inserted, the logged time is for that point in time and never for any prior point in time -- then you should only need to review data for a set period of time, and not have to review "all data" as you are doing now.

Next question is: how frequently do you make this check? If you are concerned with clicks per second, then something between "once per hour" and "once every 24 hours" seems reasonable.

Next up: define your interval. "All clicks per IPAddress within 5 seconds" could go two ways: set window (00-04, 05-09, 10-14, etc), or sliding window(00-04, 01-05, 02-06, etc.) Probably irrelevant with a 5 second window, but perhaps more relevant for longer periods (clicks per "day").

With that, the general approach I'd take is:

  • Start with earliest point in time you care about (1 hour ago, 24 hours ago)
  • Set up "buckets", means by which time windows can be identified (00:00:00 - 00:00:04, 00:00:05 - 00:00:09, etc.). This could be done as a temp table.
  • For all events, calculate number of elapsed seconds since your earliest point
  • For each bucket, count number of events that hit that bucket, grouped by IPAddress (inner join on the temp table on seconds between lowValue and highValue)
  • Identify those that exceed your threshold (having count(*) > X), and defenestrate them.
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top