Mysql get records more then 3 in interval of 1 minute, return group_concat ID
-
14-03-2021 - |
Domanda
Currently i have this dataset, i need to return grouped ids that are within the range of 60 seconds and have more than 3.
CREATE TABLE test ( `id` bigint NOT NULL AUTO_INCREMENT, created_date TIMESTAMP(1) NOT NULL, origin_url VARCHAR (200) NOT NULL, client_session_id VARCHAR (50) NOT NULL, PRIMARY KEY (`id`), UNIQUE KEY `UC_PRE_CHAT_TALKID_COL` (`id`) );
✓
INSERT INTO test VALUES (1,'2021-01-18 11:02:24.0', 'https://zendes.com/', 'znkjoc3gfth2c3m0t1klii'), (2,'2021-01-18 11:02:35.0', 'https://zendes.com/', 'znkjoc3gfth2c3m0t1klii'), (3,'2021-01-18 11:02:03.0', 'https://zendes.com/', 'znkjoc3gfth2c3m0t1klii'), (4,'2021-01-18 11:11:28.0', 'https://rarara.com/', 'znkjoc3gfth2c3m0t1klii'), (5,'2021-01-18 11:11:36.0', 'https://rarara.com/', 'znkjoc3gfth2c3m0t1klii'), (6,'2021-01-18 11:11:05.0', 'https://rarara.com/', 'znkjoc3gfth2c3m0t1klii');
✓
db<>fiddle here
something like this:
ids origin_url client_session_id
1,2,3 https://testett.com/ znkjoc3gfth2c3m0t1klii
4,5,6 https://rarara.com/ znkjoc3gfth2c3m0t1klii
Edit some context:
I currently developed a cron that runs every 1 minute to analyze the last 60 seconds of the bot records in database, I need to group the conversation ID's that have more than 3 records within 60 seconds in the same url and client_session_id.
follows the SQL I'm running:
select
count(session_id),
client_session_id,
GROUP_CONCAT(id) as talkIds,
origin_url
from
bot_talk
where
created_date > now() - interval 60 second
group by
client_session_id, origin_url
having
count(session_id) >= 3
This query works as I expect, but sometimes my cron service is sometimes down, and I lose those repeated records.
I thought about making an SQL(Cron) at the end of the day to analyze the last 24 hours, and look for the records that are repeated according to the rule I mentioned above?
Soluzione
Here is the answer - see the fiddle. Another answer I wrote to a similar question may provide some clearer background and is a bit simpler - see here.
All I will say is that it gives some idea of the power of window functions.
I noticed in the comments that there was some debate about what constituted a group - in this example, I have constructed the SQL such that it picks out as a group where all subsequent sessions started within 180 seconds (i.e. 3 mins) of the first - you can change the 180 to 60 (or whatever) yourself.
I've added in some records for the purpose of testing and also added CONSTRAINT
s to the table definition. It's always best to put as much as possible into the DDL - your database is your final bastion of defence for your data!
CREATE TABLE test
(
id bigint NOT NULL,
created_date TIMESTAMP(2) NOT NULL,
origin_url VARCHAR (200) NOT NULL,
client_session_id VARCHAR (50) NOT NULL,
CONSTRAINT test_id_pk PRIMARY KEY (id),
CONSTRAINT test_cd_url_sess_id_uq UNIQUE (created_date, origin_url, client_session_id)
);
Always use named constraints - these provide more meaningful messages than: ... failed... CONSTRAINT xyz_000_43abc has been violated...
.
I populated it as follows:
INSERT INTO test VALUES
--
-- 1 lone record...
--
(1,'2021-01-18 10:30:24.0', 'https://zendes.com/', 'znkjoc3gfth2c3m0t1klii'), -- XX
--
-- 4 records within 180 seconds of the first one
--
(2,'2021-01-18 11:02:24.0', 'https://zes.com/', 'znkjoc'),
(3,'2021-01-18 11:02:35.0', 'https://zes.com/', 'znkjoc'),
(4,'2021-01-18 11:03:03.0', 'https://zes.com/', 'znkjoc'), -- **
(5,'2021-01-18 11:04:15.0', 'https://zes.com/', 'znkjoc'), -- YY
--
-- 3 records within 180s of the first one
--
(6,'2021-01-18 11:49:28.0', 'https://rararar.com/', 'znkjoc3gfth2c3m0t1klii'),
(7,'2021-01-18 11:49:48.0', 'https://rararar.com/', 'znkjoc3gfth2c3m0t1klii'), -- **
(8,'2021-01-18 11:50:13.0', 'https://rararar.com/', 'znkjoc3gfth2c3m0t1klii'), -- **
-- 1 lone record
(9,'2021-01-18 12:57:24.0', 'https://zendes.com/', 'znkjoc3gfth2c3m0t1klii'), -- XX
(10,'2021-02-18 09:02:24.0', 'https://rar.com/', 'znkjoc3'), -- ZZ
(11,'2021-02-18 09:02:35.0', 'https://rar.com/', 'znkjoc3'), -- ZZ
(12,'2021-02-18 09:03:03.0', 'https://rar.com/', 'znkjoc3'), -- ZZ
(13,'2021-02-18 09:04:15.0', 'https://rar.com/', 'znkjoc3'); -- ZZ
-- -- XX - Added record > 1 minute from next or previous.
-- -- ** - Changed created_date to get groups within 180 seconds.
-- -- YY - Added record < 3 minutes from previous to give 4 records
-- -- ZZ - Added group of 4 records at the end.
It's always worth checking out for edge cases - single records at the beginning/end of your dataset + groups that you want to capture at the beginning and end also! I leave it to you to do more exhaustive testing!
I'll give the results first:
rn st sids URL: Session id: Session start time: Session end time:
1 2 2,3,4,5 https://zes.com/ znkjoc 2021-01-18 11:02:24.00 2021-01-18 11:04:15.00
2 5 10,11,12,13 https://rar.com/ znkjoc3 2021-02-18 09:02:24.00 2021-02-18 09:04:15.00
There's a bonus - you've got the starts and ends of the multiple close-together sessions thrown in for free!
I do have one word of advice - you really shouldn't be using or gathering data as comma-separated lists - SQL wasn't designed for string manipulation and extracting meaningful information and knowledge from such lists is painful - better to have a single atomic datum in a single field - see 1st Normal Form
!
I've left in the various "sub-fiddles" that I used to arrive at the final result - hopefully they'll help you to learn about window functions &c... My own preference for the results would be in this format (see the fiddle - with one record/session - you can prune as you see fit):
sid created_date st min_cd max_cd f_ts l_ts c_ts_asc c_ts_desc o_url c_sess_id
1 2021-01-18 10:30:24.00 1 2021-01-18 10:30:24.00 2021-01-18 10:30:24.00 1 1 1 1 https://zendes.com/ znkjoc3gfth2c3m0t1klii
2 2021-01-18 11:02:24.00 2 2021-01-18 11:02:24.00 2021-01-18 11:04:15.00 1 4 1 4 https://zes.com/ znkjoc
3 2021-01-18 11:02:35.00 2 2021-01-18 11:02:24.00 2021-01-18 11:04:15.00 2 3 2 3 https://zes.com/ znkjoc
4 2021-01-18 11:03:03.00 2 2021-01-18 11:02:24.00 2021-01-18 11:04:15.00 3 2 3 2 https://zes.com/ znkjoc
5 2021-01-18 11:04:15.00 2 2021-01-18 11:02:24.00 2021-01-18 11:04:15.00 4 1 4 1 https://zes.com/ znkjoc
6 2021-01-18 11:49:28.00 3 2021-01-18 11:49:28.00 2021-01-18 11:50:13.00 1 3 1 3 https://rararar.com/ znkjoc3gfth2c3m0t1klii
7 2021-01-18 11:49:48.00 3 2021-01-18 11:49:28.00 2021-01-18 11:50:13.00 2 2 2 2 https://rararar.com/ znkjoc3gfth2c3m0t1klii
8 2021-01-18 11:50:13.00 3 2021-01-18 11:49:28.00 2021-01-18 11:50:13.00 3 1 3 1 https://rararar.com/ znkjoc3gfth2c3m0t1klii
9 2021-01-18 12:57:24.00 4 2021-01-18 12:57:24.00 2021-01-18 12:57:24.00 1 1 1 1 https://zendes.com/ znkjoc3gfth2c3m0t1klii
10 2021-02-18 09:02:24.00 5 2021-02-18 09:02:24.00 2021-02-18 09:04:15.00 1 4 1 4 https://rar.com/ znkjoc3
11 2021-02-18 09:02:35.00 5 2021-02-18 09:02:24.00 2021-02-18 09:04:15.00 2 3 2 3 https://rar.com/ znkjoc3
12 2021-02-18 09:03:03.00 5 2021-02-18 09:02:24.00 2021-02-18 09:04:15.00 3 2 3 2 https://rar.com/ znkjoc3
13 2021-02-18 09:04:15.00 5 2021-02-18 09:02:24.00 2021-02-18 09:04:15.00 4 1 4 1 https://rar.com/ znkjoc3
Or another way might be (again, see the fiddle):
sid Session no. Start of sessions End of sessions Session count o_url c_sess_id
2 2 2021-01-18 11:02:24.00 2021-01-18 11:04:15.00 4 https://zes.com/ znkjoc
10 5 2021-02-18 09:02:24.00 2021-02-18 09:04:15.00 4 https://rar.com/ znkjoc3
You have the starts and ends and the number of sessions... anyway, that's up to you. Take a look at this bit (easily missed):
MIN(created_date) OVER (PARTITION BY st ORDER BY created_date ASC) AS min_cd,
MAX(created_date) OVER (PARTITION BY st ORDER BY created_date DESC) AS max_cd,
This allows you to have an ASC
ending and a DESC
edning list.
This means that you can then do this:
WHERE (v.f_ts + v.l_ts) >= 5
And the records where this corresponds must be the ones of interest for the bunched sessions.
So now, the SQL, here it is (drum roll, trumpets sound in the distance...)! Go get a cup of coffee to read it - it's a monster (needs pruning - but that's an exercise for the OP and anybody else who's got this far):
SELECT
ROW_NUMBER() OVER () AS rn,
v.st,
GROUP_CONCAT(v.sid SEPARATOR ',') as sids,
o_url AS "URL:",
c_sess_id AS "Session id:",
min_cd AS "Session start time:",
max_cd AS "Session end time:"
FROM
(
SELECT
sid,
created_date, st,
MIN(created_date) OVER (PARTITION BY st ORDER BY created_date ASC) AS min_cd,
MAX(created_date) OVER (PARTITION BY st ORDER BY created_date DESC) AS max_cd,
ROW_NUMBER() OVER (PARTITION BY st ORDER BY created_date ASC) AS f_ts,
ROW_NUMBER() OVER (PARTITION BY st ORDER BY created_date DESC) AS l_ts,
COUNT(st) OVER (PARTITION BY st ORDER BY created_date ASC) AS c_ts_asc,
COUNT(st) OVER (PARTITION BY st ORDER BY created_date DESC) AS c_ts_desc,
o_url,
c_sess_id
FROM
(
SELECT
sid,
created_date, YYY,
SUM(testy) OVER (PARTITION BY testy ORDER BY created_date ASC) AS s,
ROW_NUMBER()
OVER (PARTITION BY testy ORDER BY created_date ASC) AS rn,
SUM(testy) OVER y AS st,
FIRST_VALUE(created_date) OVER (PARTITION BY testy ORDER BY created_date) AS fv,
LAST_VALUE(created_date) OVER y AS lv,
o_url,
c_sess_id
FROM
(
SELECT
id AS sid,
LAG(created_date, 1) OVER x AS lag_1,
created_date,
LEAD(created_date, 1) OVER x AS lead_1,
UNIX_TIMESTAMP(LEAD(created_date, 1) OVER x) - UNIX_TIMESTAMP(created_date) AS XXX,
UNIX_TIMESTAMP(created_date) - UNIX_TIMESTAMP(LAG(created_date, 1) OVER x) AS YYY,
--
-- The IMPORTANT one!
--
CASE
WHEN (UNIX_TIMESTAMP(created_date) - UNIX_TIMESTAMP(LAG(created_date, 1) OVER x) > 180
OR UNIX_TIMESTAMP(created_date) - UNIX_TIMESTAMP(LAG(created_date, 1) OVER x) IS NULL)
THEN 1
ELSE 0
END AS testy,
FIRST_VALUE(created_date) OVER x AS f_val,
UNIX_TIMESTAMP(created_date) - UNIX_TIMESTAMP(FIRST_VALUE(created_date) OVER x) AS c_f_diff,
ABS(TIMESTAMPDIFF(MINUTE, created_date, LEAD(created_date, 1) OVER x)) AS min_diff,
UNIX_TIMESTAMP(created_date) - UNIX_TIMESTAMP(LEAD(created_date, 1) OVER x) AS ut_d,
origin_url AS o_url,
client_session_id AS c_sess_id
FROM test
WINDOW x AS (PARTITION BY origin_url, client_session_id
ORDER BY created_date ASC, origin_url, client_session_id)
ORDER BY created_date ASC
) AS t
WINDOW y AS (ORDER BY created_date ASC ROWS BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW)
ORDER BY created_date ASC
) AS u
ORDER BY created_date ASC
) AS v
WHERE (v.f_ts + v.l_ts) >= 5
GROUP BY v.st, o_url, c_sess_id, min_cd, max_cd;