Mysql get records more then 3 in interval of 1 minute, return group_concat ID

https://dba.stackexchange.com/questions/283859

14-03-2021
|

题

Currently i have this dataset, i need to return grouped ids that are within the range of 60 seconds and have more than 3.

CREATE TABLE test 
(
  `id` bigint NOT NULL AUTO_INCREMENT,
  created_date TIMESTAMP(1) NOT NULL,
  origin_url   VARCHAR (200) NOT NULL,
  client_session_id VARCHAR (50) NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `UC_PRE_CHAT_TALKID_COL` (`id`)
);

✓

INSERT INTO test VALUES
(1,'2021-01-18 11:02:24.0', 'https://zendes.com/', 'znkjoc3gfth2c3m0t1klii'),
(2,'2021-01-18 11:02:35.0', 'https://zendes.com/', 'znkjoc3gfth2c3m0t1klii'),
(3,'2021-01-18 11:02:03.0', 'https://zendes.com/', 'znkjoc3gfth2c3m0t1klii'),
(4,'2021-01-18 11:11:28.0', 'https://rarara.com/', 'znkjoc3gfth2c3m0t1klii'),
(5,'2021-01-18 11:11:36.0', 'https://rarara.com/', 'znkjoc3gfth2c3m0t1klii'),
(6,'2021-01-18 11:11:05.0', 'https://rarara.com/', 'znkjoc3gfth2c3m0t1klii');

✓

db<>fiddle here

something like this:

ids     origin_url              client_session_id
1,2,3   https://testett.com/    znkjoc3gfth2c3m0t1klii
4,5,6   https://rarara.com/     znkjoc3gfth2c3m0t1klii

Edit some context:

I currently developed a cron that runs every 1 minute to analyze the last 60 seconds of the bot records in database, I need to group the conversation ID's that have more than 3 records within 60 seconds in the same url and client_session_id.

follows the SQL I'm running:

select
    count(session_id),
    client_session_id,
    GROUP_CONCAT(id) as talkIds,
    origin_url 
from
    bot_talk
where
    created_date > now() - interval 60 second
group by
    client_session_id, origin_url 
having
    count(session_id) >= 3

This query works as I expect, but sometimes my cron service is sometimes down, and I lose those repeated records.

I thought about making an SQL(Cron) at the end of the day to analyze the last 24 hours, and look for the records that are repeated according to the rule I mentioned above?

解决方案

Here is the answer - see the fiddle. Another answer I wrote to a similar question may provide some clearer background and is a bit simpler - see here.

All I will say is that it gives some idea of the power of window functions.

I noticed in the comments that there was some debate about what constituted a group - in this example, I have constructed the SQL such that it picks out as a group where all subsequent sessions started within 180 seconds (i.e. 3 mins) of the first - you can change the 180 to 60 (or whatever) yourself.

I've added in some records for the purpose of testing and also added CONSTRAINTs to the table definition. It's always best to put as much as possible into the DDL - your database is your final bastion of defence for your data!

CREATE TABLE test 
(
  id bigint NOT NULL,
  created_date TIMESTAMP(2) NOT NULL,
  origin_url   VARCHAR (200) NOT NULL,
  client_session_id VARCHAR (50) NOT NULL,
  CONSTRAINT test_id_pk PRIMARY KEY (id),
  CONSTRAINT test_cd_url_sess_id_uq UNIQUE (created_date, origin_url, client_session_id)
);

Always use named constraints - these provide more meaningful messages than: ... failed... CONSTRAINT xyz_000_43abc has been violated....

I populated it as follows:

INSERT INTO test VALUES

--
-- 1 lone record...
--

(1,'2021-01-18 10:30:24.0', 'https://zendes.com/', 'znkjoc3gfth2c3m0t1klii'), -- XX

--
-- 4 records within 180 seconds of the first one
--

(2,'2021-01-18 11:02:24.0', 'https://zes.com/', 'znkjoc'),
(3,'2021-01-18 11:02:35.0', 'https://zes.com/', 'znkjoc'),
(4,'2021-01-18 11:03:03.0', 'https://zes.com/', 'znkjoc'),  -- **
(5,'2021-01-18 11:04:15.0', 'https://zes.com/', 'znkjoc'),  -- YY

-- 
-- 3 records within 180s of the first one
--

(6,'2021-01-18 11:49:28.0', 'https://rararar.com/', 'znkjoc3gfth2c3m0t1klii'),
(7,'2021-01-18 11:49:48.0', 'https://rararar.com/', 'znkjoc3gfth2c3m0t1klii'), -- **
(8,'2021-01-18 11:50:13.0', 'https://rararar.com/', 'znkjoc3gfth2c3m0t1klii'), -- **

-- 1 lone record

(9,'2021-01-18 12:57:24.0', 'https://zendes.com/', 'znkjoc3gfth2c3m0t1klii'),  -- XX


(10,'2021-02-18 09:02:24.0', 'https://rar.com/', 'znkjoc3'), -- ZZ
(11,'2021-02-18 09:02:35.0', 'https://rar.com/', 'znkjoc3'), -- ZZ
(12,'2021-02-18 09:03:03.0', 'https://rar.com/', 'znkjoc3'), -- ZZ
(13,'2021-02-18 09:04:15.0', 'https://rar.com/', 'znkjoc3'); -- ZZ


-- -- XX - Added record > 1 minute from next or previous.
-- -- ** - Changed created_date to get groups within 180 seconds.
-- -- YY - Added record < 3 minutes from previous to give 4 records
-- -- ZZ - Added group of 4 records at the end.

It's always worth checking out for edge cases - single records at the beginning/end of your dataset + groups that you want to capture at the beginning and end also! I leave it to you to do more exhaustive testing!

I'll give the results first:

rn  st  sids                URL:                Session id: Session start time:            Session end time:
1   2   2,3,4,5             https://zes.com/    znkjoc      2021-01-18 11:02:24.00  2021-01-18 11:04:15.00
2   5   10,11,12,13         https://rar.com/    znkjoc3     2021-02-18 09:02:24.00  2021-02-18 09:04:15.00

There's a bonus - you've got the starts and ends of the multiple close-together sessions thrown in for free!

I do have one word of advice - you really shouldn't be using or gathering data as comma-separated lists - SQL wasn't designed for string manipulation and extracting meaningful information and knowledge from such lists is painful - better to have a single atomic datum in a single field - see 1st Normal Form!

I've left in the various "sub-fiddles" that I used to arrive at the final result - hopefully they'll help you to learn about window functions &c... My own preference for the results would be in this format (see the fiddle - with one record/session - you can prune as you see fit):

sid created_date    st  min_cd  max_cd  f_ts    l_ts    c_ts_asc    c_ts_desc   o_url   c_sess_id
1   2021-01-18 10:30:24.00  1   2021-01-18 10:30:24.00  2021-01-18 10:30:24.00  1   1   1   1   https://zendes.com/ znkjoc3gfth2c3m0t1klii
2   2021-01-18 11:02:24.00  2   2021-01-18 11:02:24.00  2021-01-18 11:04:15.00  1   4   1   4   https://zes.com/    znkjoc
3   2021-01-18 11:02:35.00  2   2021-01-18 11:02:24.00  2021-01-18 11:04:15.00  2   3   2   3   https://zes.com/    znkjoc
4   2021-01-18 11:03:03.00  2   2021-01-18 11:02:24.00  2021-01-18 11:04:15.00  3   2   3   2   https://zes.com/    znkjoc
5   2021-01-18 11:04:15.00  2   2021-01-18 11:02:24.00  2021-01-18 11:04:15.00  4   1   4   1   https://zes.com/    znkjoc
6   2021-01-18 11:49:28.00  3   2021-01-18 11:49:28.00  2021-01-18 11:50:13.00  1   3   1   3   https://rararar.com/    znkjoc3gfth2c3m0t1klii
7   2021-01-18 11:49:48.00  3   2021-01-18 11:49:28.00  2021-01-18 11:50:13.00  2   2   2   2   https://rararar.com/    znkjoc3gfth2c3m0t1klii
8   2021-01-18 11:50:13.00  3   2021-01-18 11:49:28.00  2021-01-18 11:50:13.00  3   1   3   1   https://rararar.com/    znkjoc3gfth2c3m0t1klii
9   2021-01-18 12:57:24.00  4   2021-01-18 12:57:24.00  2021-01-18 12:57:24.00  1   1   1   1   https://zendes.com/ znkjoc3gfth2c3m0t1klii
10  2021-02-18 09:02:24.00  5   2021-02-18 09:02:24.00  2021-02-18 09:04:15.00  1   4   1   4   https://rar.com/    znkjoc3
11  2021-02-18 09:02:35.00  5   2021-02-18 09:02:24.00  2021-02-18 09:04:15.00  2   3   2   3   https://rar.com/    znkjoc3
12  2021-02-18 09:03:03.00  5   2021-02-18 09:02:24.00  2021-02-18 09:04:15.00  3   2   3   2   https://rar.com/    znkjoc3
13  2021-02-18 09:04:15.00  5   2021-02-18 09:02:24.00  2021-02-18 09:04:15.00  4   1   4   1   https://rar.com/    znkjoc3

Or another way might be (again, see the fiddle):

sid Session no. Start of sessions   End of sessions Session count   o_url   c_sess_id
2   2   2021-01-18 11:02:24.00  2021-01-18 11:04:15.00  4   https://zes.com/    znkjoc
10  5   2021-02-18 09:02:24.00  2021-02-18 09:04:15.00  4   https://rar.com/    znkjoc3

You have the starts and ends and the number of sessions... anyway, that's up to you. Take a look at this bit (easily missed):

MIN(created_date) OVER (PARTITION BY st ORDER BY created_date ASC) AS min_cd,
MAX(created_date) OVER (PARTITION BY st ORDER BY created_date DESC) AS max_cd,

This allows you to have an ASCending and a DESCedning list.

This means that you can then do this:

WHERE (v.f_ts + v.l_ts) >= 5

And the records where this corresponds must be the ones of interest for the bunched sessions.

So now, the SQL, here it is (drum roll, trumpets sound in the distance...)! Go get a cup of coffee to read it - it's a monster (needs pruning - but that's an exercise for the OP and anybody else who's got this far):

SELECT 
  ROW_NUMBER() OVER () AS rn,
  v.st,
  GROUP_CONCAT(v.sid SEPARATOR ',') as sids,
  o_url AS "URL:",
  c_sess_id AS "Session id:",
  min_cd AS "Session start time:",
  max_cd AS "Session end time:"
FROM
(
  SELECT 
    sid,
    created_date, st, 
    MIN(created_date) OVER (PARTITION BY st ORDER BY created_date ASC) AS min_cd,
    MAX(created_date) OVER (PARTITION BY st ORDER BY created_date DESC) AS max_cd,
    ROW_NUMBER() OVER (PARTITION BY st ORDER BY created_date ASC)  AS f_ts,
    ROW_NUMBER() OVER (PARTITION BY st ORDER BY created_date DESC) AS l_ts,
    COUNT(st) OVER (PARTITION BY st ORDER BY created_date ASC)  AS c_ts_asc,
    COUNT(st) OVER (PARTITION BY st ORDER BY created_date DESC) AS c_ts_desc,
    o_url, 
    c_sess_id
  FROM
  (
    SELECT 
      sid,
      created_date, YYY, 
      SUM(testy) OVER (PARTITION BY testy ORDER BY created_date ASC) AS s, 
      ROW_NUMBER() 
        OVER (PARTITION BY testy ORDER BY created_date ASC) AS rn,
      SUM(testy) OVER y AS st,
      FIRST_VALUE(created_date) OVER (PARTITION BY testy ORDER BY created_date) AS fv,
      LAST_VALUE(created_date) OVER y AS lv,
      o_url,
      c_sess_id
    FROM
    (
      SELECT
        id AS sid,
        LAG(created_date, 1) OVER x AS lag_1,
        created_date,
        LEAD(created_date, 1) OVER x AS lead_1,  

        UNIX_TIMESTAMP(LEAD(created_date, 1) OVER x) - UNIX_TIMESTAMP(created_date) AS XXX,
 
        UNIX_TIMESTAMP(created_date) - UNIX_TIMESTAMP(LAG(created_date, 1) OVER x) AS YYY,
 
        --
        -- The IMPORTANT one!
        --
 
        CASE 
          WHEN (UNIX_TIMESTAMP(created_date) - UNIX_TIMESTAMP(LAG(created_date, 1) OVER x) > 180
          OR   UNIX_TIMESTAMP(created_date) - UNIX_TIMESTAMP(LAG(created_date, 1) OVER x) IS NULL)
          THEN 1
          ELSE 0
        END AS testy,
      
        FIRST_VALUE(created_date) OVER x AS f_val,
  
        UNIX_TIMESTAMP(created_date) - UNIX_TIMESTAMP(FIRST_VALUE(created_date) OVER x) AS c_f_diff,
  
        ABS(TIMESTAMPDIFF(MINUTE, created_date, LEAD(created_date, 1) OVER x)) AS min_diff,  

        UNIX_TIMESTAMP(created_date) - UNIX_TIMESTAMP(LEAD(created_date, 1) OVER x) AS ut_d, 

        origin_url AS o_url,
        client_session_id AS c_sess_id

      FROM test
      WINDOW x AS (PARTITION BY origin_url, client_session_id 
                     ORDER BY created_date ASC, origin_url, client_session_id)
      ORDER BY created_date ASC
    ) AS t 
    WINDOW y AS (ORDER BY created_date ASC ROWS BETWEEN UNBOUNDED PRECEDING
                                                    AND CURRENT ROW)
    ORDER BY created_date ASC
  ) AS u
  ORDER BY created_date ASC
) AS v
WHERE (v.f_ts + v.l_ts) >= 5
GROUP BY v.st, o_url, c_sess_id, min_cd, max_cd;

许可以下： CC-BY-SA 和归因

不隶属于 dba.stackexchange