문제

Percona MySql 5.6, linux x64.

We have a customers_history table, which tracks changes to our customers over time. What we would like to do is count the changes by vendor(lead_source_id) over the course of a particular month.

+--------+-------------+----------------+---------------------+--------+
| id     | customer_id | lead_source_id |   repurchased_date  | Rating |
+--------+-------------+----------------+---------------------+--------+
| 422923 |      420450 |              4 | 2014-04-14 09:16:48 |   Warm |
| 422924 |      420450 |              4 | 2014-04-14 09:16:48 |   Cold |
| 422956 |      420450 |              4 | 2014-04-14 09:16:49 |    Hot |
| 422933 |      420451 |             37 | 2014-04-14 09:18:41 |    Hot |
| 422938 |      420452 |              1 | 2014-04-10 20:50:30 |    Hot |
| 422984 |      420452 |              1 | 2014-04-12 20:50:30 |    Hot |
| 422940 |      420453 |             47 | 2014-04-14 09:20:27 |    Hot |
+--------+-------------+----------------+---------------------+--------+

Given the above sample, what we would like is this report, which reports repurchases by vendor(lead_source_id). What qualifies as a repurchase is when the repurchase_date is updated. Only changing the rating does not count.

+----------------+-------+
| lead_source_id | count |
+----------------+-------+
|              4 |     2 |
|             37 |     1 |
|              1 |     2 |
|             47 |     1 |
+----------------+-------+

We tried this initially:

SELECT count(DISTINCT(ch.repurchased_date)) FROM customers_history ch WHERE Year(ch.repurchased_date) = 2014 AND Month(ch.repurchased_date) = 4 AND ch.lead_source_id IS NOT NULL;

But the count differs from the number of rows returned when you change the where clause to SELECT DISTINCT(ch.created_at)), lead_source_id.

Anyway, we're in a pickle jar trying to figure this out. Thanks a ton for any help or pointers.

EDIT

CRAP. I'm sorry guys, thank you for the answers so far, but I totally left off why this problem is so danged hard. This is actually a history table, it records changes from multiple columns. I edited the original question.

Notice how the repurchased_date doesn't change when the rating changes. We would like to exclude row 422923 from the count, but count rows 422924 and 422956.

도움이 되었습니까?

해결책

Your query looks real close. I'm thinking that all that's needed is to add a GROUP BY clause.

The COUNT(DISTINCT foo) will effectively "collapse" identical values, so that the count only gets incremented by 1 for each :group: of identical date values.

Based on the sample data, and the desired resultset, this should work:

 SELECT ch.lead_source_id
      , COUNT(DISTINCT ch.repurchased_date)
   FROM customers_history ch
  WHERE ch.repurchased_date >= '2014-04-01'
    AND ch.repurchased_date  < '2014-04-01' + INTERVAL 1 MONTH
    AND ch.lead_source_id IS NOT NULL
  GROUP
     BY ch.lead_source_id

In the example data, the customer_id and the lead_source_id correlate with each other. (Could be due to a small sample size...)

(See NOTES below for additional comments regarding indexes, index range scans, and GROUP BY optimization using a covering index.)


ANSWER BELOW PRIOR TO QUESTION UPDATE

This is one way to return the specified result, except for the ordering, I wasn't able to discern a pattern...

SELECT ch.lead_source_id
     , COUNT(1) AS count_
  FROM customers_history ch
 WHERE ch.cust_updated_at >= '2014-04-01' 
   AND ch.cust_updated_at <  '2014-04-01' + INTERVAL 1 MONTH
   AND ch.lead_source_id IS NOT NULL
 GROUP BY ch.lead_source_id
 ORDER BY ?

UPDATE

If you want the "count" to also be by cust_updated_at, include that column in the GROUP BY. For example, if for this sample data:

+--------+-------------+----------------+---------------------+
| id     | customer_id | lead_source_id |   cust_updated_at   |
+--------+-------------+----------------+---------------------+
| 422924 |      420450 |              4 | 2014-04-14 09:16:48 |
| 422956 |      420450 |              4 | 2014-04-14 09:16:48 |
| ?????? |      420450 |              4 | 2014-04-15 22:22:22 |
+--------+-------------+----------------+---------------------+

You want to return:

+----------------+-------+
| lead_source_id | count |
+----------------+-------+
|              4 |     2 |
|              4 |     1 |
+----------------+-------+

Then, add the cust_updated_at column to the GROUP BY clause, e.g.

SELECT ch.lead_source_id
     , COUNT(1) AS count_
  FROM customers_history ch
 WHERE ch.cust_updated_at >= '2014-04-01' 
   AND ch.cust_updated_at <  '2014-04-01' + INTERVAL 1 MONTH
   AND ch.lead_source_id IS NOT NULL
 GROUP
    BY ch.lead_source_id
     , ch.cust_updated_at

NOTES:

(If we leave off the ORDER BY clause, and the GROUP BY clause implicitly applies an ORDER BY on the same set of expressions. We only need to specify an ORDER BY clause to get a different ordering.)

Also, wrapping date columns in functions in a predicate prevents MySQL from satisfying the predicate by using an index range scan. We normally like to have "bare date columns" in the predicates, and do whatever manipulation is required on the constant side. (With the date column wrapped in a function, like YEAR() forces MySQL to evaluate that function for EVERY row in the table (or, every row that isn't filtered out by another predicate.)

For optimum performance, a suitable covering index for this query would be:

... ON customer_history (lead_source_id, created_at)

MySQL can satisfy the query entirely from the index; the explain output will show "Using index". If we leave off the ORDER BY clause, MySQL will avoid a "Using filesort" operation as well.


다른 팁

I'm not sure I got what you're asking. however do you mean this?

SELECT ch.lead_source_id, count(*)
FROM customers_history ch
WHERE
     Year(ch.created_at) = 2014 AND
     Month(ch.created_at) = 4 AND ch.lead_source_id IS NOT NULL
GROUP BY ch.lead_source_id;
라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top