MySQL - Count only unique instances between specific dates

https://stackoverflow.com/questions/23609148

20-07-2023
|

Question

I've been looking at several other SO questions but I could not make out a solution from these. First, the description, then what I'm missing from the other threads. (Heads up: I'm very well aware of the non-normalised structure of our database, which is something I have addressed in meetings before but this is what we have and what I have to work with.)

Background description

We have a machine that manufactures products in 25 positions. These products' production data is being logged in a table that among other things logs current and voltage for every position. This is only logged when the machine is actually producing products (i.e. has a product in the machine). The time where no product is present, nothing is being logged.

This machine can run in two different production modes: full production and R&D production. Full production means that products are being inserted continuously so that every instance has a product at all times (i.e. 25 products are present in the machine at all times). The second mode, R&D production, only produces one product at a time (i.e. one product enters the machine, goes through the 25 instances one by one and when this one is finished, the second product enters the machine).

To clarify: every position logs data once every second whenever a product is present, which means 25 instances per second when full production is running. When R&D mode is running, position 1 will have ~20 instances for 20 consecutive seconds, position 2 will have ~20 instances for the next 20 consecutive seconds and so on.

Table structure

Productiondata:

id (autoincrement)
productID
position
time (timestamp for logged data)
current (amperes)
voltage (volts)

Question

We want to calculate the uptime of the machine, but we want to separate the uptime for production mode and R&D mode, and we want to separate this data on a weekly basis.

Guessed solution

Since we have instances logged every second I can count the amount of DISTINCT instances of time values we have in the table to find out the total uptime for both production and R&D mode. To find the R&D mode, I can safely say that whenever there is a time instance that has only one entry, I'm running in R&D mode (production mode would have 25 instances).

Progress so far

I have the following query which sums up all distinct instances to find both production and R&D mode:

SELECT YEARWEEK(time) AS YWeek, COUNT(DISTINCT time) AS Time_Seconds, ROUND(COUNT(DISTINCT time)/3600, 1) AS Time_Hours 
FROM Database.productiondata
WHERE YEARWEEK(time) >= YEARWEEK(curdate()) - 21
GROUP BY YWeek;

This query finds out how many DISTINCT time instances there are in the table and counts the number and groups that by the week.

Problem

The above query counts the amount of instances that exist in the table, but I want to find ONLY the UNIQUE instances. Basically, I'm trying to find something like IF count(time) = 1, then count that instance, IF count(time) > 1 then don't count it at all (DISTINCT still counts this).

I looked at several other SO threads, but almost all explain how to find unique values with DISTINCT, which only accomplishes half of what I'm looking for. The closest I got was this which uses a HAVING clause. I'm currently stuck at the following:

SELECT YEARWEEK(time) as YWeek, COUNT(Distinct time) As Time_Seconds, ROUND(COUNT(Distinct time)/3600, 1) As Time_Hours
FROM 
(SELECT * FROM Database.productiondata
WHERE time > '2014-01-01 00:00:00'
GROUP BY time
HAVING count(time) = 1) as temptime
GROUP BY YWeek
ORDER BY YWeek;

The problem here is that we have a GROUP BY time inside the nested select clause which takes forever (~5 million rows only for this year so I can understand that). I mean, syntactically I think that this is correct but it takes forever to exectue. Even EXPLAIN for this times out.

And that is where I am. Is this the correct approach or is there any other way that is smarter/requires less query time/avoids the group by time clause?

EDIT: As a sample, we have this table (apologies for formatting, don't know how to make a table format here on SO)

id    position    time
1     1           1
2     2           1
3     5           1
4     19          1
...   ...         ...
25    7           1
26    3           2
27    6           2
...   ...         ...

This table shows how it looks like when there is a production run going on. As you can see, there is no general structure for which position gets the first entry when logging the data in the table; what happens is that the 25 positions gets logged during every second and the data is then added to the table depending on how fast the PLC sends the data for every position. The following table shows how the table looks like when it runs in research mode.

id    position    time
245   1           1
246   1           2
247   1           3
...   ...         ...
269   1           25
270   2           26
271   2           27
...   ...         ...

Since all the data is consolidated into one single table, we want to find out how many instances there are when COUNT(time) is exactly equal to 1, or we could look for every instance when COUNT(time) is strictly larger than 1.

EDIT2: As a reply to Alan, the suggestion gives me

YWeek    Time_Seconds    Time_Hours
201352   1               0.0
201352   1               0.0
201352   1               0.0
...      ...             ...
201352   1               0.0  (1000 row limit)

Whereas my desired output is

Yweek    Time_Seconds    Time_Hours
201352   2146            35.8
201401   5789            96.5
...      ...             ...
201419   8924            148.7

EDIT3: I have gathered the tries and the results so far here with a description in gray above the queries.

Solution

You might achieve better results by eliminating your sub select:

SELECT YEARWEEK(time) as YWeek, 
       COUNT(time) As Time_Seconds, 
       ROUND(COUNT(time)/3600, 1) As Time_Hours
FROM Database.productiondata
WHERE time > '2014-01-01 00:00:00'
GROUP BY YWeek
HAVING count(time) = 1)
ORDER BY YWeek;

I'm assuming time has an index on it, but if it does not you could expect a significant improvement in performance by adding one.

UPDATE:

Per the recently added sample data, I'm not sure your approach is correct. The time column appears to be an INT representing seconds while you're treating it as a DATETIME with YEARWEEK. Below I have a working example in SQL that does exactly what you asked IF time is actually a DATETIME column:

DECLARE @table TABLE
    (
      id INT ,
      [position] INT ,
      [time] DATETIME
    )


INSERT  INTO @table
VALUES  ( 1, 1, DATEADD(week, -1, GETDATE()) )
INSERT  INTO @table
VALUES  ( 1, 1, DATEADD(week, -2, GETDATE()) )
INSERT  INTO @table
VALUES  ( 1, 1, DATEADD(week, -2, GETDATE()) )
INSERT  INTO @table
VALUES  ( 1, 1, DATEADD(week, -2, GETDATE()) )
INSERT  INTO @table
VALUES  ( 1, 1, DATEADD(week, -2, GETDATE()) )
INSERT  INTO @table
VALUES  ( 1, 1, DATEADD(week, -3, GETDATE()) )
INSERT  INTO @table
VALUES  ( 1, 1, DATEADD(week, -3, GETDATE()) )

SELECT  CAST(DATEPART(year, [time]) AS VARCHAR)
        + CAST(DATEPART(week, [time]) AS VARCHAR) AS YWeek ,
        COUNT([time]) AS Time_Seconds ,
        ROUND(COUNT([time]) / 3600, 1) AS Time_Hours
FROM    @table
WHERE [time] > '2014-01-01 00:00:00'
GROUP BY DATEPART(year, [time]) ,
        DATEPART(week, [time])
HAVING COUNT([time]) > 0
ORDER BY YWeek;

OTHER TIPS

SELECT pd1.* 
FROM Database.productiondata pd1
LEFT JOIN Database.productiondata pd2 ON pd1.time=pd2.time AND pd1.id<pd2.id
WHERE pd1.time > '2014-01-01 00:00:00' AND pd2.time > '2014-01-01 00:00:00'
  AND pd2.id IS NULL

You can LEFT JOIN to the same table and leave only the rows with no related

UPDATE The query works using the SQL fiddle

SELECT pd1.* From productiondata pd1
left Join productiondata pd2
ON pd1.time = pd2.time and pd1.id < pd2.id
Where pd1.time > '2014-01-01 00:00:00' and pd2.id IS NULL;

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow