Question

I'm trying to get records that have the highest value in one particular column ("version"). I'm using the base_id to get rows, and there may be more than one row with the same base_id, but they will then have different version numbers. So the point of the statement is to only get the one with the highest version. And the statement below works, but only if there are actually more than one value. If there is only one I get no records at all back (as opposed to the expected one row). So how can I get only the value with the highest version number below, even if for some records only one version exists?:

SELECT r.id
     , r.title
     , u.name created_by
     , m.name modified_by
     , r.version
     , r.version_displayname
     , r.informationtype
     , r.filetype
     , r.base_id
     , r.resource_id
     , r.created
     , r.modified
     , GROUP_CONCAT( CONCAT(CAST(c.id as CHAR),',',c.name,',',c.value) separator ';') categories 
  FROM resource r 
  JOIN category_resource cr 
    ON r.id = cr.resource_id 
  JOIN category c 
    ON cr.category_id = c.id 
  JOIN user u 
    ON r.created_by = u.id 
  JOIN user m 
    ON r.modified_by = m.id 
 WHERE r.base_id = 'uuid_033a7198-a213-11e3-93de-2b47e5a489c2' 
   AND r.version = (SELECT MAX(r.version) FROM resource r) 
 GROUP 
    BY r.id;

EDIT:

I realize the other parts of the query itself may complicate things, so I'll try to create a cleaner example, which should show what I'm after, I hope.

If I do this:

SELECT id, title, MAX(version) AS 'version' FROM resource GROUP BY title

on a table that looks like this:

enter image description here

Then I get the following results:

enter image description here

which is not correct, as you can see from the table. I.e, it's fetched the highest value for each resource, but if you look at Introduction, e.g. the resource with the value 2 for version has the id 6, whereas the one fetched has the id 1. So the query seems to somehow combine the values from different rows...?

I should note that I'm very much a novice at SQL, and the original query that I exemplified the problem with was something I got help with here, so please do explain as clearly as possible, thanks.

Another note is that I found some suggestion of a subquery, but apart from not returning the correct results either, it was really slow. I'm testing on 5000 rows and I really need to expect it to take only a fraction of a second, in order to meet performance requirements.

EDIT 2:

Found a way to incorporate a statement, sort of like one of the suggested ones, as well as the various solutions here: Retrieving the last record in each group

However, I tried them all, and even though most seem to work, they are incredibly slow…

Take this one:

SELECT
  r.id, r.title,
  u.name AS 'created_by', m.name AS 'modified_by', r.version, r.version_displayname, r.informationtype,
r.filetype, r.base_id, r.resource_id, r.created, r.modified,
  GROUP_CONCAT( CONCAT(CAST(c.id as CHAR),',',c.name,',',c.value) separator ';') AS 'Categories'
FROM
  resource r
  INNER JOIN
  (SELECT
   DISTINCT r.id AS id
  FROM
   resource r
  INNER JOIN
   category_resource cr1 ON (r.id = cr1.resource_id)
  WHERE
   cr1.category_id IN (9)
) mr

    ON r.id = mr.id
  INNER JOIN category_resource cr
    ON r.id = cr.resource_id
  INNER JOIN category c
    ON cr.category_id = c.id
  INNER JOIN user u
    ON r.created_by = u.id
  INNER JOIN user m
    ON r.modified_by = m.id
INNER JOIN
(
    SELECT max(version) MyVersion, base_id
    FROM resource
    GROUP BY base_id
) r2
  ON r.base_id = r2.base_id
  AND r.version = r2.MyVersion
group by r.base_id
order by r.version desc;

The addition at the end (starting with the INNER JOIN) to get only the rows with the highest version value for each base_id slows the query down from 20 ms to around 6-8 seconds. That is a no go… But this surprises me. Although I’m obviously no database expert, it seems to me that database queries should be optimized for getting data like this. But if I do the only alternative I can think of, which is to get all the records regardless of version number, and then filter them in PHP, guess what? That is much faster than this…

I initially thought the performance hit caused by filtering in PHP was too much, but that is about a second’s delay, so still much better than this.

But I feel like I’m missing something, shouldn’t it be possible to do this much more efficiently?

Was it helpful?

Solution 2

Well, I think I found the answer myself. As far as I can understand, a query like this will take a lot of time, and instead the database needs to be modified. I found this:

How to version control a record in a database

The suggestion to use startend and enddate columns and set the enddate to null for the latest version made it very easy to do queries for the latest version. And it is again very very fast. So this is what I needed. It gives me something like this, all put together:

SELECT
  r.id, r.title,
  u.name AS 'created_by', m.name AS 'modified_by', r.version, r.version_displayname, r.informationtype,
r.filetype, r.base_id, r.resource_id, r.created, r.modified,
  GROUP_CONCAT( CONCAT(CAST(c.id as CHAR),',',c.name,',',c.value) separator ';') AS 'categories', startdate, enddate
FROM
  resource r
  INNER JOIN 
  (SELECT
   DISTINCT r.id AS id
  FROM
   resource r
  INNER JOIN
   category_resource cr1 ON (r.id = cr1.resource_id)
  WHERE
   cr1.category_id IN (9)
) mr

    ON r.id = mr.id
  INNER JOIN category_resource cr
    ON r.id = cr.resource_id
  INNER JOIN category c
    ON cr.category_id = c.id
  INNER JOIN user u
    ON r.created_by = u.id
  INNER JOIN user m
    ON r.modified_by = m.id
WHERE r.enddate is null
group by r.id;

And this query once again is back to the 20 ms execution time.

OTHER TIPS

Given your own answer, your question was basically the same as in the link you supplied. Since you had some sub-questions I'll try to give you some additional help there.

If you want to have some kind of version control in your database, then you basically extend your primary key by some version column(s). I'd vote for using startdate/enddate-colums, too for the reason you mentioned. Given your own answer, you could modify your layout accordingly. That's the route you should go if you can!

In your given example it is not clear what the primary key is, since the 'id' column has changing values, too. In your case the primary key would be the column 'title'. So you could use some query like

SELECT title, max(version) as version FROM resource GROUP BY title

to get a result in which you see your original primary key and the latest version -- which together form your actual primary key.

To get all other fields in that table, you'd join that result to the resource table and use the primary key fields as join condition.

SELECT * FROM (
        SELECT title, max(version) as version 
        FROM resource 
        GROUP BY title) as s 
    INNER JOIN resource r on (r.title = s.title AND r.version = s.version)

Why did your query give you wrong results?

The reason is, that you had an error in your query that MySQL somewhat "fixed" for you. Normally you would need to supply every column that you did not use in an aggregate function (like MAX()) in your GROUP BY clause. In Your example

SELECT id, title, MAX(version) AS 'version' FROM resource GROUP BY title

you had a colum ('id') in the select-part of your query that you didn't supply in your GROUP BY clause.

In MySQL you can ignore that rule (see here).

When using this feature, all rows in each group should have the same values for the columns that are ommitted from the GROUP BY part. The server is free to return any value from the group, so the results are indeterminate unless all values are the same.

Since the 'id' column had different values for your key (the 'title' column) you just got some result -- in that case MySQL probably just used the first row it found. But the result itself is undefined and might be subject to change e.g. when the database gets updated or the data grows. You should not depend on rules you deduce from results you see while testing!

On other databases like oracle and SQL-Server you would have gotten an error trying to execute that last query.

I hope I could clarify the reason for your results a little.

What if you try something like this : -

SELECT r.id
     , r.title
     , u.name created_by
     , m.name modified_by
     , r.version
     , r.version_displayname
     , r.informationtype
     , r.filetype
     , r.base_id
     , r.resource_id
     , r.created
     , r.modified
     , GROUP_CONCAT( CONCAT(CAST(c.id as CHAR),',',c.name,',',c.value) separator ';') categories 
  FROM resource r 
  JOIN category_resource cr 
    ON r.id = cr.resource_id 
  JOIN category c 
    ON cr.category_id = c.id 
  JOIN user u 
    ON r.created_by = u.id 
  JOIN user m 
    ON r.modified_by = m.id 
 WHERE r.base_id = 'uuid_033a7198-a213-11e3-93de-2b47e5a489c2' 
   AND r.version = (SELECT MAX(r1.version) FROM resource r1 where r1.base_id = r.base_id group by r.base_id) 
 GROUP 
    BY r.id;

Similar to Steve's answer, you could use the following:-

Select
id = (Select id From Resources R2 Where R2.title = R1.title And R2.version = (Select Max(version) From Resources R2 Where R2.title = R1.title)),
R1.title, 
version = (Select Max(version) From Resources R3 Where R3.title = R1.title) 
From Resources R1 
Group By R1.title
Order By R1.title

Try using windowing functions:

SELECT x.* FROM (
    SELECT 
       r.id
     , r.title
     , u.name created_by
     , m.name modified_by
     , r.version
     , row_indicator=row_number() over (partition by r.base_id order by r.version desc)
     , r.version_displayname
     , r.informationtype
     , r.filetype
     , r.base_id
     , r.resource_id
     , r.created
     , r.modified
     , GROUP_CONCAT( CONCAT(CAST(c.id as CHAR),',',c.name,',',c.value) separator ';')     categories 
     FROM resource r 
     JOIN category_resource cr 
     ON r.id = cr.resource_id 
     JOIN category c 
     ON cr.category_id = c.id 
     JOIN user u 
     ON r.created_by = u.id 
     JOIN user m 
     ON r.modified_by = m.id 
     WHERE r.base_id = 'uuid_033a7198-a213-11e3-93de-2b47e5a489c2'
) x
where row_indicator = 1

The key part is the use of the row_number() windowing function. If you look up SQL Server Window Functions, you will find they are VERY powerful and eliminate the need for subqueries and/or self-joins in a lot of cases like this.

To filter by the row_number() (aliased as "row_indicator"), you have to wrap the query in an inline view. Since the partition clause used with the row_number() function sorts by version descending, the highest number version will receive a row_number() of 1.

Good luck!

I wrote this from the perspective of SQL Server (2005), but I suspect it will be the same in MySQL.

First, your example query would result in an error:

SELECT id, title, MAX(version) AS 'version' FROM Resource GROUP BY title

Msg 8120, Level 16, State 1, Line XX Column 'Resource.ID' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

And the addition of the ID to fix the error would provide clues to why it won't work to accomplish your goal. If you include the ID in your grouping, you won't filter your "duplicate" titles. You could instead use MAX(ID), and that would probably result in correct data, but (1) it would only be as reliable if higher version numbers were always inserted after earlier version numbers, and (2) the query would become more complicated as you added fields, because they would also be involved in the grouping.

Instead, you can simply find the "TOP" entry in the table for each of the items in the distinct list. You can accomplish this with a query like this:

-- Populate Test Data
DECLARE @Resource TABLE
(
    ID int IDENTITY,
    Title varchar(100),
    Version int
);
INSERT INTO @Resource (Title, Version) VALUES ('Introduction', 1);
INSERT INTO @Resource (Title, Version) VALUES ('Technical Data', 1);
INSERT INTO @Resource (Title, Version) VALUES ('Warranty', 1);
INSERT INTO @Resource (Title, Version) VALUES ('Product Description', 1);
INSERT INTO @Resource (Title, Version) VALUES ('Warranty', 2);
INSERT INTO @Resource (Title, Version) VALUES ('Introduction', 2);
INSERT INTO @Resource (Title, Version) VALUES ('Technical Data', 3);

-- Query with desired results    
SELECT
    *
FROM        @Resource r1
WHERE       r1.ID =
            (
                SELECT
                    TOP 1 r2.ID
                FROM        @Resource r2
                WHERE       r2.Title = r1.Title
                ORDER BY    r2.Version DESC,
                            r2.ID DESC
            );

If you can guarantee that there won't be a duplicate Version number for a given Title, you can use either of these methods (each of which produces the same query plan):

SELECT
    *
FROM        @Resource r1
WHERE       r1.Version =
            (
                SELECT
                    MAX(r2.Version)
                FROM        @Resource r2
                WHERE       r2.Title = r1.Title
            )
ORDER BY    r1.Title;

SELECT      r1.*
FROM        (
                SELECT
                    r2.Title,
                    MAX(r2.Version) AS MaxVersion
                FROM        @Resource r2
                GROUP BY    r2.Title
            ) AS MaxVerList
JOIN        @Resource r1
ON          r1.Title = MaxVerList.Title
AND         r1.Version = MaxVerList.MaxVersion
ORDER BY    r1.Title;

Using the Data Riley produced, changing the @ to a # for a temp table, and again from a SQl Server 2008 perspective but it's core SQL the following should work without overly causing performance issues.

SELECT
    *
FROM   #Resource r1
WHERE r1.Version = (SELECT MAX(r2.Version) 
FROM #Resource r2 WHERE r1.Title = r2.Title )
ORDER BY r1.ID

Gives the correct answer

ID    Title                  Version
4     Product Description    1
5     Warranty               2
6     Introduction           2
7     Technical Data         3

You're looking for the Max(Version) per Title from what I can see. The major cost on this query is the order by as there are no indexes.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top