Question

I have two tables players and scores.

I want to generate a report that looks something like this:

player    first score             points
foo       2010-05-20              19
bar       2010-04-15              29
baz       2010-02-04              13

Right now, my query looks something like this:

select p.name        player,
       min(s.date)   first_score,
       s.points      points    
from  players p    
join  scores  s on  s.player_id = p.id    
group by p.name, s.points

I need the s.points that is associated with the row that min(s.date) returns. Is that happening with this query? That is, how can I be certain I'm getting the correct s.points value for the joined row?

Side note: I imagine this is somehow related to MySQL's lack of dense ranking. What's the best workaround here?

Was it helpful?

Solution

This is the greatest-n-per-group problem that comes up frequently on Stack Overflow.

Here's my usual answer:

select
  p.name        player,
  s.date        first_score,
  s.points      points

from  players p

join  scores  s
  on  s.player_id = p.id

left outer join scores  s2
  on  s2.player_id = p.id
      and s2.date < s.date

where
  s2.player_id is null

;

In other words, given score s, try to find a score s2 for the same player, but with an earlier date. If no earlier score is found, then s is the earliest one.


Re your comment about ties: You have to have a policy for which one to use in case of a tie. One possibility is if you use auto-incrementing primary keys, the one with the least value is the earlier one. See the additional term in the outer join below:

select
  p.name        player,
  s.date        first_score,
  s.points      points

from  players p

join  scores  s
  on  s.player_id = p.id

left outer join scores  s2
  on  s2.player_id = p.id
      and (s2.date < s.date or s2.date = s.date and s2.id < s.id)

where
  s2.player_id is null

;

Basically you need to add tiebreaker terms until you get down to a column that's guaranteed to be unique, at least for the given player. The primary key of the table is often the best solution, but I've seen cases where another column was suitable.

Regarding the comments I shared with @OMG Ponies, remember that this type of query benefits hugely from the right index.

OTHER TIPS

Most RDMBs won't even let you include non aggregate columns in your SELECT clause when using GROUP BY. In MySQL, you'll end up with values from random rows for your non-aggregate columns. This is useful if you actually have the same value in a particular column for all the rows. Therefore, it's nice that MySQL doesn't restrict us, though it's an important thing to understand.

A whole chapter is devoted to this in SQL Antipatterns.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top