Query optimization ndbcluster + order by

https://dba.stackexchange.com/questions/55509

02-11-2019
|

Question

I'm not really all too familiar with the ndbcluster storage engine of MySQL, but I've recently written an application that does some very simple queries. However, a co-worker of mine (who is now on vacation, or I'd ask him about it, of course) left a TODO in the comments next to what I thought to be a pretty harmless query.

The comment read "TODO: Avoid sorting on PK when querying on MySQL cluster"

Granted, there are some things I'm aware of when working on a cluster (like using less JOIN's) but I gathered the index (and therefore the PK) lookups were effectively O(1) operations, so a simple ORDER BY someID DESC wouldn't hurt performance.
I've been looking around on google, but I can't quite find a definitive answer as to what this ORDER BY means on a cluster. I can make a few educated guesses, but still...

What's more, I can't really think what query I can use instead. Basically, I have a generic method that enables me to fetch the last N rows based on 1 or 2 values. The simplified version of the query looks something like this:

SELECT * 
    FROM db.tbl 
WHERE user_Id = ? 
  AND req_action = ? 
ORDER BY tbl_Id DESC
LIMIT 0,?; -- 0,1 is default, though

Most of the time, the limit, though is 0,1, so I came up with this alternative approach that enables me to not use the ORDER BY clause, but it reduces the flexibility of my method, rather significantly:

SELECT *
    FROM db.tbl
WHERE 
    tbl_Id = (
        SELECT MAX(tbl_ID)
            FROM db.tbl
        WHERE user_Id = ?
          AND req_action = ?);

But in that case, I'm always going to end up with either an empty resultset, or a single row.

My questions:

Is it true that ORDER BY on the primary key is not advisable on a cluster
if so, what alternative constructs should I be looking at to maintain the flexibility of my initial query, while sticking to good practices for cluster queries?

Things to keep in mind:

The dataset I'm querying is certain to have hundreds of thousands, if not millions of rows.
The table will be used heavily, both read and write operations, 30 RW/s is not exceptional
I can write stored procedures, but we only use them as a last resort.
An EXPLAIN EXTENDED of both queries would have me believe that the first (my initial query) is still preferable, but I ran it on a vagrant box, and just noticed the test tables use InnoDB storage engine, so I take it I can't trust those results :)

No correct solution

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange