Optimizing a MySQL query with a large IN() clause or join on derived table

https://stackoverflow.com/questions/2091777

21-09-2019
|

Question

Let's say I need to query the associates of a corporation. I have a table, "transactions", which contains data on every transaction made.

CREATE TABLE `transactions` (
  `transactionID` int(11) unsigned NOT NULL,
  `orderID` int(11) unsigned NOT NULL,
  `customerID` int(11) unsigned NOT NULL,
  `employeeID` int(11) unsigned NOT NULL, 
  `corporationID` int(11) unsigned NOT NULL,
  PRIMARY KEY (`transactionID`),
  KEY `orderID` (`orderID`),
  KEY `customerID` (`customerID`),
  KEY `employeeID` (`employeeID`),
  KEY `corporationID` (`corporationID`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

It's fairly straightforward to query this table for associates, but there's a twist: A transaction record is registered once per employee, and so there may be multiple records for one corporation per order.

For example, if employees A and B from corporation 1 were both involved in selling a vacuum cleaner to corporation 2, there would be two records in the "transactions" table; one for each employee, and both for corporation 1. This must not affect the results, though. A trade from corporation 1, regardless of how many of its employees were involved, must be treated as one.

Easy, I thought. I'll just make a join on a derived table, like so:

SELECT corporationID FROM transactions JOIN (SELECT DISTINCT orderID FROM transactions WHERE corporationID = 1) AS foo USING (orderID)

The query returns a list of corporations who have been involved in trades with corporation 1. That's exactly what I need, but it's very slow because MySQL can't use the corporationID index to determine the derived table. I understand that this is the case for all subqueries/derived tables in MySQL.

I've also tried to query a collection of orderIDs separately and use a ridiculously large IN() clause (typhically 100 000+ IDs), but as it turns out MySQL has issues using indices on ridiculously large IN() clauses as well and as a result the query time does not improve.

Are there any other options available, or have I exhausted them both?

Solution

If I understand your requirement, you could try this.

select distinct t1.corporationID
from transactions t1
where exists (
    select 1
    from transactions t2
    where t2.corporationID =  1
    and t2.orderID = t1.orderID)
and t1.corporationID != 1;

or this:

select distinct t1.corporationID
from transactions t1
join transactions t2
on t2.orderID = t1.orderID
and t1.transactionID != t2.transactionID
where t2.corporationID = 1
and t1.corporationID != 1;

OTHER TIPS

Your data makes no sense to me, I think you are using corporationID where you mean customer ID at some point in there, as your query joins the transaction table to the transaction table for corporationID=1 based on orderID to get the corporationIDs...which would then be 1, right?

Can you please specify what the customerID, employeeID, and corporationIDs mean? How do I know employees A and B are from corporation 1 - in that case, is corporation 1 the corporationID, and corporation 2 is the customer, and so stored in the customerID?

If that is the case, you just need to do a group by:

SELECT customerID
FROM transactions
WHERE corporationID = 1
GROUP BY customerID

(Or select and group by orderID if you want one row per order instead of one row per customer.)

By using the group by, you ignore the fact that there are multiple records that are duplicate except for the employeeID.

Conversely, to returns all corporations that have sold to corporation 2.

SELECT corporationID
FROM transactions
WHERE customerID = 2
GROUP BY corporationID

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow