Let the number of transactions be n
, the number of items be k
, and the average size of a transaction be d
.
The naive approach (check pair in all records) will give you O(k^2 * n * d)
solution, not very optimal indeed. But we can improve it to O(k*n*d)
, and if we assume uniform distribution of items (i.e. each items repeats on average O(n*d/k)
times) - we might be able to improve it to O(d^2 * n + k^2)
(which is much better, since most likely d << k
).
This can be done by building an inverted index of your transactions, meaning - create a map from the items to the transactions containing them (Creating the index is O(nd + k)
).
Example, if you have transactions
transaction1 = ('apple','grape')
transaction2 = ('apple','banana','mango')
transaction3 = ('grape','mango')
The inverted index will be:
'apple' -> [1,2]
'grape' -> [1,3]
'banana' -> [2]
'mango' -> [2,3]
So, after understanding what an inverted index is - here is the guidelines for the solution:
- Build an inverted index for your data
- For each item x, iterate all documents it appears in, and build a histogram for all the pairs
(x,y)
such thaty
co-occures withx
. - When you are done, you have a histogram containing k^2 items, which you need to process. This question discusses how to get top-k elements out of an unsorted list.
Complexity analysis:
- Building an inverted index is
O(nd+k)
- Assuming each element repeats in
O(nd/k)
transactions, each iteration takesO(nd/k * d)
time, and you havek
iterations in this step, so you getO(nd^2 + k)
for this step. - Processing the list can be done in O(k^2logk) if you want full ordering, or if you just want to print top X elements, it can be done in
O(k^2)
.
Totaling in O(nd^2 + k^2)
solution to get top-X elements, which is MUCH better then naive approach, assuming d << k
.
In addition, note that the bottleneck (step 2) can be efficiently parallelized and distributed among threads if needed.