Heuristic for finding elements that appears often together in a big data set

Question 1

Let the number of transactions be n, the number of items be k, and the average size of a transaction be d.

The naive approach (check pair in all records) will give you O(k^2 * n * d) solution, not very optimal indeed. But we can improve it to O(k*n*d), and if we assume uniform distribution of items (i.e. each items repeats on average O(n*d/k) times) - we might be able to improve it to O(d^2 * n + k^2) (which is much better, since most likely d << k).

This can be done by building an inverted index of your transactions, meaning - create a map from the items to the transactions containing them (Creating the index is O(nd + k)).

Example, if you have transactions

transaction1 = ('apple','grape')
transaction2 = ('apple','banana','mango')
transaction3 = ('grape','mango')

The inverted index will be:

'apple' -> [1,2]
'grape' -> [1,3]
'banana' -> [2]
'mango' -> [2,3]

So, after understanding what an inverted index is - here is the guidelines for the solution:

Build an inverted index for your data
For each item x, iterate all documents it appears in, and build a histogram for all the pairs (x,y) such that y co-occures with x.
When you are done, you have a histogram containing k^2 items, which you need to process. This question discusses how to get top-k elements out of an unsorted list.

Complexity analysis:

Building an inverted index is O(nd+k)
Assuming each element repeats in O(nd/k) transactions, each iteration takes O(nd/k * d) time, and you have k iterations in this step, so you get O(nd^2 + k) for this step.
Processing the list can be done in O(k^2logk) if you want full ordering, or if you just want to print top X elements, it can be done in O(k^2).

Totaling in O(nd^2 + k^2) solution to get top-X elements, which is MUCH better then naive approach, assuming d << k.

In addition, note that the bottleneck (step 2) can be efficiently parallelized and distributed among threads if needed.

Question 2

If the number of items ordered in one purchase is small(<10) do this:
have map of maps(dictionary of dictionaries) : key in the first map is item,
value in the first map is map whose key is second item, value count how many times it appeared in the purchase with first key.

So go through every order and for every pair update map. At the end go through map of maps and look for "big values" in the "second value"

Note: depending on the size and "distribution"of input data you might end up with not enough memory