Calculating average for two variables

Question

For a known number of columns the table you describe - which is not actually an average, it is a count - can be done using IF:

SELECT Product,
       SUM(IF(VisitDayBeforePurchase = 0, 1, 0)) AS Day0,
       SUM(IF(VisitDayBeforePurchase = 1, 1, 0)) AS Day1,
       SUM(IF(VisitDayBeforePurchase = 2, 1, 0)) AS Day2
FROM yourtable
GROUP BY Product;

Essentially, I want to see an average number of visits somebody visits the website X days before a purchasing a specific product. i.e. sum(visits)/sum(uniqueVisitors) per product per days before visit

This is a different request. You can do this by adding (or replacing) a column

SELECT Product,
       AVG(VisitDayBeforePurchase) AS AverageDays
FROM yourtable
GROUP BY Product;

This gives you all (you can see it in action here).

SELECT Product,
       SUM(IF(VisitDayBeforePurchase = 0, 1, 0)) AS Day0,
       SUM(IF(VisitDayBeforePurchase = 1, 1, 0)) AS Day1,
       SUM(IF(VisitDayBeforePurchase = 2, 1, 0)) AS Day2,
       AVG(VisitDayBeforePurchase) AS AverageDays
FROM yourtable
GROUP BY Product;

Accounting for multiple visitors

In a nutshell: it's complicated, and maybe it's best not done at all.

Say we have a product that gets viewed twice (or more) by the same visitor, then we do not want to count these as separate visits. If mr. X visited the site three days, two days, and on the day of purchase, what do we do?

At first sight we might think to only count the last visit. But we would get an obvious unintended consequence: since you have to visit the site to purchase an item on the site, then the last visit before the purchase is the visit whereby you made the purchase, and so it will always be zero days before the purchase itself. In the same hour and minute, even, possibly. While it is possible to consider the last visit, it would give us worthless results.

Considering the first visit also has the unintended consequence of overlooking repeated purchases, so that our best repeated customers will actually be considered as being the most diddling and indecisive.

So one would have to consider, for instance, only the day intervals actually tabulated with SUM, and then do something:

VisitorID       ProductID       VDBeforeP
42              137             3
42              137             2
41              137             2

But what to do? If we consider only one record for visitor 42, whatever we do we end up with an incorrect result, either too optimistic on average, or too pessimistic on average. We can consider user 42's average, which gives 2.5 for user 42 with weight one (instead of two), so in comparison with the "brute average" (solution above) we sort of consider repeated customers a bit less.

To do so, we use a SUBSELECT: we obtain the averaged data with only one Visitor and Product for each data point

SELECT VisitorID, Product, AVG(VisitDayBeforePurchase) AS VisitDayBeforePurchase
    FROM visits GROUP BY VisitorID, Product;

and this will yield a table with the same format as the original one, but with averaged data. And it will never work because the original query only verified integer numbers of days, and 2.5 is neither 2 nor 3. So we have to make either an optimistic or pessimistic correction; this is the optimistic

SELECT VisitorID, Product, FLOOR(AVG(VisitDayBeforePurchase)) AS VisitDayBeforePurchase
    FROM visits GROUP BY VisitorID, Product;

while the pessimistic would use FLOOR(1.0+AVG.... A compromise would be to use ROUND.

Now we repeat the query:

SELECT Product,
    SUM(IF(V = 0, 1, 0)) AS Day0,
    SUM(IF(V = 1, 1, 0)) AS Day1,
    SUM(IF(V = 2, 1, 0)) AS Day2,
    AVG(BetterV) AS AverageDays
FROM (
    SELECT VisitorID,
           Product,
           ROUND(AVG(VisitDayBeforePurchase)) AS V,
           AVG(VisitDayBeforePurchase) AS BetterV
    FROM visits GROUP BY VisitorID, Product
) AS grouped
  GROUP BY Product;

A working example can be also found here

Map-Reduce

To run the above in a map-reduce environment you would need two stages: a map stage to directly output VisitorID, Product and VisitDayBeforePurchase, and a reduce stage to group by key (VisitorID, Product) and output those and the V (and BetterV?) calculation results.

This gets fed to a new reduce stage that performs averages on the V's.