aggregating weekly data in hive

https://stackoverflow.com/questions/20984402

25-09-2022
|

Pergunta

I wanted to aggregate a count of accounts according to the criteria specified in the query below on a weekly basis for the last 3 months. What is the most efficient way to get this data in a table with num_of_accounts and weeks as the columns.

select COUNT(DISTINCT a.account_id) as num_accounts,
WEEKOFYEAR(a.ds) as week
FROM
    (SELECT
    CAST(account_id as BIGINT)
    FROM
    tableA
    WHERE ds='2013-12-28') a
JOIN   
    tableB b
ON a.account_id=b.account_id AND
    b.ds='2013-12-28'
WHERE
b.invoice_date between '2013-12-22' AND '2013-12-28' AND
b.payment_status = 'failed' AND b.payment_status = 'unbilled'

Solução

You're trying to do a count unique over a large set. One approach, which is scalable, is to use a probabilistic data structure like a hyperloglog or KMV sketch set, like those provided in Brickhouse ( http://github.com/klout/brickhouse ) . There is a blog posting describing a situation just like yours at http://brickhouseconfessions.wordpress.com/2013/12/11/using-sketch_set-for-reach-estimation/ . This should give you a fairly close estimate, without having to completely resort your data.

If I understand you correctly, you just want to aggregate by week, where you have a Hive UDF WEEKOFYEAR which returns a week from a date string. Just use the sketch_set UDAF from Brickhouse

SELECT WEEKOFYEAR( ds), estimated_reach( sketch_set( account_id ) ) as num_account_est
  FROM myquery
GROUP BY WEEKOFYEAR( ds);

where myquery is a view representing the business logic you expressed above.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow