You're trying to do a count unique over a large set. One approach, which is scalable, is to use a probabilistic data structure like a hyperloglog or KMV sketch set, like those provided in Brickhouse ( http://github.com/klout/brickhouse ) . There is a blog posting describing a situation just like yours at http://brickhouseconfessions.wordpress.com/2013/12/11/using-sketch_set-for-reach-estimation/ . This should give you a fairly close estimate, without having to completely resort your data.
If I understand you correctly, you just want to aggregate by week, where you have a Hive UDF WEEKOFYEAR
which returns a week from a date string. Just use the sketch_set
UDAF from Brickhouse
SELECT WEEKOFYEAR( ds), estimated_reach( sketch_set( account_id ) ) as num_account_est
FROM myquery
GROUP BY WEEKOFYEAR( ds);
where myquery is a view representing the business logic you expressed above.