Question

I want to join three tables to check the number of shared and non shared users on each group. Ideally I want to produce something similar to this.

Image borrowed from www.mathcentral.uregina.ca

I have a big table which stores data for many classes, including W, D and P. To assess if an users is attending more than one class, I would create a sub-query for each group and using a full join merge those resulting tables. My issue is that I have to join all three tables on the same primary key (userpid). I've tried to write the query in this way but got an error message from Hive saying the IS NULL operator only accepts one argument.

select
        isnull(iq1.userpid, iq2.userpid) userpid,
        groupW,
        groupD,
        sum( case when iq2.userpid is not null then 1 else 0 end ) groupP


from
        ( select
                isnull(q1.userpid, q2.userpid) userpid,
                ( case when q1.userpid is not null then 1 else 0 end ) groupW,
                ( case when q1.userpid is not null then 1 else 0 end ) groupD

        from
                -- users who attended class W
                ( select distinct userpid
                from store.attendance
                where classid = 1165
                  and datehour between '2014-04-28 00:00:00' and '2014-04-30 00:00:00'
                 ) q1

                full outer join

                -- users who attended class D
                ( select distinct userpid
                from store.attendance
                where classid= 1174
                  and datehour between '2014-04-28 00:00:00' and '2014-04-30 00:00:00'
                 ) q2

                  on q1.userpid = q2.userpid ) iq1

                full outer join

                -- users who attended class P
                ( select distinct userpid
                from store.attendance
                where classid = 1173
                  and datehour between '2014-04-28 00:00:00' and '2014-04-30 00:00:00'
                 ) iq2

                  on iq2.userpid = iq1.userpid

;

Is there another function or way to write the query which I can use to achieve the same goal? I can then see the number of shared an non shared users by class, using a series of case when calls, or processing it in R or Python.

Was it helpful?

Solution

I would suggest doing this using two layers of aggregation. First create flags for each class. Then aggregate by those flags:

select has_1165, has_1174, has_1173, count(*) as cnt, min(userpid), max(userpid)
from (select userpid,
             max(case when classid = 1165 then 1 else 0 end) as has_1165,
             max(case when classid = 1174 then 1 else 0 end) as has_1174,
             max(case when classid = 1173 then 1 else 0 end) as has_1173
      from store.attendance
      where classid in (1165, 1173, 1174) and
            datehour between '2014-04-28 00:00:00' and '2014-04-30 00:00:00'
      group by userpid
     ) a
group by has_1165, has_1174, has_1173;
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top