Efficiently find top-N values from multiple columns independently in Oracle

https://stackoverflow.com/questions/7278905

18-01-2021
|

Question

Suppose I have 30 billion rows with multiple columns, and I want to efficiently find the top N most-frequent values for each column independently, and with the most elegant SQL possible. For example, if I have

FirstName LastName FavoriteAnimal FavoriteBook
--------- -------- -------------- ------------
Ferris    Freemont Possum         Ubik
Nancy     Freemont Lemur          Housekeeping
Nancy     Drew     Penguin        Ubik
Bill      Ribbits  Lemur          Dhalgren

and I want the top-1, then the result would be:

FirstName LastName FavoriteAnimal FavoriteBook
--------- -------- -------------- ------------
Nancy     Freemont Lemur          Ubik

I can probably think of ways to do it, but not sure if they're optimal, which is important when there are 30 billion rows; and the SQL might be big and ugly, and maybe too much temp space would be used.

Using Oracle.

Solution

This should only do one pass over the table. You can use the analytic version of count() to get the frequency of each value independently:

select firstname, count(*) over (partition by firstname) as c_fn,
    lastname, count(*) over (partition by lastname) as c_ln,
    favoriteanimal, count(*) over (partition by favoriteanimal) as c_fa,
    favoritebook, count(*) over (partition by favoritebook) as c_fb
from my_table;

FIRSTN C_FN LASTNAME C_LN FAVORIT C_FA FAVORITEBOOK C_FB
------ ---- -------- ---- ------- ---- ------------ ----
Bill      1 Ribbits     1 Lemur      2 Dhalgren        1
Ferris    1 Freemont    2 Possum     1 Ubik            2
Nancy     2 Freemont    2 Lemur      2 Housekeeping    1
Nancy     2 Drew        1 Penguin    1 Ubik            2

You can then use that as a CTE (or subquery factoring, I think in oracle terminology) and pull only the highest-frequency value from each column:

with tmp_tab as (
    select /*+ MATERIALIZE */
        firstname, count(*) over (partition by firstname) as c_fn,
        lastname, count(*) over (partition by lastname) as c_ln,
        favoriteanimal, count(*) over (partition by favoriteanimal) as c_fa,
        favoritebook, count(*) over (partition by favoritebook) as c_fb
    from my_table)
select (select firstname from (
        select firstname,
            row_number() over (partition by null order by c_fn desc) as r_fn
            from tmp_tab
        ) where r_fn = 1) as firstname,
    (select lastname from (
        select lastname,
            row_number() over (partition by null order by c_ln desc) as r_ln
        from tmp_tab
        ) where r_ln = 1) as lastname,
    (select favoriteanimal from (
        select favoriteanimal,
            row_number() over (partition by null order by c_fa desc) as r_fa
        from tmp_tab
        ) where r_fa = 1) as favoriteanimal,
    (select favoritebook from (
        select favoritebook,
            row_number() over (partition by null order by c_fb desc) as r_fb
        from tmp_tab
        ) where r_fb = 1) as favoritebook
from dual;

FIRSTN LASTNAME FAVORIT FAVORITEBOOK
------ -------- ------- ------------
Nancy  Freemont Lemur   Ubik

You're doing one pass over the CTE for each column, but that should still only hit the real table once (thanks to the materialize hint). And you may want to add to the order by clauses to tweak what do to if there are ties.

This is similar in concept to what Thilo, ysth and others have suggested, except you're letting Oracle keep track of all the counting.

Edit: Hmm, explain plan shows it doing four full table scans; may need to think about this a bit more... Edit 2: Adding the (undocumented) MATERIALIZE hint to the CTE seems to resolve this; it's creating a transient temporary table to hold the results, and only does one full table scan. The explain plan cost is higher though - at least on this time sample data set. Be interested in any comments on any downside to doing this.

OTHER TIPS

The best I've come up so far with pure Oracle SQL is something similar to what @AlexPoole did. I use count(A) rather than count(*) to push nulls to the bottom.

with 
NUM_ROWS_RETURNED as (
    select 4 as NUM from dual
),
SAMPLE_DATA as (
    select /*+ materialize */ 
        A,B,C,D,E
    from (
                  select 1 as A, 1 as B, 4 as C, 1 as D, 4 as E from dual
        union all select 1     , -2    , 3     , 2     , 3      from dual
        union all select 1     , -2    , 2     , 2     , 3      from dual
        union all select null  , 1     , 1     , 3     , 2      from dual
        union all select null  , 2     , 4     , null  , 2      from dual
        union all select null  , 1     , 3     , null  , 2      from dual
        union all select null  , 1     , 2     , null  , 1      from dual
        union all select null  , 1     , 4     , null  , 1      from dual
        union all select null  , 1     , 3     , 3     , 1      from dual
        union all select null  , 1     , 4     , 3     , 1      from dual
    )
),
RANKS as (
    select /*+ materialize */ 
        rownum as RANKED 
    from 
        SAMPLE_DATA 
    where 
        rownum <= (select min(NUM) from NUM_ROWS_RETURNED)
)
select
    r.RANKED,
    max(case when A_RANK = r.RANKED then A else null end) as A,
    max(case when B_RANK = r.RANKED then B else null end) as B,
    max(case when C_RANK = r.RANKED then C else null end) as C,
    max(case when D_RANK = r.RANKED then D else null end) as D,
    max(case when E_RANK = r.RANKED then E else null end) as E
from (
    select 
        A,  dense_rank() over (order by A_COUNTS desc) as A_RANK,
        B,  dense_rank() over (order by B_COUNTS desc) as B_RANK,
        C,  dense_rank() over (order by C_COUNTS desc) as C_RANK,
        D,  dense_rank() over (order by D_COUNTS desc) as D_RANK,
        E,  dense_rank() over (order by E_COUNTS desc) as E_RANK
    from (
        select 
            A,  count(A) over (partition by A) as A_COUNTS,
            B,  count(B) over (partition by B) as B_COUNTS,
            C,  count(C) over (partition by C) as C_COUNTS,
            D,  count(D) over (partition by D) as D_COUNTS,
            E,  count(E) over (partition by E) as E_COUNTS
        from
            SAMPLE_DATA
    )
)
cross join 
    RANKS r
group by
    r.RANKED
order by
    r.RANKED
/

gives:

RANKED|   A|   B|   C|   D|   E
------|----|----|----|----|----
     1|   1|   1|   4|   3|   1
     2|null|  -2|   3|   2|   2
     3|null|   2|   2|   1|   3
     4|null|null|   1|null|   4

with plan:

--------------------------------------------------------------------------------------------------
| Id  | Operation                         | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                  |              |     1 |    93 |    57  (20)| 00:00:01 |
|   1 |  TEMP TABLE TRANSFORMATION        |              |       |       |            |          |
|   2 |   LOAD AS SELECT                  |              |       |       |            |          |
|   3 |    VIEW                           |              |    10 |   150 |    20   (0)| 00:00:01 |
|   4 |     UNION-ALL                     |              |       |       |            |          |
|   5 |      FAST DUAL                    |              |     1 |       |     2   (0)| 00:00:01 |
|   6 |      FAST DUAL                    |              |     1 |       |     2   (0)| 00:00:01 |
|   7 |      FAST DUAL                    |              |     1 |       |     2   (0)| 00:00:01 |
|   8 |      FAST DUAL                    |              |     1 |       |     2   (0)| 00:00:01 |
|   9 |      FAST DUAL                    |              |     1 |       |     2   (0)| 00:00:01 |
|  10 |      FAST DUAL                    |              |     1 |       |     2   (0)| 00:00:01 |
|  11 |      FAST DUAL                    |              |     1 |       |     2   (0)| 00:00:01 |
|  12 |      FAST DUAL                    |              |     1 |       |     2   (0)| 00:00:01 |
|  13 |      FAST DUAL                    |              |     1 |       |     2   (0)| 00:00:01 |
|  14 |      FAST DUAL                    |              |     1 |       |     2   (0)| 00:00:01 |
|  15 |   LOAD AS SELECT                  |              |       |       |            |          |
|* 16 |    COUNT STOPKEY                  |              |       |       |            |          |
|  17 |     VIEW                          |              |    10 |       |     2   (0)| 00:00:01 |
|  18 |      TABLE ACCESS FULL            | SYS_TEMP_0FD9|    10 |   150 |     2   (0)| 00:00:01 |
|  19 |     SORT AGGREGATE                |              |     1 |       |            |          |
|  20 |      FAST DUAL                    |              |     1 |       |     2   (0)| 00:00:01 |
|  21 |   SORT GROUP BY                   |              |     1 |    93 |    33  (34)| 00:00:01 |
|  22 |    MERGE JOIN CARTESIAN           |              |   100 |  9300 |    32  (32)| 00:00:01 |
|  23 |     VIEW                          |              |    10 |   800 |    12  (84)| 00:00:01 |
|  24 |      WINDOW SORT                  |              |    10 |   800 |    12  (84)| 00:00:01 |
|  25 |       WINDOW SORT                 |              |    10 |   800 |    12  (84)| 00:00:01 |
|  26 |        WINDOW SORT                |              |    10 |   800 |    12  (84)| 00:00:01 |
|  27 |         WINDOW SORT               |              |    10 |   800 |    12  (84)| 00:00:01 |
|  28 |          WINDOW SORT              |              |    10 |   800 |    12  (84)| 00:00:01 |
|  29 |           VIEW                    |              |    10 |   800 |     7  (72)| 00:00:01 |
|  30 |            WINDOW SORT            |              |    10 |   150 |     7  (72)| 00:00:01 |
|  31 |             WINDOW SORT           |              |    10 |   150 |     7  (72)| 00:00:01 |
|  32 |              WINDOW SORT          |              |    10 |   150 |     7  (72)| 00:00:01 |
|  33 |               WINDOW SORT         |              |    10 |   150 |     7  (72)| 00:00:01 |
|  34 |                WINDOW SORT        |              |    10 |   150 |     7  (72)| 00:00:01 |
|  35 |                 VIEW              |              |    10 |   150 |     2   (0)| 00:00:01 |
|  36 |                  TABLE ACCESS FULL| SYS_TEMP_0FD9|    10 |   150 |     2   (0)| 00:00:01 |
|  37 |     BUFFER SORT                   |              |    10 |   130 |    33  (34)| 00:00:01 |
|  38 |      VIEW                         |              |    10 |   130 |     2   (0)| 00:00:01 |
|  39 |       TABLE ACCESS FULL           | SYS_TEMP_0FD9|    10 |   130 |     2   (0)| 00:00:01 |
--------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
16 - filter( (SELECT MIN(4) FROM "SYS"."DUAL" "DUAL")>=ROWNUM)

But with one of the real tables, it looks like (for a slightly modified query):

----------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name         | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     | Pstart| Pstop |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |              |     1 |   422 |       |  6026M  (1)|999:59:59 |       |       |        |      |            |
|   1 |  TEMP TABLE TRANSFORMATION           |              |       |       |       |            |          |       |       |        |      |            |
|   2 |   LOAD AS SELECT                     |              |       |       |       |            |          |       |       |        |      |            |
|*  3 |    COUNT STOPKEY                     |              |       |       |       |            |          |       |       |        |      |            |
|   4 |     PX COORDINATOR                   |              |       |       |       |            |          |       |       |        |      |            |
|   5 |      PX SEND QC (RANDOM)             | :TQ10000     |    10 |       |       |     2   (0)| 00:00:01 |       |       |  Q1,00 | P->S | QC (RAND)  |
|*  6 |       COUNT STOPKEY                  |              |       |       |       |            |          |       |       |  Q1,00 | PCWC |            |
|   7 |        PX BLOCK ITERATOR             |              |    10 |       |       |     2   (0)| 00:00:01 |     1 |   115 |  Q1,00 | PCWC |            |
|   8 |         INDEX FAST FULL SCAN         | IDX          |    10 |       |       |     2   (0)| 00:00:01 |     1 |   115 |  Q1,00 | PCWP |            |
|   9 |   SORT GROUP BY                      |              |     1 |   422 |       |  6026M  (1)|999:59:59 |       |       |        |      |            |
|  10 |    MERGE JOIN CARTESIAN              |              |    22G|  8997G|       |  6024M  (1)|999:59:59 |       |       |        |      |            |
|  11 |     VIEW                             |              |  2289M|   872G|       |  1443M  (1)|999:59:59 |       |       |        |      |            |
|  12 |      WINDOW SORT                     |              |  2289M|   872G|   970G|  1443M  (1)|999:59:59 |       |       |        |      |            |
|  13 |       WINDOW SORT                    |              |  2289M|   872G|   970G|  1443M  (1)|999:59:59 |       |       |        |      |            |
|  14 |        WINDOW SORT                   |              |  2289M|   872G|   970G|  1443M  (1)|999:59:59 |       |       |        |      |            |
|  15 |         WINDOW SORT                  |              |  2289M|   872G|   970G|  1443M  (1)|999:59:59 |       |       |        |      |            |
|  16 |          WINDOW SORT                 |              |  2289M|   872G|   970G|  1443M  (1)|999:59:59 |       |       |        |      |            |
|  17 |           WINDOW SORT                |              |  2289M|   872G|   970G|  1443M  (1)|999:59:59 |       |       |        |      |            |
|  18 |            VIEW                      |              |  2289M|   872G|       |   248M  (1)|829:16:06 |       |       |        |      |            |
|  19 |             WINDOW SORT              |              |  2289M|   162G|   198G|   248M  (1)|829:16:06 |       |       |        |      |            |
|  20 |              WINDOW SORT             |              |  2289M|   162G|   198G|   248M  (1)|829:16:06 |       |       |        |      |            |
|  21 |               WINDOW SORT            |              |  2289M|   162G|   198G|   248M  (1)|829:16:06 |       |       |        |      |            |
|  22 |                WINDOW SORT           |              |  2289M|   162G|   198G|   248M  (1)|829:16:06 |       |       |        |      |            |
|  23 |                 WINDOW SORT          |              |  2289M|   162G|   198G|   248M  (1)|829:16:06 |       |       |        |      |            |
|  24 |                  WINDOW SORT         |              |  2289M|   162G|   198G|   248M  (1)|829:16:06 |       |       |        |      |            |
|  25 |                   PARTITION RANGE ALL|              |  2289M|   162G|       |  3587K  (4)| 11:57:36 |     1 |   115 |        |      |            |
|  26 |                    TABLE ACCESS FULL | LARGE_TABLE  |  2289M|   162G|       |  3587K  (4)| 11:57:36 |     1 |   115 |        |      |            |
|  27 |     BUFFER SORT                      |              |    10 |   130 |       |  6026M  (1)|999:59:59 |       |       |        |      |            |
|  28 |      VIEW                            |              |    10 |   130 |       |     2   (0)| 00:00:01 |       |       |        |      |            |
|  29 |       TABLE ACCESS FULL              | SYS_TEMP_0FD9|    10 |   130 |       |     2   (0)| 00:00:01 |       |       |        |      |            |
----------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
3 - filter(ROWNUM<=10)
6 - filter(ROWNUM<=10)

I could use from LARGE_TABLE sample (0.01) to speed things up though, at risk of getting a distorted picture. This returned an answer in 53 minutes for a table with 2 billion rows.

You can't.

There's no trick here, just raw work.

Simply, you have to run through every row in the table, and count occurrences of each column you're interested in, and then sort those results to find the ones with the highest value.

For a single column, it's easy:

SELECT col, count(*) FROM table GROUP BY col ORDER BY count(*) DESC

and fetch the first row.

N columns equals N table scans.

If you write logic and pass through the table once, you then you count each instance of each value of each column.

If you have 30 billion rows, with 30 billion values, you get to store them all and they all have a count of 1. And you would get to do that for each column you care about.

If this information is important to you, you're better tracking it independently and incrementally as your data comes in. But that's a different problem.

Assuming that you have not too many distinct values in every column, you'd want to do the following:

Create a map for every column that keeps the counters for every distinct value
Read the whole table (row by row, but just once)
For every row, increase the counters
After that, look at your maps and pull out the most frequent values

For a single column, SQL would do that:

select value from (
   select value, count(*) from the_table
   group by value
   order by count(*) desc 
) where rownum < 2

However, if you just combined several of these into one big SQL, I think it would scan the table several times (once for each column), which you don't want. Can you get the execution plans for this?

So you probably have to write a program to do that, either on the server (PL/SQL or Java if available), or as a client program.

Loop through your records, keeping an in memory count of times each value for each column of interest was encountered.

Every so often (every X records, or when you've accumulated an amount of data that is reaching a fixed memory limit), loop through your memory counts and increment corresponding counts in some on-disk storage and clear the in memory info.

Details depend on what programming language you are using.

Below, I've put in a naive approach. It would be totally unworkable for datasets above a few hundred thousand, I think. Perhaps a guru can use it as a basis for a more suitable answer.

How current do the results of the query need to be? You could select the results of the "group by" part of the below query into some kind of cache, perhaps on a nightly basis.

Then you could do the final selection on that.

Another possibility would be to create a trigger on the table in question which would update a "counter" table with each insert/update/delete.

The counter table would look like this:

field_value   count
Nancy         2
Bill          1
Ferris        1

You'd have to have a counter table for each field you wanted to count.

In short, I think you need to think about ways to observe this data indirectly. I don't think there's going to be any way around the fact that doing the actual counts is going to take a long time. But if you have a way to incrementally track what has changed, then you only have to do the heavy lifting once. Then your cache + whats new should give you what you need.

select top 1
  firstname, COUNT(*) as freq
from
  (
  select
    'Ferris' as firstname, 'Freemont' as lastname,
    'Possum' as favoriteanimal, 'Ubik' as favoritebook
  union all
  select 'Nancy','Freemont','Lemur','Housekeeping'
  union all
  select 'Nancy','Drew','Penguin','Ubik'
  union all
  select 'Bill','Ribbits','Lemur','Dhalgren'
  ) sample_data
group by
  firstname
order by
  COUNT(*) desc

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow