Question

Suppose I have these two data frames (simplified for my question):

users

+---------+
| user_id |
+---------+
| 1       |
| 2       |
| ...     |
+---------+

articles

+------------+------------+
| article_id |    date    |
+------------+------------+
| a          | 2019-01-01 |
| b          | 2018-03-03 |
| ...        |            |
+------------+------------+

And a dense matrix of user-article pairs where each value is how much I predict each user would want to read each article (from 0 to 1):

+-----+------+------+-----+
|     |  1   |  2   | ... |
+-----+------+------+-----+
| a   | 0.54 | 0.99 | ... |
| b   | 0    | 0.7  | ... |
| ... | ...  | ...  | ... |
+-----+------+------+-----+

I have a web app that needs to do thing like return the top 10 most recommended articles for a single user, or the 11th-20th most recommended, for a given date range etc.:

query: (user_id=123) AND (news_date IN ('2019-04-01', '2019-05-01')) LIMIT 10 OFFSET 10

+---------+-------+------+
| news_id | score | rank |
+---------+-------+------+
| g       | 0.98  | 11   |
| d       | 0.97  | 12   |
| ...     | ...   | ...  |
| q       | 0.8   | 20   |
+---------+-------+------+

The challenge is that I have users and articles numbering in the tens of thousands, so I can't just store my matrix as a Postgres table due to its column limit.

I could store the recommendation scores in Postgres in a table as (user_id, article_id, score), which would be fast for querying, but this table would have 100M+ rows and be expensive to update, which I do daily.

My current solution is to store a single data frame (news_id, news_date, user_1_score, user_2_score, ..., user_n_score) as a gzipped Parquet file on disk, load the news_date and user_x_score columns, then filter, sort and slice. The only downside is that my web host has an ephemeral file system, so this file needs to be downloaded when the app boots. It's fast enough to get data during a web request, at least.

I don't know much about columnar data stores, but I have a feeling that one of these products might be good for my problem. Does anyone have an idea?

Was it helpful?

Solution

"but this table would have 100M+ rows and be expensive to update, which I do daily."

In order to refute this, I did the following;

CREATE TABLE test_article (
    the_series integer,
    user_id integer,
    article_id integer,
    rating numeric
);

Put on the timing, so we have proper metrics.

\timing

Then, I inserted 10 million records into test_article:

INSERT INTO test_article
SELECT generate_series(1, 10000000), CAST(RANDOM() * 10 + 1 AS INTEGER), CAST(RANDOM() * 100 + 1 AS INTEGER), ROUND(CAST(RANDOM() AS NUMERIC), 2);

TIME:

INSERT 0 10000000
Time: 33520.809 ms (00:33.521)

Table content (sample):

test=# SELECT * FROM test_article;

 the_series | user_id | article_id | rating 
------------+---------+------------+--------
          1 |       5 |         85 |   0.95
          2 |       6 |         41 |   0.14
          3 |       5 |         90 |   0.34
          4 |       3 |         18 |   0.32
          5 |       7 |          6 |   0.30
          6 |      10 |         32 |   0.31
          7 |       8 |         70 |   0.84

I realise that this is not a perfect benchmark. In order for it to be so, there would have to be a UNIQUE index on (user_id, article_id) - however in order to make it as realistic as possible, I'm going to put it on those fields. I believe that it's not a huge distortion. EDIT - see below - this problem has been resolved!

So, I created the index:

CREATE INDEX user_article_ix ON test_article (user_id, article_id);

TIME:

CREATE INDEX
Time: 20556.118 ms (00:20.556)

Then, I inserted 100K records:

INSERT INTO test_article
SELECT generate_series(1, 100000), CAST(RANDOM() * 10 + 1 AS INTEGER), CAST(RANDOM() * 100 + 1 AS INTEGER), ROUND(CAST(RANDOM() AS NUMERIC), 2);

TIME;

INSERT 0 100000
Time: 996.115 ms

Less than 1 second!

So, it would appear that there is no problem with inserting a large amount of records into your linking table (also called an Associative Entity - aka joining table, association table...)

So, I very much suggest that you should go with this as a solution!

Unique combination of user_id and article_id.

After much wailing and gnashing of teeth, I finally figured out how to make the combination of user_id and article_id be unique (because any given user can only have one current rating of an article) using generate_series.

I won't show every step, just the ones which helped with uniqueness - based on what's above:

The "secret sauce" was this bit:

INSERT INTO test_article (user_id, article_id) 
SELECT * FROM
(
  WITH x AS
  (
    SELECT generate_series(1, 500) AS bill
  ),
  y AS
  (
    SELECT generate_series(1, 20000) AS fred
  )
  SELECT * FROM x
  CROSS JOIN y
) AS z
ORDER BY bill, fred;

It involves CROSS JOINing a table of 500 (i.e. users) with a table of 20,000 (i.e. articles) - the astute among you will realise that the product of these is 10,000,000 (seen above).

Now, the combination of user_id and article_id are guaranteed to be unique, because with (sample), bill = 2 and fred = 3, you get

bill | fred 
------+------
    1 |    1
    1 |    2
    1 |    3
    2 |    1
    2 |    2
    2 |    3

Every record is unique - et voilà!

In any case, I used this construct to test for dupes:

SELECT (user_id, article_id)::text, count(*)
FROM test_article
WHERE 1 = (SELECT 1)
GROUP BY user_id, article_id
HAVING count(*) > 1

TIME: 4s.

You can then make (user_id, article_id) the PRIMARY KEY (not shown - only took around 30s).

Then, to add 100,000 records, you leave the users alone (still 1 - 500), but you modify the generate_series() for the articles to 20,001 to 20200 (i.e. 200 x 50 = 100,000) and do the same INSERT as above. Blisteringly fast - even with the PRIMARY KEY (< 1s).

To obtain all the articles of a particular user is v. fast (~ 25 ms)

test=# EXPLAIN(ANALYZE, BUFFERS) SELECT * FROM test_article WHERE user_id = 77;
                                                                  QUERY PLAN                                                           
 Index Scan using test_article_pkey on test_article  (cost=0.44..65174.74 rows=44503 width=44) (actual time=0.074..21.837 rows=20200 lo
ops=1)
   Index Cond: (user_id = 77)
   Buffers: shared hit=40371 read=361 dirtied=271
 Planning Time: 0.131 ms
 Execution Time: 23.475 ms
(5 rows)

Time: 24.187 ms

And the pièce de résistance, a point search on the PK (< 1 ms):

test=# EXPLAIN(ANALYZE, BUFFERS) SELECT * FROM test_article WHERE user_id = 77 AND article_id = 4567;
                                                            QUERY PLAN                                                            

 Index Scan using test_article_pkey on test_article  (cost=0.44..10.22 rows=2 width=44) (actual time=0.038..0.040 rows=1 loops=1)
   Index Cond: ((user_id = 77) AND (article_id = 4567))
   Buffers: shared hit=4
 Planning Time: 0.219 ms
 Execution Time: 0.078 ms
(5 rows)

Time: 0.947 ms

OTHER TIPS

When working with relational databases, stop thinking in matrices, think in relational terms instead. What you describe is a typical many-to-many relationship between users and articles, normally implemented using a relationship (link) table, as you mentioned.

A column-organized data store is not the answer, primarily because it's simply a different physical implementation of the same old relational model and therefore subject to the same table width and update performance limitations.

If your statement about "100+M rows being expensive to update" is based on actual performance testing, you should ask a concrete question about the update performance, and I'm sure we'll be able to help with that. If it's just your presumption, I suggest you try and see if it holds.

You might consider using SQL Server. Tables with a COLUMN_SET column can have up to 30,000 sparse columns, and performance is really great. SQL Server 2017+ is also Linux-compatible.

I wrote a blog post about it here.

Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top