Question

I have an iPhone app connected to a Django server running on Heroku. The user taps a word (like "cape") and then queries the server for any other passages containing that word. So right now I do a SQL query with some RegEx:

SELECT "connectr_passage"."id", "connectr_passage"."third_party_id", "connectr_passage"."third_party_created", "connectr_passage"."source", "connectr_passage"."text", "connectr_passage"."author", "connectr_passage"."raw_data", "connectr_passage"."retweet_count", "connectr_passage"."favorited_count", "connectr_passage"."lang", "connectr_passage"."location", "connectr_passage"."author_followers_count", "connectr_passage"."created", "connectr_passage"."modified" FROM "connectr_passage" WHERE ("connectr_passage"."text" ~ E'(?i)\ycape\y' AND NOT ("connectr_passage"."text" ~ E'https?://' ))

on a table with about 412K rows of data, using the $9 'dev' database, this query takes 1320 ms So for the app user it feels pretty slow as the total response time is even higher.

With the same exact database on my local machine (MBP, 8gb ram, ssd), this query takes 629.214 ms

I understand the dev database has some limitations (doesn't do in-memory cache and such), so my questions are:

  1. Is there some way I can speed things up? Adding an index on the text column didn't seem to help.

  2. Will upgrading to one of the production databases significantly improve this performance? They're pretty expensive for my needs.

  3. Any other good alternatives for hosting a database connected to Heroku that you know about?

  4. Any recommended alternatives to doing a regex sql query to search for terms? I was thinking about creating a custom index store of words or something, maybe there's a plugin for that somewhere. Haystack?


----- Edit -----

Here is what the elephant has to say about my query:

Sort  (cost=16979.75..16979.83 rows=34 width=175) (actual time=616.131..616.132 rows=18 loops=1)
   Sort Key: author_followers_count
   Sort Method:  quicksort  Memory: 30kB
   ->  Seq Scan on connectr_passage  (cost=0.00..16978.89 rows=34 width=175) (actual time=10.863..616.027 rows=18 loops=1)
         Filter: (((text)::text ~ '(?i)\\ycape\\y'::text) AND ((text)::text !~ 'https?://'::text))
 Total runtime: 616.229 ms

So it looks like it is doing a full table scan, so the index isn't working. I'm a Postgres newbie so not sure if I have this right, but here is my index (created by setting db_index=True in the Django model):

public | connectr_passage_text                                | index | connectr | connectr_passage

Another edit:

Here is the latest - after using the pg_trgm add-on.

create extension pg_trgm;
create index passage_trgm_gin on connectr_passage using gin (text gin_trgm_ops);

First attempt:

d2lgd5pcso4g2k=> explain analyze select * from connectr_passage where text ~ E'cape\y';
                                                      QUERY PLAN                                                       

 Seq Scan on connectr_passage  (cost=0.00..28627.30 rows=95 width=177) (actual time=2647.828..2647.828 rows=0 loops=1)
   Filter: ((text)::text ~ 'capey'::text)
   Rows Removed by Filter: 970514
 Total runtime: 2647.866 ms
(4 rows)

Damn, still super slow. But wait what if I do a simple filter before the RegEx:

d2lgd5pcso4g2k=> explain analyze select * from connectr_passage where text like '%cape%' and text ~ E'(?i)\ycape\y';
                                                          QUERY PLAN                                                          

 Bitmap Heap Scan on connectr_passage  (cost=578.14..762.70 rows=1 width=177) (actual time=11.432..11.432 rows=0 loops=1)
   Recheck Cond: ((text)::text ~~ '%cape%'::text)
   Rows Removed by Index Recheck: 165
   Filter: ((text)::text ~ '(?i)ycapey'::text)
   Rows Removed by Filter: 468
   ->  Bitmap Index Scan on passage_trgm_gin  (cost=0.00..578.14 rows=95 width=0) (actual time=8.845..8.845 rows=633 loops=1)
         Index Cond: ((text)::text ~~ '%cape%'::text)
 Total runtime: 11.479 ms
(8 rows)

Superfast!

Was it helpful?

Solution

So this is pretty much solved thanks to mu-is-too-short's sugestion and a bit of googling. Basically PostgreSQL's pg_trgm extension solved the problem and led to 800x faster query!

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top