How do I speed up a string manipulation query where I want to replace characters, extract certain values and update a table with the results?

https://dba.stackexchange.com/questions/273699

06-03-2021
|

Question

I am trying to extract information from strings that are presented in a key-value format, with the keys and values being separated by commas. I want to extract the values associated with certain keys and add them to dedicated columns in my table.

Some notes on the data:

The keys I am interested in are connected, as in keyB relates to keyA;
in some cases keyB or keyA may not exist
If keyB doesn't exist, but keyA is something specific, then I can set the value for keyB anyway.

I have a solution that does what I want ([db-fiddle]), but it is painfully slow (9.6 hours) and I can't help think that there must be a better way as I've not been in this DB game long.

For info, The table has ~8.2M rows and is hosted on AWS RDS on a t3.large DB (virtual CPUs = 2, Memory = 8.0GB).

Some pointers on where I can improve this are much appreciated.

Solution

You only need a single INSERT and a single UPDATE for the main table. The whole default detection and populating of the "variables" (columns) can be done while inserting the data:

INSERT INTO main_tags (id, tags, variablea, variableb)
select id, tags, var_a, 
       case 
          when var_b is null and 'valA' = any(tags)  then 'valB_default'
          when var_b is null and 'valA_xx' = any(tags) then 'valB_default_two'
          else var_b
       end as var_b
from (
  SELECT id, tags, 
         tags[array_position(tags, 'keyA') + 1] as var_a, 
         tags[array_position(tags, 'keyB') + 1] as var_b
  FROM (
    SELECT id, string_to_array(regexp_replace(tags,'[{}"]','','g'), ',') as tags
    FROM main
  ) as b 
) x  
WHERE tags && array['keyA','keyB'];

The final update can also handle the "unknown" value directly, no need to run three updates:

UPDATE main e
  SET variableA = coalesce(et.variableA, 'unknown'),
      variableB = coalesce(et.variableB, 'unknown')
FROM main_tags et
WHERE e.id = et.id;

One way to speed that up, is to create an index on main_tags (id)

You can actually get rid of the temp table completely and do everything in a single statement. This might or might not be faster, but I think it's worth trying:

UPDATE main e
  SET variableA = coalesce(et.var_a, 'unknown'),
      variableB = coalesce(et.var_b, 'unknown')
FROM (
  select id, tags, 
         var_a, 
         case 
            when var_b is null and 'valA' = any(tags)  then 'valB_default'
            when var_b is null and 'valA_xx' = any(tags) then 'valB_default_two'
            else var_b
         end as var_b
  from (
    SELECT id, 
           tags, 
           tags[array_position(tags, 'keyA') + 1] as var_a, 
           tags[array_position(tags, 'keyB') + 1] as var_b
    FROM (
      SELECT id, string_to_array(regexp_replace(tags,'[{}"]','','g'), ',') as tags
      FROM main
    ) as b 
  ) x  
  WHERE tags && array['keyA','keyB']
) as et
WHERE e.id = et.id;

OTHER TIPS

If your data examples are representative (as far as length of the "tags" field), there is only a few minutes worth of string processing here even at 10e6 rows. The rest of the time is spent making repeated unnecessary passes over the data, and at least one is probably done in a horribly inefficient IO pattern.

Why are you loading data that requires so much processing in the first place, rather than processing before or during the load? It looks like every row can be processed individually, so you should be able to process them in python or perl and then stream the processed data into the database.

Why are you using a t3 instance? The t3 class has performance problems by design. If you care about performance (and apparently you do or you wouldn't be asking the question), you shouldn't use them.

Why do you care about performance anyway? It may have been slow, but it is (apparently) already done. Are you going to have to keep doing this on an ongoing basis? If so, will it happen from a clean table each time, or will you be adding more rows to a populated table, then only needing to process the new rows?

Is there a reason you need to update "main"? Creating new table main_new as the result of a query between "main" and main_tags (or inserting into main_real based on that query) should be much more efficient than trying to do an in-place update. This is the classic "staging table" design.

Finally, PostgreSQL does a poor job of planning bulk UPDATE...FROM. It plans the UPDATE just as if it were a SELECT, not estimating the cost of the UPDATE itself. This ignores the fact that some plans would update the table in sequential order, while others would jump all over around the table creating lots of slow IO.

If you did an EXPLAIN on your UPDATE...FROM, you would probably get something like this:

                                     QUERY PLAN                                     
------------------------------------------------------------------------------------
 Update on main e  (cost=427663.40..953462.12 rows=8498274 width=119)
   ->  Hash Join  (cost=427663.40..953462.12 rows=8498274 width=119)
         Hash Cond: (et.id = e.id)
         ->  Seq Scan on main_tags et  (cost=0.00..204676.74 rows=8498274 width=74)
         ->  Hash  (cost=238341.62..238341.62 rows=8502862 width=49)
               ->  Seq Scan on main e  (cost=0.00..238341.62 rows=8502862 width=49)

This is probably fine as long as the hash table is held in memory, but if work_mem is low enough that it spills to disk you will doing a lot of random IO. If you just set enable_hashjoin=off you will probably get a much IO friendlier plan (although that could depend on how well clustered your table is on "id")

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange