Question

I am running a job class that contains the following:

  • A Postgresql connection that can issue SQL statements
  • A Redshift connection that can do the same
  • an S3 connection to function as an intermediary between the two

The current process I am taking is dealing with the columns with redshifts differences on a per-case basis. The types of columns used (disregarding duplicate types, can give the rest if needed) are:

 id                             integer          NOT NULL,
 client_id                      integer          NOT NULL,
 manual_dt                      date,
 scheduled_at                   timestamp,
 some_other_id                  varchar(255),
 is_good                        integer          DEFAULT 0 NOT NULL,
 url                            varchar(4096)    NOT NULL,
 image_width                    integer,
 image_dim_ratio                float8,
 invalid_reason                 varchar(256)     DEFAULT NULL::character varying

There are a total of ~2.3 million records that need to be copied in this initial load. I am using this as a SELECT and COPY on postgresql, uploading it to S3, then using Redshift's COPY to get it from the s3 source. All of this can be posted if needed too.

This works on loads of 100, 1000, 10000, 100000, and 1000000 records. But, if I go to the entire set or a limit of the exact number of records, I receive the following trace:

psycopg2.extensions.TransactionRollbackError: canceling statement due to conflict with recovery

DETAIL: User query might have needed to see row versions that must be removed.

Is this the source of what is stopping the copy, or am I on the wrong track and it is another issue?

Was it helpful?

Solution

This is because you are querying a postgresql replica that doesn't allow queries to run for very long. Your only solution is to ask the database administrator to increase the query limit on that replica, or hit production instead.

Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top