Perl script fails when selecting data from big PostgreSQL table

https://stackoverflow.com/questions/21960121

15-10-2022
|

Pregunta

I'm trying to run a SELECT statement on PostgreSQL database and save its result into a file.

The code runs on my environment but fails once I run it on a lightweight server.

I monitored it and saw that the reason it fails after several seconds is due to a lack of memory (the machine has only 512MB RAM). I didn't expect this to be a problem, as all I want to do is to save the whole result set as a JSON file on disk.

I was planning to use fetchrow_array or fetchrow_arrayref functions hoping to fetch and process only one row at a time.

Unfortunately I discovered there's no difference when it comes to the true fetch operations between the two above and fetchall_arrayref when you use DBD::Pg. My script fails at the $sth->execute() call, even before it has a chance to do call any fetch... function.

This suggests to me that the implementation of execute in DBD::Pg actually fetches ALL the rows into memory, leaving only the actual format its returned to the fetch... functions.

A quick look at the DBI documentation gives a hint:

If the driver supports a local row cache for SELECT statements, then this attribute holds the number of un-fetched rows in the cache. If the driver doesn't, then it returns undef. Note that some drivers pre-fetch rows on execute, whereas others wait till the first fetch.

So in theory I would just need to set the RowCacheSize parameter. I've tried but this feature doesn't seem to be implemented by DBD::Pg

Not used by DBD::Pg

I find this limitation a huge general problem (execute() call pre-fetches all rows?) and more inclined to believe that I'm missing something here, than that this is actually a true limitation of interacting with PostgreSQL databases using Perl.

Update (2014-03-09): My script works now thanks to using a workaround as described in my comment to Borodin's answer. The maintainer of DBD::Pg library got back to me on the issue actually saying the root cause is deeper and lies within libpq postgresql internal library (used by DBD::Pg). Also, I think very similar issue to the one described here affects pgAdmin. Being postgresql native tool it still doesn't give in the Options chance to define the default limit of the result set row size. This is probably why it makes Query tool sometimes waiting a good while before presenting results from bulky queries, potentially breaking the app in some cases too.

Solución

In the section Cursors, the documentation for the database driver says this

Therefore the "execute" method fetches all data at once into data structures located in the front-end application. This fact must to be considered when selecting large amounts of data!

So your supposition is correct. However the same section goes on to describe how you can use cursors in your Perl application to read the data in chunks. I believe this would fix your problem.

Another alternative is to use OFFSET and LIMIT clauses on your SELECT statement to emulate cursor functionality. If you write

my $select = $dbh->prepare('SELECT * FROM table OFFSET ? LIMIT 1');

then you can say something like (all of this is untested)

my $i = 0;
while ($select->execute($i++)) {
  my @data = $select->fetchrow_array;
  # Process data
}

to read your tables one row at a time.

You may find that you need to increase the chunk size to get an acceptable level of efficiency.

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow