Question

I am working on an API to query a database server (Oracle in my case) to retrieve massive amount of data. (This is actually a layer on top of JDBC.)

The API I created tries to limit as much as possible the loading of every queried information into memory. I mean that I prefer to iterate over the result set and process the returned row one by one instead of loading every rows in memory and process them later.

But I am wondering if this is the best practice since it has some issues:

  • The result set is kept during the whole processing, if the processing is as long as retrieving the data, it means that my result set will be open twice as long
  • Doing another query inside my processing loop means opening another result set while I am already using one, it may not be a good idea to start opening too much result sets simultaneously.

On the other side, it has some advantages:

  • I never have more than one row of data in memory for a result set, since my queries tend to return around 100k rows, it may be worth it.
  • Since my framework is heavily based on functionnal programming concepts, I never rely on multiple rows being in memory at the same time.
  • Starting the processing on the first rows returned while the database engine is still returning other rows is a great performance boost.

In response to Gandalf, I add some more information:

  • I will always have to process the entire result set
  • I am not doing any aggregation of rows

I am integrating with a master data management application and retrieving data in order to either validate them or export them using many different formats (to the ERP, to the web platform, etc.)

Was it helpful?

Solution

There is no universal answer. I personally implemented both solutions dozens of times.

This depends of what matters more for you: memory or network traffic.

If you have a fast network connection (LAN) and a poor client machine, then fetch data row by row from the server.

If you work over the Internet, then batch fetching will help you.

You can set prefetch count or your database layer properties and find a golden mean.

Rule of thumb is: fetch everything that you can keep without noticing it

if you need more detailed analysis, there are six factors involved:

  • Row generation responce time / rate(how soon Oracle generates first row / last row)
  • Row delivery response time / rate (how soon can you get first row / last row)
  • Row processing response time / rate (how soon can you show first row / last row)

One of them will be the bottleneck.

As a rule, rate and responce time are antagonists.

With prefetching, you can control the row delivery response time and row delivery rate: higher prefetch count will increase rate but decrease response time, lower prefetch count will do the opposite.

Choose which one is more important to you.

You can also do the following: create separate threads for fetching and processing.

Select just ehough rows to keep user amused in low prefetch mode (with high response time), then switch into high prefetch mode.

It will fetch the rows in the background and you can process them in the background too, while the user browses over the first rows.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top