Question

I am trying to crawl data from an oracle database table and want a solution for making checkpoint to the rows.

  • Basicly, the crawler will crawl at most 500 rows, It then stop for the application to process the data. After 2 minutes the crawler must continue from the last row (checkpoint).

  • Currently, I am using the solution with rownumber, but the result indicated some rows are missing.

This is probably a basic question, but being inexperienced with dealing with database. I just cannot find the acceptable solution.

Was it helpful?

Solution

Normally, you wouldn't design the system this way. Normally, your application would simply open a cursor, fetch 500 rows, process those rows in the application, fetch the next 500 rows, etc. So long as you don't close the cursor, Oracle will continue to return results as of the system change number (SCN) when you opened the cursor. Any rows that are added after the cursor is opened won't be returned, any rows that are deleted after the cursor is opened will be returned. Since readers do not block writers in Oracle, this approach does not prevent any other applications from using the table (which is often an issue in other databases).

If you really want to design a system the way you describe, you would need to have a column (or set of columns) that you can order by consistently. This would have to be something like a monotonically increasing primary key (e.g. a sequence-generated primary key). You could then do something like

SELECT *
  FROM( SELECT a.*,
               rownum rn
          FROM( SELECT *
                  FROM table_name
                 ORDER BY some_key ) a
         WHERE rownum <= MAX_ROW)
 WHERE rn >= MIN_ROW

Of course, with this sort of pagination query, your process will get slower and slower as you get toward the "end" of the table. Getting the first 500 rows will be reasonably efficient since you just have to sort the first 500 rows. When you're fetching rows 9,500 through 10,000, you're sorting 20 times as much data so that query will likely be an order of magnitude slower. If your table is large, this will generally cause performance to be quite poor. You can (slightly) improve the performance by adding logic that flips the order in the inner query and start fetching "earlier" pages once you've fetched half the pages.

This approach can also miss rows. If, for example, the transaction that inserts ID 502 commits, then you fetch rows 501-550, then the transaction that inserts 'ID` 501 commits, you will see row 502 but not row 501.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top