Persistent (Disk Based) R-Tree (or R* Tree)

Question 1

If you need to have an on-disk R-Tree index, I would suggest using Spatialite or Postgis. Spatialite is lightweight and easy to embed in a standalone application. Alternatively, have you looked at the C# Spatial Index project?. I wrote an R-Tree implementation in Java a few years back and wouldn't recommend doing it if something already exists.

Question 2

Architecture of the file

Well, it's pages (= blocks). The pages should have a multiple of the page size of the underlying storage, so probably 1kb or 8kb blocks. Each block has a number and can be references this way.

Directory pages store the bounding boxes of the children and their page numbers.

Child pages store the actual data objects.

Managing the tree

Well, in theory: whenever you modify a page in memory, you write the changes to disk. That's it.

In practise, you might want to use a cache for performance, and you might want to have transactions for keeping your tree consistent in case of an application crash.

On both of these things you can find a lot of literature in the field of RDBMS architecture.

A key benefit of the R*-tree is that it is a regular page-oriented tree as you would have them in database systems all over the place. If you have a good on disk B+-tree implementation, you can reuse most of your code for the R*-tree.

How to get started

To get started, you need to get used to disk-based data indexing, as it is done in classic RDBMS. I'd suggest starting with an on disk B-tree or B+-tree. Do allow deletions, because you need to think about managing deleted pages and all this.

Once you figured out the B-Tree on disk (and maybe spend some time on optimizing it!), doing an on-disk R-tree should be fairly obvious.

I havn't looked at the code, but this might be a good starting point: http://www.die-schoens.de/prg/ or some of the others linked in Looking for a disk-based B+ tree implementation in C++ or C

Question 3

If you already have a main-memory implementation you may reuse the same code just add writes to the disk. You have to take into consideration page size and optimize tree nodes to fit in a page (you can read it in one go).

It would be better (performance wise) to have a snapshots of the main memory tree stored in the disk (snapshots could be taken when the tree is not under high pressure) versus writing every change into the disk.

In the question you specify that querying the tree is of higher importance thus you should be better with R*-tree since it minimizes the overlap between the tree nodes. However if your requirements would be focused on update operations (insert/delete) I would suggest to take a look at Supporting frequent updates in R-trees: a bottom-up approach paper.