Question

I have a query that does selection and comparison. With the given number of tuples, disk blocks, and the index types (such as primary B+ tree index on a key), how can I calculate the number of block transfers and seek operations required to complete the query ? Let say,

select cid
from payment
where eid = 1200 and amount > 30

, where we know values of eid are uniformly distributed between 1 and 100, and values for amount are uniformly distributed between 1 and 50. There are 1000000 tuples contained in 15000 disk blocks.

Specifically, the given cases are:

No index, primary B+ tree index on eid, primary B+ tree index on amount, secondary B+ tree index on eid, secondary B+ tree index on amount.

eid is a primary key in employee, cid is a primary key in customer, cid and eid creates a candidate key for payment. amount is a attribute of payment.

Was it helpful?

Solution

Ok I have found the practical solution for those who want to learn the basics of block transferring via database query by using indexing with B+ tree or other structures.

If there is no index defined to search the values, all the data resides on a secondary disk will need to be transferred block and block. Therefore, all the blocks will be transferred and values will be searched sequentially. It is enough to know in which address the blocks for our data start on the disk, then values will be searched sequentially as I stressed. So, according to our example, 15000 blocks will be transferred during query. 1 seek is enough.

If there is a primary primary B+ tree index on eid, then we can make two assumptions from the given query:

First, B+ tree searches for eid=1200 and it cannot find a eid with value 1200 because eid is between 0-100. Therefore, there will be no block transfer and seek in disk because we ensured that there is no such value in the disk by firstly searching from B+ tree.

Second, if we assume there are 100 uniquely distributed eid values and one of them is 1200, then B+ tree will be successful on its search for eid=1200. In such situation, the pointer on the leaf with value 1200 points to the disk location where eid values are SORTED and pointing address is the first address where eid=1200. By starting from this address, the data is searched until eid is no longer 1200. Our approximate assumption is that since there are 100 unique eid values and there are 15000 blocks. The tuples where eid=1200, will be in approximately 15000/100 = 150 blocks. We divided total number of blocks because we know the values are sorted since our index is a primary key. Therefore, if we know the address where eid has its first 1200 value, we are absolutely sure that the following tuples will also have eid=1200 as long as we do not see eid attribute has a different value. Therefore 150 blocks will be transferred if primary index is eid and 1 seek is enough because when we started to search there is no necessity to seek the disk again, we can sequentially search the tuples until eid has different value in disk.

In a similar vein, if we have a primary index on amount, and we want the values where amount = [31, 50] we will need to transfer (15000/50)*20 = 6000 blocks. We divided by 50 because we can have 50 different amount values. And we multiplied the division by 20 because we will search 20 of these 50 values, not 1. Therefore these 20 different amount values can be in 6000 tuples. Again 1 seek is enough, when we started to search from the starting address, we will look the tuples sequentially.

If the index is secondary, then we can no longer say the values are sorted in the disk. The pointer from a leaf node of the B+ tree points to a bucket that contains pointers to the actual places of the values on the disk. So we firstly go to the address bucket, then from there we visit the disk directly. Therefore, our first search is from B+ tree to the bucket and then from the bucket to the disk. Note that, in worst case, all the values you want can be placed in totally different blocks. So we may need to transfer blocks as much as the total number of tuples that contains the value we defined.

If there is a secondary index eid and the values are between 0-100, again there will be no I/O since there will be no such value with eid=1200. However, if we make our second assumption again about the eid value, then we now approximately assume that there can be 1000000/100 = 10000 tuples that contains the attribute eid with value 1200. In worst case, we will need to transfer 10000 blocks if non of the tuples come sequentially, and are in different blocks. Again we will need 10000 seeks to find the places on disk that the bucket points to.

If there is a secondary index on amount and we need 20 of them, these values approximately will be in (1000000/50)*20 = 400000 tuples. In worst case again we may need to transfer 400000 block if we assume when a block is transferred, the next desired tuple will never be in the block lastly transferred. For such a situation, we will need, in worst case, 400000 block transfers and 400000 seek as I stressed the reason above.

We can make the selections eid=1200 while we are searching on the index amount, or vice versa. So we do not concern the second condition, that is either looking for eid=1200 while searching on the index amount >= 30 or looking for amount >= 30 while searching on the index eid=1200.

AGAIN these are approximate and, in general, worst case results to explain the basics of block transferring and seeks while making a query. It gives the basic idea about how the data transfers are handled from disk to the main memory and to the user.

OTHER TIPS

Although I voted to close this question, I do want to make some comments that are too long for a comment.

Your basic question, "how can I calculate the number of block transfers and seek operations required to complete the query ?", is indeterminate. The number of such operations depends on the state of the database. And, in particular, whether the data pages and index pages are already cached in memory.

For the statement that's given, I think the only important thing that 99.999% of the world needs to know is that an index on payment(eid, amount, cid) is the optimal index. The first two elements of the index (in that order) support the where clause. The last allows the index to cover the query, so the original data table does not need to be used. The specific type of index seems entirely unimportant.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top