Question

So i want to understand how DBMS implementation works

To give an example :

MySQL implements each tables with its own pages, which are 16KB

so each table is a file, and is a multiple of 16KB, considering how large is it and therefore how many pages it needs

Now i read somewhere that these pages don't get fragmented, so my question is, HOW?

how do DBMS developers tell the operating system that "hey i just added a 16KB data (page) to this file, but make this page doesn't get fragmented"

sorry if this is a duplicate, i searched and couldn't find any related question, also lets say the O.S is windows or Linux

my point is lets say O.S stores files based on 4KB chunks, and may fragment some files when they exceed it, and the DBMS uses 16KB pages, my question is how do they implement that DBMS so that 16KB pages which get added to table files dont get fragmented? when i append a 16KB data to a file, is it by default reserved for it and will never get fragmented? (basically how do they reserve a 16KB on the disk and make sure its not gonna get fragmented?)

if you can give an example in any language that how these type of appending is done I'm Ok, I'm not looking for a specific language just wanna know how its done

Also I'm not asking about any specific database either, all the relational databases use these pages.

ALSO I'm taking about fragmentation inside a disk image or memory image, not sure if these images are logical or what, so when i take the image of that database folder, or its process in memory, these pages are not fragmented, how?

Was it helpful?

Solution

I can't reliably talk about all filesystems and all platforms. But I have some experience in dealing with file allocation reflecting disk/volume/partition/region structure. The most obvious way is present in WinNT NTFS disks. (I'm not entirely sure about other win FSes). It can be easily done using userspace defragmentation API. From my point you can not only place part of file in particular positions but moreover put internal structures of NTFS (eg MFT, directory trees) in predefined order. But the latter is not 100% reliable process. This was done by me on hot running system. Take a look at Jetico BCWipe application.

Another thought: you can get full control of file allocation with your own fs driver or userspace utility but those should either work on unmounted device or replace system's logic entirely then.

As for other systems. I think It possible to invent some heuristic but not completely reliable algorithms for each file system type to be able to control allocation behaviour. Look at the above notice about NTFS structures.

So to summarize all this: [everything is possible]. But reliability and accompanying risks (eg design complexity) depend on the way you choose to implement such features.

OTHER TIPS

You don't. At least in the physical disk access sense.

There may be a way for particular platforms to allocate multiple contiguous chunks (e.g. ask for them all in one go), but it doesn't matter if the are or are not physically adjacent. The OS presents all files to you as logically contiguous byte sequences.

Hard Drives are pretty complex, especially now days where they can have several layers of caching in solid-state memory. Particularly active blocks in a file may never actually be written to the Hard Disk, and permanently reside in solid-state memory.

If you are optimising for sequential I/O, don't bother. The OS and hardware is already way ahead of you, any action you do take will likely slow this down. The only exception is concatenating numerous small files together, in which case you should consider this to be closer to the random I/O case.

If you are optimising for random I/O, pay attention to the Block-Size of the device. Like memory pages this is the smallest unit of Read/Write the device will perform. Design your data-structures to respect that boundary and co-locate as much relevant data within those blocks as possible, while avoiding fragmenting data across the block boundaries. In this case not using space because it is too small to store anything useful/relevant is not a sin, but a virtue.

Licensed under: CC-BY-SA with attribution
scroll top