Question

I have been trying to understand the way a database system works. Digging through the internet, I have come to following picture.

  1. Database system is required because we need a way to store data, mainly on large scale, so that we can update it, query it, delete it, manipulate it plus perform other operations easily. The traditional file system is not designed for this and fails to serve the purpose.
  2. We can implement the above database system in different ways, which leads to different data models, RDBMS being the most dominant one.
  3. There is a database management system, which is a software and is in a way like file management system, and it manages the requests of creation, deletion, insertion of data etc. Eg. SQLite.
  4. We interact with database using an interface, which is provided by database management system, and a language which is compatible with the database management system. Eg. SQL.

I hope my understanding till now is correct.

Now, the database management system interacts with database and I am not able to understand what database is exactly. I used to think the collection of all the tables is termed a database and is just a name. But then I found out there are several databases possible and a database can be created using DBMS just like files. Further research indicated me that database is a file.

My doubts are following:

  1. If database is a file, how does it differ from ordinary files (in terms of implementation)?
  2. I believe that tables are also files. But tables are stored in a database. How can a file be inside a file and both can be considered separate.
  3. The implementation of different data models, like relations to store data in rdbms, objects to store data in OOP based database systems, are they implemented in database management system or database?

And if database is not a file what is it exactly and how tables, database and database management system are related in terms of implementation?

I have tried to find answers to these but it seems like the content on these are not much and I have had no luck. Also, please explain it avoiding very involved code snippets, as I am an analys

Was it helpful?

Solution

The following is the beginning of a bottom-up approach to explaining how one particular DBMS stores and retrieves data. The particular DBMS is Oracle-RDB, formerly known as DEC RDB/VMS. This is a distinct product from Oracle-RDBMS, which is the product that everyone knows and loves (or doesn't love).

The database files consist of one Main file, and several storage area files, along with several snapshot files. The snapshot files contain data needed to preserve a virtual snapshot of the database for the sake of long running read only transactions. I'm going to ignore those files for now.

The main file contains information the DBMS needs in order to locate and use the storage area files, and also the snapshot files.

Each storage area is divided into what are called "Database pages". The name "pages" may be misleading because this isn't the same thing as the memory pages that are managed by the memory mapper inside the CPU. However, the database page size is always a multiple of both the page size in the memory mapper and the disk block size in the file system.
All database pages in a single storage area are the same size, but the size may be different between different storage areas.

Each page is divided up into lines. A line is located through a line index. Each index entry contains a byte offset, and a line size (in bytes). The byte offset is from the beginning of the page.

A line may contain: a table row; an index node; a hash bucket; or any of several data structures the DBMS needs in order to manage free space inside of storage areas, and such.

All of this structure is completely opaque to the application developer writing an app in some programming language, and also in SQL. The programmer doesn't need to know this stuff at all. The database builder may benefit from knowing this stuff in order to build a high performance database.

The various rows that make up a table need not be located contiguously to each other in a storage area. A table can even be split across storage areas, and database builders exploit this fact in order to improve the performance of multiuser databases.

The DBMS can locate all the rows of a given table fairly quickly, but full access to a large table is still a performance disaster in many cases. It's sort of last resort. most data access is done by way of indexes and/or hash buckets. Indexes provides a very rapid way to turn keyed access into direct access.

The access info provided by an index entry consists of a storage area number, a page number within the storage area, and a line number within the page. Given this info, the DBMS can rapidly find the file blocks that make up the database page, and get that page into memory. Then, it can rapidly locate the line inside the page, using the line number and the line index.

All of this seems to have little to do with the SQL a programmer may have issued with a SELECT statement. All of what I've said is relevant to building a platform where the DBMS can locate and retrieve the data needed by a SELECT statement. But there is lots more to the story than what I have told. I've just begun to scratch the surface.

RDB has a cost-based query optimizer built into it, and this optimizer helps the DBMS choose which among several equivalent retrieval strategies is likely to require the fewest disk I/O operations.

That's all I'm going to try to write for now. If you are interested in learning on your own, here's a link: http://neilrieck.net/docs/openvms_notes_rms_rdb.html#rdb

However, if I were you, I'd spend more time trying to learn the internals of Oracle RDBMS, which is organized somewhat differently. Also, the internals of SQL Server, another major contender. There are other important players like Postgres, etc., etc.

Good luck.


addendum

One piece of data structure inside the database that's worth mentioning. It's the Data Dictionary. This is a repository of data definitions stored inside the database. When the database builder creates the database and its contents, using CREATE, ALTER, and DROP commands, the DBMS records its activity in the form of metadata stored inside the database. This is called the Data Dictionary, using industry wide parlance. Oracle-RDB uses the term "System Relations". This is really a second database, in which the DBMS shares data definitions with itself. Every table name, column name, and storage area name is stored in here, along with lots of other stuff. The DBMS is going to need this at a later point in time, when it parses the SQL that comes from the programmers or the DBA. It can also be used to extract a create script to make a new database without any data in it, but with the then current definitions. Self describing data is a key element to any real database system.

(thanks to Shaheed Haque for his answer, which jogged my memory with regard to this part. His answer is a useful counterpoint to this one, because Shaheed has taken a top-down approach to describing the internals of a DBMS.)

There may be more to come.

OTHER TIPS

It is probably helpful to start thinking not about the low level details, but the high-level properties/behaviours of different kinds of storage.

For example, one might consider simple files (flat, binary content), indexed files (structured access layered on top of a single simple file), with or without locking in either case supplied by OS level primitives.

In such terms, one might say that different database systems (using the term very loosely) can be characterised and contrasted with files by things like:

  • A query language that is used to read (and to some extent, write) data according to some complex criteria. Simple files can only be read or written by file offset. Indexed files can have some criteria, but the query language is often very simple.

  • A database system often has multiple types of object stored in it, with metadata structures that allow those objects to be treated in a uniform manner by the query language. With simple files, and indexed files, you have to build all those aspects yourself.

  • A database system is often network-accessible, including with locking semantics. Simple files can be exposed by a network filesystem layer, but the locking semantics are often lost or changed when you do so. I'm not aware of any system (other than possibly some obsolete systems from the 1980s) that even attempt this for indexed files.

One might continue such an analysis for many other behaviours and properties. As you do that, the details of your questions and assumptions can be answered with regard to those behaviours and properties.

Of course, it is also true that many database systems are built on top of simple files provided by the operating system. But that is not a given - it is in fact common for high performance database systems to access the disks using some kind of "raw" I/O too. Files just happen to be a convenient abstraction to avoid the database system having to replicate a lot of things that the operating system's file access layers already do.

Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top