Understanding block sizes

https://dba.stackexchange.com/questions/15510

22-10-2019
|

문제

My question targets Postgres, but answers might just be good enough coming from any database background.

Are my assumptions correct:

Disks have a fixed block size?
RAID controller can have a differnt block size? Does one RAID block get split onto multiple real disk blocks?
The filesystem also has an independant block size which again gets split onto the RAID block size?
Postgres works with fixed 8k blocks. How does the mapping to the filesystem block size happen here? Are Postgres 8k blocks batched together by the filesystem?

When setting up a system is it best to have all blocks at 8k? Or do the settings not real matter? I was also wondering if some "wrong" block size settings could endanger data integrity in case of a crash? Maybe if a Postgres 8k block has to be split onto multiple disk blocks?

Or does nothing get batched together, and therefore I loose disk space with every mismatch between defined block sizes?

해결책

Disk Sectors

A disk has a fixed sector size, normally 512 bytes or 4096 bytes on some modern disks; these disks will also have a mode where they emulate 512 byte sectors. The disk will have tracks with varying numbers of sectors; tracks closer to the outside of the disk have more sectors as they have more room for a given bit density. This allows more efficient usage of the disk space; typically a track will have something like 1,000 512 byte sectors on a modern disk.

Some formatting structures can also include error correcting information in the secotrs, which manifests itself in the disks being low-level formatted with 520 or 528 byte sectors. In this case the sector still has 512 bytes of user data. Neither Windows nor Linux support this directly, although i5OS (IBM iSeries) and various SAN controllers do.

Normally the sector/head/track is translated into a logical block address; due to historical issues with backward compatibility the geometry (heads x sectors x tracks) seen by the operating system (particularly on IDE and SATA disks) normally has little to do with its physical structure.

RAID stripe Size

A RAID controller can have a stripe size for an array using striping (e.g. RAID-5 or RAID-10). If the array has (for exmaple) a 128k stripe, each disk has 128k of contiguous data, and then the next set of data is on the next disk. Normally you can expect to get approximately one stripe per revolution of the disk, so the stripe size may affect performance on certain workloads.

Partition Alignment

A disk partition may or may not align exactly with a RAID stripe, and can cause performance degradation due to split reads if it is not aligned. Some systems (e.g. Windows 2008 server) will automatically configure partitions to align with disk volume stripe sizes. Some (e.g. Windows 2003 server) will not, and you have to use a partition utility that does support stripe alignment to ensure they do.

File System Block Size

The file system will allocate blocks of storage in chunks of a certain size. Generally this is configurable - for example NTFS will support allocation units from (IIRC) 4K to 64K. Misalignment of partitions and file system blocks to RAID stripes can cause a single filesystem block read to generate multiple disk accesses where only one would be necessary if the file system blocks aligned correctly with the RAID stripes.

Database Block Size

The database will allocate space in a table or index in some given block size. In the case of SQL Server this is 8K, and 8K is the default on many systems. On some systems such as Oracle, this is configurable, and on PostgreSQL it is a build-time option. On most systems space allocation to tables is normally done in larger chunks, with blocks allocated within those chunks.

Misalignment of filesystem and data allocation blocks can generate multiple I/Os for a single block write, which can drive a performance penalty.

I/O Chunking

Normally a DBMS will actually do its I/O in chunks of more than one block. For example, on SQL Server, all I/O is done in chunks of 8 blocks, 64k in total). On Oracle this is configurable. Casual inspection of the PostgreSQL docs doesn't reveal a specific description of whether PostgreSQL does this, so I'm not sure how it works on this platform.

When the I/O chunk larger than the file system block size or is misaligned with RAID stripe boundaries a disk write from the DB can cause multiple disk writes, which generates a performance penalty.

Disk space usage

No disk space is wasted - the database I/O will use one or more physical I/O operations on the disk to complete - but incorrectly tuned I/O can generate inefficiencies which will slow down the database. The main things that have to be in alignment are:

RAID stripes and partitions - the partition should begin on a RAID stripe boundary.
Filesystem I/O allocation and raid stripe/partition boundaries - a RAID stripe boundary must align with a filesystem allocation unit, and should be a multiple of the filesystem allocation unit size.
Disk write size and filesystem allocation unit size. There should be a 1:1 relationship between database I/O operations and filesystem I/O operations.

Misalignment does not create a greater data integrity problem than would otherwise be present. The database and file system have mechanisms in place to ensure file system opearations are atomic. Generally a disk crash will result in data loss but not data integrity issues.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 dba.stackexchange