What are the typical application uses of reverse/stride/pread and pwrite?

https://stackoverflow.com/questions/10897790

12-06-2021
|

Question

If impatient, skip to "QUESTION" headline below.

CONTEXT

I work with Unix(like) system administration and infrastructure development, but I think that my question is answered best by programmers :o)

What I want to do is to learn how to benchmark file systems (plain, volume managed, virtualized, encrypted, etc.) using iozone. As an exercise, I benchmarked an USB pendrive meant to be used as the system disk in my slug (http://www.nslu2-linux.org/) formatted with respectively vfat, ntfs, ext3, ext4 and xfs. The test produced some surprising results which are posted below. The reason why the results surprised me, though, may very well be because I am still new to iozone and don't really know how to interpret the numbers. Hence, this post.

In my test, iozone ran benchmarks on 11 different file operations, but only on one record size (4k, matching the block size of all tested file systems) and only on one file size (512MB). The one-sidedness of file system record size and file size of course leaves the test with some bias. Anyway, the file operations are listed below, each with my own short explanation:

initial write: write new data to disk sequentially, regular file usage
rewrite: appended new data to existing sequentially, regular file usage
read: sequentially read data, regular file usage
re-read: sequentially re-read data (buffer test, or what?)
reverse read: ???
stride read: ???
random read: non-sequentially read, typically database usage
random write: non-sequentially write, typically database usage
pread: reading of data on a certain position - for indexing databases?
pwrite: writing of data on a certain position - for indexing databases?
mixed workload: (obvious)

Some of these operations seem straight-forward. I guess that initial write, rewrite and read are all used for regular file handling, involving letting the pointer seek until a certain block is reached, reading or writing sequentially (often through many blocks), sometimes having to jump forward a little because of fragmented files. The sole objective of the re-read test (I guess) would be buffer testing. In parallel, random read/write are typical database operations, where the pointer has to jump from place to place within the same file collecting database records, for example when joining tables.

SO WHAT IS THE QUESTION?

So far, so good. I would highly appreciate any corrections to the above assumptions, although they seem fairly common knowledge. Now for the real question: Why would you ever do a reverse read? What is a stride read? And the "position" operations pread and pwrite, I've been told, are used with indexed databases, but why not simply keep the index in memory? Or is that what actually happens, and the pread then comes in handy for jumping to the exact location of a record once given a certain index? What else do you use a pread/pwrite for?

To sum it up, as of this time I feel that I am only able to interpret my iozone results somewhat halfways. I more or less know why high numbers on the random operations would make a good file system for a database, but why would I need to read files in reverse order, and what does a good stride read tell me? What would the typical application uses of these operations be?

BONUS QUESTION

Having asked that, here is a bonus question. As an administrator of a given file system, having gratefully learned how to interpret my file system benchmarks from insightfull programmers ;) - does anyone have suggestions on how to make an analisys of the actual use of a file system? Experimenting with file system record (block) size is trivial, although time consuming. And concerning the size and distribution of files in a given file system, 'find' is my friend. But what do I do to get counts on the actual file system calls like read(), pwrite(), etc.?

Also I would greatly appreciate any comments on the influence of other ressources on file system test results, such as the role of processor power and RAM capacity and speed. For example, what difference does it make that I make this test on a machine holding a 1.66Ghz Atom processor and 2 gigs of DDR2 RAM when I want to use the pendrive in a slug with a 266 MHz ARM Intel XScale processor and 32/8 MB SD/flash RAM?

ARCHITECTURALLY MINDED DOCUMENTATION?

Since I don't like to repeat myself too much I don't like to ask it of others either, so, if these questions cannot be answered in a short manner, I would greatly appreciate links to further documentation, the important thing not being that it explains what the above file operations actually do (I could look to APIs for that), but that this documentation is architecturally minded, that is, that it explains how these operations would typically be used in real life applications.

TEST RESULTS

Right. I promised the results of my rather humble USB pendrive file system test. My main expectation was generally poor results on writes (as a flash drive, given it's nature, often has a bigger block size than the actual file system administering it, meaning that to write a small change relatively large amounts of unchanged data have to be rewritten), and nice results on reads. The main points turned out to be:

vfat did very well on all operations, except the somewhat obscure (to me, anyway) reverse and stride read. I guess the lack of features eliminates a lot of bookkeeping.
ntfs sucks on rewrite (append) and read operations, making it a poor candidate for regular file operation. It also sucks on pread operation, making it a poor candidate for indexed databases.
surprisingly, ext3 and ext4, the latter marginately better on all operations, sucks at initial writes, rewrite, read, random write and pwrite operations, making them poor candidates for regular file usage, as well as for intensely updated databases. ext4, though, is a master at random read and pread, making it an excellent candidate for somewhat static databases(?). Both ext3 and ext4 score high on the obscure reverse read and stride read operations, whatever that means.
the unsurpassed all-over test winner was xfs, whose only weak point seem to be reverse read. On initial write, rewrite, read, random write and pwrite, it is among the best, making it an excellent candidate for regular file usage as well as for (intensely updated) databases. On reread, random read and pread it is among the runner-ups, making it a good candidate for (somewhat static) databases. It also does good on stride read - whatever that means!

Any comments on the interpretation of these results are mostly welcome! Numbers are listed beneath (somewhat cut for reasons of length), one iozone test suite pr. file system type, all tested on a standard 4GB Verbatim pendrive (orange of colour ;)), docked in a Samsung N105P laptop with a N450 1.66Ghz Atom CPU and 2GB DDR2 667 Mhz RAM, running a Linux 3.2.0-24 x86 kernel with encrypted swap (yeah, I know, I should install a 64bit Linux and leave the swap in the clear!).

Regards, Torsten

PS. After writing this I found out that apparently, the Debian NSLU2 distribution does not support xfs. My questions still stand, though!

--- vfat ---

Iozone: Performance Test of File I/O
        Version $Revision: 3.397 $
    Compiled for 32 bit mode.
    Build: linux 

Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
             Al Slater, Scott Rhine, Mike Wisner, Ken Goss
             Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
             Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner,
             Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone,
             Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root,
             Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer.
             Ben England.

Run began: Mon Jun  4 14:23:57 2012

Record Size 4 KB
File size set to 524288 KB
Command line used: iozone -l 1 -u 1 -r 4k -s 512m -F /mnt/iozone.tmp
Output is in Kbytes/sec
Time Resolution = 0.000002 seconds.
Processor cache size set to 1024 Kbytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.
Min process = 1 
Max process = 1 
Throughput test with 1 process
Each process writes a 524288 Kbyte file in 4 Kbyte records

Children see throughput for  1 initial writers  =   12864.82 KB/sec
Parent sees throughput for  1 initial writers   =    3033.39 KB/sec

Children see throughput for  1 rewriters    =   25271.86 KB/sec
Parent sees throughput for  1 rewriters     =    2876.36 KB/sec

Children see throughput for  1 readers      =  685333.00 KB/sec
Parent sees throughput for  1 readers       =  682464.06 KB/sec

Children see throughput for 1 re-readers    =  727929.94 KB/sec
Parent sees throughput for 1 re-readers     =  726612.47 KB/sec

Children see throughput for 1 reverse readers   =  458174.00 KB/sec
Parent sees throughput for 1 reverse readers    =  456910.21 KB/sec

Children see throughput for 1 stride readers    =  351768.00 KB/sec
Parent sees throughput for 1 stride readers     =  351504.09 KB/sec

Children see throughput for 1 random readers    =  553705.94 KB/sec
Parent sees throughput for 1 random readers     =  552630.83 KB/sec

Children see throughput for 1 mixed workload    =  549812.50 KB/sec
Parent sees throughput for 1 mixed workload     =  547645.03 KB/sec

Children see throughput for 1 random writers    =   19958.66 KB/sec
Parent sees throughput for 1 random writers     =    2752.23 KB/sec

Children see throughput for 1 pwrite writers    =   13355.57 KB/sec
Parent sees throughput for 1 pwrite writers     =    3119.04 KB/sec

Children see throughput for 1 pread readers     =  574273.31 KB/sec
Parent sees throughput for 1 pread readers  =  572121.97 KB/sec

--- ntfs ---

Iozone: Performance Test of File I/O
        Version $Revision: 3.397 $
    Compiled for 32 bit mode.
    Build: linux 

Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
             Al Slater, Scott Rhine, Mike Wisner, Ken Goss
             Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
             Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner,
             Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone,
             Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root,
             Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer.
             Ben England.

Run began: Mon Jun  4 13:59:37 2012

Record Size 4 KB
File size set to 524288 KB
Command line used: iozone -l 1 -u 1 -r 4k -s 512m -F /mnt/iozone.tmp
Output is in Kbytes/sec
Time Resolution = 0.000002 seconds.
Processor cache size set to 1024 Kbytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.
Min process = 1 
Max process = 1 
Throughput test with 1 process
Each process writes a 524288 Kbyte file in 4 Kbyte records

Children see throughput for  1 initial writers  =   11153.75 KB/sec
Parent sees throughput for  1 initial writers   =    2848.69 KB/sec

Children see throughput for  1 rewriters    =    8723.95 KB/sec
Parent sees throughput for  1 rewriters     =    2794.81 KB/sec

Children see throughput for  1 readers      =   24935.60 KB/sec
Parent sees throughput for  1 readers       =   24878.74 KB/sec

Children see throughput for 1 re-readers    =  144415.05 KB/sec
Parent sees throughput for 1 re-readers     =  144340.90 KB/sec

Children see throughput for 1 reverse readers   =   76627.60 KB/sec
Parent sees throughput for 1 reverse readers    =   76362.93 KB/sec

Children see throughput for 1 stride readers    =  367293.25 KB/sec
Parent sees throughput for 1 stride readers     =  366002.25 KB/sec

Children see throughput for 1 random readers    =  505843.41 KB/sec
Parent sees throughput for 1 random readers     =  500556.16 KB/sec

Children see throughput for 1 mixed workload    =  553075.56 KB/sec
Parent sees throughput for 1 mixed workload     =  551754.97 KB/sec

Children see throughput for 1 random writers    =    9747.23 KB/sec
Parent sees throughput for 1 random writers     =    2381.89 KB/sec

Children see throughput for 1 pwrite writers    =   10906.05 KB/sec
Parent sees throughput for 1 pwrite writers     =    1931.43 KB/sec

Children see throughput for 1 pread readers     =   16730.47 KB/sec
Parent sees throughput for 1 pread readers  =   16194.80 KB/sec

--- ext3 ---

Iozone: Performance Test of File I/O
        Version $Revision: 3.397 $
    Compiled for 32 bit mode.
    Build: linux 

Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
             Al Slater, Scott Rhine, Mike Wisner, Ken Goss
             Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
             Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner,
             Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone,
             Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root,
             Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer.
             Ben England.

Run began: Sun Jun  3 16:05:27 2012

Record Size 4 KB
File size set to 524288 KB
Command line used: iozone -l 1 -u 1 -r 4k -s 512m -F /media/verbatim/1/iozone.tmp
Output is in Kbytes/sec
Time Resolution = 0.000001 seconds.
Processor cache size set to 1024 Kbytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.
Min process = 1 
Max process = 1 
Throughput test with 1 process
Each process writes a 524288 Kbyte file in 4 Kbyte records

Children see throughput for  1 initial writers  =    3704.61 KB/sec
Parent sees throughput for  1 initial writers   =    3238.73 KB/sec

Children see throughput for  1 rewriters    =    3693.52 KB/sec
Parent sees throughput for  1 rewriters     =    3291.40 KB/sec

Children see throughput for  1 readers      =  103318.38 KB/sec
Parent sees throughput for  1 readers       =  103210.16 KB/sec

Children see throughput for 1 re-readers    =  908090.88 KB/sec
Parent sees throughput for 1 re-readers     =  906356.05 KB/sec

Children see throughput for 1 reverse readers   =  744801.38 KB/sec
Parent sees throughput for 1 reverse readers    =  743703.54 KB/sec

Children see throughput for 1 stride readers    =  623353.88 KB/sec
Parent sees throughput for 1 stride readers     =  622295.11 KB/sec

Children see throughput for 1 random readers    =  725649.06 KB/sec
Parent sees throughput for 1 random readers     =  723891.82 KB/sec

Children see throughput for 1 mixed workload    =  734631.44 KB/sec
Parent sees throughput for 1 mixed workload     =  733283.36 KB/sec

Children see throughput for 1 random writers    =     177.59 KB/sec
Parent sees throughput for 1 random writers     =     137.83 KB/sec

Children see throughput for 1 pwrite writers    =    2319.47 KB/sec
Parent sees throughput for 1 pwrite writers     =    2200.95 KB/sec

Children see throughput for 1 pread readers     =   13614.82 KB/sec
Parent sees throughput for 1 pread readers  =   13614.45 KB/sec

--- ext4 ---

Iozone: Performance Test of File I/O
        Version $Revision: 3.397 $
    Compiled for 32 bit mode.
    Build: linux 

Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
             Al Slater, Scott Rhine, Mike Wisner, Ken Goss
             Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
             Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner,
             Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone,
             Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root,
             Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer.
             Ben England.

Run began: Sun Jun  3 17:59:26 2012

Record Size 4 KB
File size set to 524288 KB
Command line used: iozone -l 1 -u 1 -r 4k -s 512m -F /media/verbatim/2/iozone.tmp
Output is in Kbytes/sec
Time Resolution = 0.000005 seconds.
Processor cache size set to 1024 Kbytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.
Min process = 1 
Max process = 1 
Throughput test with 1 process
Each process writes a 524288 Kbyte file in 4 Kbyte records

Children see throughput for  1 initial writers  =    4086.64 KB/sec
Parent sees throughput for  1 initial writers   =    3533.34 KB/sec

Children see throughput for  1 rewriters    =    4039.37 KB/sec
Parent sees throughput for  1 rewriters     =    3409.48 KB/sec

Children see throughput for  1 readers      = 1073806.38 KB/sec
Parent sees throughput for  1 readers       = 1062541.84 KB/sec

Children see throughput for 1 re-readers    =  991162.00 KB/sec
Parent sees throughput for 1 re-readers     =  988426.34 KB/sec

Children see throughput for 1 reverse readers   =  811973.62 KB/sec
Parent sees throughput for 1 reverse readers    =  810333.28 KB/sec

Children see throughput for 1 stride readers    =  779127.19 KB/sec
Parent sees throughput for 1 stride readers     =  777359.89 KB/sec

Children see throughput for 1 random readers    =  796860.56 KB/sec
Parent sees throughput for 1 random readers     =  795138.41 KB/sec

Children see throughput for 1 mixed workload    =  741489.56 KB/sec
Parent sees throughput for 1 mixed workload     =  739544.09 KB/sec

Children see throughput for 1 random writers    =     499.05 KB/sec
Parent sees throughput for 1 random writers     =     399.82 KB/sec

Children see throughput for 1 pwrite writers    =    4092.66 KB/sec
Parent sees throughput for 1 pwrite writers     =    3451.62 KB/sec

Children see throughput for 1 pread readers     =  840101.38 KB/sec
Parent sees throughput for 1 pread readers  =  831083.31 KB/sec

--- xfs ---

Iozone: Performance Test of File I/O
        Version $Revision: 3.397 $
    Compiled for 32 bit mode.
    Build: linux 

Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
             Al Slater, Scott Rhine, Mike Wisner, Ken Goss
             Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
             Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner,
             Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone,
             Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root,
             Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer.
             Ben England.

Run began: Mon Jun  4 14:47:49 2012

Record Size 4 KB
File size set to 524288 KB
Command line used: iozone -l 1 -u 1 -r 4k -s 512m -F /mnt/iozone.tmp
Output is in Kbytes/sec
Time Resolution = 0.000005 seconds.
Processor cache size set to 1024 Kbytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.
Min process = 1 
Max process = 1 
Throughput test with 1 process
Each process writes a 524288 Kbyte file in 4 Kbyte records

Children see throughput for  1 initial writers  =   21854.47 KB/sec
Parent sees throughput for  1 initial writers   =    3836.32 KB/sec

Children see throughput for  1 rewriters    =   29420.40 KB/sec
Parent sees throughput for  1 rewriters     =    3955.65 KB/sec

Children see throughput for  1 readers      =  624136.75 KB/sec
Parent sees throughput for  1 readers       =  614326.13 KB/sec

Children see throughput for 1 re-readers    =  577542.62 KB/sec
Parent sees throughput for 1 re-readers     =  576533.42 KB/sec

Children see throughput for 1 reverse readers   =  483368.06 KB/sec
Parent sees throughput for 1 reverse readers    =  482598.67 KB/sec

Children see throughput for 1 stride readers    =  537227.12 KB/sec
Parent sees throughput for 1 stride readers     =  536313.77 KB/sec

Children see throughput for 1 random readers    =  525219.19 KB/sec
Parent sees throughput for 1 random readers     =  524062.07 KB/sec

Children see throughput for 1 mixed workload    =  561513.50 KB/sec
Parent sees throughput for 1 mixed workload     =  560142.18 KB/sec

Children see throughput for 1 random writers    =   24118.34 KB/sec
Parent sees throughput for 1 random writers     =    3117.71 KB/sec

Children see throughput for 1 pwrite writers    =   32512.07 KB/sec
Parent sees throughput for 1 pwrite writers     =    3825.54 KB/sec

Children see throughput for 1 pread readers     =  525244.94 KB/sec
Parent sees throughput for 1 pread readers  =  523331.93 KB/sec

No correct solution

OTHER TIPS

The only times I have needed to dig in depth to filesystem performance I was on windows systems. The general principals apply no matter what OS/filesystem you are using...

Why would you ever do a reverse read?

As the program runs it reads block 987654 then using that data it determines that it needs block 123456. This might happen on a join: Your Db might be using the index on table1 to pick records (using the index) out of table two. The picking operation might happen in table one order (reverse of the table 2 order).

Similar sort of situations can happen single table selects when using two keys.

What is a stride read?

Reading every N-th block ex. reading block 12345600 then block 12345700 then block 12345800 is a stride of 100. Imagine a table with many and/or large columns. That table might have rows that need several filesystem blocks to hold the data. Typically a database would organize this data into a record for each row with each record occupying several sequential filesystem blocks. If your db rows occupy 10 filesystem blocks and you are selecting on two columns you might only need to read the 1st and 6th blocks of that 10 block record. You query would then need to read block 10001, 10006, 10011, 10016, 10021, 10026 - a stride of 5.

And the "position" operations pread and pwrite, I've been told, are used with indexed databases, but why not simply keep the index in memory?

The size of the index may exceed a reasonable amount of RAM usage. Or, your prior usage called other indexes or data into ram causing the unused index to be evected from the filesystem/db cache.

Or is that what actually happens, and the pread then comes in handy for jumping to the exact location of a record once given a certain index? Yep, that might be what your database is doing.

What else do you use a pread/pwrite for?

Some data files have predefined "interesting" locations. This might be the root of a B-Tree index, a table header, a log/journal tail or something else depending on your Db implementation. pread/rwrite is testing performance of hopping to a set specific locations repeatedly instead of a uniformly random mix of locations.

Links?

There exist system utilities for all mainstream OSes that can capture every OS filesystem operation. I think these might be named DTRACE or pTAP or pTRACE on *NIX systems. You can use the mountains of data (filtered intelligently) from these monitors to see the disk access pattern in your system.

Then general rule of thumb is that for Db usage obscene amounts of RAM is helpful. Then all your indexes reside in RAM all the time.

I'm sorry: I cannot add information about the specific system calls you asked about. So I add some opinionated content, instead...

In my opinion, iozone is not a very interesting benchmarking tool. And profiling the various system calls aren't that interesting either, I think.

What matters is how the file system works in the Real World. Benchmarking with real world scenarios can be very time consuming, however; for example it can take a long time to create a valid test environment. And that's why a benchmark tool does comes in handy. But the benchmark tool should be able to work in a way which is as close as possible to real applications; also, it's normally nice if the benchmark tool works in a brutal way, so that the limits of the involved hardware/software are explored.

Two benchmark tools which fulfill these requirements are fio and Oracle's orion. With both tools, it's relativly easy to specify a benchmark which will use a sensible mix of reads and writes, and to specify how parallel the benchmark should run. And it's possible to perform both device-level and FS-level benchmars; this is nice, because sometimes you want to benchmark storage-equipment without the overhead of a specific file system. Compared to Orion, fio has the advantage of a dynamic mailing list where there is a very high propability for good answers (I haven't found a mailing list for Orion).

I can provide some background on two parts of your question. The "backward read" test was introduced as a result of observing I/O behavior of some mechanical engineering applications. These applications frequently would read from disk sequentially forward and then backward. There was speculation that this was associated with (linear algebra) forward and backward substitution or that it was related to original implementations which relied on magnetic tape drives.

As for stride access, this was commonplace I/O pattern for many seismic exploration applications (depth and/or time migration IIRC). As was the case with the "backward read" scenario, this too was introduced after observing the I/O behavior of these applications.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow