why is reading blocks of data faster than reading byte by byte in file I/O

Question 1

fread does not read a file one byte at a time. The interface, which lets you specify size and count separately, is purely for your convenience. Behind the scenes, fread will simply read size * count bytes.

The amount of bytes that fread will try to read at once is highly dependent on your C implementation and the underlying filesystem. Unless you're intimately familiar with both, it's often safe to assume that fread will be closer to optimal than anything you invent yourself.

EDIT: physical disks tend to have a relatively high seek time compared to their throughput. In other words, they take relatively long to start reading. But once started, they can read consecutive bytes relatively fast. So without any OS/filesystem support, any call to fread would result in a severe overhead to start each read. So to utilize your disk efficiently, you'll want to read as many bytes at once as possible. But disks are slow compared to CPU, RAM and physical caches. Reading too much at once means your program spends a lot of time waiting for the disk to finish reading, when it could have been doing something useful (like processing already read bytes).

This is where the OS/filesystem comes in. The smart people who work on those have spent a lot of time figuring out the right amount of bytes to request from a disk. So when you call fread and request X bytes, the OS/filesystem will translate that to N requests for Y bytes each. Where Y is some generally optimal value that depends on more variables than can be mentioned here.

Another role of the OS/filesystem is what's called 'readahead'. The basic idea is that most IO occurs inside loops. So if a program requests some bytes from disk, there's a very good chance it'll request the next bytes shortly afterwards. Because of this, the OS/filesystem will typically read slightly more than you actually requested at first. Again, the exact amount depends on too many variables to mention. But basically, this is the reason that reading a single byte at a time is still somewhat efficient (it would be another ~10x slower without readahead).

In the end, it's best to think of fread as giving some hints to the OS/filesystem about how many bytes you'll want to read. The more accurate those hints are (closer to the total amount of bytes you'll want to read), the better the OS/filesystem will optimize the disk IO.

Question 2

Protip: Use your profiler to identify the most significant bottlenecks in an actual, real-life problem...

Q1 ) So , again fread reads the file by 1 bytes , so isn't it the same way as to read by 1-byte method ?

Is there anything from the manual to suggest that bytes can only be read one at a time? Flash memory, which is becoming more and more common, typically requires that your OS read chunks as large as 512KB at a time. Perhaps your OS performs buffering for your benefit, so you don't have to inspect the entire amount...

Q2 ) Results have proved that still fread takes lesser time .

Logically speaking, that's a fallacy. There is no requirement that fgetc be any slower at retrieving a block of bytes than fread. In fact, an optimal compiler may very well produce the same machine code following optimisation parses.

In reality, it also turns out to be invalid. Most proofs (for example, the ones you're citing) neglect to consider the influence that setvbuf (or stream.rdbuf()->pubsetbuf, in C++) has.

The empirical evidence below, however, integrates setvbuf and, at least on every implementation I've tested it on, has shown fgetc to be roughly as fast as fread at reading a large block of data, within some meaningless margin of error that swings either way... Please, run these tests multiple times and let me know if you find a system where one of these is significantly faster than the other. I suspect you won't. There are two programs to build from this code:

gcc -o fread_version -std=c99 file.c
gcc -o fgetc_version -std=c99 -DUSE_FGETC file.c

Once both programs are compiled, generate a test_file containing a large number of bytes and you can test like so:

time cat test_file | fread_version
time cat test_file | fgetc_version

Without further adieu, here's the code:

#include <assert.h>
#include <stdio.h>

int main(void) {
    unsigned int criteria[2] = { 0 };

#   ifdef USE_FGETC
    int n = setvbuf(stdin, NULL, _IOFBF, 65536);
    assert(n == 0);

    for (;;) {
        int c = fgetc(stdin);
        if (c < 0) {
            break;
        }
        criteria[c == 'a']++;
    }
#   else
    char buffer[65536];
    for (;;) {
        size_t size = fread(buffer, 1, sizeof buffer, stdin);
        if (size == 0) {
            break;
        }
        for (size_t x = 0; x < size; x++) {
            criteria[buffer[x] == 'a']++;
        }
    }
#   endif

    printf("%u %u\n", criteria[0], criteria[1]);

    return 0;
}

P.S. You might have even noticed the fgetc version is simpler than the fread version; it doesn't require a nested loop to traverse the characters. That should be the lesson to take away, here: Write code with maintenance in mind, rather than performance. If necessary, you can usually provide hints (such as setvbuf) to optimise bottlenecks that you've used your profiler to identify.

P.P.S. You did use your profiler to identify this as a bottleneck in an actual, real-life problem, right?

Question 3

It depends how you are reading byte-by-byte. But there is a significant overhead to each call to fread (it probably needs to make an OS/kernel call).

If you call fread 1000 times to read 1000 bytes one by one then you pay that cost 1000 times; if you call fread once to read 1000 bytes then you only pay that cost once.

Question 4

Consider what's physically happening with the disk. Every time you ask it to perform a read, its head must seek to the correct position and then wait for the right part of the platter to spin under it. If you do 100 separate 1-byte reads, you have to do that 100 times (as a first approximation; in reality the OS probably has a caching policy that's smart enough to figure out what you're trying to do and read ahead). But if you read 100 bytes one operation, and those bytes are roughly contiguous on the disk, you only have to do all this once.

Hans Passant's comment about caching is right on the money too, but even in the absence of that effect, I'd expect 1 bulk read operation to be faster than many small ones.

Question 5

Other contributors to the speed reduction are instruction pipeline reloads and databus contentions. Data cache misses are similar to the instruction pipeline reloads, so I am not presenting them here.

Function calls and Instruction Pipeline

Internally, the processor has an instruction pipeline in cache (fast memory physically near the processor). The processor will fill up the pipeline with instructions, then execute the instructions and fill up the pipeline again. (Note, some processors may fetch instructions as slots open up in the pipeline).

When a function call is executed, the processor encounters a branch statement. The processor can't fetch any new instructions into the pipeline until the branch is resolved. If the branch is executed, the pipeline may be reloading, wasting time. (Note: some processors can read in enough instructions into the cache so that no reading of instructions is necessary. An example is a small loop.)

Worst case, when you call the read function 1000 times, you are cause 1000 reloads of the instruction pipeline. If you call the read function once, the pipeline is only reloaded once.

Databus Collisions
Data flows through a databus from the hard drive to the processor, then from the processor to the memory. Some platforms allow for Direct Memory Access (DMA) from the hard drive to the memory. In either case, there is contention of multiple users with the data bus.

The most efficient use of the databus is send large blocks of data. When the user (component, such as the processor or DMA) wants to use the databus, the user must wait for it to become available. Worst case, another user is sending large blocks so there is a long delay. When sending 1000 bytes, one at a time, the User has to wait 1000 times for other Users to give up time with the databus.

Picture waiting in a queue (line) at a market or restaurant. You need to purchase many items, but you purchase one, then have to go back and wait in line again. Or you could be like other shoppers and purchase many items. Which consumes more time?

Summary
There are many reasons to use large blocks for I/O transfers. Some of the reasons are with the physical drive, others involve instruction pipelines, data caches, and databus contention. By reducing the quantity of data requests and increasing the data size, the accumulative time is also reduced. One request has a lot less overhead than 1000 requests. If the overhead is 1 millisecond, one request takes 1 millisecond, while 1000 requests take 1 second.