This is all perfectly normal, you are observing the behavior of the file system cache. Which is a chunk of RAM that's is set aside by the operating system to buffer disk data. It is normally a fat gigabyte, can be much more if your machine has lots of RAM. Sounds like you've got 4 GB installed, not that much for a 64-bit operating system. Depends however on the RAM needs of other processes that run on the machine.
Your calls to fwrite() or ofstream::write() write to a small buffer created by the CRT, it in turns make operating system calls to flush full buffers. The OS writes normally completely very quickly, it is a simple memory-to-memory copy going from the CRT buffer to the file system cache. Effective write speed is in excess of a gigabyte/second.
The file system driver lazily writes the file system cache data to the disk. Optimized to minimize the seek time on the write head, by far the most expensive operation on the disk drive. Effective write speed is determined by the rotational speed of the disk platter and the time needed to position the write head. Typical is around 30 megabytes/second for consumer-level drives, give or take a factor of 2.
Perhaps you see the fire-hose problem here. You are writing to the file cache a lot faster than it can be emptied. This does hit the wall eventually, you'll manage to fill the cache to capacity and suddenly see the perf of your program fall off a cliff. Your program must now wait until space opens up in the cache so the write can complete, effective write speed is now throttled by disk write speeds.
The 20 msec delays you observe are normal as well. That's typically how long it takes to open a file. That is a time that's completely dominated by disk head seek times, it needs to travel to the file system index to write the directory entry. Nominal times are between 20 and 50 msec, you are on the low end of that already.
Clearly there is very little you can do in your code to improve this. What CRT functions you use certainly don't make any difference, as you found out. At best you could increase the size of the files you write, that reduces the overhead spent on creating the file.
Buying more RAM is always a good idea. But it of course merely delays the moment where the firehose overflows the bucket. You need better drive hardware to get ahead. An SSD is pretty nice, so is a striped raid array. Best thing to do is to simply not wait for your program to complete :)