My experience indicates that memory mapping isn't particularly fast, so that would probably be the first thing I'd abandon.
Threading (explicitly or via IOCPs) probably won't do much good either, unless the target system has lots of disk drives, and can split things up so different threads are operating on different physical drives.
Once you've given up on memory mapping and do explicit I/O, you probably want to use FILE_FLAG_NO_BUFFERING and read relatively large blocks (say, a few megabytes at a time). Do check the alignment requirements on your block of memory though--they're a little tricky (or maybe "tedious" would be a better word to describe them). Also note that this only works for reads that are multiples of the disk's sector size, so in a typical case you need to open the file twice, once with FILE_FLAG_NO_BUFFERING to read the bulk of the data, then again without that flag to read the "tail" of the file.
Although it only copies a file (rather than processing the contents), and it's probably pure C, not C++, perhaps some demo code will be of at least a little help:
int do_copy(char const *in, char const *out) {
HANDLE infile;
HANDLE outfile;
char *buffer;
DWORD read, written;
DWORD junk=0;
unsigned long little_tail;
unsigned long big_tail;
unsigned __int64 total_copied = 0;
unsigned __int64 total_size = 0;
BY_HANDLE_FILE_INFORMATION file_info;
#define size (1024 * 8192)
buffer = VirtualAlloc(NULL, size, MEM_COMMIT, PAGE_READWRITE);
if ( NULL == buffer)
return 0;
infile = CreateFile(in,
GENERIC_READ,
FILE_SHARE_READ,
NULL,
OPEN_ALWAYS,
FILE_FLAG_NO_BUFFERING,
NULL);
GetFileInformationByHandle(infile, &file_info);
total_size = (unsigned __int64)file_info.nFileSizeHigh << 32 | (unsigned __int64)file_info.nFileSizeLow / 100;
outfile = CreateFile(out,
GENERIC_WRITE,
FILE_SHARE_READ,
NULL,
CREATE_ALWAYS,
FILE_FLAG_NO_BUFFERING,
NULL);
if ((infile == HNULL) || (outfile == HNULL))
return 0;
while (ReadFile(infile, buffer, size, &read, NULL) && read == size) {
WriteFile(outfile, buffer, read, &written, NULL);
total_copied += written;
fprintf(stderr, "\rcopied: %lu %%", (unsigned long)(total_copied / total_size));
}
little_tail = read % 4096;
big_tail = read - little_tail;
WriteFile(outfile, buffer, big_tail, &written, NULL);
CloseHandle(infile);
CloseHandle(outfile);
outfile = CreateFile(out,
GENERIC_WRITE,
0,
NULL,
OPEN_EXISTING,
FILE_FLAG_SEQUENTIAL_SCAN,
NULL);
fprintf(stderr, "\rcopied: 100 %%\n");
SetFilePointer(outfile, 0, &junk, FILE_END);
WriteFile(outfile, buffer+big_tail, little_tail, &written, NULL);
CloseHandle(outfile);
VirtualFree(buffer, size, MEM_RELEASE);
return 1;
}