IOCP - post overlapped or read packet?

Question 1

The answer depends on the infrastructure that you are using. Generally, the best thing is to do nothing. I know this sounds weird, so let me explain. When the OS is talking to a NIC, it generally has at least one pair of RX/TX ring-buffers and, in case of commodity hardware, is likely talking to the device over PCIe bus. On top of the PCIe bus there is a DMA engine that makes it possible for a NIC to read and write from/to host memory without using a CPU. In other words, while the NIC is active, it will always read and write packets on its own, with minimal CPU intervention. There are, of course, a lot of details, but you can generally think that on a driver-level that is what is going on — reads and writes are always performed by the NIC using DMA, no matter whether your application reads/writes anything or not. Now, on top of it there is an OS infrastructure that allows user-space applications to send and receive data to/from the NIC. When you open a socket, OS will determine in what kind of data your application is interested and add an entry into a list of applications talking to a network interface. When that happens, the application starts receiving data that is placed in some sort of application's queue in the kernel. It doesn't matter whether you are calling read or not, the data is placed there. Once the data is placed, the application is getting notified. The notification mechanisms in the kernel vary, but they all share a similar ideas — let application know that data is available to call read(). Once the data is in that "queue", application can pick it up by calling read(). The difference between blocking and non-blocking read is simple — if the read is blocking, the kernel will simply suspend the execution of an application until the data is arrived. In case of non-blocking read, the control is returned to an application in any case — either with data or without it. If latter happens, the application can either keep trying (aka spin on a socket), or wait for a notification from the kernel saying that data is available, and then proceed to reading it. Now let's get back to "doing nothing". What it means is that socket is registered to receive notification only once. Once registered, the application doesn't have to do anything but receive a notification saying "the data is there". So what the application should do is listen to that notification and perform the read only when the data is there. Once enough data is received, the app can start processing it somehow. Knowing all that, let's see what from the three approaches is better...

Post another overlapped read on the socket, this time with the size of the packet so it receives it in the next completion?

This is a a good approach. Ideally, you wouldn't have to "post" anything, but this depends on how good the OS interface is. If you cannot "register" your application once and then keep receiving notifications every time new data is available and call read() when it is, then posting an asynchronous read request is the next best thing.

Read inside the routine the whole packet using blocking sockets and then post another overlapped with recv with 9 bytes?

This is a good approach if your application has absolutely nothing else to do and you have only one socket to read from. In other words — it is an easy way of doing so, very easy to program, OS takes care of completions itself, etc. Keep in mind though that once you have more than one socket to read from, you will have to either do a very stupid thing like having a thread per socket (terrible!), or re-write your application using the first approach.

Read in chunks (decide the size) say - 4096 and have a counter to keep reading each overlapped completion until the data was read (say it would complete 12 times till all the packet was read).

This is the way to go! In fact, this is almost the same as approach #1 with a nice optimization to perform as less round-trips to the kernel as possible, and read as much as possible in one go. First I wanted to correct the first approach with these details, but then I noticed you've done it yourself.

Hope it helps. Good Luck!

Question 2

Vlad's answer is interesting but somewhat OS agnostic and theoretical. Here's something a little more focused on the design considerations for IOCP.

It seems that you are reading a stream of messages from a TCP connection whereby the message consists of a header which details the length of the complete message. The header is of a fixed size, 9 bytes.

Please bear in mind that each overlapped read completion will return between 1 byte and the size of your buffer, you should NOT assume that you can issue a 9 byte read and always get a complete header and you should not assume you can subsequently issue a read with a buffer big enough for a complete message and receive that message in its entirety when the read completes. You WILL need to deal with completions which return less bytes than you expect and the best way to deal with this is to adjust the WSABUF pointer to the start of the buffer so that the subsequent overlapped read will read more data into the buffer at the position just beyond when this read finished...

The best way to read the data will depend on the following things:

how big it is (on average and the largest possible message size)
how many connections you're likely to be dealing with
whether you can process messages in pieces or only as complete messages.
whether the peer can send multiple messages without a response from you or if it's a "message-response" style protocol.

Most of the decisions about how to read the data using IOCP come down to where data copying will occur and how convenient you want to make the data that you're processing. Assuming that you have NOT turned off socket level read buffering then there is likely to be a data copy whenever you read data. The TCP stack will be accumulating data in its per socket read buffers and your overlapped reads will be copying this into your own buffer and returning it to you.

The easiest situation is if you can process messages in pieces as they arrive. In this case simply issue an overlapped read for your full buffer size and process the completion (the buffer will contain between 1 byte and the buffer size of data), issue a new read (possibly into the end of the same buffer) until you have enough data to process and then process the data until you need to read more. The advantage of this is that you issue the minimum number of overlapped reads (for your buffer size) and this reduces the user mode to kernel mode transitions.

If you MUST process messages as complete messages then how you handle them depends on how big they can be and how big your buffers are. You COULD issue a read for a header (by specifying that the buffer is only 9 bytes in length) and then issue more overlapped reads to accumulate the complete message into one or more buffers (by adjusting the buffer start and length as you go) and chaining the buffers together inside your 'per-connection' data structure. Alternatively do not issue a "special" read for the header and deal with the possibility of a single read returning more than one message.

I have a some example IOCP servers which do most of this stuff, you can download them from here and read about them in the accompanying articles.