Threads sitting idle = bad?

Question 1

Asynchronous I/O basically means that your application does most of the thread scheduling. Instead of letting the OS randomly suspend your thread and schedule another one, you have only as many threads as there are CPU cores and yield to other tasks at the most appropriate points—when the thread reaches an I/O operation, which will take some time.

The above seems like a clear win from the performance standpoint, but the asynchronous programming model is much more complex in several regards:

it can't be expressed as a single function call so the workflow is not obvious, especially when transfer of control flow due to exceptions is considered;
without specifically targetted support from the programming language the idioms are very messy: spaghetti code and/or extremely weak signal-to-noise ratio are the norm;
mostly due to 1. above debugging is much more difficult as the stack trace does not represent the progress within a unit of work as a whole;
execution jumps from thread to thread within a pool (or even several pools, where each layer of abstraction has its own) so profiling and monitoring with the usual tools is rendered virtually useless.

On the other hand, many favorable improvements and optimizations have happened to the modern OS's which mostly eliminate the performance downsides of synchronous I/O programming:

the address space is huge, so space reserved for stacks isn't a problem;
the actual physical RAM load of call stacks is not very large as only the part of the stack actually used by a thread is committed to RAM, and a call stack doesn't normally exceed 64K;
context switching, which used to be prohibitively expensive for larger thread counts, has been improved to the point where its overhead is negligible for all practical purposes.

A classical paper going through much of the above and some other points is a good complement to what I am saying here:

https://www.usenix.org/legacy/events/hotos03/tech/full_papers/vonbehren/vonbehren_html/index.html

Question 2

There are already some good pointers in the comments of your question.

The reason for not using 10K threads is this costs memory resources, and memory costs energy. The programming model is no argument, because the thread sitting on the client connection, mustn't be the same that wants to post the event.

Please take a look at the websockets standard and the asynchronous request processing model in the Servlet 3.0 standard. All recent java web application servers implement it now (e.g. Glassfish and Tomcat) and it is the solution for your problem.

The question itself cannot be answered since the OS, JVM and application server you use is missing. However, you can test it quite fast by yourself, by just creating a servlet or JSP with Thread.sleep(9999999) and doing siege -c 10000 ... on it.

Question 3

10,000 simultaneous HTTP clients...what are the issues in having threads sitting idle?

It seems that cost of an idle thread is only memory allocated for kernel structure (a few kb) and thread's stack (512kb-a number of mb). But...

Obviously, you are going to wake up each of your n-hundred threads from time to time, right? And that is a moment when you pay the cost of the context switching, which may be not so small (time to call the system scheduler, more cache misses, etc). See, for instance: http://www.cs.rochester.edu/u/cli/research/switch.pdf

And you will have to pin your threads very carefully to don't affect the system ones. As result, thread-per-connection (on blocking IO) architecture can increase the latency of the system comparing to async IO. But it still can work for your case if almost all threads are parked most of the time.

And the final word. We don't know how many time your threads are going to be blocked on read() and how many work they need to do to process the received data. What hardware, OS and network interfaces are going to be used... So, test a prototype of your system.