How is a warp formed and handled by the hardware warp scheduler?

Question 1

Correcting some misconceptions:

A. ...From here, I think a warp(32 threads) is scheduled twice since 16 cores out of 32 are grouped together.

When the warp instruction is issued to a group of 16 cores, the entire warp executes the instruction, because the cores are clocked twice (Fermi's "hotclock") so that each core actually executes two thread's worth of computation in a single cycle (= 2 hotclocks). When a warp instruction is dispatched, the entire warp gets serviced. It does not need to be scheduled twice.

B. ...Therefore, if all threads are doing the same work, they will execute exactly the same instruction. Then all warps are always carrying the same instruction in this case.

It's true that all threads in a block (and therefore all warps) are executing from the same instruction stream, but they are not necessarily executing the same instruction. Certainly all threads in a warp are executing the same instruction at any given time. But warps execute independently from each other and so different warps within a block may be executing different instructions from the stream, at any given time. The diagram on page 10 of the Fermi whitepaper makes this evident.

Q1: Which part handles the threads grouping (into warps)? software or hardware?

It is done by hardware, as explained in the hardware implementation section of the programming guide: "The way a block is partitioned into warps is always the same; each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread 0. Thread Hierarchy describes how thread IDs relate to thread indices in the block. "

and how the hardware warp scheduler is implemented and work?

I don't believe this is formally documented anywhere. Greg Smith has provided various explanations about it, and you may wish to seach on "user:124092 scheduler" or a similar search, to read some of his comments.

Q2. If I have 64 threads, threads 0-15 and 32-47 are executing the same instruction while 16-31 and 48-63 executes another instruction, is the scheduler smart enough to group nonconsecutive threads( with the same instruction) into the same warp (i.e., to group threads 0-15 and 32-47 into the same warp, and to group threads 16-31 and 48-63 into another warp)?

This question is predicated on misconceptions outlined earlier. The grouping of threads into a warp is not dynamic; it is fixed at threadblock launch time, and it follows the methodology described above in the answer to Q1. Furthermore, threads 0-15 will never be scheduled with any threads other than 16-31, as 0-31 comprise a warp, which is indivisible for scheduling purposes, on Fermi.

Q3. What's the point to have a warp size(32) larger than the scheduling group size(16 cores)?

Again, I believe this question is predicated on previous misconceptions. The hardware units used to provide resources for a warp may exist in 16 units (or some other number) at some functional level, but from an operational level, the warp is scheduled as 32 threads, and each instruction is scheduled for the entire warp, and executed together, within some number of Fermi hotclocks.

Question 2

As far as I know:

Q1 - scheduling is done at hardware level, warps are the scheduling units and warps, their lanes constituents (a laneid is the hardware equivalent of the thread index in a warp), SMs and other components at this level are all hardware units which are abstracted and programmed via the CUDA programming model.

Q2 - It also depends on the grid: if you're launching two blocks containing a single thread each, you will end up with two warps each of which contains only one active thread. As I said all scheduling and execution is done on a warp-basis and more warps the hardware has, the more it can schedule (although they may contain dummy NOP threads) and try to hide latency/less instruction pipeline stalls.

Q3 - Once resources are allocated threads are always divided into 32-thread warps. On Fermi warp schedulers pick two warp per cycle and dispatch them to execution units. On pre-Fermi architectures SMs had fewer than 32 thread processors. Right now Fermi has 32 thread processors. However, a full memory request can only retrieve 128 bytes at a time. Therefore, for data sizes larger than 32 bits per thread per transaction, the memory controller may still break the request down into a half-warp size (https://stackoverflow.com/a/14927626/1938163). Besides

The SM schedules threads in groups of 32 parallel threads called warps. Each SM features two warp schedulers and two instruction dispatch units, allowing two warps to be issued and executed concurrently. Fermi’s dual warp scheduler selects two warps, and issues one instruction from each warp to a group of sixteen cores, sixteen load/store units, or four SFUs.

you don't have a "scheduling group size" at thread-level as you wrote, but if you re-read the above statement you'll have that 16 cores (or 16 load/store units or 4 SFUs) are readied with one instruction from a 32-thread warp each. If you were asking "why 16?" well.. that's another architectural story... and I suspect it's a carefully designed tradeoff. I'm sorry but I don't know more about this.