質問

with the different set of nvidia graphic card, it has different speciication with different number of stream multiprocessor and each processor with different number of cores in each stream processor.

The thread blocks are assigned to a single processor according to the capacity of device like 1 block of 32 warps or 2 blocks of 16 warps.

But I could not understand the number of cores in each stream processor. what is the significane of device having larger number cores in each stream processor???

I suppose we need to better utilize the device properties for better optimization

Actially how the cuda program flows with in the device regarding stream processor and cores per stream processor??/

役に立ちましたか?

解決

what is the significane of device having larger number cores in each stream processor???

The number of cores per SM translates roughly to how many warp instructions can be processed in any given clock cycle. A single warp instruction can be processed in any given clock cycle but requires 32 cores to complete (and may require multiple clock cycles to complete, depending on the instruction). A cc2.0 fermi SM with 32 "cores" can retire at most 1 instruction per clock, average (it's actually 2 instructions every 2 clocks). A Kepler SMX having 192 cores can retire 4 or more instructions per clock. For a more precise answer, refer to the compute capabilities architecture section of the programming guide, and note that there is one section for each compute capability 1.0 2.0 3.0.

Actially how the cuda program flows with in the device regarding stream processor and cores per stream processor??/

This question has been answered many times on the CUDA tag. Each threadblock in the grid associated with a kernel launch is assigned to one SM (when the SM has a free slot). The SM then "unpacks" the threadblock into warps, and schedules warp instructions on the SM internal resources (e.g. "cores", and special function units), as those resources become available.

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top