stream multiprocessor, core per streamprocessor in cuda

Question

what is the significane of device having larger number cores in each stream processor???

The number of cores per SM translates roughly to how many warp instructions can be processed in any given clock cycle. A single warp instruction can be processed in any given clock cycle but requires 32 cores to complete (and may require multiple clock cycles to complete, depending on the instruction). A cc2.0 fermi SM with 32 "cores" can retire at most 1 instruction per clock, average (it's actually 2 instructions every 2 clocks). A Kepler SMX having 192 cores can retire 4 or more instructions per clock. For a more precise answer, refer to the compute capabilities architecture section of the programming guide, and note that there is one section for each compute capability 1.0 2.0 3.0.

Actially how the cuda program flows with in the device regarding stream processor and cores per stream processor??/

This question has been answered many times on the CUDA tag. Each threadblock in the grid associated with a kernel launch is assigned to one SM (when the SM has a free slot). The SM then "unpacks" the threadblock into warps, and schedules warp instructions on the SM internal resources (e.g. "cores", and special function units), as those resources become available.