There is a cost in launching kernels, either parent or child. If your child kernels do not extract much parallelism and there is not much benefit against their non-parallel counterparts, then your faint benefit may be cancelled out by the child kernel launch overheads.
In formulas, let to
be the overhead to execute a child kernel, te
its execution time and ts
the time to execute the same code without the help of dynamic parallelism. The speedup arising from the use of dynamic parallelism is ts/(to+te)
. Perhaps (but this cannot be envinced from your code) te<ts
but te,ts<<to
, so that ts/(to+te)
is about (ts/to)<1
and you observe a slowdown instead of a speedup.