I suppose its not working because the assembly code does not 0 terminate the result buffer.
I would always prefer the substring semantics with starting position and count, instead of two positions. People think little easier in such terms.
There is no need to return any value from this function.
static inline void asm_sub_str(char *dest, char *src, int s_idx, int count)
{
__asm__ __volatile__("cld\n"
"rep\n"
"movsb\n"
"xor %%al,%%al\n"
"stosb\n"
:
:"S"(src + s_idx), "D"(dest), "c"(count)
);
}
EDIT: Note that this implementation is quite suboptimal although written in assembly. For a particular architecture memory alignment and word size are important for speed and probably the best way to do the copy is by aligned machine size words. First copy up to word size-1 bytes one by one, then copy the big part of the string in words and finally finish the last up to word size-1 bytes.
I take the question as an excersize in inline assembly and passing parameters, not as the best way to copy strings. With modern C compilers its expected that with -O2 faster code will be generated.