Unrolling does work with AMD.
http://developer.amd.com/tools-and-sdks/heterogeneous-computing/codexl/
That tool includes kernelanalyzer which allows you to see the actual output of their compiler. I used that to verify that the unrolling actually does produce a different kernel.
However unrolling loops does not necessarily give you any speedup. After all it only saves on the jump instructions at the expense of program size, whereas in GPU you are usually bound by memory latency.
In your case the bottleneck is probably the sin/cos functions, those are extremely slow on AMD HW (also on other GPU's). You should use native_sin and native_cos. They are not as precise and do not support as long of an range as the normal ones, which is why they don't use them by default, but in most cases they are enough. The precision of the native_ functions is incidentally the same as required by DirectX shaders for sin and cos.