That tutorial you reffer is wrong and will never work on GPU devices. Due to HW arquitecture.
Any kind of sync mechanism that blocks a workitem inside a workgroup will simply not work. Since the blocking state will affect the whole workgroup, producing an infinite loop.
You can only do these kind of thing with a workgroup size of 1. Or across workgroups.