How to invoke single workgroup with multiple workitems which run parallel?

Question

Your second host code snippet correctly launches a single work-group that contains 4 work-items. You have no guarantees that these work-items will run in parallel, since the hardware might not have the resources to do so. However, they will run concurrently, which is exactly what you need in order to be able to use work-group synchronisation constructs such as barriers. See this Stack Overflow question for a concise description of the difference between parallelism and concurrency. Essentially, the work-items in a work-group will make forward progress independently of each other, even if they aren't actually executing in parallel.

OpenCL 1.2 Specification (Section 3.2: Execution Model)

The work-items in a given work-group execute concurrently on the processing elements of a single compute unit.

Based on your previous question on a similar topic, I assume you are using AMD's OpenCL implementation targeting the CPU. The way most OpenCL CPU implementations work is by serialising all work-items from a work-group into a single thread. This thread then executes each work-item in turn (ignoring vectorisation for the sake of argument), switching between them when they either finish or hit a barrier. This is how they achieve concurrent execution, and gives you all the guarantees you need in order to safely use barriers within your kernel. Parallel execution is achieved by having multiple work-groups (as in your first example), which will result in multiple threads executing on multiple cores (if available).

If you replaced your infinite loop with a barrier, you would clearly see that this does actually work.