Question

I have come across the word "warp" in a few places but haven't seen a thorough definition (there's no Wikipedia page on it either).

A brief definition is found here:

In the SIMT paradigm, threads are automatically grouped into 32-wide bundles called warps. Warps are the base unit used to schedule both computation on Arithmetic and Logic Units (ALUs) and memory accesses. Threads within the same warp follow the SIMD pattern, i.e. they are supposed to execute the same operation at a given clock cycle...

Another definition is found here:

In a SIMT execution, some number of threads will be combined into a single group (called a “warp” in NVIDIA parlance, and a “wavefront” by AMD; for brevity, we will use the term “warp” hereafter). These threads will execute in lockstep, each executing the same instruction simultaneously.

Wondering if one could describe in more detail what warps are exactly, and how you should be using them or thinking about them when doing parallel / GPU programming. They seem to be mentioned in relation to some optimizations. An example is from the second link:

...efficiently mapping tree traversals on GPUs requires carefully scheduling those traversals so that traversals that are grouped together into the same warp are as similar as possible.

Then later:

The CPU then uses this information to dynamically reorder the traversals so that when the second kernel is called, threads grouped into warps perform similar work, improving SIMT efficiency.

I'm wondering how many warps there are, where they are, how to use them (like can you use them in WebGL through some API, or is it just the way you organize memory).

Était-ce utile?

La solution

SIMT stands for Single Instruction Multiple Thread. Unlike cores on a CPU which (more or less) act independently of each other, each core on a GPU executes the same instructions, from the same program, as others within the GPU. But they act on different data.

In one sense this is quite limiting. For example, if there is a branch in the code, and some cores want to take the branch and others don't, then both branches will be executed but the cores not taking the current branch will stall.

However, it has the big advantage that it requires a lot less machinery than independent cores so the designers can fit more cores onto the chip so more can be run in parallel.

Now, it doesn't make sense to have all the cores run the same instruction. Quite a lot of time will be spent waiting for things (mostly memory access) so the designers bundle them up. The idea is that you start running one bundle. If it stall on memory accesses then you park it and run another bundle. If that stalls and the memory for the first bundle is now available, you schedule that. And so on.

On nVIDIA chips, these bundles are called warps. This is sort of a pun based on the idea of a warp in textile making being a bundle of threads in parallel.

The number of threads in a warp is a bit arbitrary. It'll be fixed for a chip (to reduce machinery) and will be chosen as a balance between the considerations above. Newer/more expensive chips tend to have more threads, and hence more warps, as the manufacturers wedge more cores onto the chip.

As to you point re optimisations. What you're trying to achieve is to keep the cores busy. Ideally, you'll write your program so that all the threads in a warp are executing an instruction i.e. follow the same branches. That way they don't stall. You also want warps coded such that when some are waiting for memory, others can be scheduled to run instructions. This is called latency hiding.

In practice, this is a bit of an art form as there are numerous tradeoffs which will differ chip to chip.

Finally, with regard to using them. If you schedule a program on a GPU it will just use them. As the number is fixed, you don't have any decision to make here so there's no explicit API. The only time you interact with them is with the optimisations above.

Licencié sous: CC-BY-SA avec attribution
scroll top