Question

SSE 4.2 perform comparation on two operands of 16 bytes at a time. But it is also possible to compare two operands of 8 bytes at a time with the ordinary processor instructions.

Difference is not so large, to have the special hardvare realization of such comparison. Is SSE 4.2 so irrelevance, or I missed something?

Was it helpful?

Solution

I'm not sure of the specifics of how the standard register comparison instructions perform in comparison to their wider SSE equivalents (it's possible that the standard comparison instruction might require more cycles), but a 2x improvement in throughput isn't anything to shake a stick at.

I think you're asking "why even have SSE 4.2 if all you get is 2 comparisons at once instead of 1?" I think you're overlooking a few things:

  • As I noted before, twice the width on an operation is nice to have. If you're working on an application that does a lot of these comparisons, you're probably happy that it's there.

  • It's likely that the incremental cost of adding this instruction to the already-existing SSE execution units was relatively small. There is already a lot of hardware in place to perform the wide range of operations already defined for the earlier SSE instruction sets.

    Nowadays, the instructions that seem to get added are either wider versions of older capabilities (e.g. many of the AVX instructions) or operations that are important for certain specific applications (e.g. the CRC/AES instructions, 4-element dot products). It's possible that there is some application that benefits a lot from such a comparison instruction and the cost of adding it was worth the marketing benefit achieved by being faster on those types of code.

OTHER TIPS

x64 processors are only guaranteed to have SSE2, you'd need to use CPUID to check for SSE 4.2 support (via CPUID.01H:ECX.SSE42[Bit 20] flag), however, SSE 2 supports 16 byte comparison, via _mm_cmpeq_epi8.

While it's true that all SSE's except 4.2 added instructions that were "generally useful", the new string operations are so general that they have potential uses outside of string processing as well. I don't know of any cases where that actually helps though, because they're quite slow.

The SSE4.2 instructions compare two packed operands. So, you are not comparing two bytes or words, you are doing a very complex comparison between 16 bytes and 16 other bytes. (or 8 words and 8 other words.) ("up to X" in each case...)

The SSE4.2 instructions are generally slower than normal compares because they are almost always microcoded. But, given that each SSE4.2 instruction starts off by doing upto 256 compares (in the byte case) and then calculates a bunch of more useful output there is usually a savings in algorithm performance, unless your search pattern is unable to skip over several characters with each iteration.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top