Question

websocket spec defines unmasking data as

j                   = i MOD 4
transformed-octet-i = original-octet-i XOR masking-key-octet-j

where mask is 4 bytes long and unmasking has to be applied per byte.

Is there a way to do this more efficiently, than to just loop bytes?

Server running the code can assumed to be a Haswell CPU, OS is Linux with kernel > 3.2, so SSE etc are all present. Coding is done in C, but I can do asm as well if necessary.

I'd tried to look up the solution myself, but was unable to figure out if there was an appropriate instruction in any of the dozens of SSE1-5/AVE/(whatever extension - lost track of the many over the years)

Thank you very much!

Edit: After rereading the spec a couple of times it seems that it's actually only XOR'ing the data bytes with the mask bytes, which I can do 8 bytes at a time till the last few bytes. Question is still open, as I think there could probably be still a way to optimize this using SSE or the like (maybe processing even 16 bytes at a time? letting the process do the for loop? ...)

Était-ce utile?

La solution

Yes, you can XOR 16 bytes in one instruction using SSE2, or 32 bytes at a time with AVX2 (Haswell and later).

SSE2:

#include <emmintrin.h>                     // SSE2 instrinsics

__m128i v, v_mask;
uint8_t *buff;                             // buffer - must be 16 byte aligned

for (int i = 0; i < N; i += 16)            // note that N must be multiple of 16
{
    v = _mm_load_si128(&buff[i]);          // load 16 bytes
    v = _mm_xor_si128(v, v_mask);          // XOR with mask
    v = _mm_store_si128(&buff[i], v);      // store 16 masked bytes
}

AVX2:

#include <immintrin.h>                     // AVX2 intrinsics

__m256i w, w_mask;
uint8_t *buff;                             // buffer - must be 16 byte aligned,
                                           // and preferably 32 byte aligned

for (int i = 0; i < N; i += 32)            // note that N must be multiple of 32
{
    w = _mm256_load_si256(&buff[i]);       // load 32 bytes
    w = _mm256_xor_si256(w, w_mask);       // XOR with mask
    w = _mm256_store_si256(&buff[i], w);   // store 32 masked bytes
}
Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top