How does memchr() work under the hood?

https://stackoverflow.com/questions/525123

22-08-2019
|

Question

Background: I'm trying to create a pure D language implementation of functionality that's roughly equivalent to C's memchr but uses arrays and indices instead of pointers. The reason is so that std.string will work with compile time function evaluation. For those of you unfamiliar w/ D, functions can be evaluated at compile time if certain restrictions are met. One restriction is that they can't use pointers. Another is that they can't call C functions or use inline assembly language. Having the string library work at compile time is useful for some compile time code gen hacks.

Question: How does memchr work under the hood to perform as fast as it does? On Win32, anything that I've been able to create in pure D using simple loops is at least 2x slower even w/ obvious optimization techniques such as disabling bounds checking, loop unrolling, etc. What kinds of non-obvious tricks are available for something as simple as finding a character in a string?

Solution

I would suggest taking a look at GNU libc's source. As for most functions, it will contain both a generic optimized C version of the function, and optimized assembly language versions for as many supported architectures as possible, taking advantage of machine specific tricks.

The x86-64 SSE2 version combines the results from pcmpeqb on a whole cache-line of data at once (four 16B vectors), to amortize the overhead of the early-exit pmovmskb/test/jcc.

gcc and clang are currently incapable of auto-vectorizing loops with if() break early-exit conditions, so they make naive byte-at-a-time asm from the obvious C implementation.

OTHER TIPS

This implementation of memchr from newlib is one example of someone's optimizing memchr: it's reading and testing 4 bytes at a time (apart from memchr, other functions in the newlib library are here).

Incidentally, most of the the source code for the MSVC run-time library is available, as an optional part of the MSVC installation (so, you could look at that).

Here is FreeBSD's (BSD-licensed) memchr() from memchr.c. FreeBSD's online source code browser is a good reference for time-tested, BSD-licensed code examples.

void *
memchr(s, c, n)
    const void *s;
    unsigned char c;
    size_t n;
{
    if (n != 0) {
        const unsigned char *p = s;

        do {
            if (*p++ == c)
                return ((void *)(p - 1));
        } while (--n != 0);
    }
    return (NULL);
}

memchr like memset and memcpy generally reduce to fairly small amount of machine code. You are unlikely to be able to reproduce that kind of speed without inlining similar assembly code. One major issue to consider in an implementation is data alignment.

One generic technique you may be able to use is to insert a sentinel at the end of the string being searched, which guarantees that you will find it. It allows you to move the test for end of string from inside the loop, to after the loop.

GNU libc definitely uses the assembly version of memchr() (on any common linux distro). This is why it is so unbelievable fast.

For example, if we count lines in 11Gb file (like "wc -l" does) it takes around 2.5 seconds with assembly version of memchr() from GNU libc. But if we replace memchr() assembly call with for example memchr() C implementation from FreeBSD - the speed will decrease to like 30 seconds.

This is equal to replacing memchr() with just a while loop which compares one char after another.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow