NUMA aware cache aligned memory allocation

https://stackoverflow.com/questions/8154162

02-03-2021
|

Question

In linux systems, pthreads library provides us a function (posix_memalign) for cache alignment to prevent false sharing. And to choose a specific NUMA node of the arhitecture we can use libnuma library. What I want is something needing both two. I bind certain threads to some certain processors and I want allocate local data structures for each thread from the corresponding NUMA node in order to decrease delay in memory operations for the threads. How can I do this?

Solution

If you're just looking to get the alignment functionality around a NUMA allocator, you can easily build your own.

The idea is to call the unaligned malloc() with a little bit more space. Then return the first aligned address. To be able to free it, you need to store the base address at a known location.

Here's an example. Just substitute the names with whatever is appropriate:

pint         //  An unsigned integer that is large enough to store a pointer.
NUMA_malloc  //  The NUMA malloc function
NUMA_free    //  The NUMA free function

void* my_NUMA_malloc(size_t bytes,size_t align, /* NUMA parameters */ ){

    //  The NUMA malloc function
    void *ptr = numa_malloc(
        (size_t)(bytes + align + sizeof(pint)),
        /* NUMA parameters */
    );

    if (ptr == NULL)
        return NULL;

    //  Get aligned return address
    pint *ret = (pint*)((((pint)ptr + sizeof(pint)) & ~(pint)(align - 1)) + align);

    //  Save the free pointer
    ret[-1] = (pint)ptr;

    return ret;
}

void my_NUMA_free(void *ptr){
    if (ptr == NULL)
        return;

    //  Get the free pointer
    ptr = (void*)(((pint*)ptr)[-1]);

    //  The NUMA free function
    numa_free(ptr); 
}

To when you use this, you need to call my_NUMA_free for anything allocated with my_NUMA_malloc.

OTHER TIPS

The numa_alloc_*() functions in libnuma allocate whole pages of memory, typically 4096 bytes. Cache lines are typically 64 bytes. Since 4096 is a multiple of 64, anything that comes back from numa_alloc_*() will already be memaligned at the cache level.

Beware the numa_alloc_*() functions however. It says on the man page that they are slower than a corresponding malloc(), which I'm sure is true, but the much bigger problem I've found is that simultaneous allocations from numa_alloc_*() running on lots of cores at once suffer massive contention problems. In my case replacing malloc() with numa_alloc_onnode() was a wash (everything I gained by using local memory was offset by increased allocation/free time); tcmalloc was faster than either. I was performing thousands of 12-16kb mallocs on 32 threads/cores at once. Timing experiments showed that it wasn't numa_alloc_onnode()'s single-threaded speed that was responsible for the large amount of time my process spent performing the allocations, which left lock/contention issues as the likely cause. The solution I've adopted is to numa_alloc_onnode() large chunks of memory once, and then distribute it to the threads running on each node as needed. I use gcc atomic builtins to allow each thread (I pin threads to cpus) to grab from the memory allocated on each node. You can cache-line-size align the distributions as they are made, if you want : I do. This approach beats the pants off of even tcmalloc (which is thread aware but not NUMA aware - at least the Debain Squeeze version doesn't seem to be). The downside of this approach is that you can't free the individual distributions (well, not without a lot more work, anyway), you can only free the whole underlying on-node allocations. However, if this is temporary on-node scratch space for a function call, or you can otherwise specify exactly when that memory is no longer needed, then this approach works very well. It helps if you can predict how much memory you need to allocate on each node too, obviously.

@nandu : I wont post the full source - it's long and in places tied to something else I do which makes it less than perfectly transparent. What I will post is a slightly shortened version of my new malloc() function to illustrate the core idea :

void *my_malloc(struct node_memory *nm,int node,long size)
{
  long off,obytes;

  // round up size to the nearest cache line size
  // (optional, though some rounding is essential to avoid misalignment problems)

  if ((obytes = (size % CACHE_LINE_SIZE)) > 0)
    size += CACHE_LINE_SIZE - obytes;

  // atomically increase the offset for the requested node by size

  if (((off = __sync_fetch_and_add(&(nm->off[node]),size)) + size) > nm->bytes) {
    fprintf(stderr,"Out of allocated memory on node %d\n",node);
    return(NULL);
  }
  else
    return((void *) (nm->ptr[node] + off));

}

where struct node_memory is

struct node_memory {
  long bytes;         // the number of bytes of memory allocated on each node
  char **ptr;         // ptr array of ptrs to the base of the memory on each node
  long *off;          // array of offsets from those bases (in bytes)
  int nptrs;          // the size of the ptr[] and off[] arrays
};

and nm->ptr[node] is set up using the libnuma function numa_alloc_onnode().

I usually store allowable node information in the structure too, so my_malloc() can check node requests are sensible without making function calls; I also check nm exists and that size is sensible. The function __sync_fetch_and_add() is a gcc builtin atomic function; if you're not compiling with gcc you'll need something else. I use atomics because in my limited experience they are much faster than mutexes in high thread/core count conditions (as on 4P NUMA machines).

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow