Question

I have allocated memory using valloc, let's say array A of [15*sizeof(double)]. Now I divided it into three pieces and I want to bind each piece (of length 5) into three NUMA nodes (let's say 0,1, and 2). Currently, I am doing the following:

double* A=(double*)valloc(15*sizeof(double));

piece=5; 
nodemask=1;
mbind(&A[0],piece*sizeof(double),MPOL_BIND,&nodemask,64,MPOL_MF_MOVE);

nodemask=2;
mbind(&A[5],piece*sizeof(double),MPOL_BIND,&nodemask,64,MPOL_MF_MOVE);

nodemask=4;
mbind(&A[10],piece*sizeof(double),MPOL_BIND,&nodemask,64,MPOL_MF_MOVE);

First question is am I doing it right? I.e. is there any problems with being properly aligned to page size for example? Currently with size of 15 for array A it runs fine, but if I reset the array size to something like 6156000 and piece=2052000, and subsequently three calls to mbind start with &A[0], &A[2052000], and &A[4104000] then I am getting a segmentation fault (and sometimes it just hangs there). Why it runs for small size fine but for larger gives me segfault? Thanks.

Was it helpful?

Solution

For this to work, you need to deal with chunks of memory that are at least page-size and page-aligned - that means 4KB in most systems. In your case, I suspect the page gets moved twice (possibly three times), due to you calling mbind() three times over.

The way numa memory is located is that CPU socket 0 has a range of 0..X-1 MB, socket 1 has X..2X-1, socket three has 2X-3X-1, etc. Of course, if you stick a 4GB stick of ram next to socket 0 and a 16GB in the socket 1, then the distribution isn't even. But the principle still stands that a large chunk of memory is allocated for each socket, in accordance to where the memory is actually located.

As a consequence of how the memory is located, the physical location of the memory you are using will have to be placed in the linear (virtual) address space by page-mapping.

So, for large "chunks" of memory, it is fine to move it around, but for small chunks, it won't work quite right - you certainly can't "split" a page into something that is affine to two different CPU sockets.

Edit:

To split an array, you first need to find the page-aligned size.

page_size = sysconf(_SC_PAGESIZE);

objs_per_page = page_size / sizeof(A[0]); 
// We should be an even number of "objects" per page. This checks that that 
// no object straddles a page-boundary
ASSERT(page_size % sizeof(A[0]));   

split_three = SIZE / 3; 

aligned_size = (split_three / objs_per_page) * objs_per_page;

remnant = SIZE - (aligned_size * 3);

piece = aligned_size;

mbind(&A[0],piece*sizeof(double),MPOL_BIND,&nodemask,64,MPOL_MF_MOVE);

mbind(&A[aligned_size],piece*sizeof(double),MPOL_BIND,&nodemask,64,MPOL_MF_MOVE);

mbind(&A[aligned_size*2 + remnant],piece*sizeof(double),MPOL_BIND,&nodemask,64,MPOL_MF_MOVE);

Obviously, you will now need to split the three threads similarly using the aligned size and remnant as needed.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top