For this to work, you need to deal with chunks of memory that are at least page-size and page-aligned - that means 4KB in most systems. In your case, I suspect the page gets moved twice (possibly three times), due to you calling mbind()
three times over.
The way numa memory is located is that CPU socket 0 has a range of 0..X-1 MB, socket 1 has X..2X-1, socket three has 2X-3X-1, etc. Of course, if you stick a 4GB stick of ram next to socket 0 and a 16GB in the socket 1, then the distribution isn't even. But the principle still stands that a large chunk of memory is allocated for each socket, in accordance to where the memory is actually located.
As a consequence of how the memory is located, the physical location of the memory you are using will have to be placed in the linear (virtual) address space by page-mapping.
So, for large "chunks" of memory, it is fine to move it around, but for small chunks, it won't work quite right - you certainly can't "split" a page into something that is affine to two different CPU sockets.
Edit:
To split an array, you first need to find the page-aligned size.
page_size = sysconf(_SC_PAGESIZE);
objs_per_page = page_size / sizeof(A[0]);
// We should be an even number of "objects" per page. This checks that that
// no object straddles a page-boundary
ASSERT(page_size % sizeof(A[0]));
split_three = SIZE / 3;
aligned_size = (split_three / objs_per_page) * objs_per_page;
remnant = SIZE - (aligned_size * 3);
piece = aligned_size;
mbind(&A[0],piece*sizeof(double),MPOL_BIND,&nodemask,64,MPOL_MF_MOVE);
mbind(&A[aligned_size],piece*sizeof(double),MPOL_BIND,&nodemask,64,MPOL_MF_MOVE);
mbind(&A[aligned_size*2 + remnant],piece*sizeof(double),MPOL_BIND,&nodemask,64,MPOL_MF_MOVE);
Obviously, you will now need to split the three threads similarly using the aligned size and remnant as needed.