Question

Important: Scroll down to the "final update" before you invest too much time here. Turns out the main lesson is to beware of the side effects of other tests in your unittest suite, and to always reproduce things in isolation before jumping to conclusions!


On the face of it, the following 64-bit code allocates (and accesses) one-mega 4k pages using VirtualAlloc (a total of 4GByte):

const size_t N=4;  // Tests with this many Gigabytes
const size_t pagesize4k=4096;
const size_t npages=(N<<30)/pagesize4k;

BOOST_AUTO_TEST_CASE(test_VirtualAlloc) {

  std::vector<void*> pages(npages,0);
  for (size_t i=0;i<pages.size();++i) {
    pages[i]=VirtualAlloc(0,pagesize4k,MEM_RESERVE|MEM_COMMIT,PAGE_READWRITE);
    *reinterpret_cast<char*>(pages[i])=1;
  }

  // Check all allocs succeeded
  BOOST_CHECK(std::find(pages.begin(),pages.end(),nullptr)==pages.end()); 

  // Free what we allocated
  bool trouble=false;
  for (size_t i=0;i<pages.size();++i) {
    const BOOL err=VirtualFree(pages[i],0,MEM_RELEASE);
    if (err==0) trouble=true;
  }
  BOOST_CHECK(!trouble);
}

However, while executing it grows the "Working Set" reported in Windows Task Manager (and confirmed by the value "sticking" in the "Peak Working Set" column) from a baseline ~200,000K (~200MByte) to over 6,000,000 or 7,000,000K (tested on 64bit Windows7, and also on ESX-virtualized 64bit Server 2003 and Server 2008; unfortunately I didn't take note of which systems the various numbers observed occurred on).

Another very similar test case in the same unittest executable tests one-mega 4k mallocs (followed by frees) and that only expands by around the expected 4GByte when running.

I don't get it: does VirtualAlloc have some quite high per-alloc overhead? It's clearly a significant fraction of the page size if so; why is so much extra needed and what's it for? Or am I misunderstanding what the "Working Set" reported actually means? What's going on here?

Update: With reference to Hans' answer, I note this fails with an access violation in the second page access, so whatever is going on isn't as simple as the allocation being rounded up to the 64K "granularity".

char*const ptr = reinterpret_cast<char*>(
  VirtualAlloc(0, 4096, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE)
);
ptr[0] = 1;
ptr[4096] = 1;

Update: Now on an AWS/EC2 Windows2008 R2 instance, with VisualStudioExpress2013 installed, I can't reproduce the problem with this minimal code (compiled 64bit), which tops out with an apparently overhead-free peak working set of 4,335,816K, which is the sort of number I'd expected to see originally. So either there is something different about the other machines I'm running on, or the boost-test based exe used in the previous testing. Bizzaro, to be continued...

#define WIN32_LEAN_AND_MEAN
#include <Windows.h>

#include <vector>

int main(int, char**) {

    const size_t N = 4;
    const size_t pagesize4k = 4096;
    const size_t npages = (N << 30) / pagesize4k;

    std::vector<void*> pages(npages, 0);
    for (size_t i = 0; i < pages.size(); ++i) {
        pages[i] = VirtualAlloc(0, pagesize4k, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE);
        *reinterpret_cast<char*>(pages[i]) = 1;
    }

    Sleep(5000);

    for (size_t i = 0; i < pages.size(); ++i) {
        VirtualFree(pages[i], 0, MEM_RELEASE);
    }

    return 0;
}

Final update: Apologies! I'd delete this question if I could because it turns out the observed problems were entirely due to an immediately preceeding unittest in the test suite which used TBB's "scalable allocator" to allocate/deallocate a couple of GByte of stuff. It seems scalable allocator actually retains such allocations in it's own pool rather than returning them to the system (see e.g here or here). Became obvious once I ran tests individually with enough of a Sleep after them to observe their on-completion working set in task manager (whether anything can be done about the TBB behaviour might be an interesting question, but as-is the question here is a red-herring).

Was it helpful?

Solution 2

It turns out the observed problems were entirely due to an immediately preceding unittest in the test suite which used TBB's "scalable allocator" to allocate/deallocate a couple of GByte of stuff. It seems scalable allocator actually retains such allocations in it's own pool rather than returning them to the system (see e.g here or here). Became obvious once I ran tests individually with enough of a Sleep after them to observe their on-completion working set in task manager.

OTHER TIPS

   pages[i]=VirtualAlloc(0,pagesize4k,MEM_RESERVE|MEM_COMMIT,PAGE_READWRITE);

You won't get 4096 bytes, it will be rounded up to the smallest permitted allocation. Which is SYSTEM_INFO.dwAllocationGranularity, it has been 64KB for a long time. It is a very basic address space fragmentation counter-measure.

So you are allocating way more than you think.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top