Question

I am trying to fix performance problem with a multi threaded application which uses tcmalloc. Each threads creates large number of objects and my analysis is that thread caches in tcmalloc are not able to allocate memory and often tries to fetch memory from central page heap. This is my output of of app with MALLOCSTATS=2 for 4 threads.

Total size of freelists for per-thread caches,
transfer cache, and central cache, by size class
------------------------------------------------
class   1 [        8 bytes ] :     2046 objs;   0.0 MiB;   0.0 cum MiB
class   2 [       16 bytes ] :     1023 objs;   0.0 MiB;   0.0 cum MiB
class   3 [       32 bytes ] :      507 objs;   0.0 MiB;   0.0 cum MiB
class   5 [       64 bytes ] :      511 objs;   0.0 MiB;   0.1 cum MiB
class   6 [       80 bytes ] :      204 objs;   0.0 MiB;   0.1 cum MiB
class   9 [      128 bytes ] :      128 objs;   0.0 MiB;   0.1 cum MiB
class  15 [      224 bytes ] :       73 objs;   0.0 MiB;   0.1 cum MiB
class  16 [      240 bytes ] :       68 objs;   0.0 MiB;   0.1 cum MiB
class  17 [      256 bytes ] :       64 objs;   0.0 MiB;   0.2 cum MiB
class  19 [      320 bytes ] :       47 objs;   0.0 MiB;   0.2 cum MiB
class  25 [      512 bytes ] :      352 objs;   0.2 MiB;   0.3 cum MiB
class  26 [      576 bytes ] :       28 objs;   0.0 MiB;   0.4 cum MiB
class  33 [     1024 bytes ] :     1072 objs;   1.0 MiB;   1.4 cum MiB
class  39 [     2048 bytes ] :      832 objs;   1.6 MiB;   3.0 cum MiB
class  45 [     4096 bytes ] :      276 objs;   1.1 MiB;   4.1 cum MiB
class  50 [     8192 bytes ] :        2 objs;   0.0 MiB;   4.1 cum MiB
------------------------------------------------
PageHeap: 16 sizes;  713.5 MiB free;    0.0 MiB unmapped
------------------------------------------------
     2 pages *     39 spans ~    0.6 MiB;    0.6 MiB cum; unmapped:    0.0 MiB;    0.0 MiB cum
     4 pages *     19 spans ~    0.6 MiB;    1.2 MiB cum; unmapped:    0.0 MiB;    0.0 MiB cum
     6 pages *     17 spans ~    0.8 MiB;    2.0 MiB cum; unmapped:    0.0 MiB;    0.0 MiB cum
     8 pages *      6 spans ~    0.4 MiB;    2.4 MiB cum; unmapped:    0.0 MiB;    0.0 MiB cum
    10 pages *      4 spans ~    0.3 MiB;    2.7 MiB cum; unmapped:    0.0 MiB;    0.0 MiB cum
    12 pages *      2 spans ~    0.2 MiB;    2.9 MiB cum; unmapped:    0.0 MiB;    0.0 MiB cum
    14 pages *      2 spans ~    0.2 MiB;    3.1 MiB cum; unmapped:    0.0 MiB;    0.0 MiB cum
    16 pages *      2 spans ~    0.2 MiB;    3.3 MiB cum; unmapped:    0.0 MiB;    0.0 MiB cum
    20 pages *      1 spans ~    0.2 MiB;    3.5 MiB cum; unmapped:    0.0 MiB;    0.0 MiB cum
    28 pages *      1 spans ~    0.2 MiB;    3.7 MiB cum; unmapped:    0.0 MiB;    0.0 MiB cum
    30 pages *      2 spans ~    0.5 MiB;    4.2 MiB cum; unmapped:    0.0 MiB;    0.0 MiB cum
    34 pages *      1 spans ~    0.3 MiB;    4.5 MiB cum; unmapped:    0.0 MiB;    0.0 MiB cum
    44 pages *      2 spans ~    0.7 MiB;    5.1 MiB cum; unmapped:    0.0 MiB;    0.0 MiB cum
    76 pages *      1 spans ~    0.6 MiB;    5.7 MiB cum; unmapped:    0.0 MiB;    0.0 MiB cum
    78 pages *      1 spans ~    0.6 MiB;    6.3 MiB cum; unmapped:    0.0 MiB;    0.0 MiB cum
   108 pages *      1 spans ~    0.8 MiB;    7.2 MiB cum; unmapped:    0.0 MiB;    0.0 MiB cum

255 large * 15 spans ~ 706.3 MiB; 713.5 MiB cum; unmapped: 0.0 MiB; 0.0 MiB cum

Now I don't really understand whether this shows which thread caches are getting exhausted or not. My analysis of thread caches getting exhausted is based on observing the program running under GDB and interpreting at tcmalloc code which calls futex system call.

UPDATE I also noticed that per-thread caches are not changing when number of threads are being increased/decreased. It the page heap which grows.

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top