Your mlx4_core
parameters allow for the registration of 2^20 * 2^4 * 4 KiB = 64 GiB
only. With 192 GiB of physical memory per node and given that it is recommended to have at least twice as much registerable memory, you should set log_num_mtt
to 23, which would increase the limit to 512 GiB - the closest power of two greater or equal to twice the amount of RAM. Be sure to reboot the node(s) or unload and then reload the kernel module.
You should also submit a simple Torque job script that executes ulimit -l
in order to verify the limits on locked memory and make sure there is no such limit. Note that ulimit -c unlimited
does not remove the limit on the amount of locked memory but rather the limit on the size of core dump files.