what does it mean configuring MPI for shared memory?

Question

Open MPI is very modular. It has its own component model called Modular Component Architecture (MCA). This is where the name of the --mca parameter comes from - it is used to provide runtime values to MCA parameters, exported by the different components in the MCA.

Whenever two processes in a given communicator want to talk to each other, MCA finds suitable components, that are able to transmit messages from one process to the other. If both processes reside on the same node, Open MPI usually picks the shared memory BTL component, known as sm. If both processes reside on different nodes, Open MPI walks the available network interfaces and choses the fastest one that can connect to the other node. It puts some preferences on fast networks like InfiniBand (via the openib BTL component), but if your cluster doesn't have InfiniBand, TCP/IP is used as a fallback if the tcp BTL component is in the list of allowed BTLs.

By default you do not need to do anything special in order to enable shared memory communication. Just launch your program with mpiexec -np 16 ./a.out. What you have linked to is the shared memory part of the Open MPI FAQ which gives hints on what parameters of the sm BTL could be tweaked in order to get better performance. My experience with Open MPI shows that the default parameters are nearly optimal and work very well, even on exotic hardware like multilevel NUMA systems. Note that the default shared memory communication implementation copies the data twice - once from the send buffer to shared memory and once from shared memory to the receive buffer. A shortcut exists in the form of the KNEM kernel device, but you have to download it and compile it separately as it is not part of the standard Linux kernel. With KNEM support, Open MPI is able to perform "zero-copy" transfers between processes on the same node - the copy is done by the kernel device and it is a direct copy from the memory of the first process to the memory of the second process. This dramatically improves the transfer of large messages between processes that reside on the same node.

Another option is to completely forget about MPI and use shared memory directly. You can use the POSIX memory management interface (see here) to create a shared memory block have all processes operate on it directly. If data is stored in the shared memory, it could be beneficial as no copies would be made. But watch out for NUMA issues on modern multi-socket systems, where each socket has its own memory controller and accessing memory from remote sockets on the same board is slower. Process pinning/binding is also important - pass --bind-to-socket to mpiexec to have it pinn each MPI process to a separate CPU core.