Unfortunately there is no better way to trace the inner working of MPI collective operations. The standard tracing interface in MPI is based around the PMPI paradigm: all MPI_*
calls are implemented as weak aliases of the actual MPI functions. The actual functions are available under the PMPI_*
name (with the PMPI_*
calls being either the real implementations or aliases too). This allows tracer libraries to declare their own MPI_*
functions that call PMPI_*
while generating trace events before and after the call. For example:
int MPI_Reduce(void *sendbuf, void *recvbuf, int count,
MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm)
{
int result;
trace_event_start("MPI_Reduce");
result = PMPI_Reduce(sendbuf, recvbuf, count, datatype, op, root, comm);
trace_event_end("MPI_Reduce");
return result;
}
When this code is linked with the rest of the program, all calls to MPI_Reduce
are replaced with calls to the tracing version (since MPI_Reduce
was originally a weak alias, the linker won't complain about the symbol being redefined).
Now the real problem in your case is that MPI_Reduce
is implemented not using calls to MPI_Send
and MPI_Recv
but rather using calls to the low-level MPICH2 functions, e.g. MPIC_Send_ft
and MPIC_Recv_ft
. These cannot be intercepted using the PMPI mechanism. What you can do in this case is to extract the code from the MPICH2 source and replace the internal calls with calls to MPI_Send
and MPI_Recv
, then trace the resulting implementation.
I have done the procedure described above and it works quite well with Open MPI except for a minor inconvenience - once you provide your own implementation of an MPI function, e.g. MPI_Reduce
, it is no longer a weak alias and linking with the tracing library could produce a duplicate symbol error. In that case I would simply name my implementation MyMPI_Reduce
and put #define MPI_Reduce MyMPI_Reduce
in the beginning of those source files that have to be traced. I'm not that familiar with MPICH2, but from the source code I could tell that it allows user implementations to be plugged in and that would make it easier (e.g. no need to cheat with the preprocessor).
One more thing: MPICH2 has several implementation of the reduction, at least in version 3.0, and it chooses one of them at run time using simple heuristic logic:
if ((count*type_size > MPIR_PARAM_REDUCE_SHORT_MSG_SIZE) &&
(HANDLE_GET_KIND(op) == HANDLE_KIND_BUILTIN) && (count >= pof2)) {
/* do a reduce-scatter followed by gather to root. */
mpi_errno = MPIR_Reduce_redscat_gather(sendbuf, recvbuf, count, datatype, op, root, comm_ptr, errflag);
if (mpi_errno) {
/* for communication errors, just record the error but continue */
*errflag = TRUE;
MPIU_ERR_SET(mpi_errno, MPI_ERR_OTHER, "**fail");
MPIU_ERR_ADD(mpi_errno_ret, mpi_errno);
}
}
else {
/* use a binomial tree algorithm */
mpi_errno = MPIR_Reduce_binomial(sendbuf, recvbuf, count, datatype, op, root, comm_ptr, errflag);
if (mpi_errno) {
/* for communication errors, just record the error but continue */
*errflag = TRUE;
MPIU_ERR_SET(mpi_errno, MPI_ERR_OTHER, "**fail");
MPIU_ERR_ADD(mpi_errno_ret, mpi_errno);
}
}