Your approach will probably work OK for a moderate amount of data, but you've made one rank the central point of communication here. That's not going to scale terribly well.
You're on the right track with your part 2: a parallel write using MPI-IO sounds like a good approach to me. Here's how that might go:
- Continue to have your T processes read their inputs.
- I'm going to assume that 'id' is densely allocated. What I mean is, in this collection of files, can a process know if it sees
data with id: 4
that some other processes have id 1, 2, 3, and 5 ? If so, then every process knows where it's data has to go. - Let's also assume each 'data' is fixed size. The approach is only a little more complicated if that's not the case.
If you don't know the max ID and the max timesteps, you'd have to do a bit of communication (MPI_Allreduce with MPI_MAX as the operation) to find that.
With these preliminaries, you can set an MPI-IO "file view", probably using MPI_Type_indexed
On rank 0, this gets a bit more complicated because you need to add to your list of data the timestep markers. Or, you can define a file format with an index of timesteps, and store that index in a header or footer.
The code would look roughly like this:
for(i=0; i<nitems; i++)
datalen[i] = sizeof(item);
offsets[i] = sizeof(item)*index_of_item;
}
MPI_Type_create_indexed(nitems, datalen, offsets, MPI_BYTE, &filetype);
MPI_File_set_view(fh, 0, MPI_BYTE, filetype, "native", MPI_INFO_NULL);
MPI_File_write_all(fh, buffer, nitems*sizeof(item), MPI_BYTE, &status);
The _all bit here is important: you're going to create a highly non-contiguous, irregular access pattern from each MPI processor. Give the MPI-IO library a chance to optimize that request.
Also it's important to note that MPI-IO file views must be monotonically non-decreasing, so you'll have to sort the items locally before writing the data out collectively. Local memory operations have an insignificant cost relative to an I/O operation, so this usually isn't a big deal.