MPI (OpenMPI) - MPI_Publish_name cannot contact global ompi-server and throws error

StackOverflow https://stackoverflow.com/questions/23450839

  •  15-07-2023
  •  | 
  •  

Question

I am attempting to write an MPI application that would consist of programs in the server client mould. I am stuck trying to get the server to publish its name to the ompi-server in the global scope

Here is the server code:

int main(int argc, char** argv) {
int myrank, nprocs, errmpi;

MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
char port_name[MPI_MAX_PORT_NAME];
MPI_Info info;
MPI_Info_create(&info);
MPI_Info_set(info, "ompi_global_scope", "yes");
MPI_Open_port(info, port_name);

//Fails here
MPI_Publish_name("ServerName", info, port_name);

// Rest of code...

I get the following error on running it:

$ ./mpi/bin/mpirun -np 1 --mca btl self ServerName
--------------------------------------------------------------------------
Process rank 0 attempted to publish to a global ompi_server that
could not be contacted. This is typically caused by either not
specifying the contact info for the server, or by the server not
currently executing. If you did specify the contact info for a
server, please check to see that the server is running and start
it again (or have your sys admin start it) if it isn't.

--------------------------------------------------------------------------
[xxx:18205] *** An error occurred in MPI_Publish_name
[xxx:18205] *** reported by process [1424949249,139676631433216]
[xxx:18205] *** on communicator MPI_COMM_WORLD
[xxx:18205] *** MPI_ERR_INTERN: internal error
[xxx:18205] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[xxx:18205] ***    and potentially your MPI job)

I do have the ompi-server process running in debug mode on console

$ ./ompi-server --no-daemonize -d -r +
[xxx:14140] [[9416,0],0] orte-server: up and running!

Ultimately I will distribute the processes across various nodes, but for now I would really like to get the framework working on a single node. Could someone please help? Thanks very much indeed!

EDIT 1: Thank you very much for your quick reply. I made the following changes

$mpi/bin/ompi-server --no-daemonize -d -r mpiuri

If I now run the program so, I find the program hangs at the point where it previously fails

$./mpi/bin/mpirun --ompi-server file:mpiuri -mca btn tcp,self,sm -np 1 -v Server

While if I run the program with the following,

$ ./mpi/bin/mpirun --ompi-server file:mpiuri -mca btn tcp,self,sm -np 1 -v --wait-for-server --server-wait-time 10 Server

With the following error

--------------------------------------------------------------------------
mpirun was instructed to wait for the requested ompi-server, but was unable to
establish contact with the server during the specified wait time:

Server uri:  799801344.0;tcp://192.168.1.113:44487
Timeout time: 10

Error received: Not supported

Please check to ensure that the requested server matches the actual server
information, and that the server is in operation.
--------------------------------------------------------------------------

I must be close... but I cant quite figure it

I am fairly sure it is not the firewall, since I added the rule ALLOW 192.168.1.0/24 to ufw

Was it helpful?

Solution

Here is how to connect with the ompi-server

1) Ensure that ompi server is up and running, and is writing its uri to a file with the following command

$mpi/bin/ompi-server --no-daemonize -d -r mpiuri

2) Start all the mpi processes with this uri file, ensuring that you

  1. prefix the uri filename with "file:" when you enter the --ompi-server parameter
  2. enter the hostname of the the node where you are run mpirun ... like so

    $./mpi/bin/mpirun --ompi-server file:mpiuri -host myHostName -np 1 -v Server

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top