Pergunta

I'm currently trying to setup a MPI-Client connecting to a server which publishes a certain name but it doesn't work and I have no clue about it.

MPI is OpenMPI 1.6 using g++-4.7, where /usr/lib64/mpi/gcc/openmpi/etc/openmpi-default-hostfile contains 1 line:

MY_IP

The following "minimal" (I don't like questions using too much code but I think I should include it here) example illustrates the problem:

mpi_srv.cc

#include <iostream>
#include <mpi.h>

int main (void)
{

  int rank(0);
  MPI_Init(0, NULL);
  MPI_Comm_size(MPI_COMM_WORLD, &rank);
  std::cout << "Rank: " << rank << std::endl;
  char port_name[MPI_MAX_PORT_NAME];
  MPI_Open_port(MPI_INFO_NULL, port_name);
  char publish_name[1024] = {'t','e','s','t','_','p','o','r','t','\0'};
  MPI_Publish_name(publish_name, MPI_INFO_NULL, port_name);
  std::cout << "Port: " << publish_name << " (" << port_name << ")" << std::endl;
  MPI_Comm client;
  std::cout << "Wating for Comm..." << std::endl;
  MPI_Comm_accept(port_name, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &client);
  std::cout << "Comm accepted" << std::endl;
  MPI_Comm_free(&client);
  MPI_Unpublish_name(publish_name, MPI_INFO_NULL, port_name);
  MPI_Close_port(port_name);
  MPI_Finalize();
  return 1;

}

compiled and executed via

mpic++ mpi_src.cc -o mpi_srv.x
mpirun mpi_srv.x

prints

Rank: 1
Port: test_port (2428436480.0;tcp://MY_IP:33573+2428436481.0;tcp://MY_IP:43172:300)
Wating for Comm...

and blocks as required.

My client

mpi_client.cc

#include <iostream>
#include <mpi.h>

int main (void)
{

  int rank(0);
  MPI_Init(0, NULL);
  MPI_Comm_size(MPI_COMM_WORLD, &rank);
  std::cout << "Rank: " << rank << std::endl;
  char port_name[MPI_MAX_PORT_NAME];
  char publish_name[1024] = {'t','e','s','t','_','p','o','r','t','\0'};
  MPI_Lookup_name(publish_name, MPI_INFO_NULL, port_name);
  MPI_Comm client;
  MPI_Comm_connect(port_name, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &client);
  MPI_Comm_disconnect(&client);
  MPI_Finalize();
  return 1;

}

compiled and executed via

mpic++ mpi_client.cc -o mpi_client.x
mpirun mpi_client.x

prints

Rank: 1
[MY_HOST:24870] *** An error occurred in MPI_Lookup_name
[MY_HOST:24870] *** on communicator MPI_COMM_WORLD
[MY_HOST:24870] *** MPI_ERR_NAME: invalid name argument
[MY_HOST:24870] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort

with the server still running.

I removed the error checking in the exmaples above but the function return values indicate successful publication of the port name in the server executable. I found out that this problem can arise because of the published port being invisible to the client when using different mpirun but I used the same mpirun executable to execute both.

Why doesn't the client connect to the server as I'd expect here?

Foi útil?

Solução

When you run two separate MPI sessions, e.g.:

$ mpirun mpi_server.x
...

and

$ mpirun mpi_client.x
...

the second (client) MPI session has to be told where the naming service that holds the name/port mapping is located. With Open MPI you have several choices of naming service:

  • an instance of the dedicated naming service daemon ompi-server, or
  • the mpirun process of the server session.

In both cases the client session has to be provided with the location of the naming service. See this question and my answer to it for more information on how to deal with this in Open MPI.

Outras dicas

Name publishing is a tricky thing and can behave a little differently from one implementation to the next. It's up to the implementation to decide what level of support it will provide. For Open MPI (https://www.open-mpi.org/doc/v1.5/man3/MPI_Publish_name.3.php), it appears that you can set an MPI_Info key to specify that the name should be published locally or globally. You should make sure that you're publishing globally if you won't be starting your clients via MPI_Comm_spawn (which you're not).

Beyond that, this isn't a feature that I've used a lot so it may be that there's something else going on here.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top