Question

Consider the following simple program:

#include <mpi.h>                                                                                                                                                                                                                                 
#include <iostream>                                                                                                                                                                                                                              
#include <stdlib.h>                                                                                                                                                                                                                              
#include <stdio.h>                                                                                                                                                                                                                               
#include <string>                                                                                                                                                                                                                                
#include <vector>                                                                                                                                                                                                                                

using std::cout;                                                                                                                                                                                                                                 
using std::string;                                                                                                                                                                                                                               
using std::vector;                                                                                                                                                                                                                               

vector<float> test;                                                                                                                                                                                                                              
#ifdef GLOBAL                                                                                                                                                                                                                                    
string hostname;                                                                                                                                                                                                                                 
#endif                                                                                                                                                                                                                                           

int main(int argc, char** argv) {                                                                                                                                                                                                                
  int rank;  // The node id of this processor.                                                                                                                                                                                                   
  int size;  // The total number of nodes.                                                                                                                                                                                                       
#ifndef GLOBAL                                                                                                                                                                                                                                   
  string hostname;                                                                                                                                                                                                                               
#endif                                                                                                                                                                                                                                           
  MPI_Init(&argc, &argv);                                                                                                                                                                                                                        
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);                                                                                                                                                                                                          
  MPI_Comm_size(MPI_COMM_WORLD, &size);                                                                                                                                                                                                          

  cout << "Joining the job as processor: " << rank << std::endl;                                                                                                                                                                                 

  {                                                                                                                                                                                                                                              
    char buf[2048] = "HELLO";                                                                                                                                                                                                                    
    hostname.assign(buf, 2048);                                                                                                                                                                                                                  
  }                                                                                                                                                                                                                                              
  test.push_back(1.0f);                                                                                                                                                                                                                          

  cout << "Hostname: " << hostname << "::" << test[0] << std::endl;                                                                                                                                                                              

  MPI_Finalize();                                                                                                                                                                                                                                
  return 0;                                                                                                                                                                                                                                      
} 

If I compile/run this with:

mpicxx -c test.cc && mpicxx -lstdc++ test.o -o test && ./test

there is no segmentation fault, but if I run it with:

mpicxx -DGLOBAL -c test.cc && mpicxx -lstdc++ test.o -o test && ./test

then there is a segmentation fault at the hostname.assign() line. In addition, if I remove this assignment, there is a segmentation fault in the string destructor once the main method returns so the assign method isn't the actual culprit.

Notice that the only difference is where the "global" variable hostname gets declared.

I am compiling with MPICH2 version 1.6, and don't really have the option to change this since I am running this on a supercomputer.

If I remove MPI_Init, etc. the error goes away leading me to believe that there is something unexpected happening with MPI and this global variable.

I found some other examples of this happening to people online, but they all resolved their issues by installing a new version of MPICH, which again is not a possibility for me.

Moreover, I want to know WHY this happening more than just a way around it.

Thanks for your time.

Was it helpful?

Solution

Ok, after quite a bit of debugging I have found that the MVAPICH2-1.6 library defines a variable called hostname in:

mpid/ch3/channels/mrail/src/rdma/ch3_shmem_coll.c

Here is the line (55 in this version of the file):

char hostname[SHMEM_COLL_HOSTNAME_LEN];

The compiler didn't complain about the name clash here, but this is almost certainly the culprit since changing the variable name in my program removed the error. I imagine this is changed in later versions of MVAPICH2, but I will file the bug if not.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top