Domanda

I'm trying to find some memory errors in a program of mine using electric fence. My program uses OpenMPI and when I try to run it, it segfaults with the following back trace:

Program received signal SIGSEGV, Segmentation fault.
2001    ../sysdeps/x86_64/multiarch/memcpy-ssse3-back.S: No such file or directory.
__memcpy_ssse3_back () at ../sysdeps/x86_64/multiarch/memcpy-ssse3-back.S:2001
(gdb) bt
#0  __memcpy_ssse3_back ()
    at ../sysdeps/x86_64/multiarch/memcpy-ssse3-back.S:2001
#1  0x00007ffff72d6b7f in ompi_ddt_copy_content_same_ddt ()
   from /usr/lib/libmpi.so.0
#2  0x00007ffff72d4d0d in ompi_ddt_sndrcv () from /usr/lib/libmpi.so.0
#3  0x00007ffff72dd5b3 in PMPI_Allgather () from /usr/lib/libmpi.so.0
#4  0x00000000004394f1 in ppl::gvec<unsigned int>::gvec (this=0x7fffffffdd60, 
    length=1) at qppl/gvec.h:32
#5  0x0000000000434a35 in TreeBuilder::TreeBuilder (this=0x7fffffffdc60, 
    octree=..., mygbodytab=..., mylbodytab=..., cellpool=0x7fffef705fc8, 
---Type <return> to continue, or q <return> to quit---
    leafpool=0x7fffef707fc8, bodypool=0x7fffef6bdfc0) at treebuild.cxx:93
#6  0x000000000042fb6b in BarnesHut::BuildOctree (this=0x7fffffffde50)
    at barnes.cxx:155
#7  0x000000000042af52 in BarnesHut::Run (this=0x7fffffffde50)
    at barnes.cxx:386
#8  0x000000000042b164 in main (argc=1, argv=0x7fffffffe118) at barnes.cxx:435

The relevant portion of my code is:

   me = spr_locale_id();
   world_size = spr_num_locales();
   my_elements = std::shared_ptr<T>(new T[1]);

   world_element_pointers = std::shared_ptr<T*>(new T*[world_size]);

   MPI_Allgather(my_elements.get(), sizeof(T*), MPI_BYTE,
       world_element_pointers.get(), sizeof(T*), MPI_BYTE,
       MPI_COMM_WORLD);

I'm not sure why __memcpy_ssse3_back is causing a segfault. This part of the program doesn't segfault when I run without electric fence. Does anyone know what's going on? I'm using openmpi version 1.4.3

È stato utile?

Soluzione

There are two possible reasons for the error:

There is a bug in the data copy routines, present in older Open MPI versions, that appears to have been fixed in version 1.4.4. If this is the case, an upgrade of the Open MPI library to a newer version would solve the problem.

Another possible reason is that my_elements is an array of single item of type T. In the MPI_Allgather call you pass a pointer to this element, but you specify instead sizeof(T*) as the number of bytes to be sent. By default, Electric Fence places the newly allocated memory at the end of a memory page and then inserts an inaccessible memory page immediately after. If T happens to be shorter than a pointer type (e.g. T is int and you are running on a 64-bit LP64 platform), then access to the inaccessible memory page would occur and hence the segfault. As your intention is to actually send a pointer to the data, then you should pass MPI_Allgather a pointer to the value returned by my_elements.get() instead.

By the way, passing pointers around is not a nice thing to do. MPI provides its own portable RDMA implementation. See the One-sided Communications chapter of the MPI standard. It is a bit cumbersome, but it should at least be portable.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top