1
1
openmpi/ompi
Ralph Castain ab4f8585b0 When we abort during MPI_Init, we currently emit a totally incorrect error message stating that we were unable to aggregate error messages and cannot guarantee all other processes were killed. This simply isn't true IF the rte has been initialized.
So track that the rte has reached that point, and only emit the new message if it is accurate.

Note that we still generate a TON of output for a minor error:

Ralphs-iMac:examples rhc$ mpirun -n 3 -mca btl sm ./hello_c
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[50239,1],2]) is on host: Ralphs-iMac
  Process 2 ([[50239,1],2]) is on host: Ralphs-iMac
  BTLs attempted: sm

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
--------------------------------------------------------------------------
MPI_INIT has failed because at least one MPI process is unreachable
from another.  This *usually* means that an underlying communication
plugin -- such as a BTL or an MTL -- has either not loaded or not
allowed itself to be used.  Your MPI job will now abort.

You may wish to try to narrow down the problem;

 * Check the output of ompi_info to see which BTL/MTL plugins are
   available.
 * Run your application with MPI_THREAD_SINGLE.
 * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
   if using MTL-based communications) to see exactly which
   communication plugins were considered and/or discarded.
--------------------------------------------------------------------------
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[50239,1],2]
  Exit code:    1
--------------------------------------------------------------------------
[Ralphs-iMac.local:23227] 2 more processes have sent help message help-mca-bml-r2.txt / unreachable proc
[Ralphs-iMac.local:23227] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[Ralphs-iMac.local:23227] 2 more processes have sent help message help-mpi-runtime / mpi_init:startup:pml-add-procs-fail
Ralphs-iMac:examples rhc$ 

Hopefully, we can agree on a way to reduce this verbage!

This commit was SVN r31686.

The following SVN revision numbers were found above:
  r2 --> open-mpi/ompi@58fdc18855
2014-05-08 15:48:16 +00:00
..
attribute Return non-SUCCESS error codes from attribute copy functions. 2014-03-19 15:45:38 +00:00
class call element destructors in ompi_free_list_destruct 2013-08-29 22:56:31 +00:00
communicator As per the RFC: 2014-04-29 21:49:23 +00:00
contrib Per RFC, continue pecking away at the build system renaming 2014-05-06 16:27:38 +00:00
datatype Roll back r31519 and r31521: George convinced us that these approaches 2014-04-24 20:27:03 +00:00
debuggers The final step of the RFC: convert the <foo>libdir and friends to fit their respective code areas, and equate them all at the top. Note that we can't entirely separate things as the opal_install_dirs framework can't handle separated locations for the various trees. 2014-05-08 02:01:35 +00:00
errhandler osc: add missing MPI_ERR_RMA_SHARED error code and internal equivalent 2014-03-27 20:06:43 +00:00
etc Use MKDIR_P instead of mkdir_p in Makefiles, as MKDIR_P is the only one 2012-06-21 16:52:37 +00:00
file == Highlights == 2012-04-18 15:57:29 +00:00
group MPI-3.0: update C bindings with const and consistent use of [] for 2013-09-26 21:56:20 +00:00
include ompi_config.h: no need to AC_CONFIG_HEADER this file 2014-04-15 15:38:49 +00:00
info Fix an MPI-3 compliance buglet - MPI_Info_set should now use "const" qualifiers on the key and value parameters 2014-03-11 23:50:09 +00:00
mca Couple of minor fixes 2014-05-08 02:26:45 +00:00
message Remove a bunch of unused variables. 2013-03-26 14:34:29 +00:00
mpi Assume we always have fortran PROCEDURE support 2014-05-01 18:18:38 +00:00
mpiext The bulk of the remaining renaming changes, in one final glorious "blob". Thanks to Jeff for some help chasing down a few spots. Per chat with Jeff, we decided to cleanup a few things that were historical in nature: 2014-05-07 21:48:53 +00:00
op Deprecated comment. 2014-04-21 23:30:05 +00:00
patterns coll/ml: fix leaks 2014-03-27 23:25:31 +00:00
peruse - Sanity check initialization and finalization of PERUSE. 2010-01-12 16:36:24 +00:00
proc Add some further debug to the dstore framework. When doing comm_spawn, we have to exchange any provided cpu bitmaps to ensure both sides compute the same locality, else various mpi frameworks can go bonkers. 2014-04-30 19:29:00 +00:00
request Add support for MPI_Comm_dup_with_info, MPI_Comm_create_group, and 2013-10-02 14:26:40 +00:00
runtime When we abort during MPI_Init, we currently emit a totally incorrect error message stating that we were unable to aggregate error messages and cannot guarantee all other processes were killed. This simply isn't true IF the rte has been initialized. 2014-05-08 15:48:16 +00:00
tools Per RFC: OMPI_INSTALL_BINARIES -> OPAL_INSTALL_BINARIES 2014-05-05 21:43:05 +00:00
win Merge one-sided updates to the trunk - written by Brian Barrett and Nathan Hjelmn 2014-02-25 17:36:43 +00:00
Makefile.am The bulk of the remaining renaming changes, in one final glorious "blob". Thanks to Jeff for some help chasing down a few spots. Per chat with Jeff, we decided to cleanup a few things that were historical in nature: 2014-05-07 21:48:53 +00:00