Add some new proc/job states
Rename a constant to reflect coming change - remove the arbitrary difference between restarting a proc locally and relocating it to another node in terms of the number of restarts allowed.
Add pretty-print of signals for "proc aborted due to signal" reports.
This commit was SVN r24378.
The following SVN revision numbers were found above:
r24371 --> open-mpi/ompi@93d28a5792
This means that the converters (opal_err2str, orte_err2str) can now
return NULL as a "silent error". The return value of opal_err2str_fn_t
is the status of the operation (OPAL_SUCCESS or OPAL_ERROR).
This fixes the "Unknown error" message issues on the trunk.
This commit was SVN r24371.
Temporarily remove hwloc's internal version of myriexpress.h. It is
causing a problem when compiling Open MPI with MX support because
hwloc uses AC_CONFIG_HEADER in hwloc's hwloc.m4 to generate
opal/mca/paffinity/hwloc/hwloc/include/hwloc/config.h.
AC_CONFIG_HEADER apparently has the (undocumented) side effect of
adding -I$(top_builddir)/opal/mca/paffinity/hwloc/hwloc/include/hwloc
to OMPI's compilation flags. Hence, when the OMPI MX components are
compiled and #include "myriexpress.h" (or <myriexpress.h>) they see
hwloc's myriexpress.h before the system one. Badness ensures.
This removal is temporary because we need to figure out a better
solution. But for now, OMPI is not using hwloc's myriexpress.h file --
so it's safe to remove. I'll push this issue upstream to hwloc to
figure out a better solution...
This commit was SVN r24354.
The following Trac tickets were found above:
Ticket 2690 --> https://svn.open-mpi.org/trac/ompi/ticket/2690
Short Version:
--------------
Event engine needs to be flushed so it does not use old/stale file descriptors.
Long Version:
-------------
The problem was that the restarted process was waiting for the socket to the local daemon to finish establishing during the 'sync' operation. The core problem was that the daemon was sending a header of 36 bytes, but the restarted process only received 35 bytes of the message. So the restarted process became stuck waiting for the last byte to arrive.
After many hours of digging, I figured out that the event engine was using the same file descriptor for its evsig_cb functionality (to signal itself when a signal arrives). So when the daemon wrote in to the new fd the event engine was stealing the first byte (*shakes fist at event engine*) before the recv() could be posted.
The solution is to use the event_reinit() function on restart to re-establish the now-stale file descriptors in the event engine. This seems to have fixed the problem.
A few other minor things:
-------------------------
* Add a check to make sure the event engine is balanced in its init/finalize
* Add the opal_event_base_close() to the BLCR restart exec function (still not 100% sure it is needed, but there it is).
This commit was SVN r24296.
last December. :-(
Add new MCA param: maffinity_libnuma_policy. Thanks to David
Singleton for the suggestion. Here's the help text about it:
{{{
MCA maffinity: parameter "maffinity_libnuma_policy" (current value:
<loose>, data source: default value)
Binding policy that determines what happens if memory
is unavailable on the local NUMA node. A value of
"strict" means that the memory allocation will fail;
a value of "loose" means that the memory allocation
will spill over to another NUMA node.
}}}
This commit was SVN r24290.
either direct link to these basic predefined types, or a combination of them.
Anyway, the first items in the datatype list belong to OPAL, the second round
are MPI datatypes created by composing basic OPAL datatypes, and the last
batch are mapped datatype (direct correspondance between an OMPI datatype and
an OPAL one such as int -> int32_t).
Modify the op to fit this new scheme.
This commit was SVN r24247.
the module to use the new hwloc bitmap API (the cpuset API is both
klunkier and deprecated), which simplified a few things.
This commit was SVN r24217.
{{{
base/paffinity_base_service.c: In function ‘opal_paffinity_base_cset2mapstr’:
base/paffinity_base_service.c:623: warning: unused variable ‘range_last’
base/paffinity_base_service.c:623: warning: unused variable ‘range_first’
base/paffinity_base_service.c:622: warning: unused variable ‘count’
base/paffinity_base_service.c:622: warning: unused variable ‘m’
}}}
{{{
connect/btl_openib_connect_oob.c: In function ‘init_ud_qp’:
connect/btl_openib_connect_oob.c:1111: warning: control reaches end of non-void function
connect/btl_openib_connect_oob.c: In function ‘init_device’:
connect/btl_openib_connect_oob.c:1235: warning: unused variable ‘i’
connect/btl_openib_connect_oob.c: In function ‘get_pathrecord_sl’:
connect/btl_openib_connect_oob.c:1323: warning: unused variable ‘i’
}}}
This commit was SVN r24196.
It is statically initialized to the real back-end OPAL show_help
function. During orte_show_help_init(), the variable is re-assigned
with the value of the back-end ORTE show_help function (the one that
does error message aggregation).
Therefore, anything that calls opal_show_help() after a certain point
in orte_init() will have their show_help messages be aggregated.
w00t! Even code down in OPAL -- that has no knowledge of ORTE -- will
have their messages aggregated. '''Double w00t!'''
During orte_show_help_finalize(), we restore the original pointer
value so that it something calls opal_show_help() after
orte_finalize(), it'll still work properly (but it won't be
aggregated).
This commit was SVN r24185.
1. Remove it from libevent207.h because it is not needed.
2. Add compat to the include list so it can use queue.h when needed.
This commit was SVN r24144.