In short applications, it's possible that the agent (i.e., local rank
0) will finalize after non-local rank 0 procs detect the connectivity
checker named socket, but before they complete a connect() on it. As
such, their connect() gets ECONNREFUSED.
This commit adds a simple counter in the agent that won't let it quit
before it accept()'s from all local procs, or 10 seconds goes by
(whichever occurs first). This is similar to the timeout for the
clients: they'll exit if they don't see the expected named socket
within 10 seconds.
There's no longer any need for the usnic BTL to have its own progress
thread: it can use the opal_progress_thread() infrastructure. This
commit removes the code to startup/shutdown the usnic-BTL-specific
progress thread and instead, just adds its events to the OPAL-wide
progress thread.
This necessitated a small change in the finalization step.
Previously, we would stop the progress thread and then tear down the
events. We can no longer stop the progress thread, and if we start
tearing down events, this will cause shutdown/hangups to be sent
across sockets, potentially firing some of the still-remaining events
while some (but not all) of the data structures have been torn down.
Chaos ensues.
Instead, queue up an event to tear down all the pending events. Since
the progress thread will only fire one event at a time, having a
teardown event means that it can tear down all the pending events
"atomically" and not have to worry that one of those events will get
fired in the middle of the teardown process.
Added Cloneable to the implemented interface list as per
Coverity suggestion. The required methods were already
implemented, but it was not explicitly stated. This is
an intent revealing change.
Signed-off-by: Nathaniel Graham <ngraham@lanl.gov>
There are now four functions and one global constant:
* opal_progress_thread_name: the name of the OPAL-wide async progress
thread. If you have general purpose events that you need to run in
*a* progress thread, but not a *dedicated* progress thread, use this
name in the functions below to glom your events on to the general
OPAL-wide async progress thread.
* opal_progress_thread_init(): return an event base corresponding to a
progress thread of the specified name (a progress thread will be
created for that name if it does not already exist).
* opal_progress_thread_finalize(): decrement the refcount on the
passed progress thread name. If the refcount is 0, stop the thread
and destroy the event base.
* opal_progress_thread_pause(): stop processing events on the event
base corresponding to the progress thread name, but do not destroy
the event base.
* opal_progess_thread_resume(): resume processing events on the event
base corresponding to a previously-paused progress thread name.
Ensure that we have non-NULL on all levels of pointers, which will
save us if there are exitable errors very early during component /
module initialization.
Per open-mpi/ompi#751, ensure to patch up all Autotools output (not
just in some cases).
Also, adjust the patching process to only write our verbose statements
and a new configure script if the content actually changed as result
of the patching.
When configured with --enable-picky
topo_base_lazy_init.c compiles with a warning:
CC base/topo_base_lazy_init.lo
base/topo_base_lazy_init.c:46:67: warning: implicit conversion from enumeration type 'enum mca_base_register_flag_t' to different enumeration type 'mca_base_open_flag_t' (aka 'enum mca_base_open_flag_t') [-Wenum-conversion]
err = mca_base_framework_open (&ompi_topo_base_framework, MCA_BASE_REGISTER_DEFAULT);
This commit fixes this implicit conversion problem.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
In order to address issue #741, the orted's now are
always launched with the Cray PMI environment variables
PMI_NO_FORK
PMI_NO_PREINITIALIZE
set to disable running of the library's ctor.
So there's no longer a need to set these for the
application(s) being launched by the orted's.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>