Also, update some properties (source files should not be executeable...), and remove a couple unneeded inclusions of orte_proc_table.h
This commit was SVN r17655.
Basically, the method employed here is to have a recv create a zero-time timer event that causes the event library to execute a function that processes the message once the recv returns. Thus, any action taken as a result of processing the message occur outside of a recv.
Created two new macros to assist:
ORTE_MESSAGE_EVENT: creates the zero-time event, passing info in a new orte_message_event_t object
ORTE_PROGRESSED_WAIT: while waiting for specified conditions, just calls progress so messages can be recv'd.
Also fixed the failed_launch function as we no longer block in the orted callback function. Updated the error messages to reflect revision. No change in API to this function, but PLM "owners" may want to check their internal error messages to avoid duplication and excessive output.
This has been tested on Mac, TM, and SLURM.
This commit was SVN r17647.
about linkers, have all OPAL, ORTE, and OMPI components '''not'' link
against the OPAL, ORTE, or OMPI libraries.
See ttp://www.open-mpi.org/community/lists/users/2007/10/4220.php for
details (or https://svn.open-mpi.org/trac/ompi/wiki/Linkers for a
better-formatted version of the same info).
This commit was SVN r16968.
This commit also cleans up the checkpoint and terminate case making it more
precise than before. Previously the application could make a small amount of
progress between checkpoint completion and application termination. Now the
application will make no progress at all in this time span.
Additional minor change:
- Start using OPAL_INT_TO_BOOL instead of if/else logic
This commit was SVN r16952.
methods (in order of precedence):
1. #pragma ident <ident string> (e.g., Intel and Sun)
1. #ident <ident string> (e.g., GCC)
1. static const char ident[] = <ident string> (all others)
By default, the ident string used is the standard Open MPI version string. Only
the following libraries will get the embedded version strings (e.g., DSOs will
not):
* libmpi.so
* libmpi_cxx.so
* libmpi_f77.so
* libopen-pal.so
* libopen-rte.so
* Added two new configure options:
* `--with-package-name="STRING"` (defaults to "Open MPI username@hostname
Distribution"). `STRING` is displayed by `ompi_info` next to the "Package"
heading.
* `--with-ident-string="STRING"` (defaults to the standard Open MPI version
string - e.g., X.Y.Zr######). `%VERSION%` will expand to the Open MPI
version string if it is supplied to this configure option.
This commit was SVN r16644.
has his own range which is defined by a min value and a range. By default
there is no limitation on the port range, which is exactly the same
behavior as before.
This commit was SVN r16584.
If we cannot resolve the route to the peer that we're trying to send
to, don't queue up the message in the TCP OOB -- instead, return it to
the upper layer (e.g., the RML) and let it decide what to do.
In the case of the routed RML, the tree component will queue it up for
later transmission. Hence, we don't want the message queued up both
here in the TCP OOB and the tree routed. Also see some more
discussion / explanation in #1171.
This commit was SVN r16540.
The following SVN revision numbers were found above:
r16513 --> open-mpi/ompi@7ae9589d70
The following Trac tickets were found above:
Ticket 1170 --> https://svn.open-mpi.org/trac/ompi/ticket/1170
Note that this means ALL procs in the parent job are updated, even though they may not be participating in the comm_spawn. This doesn't really hurt anything - just unnecessary.
Comm_spawn still has a problem when a child process shares a node with a parent, so this doesn't fix everything. It only fixes the bug of ensuring all procs know how to talk to each other.
This commit was SVN r16460.
This commit introduces the necessary logic to avoid that conflict. If a PLS component can identify that a daemon has failed, then we will set a flag indicating that fact. The xcast system will subsequently check that flag and, if it is set, will send all messages direct to the recipient. In the case of "kill local procs" and "terminate", the messages will go directly to each orted, thus bypassing any orted that has failed.
In addition, the xcast system will -not- wait for the messages to complete, but will return immediately (i.e., operate in non-blocking mode). Orterun will wait (via an event timer) for a period of time based on the number of daemons in the system to allow the messages to attempt to be delivered - at the end of that time, orterun will simply exit, alerting the user to the problem and -strongly- recommending they run orte-clean.
I could only test this on slurm for the case where all daemons unexpectedly died - srun apparently only executes its waitpid callback when all launched functions terminate. I have asked that Jeff integrate this capability into the OOB as he is working on it so that we execute it whenever a socket to an orted is unexpectedly closed. Meantime, the functionality will rarely get called, but at least the logic is available for anyone whose environment can support it.
This commit was SVN r16451.
variable is not defined. Make sure to set it to something reasonable
so that file preloading still works (instead of seg faulting :)
Thanks to Hiep Bui Hoang for reporting this bug.
This commit was SVN r16433.