Cleanup the logic in the odls for when processes terminate. It turns out that we were only going through the kill_proc logic once instead of looping over all local children when we ordered a daemon to kill its local procs. This went unnoticed for some time as for most systems the local procs were terminated anyway when the daemon terminated due to the parent/child relationship.
Solaris is apparently different - the children are not automatically terminated when the parent dies. As a result, it acts as a detector for this bug.
Mucho thanks to Rolf V. for his help in debugging - and to IM for letting me follow his gdb progress in quasi real-time!
This commit was SVN r19044.
lsb_launch tampers with SIGCHLD signal handler. We are forced to reinstall our own signal handler after a call to this function.
This commit fixes trac:1356.
This commit was SVN r19033.
The following Trac tickets were found above:
Ticket 1356 --> https://svn.open-mpi.org/trac/ompi/ticket/1356
Fix a few bugs in the mappers:
1. Ensure that bynode with no -np fills all available slots - it just does so with the ranks set bynode instead of byslot
2. fix --nolocal behavior so it works correctly in all cases. We still have to test the host's name using opal_ifislocal in the mapper because the name returned by gethostname to orte_process_info.hostname can be an FQDN, but a hostfile may contain a non-FQDN version.
3. Add missing --nolocal logic to the seq mapper
Oversubscribed mapping seemed to be working okay without repair, so I couldn't verify my own bug report in that regard.
Also included are some preliminary changes to support the modified hostfile behavior, which will be committed shortly:
1. removed the totally useless "allocate" field in the orte_node_t object since every node is automatically allocated for use - and everything ignored the field anyway
2. correctly initialize the slots_alloc field when the allocation is read
This commit was SVN r19030.
Modify the odls to remove a (size_t) typecast in front of the num_processors variable just in case it is returned negative. This usually is accompanied by an opal_error, so this shouldn't make any difference - but it is more technically correct.
This commit was SVN r19008.
Fixed allocation of all ranks when using RANKFILE, but not all ranks assigned
Aborting if using RANKFILE, but np wasn't specified a little earlier
Clean mca_rmaps_rank_file_component.debug
This commit was SVN r19004.
--debug flag to help developers figure out possible future issues.
This fixes trac:1335.
This commit was SVN r18979.
The following Trac tickets were found above:
Ticket 1335 --> https://svn.open-mpi.org/trac/ompi/ticket/1335
1. add a new API delete_route(orte_process_name_t*) to delete the specified proc from the routing table
2. modify update_route so that it actually updates pre-existing routes instead of only adding routing info the end of the hash table
This fixes ticket #1403
This commit was SVN r18970.
can have a pub_endpoint and a sub_endpoint that are not equal but go
to the same place (fd). I didn't think that that was possible. :-\
So just use a bool to track whether we have forwarded the fragment at
all; if we have, then don't forward to the sub_endpoint.
IOF is going to be re-written for v1.4.
This commit was SVN r18950.
The following SVN revision numbers were found above:
r18873 --> open-mpi/ompi@773c92a6eb
Short version: remove opal_paffinity_alone and restore
mpi_paffinity_alone. ORTE makes various information available for the
MPI layer to decide what it wants to do in terms of processor
affinity.
Details:
* remove opal_paffinity_alone MCA param; restore mpi_paffinity_alone
MCA param
* move opal_paffinity_slot_list param registration to paffinity base
* ompi_mpi_init() calls opal_paffinity_base_slot_list_set(); if that
succeeds use that. If no slot list was set, see if
mpi_paffinity_alone was set. If so, bind this process to its Node
Local Rank (NLR). The NLR is the ORTE-maintained slot ID; if you
COMM_SPAWN to a host in this ORTE universe that already has procs
on it, the NLR for the new job will start at N (not 0). So this is
slightly better than mpi_paffinity_alone in the v1.2 series.
* If a slot list is specified *and* mpi_paffinity_alone is set, we
display an error and abort.
* Remove calls from rmaps/rank_file component to register and lookup
opal_paffinity mca params.
* Remove code in orte/odls that set affinities - instead, have them
just pass a slot_list if it exists.
* Cleanup the orte/odls code that determined
oversubscribed/want_processor as these were just opposites of each
other.
This commit was SVN r18874.
The following Trac tickets were found above:
Ticket 1383 --> https://svn.open-mpi.org/trac/ompi/ticket/1383
Short version: when the HNP launches VPID 0 on the same node as
itself, the STDIN IOF endpoint will have both a pub and a sub on it.
We need to ensure to only forward incoming messages ''once'' (not
twice, as was happening). A lengthy comment in the code explains in
more detail.
This commit was SVN r18873.
The following Trac tickets were found above:
Ticket 1135 --> https://svn.open-mpi.org/trac/ompi/ticket/1135
Add comments to both orterun and orted code explaining why we take a snapshot of the local environment and apply it to the local procs when they are spawned.
This commit was SVN r18842.
Actually, the problem was that we were simply -adding- any enviro MCA params to whatever had been found on the cmd line. Thus, duplicate MCA param directives were winding up duplicated in the environment. Some shells took the first one in the environ array - others took the last! So we could get completely different behavior based on the whims of the shell.
This commit fixes trac:1373
This commit was SVN r18836.
The following Trac tickets were found above:
Ticket 1373 --> https://svn.open-mpi.org/trac/ompi/ticket/1373
1. repair of the linear and direct routed modules
2. repair of the ompi/pubsub/orte module to correctly init routes to the ompi-server, and correctly handle failure to correctly parse the provided ompi-server URI
3. modification of orterun to accept both "file" and "FILE" for designating where the ompi-server URI is to be found - purely a convenience feature
4. resolution of a message ordering problem during the connect/accept handshake that allowed the "send-first" proc to attempt to send to the "recv-first" proc before the HNP had actually updated its routes.
Let this be a further reminder to all - message ordering is NOT guaranteed in the OOB
5. Repair the ompi/dpm/orte module to correctly init routes during connect/accept.
Reminder to all: messages sent to procs in another job family (i.e., started by a different mpirun) are ALWAYS routed through the respective HNPs. As per the comments in orte/routed, this is REQUIRED to maintain connect/accept (where only the root proc on each side is capable of init'ing the routes), allow communication between mpirun's using different routing modules, and to minimize connections on tools such as ompi-server. It is all taken care of "under the covers" by the OOB to ensure that a route back to the sender is maintained, even when the different mpirun's are using different routed modules.
6. corrections in the orte/odls to ensure proper identification of daemons participating in a dynamic launch
7. corrections in build/nidmap to support update of an existing nidmap during dynamic launch
8. corrected implementation of the update_arch function in the ESS, along with consolidation of a number of ESS operations into base functions for easier maintenance. The ability to support info from multiple jobs was added, although we don't currently do so - this will come later to support further fault recovery strategies
9. minor updates to several functions to remove unnecessary and/or no longer used variables and envar's, add some debugging output, etc.
10. addition of a new macro ORTE_PROC_IS_DAEMON that resolves to true if the provided proc is a daemon
There is still more cleanup to be done for efficiency, but this at least works.
Tested on single-node Mac, multi-node SLURM via odin. Tests included connect/accept, publish/lookup/unpublish, comm_spawn, comm_spawn_multiple, and singleton comm_spawn.
Fixes ticket #1256
This commit was SVN r18804.
- Change the arguments for launch failed function according to changeset r18611.
This commit was SVN r18795.
The following SVN revision numbers were found above:
r18611 --> open-mpi/ompi@7bee71aa59