Cleanup the logic in the odls for when processes terminate. It turns out that we were only going through the kill_proc logic once instead of looping over all local children when we ordered a daemon to kill its local procs. This went unnoticed for some time as for most systems the local procs were terminated anyway when the daemon terminated due to the parent/child relationship.
Solaris is apparently different - the children are not automatically terminated when the parent dies. As a result, it acts as a detector for this bug.
Mucho thanks to Rolf V. for his help in debugging - and to IM for letting me follow his gdb progress in quasi real-time!
This commit was SVN r19044.
are properly linked against libmpi.la.
This required a little creative AM usage, inspired by discussion on
OMPI devel list:
* Make a new ompi/mpi/f77/Makefile_f77base.include; effectively move
the building of the f77 "base" glue stuff (libmpi_f77base.la) into
this Makefile and away from ompi/mpi/f77/Makefile.am. The sources
in question require some specific CPPFLAGS, so we couldn't just add
the raw sources into libmpi_la_SOURCES, unfortunately.
* Include this new Makefile in the top-level ompi/Makefile.am
* The libmpi_f77base.la LT convenience library was already sucked
into libmpi.la; breaking it out into its own Makefile allows us
to build it earlier and therefore complete buidling libmpi.la
earlier.
* Side effect: the ompi/mpi/Makefile.am is now mostly unnecessary; it
no longer specifies a SUBDIRS for each of the bindings directories
to traverse into (since they are now in the top-level SUBDIRS). As
such, the man pages are now also now included in the top-level
ompi/Makefile.am.
The end of the result is that libmpi.la -- including a few sources
from mpi/f77 -- is fully built before the C++, F77, and F90 bindings
are built. Therefore, the C++, F77, and F90 bindings libraries can
all link against libmpi.la.
This commit was SVN r19040.
The following Trac tickets were found above:
Ticket 1409 --> https://svn.open-mpi.org/trac/ompi/ticket/1409
lsb_launch tampers with SIGCHLD signal handler. We are forced to reinstall our own signal handler after a call to this function.
This commit fixes trac:1356.
This commit was SVN r19033.
The following Trac tickets were found above:
Ticket 1356 --> https://svn.open-mpi.org/trac/ompi/ticket/1356
Fix a few bugs in the mappers:
1. Ensure that bynode with no -np fills all available slots - it just does so with the ranks set bynode instead of byslot
2. fix --nolocal behavior so it works correctly in all cases. We still have to test the host's name using opal_ifislocal in the mapper because the name returned by gethostname to orte_process_info.hostname can be an FQDN, but a hostfile may contain a non-FQDN version.
3. Add missing --nolocal logic to the seq mapper
Oversubscribed mapping seemed to be working okay without repair, so I couldn't verify my own bug report in that regard.
Also included are some preliminary changes to support the modified hostfile behavior, which will be committed shortly:
1. removed the totally useless "allocate" field in the orte_node_t object since every node is automatically allocated for use - and everything ignored the field anyway
2. correctly initialize the slots_alloc field when the allocation is read
This commit was SVN r19030.
* Use synonym/deprecated MCA param API for some mca base params
* In openib BTL, if we have appropriate memory hooks support, and if
mpi_leave_pinned and mpi_leave_pinned_pipeline were not set by the
user, set mpi_leave_pinned to 1.
* Defer checking mpi_leave_pinned_* until as late as possible (i.e.,
until after the btl's have had a chance to set mpi_leave_pinned to
1):
* in ob1 pml
* in rdma mpool
This commit was SVN r19022.
The following Trac tickets were found above:
Ticket 1379 --> https://svn.open-mpi.org/trac/ompi/ticket/1379
Kinda found by MTT. MTT calls 'ompi_info --all --parsable' and it was livelocked and had to be killed by hand.
I'm going to push this one to Jeff to push to v1.3 since he did the original implementation and should check this code.
This commit was SVN r19014.
a little bit more than "BTL was able to add some procs". The real condition to
allow the BTL progress is that we will use it to send/recv data to/from some
of the peers (this include the BTL exclusivity in the process).
This commit was SVN r19010.
Modify the odls to remove a (size_t) typecast in front of the num_processors variable just in case it is returned negative. This usually is accompanied by an opal_error, so this shouldn't make any difference - but it is more technically correct.
This commit was SVN r19008.
Fixed allocation of all ranks when using RANKFILE, but not all ranks assigned
Aborting if using RANKFILE, but np wasn't specified a little earlier
Clean mca_rmaps_rank_file_component.debug
This commit was SVN r19004.
* Fix linux paffinity component to make a "best" guess when PLPA
can't find topology information in the Linux kernel. That is, if
PLPA can't tell us the max_processor_id, just assume that it's the
same as the number of processors. If you have a more complex
system than that (e.g., you have holes in your available processor
IDs), you'll likely be running a Linux kernel that supports the
topology information, and this problem won't happen.
* Make sure to conver the return codes from PLPA to OPAL_ERR* codes.
This commit was SVN r19001.
The following Trac tickets were found above:
Ticket 1250 --> https://svn.open-mpi.org/trac/ompi/ticket/1250
For now, hide the OSX component with .ompi_ignore so only I can see it until I can ensure that it doesn't inadvertently interfere with Linux and Solaris support.
This clears the conflict with Windows.
This commit was SVN r18989.
"!OpenFabrics" / neutral (i.e., refer to IB and/or iWARP).
* Mostly just type, variable/field, and funcion name changes, such as
s/hca/device/g, etc.
* Changed the INI file for the hardware-specific parameters to be
mca-btl-openib-device-params.ini.
* Updated a lot of help messages in the help-*.txt files, not just to
update it to be !OpenFabrics/neutral language, but also for some
consistency of tone, indenting, etc.
* Deprecated a bunch of MCA params in favor of language-neutral new
ones:
* btl_openib_warn_no_hca_params_found (s/hca/device/)
* btl_openib_hca_param_files
* btl_openib_ib_cq_size (s/_ib_/_of_/)
* btl_openib_ib_max_inline_data
* btl_openib_ib_psn
* btl_openib_ib_mtu
* btl_openib_ib_pkey_ix
* btl_openib_ib_pkey_val
This commit was SVN r18985.
The following Trac tickets were found above:
Ticket 1295 --> https://svn.open-mpi.org/trac/ompi/ticket/1295
The rdmacm event handler has no way of reporting fatal errors to the upper
layers. By calling mca_btl_openib_endpoint_invoke_error in the rdmacm event
handler for the errors encountered, these errors can now be handled
appropriately.
Closes out Ticket #1283
This commit was SVN r18980.
--debug flag to help developers figure out possible future issues.
This fixes trac:1335.
This commit was SVN r18979.
The following Trac tickets were found above:
Ticket 1335 --> https://svn.open-mpi.org/trac/ompi/ticket/1335
1. add a new API delete_route(orte_process_name_t*) to delete the specified proc from the routing table
2. modify update_route so that it actually updates pre-existing routes instead of only adding routing info the end of the hash table
This fixes ticket #1403
This commit was SVN r18970.
can have a pub_endpoint and a sub_endpoint that are not equal but go
to the same place (fd). I didn't think that that was possible. :-\
So just use a bool to track whether we have forwarded the fragment at
all; if we have, then don't forward to the sub_endpoint.
IOF is going to be re-written for v1.4.
This commit was SVN r18950.
The following SVN revision numbers were found above:
r18873 --> open-mpi/ompi@773c92a6eb