This commit fixes two bugs in XRC support
- When dynamic add_procs support was added to master the remote
process name was added to the non-XRC request structure. The same
value was not added to the XRC xconnect structure. This error was
not caught because the send/recv code was incorrectly using the
wrong structure member. This commmit adds the member and ensure the
xconnect code uses the correct structure.
- XRC loopback QP support has been fixed. It was 1) not setting the
correct fields on the endpoint structure, 2) calling
udcm_xrc_recv_qp_connect, and 3) was not initializing the endpoint
data.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
mca_btl_openib_put incorrectly checks the qp inline max before
allowing an inline put. This check will always fail for an endpoint
that has not been connected. The commit changes the check to use the
btl_put_local_registration_threshold instead.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Commit open-mpi/ompi@400af6c52d
introduced a regression in XRC support. The commit reversed the
ordering of shared receive queue (SRQ) and completion queue (CQ)
completion. CQ creation must always preceed SRQ creation when using
XRC as the CQs are needed to create the SRQs. This commit fixes the
ordering so that CQs are always created before SRQs.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
* Clean up the DVM so it continues to run even when applications error out and we would ordinarily abort the daemons.
* Create a new errmgr component for the DVM to handle the differences.
* Cleanup the DVM state component.
* Add ORTE bindings directory and brief README
* Pass a local tool index around to match jobs.
* Pass the jobid on job completion.
* Fix initialization logic.
* Add framework for python wrapper.
* Fix terminate-with-non-zero-exit behavior so it properly terminates only the indicated procs, notifies orte-submit, and orte-dvm continues executing.
* Add some missing options to orte-dvm
* Fix a bug in -host processing that caused us to ignore the #slots designator. Add a new attribute to indicate "do not expand the DVM" when submitting job spawn requests.
* It actually makes no sense that we treat the termination of all children differently than terminating the children of a specific job - it only creates confusion over the difference in behavior. So terminate children the same way regardless.
Extend the cmd_line utility to easily allow layering of command line definitions
Catch up with ORTE interface change and make build more generic.
Disable "fixed dvm" logic for now.
Add another cmd_line function to merge a table of cmd line options with another one, reporting as errors any duplicate entries. Use this to allow orterun to reuse the orted_submit code
Fix the "fixed_dvm" logic by ensuring we reset num_new_daemons to zero. Also ensure that the nidmap is sent with the first job so the downstream daemons get the node info. Remove a duplicate cmd line entry in orterun.
Revise the DVM startup procedure to pass the nidmap only once, at the startup of the DVM. This reduces the overhead on each job launch and ensures that the nidmap doesn't get overwritten.
Add new commands to get_orted_comm_cmd_str().
Move ORTE command line options to orte_globals.[ch].
Catch up with extra orte_submit_init parameter.
Add example code.
Add documentation.
Bump version.
The nidmap and routing data must be updated prior to propagating the xcast or else the xcast will fail.
Fix the return code so it is something more expected when an error occurs. Ensure we get an error returned to us when we fail to launch for some reason. In this case, we will always get a launch_cb as we did indeed attempt to spawn it. The error code will be returned in the complete_cb.
Fix the return code from orte_submit_job - it was returning the tracker index instead of "success". Take advantage of ORTE's pretty-print capabilities to provide a nice error output explaining why we failed to launch. Ensure we always get a launch_cb when we fail to launch, but no complete_cb as the job never launched.
Extend the error reporting capability to job completion as well.
Add index parameter to orte_submit_job().
Add orte_job_cancel and implement ORTE_DAEMON_TERMINATE_JOB_CMD.
Factor out dvm termination.
Parse the terminate option at tool level.
Add error string for ORTE_ERR_JOB_CANCELLED.
Add some safeguards.
Cleanup and/of comments.
Enable the return.
Properly ORTE_DECLSPEC orte_submit_halt.
Add orte_submit_halt and orte_submit_cancel to interface.
Use the plm interface to terminate the job
Cleanup the configury so we properly check for Singularity under the various typical use-cases
Bring the Singularity support online. We have to turn "off" the sm BTL as it segfaults from inside the container - root cause remains unclear. Also turned "off" the various OPAL shmem components in case they are involved and someone else tries to use them. Happily, the vader BTL works just fine!
These changes fix issue https://github.com/open-mpi/ompi/issues/1336
- improve abstractions: opal/memory/linux component should be single place that opeartes with
Memory Allocation Hooks.
- avoid collisions in case dynamic component open/close: it is safe because it is linked statically.
- does not change original behaivour.
As reported by @marksantcroos, this substitution in opal.pc was
incorrect -- it left @{libdir} in the string (vs. ${libdir}). The fix
is simple: use the proper substitution variable in opal.pc (it was
never updated to reflect the new/correct name that was created just
for the pkg-config files).
Fixesopen-mpi/ompi#1343.
0715802f52c24c236700ac085090d5441524644c missed that there is a call
to a common/verbs_usnic symbol in the common/verbs component. This
call needs to be compiled out when the common/verbs_usnic component is
not built.
This component is a workaround to a bug in libibverbs that prints a
dire warning that usNIC devices are not supported (of course not --
usNIC devices provide functionality through libfabric, not
libibverbs). This component was written before a better workaround
was created: a "no op" libibverbs plugin for usNIC devices
(https://github.com/cisco/libusnic_verbs, and is also available in
binary form on cisco.com).
Hence, this component no longer builds by default. It's still
available if a user specifically asks for it (e.g., if they do not
want to install the "no op" libibverbs plugin), but it's not the
default. This component also has the side-effect of making
libopen-pal.so depend on libibverbs.so, which can be annoying for
packagers (which is another reason it isn't built by default any
more).
The send code in the ugni btl has an optimization that enables it to
return 1 (fragment gone) in some cases. This optimization involved
removing the btl ownership and callback flags to ensure the fragment
stuck around long enough for its completion flag to be checked. This
works fine for the single-threaded case but not in the multi-threaded
case. It is possible that a fragment will be completed by another
thread while a thread is in mca_btl_ugni_send. This competition can
lead to a leaked fragment, missed callback, or both. To fix the issue
without removing the optimization a reference count has been added to
the fragment. Callbacks and fragment release will not be made until
the fragment reference count has reach 0. The count is incremented
before sending the frag and decremented after the completion flag has
been checked. The fix has been verified to work using a multi-threaded
RMA benchmark with the osc/pt2pt component.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes a race condition that can cause an endpoint to be
added to the wait list multiple times. To fix the issue an additional
check has been added to ensure the endpoint is not on the wait list
after the wait list lock is held. The wait list processing code has
also been updated to keep the wait list lock until all wait listed
endpoints have been handled. This reduces the chance that an endpoint
that is being processed by the wait list code is not re-added to the
list by a competing send.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Three minor updates from the code review of
https://github.com/open-mpi/ompi-release/pull/933:
* Remove an extra blank line a show_help message
* We no longer allow -1 for the MCA param btl_usnic_av_eq_num, so
change the flag to REGINT_GE_ONE
* Change "num_blocks" definition to be in terms of block_len (not
eq_size)
A bunch of empirical testing has shown that increasing the retranmit
timeout from 1ms to 5ms doesn't adversely affect performance, yet
decreases the number of gratuitious retransmissions.
Add endpoints in a blocked manner so that we don't overrun the
fi_av_insert() event queue. Also make the AV EQ length an MCA param,
and report it in mca_btl_base_verbose >=5 output.
Sequence numbers will wrap around; it is not sufficient to check for
(seq-1) -- must use the SEQ_DIFF macro to properly handle the
wraparound.
This bug wasn't serious; it just meant we might retransmit one or two
extra times when retransmits were triggerd and the sequence numbers
wrapped around their sliding windows.
when SMT is enabled, a core must be counted as long as one of its hwthread is allowed
Thanks Ben Menadue for the report.
This fixes a regression from open-mpi/ompi@6d149554a7
The eviction callback, for convenience (and to avoid code
duplication), use to call opal_hotel_checkout(). However,
opal_hotel_checkout() deletes the eviction event -- which is fine to
do when opal_hotel_checkout() is invoked by the application. But when
it's invoked by the same event that it's deleting, it can cause Bad
Things to happen.
For simplicity, instead of invoking opal_hotel_checkout() from the
eviction callback, just duplicate the checkout logic into the eviction
callback function (and skip the delete-the-evict-event part).
For good measure, put a comment in all three places where the checkout
logic occurs (because it's inlined): don't change this logic without
changing all 3 places.
Finally, also add a line in the docs for opal_hotel_init() warning
users from calling opal_hotel_checkout() from their eviction
callback.
`cm_message_event_active == 1` but main thread has already stopped
processing messages and thus we will have the situation where one
message was left unhandled leading to a hang.
setmntent() doesn't support root_fd, but manual parsing of
/proc/mounts is fragile, and actually buggy for very long mount lines
(see open-mpi/hwloc#142 (comment)).
Since we only openat("/proc/mounts") there, just manually concatenate
the fsroot_path and use setmntent().
Thanks to Nathan Hjelm for the report.
(Cherry-picked from open-mpi/hwloc@d2d07b9a22)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>