Update to support passing of HWLOC shmem topology to client procs
Update use of distance API per @bgoglin
Have the openib component lookup its object in the distance matrix
Bring usnic up-to-date
Restore binding for hwloc2
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
passed to make it all flow thru the opal/pmix "put/get" operations. Update the PMIx code to latest master to pickup some required behaviors.
Remove the no-longer-required get_contact_info and set_contact_info from the RML layer.
Add an MCA param to allow the ofi/rml component to route messages if desired. This is mainly for experimentation at this point as we aren't sure if routing wi
ll be beneficial at large scales. Leave it "off" by default.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Passed the below set of symbols into a script that added ompi_ to them all.
Note that if processing a symbol named "foo" the script turns
foo into ompi_foo
but doesn't turn
foobar into ompi_foobar
But beyond that the script is blind to C syntax, so it hits strings and
comments etc as well as vars/functions.
coll_base_comm_get_reqs
comm_allgather_pml
comm_allreduce_pml
comm_bcast_pml
fcoll_base_coll_allgather_array
fcoll_base_coll_allgatherv_array
fcoll_base_coll_bcast_array
fcoll_base_coll_gather_array
fcoll_base_coll_gatherv_array
fcoll_base_coll_scatterv_array
fcoll_base_sort_iovec
mpit_big_lock
mpit_init_count
mpit_lock
mpit_unlock
netpatterns_base_err
netpatterns_base_verbose
netpatterns_cleanup_narray_knomial_tree
netpatterns_cleanup_recursive_doubling_tree_node
netpatterns_cleanup_recursive_knomial_allgather_tree_node
netpatterns_cleanup_recursive_knomial_tree_node
netpatterns_init
netpatterns_register_mca_params
netpatterns_setup_multinomial_tree
netpatterns_setup_narray_knomial_tree
netpatterns_setup_narray_tree
netpatterns_setup_narray_tree_contigous_ranks
netpatterns_setup_recursive_doubling_n_tree_node
netpatterns_setup_recursive_doubling_tree_node
netpatterns_setup_recursive_knomial_allgather_tree_node
netpatterns_setup_recursive_knomial_tree_node
pml_v_output_close
pml_v_output_open
intercept_extra_state_t
odls_base_default_wait_local_proc
_event_debug_mode_on
_evthread_cond_fns
_evthread_id_fn
_evthread_lock_debugging_enabled
_evthread_lock_fns
cmd_line_option_t
cmd_line_param_t
crs_base_self_checkpoint_fn
crs_base_self_continue_fn
crs_base_self_restart_fn
event_enable_debug_output
event_global_current_base_
event_module_include
eventops
sync_wait_mt
trigger_user_inc_callback
var_type_names
var_type_sizes
Signed-off-by: Mark Allen <markalle@us.ibm.com>
Start updating the various mappers to the new procedure. Remove the stale lama component as it is now very out-of-date. Bring round_robin and PPR online, and modify the mindist component (but cannot test/debug it).
Remove unneeded test
Fix memory corruption by re-initializing variable to NULL in loop
Resolve the race condition identified by @ggouaillardet by resetting the
mapped flag within the same event where it was set. There is no need to
retain the flag beyond that point as it isn't used again.
Add a new job attribute ORTE_JOB_FULLY_DESCRIBED to indicate that all the job information (including locations and binding) is included in the launch message. Thus, the backend daemons do not need to do any map computation for the job. Use this for the seq, rankfile, and mindist mappers until someone decides to update them.
Note that this will maintain functionality, but means that users of those three mappers will see large launch messages and less performant scaling than those using the other mappers.
Have the mindist module add procs to the job's proc array as it is a fully described module
Protect the hnp-not-in-allocation case
Per path suggested by Gilles - protect the HNP node when it gets added in the absence of any other allocation or hostfile
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Set the daemons' state to "running" and mark them as "alive" by default when constructing the nidmap
Get the DVM running again
Fix direct modex by eliminating race condition caused by releasing data while sending it
Up the size limit before compressing
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
onto the backend daemons. By default, let mpirun only pack the app_context
info and send that to the backend daemons where the mapping will
be done. This significantly reduces the computational time on mpirun as it isn't
running up/down the topology tree computing thousands of binding
locations, and it reduces the launch message to a very small number of
bytes.
When running -novm, fall back to the old way of doing things
where mpirun computes the entire map and binding, and then sends
the full info to the backend daemon.
Add a new cmd line option/mca param --fwd-mpirun-port that allows
mpirun to dynamically select a port, but then passes that back to
all the other daemons so they will use that port as a static port
for their own wireup. In this mode, we no longer "phone home" directly
to mpirun, but instead use the static port to wireup at daemon
start. We then use the routing tree to rollup the initial
launch report, and limit the number of open sockets on mpirun's node.
Update ras simulator to track the new nidmap code
Cleanup some bugs in the nidmap regex code, and enhance the error message for not enough slots to include the host on which the problem is found.
Update gadget platform file
Initialize the range count when starting a new range
Fix the no-np case in managed allocation
Ensure DVM node usage gets cleaned up after each job
Update scaling.pl script to use --fwd-mpirun-port. Pre-connect the daemon to its parent during launch while we are otherwise waiting for the daemon's children to send their "phone home" rollup messages
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Per a prior commit, the presence of "hwloc.h" can cause ambiguity when
using --with-hwloc=external (i.e., whether to include
opal/mca/hwloc/hwloc.h or whether to include the system-installed
hwloc.h).
This commit:
1. Renames opal/mca/hwloc/hwloc.h to hwloc-internal.h.
2. Adds opal/mca/hwloc/autogen.options to tell autogen.pl to expect to
find hwloc-internal.h (instead of hwloc.h) in opal/mca/hwloc.
3. s@opal/mca/hwloc/hwloc.h@opal/mca/hwloc/hwloc-internal.h@g in the
rest of the code base.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
direct. Don't resend wireup info if nothing has changed
Fix release of buffer
Correct the unpacking order
Fix the DVM - now minimized data transfer to it
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
This commit fixes rare race condition that occurs when the process
that is calling `exit(-1)` has delay between fd cleanup and actual
OS-level exit. This may happen if the process has some work to do
`on_exit()`.
**Problem description**:
Consider an application process that has called `exit(nonzero)`, it's
fd's was closed
but it's actual termination at OS level is delayed by some cleanups (eg.
in callbacks registered via `on_exit()`).
Observed sequence of events was the following:
* orted gets stdio disconnection and activating `IOF COMPLETE` state.
* parallel OOB disconnection causes `COMMUNICATION FAILURE` state to be
activated.
* during `COMMUNICATION FAILURE` processing `odls_base_default_wait_local_proc`
is called even though real waitpid wasn't yet called (code mentions that
waitpid might not be called for unspecified reason). Because of that real exit
code is unknown and set to 0. `odls_base_default_wait_local_proc` callback sees
`IOF COMPLETE` flag and in conjunction with 0-exit-code it activates
`WAITPID FIRED` state.
* processing of `WAITPID FIRED` leads to `NORMALLY TERMINATED` to be
activated.
* `NORMALLY TERMINATED` state in particular leads `ORTE_PROC_FLAG_ALIVE` flag
for this proc to be dropped.
* when application process finally exits and `wait_signal_callback` is
launched. It sets real exit code and calls `odls_base_default_wait_local_proc`
again but at this time since the process has `ORTE_PROC_FLAG_ALIVE` flag
dropped `WAITPID FIRED` state is activated (instead of `EXITED WITH NON-ZERO`)
leading to a hang that was observed.
Signed-off-by: Artem Polyakov <artpol84@gmail.com>
Update ORTE support for dynamic PMIx operations e.g., PMIx_Spawn
Update to track master
Ensure that --disable-pmix-dstore actually disables the dstore. Sync to a few debugger updates
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Still not completely done as we need a better way of tracking the routed module being used down in the OOB - e.g., when a peer drops connection, we want to remove that route from all conduits that (a) use the OOB and (b) are routed, but we don't want to remove it from an OFI conduit.
* Clean up the DVM so it continues to run even when applications error out and we would ordinarily abort the daemons.
* Create a new errmgr component for the DVM to handle the differences.
* Cleanup the DVM state component.
* Add ORTE bindings directory and brief README
* Pass a local tool index around to match jobs.
* Pass the jobid on job completion.
* Fix initialization logic.
* Add framework for python wrapper.
* Fix terminate-with-non-zero-exit behavior so it properly terminates only the indicated procs, notifies orte-submit, and orte-dvm continues executing.
* Add some missing options to orte-dvm
* Fix a bug in -host processing that caused us to ignore the #slots designator. Add a new attribute to indicate "do not expand the DVM" when submitting job spawn requests.
* It actually makes no sense that we treat the termination of all children differently than terminating the children of a specific job - it only creates confusion over the difference in behavior. So terminate children the same way regardless.
Extend the cmd_line utility to easily allow layering of command line definitions
Catch up with ORTE interface change and make build more generic.
Disable "fixed dvm" logic for now.
Add another cmd_line function to merge a table of cmd line options with another one, reporting as errors any duplicate entries. Use this to allow orterun to reuse the orted_submit code
Fix the "fixed_dvm" logic by ensuring we reset num_new_daemons to zero. Also ensure that the nidmap is sent with the first job so the downstream daemons get the node info. Remove a duplicate cmd line entry in orterun.
Revise the DVM startup procedure to pass the nidmap only once, at the startup of the DVM. This reduces the overhead on each job launch and ensures that the nidmap doesn't get overwritten.
Add new commands to get_orted_comm_cmd_str().
Move ORTE command line options to orte_globals.[ch].
Catch up with extra orte_submit_init parameter.
Add example code.
Add documentation.
Bump version.
The nidmap and routing data must be updated prior to propagating the xcast or else the xcast will fail.
Fix the return code so it is something more expected when an error occurs. Ensure we get an error returned to us when we fail to launch for some reason. In this case, we will always get a launch_cb as we did indeed attempt to spawn it. The error code will be returned in the complete_cb.
Fix the return code from orte_submit_job - it was returning the tracker index instead of "success". Take advantage of ORTE's pretty-print capabilities to provide a nice error output explaining why we failed to launch. Ensure we always get a launch_cb when we fail to launch, but no complete_cb as the job never launched.
Extend the error reporting capability to job completion as well.
Add index parameter to orte_submit_job().
Add orte_job_cancel and implement ORTE_DAEMON_TERMINATE_JOB_CMD.
Factor out dvm termination.
Parse the terminate option at tool level.
Add error string for ORTE_ERR_JOB_CANCELLED.
Add some safeguards.
Cleanup and/of comments.
Enable the return.
Properly ORTE_DECLSPEC orte_submit_halt.
Add orte_submit_halt and orte_submit_cancel to interface.
Use the plm interface to terminate the job