The PMIx support for "instant on" remains experimental, so disable it by default. Provide an MCA param and corresponding command line option to enable it at runtime.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
The current code path for PMIx_Resolve_peers and PMIx_Resolve_nodes executes a threadshift in the preg components themselves. This is done to ensure thread safety when called from the user level. However, it causes thread-stall when someone attempts to call the regex functions from _inside_ the PMIx code base should the call occur from within an event.
Accordingly, move the threadshift to the client-level functions and make the preg components just execute their algorithms. Create a new pnet/test component to verify that the prge code can be safely accessed - set that component to be selected only when the user directly specifies it. The new component will be used to validate various logical extensions during development, and can then be discarded.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 456ac7f7af3d9ba09888e3c899eb001daaa24aef)
This is a point-in-time update that includes support for several new PMIx features, mostly focused on debuggers and "instant on":
* initial prototype support for PMIx-based debuggers. For the moment, this is restricted to using the DVM. Supports direct launch of apps under debugger control, and indirect launch using prun as the intermediate launcher. Includes ability for debuggers to control the environment of both the launcher and the spawned app procs. Work continues on completing support for indirect launch
* IO forwarding for tools. Output of apps launched under tool control is directed to the tool and output there - includes support for XML formatting and output to files. Stdin can be forwarded from the tool to apps, but this hasn't been implemented in ORTE yet.
* Fabric integration for "instant on". Enable collection of network "blobs" to be delivered to network libraries on compute nodes prior to local proc spawn. Infrastructure is in place - implementation will come later.
* Harvesting and forwarding of envars. Enable network plugins to harvest envars and include them in the launch msg for setting the environment prior to local proc spawn. Currently, only OmniPath is supported. PMIx MCA params control which envars are included, and also allows envars to be excluded.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Fix a problem in packing/unpacking job updates. There remains a race condition that causes messages to attempt to be sent to the second new daemon before it is completely ready. Not entirely sure where it is coming from.
Refs #4665
Rebase to master. Reset orte_nidmap_communicated if hosts are added. Check for duplicate hostnames in an add_host command. Turn off tree_spawn for dynamic launch of additional daemons.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Handle the need for different regex generator/parsers by moving the
orte/util/nidmap and orte/util/regex code into a new "regx" framework.
Use the original code to complete a "fwd" component, and create a
scaffold for IBM's "reverse" component.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Change the determination of #spawn threads to be done on basis of #local procs in first job being spawned. Someone can look at an optimization that handles subsequent dynamic spawns that might be larger in size.
Leave the threads running, but blocked, for the life of the daemon, and use them to harvest the local procs as they terminate. This helps short-lived jobs in particular.
Add MCA params to set:
* max number of spawn threads (default: 4)
* set a specific number of spawn threads (default: -1, indicating no set number)
* cutoff - minimum number of local procs before using spawn threads (default: 32)
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Update to support passing of HWLOC shmem topology to client procs
Update use of distance API per @bgoglin
Have the openib component lookup its object in the distance matrix
Bring usnic up-to-date
Restore binding for hwloc2
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
passed to make it all flow thru the opal/pmix "put/get" operations. Update the PMIx code to latest master to pickup some required behaviors.
Remove the no-longer-required get_contact_info and set_contact_info from the RML layer.
Add an MCA param to allow the ofi/rml component to route messages if desired. This is mainly for experimentation at this point as we aren't sure if routing wi
ll be beneficial at large scales. Leave it "off" by default.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Passed the below set of symbols into a script that added ompi_ to them all.
Note that if processing a symbol named "foo" the script turns
foo into ompi_foo
but doesn't turn
foobar into ompi_foobar
But beyond that the script is blind to C syntax, so it hits strings and
comments etc as well as vars/functions.
coll_base_comm_get_reqs
comm_allgather_pml
comm_allreduce_pml
comm_bcast_pml
fcoll_base_coll_allgather_array
fcoll_base_coll_allgatherv_array
fcoll_base_coll_bcast_array
fcoll_base_coll_gather_array
fcoll_base_coll_gatherv_array
fcoll_base_coll_scatterv_array
fcoll_base_sort_iovec
mpit_big_lock
mpit_init_count
mpit_lock
mpit_unlock
netpatterns_base_err
netpatterns_base_verbose
netpatterns_cleanup_narray_knomial_tree
netpatterns_cleanup_recursive_doubling_tree_node
netpatterns_cleanup_recursive_knomial_allgather_tree_node
netpatterns_cleanup_recursive_knomial_tree_node
netpatterns_init
netpatterns_register_mca_params
netpatterns_setup_multinomial_tree
netpatterns_setup_narray_knomial_tree
netpatterns_setup_narray_tree
netpatterns_setup_narray_tree_contigous_ranks
netpatterns_setup_recursive_doubling_n_tree_node
netpatterns_setup_recursive_doubling_tree_node
netpatterns_setup_recursive_knomial_allgather_tree_node
netpatterns_setup_recursive_knomial_tree_node
pml_v_output_close
pml_v_output_open
intercept_extra_state_t
odls_base_default_wait_local_proc
_event_debug_mode_on
_evthread_cond_fns
_evthread_id_fn
_evthread_lock_debugging_enabled
_evthread_lock_fns
cmd_line_option_t
cmd_line_param_t
crs_base_self_checkpoint_fn
crs_base_self_continue_fn
crs_base_self_restart_fn
event_enable_debug_output
event_global_current_base_
event_module_include
eventops
sync_wait_mt
trigger_user_inc_callback
var_type_names
var_type_sizes
Signed-off-by: Mark Allen <markalle@us.ibm.com>
Start updating the various mappers to the new procedure. Remove the stale lama component as it is now very out-of-date. Bring round_robin and PPR online, and modify the mindist component (but cannot test/debug it).
Remove unneeded test
Fix memory corruption by re-initializing variable to NULL in loop
Resolve the race condition identified by @ggouaillardet by resetting the
mapped flag within the same event where it was set. There is no need to
retain the flag beyond that point as it isn't used again.
Add a new job attribute ORTE_JOB_FULLY_DESCRIBED to indicate that all the job information (including locations and binding) is included in the launch message. Thus, the backend daemons do not need to do any map computation for the job. Use this for the seq, rankfile, and mindist mappers until someone decides to update them.
Note that this will maintain functionality, but means that users of those three mappers will see large launch messages and less performant scaling than those using the other mappers.
Have the mindist module add procs to the job's proc array as it is a fully described module
Protect the hnp-not-in-allocation case
Per path suggested by Gilles - protect the HNP node when it gets added in the absence of any other allocation or hostfile
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Set the daemons' state to "running" and mark them as "alive" by default when constructing the nidmap
Get the DVM running again
Fix direct modex by eliminating race condition caused by releasing data while sending it
Up the size limit before compressing
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
onto the backend daemons. By default, let mpirun only pack the app_context
info and send that to the backend daemons where the mapping will
be done. This significantly reduces the computational time on mpirun as it isn't
running up/down the topology tree computing thousands of binding
locations, and it reduces the launch message to a very small number of
bytes.
When running -novm, fall back to the old way of doing things
where mpirun computes the entire map and binding, and then sends
the full info to the backend daemon.
Add a new cmd line option/mca param --fwd-mpirun-port that allows
mpirun to dynamically select a port, but then passes that back to
all the other daemons so they will use that port as a static port
for their own wireup. In this mode, we no longer "phone home" directly
to mpirun, but instead use the static port to wireup at daemon
start. We then use the routing tree to rollup the initial
launch report, and limit the number of open sockets on mpirun's node.
Update ras simulator to track the new nidmap code
Cleanup some bugs in the nidmap regex code, and enhance the error message for not enough slots to include the host on which the problem is found.
Update gadget platform file
Initialize the range count when starting a new range
Fix the no-np case in managed allocation
Ensure DVM node usage gets cleaned up after each job
Update scaling.pl script to use --fwd-mpirun-port. Pre-connect the daemon to its parent during launch while we are otherwise waiting for the daemon's children to send their "phone home" rollup messages
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Per a prior commit, the presence of "hwloc.h" can cause ambiguity when
using --with-hwloc=external (i.e., whether to include
opal/mca/hwloc/hwloc.h or whether to include the system-installed
hwloc.h).
This commit:
1. Renames opal/mca/hwloc/hwloc.h to hwloc-internal.h.
2. Adds opal/mca/hwloc/autogen.options to tell autogen.pl to expect to
find hwloc-internal.h (instead of hwloc.h) in opal/mca/hwloc.
3. s@opal/mca/hwloc/hwloc.h@opal/mca/hwloc/hwloc-internal.h@g in the
rest of the code base.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
direct. Don't resend wireup info if nothing has changed
Fix release of buffer
Correct the unpacking order
Fix the DVM - now minimized data transfer to it
Signed-off-by: Ralph Castain <rhc@open-mpi.org>