openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	7a83fdb9bb	Update to hwloc 2.0.0a with shmem support. Update to support passing of HWLOC shmem topology to client procs Update use of distance API per @bgoglin Have the openib component lookup its object in the distance matrix Bring usnic up-to-date Restore binding for hwloc2 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-07-25 20:26:22 -07:00
Ralph Castain	b225366012	Bring the ofi/rml component online by completing the wireup protocol for the daemons. Cleanup the current confusion over how connection info gets created and passed to make it all flow thru the opal/pmix "put/get" operations. Update the PMIx code to latest master to pickup some required behaviors. Remove the no-longer-required get_contact_info and set_contact_info from the RML layer. Add an MCA param to allow the ofi/rml component to route messages if desired. This is mainly for experimentation at this point as we aren't sure if routing wi ll be beneficial at large scales. Leave it "off" by default. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-07-20 21:01:57 -07:00
Mark Allen	552216f9ba	scripted symbol name change (ompi_ prefix) Passed the below set of symbols into a script that added ompi_ to them all. Note that if processing a symbol named "foo" the script turns foo into ompi_foo but doesn't turn foobar into ompi_foobar But beyond that the script is blind to C syntax, so it hits strings and comments etc as well as vars/functions. coll_base_comm_get_reqs comm_allgather_pml comm_allreduce_pml comm_bcast_pml fcoll_base_coll_allgather_array fcoll_base_coll_allgatherv_array fcoll_base_coll_bcast_array fcoll_base_coll_gather_array fcoll_base_coll_gatherv_array fcoll_base_coll_scatterv_array fcoll_base_sort_iovec mpit_big_lock mpit_init_count mpit_lock mpit_unlock netpatterns_base_err netpatterns_base_verbose netpatterns_cleanup_narray_knomial_tree netpatterns_cleanup_recursive_doubling_tree_node netpatterns_cleanup_recursive_knomial_allgather_tree_node netpatterns_cleanup_recursive_knomial_tree_node netpatterns_init netpatterns_register_mca_params netpatterns_setup_multinomial_tree netpatterns_setup_narray_knomial_tree netpatterns_setup_narray_tree netpatterns_setup_narray_tree_contigous_ranks netpatterns_setup_recursive_doubling_n_tree_node netpatterns_setup_recursive_doubling_tree_node netpatterns_setup_recursive_knomial_allgather_tree_node netpatterns_setup_recursive_knomial_tree_node pml_v_output_close pml_v_output_open intercept_extra_state_t odls_base_default_wait_local_proc _event_debug_mode_on _evthread_cond_fns _evthread_id_fn _evthread_lock_debugging_enabled _evthread_lock_fns cmd_line_option_t cmd_line_param_t crs_base_self_checkpoint_fn crs_base_self_continue_fn crs_base_self_restart_fn event_enable_debug_output event_global_current_base_ event_module_include eventops sync_wait_mt trigger_user_inc_callback var_type_names var_type_sizes Signed-off-by: Mark Allen <markalle@us.ibm.com>	2017-07-11 02:13:23 -04:00
Ralph Castain	206aec6083	By default, apply signals to all direct children _and_ any children they might have spawned (so long as they remain in the same process group). Provide an MCA param (odls_base_signal_direct_children_only) to indicate that the signal is to go _only_ to our direct children, and not be delivered to any children spawned by those procs. Refs https://www.mail-archive.com/users@lists.open-mpi.org/msg31221.html Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-06-15 12:26:11 -07:00
Ralph Castain	8f09929469	Fix rank-file mapper launch by correctly setting up the remote map from the provided data Put a simple protection for the case where procs fail while we are trying to deregister handlers Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-06-15 08:33:29 -07:00
Ralph Castain	93cf3c7203	Update OPAL and ORTE for thread safety (I swear, if I look this over one more time, I'll puke) Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-06-06 12:30:57 -07:00
Ralph Castain	9d6b929894	Fix uninitialized variable. Set exit codes for failed launch so we get pretty error messages Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-05-31 07:38:37 -07:00
Ralph Castain	5d990b557c	Reorg ordering so that bare executable names also are found Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-05-30 15:58:55 -07:00
Ralph Castain	321abfc8c6	Fix cwd and preload-binary options Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-05-30 14:07:22 -07:00
Ralph Castain	ad108ba44d	Fix the DVM Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-05-30 11:42:42 -07:00
Ralph Castain	657e701c65	Add debug verbosity to the orte data server and pmix pub/lookup functions Start updating the various mappers to the new procedure. Remove the stale lama component as it is now very out-of-date. Bring round_robin and PPR online, and modify the mindist component (but cannot test/debug it). Remove unneeded test Fix memory corruption by re-initializing variable to NULL in loop Resolve the race condition identified by @ggouaillardet by resetting the mapped flag within the same event where it was set. There is no need to retain the flag beyond that point as it isn't used again. Add a new job attribute ORTE_JOB_FULLY_DESCRIBED to indicate that all the job information (including locations and binding) is included in the launch message. Thus, the backend daemons do not need to do any map computation for the job. Use this for the seq, rankfile, and mindist mappers until someone decides to update them. Note that this will maintain functionality, but means that users of those three mappers will see large launch messages and less performant scaling than those using the other mappers. Have the mindist module add procs to the job's proc array as it is a fully described module Protect the hnp-not-in-allocation case Per path suggested by Gilles - protect the HNP node when it gets added in the absence of any other allocation or hostfile Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-05-25 18:41:27 -07:00
Ralph Castain	bb1aaa3286	Use the node index to compare to daemon vpid when identifying procs to bind Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-04-14 02:37:25 -07:00
Ralph Castain	67156556ce	On behalf of Josh, ensure we flag that the child is no longer alive since we are killing it with SIGKILL Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-04-13 21:07:26 -07:00
Ralph Castain	0500cc1c66	Update the debugger launch code to reflect the new backend mapping method. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-04-12 13:31:18 -07:00
Ralph Castain	92c996487c	Update how we pass the node regex so we pass _all_ nodes, even those without daemons. This allows the backend daemons to form a complete picture of the allocation. Include info on which nodes have daemons on them, and populate that info on the backend as well. Set the daemons' state to "running" and mark them as "alive" by default when constructing the nidmap Get the DVM running again Fix direct modex by eliminating race condition caused by releasing data while sending it Up the size limit before compressing Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-04-03 19:25:15 -07:00
Ralph Castain	583dbe954c	Silence coverity dead-code warning Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-26 20:36:43 -07:00
Ralph Castain	35f817911e	Fix coverity issues Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-24 08:09:46 -07:00
Ralph Castain	10d401b6ec	Merge pull request #3217 from rhc54/topic/wdirs Resolve a race condition for setting our working directory when fork/exec'ing application procs.	2017-03-21 17:39:54 -07:00
Ralph Castain	f8e1e3bed3	Ensure we properly exit with error if we cannot map the job Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-21 15:15:32 -07:00
Ralph Castain	75684dc260	Resolve a race condition for setting our working directory when fork/exec'ing application procs. We have to ensure we do it after the fork occurs since we want to use multiple threads in the odls. Otherwise, the different threads are bouncing the entire process around. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-21 13:54:03 -07:00
Ralph Castain	dc85e7fde7	Provide a little more help on the error messages when an executable isn't found so we have some better idea where we were looking for it. Don't double-report such errors. Ensure the ORTE_ERROR_NAME doesn't get a NULL back for the string name of an error code as that might cause some systems to segfault Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-17 09:54:37 -07:00
Ralph Castain	105fb152e1	Silence Coverity warnings Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-13 08:38:51 -07:00
Ralph Castain	b9f5cab710	Add a minor debug statement Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-12 18:15:44 -07:00
Ralph Castain	70591bf4dc	Enable parallel fork/exec of local procs by providing the option of multiple odls progress threads Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-11 20:48:04 -08:00
Ralph Castain	48fc339718	Create an alternative mapping method that pushes responsibility onto the backend daemons. By default, let mpirun only pack the app_context info and send that to the backend daemons where the mapping will be done. This significantly reduces the computational time on mpirun as it isn't running up/down the topology tree computing thousands of binding locations, and it reduces the launch message to a very small number of bytes. When running -novm, fall back to the old way of doing things where mpirun computes the entire map and binding, and then sends the full info to the backend daemon. Add a new cmd line option/mca param --fwd-mpirun-port that allows mpirun to dynamically select a port, but then passes that back to all the other daemons so they will use that port as a static port for their own wireup. In this mode, we no longer "phone home" directly to mpirun, but instead use the static port to wireup at daemon start. We then use the routing tree to rollup the initial launch report, and limit the number of open sockets on mpirun's node. Update ras simulator to track the new nidmap code Cleanup some bugs in the nidmap regex code, and enhance the error message for not enough slots to include the host on which the problem is found. Update gadget platform file Initialize the range count when starting a new range Fix the no-np case in managed allocation Ensure DVM node usage gets cleaned up after each job Update scaling.pl script to use --fwd-mpirun-port. Pre-connect the daemon to its parent during launch while we are otherwise waiting for the daemon's children to send their "phone home" rollup messages Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-07 20:43:12 -08:00
Jeff Squyres	fec519a793	hwloc: rename opal/mca/hwloc/hwloc.h -> hwloc-internal.h Per a prior commit, the presence of "hwloc.h" can cause ambiguity when using --with-hwloc=external (i.e., whether to include opal/mca/hwloc/hwloc.h or whether to include the system-installed hwloc.h). This commit: 1. Renames opal/mca/hwloc/hwloc.h to hwloc-internal.h. 2. Adds opal/mca/hwloc/autogen.options to tell autogen.pl to expect to find hwloc-internal.h (instead of hwloc.h) in opal/mca/hwloc. 3. s@opal/mca/hwloc/hwloc.h@opal/mca/hwloc/hwloc-internal.h@g in the rest of the code base. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-02-28 07:48:42 -08:00
Ralph Castain	68b53e2179	Fix comm_spawn by registering nspace info only when needed - either when we have local procs, or when job-level info is required by connecting jobs Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-02-14 19:47:56 -08:00
Ralph Castain	230d15f0d9	Cleanup the ras simulator capability, and the relay route thru grpcomm direct. Don't resend wireup info if nothing has changed Fix release of buffer Correct the unpacking order Fix the DVM - now minimized data transfer to it Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-02-01 15:01:58 -08:00
Ralph Castain	86ab751c5e	Next step in reducing launch time: begin reducing the size of the launch message itself. Start by expressing the daemon map as a set of three regular expression strings. On an 8k cluster, this reduces the nidmap contribution from over 200kBytes to 21 bytes in size. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-23 19:54:47 -08:00
Ralph Castain	6560617c04	Fix comm_spawn and orte-dvm by resetting all used "node mapped" flags after building the child list Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-22 05:55:53 -08:00
Ralph Castain	639cdd4f9d	Add missing flag set to ensure nodes do not get double-added to job map. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-21 20:06:50 -08:00
Ralph Castain	be3ef77739	Improve packing efficiency by raising the initial buffer size and modifying the extension code. Flag if a job map has had its nodes added so we don't have to loop repeatedly to check it. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-21 14:03:19 -08:00
Ralph Castain	668421b6ec	Compress the xcast message if bigger than a defined size to further improve launch performance at scale Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-19 22:08:02 -08:00
Ralph Castain	74a285be83	Cancel the waitpid callback once the waitpid on a process has fired to avoid multiple notifications Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-16 14:32:02 -08:00
Artem Polyakov	3eb6c98542	orte/odls: Fix ORTE state machine for the non-zero exit case This commit fixes rare race condition that occurs when the process that is calling `exit(-1)` has delay between fd cleanup and actual OS-level exit. This may happen if the process has some work to do `on_exit()`. Problem description: Consider an application process that has called `exit(nonzero)`, it's fd's was closed but it's actual termination at OS level is delayed by some cleanups (eg. in callbacks registered via `on_exit()`). Observed sequence of events was the following: * orted gets stdio disconnection and activating `IOF COMPLETE` state. * parallel OOB disconnection causes `COMMUNICATION FAILURE` state to be activated. * during `COMMUNICATION FAILURE` processing `odls_base_default_wait_local_proc` is called even though real waitpid wasn't yet called (code mentions that waitpid might not be called for unspecified reason). Because of that real exit code is unknown and set to 0. `odls_base_default_wait_local_proc` callback sees `IOF COMPLETE` flag and in conjunction with 0-exit-code it activates `WAITPID FIRED` state. * processing of `WAITPID FIRED` leads to `NORMALLY TERMINATED` to be activated. * `NORMALLY TERMINATED` state in particular leads `ORTE_PROC_FLAG_ALIVE` flag for this proc to be dropped. * when application process finally exits and `wait_signal_callback` is launched. It sets real exit code and calls `odls_base_default_wait_local_proc` again but at this time since the process has `ORTE_PROC_FLAG_ALIVE` flag dropped `WAITPID FIRED` state is activated (instead of `EXITED WITH NON-ZERO`) leading to a hang that was observed. Signed-off-by: Artem Polyakov <artpol84@gmail.com>	2017-01-06 11:12:55 +02:00
Ralph Castain	884fb7fcf2	Update the PMIx2 support to include the latest shared memory optimizations Update ORTE support for dynamic PMIx operations e.g., PMIx_Spawn Update to track master Ensure that --disable-pmix-dstore actually disables the dstore. Sync to a few debugger updates Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-14 15:00:10 -08:00
Ralph Castain	85a634926b	Update signal handling to introduce a pause between SIGCONT and SIGTERM, followed by another pause before SIGKILL. Do this within the odls/kill_local_procs function while we know we are blocked in an event, and before the daemon shuts down the event progress loop Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-06 12:34:42 -08:00
Ralph Castain	88313debc2	Per discussion on email thread, restore placement of child procs in their own process group so that any signal sent to one of our children is automatically propagated to any child process they might have spawned. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-02 03:36:22 -08:00
Ralph Castain	649301a3a2	Revise the routed framework to be multi-select so it can support the new conduit system. Update all calls to rml.send* to the new syntax. Define an orte_mgmt_conduit for admin and IOF messages, and an orte_coll_conduit for all collective operations (e.g., xcast, modex, and barrier). Still not completely done as we need a better way of tracking the routed module being used down in the OOB - e.g., when a peer drops connection, we want to remove that route from all conduits that (a) use the OOB and (b) are routed, but we don't want to remove it from an OFI conduit.	2016-10-23 21:52:39 -07:00
Gilles Gouaillardet	b2a2be0e5a	odls: fix memory leak plug This fixes commit open-mpi/ompi@e2c343cdfc.	2016-09-08 10:02:52 +09:00
Gilles Gouaillardet	e2c343cdfc	odls: plus memory leak as reported by Coverity with CID 710645	2016-09-07 10:08:44 +09:00
Ralph Castain	f85dcaee2a	Fixes CID 1369067 and CID 1196684 Fixes CID 1369648 Fixes CID 1372409	2016-09-06 08:43:15 -07:00
Ralph Castain	0ba9572f9f	Cleanup the forced termination a bit by restoring the delay before issuing the sigkill, and eliminating the large time loss spent checking if the proc died. The latter is responsible for a large number of test timeouts in MTT Update alps component	2016-06-02 17:48:21 -07:00
Ralph Castain	2970becd6b	Revert "Merge pull request #1451 from ggouaillardet/topic/orte_fork_wrapper_fullname" This reverts commit efafd62d38bb12c161330d5a6e4f338e9b560a7e, reversing changes made to a93b849f13b12a7b1c1cdde71a9e491ddc220e17.	2016-03-18 07:18:36 -07:00
Gilles Gouaillardet	589924c4aa	odls/base: use the full app name when using an orte fork agent	2016-03-14 11:18:21 +09:00
Ralph Castain	d653cf2847	Convert the orte_job_data pointer array to a hash table so it doesn't grow forever as we run lots and lots of jobs in the persistent DVM.	2016-02-21 11:55:49 -08:00
Ralph Castain	06c3dfc052	Refactor the ORTE DVM code so that external codes can submit multiple jobs using only a single connection to the HNP. * Clean up the DVM so it continues to run even when applications error out and we would ordinarily abort the daemons. * Create a new errmgr component for the DVM to handle the differences. * Cleanup the DVM state component. * Add ORTE bindings directory and brief README * Pass a local tool index around to match jobs. * Pass the jobid on job completion. * Fix initialization logic. * Add framework for python wrapper. * Fix terminate-with-non-zero-exit behavior so it properly terminates only the indicated procs, notifies orte-submit, and orte-dvm continues executing. * Add some missing options to orte-dvm * Fix a bug in -host processing that caused us to ignore the #slots designator. Add a new attribute to indicate "do not expand the DVM" when submitting job spawn requests. * It actually makes no sense that we treat the termination of all children differently than terminating the children of a specific job - it only creates confusion over the difference in behavior. So terminate children the same way regardless. Extend the cmd_line utility to easily allow layering of command line definitions Catch up with ORTE interface change and make build more generic. Disable "fixed dvm" logic for now. Add another cmd_line function to merge a table of cmd line options with another one, reporting as errors any duplicate entries. Use this to allow orterun to reuse the orted_submit code Fix the "fixed_dvm" logic by ensuring we reset num_new_daemons to zero. Also ensure that the nidmap is sent with the first job so the downstream daemons get the node info. Remove a duplicate cmd line entry in orterun. Revise the DVM startup procedure to pass the nidmap only once, at the startup of the DVM. This reduces the overhead on each job launch and ensures that the nidmap doesn't get overwritten. Add new commands to get_orted_comm_cmd_str(). Move ORTE command line options to orte_globals.[ch]. Catch up with extra orte_submit_init parameter. Add example code. Add documentation. Bump version. The nidmap and routing data must be updated prior to propagating the xcast or else the xcast will fail. Fix the return code so it is something more expected when an error occurs. Ensure we get an error returned to us when we fail to launch for some reason. In this case, we will always get a launch_cb as we did indeed attempt to spawn it. The error code will be returned in the complete_cb. Fix the return code from orte_submit_job - it was returning the tracker index instead of "success". Take advantage of ORTE's pretty-print capabilities to provide a nice error output explaining why we failed to launch. Ensure we always get a launch_cb when we fail to launch, but no complete_cb as the job never launched. Extend the error reporting capability to job completion as well. Add index parameter to orte_submit_job(). Add orte_job_cancel and implement ORTE_DAEMON_TERMINATE_JOB_CMD. Factor out dvm termination. Parse the terminate option at tool level. Add error string for ORTE_ERR_JOB_CANCELLED. Add some safeguards. Cleanup and/of comments. Enable the return. Properly ORTE_DECLSPEC orte_submit_halt. Add orte_submit_halt and orte_submit_cancel to interface. Use the plm interface to terminate the job	2016-02-13 08:10:44 -08:00
Ralph Castain	68912d04a8	Fix the grpcomm operations at scale. Restore the direct component to be the default, and to execute a rollup collective. This may in fact be faster than the alternatives, and something appears broken at scale when using brks in particular. Turn off the rcd and brks components as they don't work at scale right now - they can be restored at some future point when someone can debug them. Adjust to Jeff's quibbles Fixes open-mpi/mpi#1215	2016-02-04 05:42:29 -08:00
Ralph Castain	f1483eb2dc	Need to delay registration of the waitpid callback until after the fork/exec of the child process. Fix the bit testing of process type so that the proper state component gets selected for HNP.	2015-11-06 21:35:24 -08:00
Mark Santcroos	30aab75b86	Make message consistent.	2015-10-24 13:40:03 +02:00

1 2 3 4 5 ...

381 Коммитов