openmpi

Автор	SHA1	Сообщение	Дата
Mark Santcroos	36ac54b5d8	Bring ALPS ODLS up to par regarding wdir. Signed-off-by: Mark Santcroos <mark.santcroos@rutgers.edu>	2017-04-10 08:15:07 -04:00
Ralph Castain	74863a0ea4	Fix the DVM by ensuring that all nodes, even those that didn't participate (i.e., didn't have any local children) in a job, clean up all resources associated with that job upon its completion. With the advent of backend distributed mapping, nodes that weren't part of the job would still allocate resources on other nodes - and then start from that point when mapping the next job. This change ensures that all daemons start from the same point each time. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-04-04 17:31:38 -07:00
Ralph Castain	92c996487c	Update how we pass the node regex so we pass _all_ nodes, even those without daemons. This allows the backend daemons to form a complete picture of the allocation. Include info on which nodes have daemons on them, and populate that info on the backend as well. Set the daemons' state to "running" and mark them as "alive" by default when constructing the nidmap Get the DVM running again Fix direct modex by eliminating race condition caused by releasing data while sending it Up the size limit before compressing Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-04-03 19:25:15 -07:00
Ralph Castain	583dbe954c	Silence coverity dead-code warning Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-26 20:36:43 -07:00
Ralph Castain	35f817911e	Fix coverity issues Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-24 08:09:46 -07:00
Ralph Castain	10d401b6ec	Merge pull request #3217 from rhc54/topic/wdirs Resolve a race condition for setting our working directory when fork/exec'ing application procs.	2017-03-21 17:39:54 -07:00
Ralph Castain	74fd2c30af	Cleanup alps odls module Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-21 17:41:11 -06:00
Ralph Castain	f8e1e3bed3	Ensure we properly exit with error if we cannot map the job Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-21 15:15:32 -07:00
Ralph Castain	75684dc260	Resolve a race condition for setting our working directory when fork/exec'ing application procs. We have to ensure we do it after the fork occurs since we want to use multiple threads in the odls. Otherwise, the different threads are bouncing the entire process around. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-21 13:54:03 -07:00
Ralph Castain	dc85e7fde7	Provide a little more help on the error messages when an executable isn't found so we have some better idea where we were looking for it. Don't double-report such errors. Ensure the ORTE_ERROR_NAME doesn't get a NULL back for the string name of an error code as that might cause some systems to segfault Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-17 09:54:37 -07:00
Ralph Castain	105fb152e1	Silence Coverity warnings Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-13 08:38:51 -07:00
Ralph Castain	b9f5cab710	Add a minor debug statement Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-12 18:15:44 -07:00
Ralph Castain	6d6bc9bd07	Update alps module to new APIs Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-12 09:43:07 -07:00
Ralph Castain	70591bf4dc	Enable parallel fork/exec of local procs by providing the option of multiple odls progress threads Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-11 20:48:04 -08:00
Ralph Castain	48fc339718	Create an alternative mapping method that pushes responsibility onto the backend daemons. By default, let mpirun only pack the app_context info and send that to the backend daemons where the mapping will be done. This significantly reduces the computational time on mpirun as it isn't running up/down the topology tree computing thousands of binding locations, and it reduces the launch message to a very small number of bytes. When running -novm, fall back to the old way of doing things where mpirun computes the entire map and binding, and then sends the full info to the backend daemon. Add a new cmd line option/mca param --fwd-mpirun-port that allows mpirun to dynamically select a port, but then passes that back to all the other daemons so they will use that port as a static port for their own wireup. In this mode, we no longer "phone home" directly to mpirun, but instead use the static port to wireup at daemon start. We then use the routing tree to rollup the initial launch report, and limit the number of open sockets on mpirun's node. Update ras simulator to track the new nidmap code Cleanup some bugs in the nidmap regex code, and enhance the error message for not enough slots to include the host on which the problem is found. Update gadget platform file Initialize the range count when starting a new range Fix the no-np case in managed allocation Ensure DVM node usage gets cleaned up after each job Update scaling.pl script to use --fwd-mpirun-port. Pre-connect the daemon to its parent during launch while we are otherwise waiting for the daemon's children to send their "phone home" rollup messages Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-07 20:43:12 -08:00
Jeff Squyres	fec519a793	hwloc: rename opal/mca/hwloc/hwloc.h -> hwloc-internal.h Per a prior commit, the presence of "hwloc.h" can cause ambiguity when using --with-hwloc=external (i.e., whether to include opal/mca/hwloc/hwloc.h or whether to include the system-installed hwloc.h). This commit: 1. Renames opal/mca/hwloc/hwloc.h to hwloc-internal.h. 2. Adds opal/mca/hwloc/autogen.options to tell autogen.pl to expect to find hwloc-internal.h (instead of hwloc.h) in opal/mca/hwloc. 3. s@opal/mca/hwloc/hwloc.h@opal/mca/hwloc/hwloc-internal.h@g in the rest of the code base. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-02-28 07:48:42 -08:00
Ralph Castain	68b53e2179	Fix comm_spawn by registering nspace info only when needed - either when we have local procs, or when job-level info is required by connecting jobs Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-02-14 19:47:56 -08:00
Ralph Castain	230d15f0d9	Cleanup the ras simulator capability, and the relay route thru grpcomm direct. Don't resend wireup info if nothing has changed Fix release of buffer Correct the unpacking order Fix the DVM - now minimized data transfer to it Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-02-01 15:01:58 -08:00
Ralph Castain	86ab751c5e	Next step in reducing launch time: begin reducing the size of the launch message itself. Start by expressing the daemon map as a set of three regular expression strings. On an 8k cluster, this reduces the nidmap contribution from over 200kBytes to 21 bytes in size. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-23 19:54:47 -08:00
Ralph Castain	6560617c04	Fix comm_spawn and orte-dvm by resetting all used "node mapped" flags after building the child list Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-22 05:55:53 -08:00
Ralph Castain	639cdd4f9d	Add missing flag set to ensure nodes do not get double-added to job map. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-21 20:06:50 -08:00
Ralph Castain	be3ef77739	Improve packing efficiency by raising the initial buffer size and modifying the extension code. Flag if a job map has had its nodes added so we don't have to loop repeatedly to check it. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-21 14:03:19 -08:00
Ralph Castain	668421b6ec	Compress the xcast message if bigger than a defined size to further improve launch performance at scale Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-19 22:08:02 -08:00
Ralph Castain	368684bd63	Revert e9bc293 and try a different approach for scalably dealing with hetero clusters. Have each orted send back its topo "signature". If mpirun detects that this signature has not been seen before, then ask for that daemon to send back its full topology description. This allows the system to only get the topology once for each unique topo in the cluster. Cleanup a typo, and remove no longer needed MCA params for hetero nodes and hetero apps. Hetero nodes will always be automatically detected. We don't support a mix of 32 and 64 bit apps Modify the orte_node_t to use orte_topology_t instead of hwloc_topology_t, updating all the places that use it. Ensure that we properly update topology when we see a different one on a compute node. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-18 10:22:15 -08:00
Ralph Castain	74a285be83	Cancel the waitpid callback once the waitpid on a process has fired to avoid multiple notifications Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-16 14:32:02 -08:00
Artem Polyakov	3eb6c98542	orte/odls: Fix ORTE state machine for the non-zero exit case This commit fixes rare race condition that occurs when the process that is calling `exit(-1)` has delay between fd cleanup and actual OS-level exit. This may happen if the process has some work to do `on_exit()`. Problem description: Consider an application process that has called `exit(nonzero)`, it's fd's was closed but it's actual termination at OS level is delayed by some cleanups (eg. in callbacks registered via `on_exit()`). Observed sequence of events was the following: * orted gets stdio disconnection and activating `IOF COMPLETE` state. * parallel OOB disconnection causes `COMMUNICATION FAILURE` state to be activated. * during `COMMUNICATION FAILURE` processing `odls_base_default_wait_local_proc` is called even though real waitpid wasn't yet called (code mentions that waitpid might not be called for unspecified reason). Because of that real exit code is unknown and set to 0. `odls_base_default_wait_local_proc` callback sees `IOF COMPLETE` flag and in conjunction with 0-exit-code it activates `WAITPID FIRED` state. * processing of `WAITPID FIRED` leads to `NORMALLY TERMINATED` to be activated. * `NORMALLY TERMINATED` state in particular leads `ORTE_PROC_FLAG_ALIVE` flag for this proc to be dropped. * when application process finally exits and `wait_signal_callback` is launched. It sets real exit code and calls `odls_base_default_wait_local_proc` again but at this time since the process has `ORTE_PROC_FLAG_ALIVE` flag dropped `WAITPID FIRED` state is activated (instead of `EXITED WITH NON-ZERO`) leading to a hang that was observed. Signed-off-by: Artem Polyakov <artpol84@gmail.com>	2017-01-06 11:12:55 +02:00
Ralph Castain	884fb7fcf2	Update the PMIx2 support to include the latest shared memory optimizations Update ORTE support for dynamic PMIx operations e.g., PMIx_Spawn Update to track master Ensure that --disable-pmix-dstore actually disables the dstore. Sync to a few debugger updates Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-14 15:00:10 -08:00
Ralph Castain	85a634926b	Update signal handling to introduce a pause between SIGCONT and SIGTERM, followed by another pause before SIGKILL. Do this within the odls/kill_local_procs function while we know we are blocked in an event, and before the daemon shuts down the event progress loop Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-06 12:34:42 -08:00
Ralph Castain	88313debc2	Per discussion on email thread, restore placement of child procs in their own process group so that any signal sent to one of our children is automatically propagated to any child process they might have spawned. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-02 03:36:22 -08:00
Howard Pritchard	2cbc0e8472	pmix/cray: fix disable-dlopen problem PR open-mpi/ompi#2432 introduced a regression where configure and build with --disable-dlopn caused build failure owing to unresolved alps lli symbols in the libopal-pal shared library. This commit fixes this problem. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2016-11-21 13:45:10 -06:00
Ralph Castain	649301a3a2	Revise the routed framework to be multi-select so it can support the new conduit system. Update all calls to rml.send* to the new syntax. Define an orte_mgmt_conduit for admin and IOF messages, and an orte_coll_conduit for all collective operations (e.g., xcast, modex, and barrier). Still not completely done as we need a better way of tracking the routed module being used down in the OOB - e.g., when a peer drops connection, we want to remove that route from all conduits that (a) use the OOB and (b) are routed, but we don't want to remove it from an OFI conduit.	2016-10-23 21:52:39 -07:00
Gilles Gouaillardet	b2a2be0e5a	odls: fix memory leak plug This fixes commit open-mpi/ompi@e2c343cdfc.	2016-09-08 10:02:52 +09:00
Gilles Gouaillardet	e2c343cdfc	odls: plus memory leak as reported by Coverity with CID 710645	2016-09-07 10:08:44 +09:00
Ralph Castain	f85dcaee2a	Fixes CID 1369067 and CID 1196684 Fixes CID 1369648 Fixes CID 1372409	2016-09-06 08:43:15 -07:00
Ralph Castain	c1050bc01e	Provide a mechanism for obtaining memory profiles of daemons and application profiles for use in studying our memory footprint. Setting OMPI_MEMPROFILE=N causes mpirun to set a timer for N seconds. When the timer fires, mpirun will query each daemon in the job to report its own memory usage plus the average memory usage of its child processes. The Proportional Set Size (PSS) is used for this purpose.	2016-08-31 09:32:07 -07:00
Ralph Castain	0ba9572f9f	Cleanup the forced termination a bit by restoring the delay before issuing the sigkill, and eliminating the large time loss spent checking if the proc died. The latter is responsible for a large number of test timeouts in MTT Update alps component	2016-06-02 17:48:21 -07:00
Ralph Castain	ebe159acef	Add a timeout cmd line option and an option to report state info upon timeout to assist with debugging Jenkins tests If requested, obtain stacktraces for each application process and report it to stderr upon timeout stack traces: minor improvements - Also include the hostname and PID of the each process for which we're sending the stack traces (vs. just including the ORTE process name) - Send a specific error message if we couldn't find "gstack" in the $PATH (e.g., on OS X) - Send a sepcific error message if gstack fails to run - Print a message that obtaining the stack traces may take a few seconds so that users don't wonder what's happening Signed-off-by: Jeff Squyres <jsquyres@cisco.com> help-orterun.txt: minor tweaks Trivial update: show "--timeout" (instead of "-timeout") in the help message, just to encourage the use of double-dash options. Signed-off-by: Jeff Squyres <jsquyres@cisco.com> trivial: stacktrace -> stack trace Trivial word smything. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2016-05-28 08:36:25 -07:00
Jeff Squyres	dd9a819a1c	odls_default: do not opal_output() while creating a process! It is verbotten to use opal_output() after the fork() but before the exec()! It results in all manner of undefined behavior. For example, on some OS X systems, if you run a trivial "hello world" MPI program with a high level of ODLS verbosity: ```sh $ mpirun -np 3 --mca odls_base_verbose 100 ./hello_c ``` You will see a bunch of output from the mpirun ODLS base, but then it may hang in odls_default_module.c:do_child() -- after the fork() but before the exec() -- while trying to opal_output() some debugging statements. The solution is to remove these extraneous opal_output() statements. Indeed, the ODLS base is already outputting the same information that these opal_output() statements are trying to emit, anyway. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2016-05-24 21:28:57 -04:00
Ralph Castain	2970becd6b	Revert "Merge pull request #1451 from ggouaillardet/topic/orte_fork_wrapper_fullname" This reverts commit efafd62d38bb12c161330d5a6e4f338e9b560a7e, reversing changes made to a93b849f13b12a7b1c1cdde71a9e491ddc220e17.	2016-03-18 07:18:36 -07:00
Gilles Gouaillardet	589924c4aa	odls/base: use the full app name when using an orte fork agent	2016-03-14 11:18:21 +09:00
Ralph Castain	d72c1c72ff	Do not push child processes into separate process groups so that any host RM can still "see" them, and ensure that any signal sent to the orted's themselves will be provided to all child processes. Forward all signals from mpirun to the child processes, removing the old MCA parameter required to turn that behavior "on".	2016-03-06 17:55:09 -08:00
Ralph Castain	4a55fba414	Fix registration of error handlers thru the pmix120 component. A thread-shift operation was hanging on the sync_event_base, which made it dependent on someone calling opal_progress. Unfortunately, a process in "sleep" or spinning outside the MPI library won't do that, and so we never complete errhandler registration.	2016-03-02 15:01:01 -08:00
Ralph Castain	d653cf2847	Convert the orte_job_data pointer array to a hash table so it doesn't grow forever as we run lots and lots of jobs in the persistent DVM.	2016-02-21 11:55:49 -08:00
Ralph Castain	06c3dfc052	Refactor the ORTE DVM code so that external codes can submit multiple jobs using only a single connection to the HNP. * Clean up the DVM so it continues to run even when applications error out and we would ordinarily abort the daemons. * Create a new errmgr component for the DVM to handle the differences. * Cleanup the DVM state component. * Add ORTE bindings directory and brief README * Pass a local tool index around to match jobs. * Pass the jobid on job completion. * Fix initialization logic. * Add framework for python wrapper. * Fix terminate-with-non-zero-exit behavior so it properly terminates only the indicated procs, notifies orte-submit, and orte-dvm continues executing. * Add some missing options to orte-dvm * Fix a bug in -host processing that caused us to ignore the #slots designator. Add a new attribute to indicate "do not expand the DVM" when submitting job spawn requests. * It actually makes no sense that we treat the termination of all children differently than terminating the children of a specific job - it only creates confusion over the difference in behavior. So terminate children the same way regardless. Extend the cmd_line utility to easily allow layering of command line definitions Catch up with ORTE interface change and make build more generic. Disable "fixed dvm" logic for now. Add another cmd_line function to merge a table of cmd line options with another one, reporting as errors any duplicate entries. Use this to allow orterun to reuse the orted_submit code Fix the "fixed_dvm" logic by ensuring we reset num_new_daemons to zero. Also ensure that the nidmap is sent with the first job so the downstream daemons get the node info. Remove a duplicate cmd line entry in orterun. Revise the DVM startup procedure to pass the nidmap only once, at the startup of the DVM. This reduces the overhead on each job launch and ensures that the nidmap doesn't get overwritten. Add new commands to get_orted_comm_cmd_str(). Move ORTE command line options to orte_globals.[ch]. Catch up with extra orte_submit_init parameter. Add example code. Add documentation. Bump version. The nidmap and routing data must be updated prior to propagating the xcast or else the xcast will fail. Fix the return code so it is something more expected when an error occurs. Ensure we get an error returned to us when we fail to launch for some reason. In this case, we will always get a launch_cb as we did indeed attempt to spawn it. The error code will be returned in the complete_cb. Fix the return code from orte_submit_job - it was returning the tracker index instead of "success". Take advantage of ORTE's pretty-print capabilities to provide a nice error output explaining why we failed to launch. Ensure we always get a launch_cb when we fail to launch, but no complete_cb as the job never launched. Extend the error reporting capability to job completion as well. Add index parameter to orte_submit_job(). Add orte_job_cancel and implement ORTE_DAEMON_TERMINATE_JOB_CMD. Factor out dvm termination. Parse the terminate option at tool level. Add error string for ORTE_ERR_JOB_CANCELLED. Add some safeguards. Cleanup and/of comments. Enable the return. Properly ORTE_DECLSPEC orte_submit_halt. Add orte_submit_halt and orte_submit_cancel to interface. Use the plm interface to terminate the job	2016-02-13 08:10:44 -08:00
Ralph Castain	68912d04a8	Fix the grpcomm operations at scale. Restore the direct component to be the default, and to execute a rollup collective. This may in fact be faster than the alternatives, and something appears broken at scale when using brks in particular. Turn off the rcd and brks components as they don't work at scale right now - they can be restored at some future point when someone can debug them. Adjust to Jeff's quibbles Fixes open-mpi/mpi#1215	2016-02-04 05:42:29 -08:00
Mark Santcroos	5ec2b4d98c	Fix some messages in the process.	2015-11-09 18:03:26 -05:00
Ralph Castain	f1483eb2dc	Need to delay registration of the waitpid callback until after the fork/exec of the child process. Fix the bit testing of process type so that the proper state component gets selected for HNP.	2015-11-06 21:35:24 -08:00
Mark Santcroos	30aab75b86	Make message consistent.	2015-10-24 13:40:03 +02:00
Nathan Hjelm	8b5810f7f7	mca/base: add priority output to mca_base_select The mca_base_select function uses returned priorities to select the best component/module. This priority may be of use to the caller so pass that information back in an optional argument. If the priority is not needed pass NULL. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-10-19 12:32:41 -06:00
Howard Pritchard	d899320574	odls/alps: close the directory Close the /proc/self/fd dir after checking for open fds. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2015-10-06 11:13:44 -07:00

1 2 3 4 5 ...

568 Коммитов