openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	dd0f843843	Fix rare hangs observed on OS-X by properly thread-shifting upcalls from the PMIx server into ORTE	2016-06-05 21:39:44 -07:00
Ralph Castain	0ba9572f9f	Cleanup the forced termination a bit by restoring the delay before issuing the sigkill, and eliminating the large time loss spent checking if the proc died. The latter is responsible for a large number of test timeouts in MTT Update alps component	2016-06-02 17:48:21 -07:00
Jeff Squyres	873cebb4c0	Merge pull request #1727 from jsquyres/pr/mpirun-timeout-and-friends mpirun.1in: add descriptions of new options	2016-06-01 17:11:44 -04:00
Gilles Gouaillardet	5f565dfec3	configury: clean the flex generated .c files	2016-06-01 11:13:31 +09:00
Jeff Squyres	e9ce11c6a7	help-orterun.txt: minor word smything Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2016-05-31 16:33:46 -07:00
Jeff Squyres	347497cc7e	mpirun.1in: add descriptions of new options Add descriptions for the new --report-state-on-timeout and --get-stack-traces options. Also add --timeout, and cross-reference MPIEXEC_TIMEOUT with it. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2016-05-31 16:33:46 -07:00
Ralph Castain	0cd0ccb7fd	Provide ETIMEDOUT as the mpirun exit code if the timeout limit was hit	2016-05-31 07:45:31 -07:00
Ralph Castain	3913595e10	Enable simulation of large-scale clusters by allowing multiple daemons/node. Specifying the ras_base_multiplier parameter to be greater than 1 will cause ORTE to replicate each allocated node by that factor. A daemon will be spawned for each replica, thus letting ORTE function as if it were on a much larger cluster. Note that this cannot be used for MPI performance testing. It is really only useful for ORTE scaling tests. It also only works with the rsh/ssh launcher.	2016-05-29 18:56:18 -07:00
Ralph Castain	ebe159acef	Add a timeout cmd line option and an option to report state info upon timeout to assist with debugging Jenkins tests If requested, obtain stacktraces for each application process and report it to stderr upon timeout stack traces: minor improvements - Also include the hostname and PID of the each process for which we're sending the stack traces (vs. just including the ORTE process name) - Send a specific error message if we couldn't find "gstack" in the $PATH (e.g., on OS X) - Send a sepcific error message if gstack fails to run - Print a message that obtaining the stack traces may take a few seconds so that users don't wonder what's happening Signed-off-by: Jeff Squyres <jsquyres@cisco.com> help-orterun.txt: minor tweaks Trivial update: show "--timeout" (instead of "-timeout") in the help message, just to encourage the use of double-dash options. Signed-off-by: Jeff Squyres <jsquyres@cisco.com> trivial: stacktrace -> stack trace Trivial word smything. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2016-05-28 08:36:25 -07:00
Jeff Squyres	dd9a819a1c	odls_default: do not opal_output() while creating a process! It is verbotten to use opal_output() after the fork() but before the exec()! It results in all manner of undefined behavior. For example, on some OS X systems, if you run a trivial "hello world" MPI program with a high level of ODLS verbosity: ```sh $ mpirun -np 3 --mca odls_base_verbose 100 ./hello_c ``` You will see a bunch of output from the mpirun ODLS base, but then it may hang in odls_default_module.c:do_child() -- after the fork() but before the exec() -- while trying to opal_output() some debugging statements. The solution is to remove these extraneous opal_output() statements. Indeed, the ODLS base is already outputting the same information that these opal_output() statements are trying to emit, anyway. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2016-05-24 21:28:57 -04:00
rhc54	b7928c2607	Merge pull request #1693 from rhc54/topic/eval2 Fix the dist mapper option	2016-05-24 05:32:12 -07:00
Ralph Castain	30aaf785a8	Fix the dist mapper option	2016-05-23 23:20:33 -07:00
rhc54	927d3f4c3c	Merge pull request #1692 from rhc54/topic/eval2 Fix the --tune problem by searching the argv for MCA params in advance of opal_init_util	2016-05-23 22:19:09 -07:00
Ralph Castain	80f4e3b872	Fix the --tune problem by searching the argv for MCA params in advance of opal_init_util. Only search the first app_context as we historically have done - we can debate whether or not to search all app_contexts	2016-05-23 21:09:44 -07:00
Ralph Castain	2da0210de3	Fix command line usage when Java user provides the -Djava.library.path=foo options	2016-05-23 15:29:36 -07:00
Ralph Castain	42ecffb6d0	Move the registration of MCA params out of the init of the var system - put them in with the rest of the OPAL MCA param registrations Take another shot at untangling the spaghetti orterun: fix for command line parsing orte-submit calls opal_init_util () before parsing out MCA command line options (-mca, -am, etc). This prevents mpirun from setting opal MCA variables for some frameworks as well as the MCA base. This is because when a framework is opened all of its variables are set to read-only. Eventually we want to lift this restriction on some MCA variables but since -mca is affected we must parse out the MCA command line options before opal_init_util(). This commit fixes the bug by adding a new option to opal_cmd_line_parse (ignore unknown option) so orte-submit can pre-parse the command line for MCA options. Signed-off-by: Nathan Hjelm <hjelmn@me.com> Minor cleanups to avoid releasing/recreating the cmd line	2016-05-20 09:59:50 -07:00
George Bosilca	50b37758d4	Don't overwrite the function argument. In a MPMD setup the app in the jdata can be NULL, so make sure we don't leave the main argument to an inconsistent value.	2016-05-19 10:35:23 -04:00
Ralph Castain	ca69403cc8	In MPMD case, add slots given to each of the executables instead of overwriting	2016-05-15 08:55:43 -07:00
Ralph Castain	7767882346	Per user request, add some missing data and definitions: OPAL_PMIX_UNIV_RANK - synonym for OPAL_PMIX_GLOBAL_RANK OPAL_PMIX_APP_SIZE - #ranks in the application of this proc	2016-05-09 08:39:01 -07:00
Ralph Castain	1911d74095	Prevent segfault when -debug given to mpirun	2016-05-08 10:19:05 -07:00
Ralph Castain	7e5ef6a240	Fix the env_list support - the MCA param was being set way too early, so provide a "backdoor" way of providing the value	2016-05-06 15:38:39 -07:00
Ralph Castain	58dd41facf	Repair the processing of cmd line options that mapped to MCA params. This was responsible for breaking things like map-by <foo>. Remove debug, let orterun send terminate cmd to DVM Recover the DVM support	2016-05-06 13:14:03 -07:00
rhc54	ff8518853e	Merge pull request #1604 from rhc54/topic/psm2 Improve the transport key print statement to ensure that we don't get…	2016-05-03 13:43:10 -07:00
Jeff Squyres	265e5b9795	Merge pull request #1552 from kmroz/wip-hostname-len-cleanup-1 ompi/opal/orte/oshmem/test: max hostname length cleanup	2016-05-02 09:44:18 -04:00
rhc54	2fa8b6c6ac	Merge pull request #1525 from rhc54/topic/schizo Extend the schizo framework	2016-05-01 15:09:08 -07:00
Ralph Castain	6ac7929bd0	Extend the schizo framework to allow definition of CLI options by environment. Refactor orterun to mesh with the orted_submit code, thus improving code reuse. Eliminate the orte-submit tool as orterun can now meet that need. Cleanups per @jjhursey review	2016-05-01 11:30:25 -07:00
Ralph Castain	0f05893952	Ensure consistency between max_procs and univ_size values - since orte wants max_procs, have the proc get that value instead of univ_size Make the singleton module consistent as well	2016-05-01 11:13:33 -07:00
Ralph Castain	29bc24bdd5	Improve the transport key print statement to ensure that we don't get zero fields as this can be a problem for PSM	2016-04-28 20:11:12 -07:00
Ralph Castain	fac409d094	Ensure the personality gets set for the debugger job launch when attaching	2016-04-28 15:28:55 -07:00
Ralph Castain	e6ad1ad621	Up-port of change for 2.x: if user directs oversubscribe, then do not bind as we will otherwise overload resources	2016-04-28 13:21:10 -07:00
Ralph Castain	75dc4c305a	Correctly set the #procs in the job to "job_size", and the max_procs to "univ_size"	2016-04-27 12:00:19 -07:00
Gilles Gouaillardet	6bf57c799f	orte/rml: ORTE_RML_SEND_COMPLETE handles messages with both NULL iov and cbfunc.buffer	2016-04-26 09:19:31 +09:00
Karol Mroz	5c11bdb251	orte: fixup hostname max length usage Also removes orte specific max hostname value. Signed-off-by: Karol Mroz <mroz.karol@gmail.com>	2016-04-25 07:08:23 +02:00
Joshua Hursey	29b49351af	ras/lsf: Fix affinity for MPMD jobs running under LSF	2016-04-22 11:18:34 -05:00
Jeff Squyres	68c1a5eb6c	Merge pull request #1567 from jsquyres/pr/fix-ompi-to-opal-name-conversion m4: rename OMPI_SUMMARY_* macros to OPAL_SUMMARY_*	2016-04-20 13:10:06 -04:00
Jeff Squyres	6800ef9ec0	m4: rename OMPI_SUMMARY_* macros to OPAL_SUMMARY_* These macros should really be named OPAL_SUMMARY_*; they're used in all projects, and therefore should be in the lowest later project (OPAL). Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2016-04-20 08:40:00 -07:00
Ralph Castain	449ec41532	Roll to PMIx 1.1.4rc1 and remove the PMIx 1.2.0 directory as the community has decided to not do that release version. This incorporates a number of bug fixes that have been identified and repaired in the PMIx and OMPI code bases. Also includes several minor corrections to the PMIx code so it now supports run-thru without hanging on collectives involving a process that exits	2016-04-15 10:11:11 -07:00
Ralph Castain	1fa236b26c	Ensure that we exit with a non-zero status when oversubscribe fails	2016-04-14 05:51:10 -07:00
Ralph Castain	437f5b4289	Fix map-by node and do-not-launch	2016-04-13 09:21:19 -07:00
Ralph Castain	2432daf065	Some minor cleanups of a memory leak and error output	2016-04-08 07:46:18 -07:00
Rainer Keller	ad690a4bc0	Move the help into the proper file: all orte_show_help in orte/orted/pmix/pmix_server.c reference orterun.	2016-04-07 22:52:23 +02:00
Rainer Keller	52080a5736	As per the pull request to pmix/master: https://github.com/pmix/master/pull/71 Have OMPI's current version of pmix120 nicely fail in case of too long sun_path (longer than 108 or in case of OSX 103 chars). And have OMPI return proper error messages with hints how to amend.	2016-04-07 22:12:53 +02:00
rhc54	a95de6e8ef	Merge pull request #1353 from rhc54/topic/host Per the discussion on the telecon, change the -host behavior yet again	2016-04-04 10:30:36 -07:00
Gilles Gouaillardet	d757fbba5d	oob/usock: drop message to be sent in process_send()	2016-04-04 16:04:54 +09:00
Gilles Gouaillardet	170734182b	oob/usock: mca_oob_usock_peer_close() sets peer->sd = -1 after close() so usock_peer_create_socket know it must re-create the socket /* assuming it is ever supposed to occur */ also fix a typo (peer->sd >= 0) in usock_peer_create_socket	2016-04-04 16:02:05 +09:00
Gilles Gouaillardet	2ede47c462	pmix: fix misc missing conversion and type issues	2016-04-04 10:12:34 +09:00
Ralph Castain	503e1274a9	Per the discussion on the telecon, change the -host behavior so we only run one instance if no slots were provided and the user didn't specify #procs to run. However, if no slots are given and the user does specify #procs, then let the number of slots default to the #found processing elements Ensure the returned exit status is non-zero if we fail to map If no -np is given, but either -host and/or -hostfile was given, then error out with a message telling the user that this combination is not supported. If -np is given, and -host is given with only one instance of each host, then default the #slots to the detected #pe's and enforce oversubscription rules. If -np is given, and -host is given with more than one instance of a given host, then set the #slots for that host to the number of times it was given and enforce oversubscription rules. Alternatively, the #slots can be specified via "-host foo:N". I therefore believe that row #7 on Jeff's spreadsheet is incorrect. With that one correction, this now passes all the given use-cases on that spreadsheet. Make things behave under unmanaged allocations more like their managed cousins - if the #slots is given, then no-np shall fill things up. Fixes #1344	2016-03-29 11:21:57 -07:00
Ralph Castain	bd18d9c9d5	Ensure the compiler knows that a critical variable is volatile	2016-03-29 09:18:25 -07:00
Howard Pritchard	e7433fcb44	Merge pull request #1486 from hppritcha/topic/fix_wlm_detect_code plm/alps: fix usage of cray wlm_detect methods	2016-03-26 13:22:50 -06:00
Ralph Castain	0e1350f5b7	Add missing header files	2016-03-25 09:06:51 -07:00

1 2 3 4 5 ...

5120 Коммитов