openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	de7b1494d9	Clean out old cruft from the ORCM project	2016-09-21 00:13:30 -07:00
Gilles Gouaillardet	83399adb3f	singleton: "safe" read/write to the pipe between (spawn'ed) orted and singleton	2016-09-20 14:56:58 +09:00
Gilles Gouaillardet	e7ae6975d0	orted: fix spawn in singleton mode in singleton mode, have the spawn'ed orted invoke orte_pre_condition_transports() and send the transport key back to the singleton	2016-09-20 14:39:22 +09:00
Gilles Gouaillardet	d84ac9bdc5	orted: remove debug remove debug code that was added by mistake in open-mpi/ompi@eae9d31784	2016-09-19 19:15:42 +09:00
Gilles Gouaillardet	eae9d31784	pre_condition_transports: code cleanup replace hard coded "OMPI_MCA_orte_precondition_transports" environment variable name with macro'ed OPAL_MCA_PREFIX"orte_precondition_transports"	2016-09-19 13:31:47 +09:00
Ralph Castain	e55cc63da9	Remove debug	2016-09-16 07:06:58 -07:00
Ralph Castain	a16b3cc33d	Fix some minor complaints - missing "void" in function parameters	2016-09-15 15:18:42 -07:00
Ralph Castain	6f086189e6	Fix trivial typo	2016-09-15 13:10:55 -07:00
Gregory M. Kurtzer	16794cc260	Updates to support Singularity containers v2.2	2016-09-15 09:52:06 -07:00
Gilles Gouaillardet	11ebf3ab23	ess/singleton: when forking hnp, use the PMIX_NAMESPACE sent by the hnp as the jobid	2016-09-15 13:57:23 +09:00
Gilles Gouaillardet	628c730196	pkgconfig: define the pkgincludedir variable in *.pc files this has been made necesarry with open-mpi/ompi@12e796dcaf Refs open-mpi/ompi#2069	2016-09-13 09:50:14 +09:00
Gilles Gouaillardet	e84b35217f	oob/tcp: plug a memory leak as reported by Coverity with CID 1196711	2016-09-08 18:50:18 +09:00
Gilles Gouaillardet	b2a2be0e5a	odls: fix memory leak plug This fixes commit open-mpi/ompi@e2c343cdfc.	2016-09-08 10:02:52 +09:00
Jeff Squyres	fd829ac389	Merge pull request #1982 from jsquyres/pr/fix-pkg-config-static pkg-config: fix static linking	2016-09-07 14:55:50 -04:00
Jeff Squyres	b811b0a15c	Merge pull request #2060 from jsquyres/pr/remove-unused-var orte proc_info.c: remove unused variable	2016-09-07 06:33:26 -04:00
Artem Polyakov	9eba1b0b75	Merge pull request #2042 from artpol84/pmix_sdirs Several fixes related to session directories:	2016-09-07 14:15:47 +07:00
Artem Polyakov	a9a7f39773	ess/pmi: fix the comments about MCA/PMIx setting conflict resolution.	2016-09-07 07:47:35 +03:00
Gilles Gouaillardet	be41b120d0	orted: plug misc memory leaks as reported by Coverity with CID 1362603 and 1362606	2016-09-07 10:08:44 +09:00
Gilles Gouaillardet	e2c343cdfc	odls: plus memory leak as reported by Coverity with CID 710645	2016-09-07 10:08:44 +09:00
Gilles Gouaillardet	c09899f6af	plm: plus resource leaks as reported by Coverity with CIDs 72274 and 1196733	2016-09-07 10:08:44 +09:00
Jeff Squyres	722d5eecf1	orte proc_info.c: remove unused variable Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2016-09-06 16:38:15 -07:00
Josh Hursey	f6337f9eae	Merge pull request #2047 from jjhursey/topic/mixed-host2 orte: !FQDN implementation to use opal_net_isaddr	2016-09-06 13:08:54 -05:00
Ralph Castain	f85dcaee2a	Fixes CID 1369067 and CID 1196684 Fixes CID 1369648 Fixes CID 1372409	2016-09-06 08:43:15 -07:00
Artem Polyakov	74a11d7832	Fix session dir cleanup code.	2016-09-05 07:53:55 +03:00
Artem Polyakov	dc0ab674de	Add PMIx key to provide RM with ability to indicate that it will cleanup session directories provided at through OPAL_PMIX_TMPDIR, OPAL_PMIX_NSDIR, OPAL_PMIX_PROCDIR	2016-09-05 07:48:44 +03:00
Artem Polyakov	81195ab724	Several fixes related to session directories: * enable OMPI to retrieve paths from RM through PMIx * cleanups related to tempdirs.	2016-09-05 07:48:44 +03:00
Ralph Castain	fb51d65049	Minor change: check for NULL before using the job map to avoid segfault when erroring out prior to creating the map	2016-09-04 07:53:12 -07:00
Joshua Hursey	fe937d1e82	orte: !FQDN implementation to use opal_net_isaddr * Switch to use opal_net_isaddr() for checking if a name is an IP address - as it is a bit cleaner, and uses common functionality.	2016-09-02 13:31:49 -05:00
Ralph Castain	4e0788e9ad	Enable PSM to support dynamic processes Fix comm_spawn to correctly reference the actual parent process that requested the spawn when looking for the parent job object	2016-09-02 10:22:04 -07:00
Ralph Castain	0ea1cff733	Implement notification of completion on comm_spawn'd child jobs. Add a configure flag to enable PMIx 3's shared memory datastore, and set it disable by default so that comm_spawn functions again. Will reverse the default once that feature is fully functional	2016-09-01 13:10:10 -07:00
Gilles Gouaillardet	0b8c58298d	oob/usock: fix handling of orte_process_name_t * orte_process_name_t is aligned on 32 bits, so it cannot simply be casted into an int64_t. use memcpy() instead Thanks Paul Hargrove for the report	2016-09-01 13:18:02 +09:00
Ralph Castain	c1050bc01e	Provide a mechanism for obtaining memory profiles of daemons and application profiles for use in studying our memory footprint. Setting OMPI_MEMPROFILE=N causes mpirun to set a timer for N seconds. When the timer fires, mpirun will query each daemon in the job to report its own memory usage plus the average memory usage of its child processes. The Proportional Set Size (PSS) is used for this purpose.	2016-08-31 09:32:07 -07:00
Ralph Castain	9b991bd1f5	Ensure that the "running" state is correctly updated It is possible that one or more procs could get thru PMIx_Init, and thus be marked as in state "registered", before all local procs have been started. If that happens, then we would report some of the procs in state "running", and the others in state "registered" - which means that the HNP would miss the "running" stage of the state machine. Thanks to Jingchao Zhang for his patience in tracking this down on the 2.0 branch	2016-08-30 19:24:39 -07:00
Ralph Castain	cfa784c9a6	Since we changed storage to pointers in pmix_value_t, we need to allocate space for those values when unpacking	2016-08-29 20:22:24 -07:00
Josh Hursey	b0d8638824	Merge pull request #2015 from jjhursey/topic/mixed-hostnames orte: Expand use of !orte_keep_fqdn_hostnames MCA parameter	2016-08-29 09:14:54 -05:00
Ralph Castain	2f6e0fec90	Provide the number of nodes in the job	2016-08-26 14:50:41 -07:00
Joshua Hursey	d26dd2c20e	orte: Expand the application of !orte_keep_fqdn_hostnames * Expand the use of the `orte_keep_fqdn_hostnames` MCA parameter when it is set to false. * If that parameter is set to false (default) then short hostnames (e.g., `node01`) will match with the long hostnames (e.g., `node01.mycluster.org`). This allows a user (or resource manager) to mix the use of short and long hostnames. - Note that this mechanism does _not_ perform a DNS lookup, but instead strips off the FQDN by truncating the hostname string at the first `.` character (when not an IP address). - By default (`false`) the following is true: `node01 == node01.mycluster.org == node01.bogus.com` since we use `node01` as the hostname.	2016-08-26 16:09:04 -05:00
Artem Polyakov	55ac3b0be3	orte/schizo: fix binding detection in slurm component in SLURM 16.05 the SLURM_CPU_BIND_TYPE is equal to "mask_cpu:" instead of "mask_cpu". Account for that.	2016-08-26 09:55:52 +03:00
rhc54	19b0f4db9f	Merge pull request #1995 from rhc54/topic/pe-per-rank Change the behavior of cpus-per-rank.	2016-08-25 14:38:12 -05:00
Ralph Castain	440eae90ec	Correct the binding algorithm to decouple it from oversubscribe. Oversubscribe stipulates that we allow more procs on the node than assigned slots - it has nothing to do with the number of available pe's. Let overload directives handle the pe situation.	2016-08-24 21:17:22 -07:00
Ralph Castain	92102304b6	Minor typo - init the job_data stdin_target field to 0 for default behavior. Add test.	2016-08-22 21:03:45 -07:00
Gilles Gouaillardet	93e73841f9	ess/singleton: push all PMIX_* environment variables, regardless how many there are	2016-08-23 09:46:55 +09:00
Gilles Gouaillardet	a1e8e58a8a	ess/singleton: expects 4 PMIX_* environment variables or more	2016-08-23 09:34:03 +09:00
Ralph Castain	7de4d6922b	Change the behavior of cpus-per-rank. We previously counted each cpu against the #slots. However, IBM has pointed out that "slot" is equated to the number of processes allowed to run on each node, and not the number of cpus on the node. This has been a continuing source of confusion, so make the distinction a "hard" one. Each process occupies a "slot". We automatically set #slots = #cpus if nothing else is told to us. If you want to run more procs and slots, you must tell us to allow oversubscription. A process can utilize multiple pe's if that option is given. If you try to bind more than one proc to a given pe, then we will error out unless you tell us to allow overloading.	2016-08-22 15:54:41 -07:00
Ralph Castain	9888615e75	Restore the coll/sync module and provide a test to verify its operation	2016-08-20 10:14:52 -07:00
Jeff Squyres	fb894e6e3e	pkg-config: fix static linking We need to list all major project libraries in the private libraries line to enable static linking to work properly. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2016-08-17 20:37:51 -05:00
Jeff Squyres	71ec5cfb43	rsh: robustify the check for plm_rsh_agent default value Don't strcmp against the default value -- the default value may change over time. Instead, check to see if the MCA var source is not DEFAULT. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2016-08-16 06:58:20 -05:00
rhc54	d7cd802426	Merge pull request #1971 from rhc54/topic/sesdir Update the session dir structure. Restore the creation of a top-level…	2016-08-16 03:14:08 -05:00
Ralph Castain	ae2af61ee3	Update the session dir structure. Restore the creation of a top-level dir based on userid so that everything is contained under the user's top-level dir. Make the next level down (the "job family" level) be either the pid (indicated by a name of "pid.N") or the job family if not launched by mpirun. This allows for proper rendezvous by direct-launched procs.	2016-08-15 22:46:46 -05:00
Ralph Castain	9f43db7303	Further cleanup getpwuid usage - try it first (unless completely disabled), and then silently failover to try other methods.	2016-08-15 07:51:36 -07:00
Ralph Castain	be8424b691	Provide backward compatible keys so that the non-PMIx components in the opal/pmix framework don't have to adjust as we continue to work on finalizing the PMIx reference scheme. Activate and utilize the new PMIx show_help capability to provide more meaningful error output when the server cannot start. Add a contrib script to cleanup permissions incorrectly modified due to things like smb mounts dd	2016-08-13 12:13:04 -07:00
Ralph Castain	08a0644df5	Fix shared memory rendezvous	2016-08-13 08:14:50 -07:00
rhc54	ddde154d28	Merge pull request #1962 from rhc54/topic/notify Ensure we properly convert pmix status to ORTE state before activatin…	2016-08-13 06:59:50 -07:00
Ralph Castain	48d35a9627	Ensure we properly convert pmix status to ORTE state before activating an error state upon notification. Cleanup some conversion issues on notification info. Add a new orte_notify.c test program	2016-08-12 21:14:29 -07:00
rhc54	9eed451916	Merge pull request #1960 from rhc54/topic/rsh Restore the rsh template creation code	2016-08-12 13:38:43 -07:00
rhc54	8d67f753ca	Merge pull request #1959 from rhc54/topic/nodeid The node index isn't normally passed with the packed node object, so …	2016-08-12 13:30:10 -07:00
rhc54	1ef3c86d44	Merge pull request #1931 from hjelmn/ess_fix ess/base: set up nidmap after pmix	2016-08-12 13:10:30 -07:00
Ralph Castain	5717b75b45	Restore the rsh template creation code	2016-08-12 12:43:40 -07:00
Ralph Castain	d4327fd973	The node index isn't normally passed with the packed node object, so we need to set it on the remote end as the orted needs to pass it down to the procs. Refactor the registration code to better package proc-level info - we will separate out the node and app levels in a subsequent change.	2016-08-12 12:06:23 -07:00
Ralph Castain	1c44543854	If the ssh agent hasn't been given, then check for qrsh and friends	2016-08-12 07:46:39 -07:00
Ralph Castain	527b5c692a	Update to include extended tool support, new datatypes	2016-08-08 13:39:46 -07:00
Artem Polyakov	1351a7065c	ess/pmi: minor code readablility cleanup. Split process name variable "name" to - "wildcard_rank" for the cases where wildcard is used. - "pname" for the case where reference to particular process is needed.	2016-08-06 15:45:19 +06:00
Howard Pritchard	ff669e7b15	code cleanup: clang is now a happier panda Clang 5.1 on my mac was a sad panda compiling a couple of files, complaining about uninitialized stack variables. This commit makes clang a happier panda (or at least not so sad). Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2016-08-04 19:34:44 -06:00
Nathan Hjelm	3c23502dfe	ess/base: set up nidmap after pmix This fixes a SEGV when the nidmap code attempts to use opal_pmix.store_local before pmix is set up. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2016-08-02 09:50:00 -06:00
Ralph Castain	16fccd4964	Establish a way for ORTE to tell PMIx the base tmpdir to use, and update PMIx to understand such directives	2016-07-29 09:52:36 -07:00
Ralph Castain	b748afceb1	Fix copy/paste error	2016-07-29 06:41:30 -07:00
Gilles Gouaillardet	e67c3d0a14	orted/pmix: protect against NULL node in orte_pmix_server_register_nspace()	2016-07-29 16:20:31 +09:00
Gilles Gouaillardet	273e56096b	configury: capture configury command line configury command line is quoted and made available via the OPAL_CONFIGURE_CLI macro. it can be retrieved via {orte-info,ompi_info,oshmem_info} -c, or {orte-info,ompi_info,oshmem_info} --all --parseable \| grep ^config:cli:	2016-07-29 09:14:09 +09:00
rhc54	19a2dbb04f	Merge pull request #1915 from rhc54/topic/connect Support timeout values when performing connect/accept operations. Bum…	2016-07-28 15:51:06 -07:00
Jeff Squyres	cc651408dc	help-orterun: remove blank line at end of help message Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2016-07-28 14:53:34 -07:00
Ralph Castain	cacb582ecd	Support timeout values when performing connect/accept operations. Bump default timeout to 10 minutes so folks have time to start the partnering application	2016-07-28 14:09:06 -07:00
Ralph Castain	9ab20cafe3	Pass the nodeid for each proc in the job. Fix a mistaken error output message	2016-07-25 15:41:15 -07:00
Ralph Castain	71de03fc67	Cleanup the new naming requirements to ensure that info is correctly retrieved Cleanup permissions Restore singleton operations	2016-07-21 09:46:03 -07:00
Ralph Castain	01a653d50a	Remove a debug print in comm_cid.c. Update PMIx2 to include the revised PMIx_Get logic for higher performance by reducing the number of hash table lookups. Fix a bug where requests for data from a proc in another nspace could hang, or result in "not found". Remove stale file reference Restore autogen pass thru pmix Remove generated file	2016-07-20 00:58:19 -07:00
Ralph Castain	99f7096031	Fix permissions	2016-07-16 21:03:55 -07:00
Ralph Castain	d4071fbd1c	Fix dynamic operations by ensuring that we only fire the debugger release if the debugger is attached, and that the OPAL pmix key for directing events to non-default handlers matches the PMIx spelling	2016-07-16 13:20:41 -07:00
rhc54	2414244171	Merge pull request #1872 from rhc54/topic/continuous Add support for continuously operating applications	2016-07-13 15:29:31 -07:00
Ralph Castain	20a91c2baf	Add a new --continuous flag to mpirun that directs ORTE to let a job continue running as app procs terminate. Don't attempt to restart them. Add event notification of abnormally terminating procs, and demonstrate that in the mpi_spin test program. Cleanup debug message	2016-07-13 15:28:33 -07:00
rhc54	cc2a648124	Merge pull request #1862 from rhc54/topic/mapping Fix a bug in the handling of nper<foo> when -host or -hostfile was gi…	2016-07-12 10:40:28 -07:00
Ralph Castain	aa78f902f2	Add some missing info to the job map so remote procs get their app_rank	2016-07-12 09:50:12 -07:00
Ralph Castain	ddd0d05de3	Fix a bug in the handling of nper<foo> when -host or -hostfile was given. Correctly mark slots as "given" when we auto-assign them. Ensure we don't set the number of procs when using nper<foo> so the PPR mapper can correctly assing them.	2016-07-12 09:27:02 -07:00
Ralph Castain	ae8444682f	Remove stale variable	2016-07-05 20:07:16 -07:00
Ralph Castain	ee56d9dc1a	Shorten the session directory name as some OS's are now providing unusually long temp directory names, causing us to overflow the sockaddr field	2016-07-05 14:59:50 -07:00
Ralph Castain	c9ada8e095	Silence Coverity warnings	2016-07-03 20:45:08 -07:00
Ralph Castain	6e434d6785	Add support for PMIx tool connections and queries. Initially only support a request to list all known namespaces (jobids) from ORTE, but other folks will extend that support to include additional information Update to match PMIx RFC Fix configury to point to correct libevent and hwloc locations	2016-06-29 19:19:19 -07:00
Gilles Gouaillardet	5d32282230	orted/pmix_server_pub: fix packing type in pmix_server_lookup_fn() and make it match the one used when unpacking in orte_data_server()	2016-06-27 14:37:08 +09:00
Ralph Castain	e3e4d73986	Need to be a little more careful when checking the range on a publish/lookup operation. If the range was constrained at publish, then we need to check that the lookup fits within that constraint. Otherwise, we should provide the data. More detailed constraint checking will be provided later.	2016-06-24 17:01:49 -07:00
Ralph Castain	380cc8f040	Add a test program to help diagnose binding issues	2016-06-23 06:27:18 -07:00
Ralph Castain	0ba02821e6	Add requested key and job-level info	2016-06-19 18:22:31 -07:00
Jeff Squyres	98a2f5248d	orte: add missing break statement This seems like an obvious typo: insert a missing "break" statement so that we don't fall through to the next case. Fixes CIDs 1362756 and 1362764. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2016-06-18 07:48:45 -07:00
Ralph Castain	5d330d5220	Enable the PMIx event notification capability and use that for all error notifications, including debugger release. This capability requires use of PMIx 2.0 or above as the features are not available with earlier PMIx releases. When OMPI master is built against an earlier external version, it will fallback to the prior behavior - i.e., debugger will be released via RML and all notifications will go strictly to the default error handler. Add PMIx 2.0 Remove PMIx 1.1.4 Cleanup copying of component Add missing file Touchup a typo in the Makefile.am Update the pmix ext114 component Minor cleanups and resync to master Update to latest PMIx 2.x Update to the PMIx event notification branch latest changes	2016-06-14 13:08:41 -07:00
Ralph Castain	a6e6c37484	Remove stale map-reduce support	2016-06-12 07:41:57 -07:00
Ralph Castain	dd0f843843	Fix rare hangs observed on OS-X by properly thread-shifting upcalls from the PMIx server into ORTE	2016-06-05 21:39:44 -07:00
Ralph Castain	0ba9572f9f	Cleanup the forced termination a bit by restoring the delay before issuing the sigkill, and eliminating the large time loss spent checking if the proc died. The latter is responsible for a large number of test timeouts in MTT Update alps component	2016-06-02 17:48:21 -07:00
Jeff Squyres	873cebb4c0	Merge pull request #1727 from jsquyres/pr/mpirun-timeout-and-friends mpirun.1in: add descriptions of new options	2016-06-01 17:11:44 -04:00
Gilles Gouaillardet	5f565dfec3	configury: clean the flex generated .c files	2016-06-01 11:13:31 +09:00
Jeff Squyres	e9ce11c6a7	help-orterun.txt: minor word smything Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2016-05-31 16:33:46 -07:00
Jeff Squyres	347497cc7e	mpirun.1in: add descriptions of new options Add descriptions for the new --report-state-on-timeout and --get-stack-traces options. Also add --timeout, and cross-reference MPIEXEC_TIMEOUT with it. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2016-05-31 16:33:46 -07:00
Ralph Castain	0cd0ccb7fd	Provide ETIMEDOUT as the mpirun exit code if the timeout limit was hit	2016-05-31 07:45:31 -07:00
Ralph Castain	3913595e10	Enable simulation of large-scale clusters by allowing multiple daemons/node. Specifying the ras_base_multiplier parameter to be greater than 1 will cause ORTE to replicate each allocated node by that factor. A daemon will be spawned for each replica, thus letting ORTE function as if it were on a much larger cluster. Note that this cannot be used for MPI performance testing. It is really only useful for ORTE scaling tests. It also only works with the rsh/ssh launcher.	2016-05-29 18:56:18 -07:00
Ralph Castain	ebe159acef	Add a timeout cmd line option and an option to report state info upon timeout to assist with debugging Jenkins tests If requested, obtain stacktraces for each application process and report it to stderr upon timeout stack traces: minor improvements - Also include the hostname and PID of the each process for which we're sending the stack traces (vs. just including the ORTE process name) - Send a specific error message if we couldn't find "gstack" in the $PATH (e.g., on OS X) - Send a sepcific error message if gstack fails to run - Print a message that obtaining the stack traces may take a few seconds so that users don't wonder what's happening Signed-off-by: Jeff Squyres <jsquyres@cisco.com> help-orterun.txt: minor tweaks Trivial update: show "--timeout" (instead of "-timeout") in the help message, just to encourage the use of double-dash options. Signed-off-by: Jeff Squyres <jsquyres@cisco.com> trivial: stacktrace -> stack trace Trivial word smything. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2016-05-28 08:36:25 -07:00
Jeff Squyres	dd9a819a1c	odls_default: do not opal_output() while creating a process! It is verbotten to use opal_output() after the fork() but before the exec()! It results in all manner of undefined behavior. For example, on some OS X systems, if you run a trivial "hello world" MPI program with a high level of ODLS verbosity: ```sh $ mpirun -np 3 --mca odls_base_verbose 100 ./hello_c ``` You will see a bunch of output from the mpirun ODLS base, but then it may hang in odls_default_module.c:do_child() -- after the fork() but before the exec() -- while trying to opal_output() some debugging statements. The solution is to remove these extraneous opal_output() statements. Indeed, the ODLS base is already outputting the same information that these opal_output() statements are trying to emit, anyway. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2016-05-24 21:28:57 -04:00
rhc54	b7928c2607	Merge pull request #1693 from rhc54/topic/eval2 Fix the dist mapper option	2016-05-24 05:32:12 -07:00
Ralph Castain	30aaf785a8	Fix the dist mapper option	2016-05-23 23:20:33 -07:00
rhc54	927d3f4c3c	Merge pull request #1692 from rhc54/topic/eval2 Fix the --tune problem by searching the argv for MCA params in advance of opal_init_util	2016-05-23 22:19:09 -07:00
Ralph Castain	80f4e3b872	Fix the --tune problem by searching the argv for MCA params in advance of opal_init_util. Only search the first app_context as we historically have done - we can debate whether or not to search all app_contexts	2016-05-23 21:09:44 -07:00
Ralph Castain	2da0210de3	Fix command line usage when Java user provides the -Djava.library.path=foo options	2016-05-23 15:29:36 -07:00
Ralph Castain	42ecffb6d0	Move the registration of MCA params out of the init of the var system - put them in with the rest of the OPAL MCA param registrations Take another shot at untangling the spaghetti orterun: fix for command line parsing orte-submit calls opal_init_util () before parsing out MCA command line options (-mca, -am, etc). This prevents mpirun from setting opal MCA variables for some frameworks as well as the MCA base. This is because when a framework is opened all of its variables are set to read-only. Eventually we want to lift this restriction on some MCA variables but since -mca is affected we must parse out the MCA command line options before opal_init_util(). This commit fixes the bug by adding a new option to opal_cmd_line_parse (ignore unknown option) so orte-submit can pre-parse the command line for MCA options. Signed-off-by: Nathan Hjelm <hjelmn@me.com> Minor cleanups to avoid releasing/recreating the cmd line	2016-05-20 09:59:50 -07:00
George Bosilca	50b37758d4	Don't overwrite the function argument. In a MPMD setup the app in the jdata can be NULL, so make sure we don't leave the main argument to an inconsistent value.	2016-05-19 10:35:23 -04:00
Ralph Castain	ca69403cc8	In MPMD case, add slots given to each of the executables instead of overwriting	2016-05-15 08:55:43 -07:00
Ralph Castain	7767882346	Per user request, add some missing data and definitions: OPAL_PMIX_UNIV_RANK - synonym for OPAL_PMIX_GLOBAL_RANK OPAL_PMIX_APP_SIZE - #ranks in the application of this proc	2016-05-09 08:39:01 -07:00
Ralph Castain	1911d74095	Prevent segfault when -debug given to mpirun	2016-05-08 10:19:05 -07:00
Ralph Castain	7e5ef6a240	Fix the env_list support - the MCA param was being set way too early, so provide a "backdoor" way of providing the value	2016-05-06 15:38:39 -07:00
Ralph Castain	58dd41facf	Repair the processing of cmd line options that mapped to MCA params. This was responsible for breaking things like map-by <foo>. Remove debug, let orterun send terminate cmd to DVM Recover the DVM support	2016-05-06 13:14:03 -07:00
rhc54	ff8518853e	Merge pull request #1604 from rhc54/topic/psm2 Improve the transport key print statement to ensure that we don't get…	2016-05-03 13:43:10 -07:00
Jeff Squyres	265e5b9795	Merge pull request #1552 from kmroz/wip-hostname-len-cleanup-1 ompi/opal/orte/oshmem/test: max hostname length cleanup	2016-05-02 09:44:18 -04:00
rhc54	2fa8b6c6ac	Merge pull request #1525 from rhc54/topic/schizo Extend the schizo framework	2016-05-01 15:09:08 -07:00
Ralph Castain	6ac7929bd0	Extend the schizo framework to allow definition of CLI options by environment. Refactor orterun to mesh with the orted_submit code, thus improving code reuse. Eliminate the orte-submit tool as orterun can now meet that need. Cleanups per @jjhursey review	2016-05-01 11:30:25 -07:00
Ralph Castain	0f05893952	Ensure consistency between max_procs and univ_size values - since orte wants max_procs, have the proc get that value instead of univ_size Make the singleton module consistent as well	2016-05-01 11:13:33 -07:00
Ralph Castain	29bc24bdd5	Improve the transport key print statement to ensure that we don't get zero fields as this can be a problem for PSM	2016-04-28 20:11:12 -07:00
Ralph Castain	fac409d094	Ensure the personality gets set for the debugger job launch when attaching	2016-04-28 15:28:55 -07:00
Ralph Castain	e6ad1ad621	Up-port of change for 2.x: if user directs oversubscribe, then do not bind as we will otherwise overload resources	2016-04-28 13:21:10 -07:00
Ralph Castain	75dc4c305a	Correctly set the #procs in the job to "job_size", and the max_procs to "univ_size"	2016-04-27 12:00:19 -07:00
Gilles Gouaillardet	6bf57c799f	orte/rml: ORTE_RML_SEND_COMPLETE handles messages with both NULL iov and cbfunc.buffer	2016-04-26 09:19:31 +09:00
Karol Mroz	5c11bdb251	orte: fixup hostname max length usage Also removes orte specific max hostname value. Signed-off-by: Karol Mroz <mroz.karol@gmail.com>	2016-04-25 07:08:23 +02:00
Joshua Hursey	29b49351af	ras/lsf: Fix affinity for MPMD jobs running under LSF	2016-04-22 11:18:34 -05:00
Jeff Squyres	68c1a5eb6c	Merge pull request #1567 from jsquyres/pr/fix-ompi-to-opal-name-conversion m4: rename OMPI_SUMMARY_* macros to OPAL_SUMMARY_*	2016-04-20 13:10:06 -04:00
Jeff Squyres	6800ef9ec0	m4: rename OMPI_SUMMARY_* macros to OPAL_SUMMARY_* These macros should really be named OPAL_SUMMARY_*; they're used in all projects, and therefore should be in the lowest later project (OPAL). Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2016-04-20 08:40:00 -07:00
Ralph Castain	449ec41532	Roll to PMIx 1.1.4rc1 and remove the PMIx 1.2.0 directory as the community has decided to not do that release version. This incorporates a number of bug fixes that have been identified and repaired in the PMIx and OMPI code bases. Also includes several minor corrections to the PMIx code so it now supports run-thru without hanging on collectives involving a process that exits	2016-04-15 10:11:11 -07:00
Ralph Castain	1fa236b26c	Ensure that we exit with a non-zero status when oversubscribe fails	2016-04-14 05:51:10 -07:00
Ralph Castain	437f5b4289	Fix map-by node and do-not-launch	2016-04-13 09:21:19 -07:00
Ralph Castain	2432daf065	Some minor cleanups of a memory leak and error output	2016-04-08 07:46:18 -07:00
Rainer Keller	ad690a4bc0	Move the help into the proper file: all orte_show_help in orte/orted/pmix/pmix_server.c reference orterun.	2016-04-07 22:52:23 +02:00
Rainer Keller	52080a5736	As per the pull request to pmix/master: https://github.com/pmix/master/pull/71 Have OMPI's current version of pmix120 nicely fail in case of too long sun_path (longer than 108 or in case of OSX 103 chars). And have OMPI return proper error messages with hints how to amend.	2016-04-07 22:12:53 +02:00
rhc54	a95de6e8ef	Merge pull request #1353 from rhc54/topic/host Per the discussion on the telecon, change the -host behavior yet again	2016-04-04 10:30:36 -07:00
Gilles Gouaillardet	d757fbba5d	oob/usock: drop message to be sent in process_send()	2016-04-04 16:04:54 +09:00
Gilles Gouaillardet	170734182b	oob/usock: mca_oob_usock_peer_close() sets peer->sd = -1 after close() so usock_peer_create_socket know it must re-create the socket /* assuming it is ever supposed to occur */ also fix a typo (peer->sd >= 0) in usock_peer_create_socket	2016-04-04 16:02:05 +09:00
Gilles Gouaillardet	2ede47c462	pmix: fix misc missing conversion and type issues	2016-04-04 10:12:34 +09:00
Ralph Castain	503e1274a9	Per the discussion on the telecon, change the -host behavior so we only run one instance if no slots were provided and the user didn't specify #procs to run. However, if no slots are given and the user does specify #procs, then let the number of slots default to the #found processing elements Ensure the returned exit status is non-zero if we fail to map If no -np is given, but either -host and/or -hostfile was given, then error out with a message telling the user that this combination is not supported. If -np is given, and -host is given with only one instance of each host, then default the #slots to the detected #pe's and enforce oversubscription rules. If -np is given, and -host is given with more than one instance of a given host, then set the #slots for that host to the number of times it was given and enforce oversubscription rules. Alternatively, the #slots can be specified via "-host foo:N". I therefore believe that row #7 on Jeff's spreadsheet is incorrect. With that one correction, this now passes all the given use-cases on that spreadsheet. Make things behave under unmanaged allocations more like their managed cousins - if the #slots is given, then no-np shall fill things up. Fixes #1344	2016-03-29 11:21:57 -07:00
Ralph Castain	bd18d9c9d5	Ensure the compiler knows that a critical variable is volatile	2016-03-29 09:18:25 -07:00
Howard Pritchard	e7433fcb44	Merge pull request #1486 from hppritcha/topic/fix_wlm_detect_code plm/alps: fix usage of cray wlm_detect methods	2016-03-26 13:22:50 -06:00
Ralph Castain	0e1350f5b7	Add missing header files	2016-03-25 09:06:51 -07:00
Ralph Castain	a3fea58d1c	Minor cleanups to prior PR commit	2016-03-24 15:55:14 -07:00
rhc54	6756e19aa2	Merge pull request #1457 from anandhis/master rml changes	2016-03-24 15:17:29 -07:00
rhc54	ba8c8700aa	Merge pull request #1493 from rhc54/topic/sing Update singularity support to track changes in upstream Singularity code	2016-03-24 15:16:38 -07:00
Ralph Castain	8c14df2328	Revert "Modify singularity support per patch from Greg Kurtzer" This reverts commit open-mpi/ompi@f7257a8310. Ensure that we properly cleanup the session directory tree. Prior code had issues with symlinks, especially if the file that the link points to was already removed as we traverse the tree. Also found that the dirent checks for directory type weren't fully portable, and so fall back to the stat-based approach which is known to be portable. Fix singularity singletons by detecting we are in a container and properly setting the pmix selection to pick the isolated component. Remove a stale restriction blocking use of the sm btl	2016-03-24 11:27:18 -07:00
Ralph Castain	378d9cbb5e	Extend the abort on non zero status flag to apply to processes which die as the result of signals.	2016-03-24 08:33:55 -07:00
Ralph Castain	cdd3dc99ca	Correct the binding for the --map-by node case - we should still use our default binding algorithms	2016-03-23 09:55:24 -07:00
Ralph Castain	6e6bbfda91	Very minor typo	2016-03-23 08:31:47 -07:00
Ralph Castain	4a623778a9	Fix the debugger attach - previous commit had fixed one instance of a check prior to sending the release message, but there was a second code path that included a similar check that was missed. Thanks to John DelSignore for spotting it!	2016-03-23 08:25:25 -07:00
Howard Pritchard	69200e6229	plm/alps: fix usage of cray wlm_detect methods Turns out there are some cases where the Cray wlm_detect_get_active may return NULL, in which case fallback to wlm_detect_get_default method is suggested. Make use of the fallback to avoid segfaults under some circumstances in the ALPS plm selection method. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2016-03-22 11:40:56 -07:00
Ralph Castain	c146c4969b	Revert part of open-mpi/ompi@c1bbbb5e2f to restore the usock component, thus fixing show_help aggregation. Fixes #1467 Restore debugger attach operations Fixes #1225	2016-03-18 21:49:04 -07:00
Ralph Castain	8f410d7897	Revert one part of open-mpi/ompi@4d0cc27eb7	2016-03-18 07:23:30 -07:00
Ralph Castain	2970becd6b	Revert "Merge pull request #1451 from ggouaillardet/topic/orte_fork_wrapper_fullname" This reverts commit `efafd62d38`, reversing changes made to `a93b849f13`.	2016-03-18 07:18:36 -07:00
Ralph Castain	a67ff065ae	Silence coverity warnings	2016-03-16 08:43:16 -07:00
Nysal Jan K.A	f6e932c864	Fix memory corruption in orte-ps orte-ps ends up free'ing the same pointer multiple times	2016-03-15 16:03:31 +05:30
Ralph Castain	6d7ada9675	Silence Coverity warning	2016-03-14 09:42:43 -07:00
Gilles Gouaillardet	589924c4aa	odls/base: use the full app name when using an orte fork agent	2016-03-14 11:18:21 +09:00
Anandhi S Jayakumar	a31292abc7	fixes to ud for removing qos channel	2016-03-10 18:03:17 -08:00
Ralph Castain	a4c8e8c28a	Cleanup the proposed change: * qos framework is moving to the scon layer and is no longer required in ORTE * remove the rml/ftrm component as we now have multiple active components, and so the wrapper needs to be rethought * no need for separating the "base" from "API" module definition. The two are identical * move the "stub" functions into their own file for cleanliness * general cleanup to meet coding standards * cleanup some logic in the stubs	2016-03-10 13:14:17 -08:00
Jeff Squyres	48c650c47a	configury: minor updates to config summary output	2016-03-10 13:02:52 -08:00
Anandhi S Jayakumar	0188c3cf81	Adding commit for multiple plugin loading support in RML	2016-03-09 18:13:48 -08:00
Ralph Castain	f7257a8310	Modify singularity support per patch from Greg Kurtzer	2016-03-09 07:52:11 -08:00
Ralph Castain	f3ae30ff39	Fix singletons yet again...	2016-03-08 10:33:35 -08:00
Ralph Castain	d72c1c72ff	Do not push child processes into separate process groups so that any host RM can still "see" them, and ensure that any signal sent to the orted's themselves will be provided to all child processes. Forward all signals from mpirun to the child processes, removing the old MCA parameter required to turn that behavior "on".	2016-03-06 17:55:09 -08:00
Ralph Castain	4d0cc27eb7	Update the singularity support to match that of the latest singularity master. Remove the restriction on shared memory components by instructing singularity to not isolate the PID space. Add a new schizo API to allow setting up the original app_context. Ensure the container is installed prior to execution.	2016-03-05 21:47:42 -08:00
Ralph Castain	ce0a05d7d1	Minor cleanup - Singularity now has an internal check for installed, so we no longer need to do so.	2016-03-04 19:07:53 -08:00
Gilles Gouaillardet	80bdbfd9e7	add missing include file	2016-03-03 13:46:28 +09:00
Ralph Castain	4a55fba414	Fix registration of error handlers thru the pmix120 component. A thread-shift operation was hanging on the sync_event_base, which made it dependent on someone calling opal_progress. Unfortunately, a process in "sleep" or spinning outside the MPI library won't do that, and so we never complete errhandler registration.	2016-03-02 15:01:01 -08:00
Ralph Castain	f0680008d1	Add test file for singularity	2016-03-02 05:40:41 -08:00
Ralph Castain	06e811c5a6	Properly use the OPAL_MCA_PREFIX in orte_submit	2016-03-01 18:16:40 -08:00
Ralph Castain	1b81d90eaa	Minor cleanups required for orte-dvm operation	2016-03-01 18:12:53 -08:00
Ralph Castain	c9f7bb6751	Add the include file to all the schizo components	2016-03-01 13:18:23 -08:00
Ralph Castain	625083fe18	Add include file	2016-03-01 13:04:20 -08:00
Ralph Castain	011403c04a	Fix a number of issues, some of which have lingered for a long time: * provide a more reliable way of determining that a process is a singleton by leveraging the schizo framework. Add new components for slurm, alps, and orte to detect when we are in a managed environment, and if we have been launched by mpirun or a native launcher. Set the correct envars to control ess and pmix selection in each case. * change the relative priority of the pmix120 and pmix112 components to make pmix120 the default * fix singleton comm-spawn by correctly setting the num_apps field of the orte_job_t created by the daemon - this fixes a segfault in register_nspace on newly created daemons * ensure orterun doesn't propagate any ess or pmix directives in its environment * Cleanup a few valgrind issues and memory leaks * Fix a race condition that prevented the client from completing notification registrations (missing thread shift) * Ensure the shizo/alps component detects launch by mpirun	2016-03-01 06:53:00 -08:00
Ralph Castain	263b0c95a8	Fix a segfault that can occur when very short-lived, non-ORTE procs are run	2016-02-28 12:30:20 -08:00
Ralph Castain	cdb494566d	Provide an option to allow isolated singletons	2016-02-25 11:33:26 -06:00
Ralph Castain	e8d347d7bd	Add missing includes	2016-02-24 08:56:02 -06:00
Ralph Castain	77f800b7e8	Tools don't create the orte_job_data table, so don't remove jobs from it	2016-02-21 16:29:00 -08:00
Ralph Castain	64b7728f33	Fix typo - do not look at daemon job when considering completion of launch	2016-02-21 14:44:51 -08:00
Ralph Castain	d653cf2847	Convert the orte_job_data pointer array to a hash table so it doesn't grow forever as we run lots and lots of jobs in the persistent DVM.	2016-02-21 11:55:49 -08:00
Ralph Castain	309e23ab3a	Fix minor typo	2016-02-20 01:33:10 -08:00
Ralph Castain	0c72ba89b9	Cleanup the output-filename options so they work as expected. Have the remote nodes output locally to the files instead of sending it all back to the HNP. Fix Solaris issues by renaming struct field	2016-02-19 12:41:46 -08:00
rhc54	bfd4254a7b	Merge pull request #1382 from rhc54/topic/cleanup Cleanup some valgrind complaints about jumps with uninitialized values.	2016-02-18 17:29:37 -08:00
Nathan Hjelm	27e7b6e466	Merge pull request #1381 from hjelmn/ddt_colon_fix orterun: allow DDT if options contain :'s	2016-02-18 17:48:21 -07:00
Ralph Castain	6e68d758b9	Cleanup some valgrind complaints about jumps with uninitialized values. Fix a few IOF issues reported by Mark Santcroos when submitting jobs from tools. Add the ability to pass directives to the --output-filename option that tell ORTE to (a) not include the jobid in the path to the output files, and (b) not to copy the output to the tool (i.e., just store it in the files). ck Remove stale debug Fix a segfault if no subscribers are present	2016-02-18 16:30:37 -08:00
Nathan Hjelm	69de442136	orterun: allow DDT if options contain :'s There is a bug in MPMD detection that disables totalview if a : is found anywhere on the command line. This includes inside an argument option or MCA variable value. This commit changes the check to look for the string " : " instead of the character : which should eliminate the issue in most cases. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2016-02-18 16:56:08 -07:00
Ralph Castain	1748f44147	Stop a segfault that results in zombied processes by checking for NULL prior to object release	2016-02-18 13:48:41 -08:00
Ralph Castain	60a7bc2e50	Enable the PMIx notification callback system. This currently is only supported by the pmix120 component, which is not selected by default. All other components will ignore error registration requests, and thus do not support debugger attach when launched via mpirun. Note that direct launched applications will support such attachment, but may not do so in a scalable fashion. Fixes ##1225	2016-02-18 09:29:12 -08:00
Nysal Jan K.A	cc9b1316a4	Make UD OOB memory registrations a multiple of page size If ibv_fork_init() has been invoked the pages are marked MADV_DONTFORK. If we only partially use a page, any data allocated on the remainder of the page will be inaccessible to the child process. Fixes open-mpi/ompi#1363	2016-02-17 22:19:49 -05:00
rhc54	dc4d3edc06	Merge pull request #1372 from rhc54/topic/sing Further enhance the support for Singularity containers.	2016-02-17 16:39:23 -08:00
Ralph Castain	8f9508cace	Further enhance the support for Singularity containers. Extend the "personality" command-line option to allow specifying both model (e.g., "ompi") and container (e.g., "singularity"), and add the necessary logic to support multiple options. Add a new pmix "isolated" component to handle singletons where no HNP is available since containers cannot launch the HNP.	2016-02-17 13:33:06 -08:00
Howard Pritchard	31841b4367	ras/alps: squelch common symbol warnings squelch a couple of warnings from the common symbols script. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2016-02-17 13:27:29 -06:00
Ralph Castain	e0de4423ba	Remove debug	2016-02-16 20:58:53 -08:00
Ralph Castain	50431001a3	Modify the IOF subsystem to handle per-job directives for redirecting IO to files, tagging IO, and timestamping IO. Fix stdin reader	2016-02-16 18:54:38 -08:00
Mark Santcroos	14f0390b7d	Release child object when we are recording someone's relatives. (Thanks to Mark Santcroos!) Release routing list entries. (Thanks to Mark Santcroos!) Address some Coverity concerns	2016-02-15 20:50:42 -08:00
Ralph Castain	351070659e	Correct ordering when checking for privileged ports	2016-02-14 09:43:01 -08:00
rhc54	59cc1f0a96	Merge pull request #1357 from rhc54/topic/oob Protect against a non-privileged port connecting to us when we are running as root	2016-02-13 08:12:29 -08:00
Ralph Castain	06c3dfc052	Refactor the ORTE DVM code so that external codes can submit multiple jobs using only a single connection to the HNP. * Clean up the DVM so it continues to run even when applications error out and we would ordinarily abort the daemons. * Create a new errmgr component for the DVM to handle the differences. * Cleanup the DVM state component. * Add ORTE bindings directory and brief README * Pass a local tool index around to match jobs. * Pass the jobid on job completion. * Fix initialization logic. * Add framework for python wrapper. * Fix terminate-with-non-zero-exit behavior so it properly terminates only the indicated procs, notifies orte-submit, and orte-dvm continues executing. * Add some missing options to orte-dvm * Fix a bug in -host processing that caused us to ignore the #slots designator. Add a new attribute to indicate "do not expand the DVM" when submitting job spawn requests. * It actually makes no sense that we treat the termination of all children differently than terminating the children of a specific job - it only creates confusion over the difference in behavior. So terminate children the same way regardless. Extend the cmd_line utility to easily allow layering of command line definitions Catch up with ORTE interface change and make build more generic. Disable "fixed dvm" logic for now. Add another cmd_line function to merge a table of cmd line options with another one, reporting as errors any duplicate entries. Use this to allow orterun to reuse the orted_submit code Fix the "fixed_dvm" logic by ensuring we reset num_new_daemons to zero. Also ensure that the nidmap is sent with the first job so the downstream daemons get the node info. Remove a duplicate cmd line entry in orterun. Revise the DVM startup procedure to pass the nidmap only once, at the startup of the DVM. This reduces the overhead on each job launch and ensures that the nidmap doesn't get overwritten. Add new commands to get_orted_comm_cmd_str(). Move ORTE command line options to orte_globals.[ch]. Catch up with extra orte_submit_init parameter. Add example code. Add documentation. Bump version. The nidmap and routing data must be updated prior to propagating the xcast or else the xcast will fail. Fix the return code so it is something more expected when an error occurs. Ensure we get an error returned to us when we fail to launch for some reason. In this case, we will always get a launch_cb as we did indeed attempt to spawn it. The error code will be returned in the complete_cb. Fix the return code from orte_submit_job - it was returning the tracker index instead of "success". Take advantage of ORTE's pretty-print capabilities to provide a nice error output explaining why we failed to launch. Ensure we always get a launch_cb when we fail to launch, but no complete_cb as the job never launched. Extend the error reporting capability to job completion as well. Add index parameter to orte_submit_job(). Add orte_job_cancel and implement ORTE_DAEMON_TERMINATE_JOB_CMD. Factor out dvm termination. Parse the terminate option at tool level. Add error string for ORTE_ERR_JOB_CANCELLED. Add some safeguards. Cleanup and/of comments. Enable the return. Properly ORTE_DECLSPEC orte_submit_halt. Add orte_submit_halt and orte_submit_cancel to interface. Use the plm interface to terminate the job	2016-02-13 08:10:44 -08:00
Ralph Castain	233bd085ca	Protect against a non-privileged port connecting to us when we are running as root Don't close the listener socket upon error unless we are giving up Cleanup the incoming socket	2016-02-13 08:07:27 -08:00

... 2 3 4 5 6 ...

5362 Коммитов