openmpi

Автор	SHA1	Сообщение	Дата
Gilles Gouaillardet	e2c343cdfc	odls: plus memory leak as reported by Coverity with CID 710645	2016-09-07 10:08:44 +09:00
Gilles Gouaillardet	c09899f6af	plm: plus resource leaks as reported by Coverity with CIDs 72274 and 1196733	2016-09-07 10:08:44 +09:00
Josh Hursey	f6337f9eae	Merge pull request #2047 from jjhursey/topic/mixed-host2 orte: !FQDN implementation to use opal_net_isaddr	2016-09-06 13:08:54 -05:00
Ralph Castain	f85dcaee2a	Fixes CID 1369067 and CID 1196684 Fixes CID 1369648 Fixes CID 1372409	2016-09-06 08:43:15 -07:00
Artem Polyakov	74a11d7832	Fix session dir cleanup code.	2016-09-05 07:53:55 +03:00
Artem Polyakov	dc0ab674de	Add PMIx key to provide RM with ability to indicate that it will cleanup session directories provided at through OPAL_PMIX_TMPDIR, OPAL_PMIX_NSDIR, OPAL_PMIX_PROCDIR	2016-09-05 07:48:44 +03:00
Artem Polyakov	81195ab724	Several fixes related to session directories: * enable OMPI to retrieve paths from RM through PMIx * cleanups related to tempdirs.	2016-09-05 07:48:44 +03:00
Ralph Castain	fb51d65049	Minor change: check for NULL before using the job map to avoid segfault when erroring out prior to creating the map	2016-09-04 07:53:12 -07:00
Joshua Hursey	fe937d1e82	orte: !FQDN implementation to use opal_net_isaddr * Switch to use opal_net_isaddr() for checking if a name is an IP address - as it is a bit cleaner, and uses common functionality.	2016-09-02 13:31:49 -05:00
Ralph Castain	4e0788e9ad	Enable PSM to support dynamic processes Fix comm_spawn to correctly reference the actual parent process that requested the spawn when looking for the parent job object	2016-09-02 10:22:04 -07:00
Ralph Castain	0ea1cff733	Implement notification of completion on comm_spawn'd child jobs. Add a configure flag to enable PMIx 3's shared memory datastore, and set it disable by default so that comm_spawn functions again. Will reverse the default once that feature is fully functional	2016-09-01 13:10:10 -07:00
Gilles Gouaillardet	0b8c58298d	oob/usock: fix handling of orte_process_name_t * orte_process_name_t is aligned on 32 bits, so it cannot simply be casted into an int64_t. use memcpy() instead Thanks Paul Hargrove for the report	2016-09-01 13:18:02 +09:00
Ralph Castain	c1050bc01e	Provide a mechanism for obtaining memory profiles of daemons and application profiles for use in studying our memory footprint. Setting OMPI_MEMPROFILE=N causes mpirun to set a timer for N seconds. When the timer fires, mpirun will query each daemon in the job to report its own memory usage plus the average memory usage of its child processes. The Proportional Set Size (PSS) is used for this purpose.	2016-08-31 09:32:07 -07:00
Ralph Castain	9b991bd1f5	Ensure that the "running" state is correctly updated It is possible that one or more procs could get thru PMIx_Init, and thus be marked as in state "registered", before all local procs have been started. If that happens, then we would report some of the procs in state "running", and the others in state "registered" - which means that the HNP would miss the "running" stage of the state machine. Thanks to Jingchao Zhang for his patience in tracking this down on the 2.0 branch	2016-08-30 19:24:39 -07:00
Josh Hursey	b0d8638824	Merge pull request #2015 from jjhursey/topic/mixed-hostnames orte: Expand use of !orte_keep_fqdn_hostnames MCA parameter	2016-08-29 09:14:54 -05:00
Ralph Castain	2f6e0fec90	Provide the number of nodes in the job	2016-08-26 14:50:41 -07:00
Joshua Hursey	d26dd2c20e	orte: Expand the application of !orte_keep_fqdn_hostnames * Expand the use of the `orte_keep_fqdn_hostnames` MCA parameter when it is set to false. * If that parameter is set to false (default) then short hostnames (e.g., `node01`) will match with the long hostnames (e.g., `node01.mycluster.org`). This allows a user (or resource manager) to mix the use of short and long hostnames. - Note that this mechanism does _not_ perform a DNS lookup, but instead strips off the FQDN by truncating the hostname string at the first `.` character (when not an IP address). - By default (`false`) the following is true: `node01 == node01.mycluster.org == node01.bogus.com` since we use `node01` as the hostname.	2016-08-26 16:09:04 -05:00
Artem Polyakov	55ac3b0be3	orte/schizo: fix binding detection in slurm component in SLURM 16.05 the SLURM_CPU_BIND_TYPE is equal to "mask_cpu:" instead of "mask_cpu". Account for that.	2016-08-26 09:55:52 +03:00
rhc54	19b0f4db9f	Merge pull request #1995 from rhc54/topic/pe-per-rank Change the behavior of cpus-per-rank.	2016-08-25 14:38:12 -05:00
Ralph Castain	440eae90ec	Correct the binding algorithm to decouple it from oversubscribe. Oversubscribe stipulates that we allow more procs on the node than assigned slots - it has nothing to do with the number of available pe's. Let overload directives handle the pe situation.	2016-08-24 21:17:22 -07:00
Gilles Gouaillardet	93e73841f9	ess/singleton: push all PMIX_* environment variables, regardless how many there are	2016-08-23 09:46:55 +09:00
Gilles Gouaillardet	a1e8e58a8a	ess/singleton: expects 4 PMIX_* environment variables or more	2016-08-23 09:34:03 +09:00
Ralph Castain	7de4d6922b	Change the behavior of cpus-per-rank. We previously counted each cpu against the #slots. However, IBM has pointed out that "slot" is equated to the number of processes allowed to run on each node, and not the number of cpus on the node. This has been a continuing source of confusion, so make the distinction a "hard" one. Each process occupies a "slot". We automatically set #slots = #cpus if nothing else is told to us. If you want to run more procs and slots, you must tell us to allow oversubscription. A process can utilize multiple pe's if that option is given. If you try to bind more than one proc to a given pe, then we will error out unless you tell us to allow overloading.	2016-08-22 15:54:41 -07:00
Jeff Squyres	71ec5cfb43	rsh: robustify the check for plm_rsh_agent default value Don't strcmp against the default value -- the default value may change over time. Instead, check to see if the MCA var source is not DEFAULT. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2016-08-16 06:58:20 -05:00
rhc54	d7cd802426	Merge pull request #1971 from rhc54/topic/sesdir Update the session dir structure. Restore the creation of a top-level…	2016-08-16 03:14:08 -05:00
Ralph Castain	ae2af61ee3	Update the session dir structure. Restore the creation of a top-level dir based on userid so that everything is contained under the user's top-level dir. Make the next level down (the "job family" level) be either the pid (indicated by a name of "pid.N") or the job family if not launched by mpirun. This allows for proper rendezvous by direct-launched procs.	2016-08-15 22:46:46 -05:00
Ralph Castain	9f43db7303	Further cleanup getpwuid usage - try it first (unless completely disabled), and then silently failover to try other methods.	2016-08-15 07:51:36 -07:00
Ralph Castain	be8424b691	Provide backward compatible keys so that the non-PMIx components in the opal/pmix framework don't have to adjust as we continue to work on finalizing the PMIx reference scheme. Activate and utilize the new PMIx show_help capability to provide more meaningful error output when the server cannot start. Add a contrib script to cleanup permissions incorrectly modified due to things like smb mounts dd	2016-08-13 12:13:04 -07:00
Ralph Castain	08a0644df5	Fix shared memory rendezvous	2016-08-13 08:14:50 -07:00
rhc54	ddde154d28	Merge pull request #1962 from rhc54/topic/notify Ensure we properly convert pmix status to ORTE state before activatin…	2016-08-13 06:59:50 -07:00
Ralph Castain	48d35a9627	Ensure we properly convert pmix status to ORTE state before activating an error state upon notification. Cleanup some conversion issues on notification info. Add a new orte_notify.c test program	2016-08-12 21:14:29 -07:00
rhc54	9eed451916	Merge pull request #1960 from rhc54/topic/rsh Restore the rsh template creation code	2016-08-12 13:38:43 -07:00
rhc54	1ef3c86d44	Merge pull request #1931 from hjelmn/ess_fix ess/base: set up nidmap after pmix	2016-08-12 13:10:30 -07:00
Ralph Castain	5717b75b45	Restore the rsh template creation code	2016-08-12 12:43:40 -07:00
Ralph Castain	1c44543854	If the ssh agent hasn't been given, then check for qrsh and friends	2016-08-12 07:46:39 -07:00
Artem Polyakov	1351a7065c	ess/pmi: minor code readablility cleanup. Split process name variable "name" to - "wildcard_rank" for the cases where wildcard is used. - "pname" for the case where reference to particular process is needed.	2016-08-06 15:45:19 +06:00
Nathan Hjelm	3c23502dfe	ess/base: set up nidmap after pmix This fixes a SEGV when the nidmap code attempts to use opal_pmix.store_local before pmix is set up. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2016-08-02 09:50:00 -06:00
Ralph Castain	71de03fc67	Cleanup the new naming requirements to ensure that info is correctly retrieved Cleanup permissions Restore singleton operations	2016-07-21 09:46:03 -07:00
Ralph Castain	01a653d50a	Remove a debug print in comm_cid.c. Update PMIx2 to include the revised PMIx_Get logic for higher performance by reducing the number of hash table lookups. Fix a bug where requests for data from a proc in another nspace could hang, or result in "not found". Remove stale file reference Restore autogen pass thru pmix Remove generated file	2016-07-20 00:58:19 -07:00
rhc54	2414244171	Merge pull request #1872 from rhc54/topic/continuous Add support for continuously operating applications	2016-07-13 15:29:31 -07:00
Ralph Castain	20a91c2baf	Add a new --continuous flag to mpirun that directs ORTE to let a job continue running as app procs terminate. Don't attempt to restart them. Add event notification of abnormally terminating procs, and demonstrate that in the mpi_spin test program. Cleanup debug message	2016-07-13 15:28:33 -07:00
Ralph Castain	ddd0d05de3	Fix a bug in the handling of nper<foo> when -host or -hostfile was given. Correctly mark slots as "given" when we auto-assign them. Ensure we don't set the number of procs when using nper<foo> so the PPR mapper can correctly assing them.	2016-07-12 09:27:02 -07:00
Ralph Castain	ee56d9dc1a	Shorten the session directory name as some OS's are now providing unusually long temp directory names, causing us to overflow the sockaddr field	2016-07-05 14:59:50 -07:00
Ralph Castain	5d330d5220	Enable the PMIx event notification capability and use that for all error notifications, including debugger release. This capability requires use of PMIx 2.0 or above as the features are not available with earlier PMIx releases. When OMPI master is built against an earlier external version, it will fallback to the prior behavior - i.e., debugger will be released via RML and all notifications will go strictly to the default error handler. Add PMIx 2.0 Remove PMIx 1.1.4 Cleanup copying of component Add missing file Touchup a typo in the Makefile.am Update the pmix ext114 component Minor cleanups and resync to master Update to latest PMIx 2.x Update to the PMIx event notification branch latest changes	2016-06-14 13:08:41 -07:00
Ralph Castain	a6e6c37484	Remove stale map-reduce support	2016-06-12 07:41:57 -07:00
Ralph Castain	dd0f843843	Fix rare hangs observed on OS-X by properly thread-shifting upcalls from the PMIx server into ORTE	2016-06-05 21:39:44 -07:00
Ralph Castain	0ba9572f9f	Cleanup the forced termination a bit by restoring the delay before issuing the sigkill, and eliminating the large time loss spent checking if the proc died. The latter is responsible for a large number of test timeouts in MTT Update alps component	2016-06-02 17:48:21 -07:00
Gilles Gouaillardet	5f565dfec3	configury: clean the flex generated .c files	2016-06-01 11:13:31 +09:00
Ralph Castain	3913595e10	Enable simulation of large-scale clusters by allowing multiple daemons/node. Specifying the ras_base_multiplier parameter to be greater than 1 will cause ORTE to replicate each allocated node by that factor. A daemon will be spawned for each replica, thus letting ORTE function as if it were on a much larger cluster. Note that this cannot be used for MPI performance testing. It is really only useful for ORTE scaling tests. It also only works with the rsh/ssh launcher.	2016-05-29 18:56:18 -07:00
Ralph Castain	ebe159acef	Add a timeout cmd line option and an option to report state info upon timeout to assist with debugging Jenkins tests If requested, obtain stacktraces for each application process and report it to stderr upon timeout stack traces: minor improvements - Also include the hostname and PID of the each process for which we're sending the stack traces (vs. just including the ORTE process name) - Send a specific error message if we couldn't find "gstack" in the $PATH (e.g., on OS X) - Send a sepcific error message if gstack fails to run - Print a message that obtaining the stack traces may take a few seconds so that users don't wonder what's happening Signed-off-by: Jeff Squyres <jsquyres@cisco.com> help-orterun.txt: minor tweaks Trivial update: show "--timeout" (instead of "-timeout") in the help message, just to encourage the use of double-dash options. Signed-off-by: Jeff Squyres <jsquyres@cisco.com> trivial: stacktrace -> stack trace Trivial word smything. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2016-05-28 08:36:25 -07:00

1 2 3 4 5 ...

3962 Коммитов