openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	9d6b929894	Fix uninitialized variable. Set exit codes for failed launch so we get pretty error messages Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-05-31 07:38:37 -07:00
Ralph Castain	26d96061aa	Roll in latest PMIx updates Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-05-30 21:35:35 -07:00
Ralph Castain	9f1f9d6606	Update to PMIx v2.0.0rc1 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-05-28 10:30:58 -07:00
Ralph Castain	9f60cd0fe7	Update the connect/accept support so we check to see if we have the proper infrastructure and RTE support, including whether we have ompi-server available if the connect/accept spans multiple applications. Print pretty help messages in all cases where we do not have support Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-05-27 10:47:08 -07:00
Nathan Hjelm	33d59886e1	Merge pull request #3587 from hjelmn/event_abstraction pmix/pmix2x: fix errors in event abstration	2017-05-26 10:44:18 -06:00
Nathan Hjelm	a512b8962d	pmix/pmix2x: fix errors in event abstration Parts of the pmix2x component called the event_* functions directly instead of the opal_event_* wrappers. This is fine as long as we are using libevent but becomes a problem with other event libraries. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2017-05-26 09:49:11 -06:00
Ralph Castain	2f721a3366	Merge pull request #3585 from rhc54/topic/pmix20 Update to pmix v2.0beta	2017-05-26 06:05:44 -07:00
Ralph Castain	e1e264711a	Update to pmix v2.0beta Fix atomics - again Fix initialization of notification ring buffer Fix wait_sync definitions Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-05-26 03:33:18 -07:00
Ralph Castain	657e701c65	Add debug verbosity to the orte data server and pmix pub/lookup functions Start updating the various mappers to the new procedure. Remove the stale lama component as it is now very out-of-date. Bring round_robin and PPR online, and modify the mindist component (but cannot test/debug it). Remove unneeded test Fix memory corruption by re-initializing variable to NULL in loop Resolve the race condition identified by @ggouaillardet by resetting the mapped flag within the same event where it was set. There is no need to retain the flag beyond that point as it isn't used again. Add a new job attribute ORTE_JOB_FULLY_DESCRIBED to indicate that all the job information (including locations and binding) is included in the launch message. Thus, the backend daemons do not need to do any map computation for the job. Use this for the seq, rankfile, and mindist mappers until someone decides to update them. Note that this will maintain functionality, but means that users of those three mappers will see large launch messages and less performant scaling than those using the other mappers. Have the mindist module add procs to the job's proc array as it is a fully described module Protect the hnp-not-in-allocation case Per path suggested by Gilles - protect the HNP node when it gets added in the absence of any other allocation or hostfile Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-05-25 18:41:27 -07:00
Gilles Gouaillardet	026f3dd2dd	pmix2x: plug a misc memory leak Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-05-10 14:57:44 +09:00
Ralph Castain	0afcb1a448	Update to support server self-notifications Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-05-08 10:04:50 -07:00
Ralph Castain	ef0e0171c9	Implement the changes required to support cross-library coordination. Update PMIx to support intra-process notifications and ensure that we always notify ourselves for events. Add a new ompi/interlib directory where cross-lib coordination code can go, and put the code to declare ourselves there (called from ompi_mpi_init.c). Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-05-08 10:04:50 -07:00
Ralph Castain	3bca715780	Fix pmix configury so that libpmix is still emitted when --with-devel-headers is given, even under static builds Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-05-05 11:15:32 -07:00
Jeff Squyres	af336ac0e8	pmix/configure.m4: always use embedded mode Looks like embedded mode was mistakenly disabled when --with-devel-headers was specified. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-05-04 10:01:41 -07:00
Ralph Castain	8b1f01dfe6	Set the default modex parameters back to full blocking modex while we continue to test and debug the slow modex - it seems to be having issues on the Cray Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-04-22 15:19:46 -07:00
Ralph Castain	9fc3079ac2	Implement a background fence that collects all data during modex operation The direct modex operation is slow, especially at scale for even modestly-connected applications. Likewise, blocking in MPI_Init while we wait for a full modex to complete takes too long. However, as George pointed out, there is a middle ground here. We could kickoff the modex operation in the background, and then trap any modex_recv's until the modex completes and the data is delivered. For most non-benchmark apps, this may prove to be the best of the available options as they are likely to perform other (non-communicating) setup operations after MPI_Init, and so there is a reasonable chance that the modex will actually be done before the first modex_recv gets called. Once we get instant-on-enabled hardware, this won't be necessary. Clearly, zero time will always out-perform the time spent doing a modex. However, this provides a decent compromise in the interim. This PR changes the default settings of a few relevant params to make "background modex" the default behavior: * pmix_base_async_modex -> defaults to true * pmix_base_collect_data -> continues to default to true (no change) * async_mpi_init - defaults to true. Note that the prior code attempted to base the default setting of this value on the setting of pmix_base_async_modex. Unfortunately, the pmix value isn't set prior to setting async_mpi_init, and so that attempt failed to accomplish anything. The logic in MPI_Init is: * if async_modex AND collect_data are set, AND we have a non-blocking fence available, then we execute the background modex operation * if async_modex is set, but collect_data is false, then we simply skip the modex entirely - no fence is performed * if async_modex is not set, then we block until the fence completes (regardless of collecting data or not) * if we do NOT have a non-blocking fence (e.g., we are not using PMIx), then we always perform the full blocking modex operation. * if we do perform the background modex, and the user requested the barrier be performed at the end of MPI_Init, then we check to see if the modex has completed when we reach that point. If it has, then we execute the barrier. However, if the modex has NOT completed, then we block until the modex does complete and skip the extra barrier. So we never perform two barriers in that case. HTH Ralph Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-04-21 10:29:23 -07:00
Ralph Castain	ffbfd22d84	Fix event registration - need to increment the event index and record the number of codes in the event handler Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-04-13 17:35:10 -07:00
Ralph Castain	9f73974fe1	Update to latest PMIx master, including disabling the pmi-1 and pmi-2 backward compatibility as these interfere with the s1,s2 components Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-04-12 12:34:27 -07:00
Ralph Castain	95ae0d1df3	Cleanup timing macros for portability across compilers. Rename the --enable-timing configure option to be --enable-pmix-timing so it doesn't pickup external timing requests. Remove a stale function reference in PMIx so it can compile with timing enabled. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-04-10 12:56:38 +06:00
Ralph Castain	92c996487c	Update how we pass the node regex so we pass _all_ nodes, even those without daemons. This allows the backend daemons to form a complete picture of the allocation. Include info on which nodes have daemons on them, and populate that info on the backend as well. Set the daemons' state to "running" and mark them as "alive" by default when constructing the nidmap Get the DVM running again Fix direct modex by eliminating race condition caused by releasing data while sending it Up the size limit before compressing Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-04-03 19:25:15 -07:00
Ralph Castain	2cc5fea8be	Update to PMIx v2.0alpha Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-04-03 10:02:29 -07:00
Ralph Castain	7dd34d0c9a	Use the correct callback data - the callback function was expecting a bool, not a pmix_ptl_sr_t. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-28 17:21:47 -07:00
Ralph Castain	35f817911e	Fix coverity issues Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-24 08:09:46 -07:00
Ralph Castain	c0bcd11bcf	Fix permissions - no CI required Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-23 08:05:52 -07:00
Ralph Castain	55e4fba5f5	If we lose connection to the server after initiating a send/recv in PMIx (e.g., in PMIx_Abort), then we need to "resolve" all pending recvs to avoid hanging. Fixes #3225 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-23 02:53:21 -07:00
Ralph Castain	d645557fa0	Update to include the PMIx 2.0 APIs for monitoring and job control. Include required integration, but leave the monitors off for now. Move the sensor framework out of ORTE as it is being absorbed into PMIx Fix typo and silence warnings Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-21 17:47:08 -07:00
Ralph Castain	4b6d220a83	You cannot include both pmi.h and pmi2.h as they have conflicting defines in them. Thanks to Kilian Cavalotti for pointing it out Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-19 11:53:54 -07:00
Ralph Castain	c6bc3ccb76	Sync to latest PMIx master and PMIx reference server Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-11 12:50:38 -08:00
Ralph Castain	aca7091114	Fix some minor compatibility issues by ensuring job-level data gets stored against wildcard rank in the cray, s1, and s2 components, and that the ext1 component translates all wildcard rank requests into the peer's rank since v1.x of PMIx doesn't understand wildcard ranks Closes #3101 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-05 10:30:59 -08:00
Ralph Castain	1de72ff023	Silence an unnecessary error log Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-02 17:18:34 -08:00
Ralph Castain	e86a0dbf39	Update to PMIx master to include dlopen fixes and addition of libltdl support Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-02-22 11:54:33 -08:00
Ralph Castain	8cffdcf127	Ensure that the pmix headers and lib get installed when --with-devel-headers is given so that PMIx applications can be built and executed against the "embedded" PMIx version Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-02-21 13:46:46 -08:00
Gilles Gouaillardet	bb2481a84b	pmix2x: synchronize to the latest PMIx master pmix/master@f57d9b2953 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-02-20 10:45:17 +09:00
Ralph Castain	f49118eaab	Fix some pmix configuration code Remove stale file reference that caused a check to always fail. Update psm2 function check to new libs Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-02-16 10:54:47 -08:00
Howard Pritchard	b272f87926	Merge pull request #2968 from hjelmn/pmix_cray pmix/cray: performance improvements and cleanup	2017-02-16 11:41:59 -07:00
Ralph Castain	201f8571ca	Ensure we retain the peer object until we are done with it, then detect that the socket has closed due to a lost connection and cleanly release the message event Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-02-15 18:30:55 -08:00
Ralph Castain	9cd7349d7c	Instead of completely free'ing the event base, pause the PMIx progress thread before tearing down the infrastructure, and then release the event base at the end of the procedure. This allows any infrastructure objects holding events to delete them prior to free'ing the event base. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-02-15 05:02:43 -08:00
Ralph Castain	f7fe2f7189	Merge pull request #2977 from rhc54/topic/spawn Fix comm_spawn by registering nspace info only when needed	2017-02-15 04:31:54 -08:00
Ralph Castain	68b53e2179	Fix comm_spawn by registering nspace info only when needed - either when we have local procs, or when job-level info is required by connecting jobs Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-02-14 19:47:56 -08:00
Ralph Castain	0c8609ca16	Update to newest PMIx master (includes configuration cleanups). Silence trivial Coverity warning in hwloc base. Cleanup a race condition segfault during finalize by ensuring the PMIx progress thread is stopped prior to starting to tear down the messaging components Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-02-14 15:14:00 -08:00
Nathan Hjelm	3b912ea2a7	pmix/cray: performance improvements and cleanup Do not use opal_output_verbose inside O(n) loops. This was causing us to make O(n) calls to snprintf which was greatly slowing launch at scale. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2017-02-14 11:13:10 -07:00
Ralph Castain	35578b4009	Update to lastest PMIx master Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-02-13 23:19:26 -08:00
Gilles Gouaillardet	7acef4833e	pmix2x: Update to latest PMIx master pmix/master@6ed27be839 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-02-08 13:23:27 +09:00
KAWASHIMA Takahiro	4b2eba34a6	Merge pull request #2933 from kawashima-fj/pr/dstore-config-desc pmix/pmix2x: Correct configure option description	2017-02-08 13:03:27 +09:00
Jeff Squyres	100b112d3c	pmix: fix zlib protection macro usage It's possible that we can have zlib.h but still not have zlib support. Use the correct macro to protect the usage of calling zlib functions. This fixes 32-bit MTT builds at Cisco (e.g., https://mtt.open-mpi.org/index.php?do_redir=2389). Submitted upstream to PMIX: https://github.com/pmix/master/pull/290 Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-02-07 05:52:32 -08:00
KAWASHIMA Takahiro	750406f67b	pmix/pmix2x: Correct configure option description `--enable-pmix-dstore` option was enabled by default in `f4a5511`. Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>	2017-02-07 11:52:56 +09:00
Ralph Castain	edcfdf2365	Update to latest PMIx master Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-31 08:01:37 -08:00
Gilles Gouaillardet	b078e57e73	pmix/ext1x: fix misc memory leaks in namespace registration Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-30 10:52:42 +09:00
Gilles Gouaillardet	f51fc293a2	ext1x/pmix1x_client: plug misc memory leaks Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-30 10:52:42 +09:00
Gilles Gouaillardet	022cca79ea	pmix/ext1x: plug a memory leak in opal_lkupcbfunc() Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-30 10:52:36 +09:00
Gilles Gouaillardet	f485d12a82	pmix: rename the ext11 component into ext1x also use the same naming scheme thann pmix/ext2x Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-30 10:52:35 +09:00
Gilles Gouaillardet	dccb1899e6	pmix/ext11: correctly use PMIx_server_register_nspace() PMIx_server_register_nspace() is an asynchronous operation, so the pmix glue wait for it completes before returning. Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-30 09:23:19 +09:00
Gilles Gouaillardet	6955e1e25c	pmix/ext11: fix compilation the argc field from the opal_pmix_app_t struct was removed, so adjust the pmix/ext11 glue accordingly. Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-30 09:23:18 +09:00
Ralph Castain	3302864a7d	Cleanup a typo that can cause a segfault - use a local variable name different than the one passed into the function Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-27 16:49:25 -08:00
Gilles Gouaillardet	896434b1bd	pmix/ext2x: plug a memory leak in opal_lkupcbfunc() Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-26 14:07:15 +09:00
Gilles Gouaillardet	6b8e1c217c	pmix/ext2x: plug misc memory leaks Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-26 14:06:58 +09:00
Gilles Gouaillardet	142b95df87	pmix/ext2x: plug misc memory leaks regarding opal_pmix2x_event_chain_t handling Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-25 16:17:10 +09:00
Gilles Gouaillardet	7a3d39f079	pmix/ext2x: plug a memory leak in _reg_nspace() Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-25 16:17:01 +09:00
Gilles Gouaillardet	189da7fdab	pmix2x: plug a memory leak in _event_hdlr() Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-24 09:13:30 +09:00
Gilles Gouaillardet	acbc32d3b2	pmix2x: plug a memory leak in opal_lkupcbfunc() Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-24 09:13:29 +09:00
Gilles Gouaillardet	b5b21043c4	pmix2x: plug a memory leak in _reg_nspace() Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-24 09:13:29 +09:00
Gilles Gouaillardet	0f47310a75	pmix2x/pmix2x_client: plug misc memory leaks Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-24 09:13:29 +09:00
Ralph Castain	8c960bae8d	Update to latest PMIx master Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-23 07:07:40 -08:00
Ralph Castain	e568b211e4	Silence Coverity CID 1398541 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-10 15:30:50 -08:00
Gilles Gouaillardet	6d59b476de	Merge pull request #2686 from ggouaillardet/topic/pmix2x_ptl_base_sendrecv pmix2x: ptl/base: send header and message data together via writev()	2017-01-10 16:26:10 +09:00
Gilles Gouaillardet	44c1ff60f1	Merge pull request #2672 from ggouaillardet/topic/misc_memory_leaks Plug misc memory leaks	2017-01-10 13:16:04 +09:00
Gilles Gouaillardet	a01960bee5	pmix2x: ptl/base: send header and message data together via writev() on Linux, sending the header and then the message data does severely impact performances of ptl/tcp : on the receiver, reading the data can often result in an PMIX_ERR_RESOURCE_BUSY or PMIX_ERR_WOULD_BLOCK, which ends up degrading performances) this commit send both header and message data at the same time via writev() and makes ptl/tcp virtually as efficient as ptl/usock. Short writev generally occur when the kernel buffer is full, so there is no point for retrying in this case. fwiw, no such degradation was observed on OSX. Refs open-mpi/ompi#2657 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-10 13:07:39 +09:00
Ralph Castain	67fce2861b	Merge pull request #2685 from rhc54/topic/cov Resolve Coverity issues	2017-01-07 13:11:40 -08:00
Ralph Castain	e25e69dc2f	Resolve Coverity issues Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-07 10:45:52 -08:00
Ralph Castain	822e2680ba	Cleanup some configure stuff for static builds - still can't get wrapper extra libs to be recognized Signed-off-by: Ralph Castain <rhc@open-mpi.org> pmix2x: minor configure updates Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-01-07 08:37:36 -08:00
Ralph Castain	444f5fa35d	Raise the priority of the usock component so it gets preferentially picked Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-05 22:53:04 -08:00
Gilles Gouaillardet	6ef281e163	pmix/base: fix misc memory leaks Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-06 13:46:35 +09:00
Ralph Castain	6509f60929	Complete the memprobe support. This provides a new scaling tool called "mpi_memprobe" that samples the memory footprint of the local daemon and the client procs, and then reports the results. The output contains the footprint of the daemon on each node, plus the average footprint of the client procs on that node. Samples are taken after MPI_Init, and then again after MPI_Barrier. This allows the user to see memory consumption caused by add_procs, as well as any modex contribution from forming connections if pmix_base_async_modex is given. Using the probe simply involves executing it via mpirun, with however many copies you want per node. Example: $ mpirun -npernode 2 ./mpi_memprobe Sampling memory usage after MPI_Init Data for node rhc001 Daemon: 12.483398 Client: 6.514648 Data for node rhc002 Daemon: 11.865234 Client: 4.643555 Sampling memory usage after MPI_Barrier Data for node rhc001 Daemon: 12.520508 Client: 6.576660 Data for node rhc002 Daemon: 11.879883 Client: 4.703125 Note that the client value on node rhc001 is larger - this is where rank=0 is housed, and apparently it gets a larger footprint for some reason. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-05 10:32:17 -08:00
Ralph Castain	91d714fe93	Add flags to direct PMIx to only use one listener, but without directing which one (tcp or usock) to use. This allows the user to set PMIX_MCA_ptl in their environment to select the transport method. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-04 09:16:44 -08:00
Ralph Castain	f355fb926d	Continue cleanup of notifications. Resolve a race condition that can result in attempt to send a message on a closed socket Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-04 09:16:33 -08:00
Ralph Castain	9eab9a1ed3	Remove stale global variables Revamp the event notification integration to rely on the PMIx event chaining and remove the duplicate chaining in OPAL. This ensures we get system-level events that target non-default handlers. Restore the hostname entries for MPI-level error messages, but provide an MCA param (orte_hostname_cutoff) to remove them for large clusters where the memory footprint is problematic. Set the default at 1000 nodes in the job (not the allocation). Begin first cut at memory profiler Some minor cleanups of memprobe Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-02 14:04:24 -08:00
Ralph Castain	e8aea2ebfc	Minor cleanups Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-30 16:19:42 -08:00
Ralph Castain	08c76a42bb	Update to latest PMIx master Signed-off-by: Ralph Castain <rhc@open-mpi.org> Plug a minor memory leak. Tell the PMIx server not to create a dstore memory region for the daemon job as there is nobody to share it with. Signed-off-by: Ralph Castain <rhc@open-mpi.org> Protect users of hwloc membind functions Signed-off-by: Ralph Castain <rhc@open-mpi.org> Update PMIx to include NULL string protection Signed-off-by: Ralph Castain <rhc@open-mpi.org> Update to PMIx master to include key overwrite protection Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-30 12:44:47 -08:00
Ralph Castain	fe68f23099	Only instantiate the HWLOC topology in an MPI process if it actually will be used. There are only five places in the non-daemon code paths where opal_hwloc_topology is currently referenced: * shared memory BTLs (sm, smcuda). I have added a code path to those components that uses the location string instead of the topology itself, if available, thus avoiding instantiating the topology * openib BTL. This uses the distance matrix. At present, I haven't developed a method for replacing that reference. Thus, this component will instantiate the topology * usnic BTL. Uses the distance matrix. * treematch TOPO component. Does some complex tree-based algorithm, so it will instantiate the topology * ess base functions. If a process is direct launched and not bound at launch, this code attempts to bind it. Thus, procs in this scenario will instantiate the topology Note that instantiating the topology on complex chips such as KNL can consume megabytes of memory. Fix pernode binding policy Properly handle the unbound case Correct pointer usage Do not free static error messages! Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-29 10:33:29 -08:00
Ralph Castain	3a2d6a5ab6	Begin to reduce reliance of application procs on the topology tree itself by having the daemon provide more detailed info. In this case, provide the topology description string so that procs can readily determine the number of types of objects on the node, and a "locality" string that describes which objects this process is executing upon. The latter allows a process to compute the objects of overlap between itself and another proc without consulting the topology tree. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-28 09:14:26 -08:00
Ralph Castain	d3aa3777f3	Per @jsquyres: avoid mangling user-provided CFLAGS by using the new PMIX_FLAGS_UNIQ autoconf script in place of PMIX_UNIQ Refs #2636 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-27 09:00:59 -08:00
Gilles Gouaillardet	22db1d36b6	pmix2x: silence misc warnings Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2016-12-26 13:35:17 +09:00
Ralph Castain	ea133206ec	Sync the internal OMPI component to PMIx master Update external PMIx v2.x component Add missing Makefile Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-19 19:14:16 -08:00
Ralph Castain	256b5adac5	Transfer across final fixes from debugger attach work Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-19 00:34:27 -08:00
Ralph Castain	c6f6f40529	Transfer debugger support changes Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-17 18:14:46 -08:00
Ralph Castain	269753f5c1	Transfer back changes from debugger attach work Silence warning Remove debug Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-17 10:00:52 -08:00
Ralph Castain	215d6290e0	Add a flux component for LLNL Fine tuning of flux component Fix a few minor issues with the initial cut: * Job id could be obtained from the PMI kvsname like SLURM, but simpler to getenv (FLUX_JOB_ID) * Flux pmi-1 doesn't define PMI_BOOL, PMI_TRUE, PMI_FALSE * Flux pmi-1 maps the deprecated PMI_Get_kvs_domain_id() to PMI_KVS_Get_my_name() internally, so just call that instead. * Drop residual slurm references. Add wrappers for PMI functions so that if HAVE_FLUX_PMI_LIBRARY is not defined, the component can dlopen libpmi.so at location specified by the FLUX_PMI_LIBRARY_PATH env variable, which adds flexibility. If HAVE_FLUX_PMI_LIBRARY is defined, link with libpmi.so at build time in the usual way. Update configury for flux component Update m4 so the configure options work as follows: --with-flux-pmi Build Flux PMI support (default: yes) --with-flux-pmi-library Link Flux PMI support with PMI library at build time. Otherwise the library is opened at runtime at location specified by FLUX_PMI_LIBRARY_PATH environment variable. Use this option to enable Flux support when building statically or without dlopen support (default: no) If the latter option is provided, the library/header is located at build time using the pkg-config module 'flux-pmi'. Otherwise there is no library/header dependency. Handle the case where ompi is configured with --disable-dlopen or --enable-statkc. In those cases, don't build the component unless --with-flux-pmi-library is provided. It is fatal if the user explicitly requests --with-flux-pmi but it cannot be built (e.g. due to --disable-dlopen). Add a schizo/flux component Update schizo/flux component Eliminate slurm-specific usage cases. Since the module is only loaded if FLUX_JOB_ID is set, there are only two cases to handle: 1) App was launched indirectly through mpirun. This is not yet supported with Flux, but hook remains in case this mode is supported in the future. 2) App was launched directly by Flux, with Flux providing CPU binding, if any. Fix up white space in pmix/flux component Drop non-blocking fence from pmix:flux component The flux PMI-1 library is not thread safe, therefore register a regular blocking fence callback instead of the thread-shifting fencenb(). pmix/flux component avoids extra PMI_KVS_Gets Keys stored into the base cache under the wildcard rank are not intended to be part of the global key namespace. These keys therefore should not trigger a PMI_KVS_Get() if they are not found in the cache. Minor pmix/flux component cleanup pmix/flux: drop code for fetching unused pmix_id pmix/flux: err_exit must return error Problem: in flux_init(), although 'ret' (variable holding err_exit return code) is initialized to OPAL_ERROR, the variable is reused as a temporary result code, so if there are some successes followed by a failure that doesn't set 'ret', flux_init() could return success with PMI not initialized. Ensure that a "goto err_exit" returns OPAL_ERROR if 'ret' is not set to some other error code. pmix/flux: don't mix OPAL_ and PMI_ return codes Problem: flux_init() can return both PMI_ and OPAL_ return codes. Although OPAL_SUCCESS and PMI_SUCCESS are both defined as 0, other codes are not compatible. Ensure that flux_init() consistently uses 'rc' for PMI_ return codes and 'ret' for OPAL_ return codes. pmix/flux: factor out repeated code for cache put Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-16 18:26:38 -08:00
rhc54	00b87ea829	Merge pull request #2584 from rhc54/topic/warnings Reduce the flood of warnings due to uninitialized variables, mismatch…	2016-12-15 10:09:01 -08:00
Ralph Castain	585540bcee	Reduce the flood of warnings due to uninitialized variables, mismatched types, and unused things to a more bearable trickle Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-14 16:33:50 -08:00
Gilles Gouaillardet	a019095b84	pmix2x/class: correctly handle concurrent class initialization (back-ported from upstream commit pmix/master@ceedbd67fd) Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2016-12-15 09:07:24 +09:00
Ralph Castain	884fb7fcf2	Update the PMIx2 support to include the latest shared memory optimizations Update ORTE support for dynamic PMIx operations e.g., PMIx_Spawn Update to track master Ensure that --disable-pmix-dstore actually disables the dstore. Sync to a few debugger updates Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-14 15:00:10 -08:00
Ralph Castain	1961a1c22a	Use the server tmpdir instead of the system tmpdir for tool contact files Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-14 08:42:09 -08:00
Ralph Castain	fbed2d794a	Update to latest PMIx master + PTL branch Update the usock component to disable it Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-06 20:47:44 -08:00
Ralph Castain	d51821cbc7	Update pmix check headers to support Open BSD Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-06 19:37:06 -08:00
Ralph Castain	f91f8ce494	Protect against NULL param Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-03 21:19:32 -08:00
Ralph Castain	f633d5c1b1	Initialize var Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-03 21:08:45 -08:00
rhc54	003f7d308f	Merge pull request #2504 from hppritcha/topic/fix_pmix_base_help_file pmix: Fix pmix base help file.	2016-12-03 11:57:07 -08:00
Ralph Castain	a43fae74a5	Avoid hanging in show_help if PMIx was unable to initialize Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-03 10:12:58 -08:00
Ralph Castain	a4e3f615e3	Fix executable modes of pmix config scripts Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-02 20:50:59 -08:00
Ralph Castain	342dbfcf4e	Update Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-02 20:33:03 -08:00
Ralph Castain	8e64382edf	Update to correct tarball Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-02 19:16:07 -08:00
Ralph Castain	1a0bccb536	Now that PMIx has settled on its release strategy and numbering, update the OPAL pmix framework to track Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-02 15:44:43 -08:00
Howard Pritchard	3049848731	Fix pmix base help file. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2016-12-02 15:03:22 -06:00
Ralph Castain	6041467df0	Update to latest PMIx master Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-01 14:47:44 -08:00
Howard Pritchard	eee9f7ae3a	pmix/cray: abort job if using aprun for general case It turns that there is an incompatibility between the Cray PMI library and the default configuration for building Open MPI (master). To work around this, we now disable use of aprun for direct launch of Open MPI jobs except under specific conditions. The problem is that there are now (on master) packages getting initialized that do not work properly across a fork operation. As part of a constructor in the Cray PMI library, a fork operation is done to simplify use of shared memory between the processes in a job on the same node. This ends up thoroughly messing up the Open MPI initialization process in the case that dlopen support is enabled. The initialization process gets about half-way through when the PMIX framework is opened and components are loaded, which triggers the Cray PMI constructor and hence the fork operation. There are two workarounds for this: 1) configure Open MPI for Cray XE/XC systems using aprun with the --disable-dlopen option 2) set the PMI_NO_FORK environment variable in the shell in which the aprun command is run. Without taking these measures, a Open MPI job will just hang at job startup in the first attempt to "thread-shift" the PMIx fence_nb operation. Additional hangs occur at shutdown if this problem is worked around, again due to the insertion of a fork operation halfway through the Open MPI initialization procedure. This commit detects if the conditions that bring out the hang situation are present, and if so, prints out a message and aborts the job launch. Note on systems using slurm, the PMI_NO_FORK environment variable is set as part of the srun job launch, hence this issue is avoided on those systems. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2016-11-25 06:28:19 -07:00
Howard Pritchard	2cbc0e8472	pmix/cray: fix disable-dlopen problem PR open-mpi/ompi#2432 introduced a regression where configure and build with --disable-dlopn caused build failure owing to unresolved alps lli symbols in the libopal-pal shared library. This commit fixes this problem. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2016-11-21 13:45:10 -06:00
Howard Pritchard	0bbb319246	Merge pull request #2444 from hppritcha/topic/cray_pmix_ws_cleanup pmix/cray: whitespace cleanup	2016-11-21 06:03:56 -07:00
Howard Pritchard	08dce4f161	pmix/cray: whitespace cleanup Get rid of tabs. This is anti-ompi style. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2016-11-18 19:30:40 -07:00
Howard Pritchard	de3de131af	pmix/cray: set some envars for MPI_INFO_ENV object Enhance the cray pmix component to set some OMPI internal env. variables used to set some key/value pairs on the MPI_INFO_ENV object. This allows more of the ompi-tests ibm unit tests to pass when using aprun/srun direct launch and Cray PMI. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2016-11-16 17:37:52 -06:00
Howard Pritchard	703b464c03	pmix: fix a typo in a help file Fixes #2391 Thanks to @njoly for reporting Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2016-11-12 11:49:15 -07:00
Ralph Castain	64873487b4	Remove the max_connections parameter from the radix component as it is confusing. Modify PMIx client init so that it simply returns the nspace/rank if called by a server - this allows the server to retrieve its assigned ID. Register the server's nspace so client-side operations can succeed Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-11-01 12:17:11 -07:00
Ralph Castain	f4a55118e6	Update to latest PMIx master - mostly updates example codes, but includes one critical cleanup during finalize. NOTE: set dstore shared memory storage option to "on" by default Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-10-27 11:16:03 -07:00
Ralph Castain	f298f294e1	Update PMIx to latest master tarball. Ensure we set the HNP name for orted's so that PMIx_Lookup can find the server Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-10-26 15:48:56 -07:00
Ralph Castain	649301a3a2	Revise the routed framework to be multi-select so it can support the new conduit system. Update all calls to rml.send* to the new syntax. Define an orte_mgmt_conduit for admin and IOF messages, and an orte_coll_conduit for all collective operations (e.g., xcast, modex, and barrier). Still not completely done as we need a better way of tracking the routed module being used down in the OOB - e.g., when a peer drops connection, we want to remove that route from all conduits that (a) use the OOB and (b) are routed, but we don't want to remove it from an OFI conduit.	2016-10-23 21:52:39 -07:00
Ralph Castain	9131eca9c6	Update to latest PMIx master	2016-10-20 21:13:40 -07:00
Ralph Castain	2f966bf3bf	Cleanup external PMIx v3 component for copy/paste errors - component and module require unique names	2016-10-20 09:11:46 -07:00
Ralph Castain	8113a8d1b0	Now that we are hiding symbols in the internal PMIx component, we cannot reuse that component for integration to the external PMIx master as the symbols don't match. So create a new "ext3x" component and copy the PMIx v3 integration over there. Also, remove a couple of build-product files from the pmix3x component.	2016-10-18 13:15:32 -07:00
Ralph Castain	50c9f3de55	Ensure the PMIx progress thread is stopped prior to tearing anything down. Thanks to Gilles for spotting this error!	2016-10-18 00:27:52 -07:00
Ralph Castain	6f65d0a173	Repair event notification support. Cleanup the long-suffering "epoll: warning" coming out of libevent whenever a process abnormally terminated. Add changes to test program Sync to PMIx master	2016-10-13 16:27:39 -07:00
Ralph Castain	6417f217e1	Turn PMIx dstore off by default as MTT was effectively broken	2016-10-13 08:14:51 -07:00
Ralph Castain	8f05beb1ec	Sync pmix/master@cb53105	2016-10-11 20:54:59 -07:00
Ralph Castain	6ce4b6d098	Eliminate -Wall from being hardcoded	2016-10-11 12:50:31 -07:00
Ralph Castain	1859b03416	Enable PMIx shared memory support by default	2016-10-11 12:18:01 -07:00
Ralph Castain	1d7d7c201b	Update PMIx support to latest PMIx master	2016-10-11 10:17:23 -07:00
Ralph Castain	5b1484a836	Implement the backend support for process-generated event notification	2016-10-08 09:24:28 -07:00
Gilles Gouaillardet	f1f1fb15eb	pmix3x: configury: output major, minor and release version after checking them and hence fix the configure output (back-ported from upstream commit pmix/master@7b7cdda2de)	2016-10-08 13:01:28 +09:00
Gilles Gouaillardet	f3af799608	pmix3x: misc fixes to get pmix build on Solaris - replace MAXHOSTNAMELEN with hardcoded 1024. unlike Linux, Solaris #define MAXHOSTNAMELEN in <netdb.h>, so use a hard coded value to keep the test simpl - stdout cannot be assigned on Solaris, so use freopen instead (back-ported from upstream commit pmix/master@a63f6e53f4)	2016-10-08 13:01:28 +09:00
Gilles Gouaillardet	5cbfddb8f1	pmix3x: fix misc memory leaks (back-ported from upstream commit pmix/master@1eff526929)	2016-10-08 13:01:28 +09:00
Gilles Gouaillardet	b4e4e4a5f1	pmix3x: enhance pmix_nspace_t destructor PMIX_RELEASE all elements stored in the internal and modex hash tables (back-ported from upstream commit pmix/master@b90674fc52)	2016-10-08 13:01:27 +09:00
Gilles Gouaillardet	f1dc033767	pmix3x: add the PMIX_HASH_TABLE_FOREACH macro this is a convenience macro similar to the PMIX_LIST_FOREACH macro, that can be used to iterate on all the key/value pairs of a pmix_hash_table_t (back-ported from upstream commit pmix/master@349971c68c)	2016-10-08 13:01:27 +09:00
Gilles Gouaillardet	7601e783cc	pmix3x: sec/munge: add a missing include file (cherry picked from upstream pmix/master@f7cfb11f6b)	2016-10-03 16:09:10 +09:00
Ralph Castain	e773c17cf3	Put show_help thru the PMIx "log" API. This pushes the show_help output from apps into the pmix thread, thus avoiding conflicts in the RML thread, which should help with thread lock situations.	2016-10-02 16:02:23 -07:00
Gilles Gouaillardet	871ade9231	pmix/{cray,s1,s2}: make pmi_opcaddy_t class static theses three pmix components use the same class name, declare it as static so Open MPI can be built with --disable-dlopen Thanks Limin Gu for the report	2016-09-28 09:18:36 +09:00
Gilles Gouaillardet	1fbc9a5431	pmix3x: dstore/pmix: flock portability Using the fcntl-locking instead of the flock (back-ported from upstream pmix/master@3030a0cca1)	2016-09-27 13:21:03 +09:00
Gilles Gouaillardet	362a5886de	pmix3x: client: fix PMIx_Finalize() sequence pmix_progress_thread_finalize() invokes libevent event_base_free, so all libevent stuff cannot be used after. Hence, pmix_client_globals.myserver must be PMIX_DESTRUCT'ed before invoking pmix_progress_thread_finalize()	2016-09-24 00:01:23 +09:00
Gilles Gouaillardet	5479c6cca7	pmix3x: add missing #include and get Open MPI build on OpenBSD 6.0	2016-09-23 11:23:18 +09:00
Gilles Gouaillardet	fbf03299c3	Merge pull request #2079 from ggouaillardet/topic/pmix_configury_dlopen pmix3x: configury: correctly handle --disable-dlopen	2016-09-21 10:59:33 +09:00
Gilles Gouaillardet	6c1e25b76e	pmix/ext11: fix pmix1_value_unload() prototype and call pmix1_value_unload() was added a "key" argument which is unused, and pmix1_value_unload() was sometimes invoked with two arguments instead of three. since the "key" argument is unused, simply remove it from the subroutine prototype and calls.	2016-09-20 14:34:41 +09:00
Gilles Gouaillardet	041a431966	pmix3x: configury: correctly handle --disable-dlopen the LT_* macros do overwrite the enable_dlopen variable, so it must be tested and saved before invoking LT_INIT. delay the invokation of the LT_* macros and use the PMIX_ENABLE_DLOPEN_SUPPORT variable to figure out whether --disable-dlopen was invoked	2016-09-15 13:26:20 +09:00
Artem Polyakov	dc0ab674de	Add PMIx key to provide RM with ability to indicate that it will cleanup session directories provided at through OPAL_PMIX_TMPDIR, OPAL_PMIX_NSDIR, OPAL_PMIX_PROCDIR	2016-09-05 07:48:44 +03:00
Ralph Castain	34f04a7924	Remove spurious Makefile.am line	2016-09-01 15:31:09 -07:00
Ralph Castain	0ea1cff733	Implement notification of completion on comm_spawn'd child jobs. Add a configure flag to enable PMIx 3's shared memory datastore, and set it disable by default so that comm_spawn functions again. Will reverse the default once that feature is fully functional	2016-09-01 13:10:10 -07:00
Ralph Castain	39992d1ad7	Silence trivial Coverity warnings	2016-08-31 09:42:33 -07:00
Ralph Castain	cfa784c9a6	Since we changed storage to pointers in pmix_value_t, we need to allocate space for those values when unpacking	2016-08-29 20:22:24 -07:00
rhc54	b90a64e734	Merge pull request #2022 from rhc54/topic/nnodes Provide the number of nodes in the job	2016-08-26 18:15:24 -05:00
Ralph Castain	2f6e0fec90	Provide the number of nodes in the job	2016-08-26 14:50:41 -07:00
Jeff Squyres	e03a40a0e9	pmix3x: remove generated file Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2016-08-26 10:30:47 -07:00
Gilles Gouaillardet	e4bf915e75	pmix3x: remove auto-generated file remove opal/mca/pmix/pmix3x/pmix/src/include/pmix_config.h.in .gitignore is correct, so it seems this file was added before .gitignore was updated	2016-08-26 15:00:18 +09:00
Ralph Castain	af67f16422	Update configury to support multiple PMIx versions, rename pmix2x component to pmix3x for support of PMIx master Update support for external v1.1.x and v2.x libraries. Minor corrections to the v3.x component	2016-08-25 18:19:05 -07:00
Gilles Gouaillardet	02847d9e7b	pmix2x: dstore: add missing <fcntl.h> include file in pmix_esh.c (back-ported from upstream pmix/master@5c66ffe0f0)	2016-08-24 11:18:46 +09:00
Gilles Gouaillardet	c11e8163f8	pmix2x: sec/native: fix the pmix_native module under solaris by using getpeerucred() and fail with a user friendly message if no method is available: "sec: native cannot validate_cred on this system" (back-ported from upstream pmix/master@c474a1fc60)	2016-08-24 11:18:40 +09:00
Gilles Gouaillardet	e91292aa41	pmix2x: configury: add missing check for <netdb.h> header file (back-ported from upstream pmix/master@e54ce6d423)	2016-08-24 11:18:32 +09:00
Ralph Castain	639dbdb7ea	For maintainability, fold the external PMIx 2.x integration into the internal PMIx 2.x library component. This ensures that we always stay in sync with the two as that is becoming a problem.	2016-08-22 13:28:55 -07:00
Ralph Castain	61ffba668b	Roll in the latest PMIx version - includes shared memory datastore and reduced memory footprint	2016-08-20 07:53:06 -07:00
Artem Polyakov	6ea8cccdab	Merge pull request #1969 from artpol84/pmix_jobid_fix Pmix jobid fix	2016-08-18 17:24:58 +07:00
Ralph Castain	7da9793fef	Support the PMIX_TIMEOUT key at the PMIx server when timeout=0 - this indicates that the user doesn't want a lookup of any data from the host RM.	2016-08-17 16:26:58 -05:00
Gilles Gouaillardet	3126ff77e2	pmix2x: common syms: whitelist bison-generated common symbols Bison generates some common symbols that we can't do anything about, so whitelist them.	2016-08-16 11:29:06 +09:00
Artem Polyakov	c5a91c5c9d	opal/pmix: fix pmix jobid calculation if external PMIx server is used.	2016-08-15 21:13:51 +03:00
Ralph Castain	ecbedee8bb	Fix typo	2016-08-15 07:32:00 -07:00
Artem Polyakov	f3c816b52e	opal/pmix: fix indentation in some files.	2016-08-15 18:21:50 +07:00
Gilles Gouaillardet	483685eb6a	update .gitignore remove autogenerated opal/mca/pmix/pmix2x/pmix/src/include/pmix_config.h.in	2016-08-15 17:00:20 +09:00
Ralph Castain	be8424b691	Provide backward compatible keys so that the non-PMIx components in the opal/pmix framework don't have to adjust as we continue to work on finalizing the PMIx reference scheme. Activate and utilize the new PMIx show_help capability to provide more meaningful error output when the server cannot start. Add a contrib script to cleanup permissions incorrectly modified due to things like smb mounts dd	2016-08-13 12:13:04 -07:00
rhc54	ddde154d28	Merge pull request #1962 from rhc54/topic/notify Ensure we properly convert pmix status to ORTE state before activatin…	2016-08-13 06:59:50 -07:00
Ralph Castain	48d35a9627	Ensure we properly convert pmix status to ORTE state before activating an error state upon notification. Cleanup some conversion issues on notification info. Add a new orte_notify.c test program	2016-08-12 21:14:29 -07:00
Ralph Castain	4a4c9703a9	Setup the job list in the PMIx integration so that static ports can run	2016-08-12 13:27:10 -07:00
Ralph Castain	1d44f0c0e2	Silence Coverity warnings	2016-08-11 21:22:01 -07:00
Ralph Castain	73544d2e00	Rename symbol	2016-08-11 13:06:46 -07:00
Ralph Castain	b0cc9b0bc8	Update to latest PMIx toolext branch Fix indentations Update the ext20 component to match latest PMIx master. Cleanup name conflicts and uninit vars	2016-08-11 12:29:48 -07:00
Ralph Castain	527b5c692a	Update to include extended tool support, new datatypes	2016-08-08 13:39:46 -07:00
Artem Polyakov	b24ec3e3b9	pmix/s2: fix indentation (only)	2016-08-06 16:31:19 +06:00
Artem Polyakov	2cb923a413	pmix/s1: fix indentation (only)	2016-08-06 16:30:45 +06:00
Artem Polyakov	8aa3ef7799	pmix/s2: fix s2 component data placement Use wildcard for the information related to the job-level data. Fixes s2 component with regard to PR https://github.com/open-mpi/ompi/pull/1897.	2016-08-06 15:49:16 +06:00
Artem Polyakov	81063f1717	pmix/s1: fix s1 component data placement Use wildcard for the information related to the job-level data. Fixes s1 component with regard to PR https://github.com/open-mpi/ompi/pull/1897.	2016-08-06 15:45:46 +06:00
Gilles Gouaillardet	30f98cd9d0	pmix: redefine OPAL_PMIX_ARCH macro Architecture is set by the ompi layer after job startup, so the key cannot have the "pmix" prefix since optimizations in open-mpi/ompi@01a653d50a otherwise architecture cannot be retrieved	2016-08-04 13:31:28 +09:00
Gilles Gouaillardet	21e7f31dbe	pmix2x: fix unpack sequence in PMIx_Get callback first unpack the nspace (PMIX_STRING) before unpacking the various keys (PMIX_KVAL)	2016-08-01 14:21:22 +09:00
Ralph Castain	16fccd4964	Establish a way for ORTE to tell PMIx the base tmpdir to use, and update PMIx to understand such directives	2016-07-29 09:52:36 -07:00
Ralph Castain	cacb582ecd	Support timeout values when performing connect/accept operations. Bump default timeout to 10 minutes so folks have time to start the partnering application	2016-07-28 14:09:06 -07:00
Howard Pritchard	b65bbe017f	pmix/cray: switch to using wildcards for some items so that at least srun native launch on cray works again. More issues to fix when using alps. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2016-07-26 17:07:58 -05:00
Ralph Castain	71de03fc67	Cleanup the new naming requirements to ensure that info is correctly retrieved Cleanup permissions Restore singleton operations	2016-07-21 09:46:03 -07:00
Ralph Castain	2b55ee8118	Cleanup Coverity warnings	2016-07-20 20:31:58 -07:00
Ralph Castain	01a653d50a	Remove a debug print in comm_cid.c. Update PMIx2 to include the revised PMIx_Get logic for higher performance by reducing the number of hash table lookups. Fix a bug where requests for data from a proc in another nspace could hang, or result in "not found". Remove stale file reference Restore autogen pass thru pmix Remove generated file	2016-07-20 00:58:19 -07:00
Nathan Hjelm	03bce91de8	pmix/pmix2x: add missing increment in loop This commit fixes a bug in the pmix2x client code where a loop variable is not correctly incremented. This was leading to hangs and crashes when creating intercommunicators. Also fixed two double increments in other loops. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2016-07-18 10:35:05 -06:00
Jeff Squyres	72f41d4490	pmix: replace all tabs with spaces No code or logic changes Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2016-07-17 15:08:33 -04:00
Jeff Squyres	1c32742c66	pmix_ext20: fix syntax error Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2016-07-17 15:04:12 -04:00
Ralph Castain	99f7096031	Fix permissions	2016-07-16 21:03:55 -07:00
Ralph Castain	d4071fbd1c	Fix dynamic operations by ensuring that we only fire the debugger release if the debugger is attached, and that the OPAL pmix key for directing events to non-default handlers matches the PMIx spelling	2016-07-16 13:20:41 -07:00
Ralph Castain	1ceb35ba5c	Fix singletons - do not include the PMIx tool URI in the environment provided to child processes	2016-07-13 17:33:34 -07:00
Ralph Castain	20a91c2baf	Add a new --continuous flag to mpirun that directs ORTE to let a job continue running as app procs terminate. Don't attempt to restart them. Add event notification of abnormally terminating procs, and demonstrate that in the mpi_spin test program. Cleanup debug message	2016-07-13 15:28:33 -07:00
Artem Polyakov	72585a905f	opal/pmix: add blocking Fence to SLURM components. Blocking fence is used in yalla del proc. Native pmix exposes this functionality. We need to expose it for SLURM's s1/s2 components as well. Also this commit fixes uninitialized `rc` in fencenb's of both components.	2016-07-11 09:43:15 +03:00
Artem Polyakov	8e16f47492	Merge pull request #1688 from artpol84/fix_base64 Fix base64 implementation in pmix framework.	2016-07-07 10:47:50 +06:00
Gilles Gouaillardet	acda07472a	configury: revamp and re-ident sub configure.m4 after open-mpi/ompi@846360fd4c	2016-07-06 11:59:51 +09:00
Gilles Gouaillardet	846360fd4c	configury: correctly perform make distclean when {libevent,hwloc,pmix} are external components Thanks Jeff for the guidance Fixes open-mpi/ompi#1683 note: in order to keep this commit easy to review, some AS_IF([...]) were replaced with AS_IF([false], ...) or AS_IF_([true], ...) these will be removed and re-idented in a subsequent commit	2016-07-06 11:57:24 +09:00
Ralph Castain	ee56d9dc1a	Shorten the session directory name as some OS's are now providing unusually long temp directory names, causing us to overflow the sockaddr field	2016-07-05 14:59:50 -07:00
Ralph Castain	7e0af3f4f0	Update pmix2x to track upstream changes	2016-07-05 11:54:22 -07:00
Gilles Gouaillardet	267821f0dd	pmix2x/pmix: fix a typo in PMIx_tool_init() and remove now useless local variable i	2016-07-05 13:47:50 +09:00
Gilles Gouaillardet	efce8cc734	pmix2x/pmix: add missing include files pmix cannot be built on alpine linux because of some missing includes. uid_t and gid_t are defined in unistd.h or sys/types.h, and unistd.h is not indirectly pulled under alpine linux, so do it manually. Thanks N.L.K Nguyen for the report (back-ported from upstream pmix/master@c8d55350a9)	2016-07-05 09:03:14 +09:00
Ralph Castain	c9ada8e095	Silence Coverity warnings	2016-07-03 20:45:08 -07:00
Ralph Castain	673f82e2b6	Update the PMIx listener to avoid leaking sockets into children, and better handle race condition errors	2016-07-03 08:23:33 -07:00
Ralph Castain	6e434d6785	Add support for PMIx tool connections and queries. Initially only support a request to list all known namespaces (jobids) from ORTE, but other folks will extend that support to include additional information Update to match PMIx RFC Fix configury to point to correct libevent and hwloc locations	2016-06-29 19:19:19 -07:00
Ralph Castain	08b1438f15	Add missing PMIx range value so OPAL and PMIx align again	2016-06-22 22:03:25 -07:00
Gilles Gouaillardet	bf133c401e	pmix2x: fix a typo in dereg_event_hdlr() This bug has been fixed when open-mpi/ompi@dde69e1be2 was backported into upstream pmix in pmix/master@5e5577778c but it was not fixed in open-mpi/ompi	2016-06-22 13:45:29 +09:00
Ralph Castain	441739b5a4	Cleanup a lagging message that generates an annoying (but seemingly harmless) warning	2016-06-20 12:23:27 -07:00
Ralph Castain	0ba02821e6	Add requested key and job-level info	2016-06-19 18:22:31 -07:00
Ralph Castain	0a29f5cb77	Sigh - missed two typos	2016-06-18 20:57:53 -07:00
Ralph Castain	dd38cf1fed	Fix typo	2016-06-18 20:56:43 -07:00
Ralph Castain	dde69e1be2	Cleanup CIDs 1362763, 1362762, 1362760, 1362759, 1362758, 1362757, 1362756, 1362755, 1362754. Unsure how to resolve 1362761. Fixes #1792	2016-06-18 12:28:46 -07:00
Ralph Castain	044c561cba	Roll to latest PMIx master	2016-06-16 17:30:30 -07:00
Ralph Castain	5d330d5220	Enable the PMIx event notification capability and use that for all error notifications, including debugger release. This capability requires use of PMIx 2.0 or above as the features are not available with earlier PMIx releases. When OMPI master is built against an earlier external version, it will fallback to the prior behavior - i.e., debugger will be released via RML and all notifications will go strictly to the default error handler. Add PMIx 2.0 Remove PMIx 1.1.4 Cleanup copying of component Add missing file Touchup a typo in the Makefile.am Update the pmix ext114 component Minor cleanups and resync to master Update to latest PMIx 2.x Update to the PMIx event notification branch latest changes	2016-06-14 13:08:41 -07:00
Ralph Castain	d58da99dbc	Shift to memcpy to avoid Solaris issues	2016-06-09 12:07:17 -07:00
Ralph Castain	8fa935534b	Abstract the strnlen function for environments that do not have it (e.g., Solaris 10)	2016-06-08 10:12:43 -07:00
Gilles Gouaillardet	b707d138fe	pmix114/pmix1_client: fix misc memory leaks Fixes CID 1325146-1325149	2016-06-06 09:33:35 +09:00
Ralph Castain	ecea1e3bb5	Update to 1.1.4rc3	2016-06-01 20:56:07 -07:00
Ralph Castain	12ecf972af	Split the pmix external component into one for the 1.1.4 release, and another for the upcoming 2.0 release. Clean up the configury so the components look for a series-specific function instead of running a program. NOTE: the changes for the 2.0 series are not yet in the PMIx master.	2016-06-01 14:15:24 -07:00
Ralph Castain	55923eacd3	Stealing some pieces of Josh Hursey's PR #1583 and modifying a bit, allow the opal/pmix external component to handle both PMIx 1.1.4 and PMIx 2.0 versions. Automatically detect the version of the target external library and adjust the only two APIs that changed (PMIx_Init and PMIx_Finalize) Rename temp vars in .m4 to avoid conflict with Travis	2016-05-27 08:06:31 -07:00
Artem Polyakov	725eea2819	Fix base64 implementation in pmix framework. In the commit `80f07b65f1` setting of '-' marker used as the string termination sign was moved from base64 code: from: `80f07b65f1 (diff-1b10896c267d2591dc2c08fd0542ab67L491)` to: `80f07b65f1 (diff-1b10896c267d2591dc2c08fd0542ab67R189)` However the decoding function wasn't fixed and still expects on extra byte at the end of the encoded string which leads to data truncation during extraction (was noticed on standalone code that was using base64 from OMPI).	2016-05-23 23:30:31 +06:00
Gilles Gouaillardet	8466a3daf3	pmix: update .gitignore git ignore opal/mca/pmix/pmix114/pmix/include/pmix/autogen/config.h.in git rm opal/mca/pmix/pmix114/pmix/include/pmix/autogen/config.h.in git ignore opal/mca/pmix/pmix*/...	2016-05-23 11:58:07 +09:00
Gilles Gouaillardet	cbbdce05b1	pmix/pmix114: silence a warning	2016-05-20 09:35:26 +09:00
Ralph Castain	6f743f81b6	Update PMIx 114 to current release candidate	2016-05-19 12:55:05 -07:00
rhc54	8b534e9897	Merge pull request #1668 from rhc54/topic/slurm When direct launching applications, we must allow the MPI layer to pr…	2016-05-16 12:23:19 -07:00
Howard Pritchard	1a676e5b35	pmix/cray: fix some breakage Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2016-05-16 12:45:05 -05:00
Ralph Castain	01ba861f2a	When direct launching applications, we must allow the MPI layer to progress during RTE-level barriers. Neither SLURM nor Cray provide non-blocking fence functions, so push those calls into a separate event thread (use the OPAL async thread for this purpose so we don't create another one) and let the MPI thread sping in wait_for_completion. This also restores the "lazy" completion during MPI_Finalize to minimize cpu utilization. Update external as well Revise the change: we still need the MPI_Barrier in MPI_Finalize when we use a blocking fence, but do use the "lazy" wait for completion. Replace the direct logic in MPI_Init with a cleaner macro	2016-05-14 16:37:00 -07:00
Ralph Castain	7767882346	Per user request, add some missing data and definitions: OPAL_PMIX_UNIV_RANK - synonym for OPAL_PMIX_GLOBAL_RANK OPAL_PMIX_APP_SIZE - #ranks in the application of this proc	2016-05-09 08:39:01 -07:00
Ralph Castain	8ec1891d11	Silence warning	2016-05-05 20:04:10 -07:00
Ralph Castain	08022d7af1	Some minor cleanups of warnings from gcc 6.0.0. Update s1/s2 pmix to get max_procs as required.	2016-05-05 15:28:13 -07:00
rhc54	648043597a	Merge pull request #1612 from ggouaillardet/poc/pmix_external_configury pmix/external: revamp external pmix package detection	2016-05-02 09:46:05 -07:00
Jeff Squyres	265e5b9795	Merge pull request #1552 from kmroz/wip-hostname-len-cleanup-1 ompi/opal/orte/oshmem/test: max hostname length cleanup	2016-05-02 09:44:18 -04:00
Gilles Gouaillardet	45f9a47d77	pmix/external: fix typo and silence a warning	2016-05-02 17:15:52 +09:00
Gilles Gouaillardet	08d91b9a03	pmix/external: revamp external pmix package detection	2016-05-02 16:23:31 +09:00
Ralph Castain	42d9d861fc	Fix minor typo in PMIx packing of pmix_app_t - thanks to Gilles for pointing it out	2016-04-29 08:55:46 -07:00
Howard Pritchard	f52dd511d4	Merge pull request #1600 from hppritcha/topic/pmix_fix_for_finalize pmix/cray: set fence_nb to NULL	2016-04-28 13:50:15 -06:00
hppritcha	aa1d7b9c50	pmix/cray: set fence_nb to NULL Rather than have a stub function for the pmix fence_nb operation, just set to NULL. Causes fewer problems. Fixes #1597 Fixes #1527 Signed-off-by: hppritcha <howardp@lanl.gov>	2016-04-28 13:48:54 -05:00
Ralph Castain	02876564d4	Silence warning of zero-byte malloc	2016-04-26 11:55:59 -07:00
Karol Mroz	e1c64e6e59	opal: standardize on max hostname length Define OPAL_MAXHOSTNAMELEN to be either: (MAXHOSTNAMELEN + 1) or (limits.h:HOST_NAME_MAX + 1) or (255 + 1) For pmix code, define above using PMIX_MAXHOSTNAMELEN. Fixup opal layer to use the new max. Signed-off-by: Karol Mroz <mroz.karol@gmail.com>	2016-04-24 08:19:47 +02:00
Gilles Gouaillardet	d96919638f	pmix: remote autogenerated file and update .gitignore removed: opal/mca/pmix/pmix114/pmix/src/include/private/autogen/config.h.in	2016-04-18 12:57:41 +09:00
Ralph Castain	b009e58d25	Roll to PMIx 1.1.4rc2 - replaces some code that was incorrectly removed in prior update	2016-04-16 18:24:24 -07:00
Ralph Castain	8ff114e668	Update to official PMIx 1.1.4rc1	2016-04-15 21:47:46 -07:00
Ralph Castain	449ec41532	Roll to PMIx 1.1.4rc1 and remove the PMIx 1.2.0 directory as the community has decided to not do that release version. This incorporates a number of bug fixes that have been identified and repaired in the PMIx and OMPI code bases. Also includes several minor corrections to the PMIx code so it now supports run-thru without hanging on collectives involving a process that exits	2016-04-15 10:11:11 -07:00
Ralph Castain	2432daf065	Some minor cleanups of a memory leak and error output	2016-04-08 07:46:18 -07:00
Rainer Keller	52080a5736	As per the pull request to pmix/master: https://github.com/pmix/master/pull/71 Have OMPI's current version of pmix120 nicely fail in case of too long sun_path (longer than 108 or in case of OSX 103 chars). And have OMPI return proper error messages with hints how to amend.	2016-04-07 22:12:53 +02:00
Gilles Gouaillardet	6f450630d8	pmix/external: fix misc missing conversion and type issues	2016-04-04 10:12:34 +09:00
Gilles Gouaillardet	2ede47c462	pmix: fix misc missing conversion and type issues	2016-04-04 10:12:34 +09:00
Nysal Jan K.A	75233573d1	pmix: Increment the reference count in PMIx_Init The reference counting was broken which led PMIx_Finalize to release resources early. This fixes the "use after free" scenarios that I encountered. (based on commit pmix/master@abfaa4c)	2016-03-27 04:11:25 -04:00
Josh Hursey	099170bb31	Merge pull request #1496 from jjhursey/topic/pmix120-obj-patch pmix/pmix120: Fix OBJ_ to PMIX_ symbol name	2016-03-25 19:35:43 -05:00
Ralph Castain	0b4310b186	Remove an unnecessary header that forced exposure of the PMIx internal headers	2016-03-25 16:57:41 -07:00
Ralph Castain	e8246e079b	Minor cleanup to match the changes in the PMIx master	2016-03-25 15:12:41 -07:00
Joshua Hursey	8ebeaa5861	pmix/pmix120: Fix OBJ_ to PMIX_ symbol name	2016-03-25 16:17:08 -05:00
Ralph Castain	af1444b6e1	Cleanup a debug statement. Plug a memory leak	2016-03-08 18:27:55 -08:00
Ralph Castain	bac6290b22	Ensure the process name is positive when using direct launch Fixes #1425	2016-03-08 08:31:05 -08:00
Ralph Castain	b57a191ccc	Update the external client to the new PMIx init/finalize signatures	2016-03-03 20:50:20 -08:00
Ralph Castain	4a55fba414	Fix registration of error handlers thru the pmix120 component. A thread-shift operation was hanging on the sync_event_base, which made it dependent on someone calling opal_progress. Unfortunately, a process in "sleep" or spinning outside the MPI library won't do that, and so we never complete errhandler registration.	2016-03-02 15:01:01 -08:00

... 3 4 5 6 7 ...

641 Коммитов