openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	d593e5a4ce	When we specify --with-devel-headers, we also emit a copy of libpmix. However, that library was built against the OPAL libevent component, which means all the libevent functions are prefixed with OPAL names. So ensure that the emitted libpmix is linked back against libopen-pal so those symbols will be resolved. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-08-07 09:36:16 -07:00
Ralph Castain	a239b4c3c3	Per discussion on the PMIx side, do a better job of detecting mismatches between location directives for OPAL and PMIx. Provide a more helpful error message and error out if we find a mismatch. If any OPAL values are set and the PMIx equivalent is not, then transfer it. Do not clear PMIX_INSTALL_PREFIX from the daemon's launch environment Fixes #3980 Closes #4007 Refs #3985 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-08-04 19:36:00 -07:00
Ralph Castain	f128b4c546	Fix incorrect usage of '==' in test comparisons Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-08-03 21:21:26 -07:00
Artem Polyakov	500c8be888	pmix: fix PMIx envar name for the installation prefix. Signed-off-by: Artem Polyakov <artpol84@gmail.com>	2017-08-02 08:03:36 +03:00
Ralph Castain	f39ce67982	Merge pull request #3951 from rhc54/topic/hwloc2 Update to hwloc 2.0.0a	2017-08-01 15:18:31 -06:00
Ralph Castain	8f34fa4a56	Move the detection of OPAL_PREFIX and subsequent posting of PMIX_PREFIX to the internal integration code for PMIx so we only do this when running with the embeddied PMIx Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-08-01 08:24:27 -06:00
Boris Karasev	e20b581529	pmix: fixed immediate request This commit fixes a hang when using external PMIx v1 module Signed-off-by: Boris Karasev <karasev.b@gmail.com>	2017-07-28 15:53:48 +06:00
Ralph Castain	7a83fdb9bb	Update to hwloc 2.0.0a with shmem support. Update to support passing of HWLOC shmem topology to client procs Update use of distance API per @bgoglin Have the openib component lookup its object in the distance matrix Bring usnic up-to-date Restore binding for hwloc2 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-07-25 20:26:22 -07:00
Ralph Castain	0042c758f1	Update the tools support so it allows tools to access PMIx Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-07-25 17:10:08 -07:00
Ralph Castain	058e802b11	Add missing export directives Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-07-25 07:19:08 -07:00
George Bosilca	1ea8fab095	Make external symbols visible. All symbols that need to be accessed from a MCA component must be marked explicitly as visible using PMIX_EXPORT. This patch allows current trunk to almost work on OsX. More on the devel mailing list. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2017-07-25 01:14:22 -04:00
Ralph Castain	af85e48dd7	Silence Coverity warning, silence pmix_error_log of success Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-07-21 15:33:16 -07:00
Ralph Castain	492f98f8a5	Update to latest PMIx v2.1.0a Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-07-21 12:58:09 -07:00
Ralph Castain	f7e8780a42	Remove fortran support from platform file Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-07-20 21:02:30 -07:00
Ralph Castain	b225366012	Bring the ofi/rml component online by completing the wireup protocol for the daemons. Cleanup the current confusion over how connection info gets created and passed to make it all flow thru the opal/pmix "put/get" operations. Update the PMIx code to latest master to pickup some required behaviors. Remove the no-longer-required get_contact_info and set_contact_info from the RML layer. Add an MCA param to allow the ofi/rml component to route messages if desired. This is mainly for experimentation at this point as we aren't sure if routing wi ll be beneficial at large scales. Leave it "off" by default. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-07-20 21:01:57 -07:00
Ralph Castain	8c30958879	Update to PMIx v2.1.0alpha Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-07-20 11:12:06 -07:00
Ralph Castain	fca68b070b	Merge pull request #3934 from rhc54/topic/singleton Fix the isolated pmix component. Cleanup the ess/singleton component …	2017-07-19 16:02:37 -05:00
Ralph Castain	543c16b28d	Fix the isolated pmix component. Cleanup the ess/singleton component - we shouldn't be automatically discovering the local topology as that is now done on-demand. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-07-19 12:14:29 -07:00
Howard Pritchard	2fa0c4c6ec	pmix/s1: fix problems with ref counting in s1 s1 pmix component wasn't doing proper ref counting Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2017-07-18 15:59:28 -06:00
Gilles Gouaillardet	9124afbeae	pmix: do not invoke PMIX_INFO_CREATE() with a zero size Thanks Lisandro Dalcin for the report Fixes open-mpi/ompi#3854 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-07-14 15:00:05 +09:00
Artem Polyakov	4d3e22e815	Merge pull request #3870 from hppritcha/topic/repair_s2_launch pmix/s2: fix srun native launch for pmi2	2017-07-13 12:45:22 -05:00
Howard Pritchard	eeb91bc82b	pmix/s2: fix srun native launch for pmi2 recent changes that broke native launch on cray using srun or aprun was also broke native launch using pmi2. This commit fixes this problem. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2017-07-12 17:45:52 -06:00
Jeff Squyres	ccf17808b6	Merge pull request #3258 from markalle/pr/symbol_name_pollution symbol name pollution	2017-07-12 16:19:25 -05:00
Howard Pritchard	550e8c4afe	Merge pull request #3842 from hppritcha/topic/fix_cray_pmix_problem pmix/cray: add a bit of debug output	2017-07-11 08:29:56 -06:00
Howard Pritchard	26a8142c97	pmix/cray: add a bit of debug output add a bit of debug output to help with pmix finalize issues Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2017-07-11 05:45:49 -05:00
Mark Allen	552216f9ba	scripted symbol name change (ompi_ prefix) Passed the below set of symbols into a script that added ompi_ to them all. Note that if processing a symbol named "foo" the script turns foo into ompi_foo but doesn't turn foobar into ompi_foobar But beyond that the script is blind to C syntax, so it hits strings and comments etc as well as vars/functions. coll_base_comm_get_reqs comm_allgather_pml comm_allreduce_pml comm_bcast_pml fcoll_base_coll_allgather_array fcoll_base_coll_allgatherv_array fcoll_base_coll_bcast_array fcoll_base_coll_gather_array fcoll_base_coll_gatherv_array fcoll_base_coll_scatterv_array fcoll_base_sort_iovec mpit_big_lock mpit_init_count mpit_lock mpit_unlock netpatterns_base_err netpatterns_base_verbose netpatterns_cleanup_narray_knomial_tree netpatterns_cleanup_recursive_doubling_tree_node netpatterns_cleanup_recursive_knomial_allgather_tree_node netpatterns_cleanup_recursive_knomial_tree_node netpatterns_init netpatterns_register_mca_params netpatterns_setup_multinomial_tree netpatterns_setup_narray_knomial_tree netpatterns_setup_narray_tree netpatterns_setup_narray_tree_contigous_ranks netpatterns_setup_recursive_doubling_n_tree_node netpatterns_setup_recursive_doubling_tree_node netpatterns_setup_recursive_knomial_allgather_tree_node netpatterns_setup_recursive_knomial_tree_node pml_v_output_close pml_v_output_open intercept_extra_state_t odls_base_default_wait_local_proc _event_debug_mode_on _evthread_cond_fns _evthread_id_fn _evthread_lock_debugging_enabled _evthread_lock_fns cmd_line_option_t cmd_line_param_t crs_base_self_checkpoint_fn crs_base_self_continue_fn crs_base_self_restart_fn event_enable_debug_output event_global_current_base_ event_module_include eventops sync_wait_mt trigger_user_inc_callback var_type_names var_type_sizes Signed-off-by: Mark Allen <markalle@us.ibm.com>	2017-07-11 02:13:23 -04:00
Ralph Castain	a190b4b89f	Prefix the MB macro in one more place Fixes #3830 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-07-07 06:07:47 -07:00
Ralph Castain	ed43492867	Not really necessary, but technically correct Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-07-06 06:00:03 -07:00
Ralph Castain	31130a4bee	Replace syntax with something less strictly C99 Fixes #3809 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-07-05 16:54:36 -07:00
Howard Pritchard	1f2f3db553	pmix/cray: fix handling of multiple finis The fini code for cray pmix wasn't correct. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2017-07-03 14:30:34 -05:00
Ralph Castain	9178219e6b	Deregister event handlers only on final call to finalize. Ensure we pass PMIx mca params Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-06-28 15:00:43 -07:00
Ralph Castain	d619de4f4c	Fix a threadlock when notifying clients of failures Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-06-28 08:58:41 -07:00
Ralph Castain	e6c2a8d346	Track PMIx v2.0.1 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-06-26 09:34:57 -07:00
Ralph Castain	79fd359848	Merge pull request #3713 from rhc54/topic/ofi Enable use of OFI fabrics for launch and other collective operations.…	2017-06-25 11:47:40 -07:00
Ralph Castain	ed85512a7c	Update to track PMIx v2.0.1 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-06-25 07:29:32 -07:00
Ralph Castain	ef56c7d47a	Correctly transfer size_t data fields Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-06-24 20:11:54 -07:00
Ralph Castain	f4411c4393	Enable use of OFI fabrics for launch and other collective operations. Update the PMIx repo to the latest master to get the required support for the server to "push" modex info, and to retrieve all its own "modex" values for sending back to mpirun. Have mpirun cache them in its local modex hash as OFI goes point-to-point direct and doesn't route - so the remote daemons don't need a copy of this connection info. Remove the opal_ignore from the RML/OFI component, but disable that component unless the user specifically requests it via the "rml_ofi_desired=1" MCA param. This will let us test compile in various environments without interfering with operations while we continue to debug Fix an error when computing the number of infos during server init Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-06-23 19:57:21 -07:00
Ralph Castain	8263efff65	Fix uninitialized variables Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-06-23 11:12:26 -07:00
Ralph Castain	6ec2ad5288	Fix the pmix_query API when it asks for something that returns an array of pmix_info_t. Protect the PMIX_INFO_FREE macro from NULL arrays. Update the mpi_memprobe scaling test Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-06-22 20:11:36 -07:00
Ralph Castain	3e78f84093	Silence Coverity warnings Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-06-21 13:19:51 -07:00
Ralph Castain	cba127bc43	Update the ext2x component to match the internal one Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-06-20 11:42:14 -07:00
Ralph Castain	952726c121	Update to latest PMIx master - equivalent to 2.0rc2. Update the thread support in the opal/pmix framework to protect the framework-level structures. This now passes the loop test, and so we believe it resolves the random hangs in finalize. Changes in PMIx master that are included here: * Fixed a bug in the PMIx_Get logic * Fixed self-notification procedure * Made pmix_output functions thread safe * Fixed a number of thread safety issues * Updated configury to use 'uname -n' when hostname is unavailable Work on cleaning up the event handler thread safety problem Rarely used functions, but protect them anyway Fix the last part of the intercomm problem Ensure we don't cover any PMIx calls with the framework-level lock. Protect against NULL argv comm_spawn Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-06-20 09:02:15 -07:00
Ralph Castain	8f09929469	Fix rank-file mapper launch by correctly setting up the remote map from the provided data Put a simple protection for the case where procs fail while we are trying to deregister handlers Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-06-15 08:33:29 -07:00
Ralph Castain	548cd24e4e	Forward-port changes proposed for v3.0 to master from PR #3677 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-06-09 07:51:21 -07:00
Ralph Castain	2d65908184	Correct the external pmix configury Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-06-07 00:33:29 -07:00
Ralph Castain	bd1793ad17	Get the pmix/ext2x component to work. Fix a minor problem in the libevent external component. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-06-06 20:06:28 -07:00
Ralph Castain	c3e6dc2022	Update to pmix v2.0.0rc1, including thread safety fixes Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-06-06 15:16:34 -07:00
Ralph Castain	93cf3c7203	Update OPAL and ORTE for thread safety (I swear, if I look this over one more time, I'll puke) Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-06-06 12:30:57 -07:00
Ralph Castain	2f85d10600	Update to PMIx master Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-06-06 08:19:25 -07:00
Ralph Castain	8f526968c2	Do not hang if we cannot relay messages. Eliminate extra error log message Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-06-05 06:35:19 -07:00
Ralph Castain	9d6b929894	Fix uninitialized variable. Set exit codes for failed launch so we get pretty error messages Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-05-31 07:38:37 -07:00
Ralph Castain	26d96061aa	Roll in latest PMIx updates Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-05-30 21:35:35 -07:00
Ralph Castain	9f1f9d6606	Update to PMIx v2.0.0rc1 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-05-28 10:30:58 -07:00
Ralph Castain	9f60cd0fe7	Update the connect/accept support so we check to see if we have the proper infrastructure and RTE support, including whether we have ompi-server available if the connect/accept spans multiple applications. Print pretty help messages in all cases where we do not have support Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-05-27 10:47:08 -07:00
Nathan Hjelm	33d59886e1	Merge pull request #3587 from hjelmn/event_abstraction pmix/pmix2x: fix errors in event abstration	2017-05-26 10:44:18 -06:00
Nathan Hjelm	a512b8962d	pmix/pmix2x: fix errors in event abstration Parts of the pmix2x component called the event_* functions directly instead of the opal_event_* wrappers. This is fine as long as we are using libevent but becomes a problem with other event libraries. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2017-05-26 09:49:11 -06:00
Ralph Castain	2f721a3366	Merge pull request #3585 from rhc54/topic/pmix20 Update to pmix v2.0beta	2017-05-26 06:05:44 -07:00
Ralph Castain	e1e264711a	Update to pmix v2.0beta Fix atomics - again Fix initialization of notification ring buffer Fix wait_sync definitions Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-05-26 03:33:18 -07:00
Ralph Castain	657e701c65	Add debug verbosity to the orte data server and pmix pub/lookup functions Start updating the various mappers to the new procedure. Remove the stale lama component as it is now very out-of-date. Bring round_robin and PPR online, and modify the mindist component (but cannot test/debug it). Remove unneeded test Fix memory corruption by re-initializing variable to NULL in loop Resolve the race condition identified by @ggouaillardet by resetting the mapped flag within the same event where it was set. There is no need to retain the flag beyond that point as it isn't used again. Add a new job attribute ORTE_JOB_FULLY_DESCRIBED to indicate that all the job information (including locations and binding) is included in the launch message. Thus, the backend daemons do not need to do any map computation for the job. Use this for the seq, rankfile, and mindist mappers until someone decides to update them. Note that this will maintain functionality, but means that users of those three mappers will see large launch messages and less performant scaling than those using the other mappers. Have the mindist module add procs to the job's proc array as it is a fully described module Protect the hnp-not-in-allocation case Per path suggested by Gilles - protect the HNP node when it gets added in the absence of any other allocation or hostfile Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-05-25 18:41:27 -07:00
Gilles Gouaillardet	026f3dd2dd	pmix2x: plug a misc memory leak Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-05-10 14:57:44 +09:00
Ralph Castain	0afcb1a448	Update to support server self-notifications Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-05-08 10:04:50 -07:00
Ralph Castain	ef0e0171c9	Implement the changes required to support cross-library coordination. Update PMIx to support intra-process notifications and ensure that we always notify ourselves for events. Add a new ompi/interlib directory where cross-lib coordination code can go, and put the code to declare ourselves there (called from ompi_mpi_init.c). Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-05-08 10:04:50 -07:00
Ralph Castain	3bca715780	Fix pmix configury so that libpmix is still emitted when --with-devel-headers is given, even under static builds Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-05-05 11:15:32 -07:00
Jeff Squyres	af336ac0e8	pmix/configure.m4: always use embedded mode Looks like embedded mode was mistakenly disabled when --with-devel-headers was specified. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-05-04 10:01:41 -07:00
Ralph Castain	8b1f01dfe6	Set the default modex parameters back to full blocking modex while we continue to test and debug the slow modex - it seems to be having issues on the Cray Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-04-22 15:19:46 -07:00
Ralph Castain	9fc3079ac2	Implement a background fence that collects all data during modex operation The direct modex operation is slow, especially at scale for even modestly-connected applications. Likewise, blocking in MPI_Init while we wait for a full modex to complete takes too long. However, as George pointed out, there is a middle ground here. We could kickoff the modex operation in the background, and then trap any modex_recv's until the modex completes and the data is delivered. For most non-benchmark apps, this may prove to be the best of the available options as they are likely to perform other (non-communicating) setup operations after MPI_Init, and so there is a reasonable chance that the modex will actually be done before the first modex_recv gets called. Once we get instant-on-enabled hardware, this won't be necessary. Clearly, zero time will always out-perform the time spent doing a modex. However, this provides a decent compromise in the interim. This PR changes the default settings of a few relevant params to make "background modex" the default behavior: * pmix_base_async_modex -> defaults to true * pmix_base_collect_data -> continues to default to true (no change) * async_mpi_init - defaults to true. Note that the prior code attempted to base the default setting of this value on the setting of pmix_base_async_modex. Unfortunately, the pmix value isn't set prior to setting async_mpi_init, and so that attempt failed to accomplish anything. The logic in MPI_Init is: * if async_modex AND collect_data are set, AND we have a non-blocking fence available, then we execute the background modex operation * if async_modex is set, but collect_data is false, then we simply skip the modex entirely - no fence is performed * if async_modex is not set, then we block until the fence completes (regardless of collecting data or not) * if we do NOT have a non-blocking fence (e.g., we are not using PMIx), then we always perform the full blocking modex operation. * if we do perform the background modex, and the user requested the barrier be performed at the end of MPI_Init, then we check to see if the modex has completed when we reach that point. If it has, then we execute the barrier. However, if the modex has NOT completed, then we block until the modex does complete and skip the extra barrier. So we never perform two barriers in that case. HTH Ralph Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-04-21 10:29:23 -07:00
Ralph Castain	ffbfd22d84	Fix event registration - need to increment the event index and record the number of codes in the event handler Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-04-13 17:35:10 -07:00
Ralph Castain	9f73974fe1	Update to latest PMIx master, including disabling the pmi-1 and pmi-2 backward compatibility as these interfere with the s1,s2 components Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-04-12 12:34:27 -07:00
Ralph Castain	95ae0d1df3	Cleanup timing macros for portability across compilers. Rename the --enable-timing configure option to be --enable-pmix-timing so it doesn't pickup external timing requests. Remove a stale function reference in PMIx so it can compile with timing enabled. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-04-10 12:56:38 +06:00
Ralph Castain	92c996487c	Update how we pass the node regex so we pass _all_ nodes, even those without daemons. This allows the backend daemons to form a complete picture of the allocation. Include info on which nodes have daemons on them, and populate that info on the backend as well. Set the daemons' state to "running" and mark them as "alive" by default when constructing the nidmap Get the DVM running again Fix direct modex by eliminating race condition caused by releasing data while sending it Up the size limit before compressing Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-04-03 19:25:15 -07:00
Ralph Castain	2cc5fea8be	Update to PMIx v2.0alpha Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-04-03 10:02:29 -07:00
Ralph Castain	7dd34d0c9a	Use the correct callback data - the callback function was expecting a bool, not a pmix_ptl_sr_t. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-28 17:21:47 -07:00
Ralph Castain	35f817911e	Fix coverity issues Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-24 08:09:46 -07:00
Ralph Castain	c0bcd11bcf	Fix permissions - no CI required Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-23 08:05:52 -07:00
Ralph Castain	55e4fba5f5	If we lose connection to the server after initiating a send/recv in PMIx (e.g., in PMIx_Abort), then we need to "resolve" all pending recvs to avoid hanging. Fixes #3225 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-23 02:53:21 -07:00
Ralph Castain	d645557fa0	Update to include the PMIx 2.0 APIs for monitoring and job control. Include required integration, but leave the monitors off for now. Move the sensor framework out of ORTE as it is being absorbed into PMIx Fix typo and silence warnings Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-21 17:47:08 -07:00
Ralph Castain	4b6d220a83	You cannot include both pmi.h and pmi2.h as they have conflicting defines in them. Thanks to Kilian Cavalotti for pointing it out Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-19 11:53:54 -07:00
Ralph Castain	c6bc3ccb76	Sync to latest PMIx master and PMIx reference server Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-11 12:50:38 -08:00
Ralph Castain	aca7091114	Fix some minor compatibility issues by ensuring job-level data gets stored against wildcard rank in the cray, s1, and s2 components, and that the ext1 component translates all wildcard rank requests into the peer's rank since v1.x of PMIx doesn't understand wildcard ranks Closes #3101 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-05 10:30:59 -08:00
Ralph Castain	1de72ff023	Silence an unnecessary error log Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-02 17:18:34 -08:00
Ralph Castain	e86a0dbf39	Update to PMIx master to include dlopen fixes and addition of libltdl support Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-02-22 11:54:33 -08:00
Ralph Castain	8cffdcf127	Ensure that the pmix headers and lib get installed when --with-devel-headers is given so that PMIx applications can be built and executed against the "embedded" PMIx version Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-02-21 13:46:46 -08:00
Gilles Gouaillardet	bb2481a84b	pmix2x: synchronize to the latest PMIx master pmix/master@f57d9b2953 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-02-20 10:45:17 +09:00
Ralph Castain	f49118eaab	Fix some pmix configuration code Remove stale file reference that caused a check to always fail. Update psm2 function check to new libs Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-02-16 10:54:47 -08:00
Howard Pritchard	b272f87926	Merge pull request #2968 from hjelmn/pmix_cray pmix/cray: performance improvements and cleanup	2017-02-16 11:41:59 -07:00
Ralph Castain	201f8571ca	Ensure we retain the peer object until we are done with it, then detect that the socket has closed due to a lost connection and cleanly release the message event Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-02-15 18:30:55 -08:00
Ralph Castain	9cd7349d7c	Instead of completely free'ing the event base, pause the PMIx progress thread before tearing down the infrastructure, and then release the event base at the end of the procedure. This allows any infrastructure objects holding events to delete them prior to free'ing the event base. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-02-15 05:02:43 -08:00
Ralph Castain	f7fe2f7189	Merge pull request #2977 from rhc54/topic/spawn Fix comm_spawn by registering nspace info only when needed	2017-02-15 04:31:54 -08:00
Ralph Castain	68b53e2179	Fix comm_spawn by registering nspace info only when needed - either when we have local procs, or when job-level info is required by connecting jobs Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-02-14 19:47:56 -08:00
Ralph Castain	0c8609ca16	Update to newest PMIx master (includes configuration cleanups). Silence trivial Coverity warning in hwloc base. Cleanup a race condition segfault during finalize by ensuring the PMIx progress thread is stopped prior to starting to tear down the messaging components Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-02-14 15:14:00 -08:00
Nathan Hjelm	3b912ea2a7	pmix/cray: performance improvements and cleanup Do not use opal_output_verbose inside O(n) loops. This was causing us to make O(n) calls to snprintf which was greatly slowing launch at scale. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2017-02-14 11:13:10 -07:00
Ralph Castain	35578b4009	Update to lastest PMIx master Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-02-13 23:19:26 -08:00
Gilles Gouaillardet	7acef4833e	pmix2x: Update to latest PMIx master pmix/master@6ed27be839 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-02-08 13:23:27 +09:00
KAWASHIMA Takahiro	4b2eba34a6	Merge pull request #2933 from kawashima-fj/pr/dstore-config-desc pmix/pmix2x: Correct configure option description	2017-02-08 13:03:27 +09:00
Jeff Squyres	100b112d3c	pmix: fix zlib protection macro usage It's possible that we can have zlib.h but still not have zlib support. Use the correct macro to protect the usage of calling zlib functions. This fixes 32-bit MTT builds at Cisco (e.g., https://mtt.open-mpi.org/index.php?do_redir=2389). Submitted upstream to PMIX: https://github.com/pmix/master/pull/290 Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-02-07 05:52:32 -08:00
KAWASHIMA Takahiro	750406f67b	pmix/pmix2x: Correct configure option description `--enable-pmix-dstore` option was enabled by default in `f4a5511`. Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>	2017-02-07 11:52:56 +09:00
Ralph Castain	edcfdf2365	Update to latest PMIx master Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-31 08:01:37 -08:00
Gilles Gouaillardet	b078e57e73	pmix/ext1x: fix misc memory leaks in namespace registration Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-30 10:52:42 +09:00
Gilles Gouaillardet	f51fc293a2	ext1x/pmix1x_client: plug misc memory leaks Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-30 10:52:42 +09:00
Gilles Gouaillardet	022cca79ea	pmix/ext1x: plug a memory leak in opal_lkupcbfunc() Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-30 10:52:36 +09:00
Gilles Gouaillardet	f485d12a82	pmix: rename the ext11 component into ext1x also use the same naming scheme thann pmix/ext2x Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-30 10:52:35 +09:00
Gilles Gouaillardet	dccb1899e6	pmix/ext11: correctly use PMIx_server_register_nspace() PMIx_server_register_nspace() is an asynchronous operation, so the pmix glue wait for it completes before returning. Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-30 09:23:19 +09:00
Gilles Gouaillardet	6955e1e25c	pmix/ext11: fix compilation the argc field from the opal_pmix_app_t struct was removed, so adjust the pmix/ext11 glue accordingly. Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-30 09:23:18 +09:00
Ralph Castain	3302864a7d	Cleanup a typo that can cause a segfault - use a local variable name different than the one passed into the function Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-27 16:49:25 -08:00
Gilles Gouaillardet	896434b1bd	pmix/ext2x: plug a memory leak in opal_lkupcbfunc() Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-26 14:07:15 +09:00
Gilles Gouaillardet	6b8e1c217c	pmix/ext2x: plug misc memory leaks Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-26 14:06:58 +09:00
Gilles Gouaillardet	142b95df87	pmix/ext2x: plug misc memory leaks regarding opal_pmix2x_event_chain_t handling Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-25 16:17:10 +09:00
Gilles Gouaillardet	7a3d39f079	pmix/ext2x: plug a memory leak in _reg_nspace() Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-25 16:17:01 +09:00
Gilles Gouaillardet	189da7fdab	pmix2x: plug a memory leak in _event_hdlr() Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-24 09:13:30 +09:00
Gilles Gouaillardet	acbc32d3b2	pmix2x: plug a memory leak in opal_lkupcbfunc() Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-24 09:13:29 +09:00
Gilles Gouaillardet	b5b21043c4	pmix2x: plug a memory leak in _reg_nspace() Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-24 09:13:29 +09:00
Gilles Gouaillardet	0f47310a75	pmix2x/pmix2x_client: plug misc memory leaks Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-24 09:13:29 +09:00
Ralph Castain	8c960bae8d	Update to latest PMIx master Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-23 07:07:40 -08:00
Ralph Castain	e568b211e4	Silence Coverity CID 1398541 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-10 15:30:50 -08:00
Gilles Gouaillardet	6d59b476de	Merge pull request #2686 from ggouaillardet/topic/pmix2x_ptl_base_sendrecv pmix2x: ptl/base: send header and message data together via writev()	2017-01-10 16:26:10 +09:00
Gilles Gouaillardet	44c1ff60f1	Merge pull request #2672 from ggouaillardet/topic/misc_memory_leaks Plug misc memory leaks	2017-01-10 13:16:04 +09:00
Gilles Gouaillardet	a01960bee5	pmix2x: ptl/base: send header and message data together via writev() on Linux, sending the header and then the message data does severely impact performances of ptl/tcp : on the receiver, reading the data can often result in an PMIX_ERR_RESOURCE_BUSY or PMIX_ERR_WOULD_BLOCK, which ends up degrading performances) this commit send both header and message data at the same time via writev() and makes ptl/tcp virtually as efficient as ptl/usock. Short writev generally occur when the kernel buffer is full, so there is no point for retrying in this case. fwiw, no such degradation was observed on OSX. Refs open-mpi/ompi#2657 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-10 13:07:39 +09:00
Ralph Castain	67fce2861b	Merge pull request #2685 from rhc54/topic/cov Resolve Coverity issues	2017-01-07 13:11:40 -08:00
Ralph Castain	e25e69dc2f	Resolve Coverity issues Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-07 10:45:52 -08:00
Ralph Castain	822e2680ba	Cleanup some configure stuff for static builds - still can't get wrapper extra libs to be recognized Signed-off-by: Ralph Castain <rhc@open-mpi.org> pmix2x: minor configure updates Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-01-07 08:37:36 -08:00
Ralph Castain	444f5fa35d	Raise the priority of the usock component so it gets preferentially picked Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-05 22:53:04 -08:00
Gilles Gouaillardet	6ef281e163	pmix/base: fix misc memory leaks Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-06 13:46:35 +09:00
Ralph Castain	6509f60929	Complete the memprobe support. This provides a new scaling tool called "mpi_memprobe" that samples the memory footprint of the local daemon and the client procs, and then reports the results. The output contains the footprint of the daemon on each node, plus the average footprint of the client procs on that node. Samples are taken after MPI_Init, and then again after MPI_Barrier. This allows the user to see memory consumption caused by add_procs, as well as any modex contribution from forming connections if pmix_base_async_modex is given. Using the probe simply involves executing it via mpirun, with however many copies you want per node. Example: $ mpirun -npernode 2 ./mpi_memprobe Sampling memory usage after MPI_Init Data for node rhc001 Daemon: 12.483398 Client: 6.514648 Data for node rhc002 Daemon: 11.865234 Client: 4.643555 Sampling memory usage after MPI_Barrier Data for node rhc001 Daemon: 12.520508 Client: 6.576660 Data for node rhc002 Daemon: 11.879883 Client: 4.703125 Note that the client value on node rhc001 is larger - this is where rank=0 is housed, and apparently it gets a larger footprint for some reason. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-05 10:32:17 -08:00
Ralph Castain	91d714fe93	Add flags to direct PMIx to only use one listener, but without directing which one (tcp or usock) to use. This allows the user to set PMIX_MCA_ptl in their environment to select the transport method. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-04 09:16:44 -08:00
Ralph Castain	f355fb926d	Continue cleanup of notifications. Resolve a race condition that can result in attempt to send a message on a closed socket Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-04 09:16:33 -08:00
Ralph Castain	9eab9a1ed3	Remove stale global variables Revamp the event notification integration to rely on the PMIx event chaining and remove the duplicate chaining in OPAL. This ensures we get system-level events that target non-default handlers. Restore the hostname entries for MPI-level error messages, but provide an MCA param (orte_hostname_cutoff) to remove them for large clusters where the memory footprint is problematic. Set the default at 1000 nodes in the job (not the allocation). Begin first cut at memory profiler Some minor cleanups of memprobe Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-02 14:04:24 -08:00
Ralph Castain	e8aea2ebfc	Minor cleanups Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-30 16:19:42 -08:00
Ralph Castain	08c76a42bb	Update to latest PMIx master Signed-off-by: Ralph Castain <rhc@open-mpi.org> Plug a minor memory leak. Tell the PMIx server not to create a dstore memory region for the daemon job as there is nobody to share it with. Signed-off-by: Ralph Castain <rhc@open-mpi.org> Protect users of hwloc membind functions Signed-off-by: Ralph Castain <rhc@open-mpi.org> Update PMIx to include NULL string protection Signed-off-by: Ralph Castain <rhc@open-mpi.org> Update to PMIx master to include key overwrite protection Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-30 12:44:47 -08:00
Ralph Castain	fe68f23099	Only instantiate the HWLOC topology in an MPI process if it actually will be used. There are only five places in the non-daemon code paths where opal_hwloc_topology is currently referenced: * shared memory BTLs (sm, smcuda). I have added a code path to those components that uses the location string instead of the topology itself, if available, thus avoiding instantiating the topology * openib BTL. This uses the distance matrix. At present, I haven't developed a method for replacing that reference. Thus, this component will instantiate the topology * usnic BTL. Uses the distance matrix. * treematch TOPO component. Does some complex tree-based algorithm, so it will instantiate the topology * ess base functions. If a process is direct launched and not bound at launch, this code attempts to bind it. Thus, procs in this scenario will instantiate the topology Note that instantiating the topology on complex chips such as KNL can consume megabytes of memory. Fix pernode binding policy Properly handle the unbound case Correct pointer usage Do not free static error messages! Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-29 10:33:29 -08:00
Ralph Castain	3a2d6a5ab6	Begin to reduce reliance of application procs on the topology tree itself by having the daemon provide more detailed info. In this case, provide the topology description string so that procs can readily determine the number of types of objects on the node, and a "locality" string that describes which objects this process is executing upon. The latter allows a process to compute the objects of overlap between itself and another proc without consulting the topology tree. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-28 09:14:26 -08:00
Ralph Castain	d3aa3777f3	Per @jsquyres: avoid mangling user-provided CFLAGS by using the new PMIX_FLAGS_UNIQ autoconf script in place of PMIX_UNIQ Refs #2636 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-27 09:00:59 -08:00
Gilles Gouaillardet	22db1d36b6	pmix2x: silence misc warnings Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2016-12-26 13:35:17 +09:00
Ralph Castain	ea133206ec	Sync the internal OMPI component to PMIx master Update external PMIx v2.x component Add missing Makefile Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-19 19:14:16 -08:00
Ralph Castain	256b5adac5	Transfer across final fixes from debugger attach work Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-19 00:34:27 -08:00
Ralph Castain	c6f6f40529	Transfer debugger support changes Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-17 18:14:46 -08:00
Ralph Castain	269753f5c1	Transfer back changes from debugger attach work Silence warning Remove debug Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-17 10:00:52 -08:00
Ralph Castain	215d6290e0	Add a flux component for LLNL Fine tuning of flux component Fix a few minor issues with the initial cut: * Job id could be obtained from the PMI kvsname like SLURM, but simpler to getenv (FLUX_JOB_ID) * Flux pmi-1 doesn't define PMI_BOOL, PMI_TRUE, PMI_FALSE * Flux pmi-1 maps the deprecated PMI_Get_kvs_domain_id() to PMI_KVS_Get_my_name() internally, so just call that instead. * Drop residual slurm references. Add wrappers for PMI functions so that if HAVE_FLUX_PMI_LIBRARY is not defined, the component can dlopen libpmi.so at location specified by the FLUX_PMI_LIBRARY_PATH env variable, which adds flexibility. If HAVE_FLUX_PMI_LIBRARY is defined, link with libpmi.so at build time in the usual way. Update configury for flux component Update m4 so the configure options work as follows: --with-flux-pmi Build Flux PMI support (default: yes) --with-flux-pmi-library Link Flux PMI support with PMI library at build time. Otherwise the library is opened at runtime at location specified by FLUX_PMI_LIBRARY_PATH environment variable. Use this option to enable Flux support when building statically or without dlopen support (default: no) If the latter option is provided, the library/header is located at build time using the pkg-config module 'flux-pmi'. Otherwise there is no library/header dependency. Handle the case where ompi is configured with --disable-dlopen or --enable-statkc. In those cases, don't build the component unless --with-flux-pmi-library is provided. It is fatal if the user explicitly requests --with-flux-pmi but it cannot be built (e.g. due to --disable-dlopen). Add a schizo/flux component Update schizo/flux component Eliminate slurm-specific usage cases. Since the module is only loaded if FLUX_JOB_ID is set, there are only two cases to handle: 1) App was launched indirectly through mpirun. This is not yet supported with Flux, but hook remains in case this mode is supported in the future. 2) App was launched directly by Flux, with Flux providing CPU binding, if any. Fix up white space in pmix/flux component Drop non-blocking fence from pmix:flux component The flux PMI-1 library is not thread safe, therefore register a regular blocking fence callback instead of the thread-shifting fencenb(). pmix/flux component avoids extra PMI_KVS_Gets Keys stored into the base cache under the wildcard rank are not intended to be part of the global key namespace. These keys therefore should not trigger a PMI_KVS_Get() if they are not found in the cache. Minor pmix/flux component cleanup pmix/flux: drop code for fetching unused pmix_id pmix/flux: err_exit must return error Problem: in flux_init(), although 'ret' (variable holding err_exit return code) is initialized to OPAL_ERROR, the variable is reused as a temporary result code, so if there are some successes followed by a failure that doesn't set 'ret', flux_init() could return success with PMI not initialized. Ensure that a "goto err_exit" returns OPAL_ERROR if 'ret' is not set to some other error code. pmix/flux: don't mix OPAL_ and PMI_ return codes Problem: flux_init() can return both PMI_ and OPAL_ return codes. Although OPAL_SUCCESS and PMI_SUCCESS are both defined as 0, other codes are not compatible. Ensure that flux_init() consistently uses 'rc' for PMI_ return codes and 'ret' for OPAL_ return codes. pmix/flux: factor out repeated code for cache put Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-16 18:26:38 -08:00
rhc54	00b87ea829	Merge pull request #2584 from rhc54/topic/warnings Reduce the flood of warnings due to uninitialized variables, mismatch…	2016-12-15 10:09:01 -08:00
Ralph Castain	585540bcee	Reduce the flood of warnings due to uninitialized variables, mismatched types, and unused things to a more bearable trickle Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-14 16:33:50 -08:00
Gilles Gouaillardet	a019095b84	pmix2x/class: correctly handle concurrent class initialization (back-ported from upstream commit pmix/master@ceedbd67fd) Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2016-12-15 09:07:24 +09:00
Ralph Castain	884fb7fcf2	Update the PMIx2 support to include the latest shared memory optimizations Update ORTE support for dynamic PMIx operations e.g., PMIx_Spawn Update to track master Ensure that --disable-pmix-dstore actually disables the dstore. Sync to a few debugger updates Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-14 15:00:10 -08:00
Ralph Castain	1961a1c22a	Use the server tmpdir instead of the system tmpdir for tool contact files Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-14 08:42:09 -08:00
Ralph Castain	fbed2d794a	Update to latest PMIx master + PTL branch Update the usock component to disable it Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-06 20:47:44 -08:00
Ralph Castain	d51821cbc7	Update pmix check headers to support Open BSD Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-06 19:37:06 -08:00
Ralph Castain	f91f8ce494	Protect against NULL param Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-03 21:19:32 -08:00
Ralph Castain	f633d5c1b1	Initialize var Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-03 21:08:45 -08:00
rhc54	003f7d308f	Merge pull request #2504 from hppritcha/topic/fix_pmix_base_help_file pmix: Fix pmix base help file.	2016-12-03 11:57:07 -08:00
Ralph Castain	a43fae74a5	Avoid hanging in show_help if PMIx was unable to initialize Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-03 10:12:58 -08:00
Ralph Castain	a4e3f615e3	Fix executable modes of pmix config scripts Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-02 20:50:59 -08:00
Ralph Castain	342dbfcf4e	Update Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-02 20:33:03 -08:00
Ralph Castain	8e64382edf	Update to correct tarball Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-02 19:16:07 -08:00
Ralph Castain	1a0bccb536	Now that PMIx has settled on its release strategy and numbering, update the OPAL pmix framework to track Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-02 15:44:43 -08:00
Howard Pritchard	3049848731	Fix pmix base help file. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2016-12-02 15:03:22 -06:00
Ralph Castain	6041467df0	Update to latest PMIx master Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-01 14:47:44 -08:00
Howard Pritchard	eee9f7ae3a	pmix/cray: abort job if using aprun for general case It turns that there is an incompatibility between the Cray PMI library and the default configuration for building Open MPI (master). To work around this, we now disable use of aprun for direct launch of Open MPI jobs except under specific conditions. The problem is that there are now (on master) packages getting initialized that do not work properly across a fork operation. As part of a constructor in the Cray PMI library, a fork operation is done to simplify use of shared memory between the processes in a job on the same node. This ends up thoroughly messing up the Open MPI initialization process in the case that dlopen support is enabled. The initialization process gets about half-way through when the PMIX framework is opened and components are loaded, which triggers the Cray PMI constructor and hence the fork operation. There are two workarounds for this: 1) configure Open MPI for Cray XE/XC systems using aprun with the --disable-dlopen option 2) set the PMI_NO_FORK environment variable in the shell in which the aprun command is run. Without taking these measures, a Open MPI job will just hang at job startup in the first attempt to "thread-shift" the PMIx fence_nb operation. Additional hangs occur at shutdown if this problem is worked around, again due to the insertion of a fork operation halfway through the Open MPI initialization procedure. This commit detects if the conditions that bring out the hang situation are present, and if so, prints out a message and aborts the job launch. Note on systems using slurm, the PMI_NO_FORK environment variable is set as part of the srun job launch, hence this issue is avoided on those systems. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2016-11-25 06:28:19 -07:00
Howard Pritchard	2cbc0e8472	pmix/cray: fix disable-dlopen problem PR open-mpi/ompi#2432 introduced a regression where configure and build with --disable-dlopn caused build failure owing to unresolved alps lli symbols in the libopal-pal shared library. This commit fixes this problem. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2016-11-21 13:45:10 -06:00
Howard Pritchard	0bbb319246	Merge pull request #2444 from hppritcha/topic/cray_pmix_ws_cleanup pmix/cray: whitespace cleanup	2016-11-21 06:03:56 -07:00
Howard Pritchard	08dce4f161	pmix/cray: whitespace cleanup Get rid of tabs. This is anti-ompi style. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2016-11-18 19:30:40 -07:00
Howard Pritchard	de3de131af	pmix/cray: set some envars for MPI_INFO_ENV object Enhance the cray pmix component to set some OMPI internal env. variables used to set some key/value pairs on the MPI_INFO_ENV object. This allows more of the ompi-tests ibm unit tests to pass when using aprun/srun direct launch and Cray PMI. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2016-11-16 17:37:52 -06:00
Howard Pritchard	703b464c03	pmix: fix a typo in a help file Fixes #2391 Thanks to @njoly for reporting Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2016-11-12 11:49:15 -07:00
Ralph Castain	64873487b4	Remove the max_connections parameter from the radix component as it is confusing. Modify PMIx client init so that it simply returns the nspace/rank if called by a server - this allows the server to retrieve its assigned ID. Register the server's nspace so client-side operations can succeed Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-11-01 12:17:11 -07:00
Ralph Castain	f4a55118e6	Update to latest PMIx master - mostly updates example codes, but includes one critical cleanup during finalize. NOTE: set dstore shared memory storage option to "on" by default Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-10-27 11:16:03 -07:00
Ralph Castain	f298f294e1	Update PMIx to latest master tarball. Ensure we set the HNP name for orted's so that PMIx_Lookup can find the server Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-10-26 15:48:56 -07:00
Ralph Castain	649301a3a2	Revise the routed framework to be multi-select so it can support the new conduit system. Update all calls to rml.send* to the new syntax. Define an orte_mgmt_conduit for admin and IOF messages, and an orte_coll_conduit for all collective operations (e.g., xcast, modex, and barrier). Still not completely done as we need a better way of tracking the routed module being used down in the OOB - e.g., when a peer drops connection, we want to remove that route from all conduits that (a) use the OOB and (b) are routed, but we don't want to remove it from an OFI conduit.	2016-10-23 21:52:39 -07:00
Ralph Castain	9131eca9c6	Update to latest PMIx master	2016-10-20 21:13:40 -07:00
Ralph Castain	2f966bf3bf	Cleanup external PMIx v3 component for copy/paste errors - component and module require unique names	2016-10-20 09:11:46 -07:00
Ralph Castain	8113a8d1b0	Now that we are hiding symbols in the internal PMIx component, we cannot reuse that component for integration to the external PMIx master as the symbols don't match. So create a new "ext3x" component and copy the PMIx v3 integration over there. Also, remove a couple of build-product files from the pmix3x component.	2016-10-18 13:15:32 -07:00
Ralph Castain	50c9f3de55	Ensure the PMIx progress thread is stopped prior to tearing anything down. Thanks to Gilles for spotting this error!	2016-10-18 00:27:52 -07:00
Ralph Castain	6f65d0a173	Repair event notification support. Cleanup the long-suffering "epoll: warning" coming out of libevent whenever a process abnormally terminated. Add changes to test program Sync to PMIx master	2016-10-13 16:27:39 -07:00
Ralph Castain	6417f217e1	Turn PMIx dstore off by default as MTT was effectively broken	2016-10-13 08:14:51 -07:00
Ralph Castain	8f05beb1ec	Sync pmix/master@cb53105	2016-10-11 20:54:59 -07:00
Ralph Castain	6ce4b6d098	Eliminate -Wall from being hardcoded	2016-10-11 12:50:31 -07:00
Ralph Castain	1859b03416	Enable PMIx shared memory support by default	2016-10-11 12:18:01 -07:00
Ralph Castain	1d7d7c201b	Update PMIx support to latest PMIx master	2016-10-11 10:17:23 -07:00
Ralph Castain	5b1484a836	Implement the backend support for process-generated event notification	2016-10-08 09:24:28 -07:00
Gilles Gouaillardet	f1f1fb15eb	pmix3x: configury: output major, minor and release version after checking them and hence fix the configure output (back-ported from upstream commit pmix/master@7b7cdda2de)	2016-10-08 13:01:28 +09:00
Gilles Gouaillardet	f3af799608	pmix3x: misc fixes to get pmix build on Solaris - replace MAXHOSTNAMELEN with hardcoded 1024. unlike Linux, Solaris #define MAXHOSTNAMELEN in <netdb.h>, so use a hard coded value to keep the test simpl - stdout cannot be assigned on Solaris, so use freopen instead (back-ported from upstream commit pmix/master@a63f6e53f4)	2016-10-08 13:01:28 +09:00
Gilles Gouaillardet	5cbfddb8f1	pmix3x: fix misc memory leaks (back-ported from upstream commit pmix/master@1eff526929)	2016-10-08 13:01:28 +09:00
Gilles Gouaillardet	b4e4e4a5f1	pmix3x: enhance pmix_nspace_t destructor PMIX_RELEASE all elements stored in the internal and modex hash tables (back-ported from upstream commit pmix/master@b90674fc52)	2016-10-08 13:01:27 +09:00
Gilles Gouaillardet	f1dc033767	pmix3x: add the PMIX_HASH_TABLE_FOREACH macro this is a convenience macro similar to the PMIX_LIST_FOREACH macro, that can be used to iterate on all the key/value pairs of a pmix_hash_table_t (back-ported from upstream commit pmix/master@349971c68c)	2016-10-08 13:01:27 +09:00
Gilles Gouaillardet	7601e783cc	pmix3x: sec/munge: add a missing include file (cherry picked from upstream pmix/master@f7cfb11f6b)	2016-10-03 16:09:10 +09:00
Ralph Castain	e773c17cf3	Put show_help thru the PMIx "log" API. This pushes the show_help output from apps into the pmix thread, thus avoiding conflicts in the RML thread, which should help with thread lock situations.	2016-10-02 16:02:23 -07:00
Gilles Gouaillardet	871ade9231	pmix/{cray,s1,s2}: make pmi_opcaddy_t class static theses three pmix components use the same class name, declare it as static so Open MPI can be built with --disable-dlopen Thanks Limin Gu for the report	2016-09-28 09:18:36 +09:00
Gilles Gouaillardet	1fbc9a5431	pmix3x: dstore/pmix: flock portability Using the fcntl-locking instead of the flock (back-ported from upstream pmix/master@3030a0cca1)	2016-09-27 13:21:03 +09:00
Gilles Gouaillardet	362a5886de	pmix3x: client: fix PMIx_Finalize() sequence pmix_progress_thread_finalize() invokes libevent event_base_free, so all libevent stuff cannot be used after. Hence, pmix_client_globals.myserver must be PMIX_DESTRUCT'ed before invoking pmix_progress_thread_finalize()	2016-09-24 00:01:23 +09:00
Gilles Gouaillardet	5479c6cca7	pmix3x: add missing #include and get Open MPI build on OpenBSD 6.0	2016-09-23 11:23:18 +09:00
Gilles Gouaillardet	fbf03299c3	Merge pull request #2079 from ggouaillardet/topic/pmix_configury_dlopen pmix3x: configury: correctly handle --disable-dlopen	2016-09-21 10:59:33 +09:00
Gilles Gouaillardet	6c1e25b76e	pmix/ext11: fix pmix1_value_unload() prototype and call pmix1_value_unload() was added a "key" argument which is unused, and pmix1_value_unload() was sometimes invoked with two arguments instead of three. since the "key" argument is unused, simply remove it from the subroutine prototype and calls.	2016-09-20 14:34:41 +09:00
Gilles Gouaillardet	041a431966	pmix3x: configury: correctly handle --disable-dlopen the LT_* macros do overwrite the enable_dlopen variable, so it must be tested and saved before invoking LT_INIT. delay the invokation of the LT_* macros and use the PMIX_ENABLE_DLOPEN_SUPPORT variable to figure out whether --disable-dlopen was invoked	2016-09-15 13:26:20 +09:00
Artem Polyakov	dc0ab674de	Add PMIx key to provide RM with ability to indicate that it will cleanup session directories provided at through OPAL_PMIX_TMPDIR, OPAL_PMIX_NSDIR, OPAL_PMIX_PROCDIR	2016-09-05 07:48:44 +03:00
Ralph Castain	34f04a7924	Remove spurious Makefile.am line	2016-09-01 15:31:09 -07:00
Ralph Castain	0ea1cff733	Implement notification of completion on comm_spawn'd child jobs. Add a configure flag to enable PMIx 3's shared memory datastore, and set it disable by default so that comm_spawn functions again. Will reverse the default once that feature is fully functional	2016-09-01 13:10:10 -07:00
Ralph Castain	39992d1ad7	Silence trivial Coverity warnings	2016-08-31 09:42:33 -07:00
Ralph Castain	cfa784c9a6	Since we changed storage to pointers in pmix_value_t, we need to allocate space for those values when unpacking	2016-08-29 20:22:24 -07:00
rhc54	b90a64e734	Merge pull request #2022 from rhc54/topic/nnodes Provide the number of nodes in the job	2016-08-26 18:15:24 -05:00
Ralph Castain	2f6e0fec90	Provide the number of nodes in the job	2016-08-26 14:50:41 -07:00
Jeff Squyres	e03a40a0e9	pmix3x: remove generated file Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2016-08-26 10:30:47 -07:00
Gilles Gouaillardet	e4bf915e75	pmix3x: remove auto-generated file remove opal/mca/pmix/pmix3x/pmix/src/include/pmix_config.h.in .gitignore is correct, so it seems this file was added before .gitignore was updated	2016-08-26 15:00:18 +09:00
Ralph Castain	af67f16422	Update configury to support multiple PMIx versions, rename pmix2x component to pmix3x for support of PMIx master Update support for external v1.1.x and v2.x libraries. Minor corrections to the v3.x component	2016-08-25 18:19:05 -07:00
Gilles Gouaillardet	02847d9e7b	pmix2x: dstore: add missing <fcntl.h> include file in pmix_esh.c (back-ported from upstream pmix/master@5c66ffe0f0)	2016-08-24 11:18:46 +09:00

... 2 3 4 5 6 ...

641 Коммитов