openmpi

Автор	SHA1	Сообщение	Дата
bosilca	ac348da13a	Merge pull request #4374 from bosilca/topic/osx_syslog Topic/osx syslog	2017-10-23 18:06:36 -04:00
Ralph Castain	6ea3c8a0bd	Update the interlib example to show an alternative method for model declaration. Add a missing range value to the OPAL layer. Make it easier to see OMPI model callbacks Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-10-23 11:27:42 -07:00
George Bosilca	8f32b345de	Address syslog issues on OSX 10.13 with gcc 7.x gcc 7.[1,2] (at least) fails to correctly parse the OSX 10.13 sys/syslog.h header. As a results we need to potect syslog support in OPAL, PMIX and ORTE. Signed-off-by: George Bosilca <bosilca@icl.utk.edu> Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-10-23 14:02:10 -04:00
Ralph Castain	a63904d47f	Updates to support cross-version operations with OMPI v2.x Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-10-22 08:38:33 -07:00
Ralph Castain	f8ce31f13c	Fix event registration so OpenMP/MPI coordination sides can both get notification of model declarations Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-10-19 18:06:38 -07:00
Howard Pritchard	e8bfd494e7	pmix/cray: define fence method for cray pmix Turns out UCX PML calls opal_pmix.fence in its del procs method without checking whether or not the fence method for the pmix component was defined. Rather than patch UCX PML, actually define a fence method for the cray pmix. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2017-10-17 15:58:01 -06:00
Ralph Castain	60b338e857	Sync to PMIx v3. Ensure prun uses the ess/tool component. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-10-14 08:24:57 -07:00
Ralph Castain	8ae10c9e1a	Ensure we exit with an appropriate error code when hitting a PMI2 error Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-10-13 19:30:28 -07:00
Ralph Castain	388034c814	Add support for the -v (verbose) option to prun and silence the "executing" and "completed" output otherwise. Debounce "unreachable" notifications for tools when they disconnect Enable the -x cmd line option for prun Signed-off-by: Ralph Castain <rhc@open-mpi.org> (cherry picked from commit 0a5b36180a22959654461ac1303cec35313f8b4a)	2017-10-10 12:54:49 -07:00
Ralph Castain	c696e04c5e	Since PMIx is moving to release v3.0, embed the new release candidate in opal/pmix framework. Move the pmix2x code over to the ext2x component. Create a new ext3x component Remove some build product. Tell PMIx that we don't need a new nspace generated when OMPI calls connect Add missing Makefile Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-10-09 13:51:08 -07:00
Ralph Castain	51f3fbdb3e	Fix cmd line passing of DVM URI Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-10-06 18:10:46 -07:00
Ralph Castain	c3b239cee8	Sync to PMIx master Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-10-06 12:40:23 -07:00
Ralph Castain	5352c31914	Enable remote tool connections for the DVM. Fix notifications so we "de-bounce" termination calls Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-10-06 10:47:05 -07:00
Ralph Castain	073eff5dcd	Update to track PMIx master Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-10-05 10:50:08 -07:00
Ralph Castain	3ad5a40ba8	Sync to PMIx master Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-10-03 10:56:30 -07:00
Ralph Castain	57c14cbfed	Sync to PMIx master to pickup a little bug fix Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-09-27 07:54:16 -07:00
Ralph Castain	d5db4ee965	Update to track PMIx master (v2.1.0) Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-09-25 10:24:13 -07:00
Ralph Castain	5fed7330e7	Update the configure logic to separate the emitting of a libpmix library from with-devel-headers. Instead, we create a new --enable-install-libpmix expressly for that purpose. Continue to link the new library back to libopen-pal to resolve the renamed symbols. Update opal configure logic to set disable_dlopen when disable_mca_dso is given. Fix typos in disable_dlopen when setting variables (incorrect inclusion of quotes) Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-09-22 16:02:57 -07:00
Ralph Castain	3493c43468	Sync to PMIx master Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-09-22 10:48:00 -07:00
Ralph Castain	fe9b584c05	Fully support OMPI spawn options. Fix a bug in the round-robin mappers where we weren't adding nodes to the job map node array, and so resources were not released Signed-off-by: Ralph Castain <rhc@open-mpi.org> (cherry picked from commit 285d8cfef74ffc899e9c51e1d9c597b7fb2ceb89)	2017-09-21 10:29:27 -07:00
Ralph Castain	e575c4d6f9	Fix tool connection logic so we properly search for default session server, perform specified number of retries, etc. Signed-off-by: Ralph Castain <rhc@open-mpi.org> (cherry picked from commit 7c755e01004f8b86c71f1729662979ea45ab1adb)	2017-09-19 13:35:46 -07:00
Ralph Castain	3b3ce243bb	Merge pull request #4214 from karasevb/pmix1_hang_fix pmix: fixed immediate request for PMIx v1.2	2017-09-19 06:51:25 -07:00
Ralph Castain	5708872112	Implement support for "local" range when publishing data Signed-off-by: Ralph Castain <rhc@open-mpi.org> (cherry picked from commit 2d54f7e0dd3a47260b0b2634aae3361316005933)	2017-09-18 19:34:08 -07:00
Boris Karasev	2929f52ffc	pmix1: fixed immediate request This fixes a hang of immediate PMIx request. PMIx v1.2 does not support the info key `PMIX_IMMEDIATE` that leads to hanging. For that request the fix uses the key `PMIX_OPTIONAL` for not go to the server. Signed-off-by: Boris Karasev <karasev.b@gmail.com>	2017-09-18 09:17:44 +03:00
Ralph Castain	3c914a7a97	Complete the fix of the ORTE DVM. We will now use "prun" instead of "orterun -hnp foo" to execute jobs. This provides the feature of automatic discovery of the orte-dvm so you don't need to manually enter URI's or contact file locations. All IO is forwarded to prun. Still in the "needs to be done" category: * mapping/ranking/binding options aren't correctly supported * if the DVM encounters some errors (e.g., not enough resources for the job), the resulting error is globally set and impacts any subsequent job submission Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-09-16 13:13:07 -07:00
Ralph Castain	7c7d8a69a0	Backport changes from PMIx reference server Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-09-14 11:48:56 -07:00
Ralph Castain	691237801b	Update to track PMIx master Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-09-13 10:21:44 -07:00
Ralph Castain	bbd83fd4c0	Add a new launcher "prun" for starting applications against the ORTE DVM. Unlike "orterun", "prun" is a PMIx-only program that discovers the DVM connection instead of requiring that we explicitly provide it. Only build "prun" if PMIx v2.x is available. This gets the DVM working again, but still is showing problems for multiple executions. I'll detail those in a separate issue. Thus, the DVM should still be considered "broken". Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-09-12 21:40:41 -07:00
Ralph Castain	88eac797fb	Silence coverity warnings Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-09-12 09:14:36 -07:00
Ralph Castain	3477079804	Repair the ORTE DVM Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-09-11 17:38:21 -07:00
Ralph Castain	cbc114e923	Update to track PMIx master Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-09-06 13:15:24 -07:00
Ralph Castain	2c723f4338	Roll to track PMIx master Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-09-01 12:30:34 -07:00
Gilles Gouaillardet	c9cca771cc	pmix/ext2x: automatically generate ext2x component from pmix2x sources Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-08-30 09:41:31 +09:00
Gilles Gouaillardet	fd08b923d5	pmix: do not invoke PMIX_INFO_CREATE() with a zero size Thanks Lisandro Dalcin for the report Fixes open-mpi/ompi#3854 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-08-28 11:25:58 +09:00
Josh Hursey	ad87aa2674	Merge pull request #4121 from jjhursey/explore/dlopen-local mca: Dynamic components link against project lib	2017-08-25 13:15:51 -05:00
Joshua Hursey	e1d079544b	mca: Dynamic components link against project lib * Resolves #3705 * Components should link against the project level library to better support `dlopen` with `RTLD_LOCAL`. * Extend the `mca_FRAMEWORK_COMPONENT_la_LIBADD` in the `Makefile.am` with the appropriate project level library: ``` MCA components in ompi/ $(top_builddir)/ompi/lib@OMPI_LIBMPI_NAME@.la MCA components in orte/ $(top_builddir)/orte/lib@ORTE_LIB_PREFIX@open-rte.la MCA components in opal/ $(top_builddir)/opal/lib@OPAL_LIB_PREFIX@open-pal.la MCA components in oshmem/ $(top_builddir)/oshmem/liboshmem.la" ``` Note: The changes in this commit were automated by the script in the commit that proceeds it with the `libadd_mca_comp_update.py` script. Some components were not included in this change because they are statically built only. Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>	2017-08-24 11:56:16 -04:00
Ralph Castain	68029b27e4	Fix the orte-dvm operations so that orterun can connect and execute an application. There is a lingering problem, though. The first invocation of orterun succeeds every time. However, subsequent invocations have a high probability of hanging in the OOB connection handshake. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-08-23 17:31:08 -07:00
Ralph Castain	0561d64748	Continue tracking PMIx v2.1.0 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-08-23 09:38:27 -07:00
Ralph Castain	d80b0c7990	If the HWLOC shared memory system is unable to connect, then fallback to providing the topology via XML. Do not automatically provide the XML to every process as that defeats the purpose of the shared memory system. Instead, use PMIx_Query_info_nb to get the info from the server when required. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-08-22 18:12:26 -07:00
Ralph Castain	e3213386ec	Fix the internal PMIx installation - matching changes have been upstreamed Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-08-22 13:49:07 -07:00
Ralph Castain	a1b15c5666	Roll in update to PMIx master. Transfer updates from pmix2x component to ext2x Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-08-22 13:06:47 -07:00
Ralph Castain	d515f48885	The local PMIx server is notifying its clients of all events, but for some reason I don't recall, the broadcast notification was marked for delivery only to non-default event handlers. This creates a discrepancy between the two behaviors, so don't restrict the broadcast notifications. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-08-18 17:26:11 -07:00
Ralph Castain	088b6cdeee	Silence coverity warnings Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-08-17 09:49:35 -07:00
Ralph Castain	c4d5dbfcdc	Change test per recommendation of @jsquyres Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-08-16 11:19:15 -07:00
Ralph Castain	eb69df02ae	Update to PMIx v2.1.0rc1 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-08-15 19:59:15 -07:00
Ralph Castain	65fb6070d9	Update tool support by adding MCA params to direct orted's to drop session and/or system-level tool rendezous files. Ensure PMIx is enabled for tools Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-08-15 17:49:47 -07:00
Ralph Castain	033a0eb373	Fix the --disable-dlopen --with-devel-headers case by not having libpmix link back to libopen-pal as the latter won't exist in time during this build case Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-08-15 10:51:35 -07:00
Ralph Castain	4290247d64	Update to latest PMIx v2.1.0a Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-08-10 18:48:07 -07:00
Ralph Castain	53c9270af7	Silence coverity warnings Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-08-08 06:10:14 -07:00
Ralph Castain	9921237f99	Merge pull request #4012 from rhc54/topic/p3 Cover the use-cases for OPAL_PREFIX and PMIX_INSTALL_PREFIX options	2017-08-07 11:42:53 -07:00
Ralph Castain	d593e5a4ce	When we specify --with-devel-headers, we also emit a copy of libpmix. However, that library was built against the OPAL libevent component, which means all the libevent functions are prefixed with OPAL names. So ensure that the emitted libpmix is linked back against libopen-pal so those symbols will be resolved. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-08-07 09:36:16 -07:00
Ralph Castain	a239b4c3c3	Per discussion on the PMIx side, do a better job of detecting mismatches between location directives for OPAL and PMIx. Provide a more helpful error message and error out if we find a mismatch. If any OPAL values are set and the PMIx equivalent is not, then transfer it. Do not clear PMIX_INSTALL_PREFIX from the daemon's launch environment Fixes #3980 Closes #4007 Refs #3985 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-08-04 19:36:00 -07:00
Ralph Castain	f128b4c546	Fix incorrect usage of '==' in test comparisons Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-08-03 21:21:26 -07:00
Artem Polyakov	500c8be888	pmix: fix PMIx envar name for the installation prefix. Signed-off-by: Artem Polyakov <artpol84@gmail.com>	2017-08-02 08:03:36 +03:00
Ralph Castain	f39ce67982	Merge pull request #3951 from rhc54/topic/hwloc2 Update to hwloc 2.0.0a	2017-08-01 15:18:31 -06:00
Ralph Castain	8f34fa4a56	Move the detection of OPAL_PREFIX and subsequent posting of PMIX_PREFIX to the internal integration code for PMIx so we only do this when running with the embeddied PMIx Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-08-01 08:24:27 -06:00
Boris Karasev	e20b581529	pmix: fixed immediate request This commit fixes a hang when using external PMIx v1 module Signed-off-by: Boris Karasev <karasev.b@gmail.com>	2017-07-28 15:53:48 +06:00
Ralph Castain	7a83fdb9bb	Update to hwloc 2.0.0a with shmem support. Update to support passing of HWLOC shmem topology to client procs Update use of distance API per @bgoglin Have the openib component lookup its object in the distance matrix Bring usnic up-to-date Restore binding for hwloc2 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-07-25 20:26:22 -07:00
Ralph Castain	0042c758f1	Update the tools support so it allows tools to access PMIx Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-07-25 17:10:08 -07:00
Ralph Castain	058e802b11	Add missing export directives Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-07-25 07:19:08 -07:00
George Bosilca	1ea8fab095	Make external symbols visible. All symbols that need to be accessed from a MCA component must be marked explicitly as visible using PMIX_EXPORT. This patch allows current trunk to almost work on OsX. More on the devel mailing list. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>	2017-07-25 01:14:22 -04:00
Ralph Castain	af85e48dd7	Silence Coverity warning, silence pmix_error_log of success Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-07-21 15:33:16 -07:00
Ralph Castain	492f98f8a5	Update to latest PMIx v2.1.0a Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-07-21 12:58:09 -07:00
Ralph Castain	f7e8780a42	Remove fortran support from platform file Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-07-20 21:02:30 -07:00
Ralph Castain	b225366012	Bring the ofi/rml component online by completing the wireup protocol for the daemons. Cleanup the current confusion over how connection info gets created and passed to make it all flow thru the opal/pmix "put/get" operations. Update the PMIx code to latest master to pickup some required behaviors. Remove the no-longer-required get_contact_info and set_contact_info from the RML layer. Add an MCA param to allow the ofi/rml component to route messages if desired. This is mainly for experimentation at this point as we aren't sure if routing wi ll be beneficial at large scales. Leave it "off" by default. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-07-20 21:01:57 -07:00
Ralph Castain	8c30958879	Update to PMIx v2.1.0alpha Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-07-20 11:12:06 -07:00
Ralph Castain	fca68b070b	Merge pull request #3934 from rhc54/topic/singleton Fix the isolated pmix component. Cleanup the ess/singleton component …	2017-07-19 16:02:37 -05:00
Ralph Castain	543c16b28d	Fix the isolated pmix component. Cleanup the ess/singleton component - we shouldn't be automatically discovering the local topology as that is now done on-demand. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-07-19 12:14:29 -07:00
Howard Pritchard	2fa0c4c6ec	pmix/s1: fix problems with ref counting in s1 s1 pmix component wasn't doing proper ref counting Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2017-07-18 15:59:28 -06:00
Gilles Gouaillardet	9124afbeae	pmix: do not invoke PMIX_INFO_CREATE() with a zero size Thanks Lisandro Dalcin for the report Fixes open-mpi/ompi#3854 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-07-14 15:00:05 +09:00
Artem Polyakov	4d3e22e815	Merge pull request #3870 from hppritcha/topic/repair_s2_launch pmix/s2: fix srun native launch for pmi2	2017-07-13 12:45:22 -05:00
Howard Pritchard	eeb91bc82b	pmix/s2: fix srun native launch for pmi2 recent changes that broke native launch on cray using srun or aprun was also broke native launch using pmi2. This commit fixes this problem. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2017-07-12 17:45:52 -06:00
Jeff Squyres	ccf17808b6	Merge pull request #3258 from markalle/pr/symbol_name_pollution symbol name pollution	2017-07-12 16:19:25 -05:00
Howard Pritchard	550e8c4afe	Merge pull request #3842 from hppritcha/topic/fix_cray_pmix_problem pmix/cray: add a bit of debug output	2017-07-11 08:29:56 -06:00
Howard Pritchard	26a8142c97	pmix/cray: add a bit of debug output add a bit of debug output to help with pmix finalize issues Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2017-07-11 05:45:49 -05:00
Mark Allen	552216f9ba	scripted symbol name change (ompi_ prefix) Passed the below set of symbols into a script that added ompi_ to them all. Note that if processing a symbol named "foo" the script turns foo into ompi_foo but doesn't turn foobar into ompi_foobar But beyond that the script is blind to C syntax, so it hits strings and comments etc as well as vars/functions. coll_base_comm_get_reqs comm_allgather_pml comm_allreduce_pml comm_bcast_pml fcoll_base_coll_allgather_array fcoll_base_coll_allgatherv_array fcoll_base_coll_bcast_array fcoll_base_coll_gather_array fcoll_base_coll_gatherv_array fcoll_base_coll_scatterv_array fcoll_base_sort_iovec mpit_big_lock mpit_init_count mpit_lock mpit_unlock netpatterns_base_err netpatterns_base_verbose netpatterns_cleanup_narray_knomial_tree netpatterns_cleanup_recursive_doubling_tree_node netpatterns_cleanup_recursive_knomial_allgather_tree_node netpatterns_cleanup_recursive_knomial_tree_node netpatterns_init netpatterns_register_mca_params netpatterns_setup_multinomial_tree netpatterns_setup_narray_knomial_tree netpatterns_setup_narray_tree netpatterns_setup_narray_tree_contigous_ranks netpatterns_setup_recursive_doubling_n_tree_node netpatterns_setup_recursive_doubling_tree_node netpatterns_setup_recursive_knomial_allgather_tree_node netpatterns_setup_recursive_knomial_tree_node pml_v_output_close pml_v_output_open intercept_extra_state_t odls_base_default_wait_local_proc _event_debug_mode_on _evthread_cond_fns _evthread_id_fn _evthread_lock_debugging_enabled _evthread_lock_fns cmd_line_option_t cmd_line_param_t crs_base_self_checkpoint_fn crs_base_self_continue_fn crs_base_self_restart_fn event_enable_debug_output event_global_current_base_ event_module_include eventops sync_wait_mt trigger_user_inc_callback var_type_names var_type_sizes Signed-off-by: Mark Allen <markalle@us.ibm.com>	2017-07-11 02:13:23 -04:00
Ralph Castain	a190b4b89f	Prefix the MB macro in one more place Fixes #3830 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-07-07 06:07:47 -07:00
Ralph Castain	ed43492867	Not really necessary, but technically correct Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-07-06 06:00:03 -07:00
Ralph Castain	31130a4bee	Replace syntax with something less strictly C99 Fixes #3809 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-07-05 16:54:36 -07:00
Howard Pritchard	1f2f3db553	pmix/cray: fix handling of multiple finis The fini code for cray pmix wasn't correct. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2017-07-03 14:30:34 -05:00
Ralph Castain	9178219e6b	Deregister event handlers only on final call to finalize. Ensure we pass PMIx mca params Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-06-28 15:00:43 -07:00
Ralph Castain	d619de4f4c	Fix a threadlock when notifying clients of failures Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-06-28 08:58:41 -07:00
Ralph Castain	e6c2a8d346	Track PMIx v2.0.1 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-06-26 09:34:57 -07:00
Ralph Castain	79fd359848	Merge pull request #3713 from rhc54/topic/ofi Enable use of OFI fabrics for launch and other collective operations.…	2017-06-25 11:47:40 -07:00
Ralph Castain	ed85512a7c	Update to track PMIx v2.0.1 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-06-25 07:29:32 -07:00
Ralph Castain	ef56c7d47a	Correctly transfer size_t data fields Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-06-24 20:11:54 -07:00
Ralph Castain	f4411c4393	Enable use of OFI fabrics for launch and other collective operations. Update the PMIx repo to the latest master to get the required support for the server to "push" modex info, and to retrieve all its own "modex" values for sending back to mpirun. Have mpirun cache them in its local modex hash as OFI goes point-to-point direct and doesn't route - so the remote daemons don't need a copy of this connection info. Remove the opal_ignore from the RML/OFI component, but disable that component unless the user specifically requests it via the "rml_ofi_desired=1" MCA param. This will let us test compile in various environments without interfering with operations while we continue to debug Fix an error when computing the number of infos during server init Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-06-23 19:57:21 -07:00
Ralph Castain	8263efff65	Fix uninitialized variables Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-06-23 11:12:26 -07:00
Ralph Castain	6ec2ad5288	Fix the pmix_query API when it asks for something that returns an array of pmix_info_t. Protect the PMIX_INFO_FREE macro from NULL arrays. Update the mpi_memprobe scaling test Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-06-22 20:11:36 -07:00
Ralph Castain	3e78f84093	Silence Coverity warnings Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-06-21 13:19:51 -07:00
Ralph Castain	cba127bc43	Update the ext2x component to match the internal one Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-06-20 11:42:14 -07:00
Ralph Castain	952726c121	Update to latest PMIx master - equivalent to 2.0rc2. Update the thread support in the opal/pmix framework to protect the framework-level structures. This now passes the loop test, and so we believe it resolves the random hangs in finalize. Changes in PMIx master that are included here: * Fixed a bug in the PMIx_Get logic * Fixed self-notification procedure * Made pmix_output functions thread safe * Fixed a number of thread safety issues * Updated configury to use 'uname -n' when hostname is unavailable Work on cleaning up the event handler thread safety problem Rarely used functions, but protect them anyway Fix the last part of the intercomm problem Ensure we don't cover any PMIx calls with the framework-level lock. Protect against NULL argv comm_spawn Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-06-20 09:02:15 -07:00
Ralph Castain	8f09929469	Fix rank-file mapper launch by correctly setting up the remote map from the provided data Put a simple protection for the case where procs fail while we are trying to deregister handlers Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-06-15 08:33:29 -07:00
Ralph Castain	548cd24e4e	Forward-port changes proposed for v3.0 to master from PR #3677 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-06-09 07:51:21 -07:00
Ralph Castain	2d65908184	Correct the external pmix configury Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-06-07 00:33:29 -07:00
Ralph Castain	bd1793ad17	Get the pmix/ext2x component to work. Fix a minor problem in the libevent external component. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-06-06 20:06:28 -07:00
Ralph Castain	c3e6dc2022	Update to pmix v2.0.0rc1, including thread safety fixes Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-06-06 15:16:34 -07:00
Ralph Castain	93cf3c7203	Update OPAL and ORTE for thread safety (I swear, if I look this over one more time, I'll puke) Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-06-06 12:30:57 -07:00
Ralph Castain	2f85d10600	Update to PMIx master Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-06-06 08:19:25 -07:00
Ralph Castain	8f526968c2	Do not hang if we cannot relay messages. Eliminate extra error log message Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-06-05 06:35:19 -07:00
Ralph Castain	9d6b929894	Fix uninitialized variable. Set exit codes for failed launch so we get pretty error messages Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-05-31 07:38:37 -07:00
Ralph Castain	26d96061aa	Roll in latest PMIx updates Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-05-30 21:35:35 -07:00
Ralph Castain	9f1f9d6606	Update to PMIx v2.0.0rc1 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-05-28 10:30:58 -07:00
Ralph Castain	9f60cd0fe7	Update the connect/accept support so we check to see if we have the proper infrastructure and RTE support, including whether we have ompi-server available if the connect/accept spans multiple applications. Print pretty help messages in all cases where we do not have support Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-05-27 10:47:08 -07:00
Nathan Hjelm	33d59886e1	Merge pull request #3587 from hjelmn/event_abstraction pmix/pmix2x: fix errors in event abstration	2017-05-26 10:44:18 -06:00
Nathan Hjelm	a512b8962d	pmix/pmix2x: fix errors in event abstration Parts of the pmix2x component called the event_* functions directly instead of the opal_event_* wrappers. This is fine as long as we are using libevent but becomes a problem with other event libraries. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2017-05-26 09:49:11 -06:00
Ralph Castain	2f721a3366	Merge pull request #3585 from rhc54/topic/pmix20 Update to pmix v2.0beta	2017-05-26 06:05:44 -07:00
Ralph Castain	e1e264711a	Update to pmix v2.0beta Fix atomics - again Fix initialization of notification ring buffer Fix wait_sync definitions Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-05-26 03:33:18 -07:00
Ralph Castain	657e701c65	Add debug verbosity to the orte data server and pmix pub/lookup functions Start updating the various mappers to the new procedure. Remove the stale lama component as it is now very out-of-date. Bring round_robin and PPR online, and modify the mindist component (but cannot test/debug it). Remove unneeded test Fix memory corruption by re-initializing variable to NULL in loop Resolve the race condition identified by @ggouaillardet by resetting the mapped flag within the same event where it was set. There is no need to retain the flag beyond that point as it isn't used again. Add a new job attribute ORTE_JOB_FULLY_DESCRIBED to indicate that all the job information (including locations and binding) is included in the launch message. Thus, the backend daemons do not need to do any map computation for the job. Use this for the seq, rankfile, and mindist mappers until someone decides to update them. Note that this will maintain functionality, but means that users of those three mappers will see large launch messages and less performant scaling than those using the other mappers. Have the mindist module add procs to the job's proc array as it is a fully described module Protect the hnp-not-in-allocation case Per path suggested by Gilles - protect the HNP node when it gets added in the absence of any other allocation or hostfile Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-05-25 18:41:27 -07:00
Gilles Gouaillardet	026f3dd2dd	pmix2x: plug a misc memory leak Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-05-10 14:57:44 +09:00
Ralph Castain	0afcb1a448	Update to support server self-notifications Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-05-08 10:04:50 -07:00
Ralph Castain	ef0e0171c9	Implement the changes required to support cross-library coordination. Update PMIx to support intra-process notifications and ensure that we always notify ourselves for events. Add a new ompi/interlib directory where cross-lib coordination code can go, and put the code to declare ourselves there (called from ompi_mpi_init.c). Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-05-08 10:04:50 -07:00
Ralph Castain	3bca715780	Fix pmix configury so that libpmix is still emitted when --with-devel-headers is given, even under static builds Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-05-05 11:15:32 -07:00
Jeff Squyres	af336ac0e8	pmix/configure.m4: always use embedded mode Looks like embedded mode was mistakenly disabled when --with-devel-headers was specified. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-05-04 10:01:41 -07:00
Ralph Castain	8b1f01dfe6	Set the default modex parameters back to full blocking modex while we continue to test and debug the slow modex - it seems to be having issues on the Cray Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-04-22 15:19:46 -07:00
Ralph Castain	9fc3079ac2	Implement a background fence that collects all data during modex operation The direct modex operation is slow, especially at scale for even modestly-connected applications. Likewise, blocking in MPI_Init while we wait for a full modex to complete takes too long. However, as George pointed out, there is a middle ground here. We could kickoff the modex operation in the background, and then trap any modex_recv's until the modex completes and the data is delivered. For most non-benchmark apps, this may prove to be the best of the available options as they are likely to perform other (non-communicating) setup operations after MPI_Init, and so there is a reasonable chance that the modex will actually be done before the first modex_recv gets called. Once we get instant-on-enabled hardware, this won't be necessary. Clearly, zero time will always out-perform the time spent doing a modex. However, this provides a decent compromise in the interim. This PR changes the default settings of a few relevant params to make "background modex" the default behavior: * pmix_base_async_modex -> defaults to true * pmix_base_collect_data -> continues to default to true (no change) * async_mpi_init - defaults to true. Note that the prior code attempted to base the default setting of this value on the setting of pmix_base_async_modex. Unfortunately, the pmix value isn't set prior to setting async_mpi_init, and so that attempt failed to accomplish anything. The logic in MPI_Init is: * if async_modex AND collect_data are set, AND we have a non-blocking fence available, then we execute the background modex operation * if async_modex is set, but collect_data is false, then we simply skip the modex entirely - no fence is performed * if async_modex is not set, then we block until the fence completes (regardless of collecting data or not) * if we do NOT have a non-blocking fence (e.g., we are not using PMIx), then we always perform the full blocking modex operation. * if we do perform the background modex, and the user requested the barrier be performed at the end of MPI_Init, then we check to see if the modex has completed when we reach that point. If it has, then we execute the barrier. However, if the modex has NOT completed, then we block until the modex does complete and skip the extra barrier. So we never perform two barriers in that case. HTH Ralph Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-04-21 10:29:23 -07:00
Ralph Castain	ffbfd22d84	Fix event registration - need to increment the event index and record the number of codes in the event handler Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-04-13 17:35:10 -07:00
Ralph Castain	9f73974fe1	Update to latest PMIx master, including disabling the pmi-1 and pmi-2 backward compatibility as these interfere with the s1,s2 components Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-04-12 12:34:27 -07:00
Ralph Castain	95ae0d1df3	Cleanup timing macros for portability across compilers. Rename the --enable-timing configure option to be --enable-pmix-timing so it doesn't pickup external timing requests. Remove a stale function reference in PMIx so it can compile with timing enabled. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-04-10 12:56:38 +06:00
Ralph Castain	92c996487c	Update how we pass the node regex so we pass _all_ nodes, even those without daemons. This allows the backend daemons to form a complete picture of the allocation. Include info on which nodes have daemons on them, and populate that info on the backend as well. Set the daemons' state to "running" and mark them as "alive" by default when constructing the nidmap Get the DVM running again Fix direct modex by eliminating race condition caused by releasing data while sending it Up the size limit before compressing Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-04-03 19:25:15 -07:00
Ralph Castain	2cc5fea8be	Update to PMIx v2.0alpha Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-04-03 10:02:29 -07:00
Ralph Castain	7dd34d0c9a	Use the correct callback data - the callback function was expecting a bool, not a pmix_ptl_sr_t. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-28 17:21:47 -07:00
Ralph Castain	35f817911e	Fix coverity issues Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-24 08:09:46 -07:00
Ralph Castain	c0bcd11bcf	Fix permissions - no CI required Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-23 08:05:52 -07:00
Ralph Castain	55e4fba5f5	If we lose connection to the server after initiating a send/recv in PMIx (e.g., in PMIx_Abort), then we need to "resolve" all pending recvs to avoid hanging. Fixes #3225 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-23 02:53:21 -07:00
Ralph Castain	d645557fa0	Update to include the PMIx 2.0 APIs for monitoring and job control. Include required integration, but leave the monitors off for now. Move the sensor framework out of ORTE as it is being absorbed into PMIx Fix typo and silence warnings Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-21 17:47:08 -07:00
Ralph Castain	4b6d220a83	You cannot include both pmi.h and pmi2.h as they have conflicting defines in them. Thanks to Kilian Cavalotti for pointing it out Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-19 11:53:54 -07:00
Ralph Castain	c6bc3ccb76	Sync to latest PMIx master and PMIx reference server Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-11 12:50:38 -08:00
Ralph Castain	aca7091114	Fix some minor compatibility issues by ensuring job-level data gets stored against wildcard rank in the cray, s1, and s2 components, and that the ext1 component translates all wildcard rank requests into the peer's rank since v1.x of PMIx doesn't understand wildcard ranks Closes #3101 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-05 10:30:59 -08:00
Ralph Castain	1de72ff023	Silence an unnecessary error log Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-02 17:18:34 -08:00
Ralph Castain	e86a0dbf39	Update to PMIx master to include dlopen fixes and addition of libltdl support Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-02-22 11:54:33 -08:00
Ralph Castain	8cffdcf127	Ensure that the pmix headers and lib get installed when --with-devel-headers is given so that PMIx applications can be built and executed against the "embedded" PMIx version Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-02-21 13:46:46 -08:00
Gilles Gouaillardet	bb2481a84b	pmix2x: synchronize to the latest PMIx master pmix/master@f57d9b2953 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-02-20 10:45:17 +09:00
Ralph Castain	f49118eaab	Fix some pmix configuration code Remove stale file reference that caused a check to always fail. Update psm2 function check to new libs Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-02-16 10:54:47 -08:00
Howard Pritchard	b272f87926	Merge pull request #2968 from hjelmn/pmix_cray pmix/cray: performance improvements and cleanup	2017-02-16 11:41:59 -07:00
Ralph Castain	201f8571ca	Ensure we retain the peer object until we are done with it, then detect that the socket has closed due to a lost connection and cleanly release the message event Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-02-15 18:30:55 -08:00
Ralph Castain	9cd7349d7c	Instead of completely free'ing the event base, pause the PMIx progress thread before tearing down the infrastructure, and then release the event base at the end of the procedure. This allows any infrastructure objects holding events to delete them prior to free'ing the event base. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-02-15 05:02:43 -08:00
Ralph Castain	f7fe2f7189	Merge pull request #2977 from rhc54/topic/spawn Fix comm_spawn by registering nspace info only when needed	2017-02-15 04:31:54 -08:00
Ralph Castain	68b53e2179	Fix comm_spawn by registering nspace info only when needed - either when we have local procs, or when job-level info is required by connecting jobs Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-02-14 19:47:56 -08:00
Ralph Castain	0c8609ca16	Update to newest PMIx master (includes configuration cleanups). Silence trivial Coverity warning in hwloc base. Cleanup a race condition segfault during finalize by ensuring the PMIx progress thread is stopped prior to starting to tear down the messaging components Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-02-14 15:14:00 -08:00
Nathan Hjelm	3b912ea2a7	pmix/cray: performance improvements and cleanup Do not use opal_output_verbose inside O(n) loops. This was causing us to make O(n) calls to snprintf which was greatly slowing launch at scale. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2017-02-14 11:13:10 -07:00
Ralph Castain	35578b4009	Update to lastest PMIx master Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-02-13 23:19:26 -08:00
Gilles Gouaillardet	7acef4833e	pmix2x: Update to latest PMIx master pmix/master@6ed27be839 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-02-08 13:23:27 +09:00
KAWASHIMA Takahiro	4b2eba34a6	Merge pull request #2933 from kawashima-fj/pr/dstore-config-desc pmix/pmix2x: Correct configure option description	2017-02-08 13:03:27 +09:00
Jeff Squyres	100b112d3c	pmix: fix zlib protection macro usage It's possible that we can have zlib.h but still not have zlib support. Use the correct macro to protect the usage of calling zlib functions. This fixes 32-bit MTT builds at Cisco (e.g., https://mtt.open-mpi.org/index.php?do_redir=2389). Submitted upstream to PMIX: https://github.com/pmix/master/pull/290 Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-02-07 05:52:32 -08:00
KAWASHIMA Takahiro	750406f67b	pmix/pmix2x: Correct configure option description `--enable-pmix-dstore` option was enabled by default in `f4a5511`. Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>	2017-02-07 11:52:56 +09:00
Ralph Castain	edcfdf2365	Update to latest PMIx master Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-31 08:01:37 -08:00
Gilles Gouaillardet	b078e57e73	pmix/ext1x: fix misc memory leaks in namespace registration Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-30 10:52:42 +09:00
Gilles Gouaillardet	f51fc293a2	ext1x/pmix1x_client: plug misc memory leaks Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-30 10:52:42 +09:00
Gilles Gouaillardet	022cca79ea	pmix/ext1x: plug a memory leak in opal_lkupcbfunc() Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-30 10:52:36 +09:00

1 2 3 4 5 ...

641 Коммитов