openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	673f82e2b6	Update the PMIx listener to avoid leaking sockets into children, and better handle race condition errors	2016-07-03 08:23:33 -07:00
Ralph Castain	6e434d6785	Add support for PMIx tool connections and queries. Initially only support a request to list all known namespaces (jobids) from ORTE, but other folks will extend that support to include additional information Update to match PMIx RFC Fix configury to point to correct libevent and hwloc locations	2016-06-29 19:19:19 -07:00
Ralph Castain	08b1438f15	Add missing PMIx range value so OPAL and PMIx align again	2016-06-22 22:03:25 -07:00
Gilles Gouaillardet	bf133c401e	pmix2x: fix a typo in dereg_event_hdlr() This bug has been fixed when open-mpi/ompi@dde69e1be2 was backported into upstream pmix in pmix/master@5e5577778c but it was not fixed in open-mpi/ompi	2016-06-22 13:45:29 +09:00
Ralph Castain	441739b5a4	Cleanup a lagging message that generates an annoying (but seemingly harmless) warning	2016-06-20 12:23:27 -07:00
Ralph Castain	0ba02821e6	Add requested key and job-level info	2016-06-19 18:22:31 -07:00
Ralph Castain	0a29f5cb77	Sigh - missed two typos	2016-06-18 20:57:53 -07:00
Ralph Castain	dd38cf1fed	Fix typo	2016-06-18 20:56:43 -07:00
Ralph Castain	dde69e1be2	Cleanup CIDs 1362763, 1362762, 1362760, 1362759, 1362758, 1362757, 1362756, 1362755, 1362754. Unsure how to resolve 1362761. Fixes #1792	2016-06-18 12:28:46 -07:00
Ralph Castain	044c561cba	Roll to latest PMIx master	2016-06-16 17:30:30 -07:00
Ralph Castain	5d330d5220	Enable the PMIx event notification capability and use that for all error notifications, including debugger release. This capability requires use of PMIx 2.0 or above as the features are not available with earlier PMIx releases. When OMPI master is built against an earlier external version, it will fallback to the prior behavior - i.e., debugger will be released via RML and all notifications will go strictly to the default error handler. Add PMIx 2.0 Remove PMIx 1.1.4 Cleanup copying of component Add missing file Touchup a typo in the Makefile.am Update the pmix ext114 component Minor cleanups and resync to master Update to latest PMIx 2.x Update to the PMIx event notification branch latest changes	2016-06-14 13:08:41 -07:00
Ralph Castain	d58da99dbc	Shift to memcpy to avoid Solaris issues	2016-06-09 12:07:17 -07:00
Ralph Castain	8fa935534b	Abstract the strnlen function for environments that do not have it (e.g., Solaris 10)	2016-06-08 10:12:43 -07:00
Gilles Gouaillardet	b707d138fe	pmix114/pmix1_client: fix misc memory leaks Fixes CID 1325146-1325149	2016-06-06 09:33:35 +09:00
Ralph Castain	ecea1e3bb5	Update to 1.1.4rc3	2016-06-01 20:56:07 -07:00
Ralph Castain	12ecf972af	Split the pmix external component into one for the 1.1.4 release, and another for the upcoming 2.0 release. Clean up the configury so the components look for a series-specific function instead of running a program. NOTE: the changes for the 2.0 series are not yet in the PMIx master.	2016-06-01 14:15:24 -07:00
Ralph Castain	55923eacd3	Stealing some pieces of Josh Hursey's PR #1583 and modifying a bit, allow the opal/pmix external component to handle both PMIx 1.1.4 and PMIx 2.0 versions. Automatically detect the version of the target external library and adjust the only two APIs that changed (PMIx_Init and PMIx_Finalize) Rename temp vars in .m4 to avoid conflict with Travis	2016-05-27 08:06:31 -07:00
Artem Polyakov	725eea2819	Fix base64 implementation in pmix framework. In the commit `80f07b65f1` setting of '-' marker used as the string termination sign was moved from base64 code: from: `80f07b65f1 (diff-1b10896c267d2591dc2c08fd0542ab67L491)` to: `80f07b65f1 (diff-1b10896c267d2591dc2c08fd0542ab67R189)` However the decoding function wasn't fixed and still expects on extra byte at the end of the encoded string which leads to data truncation during extraction (was noticed on standalone code that was using base64 from OMPI).	2016-05-23 23:30:31 +06:00
Gilles Gouaillardet	8466a3daf3	pmix: update .gitignore git ignore opal/mca/pmix/pmix114/pmix/include/pmix/autogen/config.h.in git rm opal/mca/pmix/pmix114/pmix/include/pmix/autogen/config.h.in git ignore opal/mca/pmix/pmix*/...	2016-05-23 11:58:07 +09:00
Gilles Gouaillardet	cbbdce05b1	pmix/pmix114: silence a warning	2016-05-20 09:35:26 +09:00
Ralph Castain	6f743f81b6	Update PMIx 114 to current release candidate	2016-05-19 12:55:05 -07:00
rhc54	8b534e9897	Merge pull request #1668 from rhc54/topic/slurm When direct launching applications, we must allow the MPI layer to pr…	2016-05-16 12:23:19 -07:00
Howard Pritchard	1a676e5b35	pmix/cray: fix some breakage Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2016-05-16 12:45:05 -05:00
Ralph Castain	01ba861f2a	When direct launching applications, we must allow the MPI layer to progress during RTE-level barriers. Neither SLURM nor Cray provide non-blocking fence functions, so push those calls into a separate event thread (use the OPAL async thread for this purpose so we don't create another one) and let the MPI thread sping in wait_for_completion. This also restores the "lazy" completion during MPI_Finalize to minimize cpu utilization. Update external as well Revise the change: we still need the MPI_Barrier in MPI_Finalize when we use a blocking fence, but do use the "lazy" wait for completion. Replace the direct logic in MPI_Init with a cleaner macro	2016-05-14 16:37:00 -07:00
Ralph Castain	7767882346	Per user request, add some missing data and definitions: OPAL_PMIX_UNIV_RANK - synonym for OPAL_PMIX_GLOBAL_RANK OPAL_PMIX_APP_SIZE - #ranks in the application of this proc	2016-05-09 08:39:01 -07:00
Ralph Castain	8ec1891d11	Silence warning	2016-05-05 20:04:10 -07:00
Ralph Castain	08022d7af1	Some minor cleanups of warnings from gcc 6.0.0. Update s1/s2 pmix to get max_procs as required.	2016-05-05 15:28:13 -07:00
rhc54	648043597a	Merge pull request #1612 from ggouaillardet/poc/pmix_external_configury pmix/external: revamp external pmix package detection	2016-05-02 09:46:05 -07:00
Jeff Squyres	265e5b9795	Merge pull request #1552 from kmroz/wip-hostname-len-cleanup-1 ompi/opal/orte/oshmem/test: max hostname length cleanup	2016-05-02 09:44:18 -04:00
Gilles Gouaillardet	45f9a47d77	pmix/external: fix typo and silence a warning	2016-05-02 17:15:52 +09:00
Gilles Gouaillardet	08d91b9a03	pmix/external: revamp external pmix package detection	2016-05-02 16:23:31 +09:00
Ralph Castain	42d9d861fc	Fix minor typo in PMIx packing of pmix_app_t - thanks to Gilles for pointing it out	2016-04-29 08:55:46 -07:00
Howard Pritchard	f52dd511d4	Merge pull request #1600 from hppritcha/topic/pmix_fix_for_finalize pmix/cray: set fence_nb to NULL	2016-04-28 13:50:15 -06:00
hppritcha	aa1d7b9c50	pmix/cray: set fence_nb to NULL Rather than have a stub function for the pmix fence_nb operation, just set to NULL. Causes fewer problems. Fixes #1597 Fixes #1527 Signed-off-by: hppritcha <howardp@lanl.gov>	2016-04-28 13:48:54 -05:00
Ralph Castain	02876564d4	Silence warning of zero-byte malloc	2016-04-26 11:55:59 -07:00
Karol Mroz	e1c64e6e59	opal: standardize on max hostname length Define OPAL_MAXHOSTNAMELEN to be either: (MAXHOSTNAMELEN + 1) or (limits.h:HOST_NAME_MAX + 1) or (255 + 1) For pmix code, define above using PMIX_MAXHOSTNAMELEN. Fixup opal layer to use the new max. Signed-off-by: Karol Mroz <mroz.karol@gmail.com>	2016-04-24 08:19:47 +02:00
Gilles Gouaillardet	d96919638f	pmix: remote autogenerated file and update .gitignore removed: opal/mca/pmix/pmix114/pmix/src/include/private/autogen/config.h.in	2016-04-18 12:57:41 +09:00
Ralph Castain	b009e58d25	Roll to PMIx 1.1.4rc2 - replaces some code that was incorrectly removed in prior update	2016-04-16 18:24:24 -07:00
Ralph Castain	8ff114e668	Update to official PMIx 1.1.4rc1	2016-04-15 21:47:46 -07:00
Ralph Castain	449ec41532	Roll to PMIx 1.1.4rc1 and remove the PMIx 1.2.0 directory as the community has decided to not do that release version. This incorporates a number of bug fixes that have been identified and repaired in the PMIx and OMPI code bases. Also includes several minor corrections to the PMIx code so it now supports run-thru without hanging on collectives involving a process that exits	2016-04-15 10:11:11 -07:00
Ralph Castain	2432daf065	Some minor cleanups of a memory leak and error output	2016-04-08 07:46:18 -07:00
Rainer Keller	52080a5736	As per the pull request to pmix/master: https://github.com/pmix/master/pull/71 Have OMPI's current version of pmix120 nicely fail in case of too long sun_path (longer than 108 or in case of OSX 103 chars). And have OMPI return proper error messages with hints how to amend.	2016-04-07 22:12:53 +02:00
Gilles Gouaillardet	6f450630d8	pmix/external: fix misc missing conversion and type issues	2016-04-04 10:12:34 +09:00
Gilles Gouaillardet	2ede47c462	pmix: fix misc missing conversion and type issues	2016-04-04 10:12:34 +09:00
Nysal Jan K.A	75233573d1	pmix: Increment the reference count in PMIx_Init The reference counting was broken which led PMIx_Finalize to release resources early. This fixes the "use after free" scenarios that I encountered. (based on commit pmix/master@abfaa4c)	2016-03-27 04:11:25 -04:00
Josh Hursey	099170bb31	Merge pull request #1496 from jjhursey/topic/pmix120-obj-patch pmix/pmix120: Fix OBJ_ to PMIX_ symbol name	2016-03-25 19:35:43 -05:00
Ralph Castain	0b4310b186	Remove an unnecessary header that forced exposure of the PMIx internal headers	2016-03-25 16:57:41 -07:00
Ralph Castain	e8246e079b	Minor cleanup to match the changes in the PMIx master	2016-03-25 15:12:41 -07:00
Joshua Hursey	8ebeaa5861	pmix/pmix120: Fix OBJ_ to PMIX_ symbol name	2016-03-25 16:17:08 -05:00
Ralph Castain	af1444b6e1	Cleanup a debug statement. Plug a memory leak	2016-03-08 18:27:55 -08:00
Ralph Castain	bac6290b22	Ensure the process name is positive when using direct launch Fixes #1425	2016-03-08 08:31:05 -08:00
Ralph Castain	b57a191ccc	Update the external client to the new PMIx init/finalize signatures	2016-03-03 20:50:20 -08:00
Ralph Castain	4a55fba414	Fix registration of error handlers thru the pmix120 component. A thread-shift operation was hanging on the sync_event_base, which made it dependent on someone calling opal_progress. Unfortunately, a process in "sleep" or spinning outside the MPI library won't do that, and so we never complete errhandler registration.	2016-03-02 15:01:01 -08:00
Ralph Castain	011403c04a	Fix a number of issues, some of which have lingered for a long time: * provide a more reliable way of determining that a process is a singleton by leveraging the schizo framework. Add new components for slurm, alps, and orte to detect when we are in a managed environment, and if we have been launched by mpirun or a native launcher. Set the correct envars to control ess and pmix selection in each case. * change the relative priority of the pmix120 and pmix112 components to make pmix120 the default * fix singleton comm-spawn by correctly setting the num_apps field of the orte_job_t created by the daemon - this fixes a segfault in register_nspace on newly created daemons * ensure orterun doesn't propagate any ess or pmix directives in its environment * Cleanup a few valgrind issues and memory leaks * Fix a race condition that prevented the client from completing notification registrations (missing thread shift) * Ensure the shizo/alps component detects launch by mpirun	2016-03-01 06:53:00 -08:00
Ralph Castain	d28d3ee901	Make the error message on external pmix library a little clearer by separating out the libevent from the libhwloc checks	2016-02-24 11:20:25 -06:00
Ralph Castain	d653cf2847	Convert the orte_job_data pointer array to a hash table so it doesn't grow forever as we run lots and lots of jobs in the persistent DVM.	2016-02-21 11:55:49 -08:00
Ralph Castain	8c92a179c0	Minor memory leak	2016-02-19 15:05:39 -08:00
Ralph Castain	6e68d758b9	Cleanup some valgrind complaints about jumps with uninitialized values. Fix a few IOF issues reported by Mark Santcroos when submitting jobs from tools. Add the ability to pass directives to the --output-filename option that tell ORTE to (a) not include the jobid in the path to the output files, and (b) not to copy the output to the tool (i.e., just store it in the files). ck Remove stale debug Fix a segfault if no subscribers are present	2016-02-18 16:30:37 -08:00
Ralph Castain	60a7bc2e50	Enable the PMIx notification callback system. This currently is only supported by the pmix120 component, which is not selected by default. All other components will ignore error registration requests, and thus do not support debugger attach when launched via mpirun. Note that direct launched applications will support such attachment, but may not do so in a scalable fashion. Fixes ##1225	2016-02-18 09:29:12 -08:00
rhc54	2745610eb7	Merge pull request #1377 from rhc54/topic/pmix Plug a leak in the PMIx subsystem	2016-02-17 20:05:45 -08:00
Ralph Castain	efb0eff43e	Plug a leak in the PMIx subsystem	2016-02-17 19:00:36 -08:00
Ralph Castain	8f9508cace	Further enhance the support for Singularity containers. Extend the "personality" command-line option to allow specifying both model (e.g., "ompi") and container (e.g., "singularity"), and add the necessary logic to support multiple options. Add a new pmix "isolated" component to handle singletons where no HNP is available since containers cannot launch the HNP.	2016-02-17 13:33:06 -08:00
Gilles Gouaillardet	f5a53b5f1e	pmix: fix Makefile.am to correctly exclude autogenerated file from tarball (back-ported from pmix/master@73daf58ee5)	2016-01-28 11:42:03 +09:00
Gilles Gouaillardet	15e26da1e1	pmix configury: add missing PMIX_CHECK_ICC_VARARGS function Thanks Paul Hargrove for the report (back-ported from pmix/master@7b16e914bf)	2016-01-26 10:57:16 +09:00
rhc54	b172b8599b	Merge pull request #1285 from ggouaillardet/topic/pmix_dist_fix pmix: do not include automatically generated include/private/autogen/…	2016-01-16 20:49:41 -08:00
Gilles Gouaillardet	1d38430e43	opal: replace opal_convert_jobid_to_string with opal_snprintf_jobid	2016-01-14 10:39:03 +09:00
Gilles Gouaillardet	955fe85cb6	pmix/pmix120: add missing include file	2016-01-12 11:35:32 +09:00
Gilles Gouaillardet	73daf58ee5	pmix: do not include automatically generated include/private/autogen/config.h into dist tarball Thanks Siegmar Gross for the initial report of this issue	2016-01-08 13:18:15 +09:00
Nysal Jan K.A	13f9bb9202	Use PMI2 constants for consistency	2016-01-07 11:46:22 +05:30
Jeff Squyres	e4bdad09c1	pmix: remove extra wrapper LIBS These extra libs are now no longer necessary. Fixes open-ompi/ompi#1281.	2016-01-05 12:09:53 -08:00
Ralph Castain	0a6b8d2c14	Correctly handle connection terminations during finalize so mpirun doesn't hang. Cleanup some corner cases in the error notification system	2015-12-30 07:16:43 -08:00
Ralph Castain	a04f1cd643	Silence some Coverity warnings	2015-12-29 20:37:25 -08:00
Gilles Gouaillardet	0ca1ee5156	configury: misc pmix120 fixes	2015-12-28 23:17:41 +09:00
Gilles Gouaillardet	3300d7cc00	pmix: rename pmix_munge_module	2015-12-28 23:16:27 +09:00
Ralph Castain	a5b95a0939	Continue work on error notification system	2015-12-28 23:15:59 +09:00
Ralph Castain	810f2446b7	Add pmix120 component, update the error handling functions in the PMIx API. Update the configure logic for the new pmix120 component ckpt Get the pmix120 component to work - still not really registering or handling notifications, but infrastructure now operates Cleanup some of the symbol scopes, and provide a more comprehensive rename.h file. Will pretty it up later - let's see how this works Cleanup the rename files to use the pretty macros	2015-12-28 23:15:44 +09:00
Gilles Gouaillardet	c757c5c612	pmix/external: Fix error handler usage	2015-12-28 23:15:17 +09:00
Gilles Gouaillardet	1157329732	configury: misc pmix112 fixes	2015-12-28 23:15:16 +09:00
Gilles Gouaillardet	d416c7fd8a	pmix/external: no more circular dependencies if not building shared DSO	2015-12-28 23:14:03 +09:00
Gilles Gouaillardet	f0e3e16f49	pmix/base: add missing #include <unistd.h> Thanks Marco Atzeri for contributing the original patch	2015-12-24 14:41:52 +09:00
rhc54	978c54880d	Merge pull request #1238 from rhc54/topic/cleanup Cleanup warnings in opal and orte layers when building optimized on Mac	2015-12-17 09:37:48 -08:00
Ralph Castain	64b695669a	Cleanup warnings in opal and orte layers when building optimized on Mac	2015-12-17 07:51:24 -08:00
Gilles Gouaillardet	75d16cfb27	Fix a few places where opal/util/argv.h were required when building pmix components (go figure)	2015-12-17 16:19:25 +09:00
Ralph Castain	3a56f0d34b	Create the pmix external component. Fix a few places where opal/util/argv.h were required when building with an external pmix (go figure). NOTE: Building with external pmix requires that you also build with external libevent and hwloc libraries. Detect this at configure and error out with large message if this requirement is violated. Closes #1204 (replaces it) Fixes #1064	2015-12-15 15:26:13 -08:00
Jeff Squyres	7977fa3f0b	pmix112 config.h.in: remove generated file	2015-12-13 06:46:55 -08:00
Ralph Castain	03eb1a80bf	Update the PMIx native component to release v1.1.1, with addition of one bug-fix commit beyond the official release Rename the pmix1xx component to pmix111 so it reflects the actual release it includes Resolve the problem of PMIx being passed a bogus --with-platform argument when configuring the PMIx tarball code. There is no reason we should be passing --with-platform arguments to any internal subdirectory, so just leave that out when constructing the opal_subdir_args variable. Update the PMIx code and continue attempting to debug direct modex Fix a problem in the ORTE PMIx server - there was an early intent to optimize the direct modex by fetching data for all procs from the target job on the remote node, instead of fetching the data one proc at a time. However, this was never completely implemented, and so we would hang if we had multiple overlapping requests for data from more than one proc on the node. Update PMIx to v1.1.2	2015-12-12 18:46:38 -08:00
Howard Pritchard	fecb326256	pmix/cray: fix locality bug There was a bug with the way the cray pmix component was setting the locality property for ranks on the same node, etc. Improve location/syntax of a comment block. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2015-12-08 11:13:48 -08:00
Ralph Castain	9803d69d02	Ensure the embedded PMIx respects an OMPI-level --disable-debug	2015-12-01 08:00:24 -08:00
Ralph Castain	52ea538bc1	Per fix from Nysal: set the listener_active flag before starting the progress thread, and declare the flag to be volatile	2015-11-09 09:00:59 -08:00
Ralph Castain	fed28e4cfc	Add missing file that was previously ignored	2015-11-06 14:37:09 -08:00
Ralph Castain	5f446570d8	Work on cleaning up memory leaks that are causing orte-dvm to eventually run out of memory. Still don't have everything plugged, but getting better. Sync to the PMIx master that includes removal of the pmix_common.h.in file that really didn't need to be generated, and update to the PMIx_server_init API.	2015-11-06 14:15:30 -08:00
Ralph Castain	bfdf08ae86	Fix intercomm_create by ensuring that both sides know how to translate jobid to/from nspace Return something just to ensure that pack is happy	2015-11-06 02:19:45 -08:00
Ralph Castain	206e9a011e	Add a couple of missing translations to/from PMIx internal and OPAL error constants	2015-10-29 12:33:02 -07:00
Ralph Castain	8ad9b450c4	Silence Coverity warning	2015-10-28 20:10:28 -07:00
George Bosilca	c9d0fffab3	Add a missing include.	2015-10-28 00:50:58 -04:00
Ralph Castain	267ca8fcd3	Cleanup the PMIx direct modex support. Add an MCA parameter pmix_base_async_modex that will cause the async modex to be used when set to 1. Default it to 0 for now to continue current default behavior. Also add an MCA param pmix_base_collect_data to direct that the blocking fence shall return all data to each process. Obviously, this param has no effect if async_ modex is used.	2015-10-27 17:31:56 -07:00
George Bosilca	6c28f114f1	Silence a warning regarding the format str for snprintf.	2015-10-24 15:24:40 -04:00
Jeff Squyres	b43fcb7695	Merge pull request #1028 from ggouaillardet/poc/pmix1xx_configury pmix1xx configury: invoke sub-configure with CFLAGS and CPPFLAGS on t…	2015-10-24 13:19:33 -04:00
Ralph Castain	4c12022a50	Silence a couple of warnings from valgrind and compilers. Since some pmix components may return success with a NULL value from a "get", check for that situation before attempting to unload the data. Preset the hostname before calling modex_recv to get it so unload properly checks for NULL. Cast a returned value to the correct ompi_proc_t pointer	2015-10-22 20:56:02 -07:00
Gilles Gouaillardet	0221f59197	pmix1xx configury: invoke sub-configure with CFLAGS and CPPFLAGS on the command line if CFLAGS and/or CPPFLAGS are passed to the ompi configure command line, pmix1xx configure will not use the correct ones previously passed in the environment see discussion started at http://www.open-mpi.org/community/lists/devel/2015/10/18159.php Thanks Siegmar Gross for bringing this to our attention	2015-10-22 10:13:52 +09:00
Nathan Hjelm	9602484568	Merge pull request #1040 from hjelmn/mtl_priority Change how cm's priority is calculated	2015-10-19 14:18:36 -06:00
Nathan Hjelm	8b5810f7f7	mca/base: add priority output to mca_base_select The mca_base_select function uses returned priorities to select the best component/module. This priority may be of use to the caller so pass that information back in an optional argument. If the priority is not needed pass NULL. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-10-19 12:32:41 -06:00
Ralph Castain	363f62a506	Fix singleton operations when running under a SLURM allocation. Sadly, SLURM's PMI will return success even if the PMI server isn't actually available. This leads to erroneous selection of pmix and ess components. So add a further requirement (namely, that we see a job_step envar) to the SLURM pmix components along with some modification of ess selection code to avoid the problem	2015-10-17 20:24:03 -07:00
annu13	cc5e1e26a5	sync with pmix master (repo_rev git69c398e)	2015-10-09 15:17:43 -07:00
annu13	5787e9248f	cleaned up debug stmts	2015-10-06 06:25:36 -07:00
annu13	30ba00e05d	sync with master	2015-10-06 06:04:54 -07:00
annu13	6f37c0e3e8	sync with PMIX master	2015-10-02 17:25:48 -07:00
annu13	7434c47626	sync with PMIX master	2015-10-02 17:17:48 -07:00
Ralph Castain	8f6855459d	Cleanup some coverity warnings	2015-09-30 10:33:53 -07:00
Ralph Castain	ec5d001538	Don't set the return value pointer to NULL as it actually is required to point to real storage - just return an error code if a modex recv doesn't succeed.	2015-09-28 20:45:50 -07:00
Ralph Castain	a4a3dfd480	Cleanup the code a bit by simply adding our nspace to the top of the list of jobid <-> nspace correlations. Add two new APIs to opal_pmix for registering new jobid/nspace pairs and retrieving an nspace given a jobid - these are required to support connect/accept. No impact on the PMIx library.	2015-09-28 08:50:13 -07:00
Ralph Castain	f713e71d51	Minor cleanup - add jobid <-> nspace in one more place	2015-09-27 14:48:39 -07:00
Ralph Castain	fad5638596	Resolve the naming issue when direct-launched by PMIx-enabled RMs using a minimal-impact approach. Detect if we were launched via ORTE - if so, then use our standard methods for computing the jobid. If not, then just hash the nspace to create the jobid, and track the jobid <-> nspace correspondece down in the opal/mca/pmix/pmix1xx component. We then do the translation any time a function that passes process names is invoked.	2015-09-27 09:57:59 -07:00
Ralph Castain	209600fe26	Sync to PMIx master	2015-09-23 21:00:30 -07:00
Ralph Castain	749bd4e6fe	Plug a few memory leaks identified by valgrind	2015-09-23 15:21:04 -07:00
Nathan Hjelm	8c4da756cf	pmix: do not touch recently freed memory Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-09-23 08:44:50 -06:00
Ralph Castain	4c654ffd94	Sync to PMIx master	2015-09-21 21:27:06 -07:00
Ralph Castain	1b7930ad52	Silence some warnings and address Coverity issues	2015-09-16 07:58:22 -07:00
Ralph Castain	c1bbbb5e2f	Remove the last involvement of the OOB system from the MPI layer, remove the no-longer-needed usock/oob component, and have procs no longer open the RML, OOB, ROUTED, and GRPCOMM frameworks as PMIx now provides all required app-mpirun cmds	2015-09-15 13:08:35 -07:00
Ralph Castain	22d7c0081a	Fix the no-disconnect test by resolving a segfault on free - opal_dss.unload will return the remaining unpacked portion of a buffer. As such, it cannot return the pointer to that info as it might be partway inside of a malloc'd region. So copy the data out of the buffer.	2015-09-11 13:01:35 -07:00
Ralph Castain	dc5796b8a1	Revert "Revert "Fix the handling of cpusets so we get the correct cpuset for each local peer. Add the ability to indicate that a modex request is "optional" so we don't call the server if we don't find the value. Take advantage of that to allow the MPI layer to decide that the lack of locality info indicates non-local"" Fix the locality computation by correctly computing the vpid of the local peer This reverts commit open-mpi/ompi@6a8fad49e5.	2015-09-11 08:29:51 -07:00
Ralph Castain	6a8fad49e5	Revert "Fix the handling of cpusets so we get the correct cpuset for each local peer. Add the ability to indicate that a modex request is "optional" so we don't call the server if we don't find the value. Take advantage of that to allow the MPI layer to decide that the lack of locality info indicates non-local" This reverts commit `f94f3cda21`.	2015-09-11 02:01:25 -07:00
Ralph Castain	e0a52354d4	Sync to PMIx master at open-mpi/pmix@89680d6663 Includes changes to support BigEndian machines	2015-09-10 20:47:40 -07:00
Ralph Castain	a2a15cea8a	Fix the s1 component so direct launch is supported for SLURM	2015-09-10 16:07:37 -07:00
rhc54	3430f154fc	Merge pull request #885 from hppritcha/topic/pmix_not_pmix1xx_u16_prob pmix/~pmix1xx: use u32 for OPAL_PMIX_LOCAL_SIZE	2015-09-10 15:38:54 -07:00
Howard Pritchard	2bbf22e2d0	pmix/~pmix1xx: use u32 for OPAL_PMIX_LOCAL_SIZE Looks like in ess_pmi_module.c u32 is being used for retrieving OPAL_PMIX_LOCAL_SIZE, while s1/s2/cray pmix components were storing as u16. This commit fixes this problem. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2015-09-10 11:41:39 -07:00
Ralph Castain	f94f3cda21	Fix the handling of cpusets so we get the correct cpuset for each local peer. Add the ability to indicate that a modex request is "optional" so we don't call the server if we don't find the value. Take advantage of that to allow the MPI layer to decide that the lack of locality info indicates non-local	2015-09-10 10:25:30 -07:00
Ralph Castain	4c47c498ac	Sync to latest PMIx master Allow the blocking send and recv to keep trying	2015-09-09 11:48:47 -07:00
Gilles Gouaillardet	7f0ed74d24	pmix1xx: fix CPPFLAGS when DSO are not built	2015-09-09 14:20:12 +09:00
rhc54	f6b6b9a9ca	Merge pull request #877 from rhc54/topic/s1s2 Cleanup s1 and s2 components	2015-09-08 19:20:59 -07:00
Ralph Castain	1cdb86b8c7	Cleanup s1 and s2 components, and ensure that mpirun and orteds only use non-direct-launch pmix components.	2015-09-08 18:37:09 -07:00
rhc54	3a446c9797	Merge pull request #876 from rhc54/topic/hnp Fix segfault upon job error	2015-09-08 15:10:51 -07:00
rhc54	47f437608d	Merge pull request #875 from rhc54/topic/dynamics Stop a segfault in the test by correctly passing all the argv during spawn	2015-09-08 14:35:42 -07:00
Ralph Castain	459f169e06	Fix segfault upon job error Silence some unnecessary error-logs	2015-09-08 14:03:06 -07:00
Ralph Castain	ae7156cabb	Stop a segfault in the test by correctly passing all the argv during spawn	2015-09-08 13:42:46 -07:00
rhc54	8053357fcc	Merge pull request #873 from rhc54/topic/static Add the libs required for PMIx to support static builds (and trim all excess whitespace)	2015-09-08 11:28:47 -07:00
Ralph Castain	291afe502f	Add the libs required for PMIx to support static builds Remove unneeded CPPFLAGS	2015-09-08 10:21:06 -07:00
Jeff Squyres	bc9e5652ff	whitespace: purge whitespace at end of lines Generated by running "./contrib/whitespace-purge.sh".	2015-09-08 09:47:17 -07:00
Ralph Castain	e6add86e4f	Deal with connect/accept between two jobs from different mpirun's. Somewhat optimize connect/accept by using MPI bcast to distribute the participants instead of another PMIx lookup. Cleanup some Coverity issues.	2015-09-07 09:19:24 -07:00
Ralph Castain	37c3ed68e7	Cleanup connect/disconnect and bring comm_spawn back online!	2015-09-06 10:27:39 -07:00
rhc54	665b30376a	Merge pull request #868 from rhc54/topic/hwloc Remove OPAL_HAVE_HWLOC qualifier and error out if --without-hwloc is given	2015-09-04 17:58:07 -07:00
Ralph Castain	d97bc29102	Remove OPAL_HAVE_HWLOC qualifier and error out if --without-hwloc is given	2015-09-04 16:54:40 -07:00
rhc54	d45ccda813	Merge pull request #866 from rhc54/topic/updatepmix Update PMIx support	2015-09-04 11:09:36 -07:00
Ralph Castain	f6948c2bb4	Sync with PMIx master 43e45c3. Get multi-node publish/lookup/unpublish working	2015-09-04 10:07:17 -07:00
Howard Pritchard	0557beee22	Merge pull request #864 from hppritcha/topic/pmix_cray_more_funcs pmix/cray: more stubs plus a get_version method	2015-09-03 14:52:46 -06:00
Howard Pritchard	6e7345c790	pmix/cray: more stubs plus a get_version method Add more stubs to reduce likelihood of future mysterious segfaults if some of the newer pmix funcs start to get used within ompi. Add a get_version to return the version of the Cray PMI library being used, since the Cray PMI library actually has a function to get that info. Be more accurate about which functions have a hope of being implemented using Cray PMI and those which never will. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2015-09-03 12:51:50 -07:00
Ralph Castain	a772b46c15	Bring the MPI_Publish and friends online	2015-09-02 12:04:07 -07:00
Ralph Castain	95dbd70f44	Sync to PMIx 1.1, sha- 51479b0	2015-09-01 14:09:25 -07:00
rhc54	d8cb3fe705	Merge pull request #852 from rhc54/topic/pmix Sync to PMIx tarball - includes:	2015-09-01 06:54:34 -07:00
Gilles Gouaillardet	6dfa996760	configury: fix a typo in opal/mca/pmix/pmix1xx/configure.m4	2015-09-01 14:59:07 +09:00
Ralph Castain	c1bbd7bc78	Sync to PMIx tarball - includes: * update to configury to silence ident messages (thanks Gilles!) * fix for warnings Jeff saw when get didn't find the requested data * fix for Mac OSX operations	2015-08-31 21:51:02 -07:00
Ralph Castain	ef69958e01	Only copy the value across if the "get" operation succeeded	2015-08-31 17:11:26 -07:00
Ralph Castain	a3842af709	Sync to PMIx tarball	2015-08-31 07:47:46 -07:00
Ralph Castain	bcabd1e282	Sync with PMIx tarball, bringing across the warning fixes pointed out by Gilles	2015-08-30 21:13:55 -07:00
Gilles Gouaillardet	7e6a213465	pmix: fix compilation error compilation failed because of missing prototypes when configure'd with --enable-debug --enable-picky on a CentOS 7 box	2015-08-31 10:33:13 +09:00
rhc54	51a8a0f5d7	Merge pull request #842 from rhc54/topic/smfix Fix shared memory operations by resolving local peers	2015-08-30 14:49:43 -07:00
Ralph Castain	b0d7564400	Sync to PMIx 1.1 - do not check pmix version when making connections	2015-08-30 12:15:30 -07:00
Ralph Castain	38ba54366c	Fix shared memory operations by resolving local peers	2015-08-30 12:07:14 -07:00
Ralph Castain	0d5814b5ca	Cleanup Coverity issues	2015-08-29 21:19:27 -07:00
Ralph Castain	3cab860a01	Some cleanups - still some errors that impact shared memory operations	2015-08-29 18:11:11 -07:00
Ralph Castain	1d71037139	Update some APIs	2015-08-29 17:26:32 -07:00
Ralph Castain	79827ceaa8	Remove stale directory	2015-08-29 17:15:17 -07:00
Ralph Castain	cf6137b530	Integrate PMIx 1.0 with OMPI. Bring Slurm PMI-1 component online Bring the s2 component online Little cleanup - let the various PMIx modules set the process name during init, and then just raise it up to the ORTE level. Required as the different PMI environments all pass the jobid in different ways. Bring the OMPI pubsub/pmi component online Get comm_spawn working again Ensure we always provide a cpuset, even if it is NULL pmix/cray: adjust cray pmix component for pmix Make changes so cray pmix can work within the integrated ompi/pmix framework. Bring singletons back online. Implement the comm_spawn operation using pmix - not tested yet Cleanup comm_spawn - procs now starting, error in connect_accept Complete integration	2015-08-29 16:04:10 -07:00
Jeff Squyres	d7c25f683e	pmix_native: update to the new opal_progress_thread API	2015-08-07 10:13:40 -07:00
Ralph Castain	219c4dfba5	Create a new opal_async_event_base and have the pmix/native and ORTE level use it. This reduces our thread count by one.	2015-07-12 08:23:34 -07:00
Ralph Castain	a2243dcddd	Add an opal/errhandler so opal-level errors can be up-leveled	2015-07-11 07:09:11 -07:00
Ralph Castain	861fe1d9dd	This is the third time I am fixing this - I have no idea who or why this is being reset.	2015-07-02 08:39:48 -05:00
Nathan Hjelm	4d92c9989e	more c99 updates This commit does two things. It removes checks for C99 required headers (stdlib.h, string.h, signal.h, etc). Additionally it removes definitions for required C99 types (intptr_t, int64_t, int32_t, etc). Signed-off-by: Nathan Hjelm <hjelmn@me.com>	2015-06-25 10:14:13 -06:00
Ralph Castain	869041f770	Purge whitespace from the repo	2015-06-23 20:59:57 -07:00
Ralph Castain	cc9b416ab3	Ensure we properly commit suicide if/when we lose connection to the daemon. There are multiple paths by which a lost daemon can be reported, and so a race condition exists in the pmix support. Our MPI layer wants the ability to determine the response to the failure, and so it will call down to the RTE with any abort request. This comes down to the pmix layer as a "pmix_abort" command, which involves communicating the request to the daemon - who is gone. Sadly, the pmix component may not know that just yet, and so we hang. So add a brief timer event to kick us out of the communication. The precise amount of time we should wait is somewhat TBD, but set something short for now and we can adjust.	2015-06-18 09:45:52 -07:00
Ralph Castain	ea35e47228	Fat SMPs (i.e., systems with nodes containing large numbers of cpus) were failing to start due to connection failures of the opal/pmix support. Root cause was that (a) we were setting the client socket to non-blocking before calling connect, and (b) the server was using the event library to harvest the accepts, and also did the handshake while in that event. So the server would backup beyond the connection backlog limit, and we would fail. Changing the client to leave its socket as blocking during the connect doesn't solve the problem by itself - you also have to introduce a sleep delay once the backlog is hit to avoid simply machine-gunning your way thru retries. This gets somewhat difficult to adjust as you don't want to unnecessarily prolong startup time. We've solved this before by adding a listening thread that simply reaps accepts and shoves them into the event library for subsequent processing. This would resolve the problem, but meant yet another daemon-level thread. So I centralized the listening thread support and let multiple elements register listeners on it. Thus, each daemon now has a single listening thread that reaps accepts from multiple sources - for now, the orte/pmix server and the oob/usock support are using it. I'll add in the oob/tcp component later. This still didn't fully resolve the SMP problem, especially on coprocessor cards (e.g., KNC). Removing the shared memory dstore support helped further improve the behavior - it looks like there is some kind of memory paging issue there that needs further understanding. Given that the shared memory support was about to be lost when I bring over the PMIx integration (until it is restored in that library), it seemed like a reasonable thing to just remove it at this point.	2015-05-29 14:37:14 -07:00
Nathan Hjelm	7b7993e406	pmix/base: fix coverity issue CID 1269707 Logically dead code (DEADCODE) Coverity is correct that tmp3 can never be NULL here. Deleted the dead code. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-05-29 09:02:56 -06:00
Nathan Hjelm	1d27b1f944	pmix/native: fix coverity issue CID 1269730 Dereference after null check (FORWARD_NULL) The code checked for cb == NULL before checking for a callback function but did not have the same protection around the OBJ_RELEASE(cb). Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-05-29 08:48:15 -06:00
Howard Pritchard	62a278d29c	Merge pull request #590 from hppritcha/topic/coverity_133 pmix/base: fix coverity error	2015-05-18 06:52:37 -06:00
Howard Pritchard	0980423c5f	pmix/base: fix coverity error Remove some obviously dead code and thus fix a coverity error - CID #133 Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2015-05-16 13:24:03 -06:00
Howard Pritchard	a1d65cfd8b	pmix/cray: fix locality setting Code for setting proc node locality was absent after the removal of Cray PMI KVS usage. This commit puts that functionality back in place. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2015-05-15 12:17:15 -07:00
Gilles Gouaillardet	c809aace47	initialize common symbols from opal A few uninitialized common symbols are remaining: common symbols generated by flex : * opal/util/keyval/keyval_lex.l: opal_util_keyval_yyleng * opal/util/keyval/keyval_lex.o: opal_util_keyval_yytext * opal/util/show_help_lex.l: opal_show_help_yyleng * opal/util/show_help_lex.l: opal_show_help_yytext common symbol generated by "external" hwloc library: * opal/mca/hwloc/hwloc191/hwloc/src/components.o: component_map	2015-05-08 09:48:51 +09:00
Nathan Hjelm	33181b2543	opal: use C99 subobject naming for component initialization This commit helps future-proof opal components by initializing each component member by name. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-04-18 10:29:58 -06:00
Nathan Hjelm	3436f2917d	Merge pull request #449 from hjelmn/mca_base_update mca/base update	2015-04-16 08:41:48 -06:00
Ralph Castain	acc2c7937c	Thanks Nathan - decrement the counter to ensure singleton's startup correctly	2015-04-08 11:23:35 -07:00
Ralph Castain	d07dc362d5	Ensure we can authenticate when crossing security domains by including all available credentials, and letting the receiver use the highest priority one they have in common.	2015-03-28 20:34:26 -07:00
Nathan Hjelm	b68d66bb9b	MCA: Add the project/project version to the MCA base component This commit adds support for project_framework_component_* parameter matching. This is the first step in allowing the same framework name in multiple projects. This change also bumps the MCA component version to 2.1.0. All master frameworks have been updated to use the new component versioning macro. An mca.h has been added to each project to add a project specific versioning macro of the form PROJECT_MCA_VERSION_2_1_0. Signed-off-by: Nathan Hjelm <hjelmn@me.com>	2015-03-27 10:59:04 -06:00
Ralph Castain	1b24536941	Allow for different security domains. Let the initiator of the connection determine the method to be used - if the receiver cannot support it, then that's an error that will cause the connection attempt to fail.	2015-03-25 13:22:01 -07:00
Ralph Castain	d7d8ae46ed	We no longer pass the RML URI for procs launched via mpirun as the daemon has no need for that info.	2015-03-17 06:10:20 -07:00
Ralph Castain	d81c372ea2	Remove the "forwarding" of envars when direct launched - there aren't any envars we can forward under that use-case	2015-02-27 12:19:48 -08:00
Howard Pritchard	bf89131f9e	add owner files to opa/ompi/orte mca directories This commit adds an owner file in each of the component directories for each framework. This allows for a simple script to parse the contents of the files and generate, among other things, tables to be used on the project's wiki page. Currently there are two "fields" in the file, an owner and a status. A tool to parse the files and generate tables for the wiki page will be added in a subsequent commit.	2015-02-22 15:10:23 -07:00
Gilles Gouaillardet	0ce59f2d29	pmix: fix misc memory leaks as reported by Coverity as CID 1269843, 1269854, 1269856, 1269857 and 1269858	2015-02-16 11:19:43 +09:00
Howard Pritchard	bd9d185951	pmix/cray: remove workaround for OBJ_RELEASE Per feedback from rhc, manually set the base_ptr member of the opal_buffer_t variable to NULL prior to calling OBJ_RELEASE. A similar feature of opal_dss.load also exists so likewise reset the base_ptr to NULL prior to invoking it. Hopefully the opal_buffer_t struct does not change frequently. Minor cleanups to reduce output when pmix_base_verbose mca paramater is set.	2015-02-13 07:47:26 -08:00
Howard Pritchard	9955834ff1	pmix/cray: initial kvs removal work Remove use of the Cray PMI KVS - which is designed for a lighweight MPI that exchanges only a minimimal amount of connection info (about 128 bytes per rank) - within cray/pmix. Use Cray PMI collective extensions instead. This is the first of several steps to accelerate launch of Open MPI on Cray systems using either native aprun or nativized slurm.	2015-02-11 15:14:55 -08:00
Ralph Castain	3ae3b96c17	Fix master compilation - a buried header dependency must have been removed.	2015-02-10 07:22:10 -08:00
Ralph Castain	a3275aa867	Once again, fix the blasted singleton comm_spawn	2015-02-05 17:34:25 -08:00
Jeff Squyres	0dbbffb753	pmix_base_frame: use the "= { 0 }" initializer Per open-mpi/ompi#381, convert the specific intialization of opal_pmix to use the generic "= { 0 }" initializer. This form can be used to initialize any type when the intent is just to zero out / assign some value.	2015-02-05 17:51:06 -05:00
Jeff Squyres	621af3aa07	pmix_base: fix global opal_pmix symbol for static linking on OS X OS X has weirdness when static linking. If a symbol is not initialized, it is put into the common block section, and Weird Things happen (linking when trying to using that global symbol will fail). If you initialize the variable, it goes into a different section (and linking to it will work). This link (that might go stale someday) has some information about OS X linker scope and treatment of symbol definitions: https://developer.apple.com/library/mac/documentation/DeveloperTools/Conceptual/MachOTopics/1-Articles/executing_files.html#//apple_ref/doc/uid/TP40001829-98432-TPXREF120 Fixes #375.	2015-02-04 12:12:31 -05:00
Ralph Castain	294ebc907a	Fix singleton operations so they can work inside a slurm environment	2015-01-27 09:29:42 -06:00
Ralph Castain	ba25e8a0ce	Fix singletons	2015-01-27 09:29:42 -06:00
Ralph Castain	028b00154d	Complete implementation of the schizo framework to support OMPI component	2015-01-27 09:29:42 -06:00
Gilles Gouaillardet	9e9261e90a	pmix: correctly set locality flags in proc_flags do not use opal_process_info.cpuset which is not set at that time.	2014-12-26 15:37:08 +09:00
Howard Pritchard	91b0d03bf2	pmix/cray: remove dead code	2014-12-19 13:08:23 -08:00
Ralph Castain	573a574a3c	Remove an unused dstore type that was redundant with another one. Define a corresponding PMIX_NODE_ID type (contains the vpid of the daemon hosting the proc) and ensure that the PMIx server includes that info in its process map	2014-12-15 12:11:13 -08:00
Ralph Castain	9658256a98	Restore the passing of the complete job map to the local proc on first get_attr so the info can be used by the MPI layer without continual calls back to the server. We'll find a more memory efficient method later.	2014-12-13 18:44:09 -08:00
Howard Pritchard	c75dccede1	pmix/cray: remove finalize call from comp close The finalize call in component close method is no longer being matched by an equivalent init call, so remove this call in the close method.	2014-12-03 09:44:18 -07:00
Ralph Castain	d9b23c1054	Increment the init_count in the Slurm pmix components so they correctly respond to calls to pmix.initialized	2014-12-02 20:20:29 -08:00
Gilles Gouaillardet	578fe41788	fix hangs introduced by previous commit `a6744b8177`	2014-11-25 17:50:44 +09:00
Howard Pritchard	a632b632ca	better way to tell if a process is in a Cray PAGG Use a more reliable way to tell if a process is 1) in a Cray PAGG 2) is actually considered an application process on a compute node (not for example, a process in a PAGG on a mom node).	2014-11-12 12:56:15 -07:00
Howard Pritchard	72bb4a2eee	make cray pmi compile again Commit @80f07b65 resulted in changes that caused cray pmi component to no longer compile. This commit fixes that issue.	2014-11-12 12:33:30 -07:00
Artem Polyakov	fce08a3db3	Fix SLURM PMI2 component. set s2_nrank to the relative position of a process inside the node (not relative position of a node inside the allocation).	2014-11-12 16:26:35 +06:00
Ralph Castain	780c93ee57	Per the PR and discussion on today's telecon, extend the process name definition as a two-field struct of uint32_t's down to the OPAL layer. This resolves issues created by prior commits that impacted both heterogeneous and SPARC support. This also simplifies the OMPI code base by removing the need for frequent memcpy's when transitioning between the OMPI/ORTE layers and OPAL. We recognize that this means other users of OPAL will need to "wrap" the opal_process_name_t if they desire to abstract it in some fashion. This is regrettable, and we are looking at possible alternatives that might mitigate that requirement. Meantime, however, we have to put the needs of the OMPI community first, and are taking this step to restore hetero and SPARC support.	2014-11-11 17:00:42 -08:00
Gilles Gouaillardet	80f07b65f1	pmix: correctly split pmi messages Thanks to @elenash for all the reviews	2014-11-11 17:16:00 +09:00
elenash	2687637071	Merge pull request #263 from elenash/master dstore sm component implementing shared memory database for pmix client/server communication	2014-11-07 07:56:55 +03:00
Howard Pritchard	b389895c66	fix make dist for pmix/cray Include file was left out of "sources" list that prevented building for cray from dist tarball.	2014-11-06 15:10:51 -07:00
Elena	03fc809bc9	This commit contains new dstore component sm which is used for communication between pmix server and clients at the same node via shared memory.	2014-11-06 16:01:19 +02:00
Gilles Gouaillardet	ca0b969991	pmix: fix a return status in native_get_attr	2014-10-30 15:26:23 +09:00
Gilles Gouaillardet	8c556bbc66	pmix: fix alignment issue	2014-10-29 13:19:23 +09:00
Ralph Castain	4f0c1ae8d9	Continue cleanup of the PMI config code. Eliminate the multiple calls to check for pmi1 and pmi2 - we must check it only once to get the pmix components to build only in the correct situations. Ensure we set the wrapper flags so we handle static builds correctly.	2014-10-27 20:37:33 -07:00
Gilles Gouaillardet	248acbbc3b	pmix/slurm: correctly set locality of the local ranks as "not found"	2014-10-23 17:02:07 +09:00
Gilles Gouaillardet	7508c6f3ad	pmix: correctly handle NULL OPAL_BYTE_OBJECT object	2014-10-22 17:15:21 +09:00
Nadezhda Kogteva	2bce929330	MTL MXM cleanup: unnecessary OMPI_MTL_MXM_CONNECT_ON_FIRST_COMM variable removed	2014-10-20 10:29:47 +03:00
Ralph Castain	b6aa691e0a	Fix incorrect implementation of new MCA param mca_base_env_list - it was not picking up envars and forwarding them, but only worked if you explicitly set a value for the envar. Ensure it works for both direct and indirect launch modes. Remove stale code as this replaced orte_forward_envars. Ensure it doesn't get passed to the ORTE daemons.	2014-10-16 12:58:56 -07:00
Gilles Gouaillardet	27dcca0bb2	pmi/s1: fix large keys do not overwrite the PMI key when pushing a message that does not fit within 255 bytes	2014-10-16 13:29:32 +09:00
Gilles Gouaillardet	5c81658d58	pmix: fix big endian arch use the appropriate 64 bits type otherwise data gets incorrectly truncated on big endian arch	2014-10-15 17:17:09 +09:00
Elena	c905fe9b78	pmix: removed pmix_base_direct modex mca parameter, renamed orte_full_modex_cutoff and ompi_hostname_cutoff to direct_modex_cutoff	2014-10-09 06:15:31 +02:00
Gilles Gouaillardet	5c5453b8b1	pmix: fix test in native_get_attr	2014-10-03 11:54:08 +09:00
Ralph Castain	9e35f80ab6	Don't multiply define WANT_PMI_SUPPORT and friends. Turns out they weren't being used anywhere anyway, so no point in defining them at all This commit was SVN r32822.	2014-09-30 20:43:25 +00:00
Howard Pritchard	8da51fab81	cray pmi equivalent to commit 5eb65b24 This commit was SVN r32820.	2014-09-30 19:25:00 +00:00
Ralph Castain	8d0b4f222a	The pmix.get functions should not be returning "success" if the requested info isn't found. Fix the macros and the component functions so they correctly return "not found" in that situation, and set the data regions and size to NULL and 0, respectively. This commit was SVN r32818.	2014-09-30 18:03:12 +00:00
Howard Pritchard	201d4ec3ad	fix setting of PMIX_NODE_RANK in cray pmix comp. Per discussions with pmix folks, it was determined that the way the cray pmi pmix component was computing the PMIX_NODE_RANK attribute for a process was incorrect. This commit fixes the problem. This commit was SVN r32810.	2014-09-29 16:55:31 +00:00
Howard Pritchard	1508a01325	Fixes to enable mpirun to work again on Cray The ess pmi module was not handling aprun launched daemons. All daemons were thinking they were vpid 1. Also, turns out that on cray systems using MOM nodes for launched jobs, just detecting whether or not a process is in a PAGG container is not sufficient. Crank up the priority of the alps PLM component in the event that the configure detected the presence of both slurm and alps. Have the ESS pmi component open the pmix framework and select a pmix component. This commit was SVN r32773.	2014-09-23 15:37:26 +00:00
Howard Pritchard	820b34e5d2	Fix bad cut/paste for commit c19e7369 This commit was SVN r32712.	2014-09-11 21:00:04 +00:00
Howard Pritchard	d07c5674a3	Fix potential double free in cray pmi cray_fini This commit was SVN r32711.	2014-09-11 20:30:40 +00:00
Ralph Castain	a7c5b77d70	Just because the openib BTL can't reach a process doesn't mean it is a job-ending error. If we have other methods for reaching the process (e.g., sm for a local proc), then that's okay. If there is no method for reaching a proc, then that's an error - but the BML will report that situation. The question of whether or not the openib BTL supports loopback is a separate question. It may be more appropriate to make the modex be PMIX_GLOBAL for cases where openib can support loopback so someone can run without a shared memory component. I'll leave that decision to the IB vendors. This commit was SVN r32702.	2014-09-10 17:02:16 +00:00
Ralph Castain	6323b226c7	Bring over some updates from the PMIx branch - mostly just minor cleanups. Make the direct grpcomm component no longer be the default. For now, we seem to be having problems with non-blocking fence operations, so make them not be the default under any scenario (e.g., when sm is the only btl in operation). This commit was SVN r32673.	2014-09-06 19:19:44 +00:00
Howard Pritchard	fe2ea1f0fb	fix handling of OPAL_DSTORE_LOCALITY and ref cnt This commit was SVN r32671.	2014-09-05 21:36:19 +00:00
Ralph Castain	41c6058153	Bring over changes to MXM from pmix branch: MTL MXM: establish endpoint connection on the first communication when direct_modex used This commit was SVN r32668.	2014-09-03 18:22:11 +00:00
Ralph Castain	5cdbc00136	Re-enable the usock oob component. Ensure the TCP component promotes messages for other procs to the OOB base so that other components have a chance to send the relay. Seems to be passing MTT, so let's see how it works for others. This commit was SVN r32650.	2014-08-30 19:33:46 +00:00
Ralph Castain	9ac75451ff	Nathan had requested this before as he needs to know the #procs in the job to optimize the UGNI btl. Add the fetch for that data - the native pmix component already provides it, but ensure the Slurm PMI-1 support does too. If not found, fall back to the non-optimized number This commit was SVN r32648.	2014-08-29 22:53:35 +00:00
Ralph Castain	f865ef61ab	Need local_size returned by the Slurm components This commit was SVN r32646.	2014-08-29 22:23:27 +00:00
Howard Pritchard	9a2891f2d6	handle PMIX_LOCAL_SIZE attr arg in cray pmix This commit was SVN r32645.	2014-08-29 21:18:02 +00:00
Ralph Castain	730e28349e	Some minor uninitialized variable cleanups This commit was SVN r32629.	2014-08-29 02:21:13 +00:00
Gilles Gouaillardet	d743da18bf	pmix: fix process name parsing on 32 bits systems opal_process_name_t is an uint64_t which is not equivalent to an unsigned long on 32 bits systems. this is now parsed as an unsigned long long. This commit was SVN r32592.	2014-08-25 03:08:02 +00:00
Ralph Castain	f00af81c1d	Little more cleanup under the abort cases cited by Gilles. All seem to be working now This commit was SVN r32585.	2014-08-22 19:57:57 +00:00
Ralph Castain	b1a7375192	Fix the "unreachable" message so it outputs the correct hostname for the remote proc. Cleanup some of the pmix stuff when running corner cases of errors This commit was SVN r32584.	2014-08-22 19:20:45 +00:00
Ralph Castain	6ff2a60829	Handle the non-blocking fence case correctly, and ensure we always at least pass back the hostname of the process whose info is being requested so that the ompi_proc_t can correctly initialize it when we are in a non-blocking fence with np < cutoff scenario This commit was SVN r32578.	2014-08-22 14:26:24 +00:00
Ralph Castain	8f1b9b463e	Fix shared memory operations - need to pass the local topology and cpusets of all local peers so we can properly compute relative locality for them. Also need to set default locality to "on node" in case where cpusets are not passed because procs are not bound. This commit was SVN r32577.	2014-08-22 05:17:51 +00:00
Ralph Castain	aec5cd08bd	Per the PMIx RFC: WHAT: Merge the PMIx branch into the devel repo, creating a new OPAL “lmix” framework to abstract PMI support for all RTEs. Replace the ORTE daemon-level collectives with a new PMIx server and update the ORTE grpcomm framework to support server-to-server collectives WHY: We’ve had problems dealing with variations in PMI implementations, and need to extend the existing PMI definitions to meet exascale requirements. WHEN: Mon, Aug 25 WHERE: https://github.com/rhc54/ompi-svn-mirror.git Several community members have been working on a refactoring of the current PMI support within OMPI. Although the APIs are common, Slurm and Cray implement a different range of capabilities, and package them differently. For example, Cray provides an integrated PMI-1/2 library, while Slurm separates the two and requires the user to specify the one to be used at runtime. In addition, several bugs in the Slurm implementations have caused problems requiring extra coding. All this has led to a slew of #if’s in the PMI code and bugs when the corner-case logic for one implementation accidentally traps the other. Extending this support to other implementations would have increased this complexity to an unacceptable level. Accordingly, we have: * created a new OPAL “pmix” framework to abstract the PMI support, with separate components for Cray, Slurm PMI-1, and Slurm PMI-2 implementations. * Replaced the current ORTE grpcomm daemon-based collective operation with an integrated PMIx server, and updated the grpcomm APIs to provide more flexible, multi-algorithm support for collective operations. At this time, only the xcast and allgather operations are supported. * Replaced the current global collective id with a signature based on the names of the participating procs. The allows an unlimited number of collectives to be executed by any group of processes, subject to the requirement that only one collective can be active at a time for a unique combination of procs. Note that a proc can be involved in any number of simultaneous collectives - it is the specific combination of procs that is subject to the constraint * removed the prior OMPI/OPAL modex code * added new macros for executing modex send/recv to simplify use of the new APIs. The send macros allow the caller to specify whether or not the BTL supports async modex operations - if so, then the non-blocking “fence” operation is used, if the active PMIx component supports it. Otherwise, the default is a full blocking modex exchange as we currently perform. * retained the current flag that directs us to use a blocking fence operation, but only to retrieve data upon demand This commit was SVN r32570.	2014-08-21 18:56:47 +00:00

... 7 8 9 10 11 ...

644 Коммитов