1
1
Граф коммитов

644 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
673f82e2b6 Update the PMIx listener to avoid leaking sockets into children, and better handle race condition errors 2016-07-03 08:23:33 -07:00
Ralph Castain
6e434d6785 Add support for PMIx tool connections and queries. Initially only support a request to list all known namespaces (jobids) from ORTE, but other folks will extend that support to include additional information
Update to match PMIx RFC

Fix configury to point to correct libevent and hwloc locations
2016-06-29 19:19:19 -07:00
Ralph Castain
08b1438f15 Add missing PMIx range value so OPAL and PMIx align again 2016-06-22 22:03:25 -07:00
Gilles Gouaillardet
bf133c401e pmix2x: fix a typo in dereg_event_hdlr()
This bug has been fixed when open-mpi/ompi@dde69e1be2 was backported into upstream pmix in pmix/master@5e5577778c
but it was not fixed in open-mpi/ompi
2016-06-22 13:45:29 +09:00
Ralph Castain
441739b5a4 Cleanup a lagging message that generates an annoying (but seemingly harmless) warning 2016-06-20 12:23:27 -07:00
Ralph Castain
0ba02821e6 Add requested key and job-level info 2016-06-19 18:22:31 -07:00
Ralph Castain
0a29f5cb77 Sigh - missed two typos 2016-06-18 20:57:53 -07:00
Ralph Castain
dd38cf1fed Fix typo 2016-06-18 20:56:43 -07:00
Ralph Castain
dde69e1be2 Cleanup CIDs 1362763, 1362762, 1362760, 1362759, 1362758, 1362757, 1362756, 1362755, 1362754. Unsure how to resolve 1362761.
Fixes #1792
2016-06-18 12:28:46 -07:00
Ralph Castain
044c561cba Roll to latest PMIx master 2016-06-16 17:30:30 -07:00
Ralph Castain
5d330d5220 Enable the PMIx event notification capability and use that for all error notifications, including debugger release. This capability requires use of PMIx 2.0 or above as the features are not available with earlier PMIx releases. When OMPI master is built against an earlier external version, it will fallback to the prior behavior - i.e., debugger will be released via RML and all notifications will go strictly to the default error handler.
Add PMIx 2.0

Remove PMIx 1.1.4

Cleanup copying of component

Add missing file

Touchup a typo in the Makefile.am

Update the pmix ext114 component

Minor cleanups and resync to master

Update to latest PMIx 2.x

Update to the PMIx event notification branch latest changes
2016-06-14 13:08:41 -07:00
Ralph Castain
d58da99dbc Shift to memcpy to avoid Solaris issues 2016-06-09 12:07:17 -07:00
Ralph Castain
8fa935534b Abstract the strnlen function for environments that do not have it (e.g., Solaris 10) 2016-06-08 10:12:43 -07:00
Gilles Gouaillardet
b707d138fe pmix114/pmix1_client: fix misc memory leaks
Fixes CID 1325146-1325149
2016-06-06 09:33:35 +09:00
Ralph Castain
ecea1e3bb5 Update to 1.1.4rc3 2016-06-01 20:56:07 -07:00
Ralph Castain
12ecf972af Split the pmix external component into one for the 1.1.4 release, and another for the upcoming 2.0 release. Clean up the configury so the components look for a series-specific function instead of running a program.
NOTE: the changes for the 2.0 series are not yet in the PMIx master.
2016-06-01 14:15:24 -07:00
Ralph Castain
55923eacd3 Stealing some pieces of Josh Hursey's PR #1583 and modifying a bit, allow the opal/pmix external component to handle both PMIx 1.1.4 and PMIx 2.0 versions. Automatically detect the version of the target external library and adjust the only two APIs that changed (PMIx_Init and PMIx_Finalize)
Rename temp vars in .m4 to avoid conflict with Travis
2016-05-27 08:06:31 -07:00
Artem Polyakov
725eea2819 Fix base64 implementation in pmix framework.
In the commit 80f07b65f1 setting of '-' marker used
as the string termination sign was moved from base64 code:
from: 80f07b65f1 (diff-1b10896c267d2591dc2c08fd0542ab67L491)
to: 80f07b65f1 (diff-1b10896c267d2591dc2c08fd0542ab67R189)

However the decoding function wasn't fixed and still expects on extra
byte at the end of the encoded string which leads to data truncation
during extraction (was noticed on standalone code that was using base64
from OMPI).
2016-05-23 23:30:31 +06:00
Gilles Gouaillardet
8466a3daf3 pmix: update .gitignore
git ignore opal/mca/pmix/pmix114/pmix/include/pmix/autogen/config.h.in
git rm opal/mca/pmix/pmix114/pmix/include/pmix/autogen/config.h.in
git ignore opal/mca/pmix/pmix*/...
2016-05-23 11:58:07 +09:00
Gilles Gouaillardet
cbbdce05b1 pmix/pmix114: silence a warning 2016-05-20 09:35:26 +09:00
Ralph Castain
6f743f81b6 Update PMIx 114 to current release candidate 2016-05-19 12:55:05 -07:00
rhc54
8b534e9897 Merge pull request #1668 from rhc54/topic/slurm
When direct launching applications, we must allow the MPI layer to pr…
2016-05-16 12:23:19 -07:00
Howard Pritchard
1a676e5b35 pmix/cray: fix some breakage
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2016-05-16 12:45:05 -05:00
Ralph Castain
01ba861f2a When direct launching applications, we must allow the MPI layer to progress during RTE-level barriers. Neither SLURM nor Cray provide non-blocking fence functions, so push those calls into a separate event thread (use the OPAL async thread for this purpose so we don't create another one) and let the MPI thread sping in wait_for_completion. This also restores the "lazy" completion during MPI_Finalize to minimize cpu utilization.
Update external as well

Revise the change: we still need the MPI_Barrier in MPI_Finalize when we use a blocking fence, but do use the "lazy" wait for completion. Replace the direct logic in MPI_Init with a cleaner macro
2016-05-14 16:37:00 -07:00
Ralph Castain
7767882346 Per user request, add some missing data and definitions:
OPAL_PMIX_UNIV_RANK - synonym for OPAL_PMIX_GLOBAL_RANK
OPAL_PMIX_APP_SIZE - #ranks in the application of this proc
2016-05-09 08:39:01 -07:00
Ralph Castain
8ec1891d11 Silence warning 2016-05-05 20:04:10 -07:00
Ralph Castain
08022d7af1 Some minor cleanups of warnings from gcc 6.0.0. Update s1/s2 pmix to get max_procs as required. 2016-05-05 15:28:13 -07:00
rhc54
648043597a Merge pull request #1612 from ggouaillardet/poc/pmix_external_configury
pmix/external: revamp external pmix package detection
2016-05-02 09:46:05 -07:00
Jeff Squyres
265e5b9795 Merge pull request #1552 from kmroz/wip-hostname-len-cleanup-1
ompi/opal/orte/oshmem/test: max hostname length cleanup
2016-05-02 09:44:18 -04:00
Gilles Gouaillardet
45f9a47d77 pmix/external: fix typo and silence a warning 2016-05-02 17:15:52 +09:00
Gilles Gouaillardet
08d91b9a03 pmix/external: revamp external pmix package detection 2016-05-02 16:23:31 +09:00
Ralph Castain
42d9d861fc Fix minor typo in PMIx packing of pmix_app_t - thanks to Gilles for pointing it out 2016-04-29 08:55:46 -07:00
Howard Pritchard
f52dd511d4 Merge pull request #1600 from hppritcha/topic/pmix_fix_for_finalize
pmix/cray: set fence_nb to NULL
2016-04-28 13:50:15 -06:00
hppritcha
aa1d7b9c50 pmix/cray: set fence_nb to NULL
Rather than have a stub function for the pmix fence_nb
operation, just set to NULL.  Causes fewer problems.

Fixes #1597
Fixes #1527

Signed-off-by: hppritcha <howardp@lanl.gov>
2016-04-28 13:48:54 -05:00
Ralph Castain
02876564d4 Silence warning of zero-byte malloc 2016-04-26 11:55:59 -07:00
Karol Mroz
e1c64e6e59 opal: standardize on max hostname length
Define OPAL_MAXHOSTNAMELEN to be either:
  (MAXHOSTNAMELEN + 1) or
  (limits.h:HOST_NAME_MAX + 1) or
  (255 + 1)

For pmix code, define above using PMIX_MAXHOSTNAMELEN.

Fixup opal layer to use the new max.

Signed-off-by: Karol Mroz <mroz.karol@gmail.com>
2016-04-24 08:19:47 +02:00
Gilles Gouaillardet
d96919638f pmix: remote autogenerated file and update .gitignore
removed: opal/mca/pmix/pmix114/pmix/src/include/private/autogen/config.h.in
2016-04-18 12:57:41 +09:00
Ralph Castain
b009e58d25 Roll to PMIx 1.1.4rc2 - replaces some code that was incorrectly removed in prior update 2016-04-16 18:24:24 -07:00
Ralph Castain
8ff114e668 Update to official PMIx 1.1.4rc1 2016-04-15 21:47:46 -07:00
Ralph Castain
449ec41532 Roll to PMIx 1.1.4rc1 and remove the PMIx 1.2.0 directory as the community has decided to not do that release version. This incorporates a number of bug fixes that have been identified and repaired in the PMIx and OMPI code bases. Also includes several minor corrections to the PMIx code so it now supports run-thru without hanging on collectives involving a process that exits 2016-04-15 10:11:11 -07:00
Ralph Castain
2432daf065 Some minor cleanups of a memory leak and error output 2016-04-08 07:46:18 -07:00
Rainer Keller
52080a5736 As per the pull request to pmix/master:
https://github.com/pmix/master/pull/71

Have OMPI's current version of pmix120 nicely fail in case of
too long sun_path (longer than 108 or in case of OSX 103 chars).
And have OMPI return proper error messages with hints how to
amend.
2016-04-07 22:12:53 +02:00
Gilles Gouaillardet
6f450630d8 pmix/external: fix misc missing conversion and type issues 2016-04-04 10:12:34 +09:00
Gilles Gouaillardet
2ede47c462 pmix: fix misc missing conversion and type issues 2016-04-04 10:12:34 +09:00
Nysal Jan K.A
75233573d1 pmix: Increment the reference count in PMIx_Init
The reference counting was broken which led PMIx_Finalize
to release resources early. This fixes the "use after free" scenarios
that I encountered.

(based on commit pmix/master@abfaa4c)
2016-03-27 04:11:25 -04:00
Josh Hursey
099170bb31 Merge pull request #1496 from jjhursey/topic/pmix120-obj-patch
pmix/pmix120: Fix OBJ_ to PMIX_ symbol name
2016-03-25 19:35:43 -05:00
Ralph Castain
0b4310b186 Remove an unnecessary header that forced exposure of the PMIx internal headers 2016-03-25 16:57:41 -07:00
Ralph Castain
e8246e079b Minor cleanup to match the changes in the PMIx master 2016-03-25 15:12:41 -07:00
Joshua Hursey
8ebeaa5861 pmix/pmix120: Fix OBJ_ to PMIX_ symbol name 2016-03-25 16:17:08 -05:00
Ralph Castain
af1444b6e1 Cleanup a debug statement. Plug a memory leak 2016-03-08 18:27:55 -08:00
Ralph Castain
bac6290b22 Ensure the process name is positive when using direct launch
Fixes #1425
2016-03-08 08:31:05 -08:00
Ralph Castain
b57a191ccc Update the external client to the new PMIx init/finalize signatures 2016-03-03 20:50:20 -08:00
Ralph Castain
4a55fba414 Fix registration of error handlers thru the pmix120 component. A thread-shift operation was hanging on the sync_event_base, which made it dependent on someone calling opal_progress. Unfortunately, a process in "sleep" or spinning outside the MPI library won't do that, and so we never complete errhandler registration. 2016-03-02 15:01:01 -08:00
Ralph Castain
011403c04a Fix a number of issues, some of which have lingered for a long time:
* provide a more reliable way of determining that a process is a singleton by leveraging the schizo framework. Add new components for slurm, alps, and orte to detect when we are in a managed environment, and if we have been launched by mpirun or a native launcher. Set the correct envars to control ess and pmix selection in each case.

* change the relative priority of the pmix120 and pmix112 components to make pmix120 the default

* fix singleton comm-spawn by correctly setting the num_apps field of the orte_job_t created by the daemon - this fixes a segfault in register_nspace on newly created daemons

* ensure orterun doesn't propagate any ess or pmix directives in its environment

* Cleanup a few valgrind issues and memory leaks

* Fix a race condition that prevented the client from completing notification registrations (missing thread shift)

* Ensure the shizo/alps component detects launch by mpirun
2016-03-01 06:53:00 -08:00
Ralph Castain
d28d3ee901 Make the error message on external pmix library a little clearer by separating out the libevent from the libhwloc checks 2016-02-24 11:20:25 -06:00
Ralph Castain
d653cf2847 Convert the orte_job_data pointer array to a hash table so it doesn't grow forever as we run lots and lots of jobs in the persistent DVM. 2016-02-21 11:55:49 -08:00
Ralph Castain
8c92a179c0 Minor memory leak 2016-02-19 15:05:39 -08:00
Ralph Castain
6e68d758b9 Cleanup some valgrind complaints about jumps with uninitialized values. Fix a few IOF issues reported by Mark Santcroos when submitting jobs from tools. Add the ability to pass directives to the --output-filename option that tell ORTE to (a) not include the jobid in the path to the output files, and (b) not to copy the output to the tool (i.e., just store it in the files).
ck

Remove stale debug

Fix a segfault if no subscribers are present
2016-02-18 16:30:37 -08:00
Ralph Castain
60a7bc2e50 Enable the PMIx notification callback system. This currently is only supported by the pmix120 component, which is not selected by default. All other components will ignore error registration requests, and thus do not support debugger attach when launched via mpirun. Note that direct launched applications will support such attachment, but may not do so in a scalable fashion.
Fixes ##1225
2016-02-18 09:29:12 -08:00
rhc54
2745610eb7 Merge pull request #1377 from rhc54/topic/pmix
Plug a leak in the PMIx subsystem
2016-02-17 20:05:45 -08:00
Ralph Castain
efb0eff43e Plug a leak in the PMIx subsystem 2016-02-17 19:00:36 -08:00
Ralph Castain
8f9508cace Further enhance the support for Singularity containers. Extend the "personality" command-line option to allow specifying both model (e.g., "ompi") and container (e.g., "singularity"), and add the necessary logic to support multiple options. Add a new pmix "isolated" component to handle singletons where no HNP is available since containers cannot launch the HNP. 2016-02-17 13:33:06 -08:00
Gilles Gouaillardet
f5a53b5f1e pmix: fix Makefile.am to correctly exclude autogenerated file from tarball
(back-ported from pmix/master@73daf58ee5)
2016-01-28 11:42:03 +09:00
Gilles Gouaillardet
15e26da1e1 pmix configury: add missing PMIX_CHECK_ICC_VARARGS function
Thanks Paul Hargrove for the report

(back-ported from pmix/master@7b16e914bf)
2016-01-26 10:57:16 +09:00
rhc54
b172b8599b Merge pull request #1285 from ggouaillardet/topic/pmix_dist_fix
pmix: do not include automatically generated include/private/autogen/…
2016-01-16 20:49:41 -08:00
Gilles Gouaillardet
1d38430e43 opal: replace opal_convert_jobid_to_string with opal_snprintf_jobid 2016-01-14 10:39:03 +09:00
Gilles Gouaillardet
955fe85cb6 pmix/pmix120: add missing include file 2016-01-12 11:35:32 +09:00
Gilles Gouaillardet
73daf58ee5 pmix: do not include automatically generated include/private/autogen/config.h into dist tarball
Thanks Siegmar Gross for the initial report of this issue
2016-01-08 13:18:15 +09:00
Nysal Jan K.A
13f9bb9202 Use PMI2 constants for consistency 2016-01-07 11:46:22 +05:30
Jeff Squyres
e4bdad09c1 pmix: remove extra wrapper LIBS
These extra libs are now no longer necessary.

Fixes open-ompi/ompi#1281.
2016-01-05 12:09:53 -08:00
Ralph Castain
0a6b8d2c14 Correctly handle connection terminations during finalize so mpirun doesn't hang. Cleanup some corner cases in the error notification system 2015-12-30 07:16:43 -08:00
Ralph Castain
a04f1cd643 Silence some Coverity warnings 2015-12-29 20:37:25 -08:00
Gilles Gouaillardet
0ca1ee5156 configury: misc pmix120 fixes 2015-12-28 23:17:41 +09:00
Gilles Gouaillardet
3300d7cc00 pmix: rename pmix_munge_module 2015-12-28 23:16:27 +09:00
Ralph Castain
a5b95a0939 Continue work on error notification system 2015-12-28 23:15:59 +09:00
Ralph Castain
810f2446b7 Add pmix120 component, update the error handling functions in the PMIx API.
Update the configure logic for the new pmix120 component

ckpt

Get the pmix120 component to work - still not really registering or handling notifications, but infrastructure now operates

Cleanup some of the symbol scopes, and provide a more comprehensive rename.h file. Will pretty it up later - let's see how this works

Cleanup the rename files to use the pretty macros
2015-12-28 23:15:44 +09:00
Gilles Gouaillardet
c757c5c612 pmix/external: Fix error handler usage 2015-12-28 23:15:17 +09:00
Gilles Gouaillardet
1157329732 configury: misc pmix112 fixes 2015-12-28 23:15:16 +09:00
Gilles Gouaillardet
d416c7fd8a pmix/external: no more circular dependencies if not building shared DSO 2015-12-28 23:14:03 +09:00
Gilles Gouaillardet
f0e3e16f49 pmix/base: add missing #include <unistd.h>
Thanks Marco Atzeri for contributing the original patch
2015-12-24 14:41:52 +09:00
rhc54
978c54880d Merge pull request #1238 from rhc54/topic/cleanup
Cleanup warnings in opal and orte layers when building optimized on Mac
2015-12-17 09:37:48 -08:00
Ralph Castain
64b695669a Cleanup warnings in opal and orte layers when building optimized on Mac 2015-12-17 07:51:24 -08:00
Gilles Gouaillardet
75d16cfb27 Fix a few places where opal/util/argv.h were required when building pmix components (go figure) 2015-12-17 16:19:25 +09:00
Ralph Castain
3a56f0d34b Create the pmix external component. Fix a few places where opal/util/argv.h were required when building with an external pmix (go figure).
NOTE: Building with external pmix *requires* that you also build with external libevent and hwloc libraries. Detect this at configure and error out with large message if this requirement is violated.

Closes #1204  (replaces it)
Fixes #1064
2015-12-15 15:26:13 -08:00
Jeff Squyres
7977fa3f0b pmix112 config.h.in: remove generated file 2015-12-13 06:46:55 -08:00
Ralph Castain
03eb1a80bf Update the PMIx native component to release v1.1.1, with addition of one bug-fix commit beyond the official release
Rename the pmix1xx component to pmix111 so it reflects the actual release it includes

Resolve the problem of PMIx being passed a bogus --with-platform argument when configuring the PMIx tarball code. There is no reason we should be passing --with-platform arguments to any internal subdirectory, so just leave that out when constructing the opal_subdir_args variable.

Update the PMIx code and continue attempting to debug direct modex

Fix a problem in the ORTE PMIx server - there was an early intent to optimize the direct modex by fetching data for all procs from the target job on the remote node, instead of fetching the data one proc at a time. However, this was never completely implemented, and so we would hang if we had multiple overlapping requests for data from more than one proc on the node.

Update PMIx to v1.1.2
2015-12-12 18:46:38 -08:00
Howard Pritchard
fecb326256 pmix/cray: fix locality bug
There was a bug with the way the cray pmix component
was setting the locality property for ranks on the
same node, etc.

Improve location/syntax of a comment block.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-12-08 11:13:48 -08:00
Ralph Castain
9803d69d02 Ensure the embedded PMIx respects an OMPI-level --disable-debug 2015-12-01 08:00:24 -08:00
Ralph Castain
52ea538bc1 Per fix from Nysal: set the listener_active flag before starting the progress thread, and declare the flag to be volatile 2015-11-09 09:00:59 -08:00
Ralph Castain
fed28e4cfc Add missing file that was previously ignored 2015-11-06 14:37:09 -08:00
Ralph Castain
5f446570d8 Work on cleaning up memory leaks that are causing orte-dvm to eventually run out of memory. Still don't have everything plugged, but getting better. Sync to the PMIx master that includes removal of the pmix_common.h.in file that really didn't need to be generated, and update to the PMIx_server_init API. 2015-11-06 14:15:30 -08:00
Ralph Castain
bfdf08ae86 Fix intercomm_create by ensuring that both sides know how to translate jobid to/from nspace
Return something just to ensure that pack is happy
2015-11-06 02:19:45 -08:00
Ralph Castain
206e9a011e Add a couple of missing translations to/from PMIx internal and OPAL error constants 2015-10-29 12:33:02 -07:00
Ralph Castain
8ad9b450c4 Silence Coverity warning 2015-10-28 20:10:28 -07:00
George Bosilca
c9d0fffab3 Add a missing include. 2015-10-28 00:50:58 -04:00
Ralph Castain
267ca8fcd3 Cleanup the PMIx direct modex support. Add an MCA parameter pmix_base_async_modex that will cause the async modex to be used when set to 1. Default it to 0 for now
to continue current default behavior.

Also add an MCA param pmix_base_collect_data to direct that the blocking fence shall return all data to each process. Obviously, this param has no effect if async_
modex is used.
2015-10-27 17:31:56 -07:00
George Bosilca
6c28f114f1 Silence a warning regarding the format str for snprintf. 2015-10-24 15:24:40 -04:00
Jeff Squyres
b43fcb7695 Merge pull request #1028 from ggouaillardet/poc/pmix1xx_configury
pmix1xx configury: invoke sub-configure with CFLAGS and CPPFLAGS on t…
2015-10-24 13:19:33 -04:00
Ralph Castain
4c12022a50 Silence a couple of warnings from valgrind and compilers. Since some pmix components may return success with a NULL value from a "get", check for that situation before attempting to unload the data. Preset the hostname before calling modex_recv to get it so unload properly checks for NULL. Cast a returned value to the correct ompi_proc_t pointer 2015-10-22 20:56:02 -07:00
Gilles Gouaillardet
0221f59197 pmix1xx configury: invoke sub-configure with CFLAGS and CPPFLAGS on the command line
if CFLAGS and/or CPPFLAGS are passed to the ompi configure command line, pmix1xx
configure will not use the correct ones previously passed in the environment
see discussion started at http://www.open-mpi.org/community/lists/devel/2015/10/18159.php
Thanks Siegmar Gross for bringing this to our attention
2015-10-22 10:13:52 +09:00
Nathan Hjelm
9602484568 Merge pull request #1040 from hjelmn/mtl_priority
Change how cm's priority is calculated
2015-10-19 14:18:36 -06:00
Nathan Hjelm
8b5810f7f7 mca/base: add priority output to mca_base_select
The mca_base_select function uses returned priorities to select the
best component/module. This priority may be of use to the caller so
pass that information back in an optional argument. If the priority is
not needed pass NULL.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-10-19 12:32:41 -06:00
Ralph Castain
363f62a506 Fix singleton operations when running under a SLURM allocation. Sadly, SLURM's PMI will return success even if the PMI server isn't actually available. This leads to erroneous selection of pmix and ess components. So add a further requirement (namely, that we see a job_step envar) to the SLURM pmix components along with some modification of ess selection code to avoid the problem 2015-10-17 20:24:03 -07:00
annu13
cc5e1e26a5 sync with pmix master (repo_rev git69c398e) 2015-10-09 15:17:43 -07:00
annu13
5787e9248f cleaned up debug stmts 2015-10-06 06:25:36 -07:00
annu13
30ba00e05d sync with master 2015-10-06 06:04:54 -07:00
annu13
6f37c0e3e8 sync with PMIX master 2015-10-02 17:25:48 -07:00
annu13
7434c47626 sync with PMIX master 2015-10-02 17:17:48 -07:00
Ralph Castain
8f6855459d Cleanup some coverity warnings 2015-09-30 10:33:53 -07:00
Ralph Castain
ec5d001538 Don't set the return value pointer to NULL as it actually is required to point to real storage - just return an error code if a modex recv doesn't succeed. 2015-09-28 20:45:50 -07:00
Ralph Castain
a4a3dfd480 Cleanup the code a bit by simply adding our nspace to the top of the list of jobid <-> nspace correlations. Add two new APIs to opal_pmix for registering new jobid/nspace pairs and retrieving an nspace given a jobid - these are required to support connect/accept. No impact on the PMIx library. 2015-09-28 08:50:13 -07:00
Ralph Castain
f713e71d51 Minor cleanup - add jobid <-> nspace in one more place 2015-09-27 14:48:39 -07:00
Ralph Castain
fad5638596 Resolve the naming issue when direct-launched by PMIx-enabled RMs using a minimal-impact approach. Detect if we were launched via ORTE - if so, then use our standard methods for computing the jobid. If not, then just hash the nspace to create the jobid, and track the jobid <-> nspace correspondece down in the opal/mca/pmix/pmix1xx component. We then do the translation any time a function that passes process names is invoked. 2015-09-27 09:57:59 -07:00
Ralph Castain
209600fe26 Sync to PMIx master 2015-09-23 21:00:30 -07:00
Ralph Castain
749bd4e6fe Plug a few memory leaks identified by valgrind 2015-09-23 15:21:04 -07:00
Nathan Hjelm
8c4da756cf pmix: do not touch recently freed memory
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-09-23 08:44:50 -06:00
Ralph Castain
4c654ffd94 Sync to PMIx master 2015-09-21 21:27:06 -07:00
Ralph Castain
1b7930ad52 Silence some warnings and address Coverity issues 2015-09-16 07:58:22 -07:00
Ralph Castain
c1bbbb5e2f Remove the last involvement of the OOB system from the MPI layer, remove the no-longer-needed usock/oob component, and have procs no longer open the RML, OOB, ROUTED, and GRPCOMM frameworks as PMIx now provides all required app-mpirun cmds 2015-09-15 13:08:35 -07:00
Ralph Castain
22d7c0081a Fix the no-disconnect test by resolving a segfault on free - opal_dss.unload will return the remaining unpacked portion of a buffer. As such, it cannot return the pointer to that info as it might be partway inside of a malloc'd region. So copy the data out of the buffer. 2015-09-11 13:01:35 -07:00
Ralph Castain
dc5796b8a1 Revert "Revert "Fix the handling of cpusets so we get the correct cpuset for each local peer. Add the ability to indicate that a modex request is "optional" so we don't call the server if we don't find the value. Take advantage of that to allow the MPI layer to decide that the lack of locality info indicates non-local""
Fix the locality computation by correctly computing the vpid of the local peer

This reverts commit open-mpi/ompi@6a8fad49e5.
2015-09-11 08:29:51 -07:00
Ralph Castain
6a8fad49e5 Revert "Fix the handling of cpusets so we get the correct cpuset for each local peer. Add the ability to indicate that a modex request is "optional" so we don't call the server if we don't find the value. Take advantage of that to allow the MPI layer to decide that the lack of locality info indicates non-local"
This reverts commit f94f3cda21.
2015-09-11 02:01:25 -07:00
Ralph Castain
e0a52354d4 Sync to PMIx master at open-mpi/pmix@89680d6663
Includes changes to support BigEndian machines
2015-09-10 20:47:40 -07:00
Ralph Castain
a2a15cea8a Fix the s1 component so direct launch is supported for SLURM 2015-09-10 16:07:37 -07:00
rhc54
3430f154fc Merge pull request #885 from hppritcha/topic/pmix_not_pmix1xx_u16_prob
pmix/~pmix1xx: use u32 for OPAL_PMIX_LOCAL_SIZE
2015-09-10 15:38:54 -07:00
Howard Pritchard
2bbf22e2d0 pmix/~pmix1xx: use u32 for OPAL_PMIX_LOCAL_SIZE
Looks like in ess_pmi_module.c u32 is being used
for retrieving OPAL_PMIX_LOCAL_SIZE, while s1/s2/cray
pmix components were storing as u16.

This commit fixes this problem.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-09-10 11:41:39 -07:00
Ralph Castain
f94f3cda21 Fix the handling of cpusets so we get the correct cpuset for each local peer. Add the ability to indicate that a modex request is "optional" so we don't call the server if we don't find the value. Take advantage of that to allow the MPI layer to decide that the lack of locality info indicates non-local 2015-09-10 10:25:30 -07:00
Ralph Castain
4c47c498ac Sync to latest PMIx master
Allow the blocking send and recv to keep trying
2015-09-09 11:48:47 -07:00
Gilles Gouaillardet
7f0ed74d24 pmix1xx: fix CPPFLAGS when DSO are not built 2015-09-09 14:20:12 +09:00
rhc54
f6b6b9a9ca Merge pull request #877 from rhc54/topic/s1s2
Cleanup s1 and s2 components
2015-09-08 19:20:59 -07:00
Ralph Castain
1cdb86b8c7 Cleanup s1 and s2 components, and ensure that mpirun and orteds only use non-direct-launch pmix components. 2015-09-08 18:37:09 -07:00
rhc54
3a446c9797 Merge pull request #876 from rhc54/topic/hnp
Fix segfault upon job error
2015-09-08 15:10:51 -07:00
rhc54
47f437608d Merge pull request #875 from rhc54/topic/dynamics
Stop a segfault in the test by correctly passing all the argv during spawn
2015-09-08 14:35:42 -07:00
Ralph Castain
459f169e06 Fix segfault upon job error
Silence some unnecessary error-logs
2015-09-08 14:03:06 -07:00
Ralph Castain
ae7156cabb Stop a segfault in the test by correctly passing all the argv during spawn 2015-09-08 13:42:46 -07:00
rhc54
8053357fcc Merge pull request #873 from rhc54/topic/static
Add the libs required for PMIx to support static builds (and trim all excess whitespace)
2015-09-08 11:28:47 -07:00
Ralph Castain
291afe502f Add the libs required for PMIx to support static builds
Remove unneeded CPPFLAGS
2015-09-08 10:21:06 -07:00
Jeff Squyres
bc9e5652ff whitespace: purge whitespace at end of lines
Generated by running "./contrib/whitespace-purge.sh".
2015-09-08 09:47:17 -07:00
Ralph Castain
e6add86e4f Deal with connect/accept between two jobs from different mpirun's. Somewhat optimize connect/accept by using MPI bcast to distribute the participants instead of another PMIx lookup. Cleanup some Coverity issues. 2015-09-07 09:19:24 -07:00
Ralph Castain
37c3ed68e7 Cleanup connect/disconnect and bring comm_spawn back online! 2015-09-06 10:27:39 -07:00
rhc54
665b30376a Merge pull request #868 from rhc54/topic/hwloc
Remove OPAL_HAVE_HWLOC qualifier and error out if --without-hwloc is given
2015-09-04 17:58:07 -07:00
Ralph Castain
d97bc29102 Remove OPAL_HAVE_HWLOC qualifier and error out if --without-hwloc is given 2015-09-04 16:54:40 -07:00
rhc54
d45ccda813 Merge pull request #866 from rhc54/topic/updatepmix
Update PMIx support
2015-09-04 11:09:36 -07:00
Ralph Castain
f6948c2bb4 Sync with PMIx master 43e45c3. Get multi-node publish/lookup/unpublish working 2015-09-04 10:07:17 -07:00
Howard Pritchard
0557beee22 Merge pull request #864 from hppritcha/topic/pmix_cray_more_funcs
pmix/cray: more stubs plus a get_version method
2015-09-03 14:52:46 -06:00
Howard Pritchard
6e7345c790 pmix/cray: more stubs plus a get_version method
Add more stubs to reduce likelihood of future
mysterious segfaults if some of the newer pmix
funcs start to get used within ompi.

Add a get_version to return the version of the
Cray PMI library being used, since the Cray PMI
library actually has a function to get that info.

Be more accurate about which functions have a hope
of being implemented using Cray PMI and those which
never will.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-09-03 12:51:50 -07:00
Ralph Castain
a772b46c15 Bring the MPI_Publish and friends online 2015-09-02 12:04:07 -07:00
Ralph Castain
95dbd70f44 Sync to PMIx 1.1, sha- 51479b0 2015-09-01 14:09:25 -07:00
rhc54
d8cb3fe705 Merge pull request #852 from rhc54/topic/pmix
Sync to PMIx tarball - includes:
2015-09-01 06:54:34 -07:00
Gilles Gouaillardet
6dfa996760 configury: fix a typo in opal/mca/pmix/pmix1xx/configure.m4 2015-09-01 14:59:07 +09:00
Ralph Castain
c1bbd7bc78 Sync to PMIx tarball - includes:
* update to configury to silence ident messages (thanks Gilles!)
* fix for warnings Jeff saw when get didn't find the requested data
* fix for Mac OSX operations
2015-08-31 21:51:02 -07:00
Ralph Castain
ef69958e01 Only copy the value across if the "get" operation succeeded 2015-08-31 17:11:26 -07:00
Ralph Castain
a3842af709 Sync to PMIx tarball 2015-08-31 07:47:46 -07:00
Ralph Castain
bcabd1e282 Sync with PMIx tarball, bringing across the warning fixes pointed out by Gilles 2015-08-30 21:13:55 -07:00
Gilles Gouaillardet
7e6a213465 pmix: fix compilation error
compilation failed because of missing prototypes when configure'd with --enable-debug --enable-picky on a CentOS 7 box
2015-08-31 10:33:13 +09:00
rhc54
51a8a0f5d7 Merge pull request #842 from rhc54/topic/smfix
Fix shared memory operations by resolving local peers
2015-08-30 14:49:43 -07:00
Ralph Castain
b0d7564400 Sync to PMIx 1.1 - do not check pmix version when making connections 2015-08-30 12:15:30 -07:00
Ralph Castain
38ba54366c Fix shared memory operations by resolving local peers 2015-08-30 12:07:14 -07:00
Ralph Castain
0d5814b5ca Cleanup Coverity issues 2015-08-29 21:19:27 -07:00
Ralph Castain
3cab860a01 Some cleanups - still some errors that impact shared memory operations 2015-08-29 18:11:11 -07:00
Ralph Castain
1d71037139 Update some APIs 2015-08-29 17:26:32 -07:00
Ralph Castain
79827ceaa8 Remove stale directory 2015-08-29 17:15:17 -07:00
Ralph Castain
cf6137b530 Integrate PMIx 1.0 with OMPI.
Bring Slurm PMI-1 component online
Bring the s2 component online

Little cleanup - let the various PMIx modules set the process name during init, and then just raise it up to the ORTE level. Required as the different PMI environments all pass the jobid in different ways.

Bring the OMPI pubsub/pmi component online

Get comm_spawn working again

Ensure we always provide a cpuset, even if it is NULL

pmix/cray: adjust cray pmix component for pmix

Make changes so cray pmix can work within the integrated
ompi/pmix framework.

Bring singletons back online. Implement the comm_spawn operation using pmix - not tested yet

Cleanup comm_spawn - procs now starting, error in connect_accept

Complete integration
2015-08-29 16:04:10 -07:00
Jeff Squyres
d7c25f683e pmix_native: update to the new opal_progress_thread API 2015-08-07 10:13:40 -07:00
Ralph Castain
219c4dfba5 Create a new opal_async_event_base and have the pmix/native and ORTE level use it. This reduces our thread count by one. 2015-07-12 08:23:34 -07:00
Ralph Castain
a2243dcddd Add an opal/errhandler so opal-level errors can be up-leveled 2015-07-11 07:09:11 -07:00
Ralph Castain
861fe1d9dd This is the third time I am fixing this - I have no idea who or why this is being reset. 2015-07-02 08:39:48 -05:00
Nathan Hjelm
4d92c9989e more c99 updates
This commit does two things. It removes checks for C99 required
headers (stdlib.h, string.h, signal.h, etc). Additionally it removes
definitions for required C99 types (intptr_t, int64_t, int32_t, etc).

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2015-06-25 10:14:13 -06:00
Ralph Castain
869041f770 Purge whitespace from the repo 2015-06-23 20:59:57 -07:00
Ralph Castain
cc9b416ab3 Ensure we properly commit suicide if/when we lose connection to the daemon. There are multiple paths by which a lost daemon can be reported, and so a race condition exists in the pmix support. Our MPI layer wants the ability to determine the response to the failure, and so it will call down to the RTE with any abort request. This comes down to the pmix layer as a "pmix_abort" command, which involves communicating the request to the daemon - who is gone. Sadly, the pmix component may not know that just yet, and so we hang.
So add a brief timer event to kick us out of the communication. The precise amount of time we should wait is somewhat TBD, but set something short for now and we can adjust.
2015-06-18 09:45:52 -07:00
Ralph Castain
ea35e47228 Fat SMPs (i.e., systems with nodes containing large numbers of cpus) were failing to start due to connection failures of the opal/pmix support. Root cause was that (a) we were setting the client socket to non-blocking before calling connect, and (b) the server was using the event library to harvest the accepts, and also did the handshake while in that event. So the server would backup beyond the connection backlog limit, and we would fail.
Changing the client to leave its socket as blocking during the connect doesn't solve the problem by itself - you also have to introduce a sleep delay once the backlog is hit to avoid simply machine-gunning your way thru retries. This gets somewhat difficult to adjust as you don't want to unnecessarily prolong startup time.

We've solved this before by adding a listening thread that simply reaps accepts and shoves them into the event library for subsequent processing. This would resolve the problem, but meant yet another daemon-level thread. So I centralized the listening thread support and let multiple elements register listeners on it. Thus, each daemon now has a single listening thread that reaps accepts from multiple sources - for now, the orte/pmix server and the oob/usock support are using it. I'll add in the oob/tcp component later.

This still didn't fully resolve the SMP problem, especially on coprocessor cards (e.g., KNC). Removing the shared memory dstore support helped further improve the behavior - it looks like there is some kind of memory paging issue there that needs further understanding. Given that the shared memory support was about to be lost when I bring over the PMIx integration (until it is restored in that library), it seemed like a reasonable thing to just remove it at this point.
2015-05-29 14:37:14 -07:00
Nathan Hjelm
7b7993e406 pmix/base: fix coverity issue
CID 1269707 Logically dead code (DEADCODE)

Coverity is correct that tmp3 can never be NULL here. Deleted the dead
code.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-29 09:02:56 -06:00
Nathan Hjelm
1d27b1f944 pmix/native: fix coverity issue
CID 1269730 Dereference after null check (FORWARD_NULL)

The code checked for cb == NULL before checking for a callback
function but did not have the same protection around the
OBJ_RELEASE(cb).

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-29 08:48:15 -06:00
Howard Pritchard
62a278d29c Merge pull request #590 from hppritcha/topic/coverity_133
pmix/base: fix coverity error
2015-05-18 06:52:37 -06:00
Howard Pritchard
0980423c5f pmix/base: fix coverity error
Remove some obviously dead code and thus fix a coverity
error - CID #133

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-05-16 13:24:03 -06:00
Howard Pritchard
a1d65cfd8b pmix/cray: fix locality setting
Code for setting proc node locality
was absent after the removal of Cray
PMI KVS usage.  This commit puts that
functionality back in place.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-05-15 12:17:15 -07:00
Gilles Gouaillardet
c809aace47 initialize common symbols from opal
A few uninitialized common symbols are remaining:

common symbols generated by flex :
 * opal/util/keyval/keyval_lex.l: opal_util_keyval_yyleng
 * opal/util/keyval/keyval_lex.o: opal_util_keyval_yytext
 * opal/util/show_help_lex.l: opal_show_help_yyleng
 * opal/util/show_help_lex.l: opal_show_help_yytext

common symbol generated by "external" hwloc library:
 * opal/mca/hwloc/hwloc191/hwloc/src/components.o: component_map
2015-05-08 09:48:51 +09:00
Nathan Hjelm
33181b2543 opal: use C99 subobject naming for component initialization
This commit helps future-proof opal components by initializing each
component member by name.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-04-18 10:29:58 -06:00
Nathan Hjelm
3436f2917d Merge pull request #449 from hjelmn/mca_base_update
mca/base update
2015-04-16 08:41:48 -06:00
Ralph Castain
acc2c7937c Thanks Nathan - decrement the counter to ensure singleton's startup correctly 2015-04-08 11:23:35 -07:00
Ralph Castain
d07dc362d5 Ensure we can authenticate when crossing security domains by including all available credentials, and letting the receiver use the highest priority one they have in common. 2015-03-28 20:34:26 -07:00
Nathan Hjelm
b68d66bb9b MCA: Add the project/project version to the MCA base component
This commit adds support for project_framework_component_* parameter
matching. This is the first step in allowing the same framework name
in multiple projects. This change also bumps the MCA component version
to 2.1.0.

All master frameworks have been updated to use the new component
versioning macro. An mca.h has been added to each project to add a
project specific versioning macro of the form
PROJECT_MCA_VERSION_2_1_0.

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2015-03-27 10:59:04 -06:00
Ralph Castain
1b24536941 Allow for different security domains. Let the initiator of the connection determine the method to be used - if the receiver cannot support it, then that's an error that will cause the connection attempt to fail. 2015-03-25 13:22:01 -07:00
Ralph Castain
d7d8ae46ed We no longer pass the RML URI for procs launched via mpirun as the daemon has no need for that info. 2015-03-17 06:10:20 -07:00
Ralph Castain
d81c372ea2 Remove the "forwarding" of envars when direct launched - there aren't any envars we can forward under that use-case 2015-02-27 12:19:48 -08:00
Howard Pritchard
bf89131f9e add owner files to opa/ompi/orte mca directories
This commit adds an owner file in each of the component directories
for each framework.  This allows for a simple script to parse
the contents of the files and generate, among other things, tables
to be used on the project's wiki page.  Currently there are two
"fields" in the file, an owner and a status.  A tool to parse
the files and generate tables for the wiki page will be added
in a subsequent commit.
2015-02-22 15:10:23 -07:00
Gilles Gouaillardet
0ce59f2d29 pmix: fix misc memory leaks
as reported by Coverity as CID 1269843, 1269854, 1269856, 1269857 and 1269858
2015-02-16 11:19:43 +09:00
Howard Pritchard
bd9d185951 pmix/cray: remove workaround for OBJ_RELEASE
Per feedback from rhc, manually set the base_ptr member
of the opal_buffer_t variable to NULL prior to calling
OBJ_RELEASE.  A similar feature of opal_dss.load also
exists so likewise reset the base_ptr to NULL prior to
invoking it.

Hopefully the opal_buffer_t struct does not change
frequently.

Minor cleanups to reduce output when pmix_base_verbose
mca paramater is set.
2015-02-13 07:47:26 -08:00
Howard Pritchard
9955834ff1 pmix/cray: initial kvs removal work
Remove use of the Cray PMI KVS - which is designed for a lighweight
MPI that exchanges only a minimimal amount of connection info
(about 128 bytes per rank) - within cray/pmix.  Use Cray PMI
collective extensions instead.

This is the first of several steps to accelerate launch of
Open MPI on Cray systems using either native aprun or nativized
slurm.
2015-02-11 15:14:55 -08:00
Ralph Castain
3ae3b96c17 Fix master compilation - a buried header dependency must have been removed. 2015-02-10 07:22:10 -08:00
Ralph Castain
a3275aa867 Once again, fix the blasted singleton comm_spawn 2015-02-05 17:34:25 -08:00
Jeff Squyres
0dbbffb753 pmix_base_frame: use the "= { 0 }" initializer
Per open-mpi/ompi#381, convert the specific intialization of opal_pmix
to use the generic "= { 0 }" initializer.  This form can be used to
initialize any type when the intent is just to zero out / assign
*some* value.
2015-02-05 17:51:06 -05:00
Jeff Squyres
621af3aa07 pmix_base: fix global opal_pmix symbol for static linking on OS X
OS X has weirdness when static linking.  If a symbol is not
initialized, it is put into the common block section, and Weird Things
happen (linking when trying to using that global symbol will fail).
If you initialize the variable, it goes into a different section (and
linking to it will work).

This link (that might go stale someday) has some information about OS
X linker scope and treatment of symbol definitions:
https://developer.apple.com/library/mac/documentation/DeveloperTools/Conceptual/MachOTopics/1-Articles/executing_files.html#//apple_ref/doc/uid/TP40001829-98432-TPXREF120

Fixes #375.
2015-02-04 12:12:31 -05:00
Ralph Castain
294ebc907a Fix singleton operations so they can work inside a slurm environment 2015-01-27 09:29:42 -06:00
Ralph Castain
ba25e8a0ce Fix singletons 2015-01-27 09:29:42 -06:00
Ralph Castain
028b00154d Complete implementation of the schizo framework to support OMPI component 2015-01-27 09:29:42 -06:00
Gilles Gouaillardet
9e9261e90a pmix: correctly set locality flags in proc_flags
do not use opal_process_info.cpuset which is not
set at that time.
2014-12-26 15:37:08 +09:00
Howard Pritchard
91b0d03bf2 pmix/cray: remove dead code 2014-12-19 13:08:23 -08:00
Ralph Castain
573a574a3c Remove an unused dstore type that was redundant with another one. Define a corresponding PMIX_NODE_ID type (contains the vpid of the daemon hosting the proc) and ensure that the PMIx server includes that info in its process map 2014-12-15 12:11:13 -08:00
Ralph Castain
9658256a98 Restore the passing of the complete job map to the local proc on first get_attr so the info can be used by the MPI layer without continual calls back to the server. We'll find a more memory efficient method later. 2014-12-13 18:44:09 -08:00
Howard Pritchard
c75dccede1 pmix/cray: remove finalize call from comp close
The finalize call in component close method is
no longer being matched by an equivalent init call,
so remove this call in the close method.
2014-12-03 09:44:18 -07:00
Ralph Castain
d9b23c1054 Increment the init_count in the Slurm pmix components so they correctly respond to calls to pmix.initialized 2014-12-02 20:20:29 -08:00
Gilles Gouaillardet
578fe41788 fix hangs introduced by previous commit a6744b8177 2014-11-25 17:50:44 +09:00
Howard Pritchard
a632b632ca better way to tell if a process is in a Cray PAGG
Use a more reliable way to tell if a process is
1) in a Cray PAGG
2) is actually considered an application process on
   a compute node (not for example, a process in a PAGG
   on a mom node).
2014-11-12 12:56:15 -07:00
Howard Pritchard
72bb4a2eee make cray pmi compile again
Commit @80f07b65 resulted in changes that
caused cray pmi component to no longer compile.
This commit fixes that issue.
2014-11-12 12:33:30 -07:00
Artem Polyakov
fce08a3db3 Fix SLURM PMI2 component. set s2_nrank to the relative position of a process inside the node
(not relative position of a node inside the allocation).
2014-11-12 16:26:35 +06:00
Ralph Castain
780c93ee57 Per the PR and discussion on today's telecon, extend the process name definition as a two-field struct of uint32_t's down to the OPAL layer. This resolves issues created by prior commits that impacted both heterogeneous and SPARC support. This also simplifies the OMPI code base by removing the need for frequent memcpy's when transitioning between the OMPI/ORTE layers and OPAL.
We recognize that this means other users of OPAL will need to "wrap" the opal_process_name_t if they desire to abstract it in some fashion. This is regrettable, and we are looking at possible alternatives that might mitigate that requirement. Meantime, however, we have to put the needs of the OMPI community first, and are taking this step to restore hetero and SPARC support.
2014-11-11 17:00:42 -08:00
Gilles Gouaillardet
80f07b65f1 pmix: correctly split pmi messages
Thanks to @elenash for all the reviews
2014-11-11 17:16:00 +09:00
elenash
2687637071 Merge pull request #263 from elenash/master
dstore sm component implementing shared memory database for pmix client/server communication
2014-11-07 07:56:55 +03:00
Howard Pritchard
b389895c66 fix make dist for pmix/cray
Include file was left out of "sources" list that prevented
building for cray from dist tarball.
2014-11-06 15:10:51 -07:00
Elena
03fc809bc9 This commit contains new dstore component sm which is used for communication between pmix server and clients at the same node via shared memory. 2014-11-06 16:01:19 +02:00
Gilles Gouaillardet
ca0b969991 pmix: fix a return status in native_get_attr 2014-10-30 15:26:23 +09:00
Gilles Gouaillardet
8c556bbc66 pmix: fix alignment issue 2014-10-29 13:19:23 +09:00
Ralph Castain
4f0c1ae8d9 Continue cleanup of the PMI config code. Eliminate the multiple calls to check for pmi1 and pmi2 - we must check it only once to get the pmix components to build only in the correct situations. Ensure we set the wrapper flags so we handle static builds correctly. 2014-10-27 20:37:33 -07:00
Gilles Gouaillardet
248acbbc3b pmix/slurm: correctly set locality of the local ranks as "not found" 2014-10-23 17:02:07 +09:00
Gilles Gouaillardet
7508c6f3ad pmix: correctly handle NULL OPAL_BYTE_OBJECT object 2014-10-22 17:15:21 +09:00
Nadezhda Kogteva
2bce929330 MTL MXM cleanup: unnecessary OMPI_MTL_MXM_CONNECT_ON_FIRST_COMM variable removed 2014-10-20 10:29:47 +03:00
Ralph Castain
b6aa691e0a Fix incorrect implementation of new MCA param mca_base_env_list - it was not picking up envars and forwarding them, but only worked if you explicitly set a value for the envar. Ensure it works for both direct and indirect launch modes. Remove stale code as this replaced orte_forward_envars. Ensure it doesn't get passed to the ORTE daemons. 2014-10-16 12:58:56 -07:00
Gilles Gouaillardet
27dcca0bb2 pmi/s1: fix large keys
do not overwrite the PMI key when pushing a message that does
not fit within 255 bytes
2014-10-16 13:29:32 +09:00
Gilles Gouaillardet
5c81658d58 pmix: fix big endian arch
use the appropriate 64 bits type otherwise data gets incorrectly
truncated on big endian arch
2014-10-15 17:17:09 +09:00
Elena
c905fe9b78 pmix: removed pmix_base_direct modex mca parameter, renamed orte_full_modex_cutoff and ompi_hostname_cutoff to direct_modex_cutoff 2014-10-09 06:15:31 +02:00
Gilles Gouaillardet
5c5453b8b1 pmix: fix test in native_get_attr 2014-10-03 11:54:08 +09:00
Ralph Castain
9e35f80ab6 Don't multiply define WANT_PMI_SUPPORT and friends. Turns out they weren't being used anywhere anyway, so no point in defining them at all
This commit was SVN r32822.
2014-09-30 20:43:25 +00:00
Howard Pritchard
8da51fab81 cray pmi equivalent to commit 5eb65b24
This commit was SVN r32820.
2014-09-30 19:25:00 +00:00
Ralph Castain
8d0b4f222a The pmix.get functions should not be returning "success" if the requested info isn't found. Fix the macros and the component functions so they correctly return "not found" in that situation, and set the data regions and size to NULL and 0, respectively.
This commit was SVN r32818.
2014-09-30 18:03:12 +00:00
Howard Pritchard
201d4ec3ad fix setting of PMIX_NODE_RANK in cray pmix comp.
Per discussions with pmix folks, it was determined that
the way the cray pmi pmix component was computing the
PMIX_NODE_RANK attribute for a process was incorrect.
This commit fixes the problem.

This commit was SVN r32810.
2014-09-29 16:55:31 +00:00
Howard Pritchard
1508a01325 Fixes to enable mpirun to work again on Cray
The ess pmi module was not handling aprun launched
daemons.  All daemons were thinking they were vpid 1.

Also, turns out that on cray systems using MOM nodes
for launched jobs, just detecting whether or not a
process is in a PAGG container is not sufficient.

Crank up the priority of the alps PLM component in the
event that the configure detected the presence of both
slurm and alps.

Have the ESS pmi component open the pmix framework and
select a pmix component.

This commit was SVN r32773.
2014-09-23 15:37:26 +00:00
Howard Pritchard
820b34e5d2 Fix bad cut/paste for commit c19e7369
This commit was SVN r32712.
2014-09-11 21:00:04 +00:00
Howard Pritchard
d07c5674a3 Fix potential double free in cray pmi cray_fini
This commit was SVN r32711.
2014-09-11 20:30:40 +00:00
Ralph Castain
a7c5b77d70 Just because the openib BTL can't reach a process doesn't mean it is a job-ending error. If we have other methods for reaching the process (e.g., sm for a local proc), then that's okay. If there is no method for reaching a proc, then that's an error - but the BML will report that situation.
The question of whether or not the openib BTL supports loopback is a separate question. It may be more appropriate to make the modex be PMIX_GLOBAL for cases where openib can support loopback so someone can run without a shared memory component. I'll leave that decision to the IB vendors.

This commit was SVN r32702.
2014-09-10 17:02:16 +00:00
Ralph Castain
6323b226c7 Bring over some updates from the PMIx branch - mostly just minor cleanups. Make the direct grpcomm component no longer be the default. For now, we seem to be having problems with non-blocking fence operations, so make them not be the default under any scenario (e.g., when sm is the only btl in operation).
This commit was SVN r32673.
2014-09-06 19:19:44 +00:00
Howard Pritchard
fe2ea1f0fb fix handling of OPAL_DSTORE_LOCALITY and ref cnt
This commit was SVN r32671.
2014-09-05 21:36:19 +00:00
Ralph Castain
41c6058153 Bring over changes to MXM from pmix branch:
MTL MXM: establish endpoint connection on the first communication when direct_modex used

This commit was SVN r32668.
2014-09-03 18:22:11 +00:00
Ralph Castain
5cdbc00136 Re-enable the usock oob component. Ensure the TCP component promotes messages for other procs to the OOB base so that other components have a chance to send the relay. Seems to be passing MTT, so let's see how it works for others.
This commit was SVN r32650.
2014-08-30 19:33:46 +00:00
Ralph Castain
9ac75451ff Nathan had requested this before as he needs to know the #procs in the job to optimize the UGNI btl. Add the fetch for that data - the native pmix component already provides it, but ensure the Slurm PMI-1 support does too. If not found, fall back to the non-optimized number
This commit was SVN r32648.
2014-08-29 22:53:35 +00:00
Ralph Castain
f865ef61ab Need local_size returned by the Slurm components
This commit was SVN r32646.
2014-08-29 22:23:27 +00:00
Howard Pritchard
9a2891f2d6 handle PMIX_LOCAL_SIZE attr arg in cray pmix
This commit was SVN r32645.
2014-08-29 21:18:02 +00:00
Ralph Castain
730e28349e Some minor uninitialized variable cleanups
This commit was SVN r32629.
2014-08-29 02:21:13 +00:00
Gilles Gouaillardet
d743da18bf pmix: fix process name parsing on 32 bits systems
opal_process_name_t is an uint64_t which is not equivalent to
an unsigned long on 32 bits systems.
this is now parsed as an unsigned long long.

This commit was SVN r32592.
2014-08-25 03:08:02 +00:00
Ralph Castain
f00af81c1d Little more cleanup under the abort cases cited by Gilles. All seem to be working now
This commit was SVN r32585.
2014-08-22 19:57:57 +00:00
Ralph Castain
b1a7375192 Fix the "unreachable" message so it outputs the correct hostname for the remote proc. Cleanup some of the pmix stuff when running corner cases of errors
This commit was SVN r32584.
2014-08-22 19:20:45 +00:00
Ralph Castain
6ff2a60829 Handle the non-blocking fence case correctly, and ensure we always at least pass back the hostname of the process whose info is being requested so that the ompi_proc_t can correctly initialize it when we are in a non-blocking fence with np < cutoff scenario
This commit was SVN r32578.
2014-08-22 14:26:24 +00:00
Ralph Castain
8f1b9b463e Fix shared memory operations - need to pass the local topology and cpusets of all local peers so we can properly compute relative locality for them. Also need to set default locality to "on node" in case where cpusets are not passed because procs are not bound.
This commit was SVN r32577.
2014-08-22 05:17:51 +00:00
Ralph Castain
aec5cd08bd Per the PMIx RFC:
WHAT:    Merge the PMIx branch into the devel repo, creating a new
               OPAL “lmix” framework to abstract PMI support for all RTEs.
               Replace the ORTE daemon-level collectives with a new PMIx
               server and update the ORTE grpcomm framework to support
               server-to-server collectives

WHY:      We’ve had problems dealing with variations in PMI implementations,
               and need to extend the existing PMI definitions to meet exascale
               requirements.

WHEN:   Mon, Aug 25

WHERE:  https://github.com/rhc54/ompi-svn-mirror.git

Several community members have been working on a refactoring of the current PMI support within OMPI. Although the APIs are common, Slurm and Cray implement a different range of capabilities, and package them differently. For example, Cray provides an integrated PMI-1/2 library, while Slurm separates the two and requires the user to specify the one to be used at runtime. In addition, several bugs in the Slurm implementations have caused problems requiring extra coding.

All this has led to a slew of #if’s in the PMI code and bugs when the corner-case logic for one implementation accidentally traps the other. Extending this support to other implementations would have increased this complexity to an unacceptable level.

Accordingly, we have:

* created a new OPAL “pmix” framework to abstract the PMI support, with separate components for Cray, Slurm PMI-1, and Slurm PMI-2 implementations.

* Replaced the current ORTE grpcomm daemon-based collective operation with an integrated PMIx server, and updated the grpcomm APIs to provide more flexible, multi-algorithm support for collective operations. At this time, only the xcast and allgather operations are supported.

* Replaced the current global collective id with a signature based on the names of the participating procs. The allows an unlimited number of collectives to be executed by any group of processes, subject to the requirement that only one collective can be active at a time for a unique combination of procs. Note that a proc can be involved in any number of simultaneous collectives - it is the specific combination of procs that is subject to the constraint

* removed the prior OMPI/OPAL modex code

* added new macros for executing modex send/recv to simplify use of the new APIs. The send macros allow the caller to specify whether or not the BTL supports async modex operations - if so, then the non-blocking “fence” operation is used, if the active PMIx component supports it. Otherwise, the default is a full blocking modex exchange as we currently perform.

* retained the current flag that directs us to use a blocking fence operation, but only to retrieve data upon demand

This commit was SVN r32570.
2014-08-21 18:56:47 +00:00