Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Plug a minor memory leak. Tell the PMIx server not to create a dstore memory region for the daemon job as there is nobody to share it with.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Protect users of hwloc membind functions
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Update PMIx to include NULL string protection
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Update to PMIx master to include key overwrite protection
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
There are only five places in the non-daemon code paths where opal_hwloc_topology is currently referenced:
* shared memory BTLs (sm, smcuda). I have added a code path to those components that uses the location string
instead of the topology itself, if available, thus avoiding instantiating the topology
* openib BTL. This uses the distance matrix. At present, I haven't developed a method
for replacing that reference. Thus, this component will instantiate the topology
* usnic BTL. Uses the distance matrix.
* treematch TOPO component. Does some complex tree-based algorithm, so it will instantiate
the topology
* ess base functions. If a process is direct launched and not bound at launch, this
code attempts to bind it. Thus, procs in this scenario will instantiate the
topology
Note that instantiating the topology on complex chips such as KNL can consume
megabytes of memory.
Fix pernode binding policy
Properly handle the unbound case
Correct pointer usage
Do not free static error messages!
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Fine tuning of flux component
Fix a few minor issues with the initial cut:
* Job id could be obtained from the PMI kvsname like SLURM,
but simpler to getenv (FLUX_JOB_ID)
* Flux pmi-1 doesn't define PMI_BOOL, PMI_TRUE, PMI_FALSE
* Flux pmi-1 maps the deprecated PMI_Get_kvs_domain_id() to
PMI_KVS_Get_my_name() internally, so just call that instead.
* Drop residual slurm references.
Add wrappers for PMI functions so that if HAVE_FLUX_PMI_LIBRARY
is not defined, the component can dlopen libpmi.so at location
specified by the FLUX_PMI_LIBRARY_PATH env variable, which adds
flexibility. If HAVE_FLUX_PMI_LIBRARY is defined, link with
libpmi.so at build time in the usual way.
Update configury for flux component
Update m4 so the configure options work as follows:
--with-flux-pmi
Build Flux PMI support (default: yes)
--with-flux-pmi-library
Link Flux PMI support with PMI library at build
time. Otherwise the library is opened at runtime at
location specified by FLUX_PMI_LIBRARY_PATH environment
variable. Use this option to enable Flux support when
building statically or without dlopen support (default: no)
If the latter option is provided, the library/header is located at
build time using the pkg-config module 'flux-pmi'. Otherwise there
is no library/header dependency.
Handle the case where ompi is configured with --disable-dlopen
or --enable-statkc. In those cases, don't build the component
unless --with-flux-pmi-library is provided.
It is fatal if the user explicitly requests --with-flux-pmi but
it cannot be built (e.g. due to --disable-dlopen).
Add a schizo/flux component
Update schizo/flux component
Eliminate slurm-specific usage cases.
Since the module is only loaded if FLUX_JOB_ID is set, there are
only two cases to handle:
1) App was launched indirectly through mpirun. This is not yet
supported with Flux, but hook remains in case this mode is supported
in the future.
2) App was launched directly by Flux, with Flux providing
CPU binding, if any.
Fix up white space in pmix/flux component
Drop non-blocking fence from pmix:flux component
The flux PMI-1 library is not thread safe, therefore
register a regular blocking fence callback instead of the
thread-shifting fencenb().
pmix/flux component avoids extra PMI_KVS_Gets
Keys stored into the base cache under the wildcard
rank are not intended to be part of the global key namespace.
These keys therefore should not trigger a PMI_KVS_Get() if they
are not found in the cache.
Minor pmix/flux component cleanup
pmix/flux: drop code for fetching unused pmix_id
pmix/flux: err_exit must return error
Problem: in flux_init(), although 'ret' (variable holding
err_exit return code) is initialized to OPAL_ERROR, the
variable is reused as a temporary result code, so if there are
some successes followed by a failure that doesn't set 'ret',
flux_init() could return success with PMI not initialized.
Ensure that a "goto err_exit" returns OPAL_ERROR if 'ret'
is not set to some other error code.
pmix/flux: don't mix OPAL_ and PMI_ return codes
Problem: flux_init() can return both PMI_ and OPAL_ return
codes. Although OPAL_SUCCESS and PMI_SUCCESS are both defined
as 0, other codes are not compatible.
Ensure that flux_init() consistently uses 'rc' for PMI_
return codes and 'ret' for OPAL_ return codes.
pmix/flux: factor out repeated code for cache put
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Update ORTE support for dynamic PMIx operations e.g., PMIx_Spawn
Update to track master
Ensure that --disable-pmix-dstore actually disables the dstore. Sync to a few debugger updates
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
It turns that there is an incompatibility between the Cray PMI
library and the default configuration for building Open MPI (master).
To work around this, we now disable use of aprun for direct launch
of Open MPI jobs except under specific conditions.
The problem is that there are now (on master) packages getting
initialized that do not work properly across a fork operation.
As part of a constructor in the Cray PMI library, a fork operation
is done to simplify use of shared memory between the
processes in a job on the same node. This ends up thoroughly
messing up the Open MPI initialization process in the case
that dlopen support is enabled. The initialization process gets
about half-way through when the PMIX framework is opened and
components are loaded, which triggers the Cray PMI constructor
and hence the fork operation.
There are two workarounds for this:
1) configure Open MPI for Cray XE/XC systems using aprun with the
--disable-dlopen option
2) set the PMI_NO_FORK environment variable in the shell in which
the aprun command is run.
Without taking these measures, a Open MPI job will just hang at
job startup in the first attempt to "thread-shift" the PMIx
fence_nb operation. Additional hangs occur at shutdown if this
problem is worked around, again due to the insertion of a fork
operation halfway through the Open MPI initialization procedure.
This commit detects if the conditions that bring out the hang
situation are present, and if so, prints out a message and
aborts the job launch.
Note on systems using slurm, the PMI_NO_FORK environment variable
is set as part of the srun job launch, hence this issue is avoided
on those systems.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
PR open-mpi/ompi#2432 introduced a regression where configure
and build with --disable-dlopn caused build failure owing
to unresolved alps lli symbols in the libopal-pal shared library.
This commit fixes this problem.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
Enhance the cray pmix component to set some OMPI internal
env. variables used to set some key/value pairs
on the MPI_INFO_ENV object. This allows more of the
ompi-tests ibm unit tests to pass when using aprun/srun
direct launch and Cray PMI.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
Still not completely done as we need a better way of tracking the routed module being used down in the OOB - e.g., when a peer drops connection, we want to remove that route from all conduits that (a) use the OOB and (b) are routed, but we don't want to remove it from an OFI conduit.
- replace MAXHOSTNAMELEN with hardcoded 1024.
unlike Linux, Solaris #define MAXHOSTNAMELEN in <netdb.h>,
so use a hard coded value to keep the test simpl
- stdout cannot be assigned on Solaris, so use freopen instead
(back-ported from upstream commit pmix/master@a63f6e53f4)
this is a convenience macro similar to the PMIX_LIST_FOREACH macro,
that can be used to iterate on all the key/value pairs of a pmix_hash_table_t
(back-ported from upstream commit pmix/master@349971c68c)
theses three pmix components use the same class name,
declare it as static so Open MPI can be built with --disable-dlopen
Thanks Limin Gu for the report
pmix_progress_thread_finalize() invokes libevent event_base_free,
so all libevent stuff cannot be used after.
Hence, pmix_client_globals.myserver must be PMIX_DESTRUCT'ed
before invoking pmix_progress_thread_finalize()
pmix1_value_unload() was added a "key" argument which is unused,
and pmix1_value_unload() was sometimes invoked with two arguments instead of three.
since the "key" argument is unused, simply remove it from the
subroutine prototype and calls.
the LT_* macros do overwrite the enable_dlopen variable,
so it must be tested and saved before invoking LT_INIT.
delay the invokation of the LT_* macros and use the
PMIX_ENABLE_DLOPEN_SUPPORT variable to figure out whether
--disable-dlopen was invoked
and fail with a user friendly message if no method is available:
"sec: native cannot validate_cred on this system"
(back-ported from upstream pmix/master@c474a1fc60)
Architecture is set by the ompi layer *after* job startup, so the key cannot
have the "pmix" prefix since optimizations in open-mpi/ompi@01a653d50a
otherwise architecture cannot be retrieved
This commit fixes a bug in the pmix2x client code where a loop
variable is not correctly incremented. This was leading to hangs and
crashes when creating intercommunicators. Also fixed two double
increments in other loops.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Blocking fence is used in yalla del proc. Native pmix exposes this functionality.
We need to expose it for SLURM's s1/s2 components as well.
Also this commit fixes uninitialized `rc` in fencenb's of both
components.
Thanks Jeff for the guidance
Fixesopen-mpi/ompi#1683
note:
in order to keep this commit easy to review, some AS_IF([...]) were replaced with
AS_IF([false], ...) or AS_IF_([true], ...)
these will be removed and re-idented in a subsequent commit
pmix cannot be built on alpine linux because of some missing includes.
uid_t and gid_t are defined in unistd.h or sys/types.h, and unistd.h
is not indirectly pulled under alpine linux, so do it manually.
Thanks N.L.K Nguyen for the report
(back-ported from upstream pmix/master@c8d55350a9)
Add PMIx 2.0
Remove PMIx 1.1.4
Cleanup copying of component
Add missing file
Touchup a typo in the Makefile.am
Update the pmix ext114 component
Minor cleanups and resync to master
Update to latest PMIx 2.x
Update to the PMIx event notification branch latest changes
Update external as well
Revise the change: we still need the MPI_Barrier in MPI_Finalize when we use a blocking fence, but do use the "lazy" wait for completion. Replace the direct logic in MPI_Init with a cleaner macro
Rather than have a stub function for the pmix fence_nb
operation, just set to NULL. Causes fewer problems.
Fixes#1597Fixes#1527
Signed-off-by: hppritcha <howardp@lanl.gov>
Define OPAL_MAXHOSTNAMELEN to be either:
(MAXHOSTNAMELEN + 1) or
(limits.h:HOST_NAME_MAX + 1) or
(255 + 1)
For pmix code, define above using PMIX_MAXHOSTNAMELEN.
Fixup opal layer to use the new max.
Signed-off-by: Karol Mroz <mroz.karol@gmail.com>
https://github.com/pmix/master/pull/71
Have OMPI's current version of pmix120 nicely fail in case of
too long sun_path (longer than 108 or in case of OSX 103 chars).
And have OMPI return proper error messages with hints how to
amend.
The reference counting was broken which led PMIx_Finalize
to release resources early. This fixes the "use after free" scenarios
that I encountered.
(based on commit pmix/master@abfaa4c)
* provide a more reliable way of determining that a process is a singleton by leveraging the schizo framework. Add new components for slurm, alps, and orte to detect when we are in a managed environment, and if we have been launched by mpirun or a native launcher. Set the correct envars to control ess and pmix selection in each case.
* change the relative priority of the pmix120 and pmix112 components to make pmix120 the default
* fix singleton comm-spawn by correctly setting the num_apps field of the orte_job_t created by the daemon - this fixes a segfault in register_nspace on newly created daemons
* ensure orterun doesn't propagate any ess or pmix directives in its environment
* Cleanup a few valgrind issues and memory leaks
* Fix a race condition that prevented the client from completing notification registrations (missing thread shift)
* Ensure the shizo/alps component detects launch by mpirun
Update the configure logic for the new pmix120 component
ckpt
Get the pmix120 component to work - still not really registering or handling notifications, but infrastructure now operates
Cleanup some of the symbol scopes, and provide a more comprehensive rename.h file. Will pretty it up later - let's see how this works
Cleanup the rename files to use the pretty macros
NOTE: Building with external pmix *requires* that you also build with external libevent and hwloc libraries. Detect this at configure and error out with large message if this requirement is violated.
Closes#1204 (replaces it)
Fixes#1064
Rename the pmix1xx component to pmix111 so it reflects the actual release it includes
Resolve the problem of PMIx being passed a bogus --with-platform argument when configuring the PMIx tarball code. There is no reason we should be passing --with-platform arguments to any internal subdirectory, so just leave that out when constructing the opal_subdir_args variable.
Update the PMIx code and continue attempting to debug direct modex
Fix a problem in the ORTE PMIx server - there was an early intent to optimize the direct modex by fetching data for all procs from the target job on the remote node, instead of fetching the data one proc at a time. However, this was never completely implemented, and so we would hang if we had multiple overlapping requests for data from more than one proc on the node.
Update PMIx to v1.1.2
There was a bug with the way the cray pmix component
was setting the locality property for ranks on the
same node, etc.
Improve location/syntax of a comment block.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
to continue current default behavior.
Also add an MCA param pmix_base_collect_data to direct that the blocking fence shall return all data to each process. Obviously, this param has no effect if async_
modex is used.
if CFLAGS and/or CPPFLAGS are passed to the ompi configure command line, pmix1xx
configure will not use the correct ones previously passed in the environment
see discussion started at http://www.open-mpi.org/community/lists/devel/2015/10/18159.php
Thanks Siegmar Gross for bringing this to our attention
The mca_base_select function uses returned priorities to select the
best component/module. This priority may be of use to the caller so
pass that information back in an optional argument. If the priority is
not needed pass NULL.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Looks like in ess_pmi_module.c u32 is being used
for retrieving OPAL_PMIX_LOCAL_SIZE, while s1/s2/cray
pmix components were storing as u16.
This commit fixes this problem.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
Add more stubs to reduce likelihood of future
mysterious segfaults if some of the newer pmix
funcs start to get used within ompi.
Add a get_version to return the version of the
Cray PMI library being used, since the Cray PMI
library actually has a function to get that info.
Be more accurate about which functions have a hope
of being implemented using Cray PMI and those which
never will.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
* update to configury to silence ident messages (thanks Gilles!)
* fix for warnings Jeff saw when get didn't find the requested data
* fix for Mac OSX operations
Bring Slurm PMI-1 component online
Bring the s2 component online
Little cleanup - let the various PMIx modules set the process name during init, and then just raise it up to the ORTE level. Required as the different PMI environments all pass the jobid in different ways.
Bring the OMPI pubsub/pmi component online
Get comm_spawn working again
Ensure we always provide a cpuset, even if it is NULL
pmix/cray: adjust cray pmix component for pmix
Make changes so cray pmix can work within the integrated
ompi/pmix framework.
Bring singletons back online. Implement the comm_spawn operation using pmix - not tested yet
Cleanup comm_spawn - procs now starting, error in connect_accept
Complete integration
This commit does two things. It removes checks for C99 required
headers (stdlib.h, string.h, signal.h, etc). Additionally it removes
definitions for required C99 types (intptr_t, int64_t, int32_t, etc).
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
So add a brief timer event to kick us out of the communication. The precise amount of time we should wait is somewhat TBD, but set something short for now and we can adjust.
Changing the client to leave its socket as blocking during the connect doesn't solve the problem by itself - you also have to introduce a sleep delay once the backlog is hit to avoid simply machine-gunning your way thru retries. This gets somewhat difficult to adjust as you don't want to unnecessarily prolong startup time.
We've solved this before by adding a listening thread that simply reaps accepts and shoves them into the event library for subsequent processing. This would resolve the problem, but meant yet another daemon-level thread. So I centralized the listening thread support and let multiple elements register listeners on it. Thus, each daemon now has a single listening thread that reaps accepts from multiple sources - for now, the orte/pmix server and the oob/usock support are using it. I'll add in the oob/tcp component later.
This still didn't fully resolve the SMP problem, especially on coprocessor cards (e.g., KNC). Removing the shared memory dstore support helped further improve the behavior - it looks like there is some kind of memory paging issue there that needs further understanding. Given that the shared memory support was about to be lost when I bring over the PMIx integration (until it is restored in that library), it seemed like a reasonable thing to just remove it at this point.
CID 1269707 Logically dead code (DEADCODE)
Coverity is correct that tmp3 can never be NULL here. Deleted the dead
code.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
CID 1269730 Dereference after null check (FORWARD_NULL)
The code checked for cb == NULL before checking for a callback
function but did not have the same protection around the
OBJ_RELEASE(cb).
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Code for setting proc node locality
was absent after the removal of Cray
PMI KVS usage. This commit puts that
functionality back in place.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
A few uninitialized common symbols are remaining:
common symbols generated by flex :
* opal/util/keyval/keyval_lex.l: opal_util_keyval_yyleng
* opal/util/keyval/keyval_lex.o: opal_util_keyval_yytext
* opal/util/show_help_lex.l: opal_show_help_yyleng
* opal/util/show_help_lex.l: opal_show_help_yytext
common symbol generated by "external" hwloc library:
* opal/mca/hwloc/hwloc191/hwloc/src/components.o: component_map