This now passes the loop test, and so we believe it resolves the random hangs in finalize.
Changes in PMIx master that are included here:
* Fixed a bug in the PMIx_Get logic
* Fixed self-notification procedure
* Made pmix_output functions thread safe
* Fixed a number of thread safety issues
* Updated configury to use 'uname -n' when hostname is unavailable
Work on cleaning up the event handler thread safety problem
Rarely used functions, but protect them anyway
Fix the last part of the intercomm problem
Ensure we don't cover any PMIx calls with the framework-level lock.
Protect against NULL argv comm_spawn
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Hostname and PID are output as a message prefix in many places in
our code. Their printf-formats were either `[%s:%d]` or `[%s:%05d]`.
This commit changes `[%s:%d]` to `[%s:%05d]`. The latter was more
widely used in our code (including OPAL output system and the signal
handler).
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
This commit expands the effect of the MCA parameter `opal_abort_delay`
to the OPAL signal handler. This allows attaching of a debugger on
segmentation fault etc. before quitting the job.
The sleep code is moved to the `opal_delay_abort` function from the
`ompi_mpi_abort` and `oshmem_shmem_abort` functions for code cleanup.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
The expected sequence of events for processing info during object creation
is that if there's an incoming info arg, it is opal_info_dup()ed into the obj
at obj->s_info first. Then interested components register callbacks for
keys they want to know about using opal_infosubscribe_infosubscribe().
Inside info_subscribe_subscribe() the specified callback() is called with
whatever matching k/v is in the object's info, or with the default. The
return string from the callback goes into the new k/v stored in info, and
the input k/v is saved as __IN_<key>/<val>. It's saved the same way
whether the input came from info or whether it was a default. A null return
from the callback indicates an ignored key/val, and no k/v is stored for
it, but an __IN_<key>/<val> is still kept so we still have access to the
original.
At MPI_*_set_info() time, opal_infosubscribe_change_info() is used. That
function calls the registered callbacks for each item in the provided info.
If the callback returns non-null, the info is updated with that k/v, or if
the callback returns null, that key is deleted from info. An __IN_<key>/<val>
is saved either way, and overwrites any previously saved value.
When MPI_*_get_info() is called, opal_info_dup_mpistandard() is used, which
allows relatively easy changes in interpretation of the standard, by looking
at both the <key>/<val> and __IN_<key>/<val> in info. Right now it does
1. includes system extras, eg k/v defaults not expliclty set by the user
2. omits ignored keys
3. shows input values, not callback modifications, eg not the internal values
Currently the callbacks are doing things like
return some_condition ? "true" : "false"
that is, returning static strings that are not to be freed. If the return
strings start becoming more dynamic in the future I don't see how unallocated
strings could support that, so I'd propose a change for the future that
the callback()s registered with info_subscribe_subscribe() do a strdup on
their return, and we change the callers of callback() to free the strings
it returns (there are only two callers).
Rough outline of the smaller changes spread over the less central files:
comm.c
initialize comm->super.s_info to NULL
copy into comm->super.s_info in comm creation calls that provide info
OBJ_RELEASE comm->super.s_info at free time
comm_init.c
initialize comm->super.s_info to NULL
file.c
copy into file->super.s_info if file creation provides info
OBJ_RELEASE file->super.s_info at free time
win.c
copy into win->super.s_info if win creation provides info
OBJ_RELEASE win->super.s_info at free time
comm_get_info.c
file_get_info.c
win_get_info.c
change_info() if there's no info attached (shouldn't happen if callbacks
are registered)
copy the info for the user
The other category of change is generally addressing compiler warnings where
ompi_info_t and opal_info_t were being used a little too interchangably. An
ompi_info_t* contains an opal_info_t*, at &(ompi_info->super)
Also this commit updates the copyrights.
Signed-off-by: Mark Allen <markalle@us.ibm.com>
ompi_communicator_t, ompi_win_t, ompi_file_t all have a super class of type opal_infosubscriber_t instead of a base/super type of opal_object_t (in previous code comm used c_base, but file used super). It may be a bit bold to say that being a subscriber of MPI_Info is the foundational piece that ties these three things together, but if you object, then I would prefer to turn infosubscriber into a more general name that encompasses other common features rather than create a different super class. The key here is that we want to be able to pass comm, win and file objects as if they were opal_infosubscriber_t, so that one routine can heandle all 3 types of objects being passed to it.
MPI_INFO_NULL is still an ompi_predefined_info_t type since an MPI_Info is part of ompi but the internal details of the underlying information concept is part of opal.
An ompi_info_t type still exists for exposure to the user, but it is simply a wrapper for the opal object.
Routines such as ompi_info_dup, etc have all been moved to opal_info_dup and related to the opal directory.
Fortran to C translation tables are only used for MPI_Info that is exposed to the application and are therefore part of the ompi_info_t and not the opal_info_t
The data structure changes are primarily in the following files:
communicator/communicator.h
ompi/info/info.h
ompi/win/win.h
ompi/file/file.h
The following new files were created:
opal/util/info.h
opal/util/info.c
opal/util/info_subscriber.h
opal/util/info_subscriber.c
This infosubscriber concept is that communicators, files and windows can have subscribers that subscribe to any changes in the info associated with the comm/file/window. When xxx_set_info is called, the new info is presented to each subscriber who can modify the info in any way they want. The new value is presented to the next subscriber and so on until all subscribers have had a chance to modify the value. Therefore, the order of subscribers can make a difference but we hope that there is generally only one subscriber that cares or modifies any given key/value pair. The final info is then stored and returned by a call to xxx_get_info.
The new model can be seen in the following files:
ompi/mpi/c/comm_get_info.c
ompi/mpi/c/comm_set_info.c
ompi/mpi/c/file_get_info.c
ompi/mpi/c/file_set_info.c
ompi/mpi/c/win_get_info.c
ompi/mpi/c/win_set_info.c
The current subscribers where changed as follows:
mca/io/ompio/io_ompio_file_open.c
mca/io/ompio/io_ompio_module.c
mca/osc/rmda/osc_rdma_component.c (This one actually subscribes to "no_locks")
mca/osc/sm/osc_sm_component.c (This one actually subscribes to "blocking_fence" and "alloc_shared_contig")
Signed-off-by: Mark Allen <markalle@us.ibm.com>
Conflicts:
AUTHORS
ompi/communicator/comm.c
ompi/debuggers/ompi_mpihandles_dll.c
ompi/file/file.c
ompi/file/file.h
ompi/info/info.c
ompi/mca/io/ompio/io_ompio.h
ompi/mca/io/ompio/io_ompio_file_open.c
ompi/mca/io/ompio/io_ompio_file_set_view.c
ompi/mca/osc/pt2pt/osc_pt2pt.h
ompi/mca/sharedfp/addproc/sharedfp_addproc.h
ompi/mca/sharedfp/addproc/sharedfp_addproc_file_open.c
ompi/mca/topo/treematch/topo_treematch_dist_graph_create.c
ompi/mpi/c/lookup_name.c
ompi/mpi/c/publish_name.c
ompi/mpi/c/unpublish_name.c
opal/mca/mpool/base/mpool_base_alloc.c
opal/util/Makefile.am
The direct modex operation is slow, especially at scale for even modestly-connected applications. Likewise, blocking in MPI_Init while we wait for a full modex to complete takes too long. However, as George pointed out, there is a middle ground here. We could kickoff the modex operation in the background, and then trap any modex_recv's until the modex completes and the data is delivered. For most non-benchmark apps, this may prove to be the best of the available options as they are likely to perform other (non-communicating) setup operations after MPI_Init, and so there is a reasonable chance that the modex will actually be done before the first modex_recv gets called.
Once we get instant-on-enabled hardware, this won't be necessary. Clearly, zero time will always out-perform the time spent doing a modex. However, this provides a decent compromise in the interim.
This PR changes the default settings of a few relevant params to make "background modex" the default behavior:
* pmix_base_async_modex -> defaults to true
* pmix_base_collect_data -> continues to default to true (no change)
* async_mpi_init - defaults to true. Note that the prior code attempted to base the default setting of this value on the setting of pmix_base_async_modex. Unfortunately, the pmix value isn't set prior to setting async_mpi_init, and so that attempt failed to accomplish anything.
The logic in MPI_Init is:
* if async_modex AND collect_data are set, AND we have a non-blocking fence available, then we execute the background modex operation
* if async_modex is set, but collect_data is false, then we simply skip the modex entirely - no fence is performed
* if async_modex is not set, then we block until the fence completes (regardless of collecting data or not)
* if we do NOT have a non-blocking fence (e.g., we are not using PMIx), then we always perform the full blocking modex operation.
* if we do perform the background modex, and the user requested the barrier be performed at the end of MPI_Init, then we check to see if the modex has completed when we reach that point. If it has, then we execute the barrier. However, if the modex has NOT completed, then we block until the modex does complete and skip the extra barrier. So we never perform two barriers in that case.
HTH
Ralph
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Adds:
- enabling/disabling of timings throught environment variable `OMPI_TIMING_ENABLE`
- output format: [file name]:[function name]:[description]: avg/min/max
- dynamically extending array of results for case then inited size was exhausted
- catch and collect errors
- cleanup
Note:
For use feature need to configure with `--enable-timings`
and set env `OMPI_TIMING_ENABLE = 1`
Signed-off-by: Boris Karasev <karasev.b@gmail.com>
This is an extension of OPAL timing framework that allows to use
MPI_reduce to provide the compact representation of the collected
timings throughout the whole application.
NOTE: the functionality is disabled now, it will be enabled after
the runtime verification.
Signed-off-by: Artem Polyakov <artpol84@gmail.com>
As we changed the ABI (forcing a major release), we can limit
the size of the predefined communicators by moving the collective
structure outside the communicator. This might have a minimal,
but unnoticeable, impact on performance. This approach has been
discussed during the January 2017 devel meeting.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
* Include a 'demo' component that shows some of the features.
* Currently has hooks for:
- MPI_Initialized
- top, bottom
- MPI_Init_thread
- top, bottom
- MPI_Finalized
- top, bottom
- MPI_Init
- top (pre-opal_init), top (post-opal_init), error, bottom
- MPI_Finalize
- top, bottom
* Other places in ompi can 'register' to hook into any one of these places
by passing back a component structure filled with function pointers.
* Add a `MCA_BASE_COMPONENT_FLAG_REQUIRED` flag to the MCA structure that
is checked by the `hook` framework. If a required, static component has
been excluded then the `hook` framework will fail to initialize.
- See note in `opal/mca/mca.h` as to why this is checked in the `hook`
framework and not in `opal/mca/base/mca_base_component_find.c`
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
There are only five places in the non-daemon code paths where opal_hwloc_topology is currently referenced:
* shared memory BTLs (sm, smcuda). I have added a code path to those components that uses the location string
instead of the topology itself, if available, thus avoiding instantiating the topology
* openib BTL. This uses the distance matrix. At present, I haven't developed a method
for replacing that reference. Thus, this component will instantiate the topology
* usnic BTL. Uses the distance matrix.
* treematch TOPO component. Does some complex tree-based algorithm, so it will instantiate
the topology
* ess base functions. If a process is direct launched and not bound at launch, this
code attempts to bind it. Thus, procs in this scenario will instantiate the
topology
Note that instantiating the topology on complex chips such as KNL can consume
megabytes of memory.
Fix pernode binding policy
Properly handle the unbound case
Correct pointer usage
Do not free static error messages!
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
F90 types cannot be freed by the enduser as specified by the standard.
but since they are ompi_datatype_dup'ed from predefined datatypes,
they have to be explicitly free'd at finalize time in order
to avoid a memory leak.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
declare ompi_mpi_show_mca_params_file as NULL
so MPI_T_Init_thread() can be invoked without leaking memory
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
As long as it is illegal to call MPI_T_init_thread() after MPI_Finalize(),
be gentle and release as much memory as possible in MPI_Finalize().
opal_cleanup() will be invoked again by the OPAL destructor, but will
do nothing since classes was set to NULL
* Add a configure time option to rename libmpi(_FOO).*
- `--with-libmpi-name=STRING`
* This commit only impacts the installed libraries.
Internal, temporary libraries have not been renamed to limit the
scope of the patch to only what is needed.
For example:
```shell
shell$ ./configure --with-libmpi-name=wookie
...
shell$ find . -name "libmpi*"
shell$ find . -name "libwookie*"
./lib/libwookie.so.0.0.0
./lib/libwookie.so.0
./lib/libwookie.so
./lib/libwookie.la
./lib/libwookie_mpifh.so.0.0.0
./lib/libwookie_mpifh.so.0
./lib/libwookie_mpifh.so
./lib/libwookie_mpifh.la
./lib/libwookie_usempi.so.0.0.0
./lib/libwookie_usempi.so.0
./lib/libwookie_usempi.so
./lib/libwookie_usempi.la
shell$
```
Relax CPU usage pressure from the application processes when doing
modex and barrier in ompi_mpi_init.
We see significant latencies in SLURM/pmix plugin barrier progress
because app processes are aggressively call opal_progress pushing
away daemon process doing collective progress.
Newer versions of gcc have "poisoned" the __malloc_initialize_hook
name and it can no longer be used. Added a configure check and
protection around its usage.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Add PMIx 2.0
Remove PMIx 1.1.4
Cleanup copying of component
Add missing file
Touchup a typo in the Makefile.am
Update the pmix ext114 component
Minor cleanups and resync to master
Update to latest PMIx 2.x
Update to the PMIx event notification branch latest changes
Update external as well
Revise the change: we still need the MPI_Barrier in MPI_Finalize when we use a blocking fence, but do use the "lazy" wait for completion. Replace the direct logic in MPI_Init with a cleaner macro
MPIR-1.0 specifies that the following symbols are only relevant in the
starter process:
- MPIR_Breakpoint
- MPIR_being_debugged
- MPIR_debug_state
- MPIR_debug_abort_string
I.e., the code filling in values in these various symbols was useless
/ never used.
MPIR-1.1 will define that MPIR_being_debugged *is* relevant in MPI
processes. That symbol is currently defined in libopen-rte (which is
currently causing a duplicate symbol error for static builds -- this
commit fixes that error), and is therefore still available for MPI
processes.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
There is a potential race condition in MPI_Init() where an orte even
thread could be in a function that uses OPAL_THREAD_LOCK /
OPAL_THREAD_UNLOCK when ompi_mpi_init calls opal_set_using_threads().
Closesopen-mpi/ompi#1586
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This is a follow-on to open-mpi/ompi@7373111: add some comments
explaining why the code is the way it is. Also update a previous
comment.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
This commit makes it possible to set relative priorities for
components. Before the addition of the patched component there was
only one component that would run on any system but that is no longer
the case. When determining which component to open each component's
query function is called and the one that returns the highest priority
is opened. The default priority of the patcher component is set
slightly higher than the old ptmalloc2/ummunotify component.
This commit fixes a long-standing break in the abstration of the
memory components. ompi_mpi_init.c was referencing the linux malloc
hook initilize function to ensure the hooks are initialized for
libmpi.so. The abstraction break has been fixed by adding a memory
base function that calls the open memory component's malloc hook init
function if it has one. The code is not yet complete but is intended
to support ptmalloc in 2.0.0. In that case the base function will
always call the ptmalloc hook init if exists.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit ensures the bml is always enabled whether or not it will
be used. This ensures that any available btls communicate their modex
so that they can be used for one-sided communication.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit rewrites both the mpool and rcache frameworks. Summary of
changes:
- Before this change a significant portion of the rcache
functionality lived in mpool components. This meant that it was
impossible to add a new memory pool to use with rdma networks
(ugni, openib, etc) without duplicating the functionality of an
existing mpool component. All the registration functionality has
been removed from the mpool and placed in the rcache framework.
- All registration cache mpools components (udreg, grdma, gpusm,
rgpusm) have been changed to rcache components. rcaches are
allocated and released in the same way mpool components were.
- It is now valid to pass NULL as the resources argument when
creating an rcache. At this time the gpusm and rgpusm components
support this. All other rcache components require non-NULL
resources.
- A new mpool component has been added: hugepage. This component
supports huge page allocations on linux.
- Memory pools are now allocated using "hints". Each mpool component
is queried with the hints and returns a priority. The current hints
supported are NULL (uses posix_memalign/malloc), page_size=x (huge
page mpool), and mpool=x.
- The sm mpool has been moved to common/sm. This reflects that the sm
mpool is specialized and not meant for any general
allocations. This mpool may be moved back into the mpool framework
if there is any objection.
- The opal_free_list_init arguments have been updated. The unused0
argument is not used to pass in the registration cache module. The
mpool registration flags are now rcache registration flags.
- All components have been updated to make use of the new framework
interfaces.
As this commit makes significant changes to both the mpool and rcache
frameworks both versions have been bumped to 3.0.0.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>