value NULL for the descriptor
The send inline optimization uses the btl_sendi function to achieve
lower latency and higher message rates. The problem is the btl_sendi
function was allowed to return a descriptor to the caller. This is fine
for some paths but not ok for the send inline optimization. To fix
this the btl now must be able to handle descriptor = NULL.
structure
This structure member was originally used to specify the remote segment
for an RDMA operation. Since the new btl interface no longer uses
desriptors for RDMA this member no longer has a purpose. In addition
to removing these members the local segment information has been
renamed to des_segments/des_segment_count.
The old BTL interface provided support for RDMA through the use of
the btl_prepare_src and btl_prepare_dst functions. These functions were
expected to prepare as much of the user buffer as possible for the RDMA
operation and return a descriptor. The descriptor contained segment
information on the prepared region. The btl user could then pass the
RDMA segment information to a remote peer. Once the peer received that
information it then packed it into a similar descriptor on the other
side that could then be passed into a single btl_put or btl_get
operation.
Changes:
- Removed the btl_prepare_dst function. This reflects the fact that
RDMA operations no longer depend on "prepared" descriptors.
- Removed the btl_seg_size member. There is no need to btl's to
subclass the mca_btl_base_segment_t class anymore.
...
Add more
Use a more reliable way to tell if a process is
1) in a Cray PAGG
2) is actually considered an application process on
a compute node (not for example, a process in a PAGG
on a mom node).
using knem
It is valid to modify the remote segment that will be used with the
btl put/get operations as long as the resulting address range falls in
the originally prepared segment. Vader should have been calculating the
offset of the remote address in the registered region. This commit
fixes this issue.
We recognize that this means other users of OPAL will need to "wrap" the opal_process_name_t if they desire to abstract it in some fashion. This is regrettable, and we are looking at possible alternatives that might mitigate that requirement. Meantime, however, we have to put the needs of the OMPI community first, and are taking this step to restore hetero and SPARC support.
There were mistakes in the Makefiles for the ugni btl and
mca/common/ugni that prevented the ugni btl from being
used unless one happened to set the --disable-dlopen option
on the config line.
This commit fixes this problem.
This enables IBM's software iWARP provider. With this driver you can
run iWARP/RDMA over any ethernet NIC. Useful for testing OMPI RDMA
logic without requiring an expensive RDMA adapter/infrastructure.
The Soft iWARP code is at: https://www.gitorious.org/softiwarp
Avoid a problem with double-derefence of a variable macro name (i.e.,
a macro with part of its name from an AC_SUBST, such as
```$(foo@BAR@baz)```.
In what might be a bug in Automake 1.14.1, if you do a pattern like
this:
```makefile
lib_LTLIBRARIES = lib@A_PREFIX@a_lib.la
noinst_LTLIBRARIES = lib@A_PREFIX@a_noinst.la
lib@A_PREFIX@a_lib_la_SOURCES = a.c
lib@A_PREFIX@a_noinst_la_SOURCES = $(lib@A_PREFIX@a_lib_la_SOURCES)
```
Then in the resulting Makefile, the value of
```$(lib@A_PREFIX@a_lib_la_OBJECTS)``` will be *blank* (when it really
should be ```a.o```).
To workaround this potential bug, I've simply avoided doing
double-derefences like this, and effectively set the second
```_SOURCES``` line equal to ```a.c``` (just like the first
```_SOURCES``` line).
Fixes#250.
These two macros set the MCA prefix and MCA cmd line id,
respectively. Specifically, MCA parameters will be named
PREFIX<foo> in the environment, and the cmd line will use
-ID foo bar.
These macros must be called during configure.ac and a value
supplied. In the case of Open MPI, the values given are
PREFIX=OMPI_MCA_ and ID=mca.
Other projects (such as ORCM) will call these macros with
their own unique values. For example, ORCM uses PREFIX=ORCM_MCA_
and ID=omca
This scheme is necessary to allow running Open MPI applications under
systems that use their own versions of ORTE and OPAL. For example,
when running OMPI applications under ORCM, we need the MCA params passed
to the ORCM daemons to be separated from those recognized by the OMPI application.
At some point we added a sanity check to the btl base to ensure that
the btl flags match the available functions (this prevents user's from
specifying get or put when no function exists). This check was
disabling get for the sm btl since at the time of the check there is
no btl_get function. The simplest fix is to set a dummy value to btl_get
that will be overwritten with the proper value on btl initialization.
Closes#239.
If there are no usnic BTL modules, then just avoid sending any modex
message at all (other BTLs do this; it's safe to do).
The change is smaller than it looks: I added a "if 0 ==..." check at
the top to return immediately if there are no BTL modules. Then I
removed some now-unnecessary conditionals and un-indented as
appropriate.
Fixes#248
These two macros set the prefix for the OPAL and ORTE libraries,
respectively. Specifically, the OPAL library will be named
libPREFIXopen-pal.la and the ORTE library will be named
libPREFIXopen-rte.la.
These macros must be called, even if the prefix argument is empty.
The intent is that Open MPI will call these macros with an empty
prefix, but other projects (such as ORCM) will call these macros with
a non-empty prefix. For example, ORCM libraries can be named
liborcm-open-pal.la and liborcm-open-rte.la.
This scheme is necessary to allow running Open MPI applications under
systems that use their own versions of ORTE and OPAL. For example,
when running MPI applications under ORTE, if the ORTE and OPAL
libraries between OMPI and ORCM are not identical (which, because they
are released at different times, are likely to be different), we need
to ensure that the OMPI applications link against their ORTE and OPAL
libraries, but the ORCM executables link against their ORTE and OPAL
libraries.
the OPAL and ORTE libraries. This is required by projects such as ORCM
that have their own ORTE and OPAL libraries in order to avoid library
confusion. By renaming their version of the libraries, the OMPI
applications can correctly dynamically load the correct one for their
build."
This reverts commit 63f619f871.
This commit makes the folowing changes:
- Add support for the knem single-copy mechanism. Initially vader will only
support the synchronous copy mode. Asynchronous copy support may be added
int the future.
- Improve Linux cross memory attach (CMA) when using restrictive ptrace
settings. This will allow Open MPI to use CMA without modifying the system
settings to support ptrace attach (see /etc/sysctl.d/10-ptrace.conf).
- Allow runtime selection of the single copy mechanism. The default behavior
is to use the best available. The priority list of single-copy mehanisms is
as follows: xpmem, cma, and knem.
- Allow disabling support for kernel-assisted single copy.
- Some tuning and bug fixes.
Restore the functionality to error out (and show a helpful message) if
knem support is requested by is either not compiled in or cannot be
activated.
Thanks to Gus Correa for bringing the matter to our attention.
Somehow, this MCA param was accidentally dropped after v1.6.5. Thanks
to Gus Correa for bringing this matter to our attention.
Also moving some MCA params down from level 9 to levels 4/5.
* redefine orte_process_name_t so it can be converted
between host and network format as an opal_identifier_t
aka uint64_t by the OPAL layer.
* correctly send OPAL_DSTORE_ARCH key
Revert "OPAL: drop dead with core on bad flow. rarely happens with helloworld on large scale."
This reverts commit 86f1d5af3e.
Will be reconsidered via RFC as it represents a significant change in behavior
of preregistered buffers
Before this change eager gets we retried on each progress loop. This commit
modifies the protocol to only retry eager gets when another eager get has
completed. This commit also cleans up some callback code that is no longer
needed.
The GNI_RDMAMODE_FENCE bit was a left over from
async progress work that is not needed at this point
in the gni BTL. Removing the bit also allows
for the removal of the GNI_CDM_MODE_BTE_SINGLE_CHANNEL
bit from the GNI_CdmCreate call.
1. It's actually hashing now, whereas the old OPAL hash table was not. Thus, it is a bug fix for and, as such, should be included in the 1.8 series.
2. It is dynamic and can grow and shrink the number of buckets in accordance with job size, whereas the old OPAL hash table had a fixed number of buckets which resulted in poor retrieval performance at large scale.
This scheme has been deployed in the field on very large H.P./Mellanox systems and has been demonstrated to significantly decrease job start-up time (~ 20% improvement) when launching applications directly with srun in SLURM environments. However, neither SLURM nor direct launch are prerequisites to take advantage of this change as any entity that utilizes OPAL hash table objects can benefit (at least partially) from this contribution.
This commit adds initial ugni thread safety support.
With this commit, sun thread tests (excepting MPI-2 RMA)
pass with various process counts and threads/process.
Also osu_latency_mt passes.
With --enable-memchecker builds, use calloc(3) for OBJ_NEW instead of
malloc(3). This cuts down on a lot of valgrind/memory checker false
positive output.
Also make a minor change in the valgrind configure.m4; have it assign
0xf to a char. The prior assignment (of 0xff) was warning about an
overflow. This didn't really matter, but we might as well make the
test not have a gratuitious warning in it.
Properly setup the opal_process_info structure early in the initialization procedure. Define the local hostname right at the beginning of opal_init so all parts of opal can use it. Overlay that during orte_init as the user may choose to remove fqdn and strip prefixes during that time. Setup the job_session_dir and other such info immediately when it becomes available during orte_init.
Update the VERSION file scheme:
* Remove "want_repo_rev".
* Add "tarball_version".
All values are now always included (major, minor, release, greek,
repo_rev). However, configure.ac now runs "opal_get_version.sh
... --tarball", which will return the value of tarball_version (if it
is non-empty) or the "full" version string (i.e.,
"major.minor.releasegreek").
Remove configure.params support: configure.params hasn't been used in
years.
Also remove autogen.subdirs support; those should really be handled by
their respective Makefile.am's.
In all previous releases, the version number would be "A.B.C" unless C
was 0, in which case it would be "A.B". This commit changes that
scheme to always be "A.B.C", even if C==0.
Hence, v1.9.0 will be the first release where this new scheme is evident.
This commit was SVN r32816.
Per discussions with pmix folks, it was determined that
the way the cray pmi pmix component was computing the
PMIX_NODE_RANK attribute for a process was incorrect.
This commit fixes the problem.
This commit was SVN r32810.
This is a large update that does the following:
- Only allocate fast boxes for a peer if a send count threshold
has been reached (default: 16). This will greatly reduce the memory
usage with large numbers of local peers.
- Improve performance by limiting the number of fast boxes that can
be allocated per peer (default: 32). This will reduce the amount
of time spent polling for fast box messages.
- Provide new MCA variables to configure the size, maximum count,
and send count thresholds for fast boxes allocations.
- Updated buffer design to increase the range of message sizes that
can be sent with a fast box.
- Add thread protection around fast box allocation (locks). When
spin locks are available this should be updated to use spin locks.
- Various fixes and cleanup.
This commit was SVN r32774.
The ess pmi module was not handling aprun launched
daemons. All daemons were thinking they were vpid 1.
Also, turns out that on cray systems using MOM nodes
for launched jobs, just detecting whether or not a
process is in a PAGG container is not sufficient.
Crank up the priority of the alps PLM component in the
event that the configure detected the presence of both
slurm and alps.
Have the ESS pmi component open the pmix framework and
select a pmix component.
This commit was SVN r32773.
1. Fixes according to (http://www.open-mpi.org/community/lists/devel/2014/09/15869.php)
2. Force mpisync:rank0 to gather results. Now sync info is written by rank0 to the output file.
3. Improve mpirun_prof: 1) adopt to the environment (SLURM/TORQUE); 2) recognize some noteset-related mpirun options.
This commit was SVN r32772.