btl sendi functions currently can not handle the descriptor being NULL. The
send inline optimization was assuming (incorrectly) that NULL was ok.
cmr=v1.7.5:ticket=trac:4149
This commit was SVN r30364.
The following Trac tickets were found above:
Ticket 4149 --> https://svn.open-mpi.org/trac/ompi/ticket/4149
allgather.
The new collectives provide a signifigant performance increase over tuned for
small and medium messages. We are initially setting the priority lower than
tuned until this has had some time to soak in the trunk. Please set
coll_ml_priority to 90 for MTT runs.
Credit for this work goes to Manjunath Gorentla Venkata (ORNL), Pavel Shamis (ORNL),
and Nathan Hjelm (LANL).
Commit details (for reference):
Import ORNL's collectives for MPI_Allreduce, MPI_Reduce, and MPI_Allgather.
We need to take the basesmuma header into account when calculating the
ptpcoll small message thresholds. Add a define to bcol.h indicating the
maximum header size so we can take the header into account while not
making ptpcoll dependent on information from basesmuma.
This resolves an issue with allreduce where ptpcoll overwrites the
header of the next buffer in the basesmuma bank.
Fix reduce and make a sequential collective launcher in coll_ml_inlines.h
The root calculation for reduce was wrong for any root != 0. There are
four possibilities for the root:
- The root is not the current process but is in the current hierarchy. In
this case the root is the index of the global root as specified in the
root vector.
- The root is not the current process and is not in the next level of the
hierarchy. In this case 0 must be the local root since this process will
never communicate with the real root.
- The root is not the current process but will be in next level of the
hierarchy. In this case the current process must be the root.
- I am the root. The root is my index.
Tested with IMB which rotates the root on every call to MPI_Reduce. Consider
IMB the reproducer for the issue this commit solves.
Make the bcast algorithm decision an enumerated variable
Resolve various asset failures when destructing coll ml requests.
Two issues:
- Always reset the request to be invalid before returning it to the
free list. This will avoid an asset in ompi_request_t's destructor.
OMPI_REQUEST_FINI does this (and also releases the fortran handle
index).
- Never explicitly construct or destruct the superclass of an opal
object. This screws up the class function tables and will cause
either an assert failure or a segmentation fault when destructing
coll ml requests.
Cleanup allgather.
I removed the duplicate non-blocking and blocking functions and modeled
the cleanup after what I found in allreduce. Also cleaned up the code
somewhat.
Don't bother copying from the send to the recieve buffer in
bcol_basesmuma_allreduce_intra_fanin_fanout if the pointers are the
same.
The eliminates a warning about memcpy and aliasing and avoids an
unnecessary call to memcpy.
Alwasy call CHECK_AND_RELEASE on memsync collectives.
There was a call to OBJ_RELEASE on the collective communicator but
because CHECK_AND_RECYLCE was never called there was not matching call
to OBJ_RELEASE. This caused coll ml to leak communicators.
Make allreduce use the sequential collective launcher in coll_ml_inlines.h
Just launch the next collective in the component progress.
I am a little unsure about this patch. There appears to be some sort
of race between collectives that causes buffer exhaustion in some cases
(IMB Allreduce is a reproducer). Changing progress to only launch the
next bcol seems to resolve the issue but might not be the best fix.
Note that I see little-no performance penalty for this change.
Fix allreduce when there are extra sources.
There was an issue with the buffer offset calculation when there are
extra sources. In the case of extra sources == 1 the offset was set
to buffer_size (just past the header of the next buffer). I adjusted
the buffer size to take into accoun the maximum header size (see the
earlier commit that added this) and simplified the offset calculation.
Make reduce/allreduce non-blocking. This is required for MPI_Comm_idup
to work correctly.
This has been tested with various layouts using the ibm testsuite and
imb and appears to have the same performance as the old blocking version.
Fix allgather for non-contiguous layouts and simplify parsing the
topology.
Some things in this patch:
- There were several comments to the effect that level 0 of the
hierarchy MUST contain all of the ranks. At least one function
made this assumption but it was not true. I changed the sbgp
components and the coll ml initization code to enforce this
requirement.
- Ensure that hierarchy level 0 has the ranks in the correct
scatter gather order. This removes the need for a separate
sort list and fixes the offset calculation for allgather.
- There were several passes over the hierarchy to determine
properties of the hierarchy. I eliminated these extra passes
and the memory allocation associated with them and calculate the
tree properties on the fly. The same DFS recursion also handles
the re-order of level 0.
All these changes have been verified with MPI_Allreduce, MPI_Reduce, and
MPI_Allgather. All functions now pass all IBM/Open MPI, and IMB tests.
coll/ml: correct pointer usage for MPI_BOTTOM
Since contiguous datatypes are copied via memcpy (bypassing the convertor) we
need to adjust for the lb of the datatype. This corrects problems found testing
code that uses MPI_BOTTOM (NULL) as the send pointer.
Add fallback collectives for allreduce and reduce.
cmr=v1.7.5:reviewer=pasha
This commit was SVN r30363.
Per RFC. There are two optimizations in this commit:
- Allocate requests for blocking sends and receives on the stack. This
bypasses the request free list and saves two atomics on the critical path.
This change improves the small message ping-pong by 50-200ns on both AMD
and Intel CPUs.
- For small messages try to use the btl sendi function before intializing a
send request. If the sendi fails or the btl does not have a sendi function
silently fallback on the standard send path.
cmr=v1.7.5:reviewer=brbarret
This commit was SVN r30343.
Adds coll_hcoll_np mca parameter similar to that of fca component (defaults to 32). Those who use hcoll be aware that from now on the communicators less than 32 procs will run w/o hcoll by default. - Resolves fallback issue in case libhcoll runs out of allowed contexts. The solution is moving hcoll_context_create from comm_enable to comm_query. Shortly, comm_enable should never return OMPI_ERROR in the coll component with highest priority (hcoll). Otherwise the ompi coll_base_select will unselect the coll funtion pointers and module references leaving the communicator w/o coll pointer. This will cause the fail. Same behavior can be reproduced even with tuned if one would hardcore some "return OMPI_ERROR" into it's module_enable funtion. - Additionally, removed all the dead code under #if 0; removed unused variables (path for library, active_modules list) and classes (module list wrapper)
Fixed by Val, Reviewed by Devendar/Josh/Miked
cmr=v1.7.4:reviewer=ompi-rm1.7
This commit was SVN r30341.
This commit fixes an error path that occurs when huge page allocations are
enabled. In this case we allocate a huge page and try to register it but fail.
We then were calling free on the opal object. Fix this by calling the proper free
function.
cmr=v1.7.4:reviewer=rhc
This commit was SVN r30289.
Also add a verbose flag so one can see what devices are selected as well as another flag to override
locality information and use all devices on the node.
This commit was SVN r30287.
NetBSD puts the AIO functions in -lrt, vs. the usual libc. So we
need the fbtl/posix configure.m4 to test for -lrt properly.
Reviewed by Jeff Squyres.
cmr=v1.7.4:reviewer=ompi-rm1.7:subject=Fix NetBSD use of -laio
This commit was SVN r30274.
NOTE: launch performance will be absolutely awful if you do this with BTLs that aren't configured to modex_recv on first message!
Even with "modex on demand", we still have to do a barrier in place of the modex - we simply don't move any data around, which does reduce the time impact. The barrier is required to ensure that the other proc has in fact registered all its BTL info and therefore is prepared to hand over a complete data package. Otherwise, you may not get the info you need. In addition, the shared memory BTL can fail to properly rendezvous as it expects the barrier to be in place.
This behavior will *only* take effect under the following conditions:
1. launched via mpirun
2. #procs is greater than ompi_hostname_cutoff, which defaults to UINT32_MAX
3. mca param rte_orte_direct_modex is set to 1. At the moment, we are having problems getting this param to register properly, so only the first two conditions are in effect. Still, the bottom line is you have to *want* this behavior to get it.
The planned next evolution of this will be to make the direct modex be non-blocking - this will require two fixes:
1. if the remote proc doesn't have the required info, then let it delay its response until it does. This means we need a way for the MPI layer to tell the RTE "I am done entering modex data".
2. adjust the SM rendezvous logic to loop until the required file has been created
Creating a placeholder to bring this over to 1.7.5 when ready.
cmr=v1.7.5:reviewer=hjelmn:subject=Enable direct modex at scale
This commit was SVN r30259.
intended to be used and emits a compile-time warning.
Thanks to Paul Hargrove for identifying the issue.
cmr=v1.7.4:reviewer=hjelmn:subject=remove/replace malloc.h
This commit was SVN r30231.
Avoid compiler warning about (unnecessarily) initializing 2 variables
during instantiation at the top of a switch block (but outside of any
case statements): just declare the variables at the top of the outter
block. They're already safely initialized, so don't worry about
initializing them in the instantiation.
Reviewed by Dave Goodell.
cmr=v1.7.4:reviewer=ompi-rm1.7:subject=Don't instantiate+init variables in a switch block
This commit was SVN r30228.
ob1 dummy registration to actually be used when using udreg. Fix this by
always setting reg to NULL when mpool/udreg's register function fails.
cmr=v1.7.4:reviewer=rhc
This commit was SVN r30214.
Set comm attribute with keyval.
Wait for pending hcoll module tasks in comm delete callback where PML
still valid on the communicator. safely destroy hcoll context during
hcoll module destructor.
Author: Devendar Bureddy
reviewed by miked
cmr=v1.7.4:reviewer=ompi-rm1.7
This commit was SVN r30175.
needed for correctness. The if_include/if_exclude are level 1, and
the TCP port range params are level 2; this parameter seems to be on
par with the TCP port range params.
Refs trac:4019
This commit was SVN r30161.
The following Trac tickets were found above:
Ticket 4019 --> https://svn.open-mpi.org/trac/ompi/ticket/4019
* Remove some set-but-not-used variables
* Make a convenience function return void (we weren't using the
return code, anyway)
* Mark a function as inline (it was supposed to be inline anyway)
Reviewed by Dave Goodell.
cmr=v1.7.5:reviewer=ompi-rm1.7:subject=Fix usnic BTL compiler warnings
This commit was SVN r30160.
Thanks to Tetsuya Mishima for detecting it!
cmr=v1.7.4:reviewer=jsquyres:subject=Correct tcp_not_use_nodelay option processing
This commit was SVN r30157.
- HCOLL close without init
- Call hcoll progress after comm finalize
- mpirun default for coll_hcoll_enable is 1
fixed by Igor, reviewed by miked
cmr=v1.7.4:reviewer=ompi-rm1.7
This commit was SVN r30156.
configury/Makefile.am changes; this commit renames the internal
installdirs.h framework struct field names to match the configry macro
names:
* pkgdatdir -> ompidatadir
* pkglibdir -> ompilibdir
* pkgincludedir -> ompiincludedir
This commit was SVN r30145.
The following SVN revision numbers were found above:
r30140 --> open-mpi/ompi@8b778903d8
pkg{data,lib,includedir}, use our own ompi{data,lib,includedir}, which is
always set to {datadir,libdir,includedir}/openmpi. This will keep us from
having help files in prefix/share/open-rte when building without Open MPI,
but in prefix/share/openmpi when building with Open MPI.
This commit was SVN r30140.
Complements r30073: tighten up the string parsing of the vendor parts
ID MCA param a bit. Also fix a small memory leak: ensure to free the
array uint32_t's parsed out of the MCA param.
This commit was SVN r30128.
The following SVN revision numbers were found above:
r30073 --> open-mpi/ompi@6003702a51
The following Trac tickets were found above:
Ticket 4301 --> https://svn.open-mpi.org/trac/ompi/ticket/4301
This commit adds support for placing the send memory segment in a
traditional shared memory segment when XPMEM is not available. The
current default is to reserve 4MB for shared memory on each process.
The latest benchmarks show vader performing better than sm on both
Intel and AMD CPUs.
For large messages vader will now use CMA if it is available (and
XPMEM is not).
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r30123.
Per RFC which expired two weeks ago:
We are planning to make a change to Open MPI to always set up the btls. This
means the btl init will be called even if add_procs is never called for that
btl. In the openib btl free lists fragments are currently allocated in btl_init.
To avoid wasting that memory this commit moves that final device setup to
the add_procs function. This included allocating free lists, and starting the
async event thread.
At this time this change is safe since we have a barrier after add_procs in
MPI_Init. If this changes we will need to re-think some of the initialization
since we might have the possibility of a connection request before add_procs
is called.
Tested with Mellanox ConnectX2 and QLogic HCAs.
Commit also cleans up tabs in btl_openib_async.c.
cmr=v1.7.5:reviewer=miked
This commit was SVN r30122.
1. Fix ompi_info memory leak in usnic BTL: do not allocate memory in
the component register function, because ompi_info only calls the
component register function and then dlclose's the component -- it
does not call component finalize. Instead, defer parsing the MCA
param (and alloc'ing memory) until the component init function so
that any allocated memory can be freed in the component close
function.
1. Also add a new check to ensure that we actually have some part
numbers to check. Add a show_help message if we don't find any
vendor part IDs to check.
1. Add a verbose output if usnic disqualifies itself from selection
because THREAD_MULTIPLE was specified.
cmr=v1.7.5:reviewer=dgoodell
This commit was SVN r30073.
- Modifications to coll/hcoll component related to the changes in the libhcoll API.
Now, hcoll_destroy_context accepts one more parameter that indicates if the context was
really destroyed as a result of the call.
This new "non-blocking" context destruction fixes hang discovered in IMB with mcast enabled.
- Clean up all the left contexts (if any) on the comm_world destruction.
fixed by Val, reviewed by miked
cmr=v1.7.4:reviewer=ompi-rm1.7
This commit was SVN r30055.
This patch changes all send/send_buffer occurrences in the C/R code
to send_nb/send_buffer_nb.
The new code compiles but does not work.
Changes from V1:
* #ifdef out the code (so it is preserved for later re-design)
* marked the broken C/R code with ENABLE_FT_FIXED
Changes from V2:
* just replace the blocking calls with the non-blocking calls
* all #ifdef's introduced in V1 are gone
* send_* returns error code or ORTE_SUCCESS (not the number of bytes)
This commit was SVN r30036.
This patch changes all recv/recv_buffer occurrences in the C/R code
to recv_nb/recv_buffer_nb.
The old code is still there but disabled using ifdefs (ENABLE_FT_FIXED).
The new code compiles but does not work.
Changes from V1:
* #ifdef out the code (so it is preserved for later re-design)
* marked the broken C/R code with ENABLE_FT_FIXED
Changes from V2:
* only #ifdef out the code where the behaviour is changed
(used to be blocking; now non-blocking)
This commit was SVN r30035.
Fix comm_spawn on a single host - with the new default mapping scheme, we were incorrectly computing the number of procs to put on the node.
Refs trac:4003
This commit was SVN r30033.
The following Trac tickets were found above:
Ticket 4003 --> https://svn.open-mpi.org/trac/ompi/ticket/4003
Some NFS scenarios can result in an infinite ESTALE return, which will
hang ROMIO. This commit causes ROMIO to error out after a large number
of retries instead of spinning forever.
This is MPICH commit b250d338:
http://git.mpich.org/mpich.git/commit/b250d338e66667a8a1071a5f73a4151fd59f83b2
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r29993.
function pointers set to the _map functions, and we get segv's in MTT
testing (e.g., the C++ suite, which actually calls MPI_Cart_map and
MPI_Graph_map).
cmr=v1.7.4:reviewer=bosilca:subject=Fix topo _map function pointer assignments
This commit was SVN r29988.
branch (it's not necessary on trunk/v1.7 because they require C99,
which allows variadic macros).
Also fix another compiler warning (using %p to print a (void*)).
Submitted by Jeff, reviewed by Dave.
cmr=v1.7.4:reviewer=ompi-rm1.7:subject=two usnic BTL fixes
This commit was SVN r29966.
usnic_channel_finalize() was deregistering recv buffers before
destroying the QP to which they were posted. The QP needs to be
destroyed first so that the NIC does not attemp tto write to
deregistered memory, causing the DMAR messages.
Submitted by Reese, reviewed by Jeff.
cmr=v1.7.4:reviewer=ompi-rm1.7
This commit was SVN r29963.
* automatically retrieve the hostname (and all RTE info) for all procs during MPI_Init if nprocs < cutoff
* if nprocs > cutoff, retrieve the hostname (and all RTE info) for a proc upon the first call to modex_recv for that proc. This would provide the hostname for debugging purposes as we only report errors on messages, and so we must have called modex_recv to get the endpoint info
* BTLs are not to call modex_recv until they need the endpoint info for first message - i.e., not during add_procs so we don't call it for every process in the job, but only those with whom we communicate
My understanding is that only some BTLs have been modified to meet that third requirement, but those include the Cray ones where jobs are big enough that launch times were becoming an issue. Other BTLs would hopefully be modified as time went on and interest in using them at scale arose. Meantime, those BTLs would call modex_recv on every proc, and we would therefore be no worse than the prior behavior.
This commit revises the MPI-RTE interface to pass the ompi_proc_t instead of the ompi_process_name_t for the proc so that the hostname can be easily inserted. I have advised the ORNL folks of the change.
cmr=v1.7.4:reviewer=jsquyres:subject=Fix thread deadlock
This commit was SVN r29931.
The following SVN revision numbers were found above:
r29917 --> open-mpi/ompi@1a972e2c9d
includes various fixes all over the C/R code which are
hard to group like the other patches.
Changes from V1:
* explain why mca_base_component_distill_checkpoint_ready no longer works
* compare return result of opal functions with OPAL_* values
Changes from V2:
* use orte_rml_oob_ft_event() instead of referencing through the modules
* properly protect variable (thanks to --enable-picky)
This commit was SVN r29922.
discovered when removing some components.
This commit was SVN r29895.
The following SVN revision numbers were found above:
r29894 --> open-mpi/ompi@58ed00296c
cmr=v1.7.4:reviewer=brbarret:subject=Disqualify sm btl for hetero procs
This commit was SVN r29882.
The following Trac tickets were found above:
Ticket 2433 --> https://svn.open-mpi.org/trac/ompi/ticket/2433