NOTE: launch performance will be absolutely awful if you do this with BTLs that aren't configured to modex_recv on first message!
Even with "modex on demand", we still have to do a barrier in place of the modex - we simply don't move any data around, which does reduce the time impact. The barrier is required to ensure that the other proc has in fact registered all its BTL info and therefore is prepared to hand over a complete data package. Otherwise, you may not get the info you need. In addition, the shared memory BTL can fail to properly rendezvous as it expects the barrier to be in place.
This behavior will *only* take effect under the following conditions:
1. launched via mpirun
2. #procs is greater than ompi_hostname_cutoff, which defaults to UINT32_MAX
3. mca param rte_orte_direct_modex is set to 1. At the moment, we are having problems getting this param to register properly, so only the first two conditions are in effect. Still, the bottom line is you have to *want* this behavior to get it.
The planned next evolution of this will be to make the direct modex be non-blocking - this will require two fixes:
1. if the remote proc doesn't have the required info, then let it delay its response until it does. This means we need a way for the MPI layer to tell the RTE "I am done entering modex data".
2. adjust the SM rendezvous logic to loop until the required file has been created
Creating a placeholder to bring this over to 1.7.5 when ready.
cmr=v1.7.5:reviewer=hjelmn:subject=Enable direct modex at scale
This commit was SVN r30259.
intended to be used and emits a compile-time warning.
Thanks to Paul Hargrove for identifying the issue.
cmr=v1.7.4:reviewer=hjelmn:subject=remove/replace malloc.h
This commit was SVN r30231.
Avoid compiler warning about (unnecessarily) initializing 2 variables
during instantiation at the top of a switch block (but outside of any
case statements): just declare the variables at the top of the outter
block. They're already safely initialized, so don't worry about
initializing them in the instantiation.
Reviewed by Dave Goodell.
cmr=v1.7.4:reviewer=ompi-rm1.7:subject=Don't instantiate+init variables in a switch block
This commit was SVN r30228.
ob1 dummy registration to actually be used when using udreg. Fix this by
always setting reg to NULL when mpool/udreg's register function fails.
cmr=v1.7.4:reviewer=rhc
This commit was SVN r30214.
Set comm attribute with keyval.
Wait for pending hcoll module tasks in comm delete callback where PML
still valid on the communicator. safely destroy hcoll context during
hcoll module destructor.
Author: Devendar Bureddy
reviewed by miked
cmr=v1.7.4:reviewer=ompi-rm1.7
This commit was SVN r30175.
needed for correctness. The if_include/if_exclude are level 1, and
the TCP port range params are level 2; this parameter seems to be on
par with the TCP port range params.
Refs trac:4019
This commit was SVN r30161.
The following Trac tickets were found above:
Ticket 4019 --> https://svn.open-mpi.org/trac/ompi/ticket/4019
* Remove some set-but-not-used variables
* Make a convenience function return void (we weren't using the
return code, anyway)
* Mark a function as inline (it was supposed to be inline anyway)
Reviewed by Dave Goodell.
cmr=v1.7.5:reviewer=ompi-rm1.7:subject=Fix usnic BTL compiler warnings
This commit was SVN r30160.
Thanks to Tetsuya Mishima for detecting it!
cmr=v1.7.4:reviewer=jsquyres:subject=Correct tcp_not_use_nodelay option processing
This commit was SVN r30157.
- HCOLL close without init
- Call hcoll progress after comm finalize
- mpirun default for coll_hcoll_enable is 1
fixed by Igor, reviewed by miked
cmr=v1.7.4:reviewer=ompi-rm1.7
This commit was SVN r30156.
configury/Makefile.am changes; this commit renames the internal
installdirs.h framework struct field names to match the configry macro
names:
* pkgdatdir -> ompidatadir
* pkglibdir -> ompilibdir
* pkgincludedir -> ompiincludedir
This commit was SVN r30145.
The following SVN revision numbers were found above:
r30140 --> open-mpi/ompi@8b778903d8
pkg{data,lib,includedir}, use our own ompi{data,lib,includedir}, which is
always set to {datadir,libdir,includedir}/openmpi. This will keep us from
having help files in prefix/share/open-rte when building without Open MPI,
but in prefix/share/openmpi when building with Open MPI.
This commit was SVN r30140.
Complements r30073: tighten up the string parsing of the vendor parts
ID MCA param a bit. Also fix a small memory leak: ensure to free the
array uint32_t's parsed out of the MCA param.
This commit was SVN r30128.
The following SVN revision numbers were found above:
r30073 --> open-mpi/ompi@6003702a51
The following Trac tickets were found above:
Ticket 4301 --> https://svn.open-mpi.org/trac/ompi/ticket/4301
This commit adds support for placing the send memory segment in a
traditional shared memory segment when XPMEM is not available. The
current default is to reserve 4MB for shared memory on each process.
The latest benchmarks show vader performing better than sm on both
Intel and AMD CPUs.
For large messages vader will now use CMA if it is available (and
XPMEM is not).
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r30123.
Per RFC which expired two weeks ago:
We are planning to make a change to Open MPI to always set up the btls. This
means the btl init will be called even if add_procs is never called for that
btl. In the openib btl free lists fragments are currently allocated in btl_init.
To avoid wasting that memory this commit moves that final device setup to
the add_procs function. This included allocating free lists, and starting the
async event thread.
At this time this change is safe since we have a barrier after add_procs in
MPI_Init. If this changes we will need to re-think some of the initialization
since we might have the possibility of a connection request before add_procs
is called.
Tested with Mellanox ConnectX2 and QLogic HCAs.
Commit also cleans up tabs in btl_openib_async.c.
cmr=v1.7.5:reviewer=miked
This commit was SVN r30122.
1. Fix ompi_info memory leak in usnic BTL: do not allocate memory in
the component register function, because ompi_info only calls the
component register function and then dlclose's the component -- it
does not call component finalize. Instead, defer parsing the MCA
param (and alloc'ing memory) until the component init function so
that any allocated memory can be freed in the component close
function.
1. Also add a new check to ensure that we actually have some part
numbers to check. Add a show_help message if we don't find any
vendor part IDs to check.
1. Add a verbose output if usnic disqualifies itself from selection
because THREAD_MULTIPLE was specified.
cmr=v1.7.5:reviewer=dgoodell
This commit was SVN r30073.
- Modifications to coll/hcoll component related to the changes in the libhcoll API.
Now, hcoll_destroy_context accepts one more parameter that indicates if the context was
really destroyed as a result of the call.
This new "non-blocking" context destruction fixes hang discovered in IMB with mcast enabled.
- Clean up all the left contexts (if any) on the comm_world destruction.
fixed by Val, reviewed by miked
cmr=v1.7.4:reviewer=ompi-rm1.7
This commit was SVN r30055.
This patch changes all send/send_buffer occurrences in the C/R code
to send_nb/send_buffer_nb.
The new code compiles but does not work.
Changes from V1:
* #ifdef out the code (so it is preserved for later re-design)
* marked the broken C/R code with ENABLE_FT_FIXED
Changes from V2:
* just replace the blocking calls with the non-blocking calls
* all #ifdef's introduced in V1 are gone
* send_* returns error code or ORTE_SUCCESS (not the number of bytes)
This commit was SVN r30036.
This patch changes all recv/recv_buffer occurrences in the C/R code
to recv_nb/recv_buffer_nb.
The old code is still there but disabled using ifdefs (ENABLE_FT_FIXED).
The new code compiles but does not work.
Changes from V1:
* #ifdef out the code (so it is preserved for later re-design)
* marked the broken C/R code with ENABLE_FT_FIXED
Changes from V2:
* only #ifdef out the code where the behaviour is changed
(used to be blocking; now non-blocking)
This commit was SVN r30035.
Fix comm_spawn on a single host - with the new default mapping scheme, we were incorrectly computing the number of procs to put on the node.
Refs trac:4003
This commit was SVN r30033.
The following Trac tickets were found above:
Ticket 4003 --> https://svn.open-mpi.org/trac/ompi/ticket/4003
Some NFS scenarios can result in an infinite ESTALE return, which will
hang ROMIO. This commit causes ROMIO to error out after a large number
of retries instead of spinning forever.
This is MPICH commit b250d338:
http://git.mpich.org/mpich.git/commit/b250d338e66667a8a1071a5f73a4151fd59f83b2
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r29993.
function pointers set to the _map functions, and we get segv's in MTT
testing (e.g., the C++ suite, which actually calls MPI_Cart_map and
MPI_Graph_map).
cmr=v1.7.4:reviewer=bosilca:subject=Fix topo _map function pointer assignments
This commit was SVN r29988.
branch (it's not necessary on trunk/v1.7 because they require C99,
which allows variadic macros).
Also fix another compiler warning (using %p to print a (void*)).
Submitted by Jeff, reviewed by Dave.
cmr=v1.7.4:reviewer=ompi-rm1.7:subject=two usnic BTL fixes
This commit was SVN r29966.
usnic_channel_finalize() was deregistering recv buffers before
destroying the QP to which they were posted. The QP needs to be
destroyed first so that the NIC does not attemp tto write to
deregistered memory, causing the DMAR messages.
Submitted by Reese, reviewed by Jeff.
cmr=v1.7.4:reviewer=ompi-rm1.7
This commit was SVN r29963.
* automatically retrieve the hostname (and all RTE info) for all procs during MPI_Init if nprocs < cutoff
* if nprocs > cutoff, retrieve the hostname (and all RTE info) for a proc upon the first call to modex_recv for that proc. This would provide the hostname for debugging purposes as we only report errors on messages, and so we must have called modex_recv to get the endpoint info
* BTLs are not to call modex_recv until they need the endpoint info for first message - i.e., not during add_procs so we don't call it for every process in the job, but only those with whom we communicate
My understanding is that only some BTLs have been modified to meet that third requirement, but those include the Cray ones where jobs are big enough that launch times were becoming an issue. Other BTLs would hopefully be modified as time went on and interest in using them at scale arose. Meantime, those BTLs would call modex_recv on every proc, and we would therefore be no worse than the prior behavior.
This commit revises the MPI-RTE interface to pass the ompi_proc_t instead of the ompi_process_name_t for the proc so that the hostname can be easily inserted. I have advised the ORNL folks of the change.
cmr=v1.7.4:reviewer=jsquyres:subject=Fix thread deadlock
This commit was SVN r29931.
The following SVN revision numbers were found above:
r29917 --> open-mpi/ompi@1a972e2c9d
includes various fixes all over the C/R code which are
hard to group like the other patches.
Changes from V1:
* explain why mca_base_component_distill_checkpoint_ready no longer works
* compare return result of opal functions with OPAL_* values
Changes from V2:
* use orte_rml_oob_ft_event() instead of referencing through the modules
* properly protect variable (thanks to --enable-picky)
This commit was SVN r29922.
discovered when removing some components.
This commit was SVN r29895.
The following SVN revision numbers were found above:
r29894 --> open-mpi/ompi@58ed00296c
cmr=v1.7.4:reviewer=brbarret:subject=Disqualify sm btl for hetero procs
This commit was SVN r29882.
The following Trac tickets were found above:
Ticket 2433 --> https://svn.open-mpi.org/trac/ompi/ticket/2433
For exammple, mca_btl_sm.knem_fd remained 0, and mca_btl_sm_component_close() ended up doing closing fd 0 which belongs to someone else.
fixed by Yossi, reviewed by miked
cmr=v1.7.4:reviewer=ompi-rm1.7
This commit was SVN r29875.
flexible members.
UDCM is ready to go for 1.7.4 with this patch.
cmr=v1.7.4:ticket=3940
This commit was SVN r29861.
The following Trac tickets were found above:
Ticket 3940 --> https://svn.open-mpi.org/trac/ompi/ticket/3940
Note that this event should never happen within a single OMPI job,
because OMPI will ignore usnic ports that are down. The PORT_ACTIVE
event should only occur if a port ''was'' down and is now ''up''. But
what the heck -- if we ever do get this event, it is harmless -- just
ignore it.
This commit was SVN r29852.
This is helpful in the work for #3694: ensure that many places that
eventually end up in configure don't overly-pollute the global shell
variable space (because debugging accidental shell variable pollution
can be a real pain).
Refs trac:3694
This commit was SVN r29830.
The following Trac tickets were found above:
Ticket 3694 --> https://svn.open-mpi.org/trac/ompi/ticket/3694
On the off chance that the PML is twiddling fields that it really
shouldn't be...
Reviewed-by: Reese Faucette <rfaucett@cisco.com>
This commit was SVN r29804.
MOFED apparently has a /usr/include/infiniband/verbs.h that also
defines a (slightly different but fully compatible) container_of
macro. So put proper #ifndef protection around our definition of
container_of.
Thanks to Rolf vandeVaart for pointing out the issue.
Reviewed by Dave Goodell.
cmr=v1.7.4:reviewer=ompi-rm1.7
This commit was SVN r29799.
Originally udcm acks used the immediate data to indicate which message was
being acknowleged. This data was (mysteriously) junk when using QLogic HCAs so I
updated udcm to use the source info (slid, qp, etc) to determine which message was being
acked. This works as long as we don't have two messages simultaneously in flight
to a particular peer and then loose the first of the two messages. The chances of this
happening are tiny. To fix this case I updated the udcm message header to include
a pointer to the in flight message. This pointer is then sent back to the sending
process to ack receipt.
cmr=v1.7.4:ticket=trac:3940
This commit was SVN r29775.
The following Trac tickets were found above:
Ticket 3940 --> https://svn.open-mpi.org/trac/ompi/ticket/3940
This commit updates the udcm cpc to support xrc. The steps followed by udcm
mimic those in the removed xoob cpc. This update has been tested with both XRC
and RC.
Mellanox, this is intended to go into 1.7.4. Please review carefully and let
me know if there are any issues.
cmr=v1.7.4:reviewer=miked
This commit was SVN r29767.
(aka the root). This commit is based on a patch provided by Pierre
Jolivet.
Fix all the output to match the failing MPI call.
This commit was SVN r29761.
To support the new mpool two changes were made to the mpool infrastructure:
1) Added an mpool flag to indicate that an mpool does not need the memory
hooks to use the leave pinned protocols. This flag is checked in the
mpool lookup.
2) Add a mpool context to the base registration. This new member is used
by the udreg mpool to store the udreg context associated with the
particular registration. The new member will not break the ABI
compatibility as the new member is only currently used by the udreg
mpool.
Dynamics support for Cray systems makes use of the global rank provided by
orte to give the ugni library a unique rank for each process. Dynamics
support is not available under direct-launch (srun.)
cmr=v1.7.4
This commit was SVN r29719.
This isn't being used yet - just enabling Nathan to do what he needs.
***** NOTE: any use of the OMPI_DB_GLOBAL_RANK database key must be protected by #ifdef OMPI_DB_GLOBAL_RANK as not all RTE's will define this key. *****
This commit was SVN r29708.
http://www.open-mpi.org/community/lists/devel/2013/10/13072.php
Add support for pinning GPU Direct RDMA in openib BTL for better small message latency of GPU buffers.
Note that none of this is compiled in unless CUDA-aware support is requested.
This commit was SVN r29680.
Gah! The "device" variable isn't used at all in this loop (my eye
glossed over the next line and thought that "device" was used in the
free() statement, but it's actually "devices" -- not "device").
This commit was SVN r29665.
The following Trac tickets were found above:
Ticket 3091 --> https://svn.open-mpi.org/trac/ompi/ticket/3091
<usnic device name>,<eth device>,<ip address>/<CIDR prefix>
For example:
usnic_0,eth4,10.1.0.15/16
This is just handy for mapping the usnic_X device back to the IP
network to which it corresponds.
This commit was SVN r29656.
Resolves a hang when using scif for shared memory transfers. This is a
simple change and doesn't require a review.
cmr=v1.7.4:reviewer=ompi-rm1.7
This commit was SVN r29653.
Cisco v1.6 git commit 913ec6c and upstream trunk r29593 (segfault fix)
introduced a performance regression by inadvertently disabling the
`module_recv_buffers` functionality. With those changes in place, the
`btl_usnic_recv.c` logic would end up mallocing a buffer that should
have otherwise come from a `module_recv_buffers` pool. It also resulted
in a small, bounded memory leak (128 buffers at each power-of-two size
interval).
The new version just places the buffer after the free list item with a
flexible array member. I bumped the pool to allocate all 128 elements
up front because the deferred allocation was modestly impacting IMB
Sendrecv performance at a few sizes.
Reviewed-by: Reese Faucette <rfaucett@cisco.com>
This commit was SVN r29631.
The following SVN revision numbers were found above:
r29593 --> open-mpi/ompi@1ed9b8ff43
Only use Portals on communicators with more than one rank
Fix computation of number of children when using the hypercube tree
This commit was SVN r29616.
Without this commit, if you run IMB pingpong between two nodes with only
one usnic selected (e.g., via `--mca btl_usnic_if_include usnic_0`) then
the run will seem fine but will segfault at MPI_Finalize time.
This behavior has happened since Cisco v1.6 git commit ec7ddf8, upstream
trunk r29484, and upstream v1.7 r29507.
Root cause was that the free list element was being used as the recv
buffer instead of the data buffer associated with the element. So the
reassembly code would stomp all over the free list element, which would
cause the destructor to explode when the free list attempted to clean up
all of its elements. This surprisingly did not cause any other problems
until now.
Reviewed-by: Reese Faucette <rfaucett@cisco.com>
This commit was SVN r29593.
The following SVN revision numbers were found above:
r29484 --> open-mpi/ompi@a6ed232a10
r29507 --> open-mpi/ompi@790d269ce8
If we need to use a convertor, go back to stashing that convertor in the
frag and populating segments "on the fly" (in
ompi_btl_usnic_module_progress_sends). Previously we would pack into a
chain of chunk segments at prepare_src time, unnecessarily consuming
additional memory.
Reviewed-by: Jeff Squyres <jsquyres@cisco.com>
Reviewed-by: Reese Faucette <rfaucett@cisco.com>
This commit was SVN r29592.
This makes it a little easier to see what's happening with callbacks to
the PML.
Reviewed-by: Jeff Squyres <jsquyres@cisco.com>
Reviewed-by: Reese Faucette <rfaucett@cisco.com>
This commit was SVN r29591.
This includes suppressing picky-mode warnings about __VA_ARGS__, which
we know are supported by any compilers we care about.
Reviewed-by: Jeff Squyres <jsquyres@cisco.com>
Reviewed-by: Reese Faucette <rfaucett@cisco.com>
This commit was SVN r29590.
Ensure that they never are touched by checking in their destructors.
Reviewed-by: Jeff Squyres <jsquyres@cisco.com>
Reviewed-by: Reese Faucette <rfaucett@cisco.com>
This commit was SVN r29589.
Let imagine that we have two btls in btl_openib_component_init() both points to the same openib_btl->device and as a result have the same openib_btl->device->endpoints array.
Finalization phase calls twice mca_btl_openib_finalize()->mca_btl_openib_finalize_resources().
mca_btl_openib_finalize_resources() frees endpoint related btl. But the second call of mca_btl_openib_finalize_resources() checks endpoint that is released by previus call.
fixed by Igor, reviewed by miked/vasily
cmr=v1.7.4:reviewer=ompi-gk1.7
This commit was SVN r29563.