1
1

426 Коммитов

Автор SHA1 Сообщение Дата
Nathan Hjelm
408da16d50 ompi/proc: add proc hash table for ompi_proc_t objects
This commit adds an opal hash table to keep track of mapping between
process identifiers and ompi_proc_t's. This hash table is used by the
ompi_proc_by_name() function to lookup (in O(1) time) a given
process. This can be used by a BTL or other component to get a
ompi_proc_t when handling an incoming message from an as yet unknown
peer.

Additionally, this commit adds a new MCA variable to control the new
add_procs behavior: mpi_add_procs_cutoff. If the number of ranks in
the process falls below the threshold a ompi_proc_t is created for
every process. If the number of ranks is above the threshold then a
ompi_proc_t is only created for the local rank. The code needed to
generate additional ompi_proc_t's for a communicator is not yet
complete.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-09-10 08:55:54 -06:00
Nathan Hjelm
6f8f2325ed btl: btls are now required to set the send flag if supported
This commit updates each non-compliant btl to send the
MCA_BTL_FLAGS_SEND flag in the btl_flags field if send is
supported. This fixes a problem identified after the latest bml/r2
update which excplicitly checks for the send flag.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-09-10 08:55:54 -06:00
Matias Cabral
f360eebfeb Merge pull request #855 from matcabral/btl_openib_mtu
Fix for openib btl mca command line parameter btl_openib_mtu being ignored
2015-09-09 11:22:00 -07:00
Rolf vandeVaart
2e64a69fa9 Add some verbosity to help debug hwloc issues 2015-09-08 10:50:22 -07:00
Jeff Squyres
bc9e5652ff whitespace: purge whitespace at end of lines
Generated by running "./contrib/whitespace-purge.sh".
2015-09-08 09:47:17 -07:00
Jeff Squyres
f782a7640e usnic: minor re-order of Makefile.am sources
Put the hwloc.c file alphabetically in the list.
2015-09-05 05:02:00 -07:00
Ralph Castain
d97bc29102 Remove OPAL_HAVE_HWLOC qualifier and error out if --without-hwloc is given 2015-09-04 16:54:40 -07:00
matcabral
1f9218a0bc Fix for openib btl mca command line parameter btl_openib_mtu being
ignored.
2015-09-02 02:22:30 -07:00
Nathan Hjelm
f926796e57 Merge pull request #828 from hjelmn/openib_thread_fix
openib thread fixes
2015-09-01 09:12:50 -06:00
Jeff Squyres
8558458bb9 usnic: adjust for new PMIX argument type 2015-08-31 14:55:58 -07:00
Ralph Castain
cf6137b530 Integrate PMIx 1.0 with OMPI.
Bring Slurm PMI-1 component online
Bring the s2 component online

Little cleanup - let the various PMIx modules set the process name during init, and then just raise it up to the ORTE level. Required as the different PMI environments all pass the jobid in different ways.

Bring the OMPI pubsub/pmi component online

Get comm_spawn working again

Ensure we always provide a cpuset, even if it is NULL

pmix/cray: adjust cray pmix component for pmix

Make changes so cray pmix can work within the integrated
ompi/pmix framework.

Bring singletons back online. Implement the comm_spawn operation using pmix - not tested yet

Cleanup comm_spawn - procs now starting, error in connect_accept

Complete integration
2015-08-29 16:04:10 -07:00
Nathan Hjelm
64e4419d76 btl/openib: allow the use of the openib btl in thread muliple
There were several issues preventing the openib btl from running in
thread multiple mode:

 - Missing locks in UDCM when generating a loopback endpoint. Fixed in
   open-mpi/ompi@8205d79819.

 - Incorrect sequence numbers generated in debug mode. This did not
   prevent the openib btl from running but instead produced incorrect
   error messages in debug builds.

 - Recursive locking of the rcache lock caused by the malloc
   hooks. This is fixed by open-mpi/ompi#827

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-08-24 16:04:52 -06:00
Nathan Hjelm
c101385f64 btl/openib: fix sequence number generation for debug mode
When using eager RDMA in debug builds the openib btl generates a
sequence number for each send. The code independently updated the head
index and the sequence number for the eager rdma transaction. If
multiple threads enter this code at the same time and run in the
following order:

thread 1: update sequence (0 -> 1)
thread 2: update sequence (1 -> 2)
thread 2: update head (0 -> 1)
thread 1: update head (1 -> 2)

the sequence number for head[0] gets 1 and the sequence number for
head[1] gets 0. The fix is to generate the sequence number from the
head index.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-08-24 16:00:06 -06:00
Nathan Hjelm
8205d79819 btl/openib: add missing lock calls
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-08-24 12:21:49 -06:00
Gilles Gouaillardet
d02ccd67de btl/openib: remove OFED version runtime check when XRC is used
this test seems broken :
 - some false positive were reported
 - it fails to detect some OFED version mismatch
this commit simply removes this test, which means the application
will likely fail if XRC is used ad OFED version is different
between compile time and runtime
2015-08-14 09:10:03 +09:00
Jeff Squyres
14340770c4 usnic: remove some logically dead code
This code really had no purpose; just assign FI_VERSION(1, 1).  This
fixes CID 1315274.

Also clarify the commet about why we still retain libfabric v1.0.0
compatibility code, even though configure.m4 requires libfabric >= v1.1.0.
2015-08-12 05:21:18 -07:00
Jeff Squyres
236cf7ff62 usnic: add v1.8/v1.10 compat code
Add compat code so that I can sync master against the v1.10 branch.
2015-08-10 16:27:38 -07:00
Jeff Squyres
7da1c4b875 usnic: avoid race condition in connectivity checker
In short applications, it's possible that the agent (i.e., local rank
0) will finalize after non-local rank 0 procs detect the connectivity
checker named socket, but before they complete a connect() on it.  As
such, their connect() gets ECONNREFUSED.

This commit adds a simple counter in the agent that won't let it quit
before it accept()'s from all local procs, or 10 seconds goes by
(whichever occurs first).  This is similar to the timeout for the
clients: they'll exit if they don't see the expected named socket
within 10 seconds.
2015-08-10 15:40:33 -07:00
Jeff Squyres
bad508687e usnic: move cchecker to OPAL-wide progress thread
There's no longer any need for the usnic BTL to have its own progress
thread: it can use the opal_progress_thread() infrastructure.  This
commit removes the code to startup/shutdown the usnic-BTL-specific
progress thread and instead, just adds its events to the OPAL-wide
progress thread.

This necessitated a small change in the finalization step.
Previously, we would stop the progress thread and then tear down the
events.  We can no longer stop the progress thread, and if we start
tearing down events, this will cause shutdown/hangups to be sent
across sockets, potentially firing some of the still-remaining events
while some (but not all) of the data structures have been torn down.
Chaos ensues.

Instead, queue up an event to tear down all the pending events.  Since
the progress thread will only fire one event at a time, having a
teardown event means that it can tear down all the pending events
"atomically" and not have to worry that one of those events will get
fired in the middle of the teardown process.
2015-08-10 15:40:33 -07:00
Jeff Squyres
b5c37dbfe2 CSCuv67889: usnic: fix an error corner case
Ensure that we have non-NULL on all levels of pointers, which will
save us if there are exitable errors very early during component /
module initialization.
2015-08-06 10:54:28 -07:00
Jeff Squyres
cbcd16b399 usnic: remove a stale shell variable name 2015-07-31 18:53:54 -07:00
Jeff Squyres
0ee8295e6e usnic: ensure that we have libfabric >= v1.1 2015-07-31 18:53:54 -07:00
Jeff Squyres
2e7f794aae usnic: convert to use fi_recvmsg / FI_MORE
Minor optimization to post 16 receive buffers at a time (vs. 1).
2015-07-31 12:45:40 -07:00
Jeff Squyres
ec3a38384f Merge pull request #688 from jsquyres/pr/usnic-libfabric-msg-prefix-fix
usnic fixes for differences between libfabric v1.0.0 and v1.1.0
2015-07-21 10:18:36 -04:00
Gilles Gouaillardet
f7cf7d5070 configury: fix XRC detection on OFED < 3.12
since ibv_create_xrc_rcv_qp is now deprecated, and in order to
be "future-proof", we have to consider the case in which only XRC Domains are supported.
also, correctly handle distro that ship broken ibverbs devel headers

Thanks Paul Hargrove for the detailled report.
2015-07-13 10:43:22 +09:00
Ralph Castain
683efcb850 Rename the current opal_event_base to opal_sync_event_base in preparation for adding an async progress thread to opal. No functional changes made here - just a simple rename. 2015-07-11 10:08:19 -07:00
Ralph Castain
61fb067f14 Update the opal_hotel class to support a given event base instead of defaulting to using opal_event_base 2015-07-11 06:42:23 -07:00
Jeff Squyres
633da6641e usnic: gracefully handle when we can't alloc an ACK
The comment didn't match the debugging code (which was ugly, and
apparently never happens, anyway).  Just return and let the sender
retransmit.
2015-07-10 14:19:33 -07:00
Jeff Squyres
3327fa56b5 usnic: minor code cleanups 2015-07-10 10:10:43 -07:00
Jeff Squyres
f9c65a701e usnic: "sin" assignment needs to be outside the #if
The "sin" variable is used below; need to ensure that it is assigned
for all builds (not just debug builds).
2015-07-10 06:51:03 -07:00
Jeff Squyres
cd87c8ad41 usnic: misc compiler warnings fixes 2015-07-10 06:51:03 -07:00
Jeff Squyres
ba429dc890 usnic: temporarily disable the BTL put method
The usnic BTL put method is currently broken.  Disable it until we can
fix it properly.
2015-07-10 06:51:03 -07:00
Jeff Squyres
f265358fbe usnic: handle FI_MSG_PREFIX differences libfabric v1.0.0->v1.1.0
In libfabric v1.0.0 (i.e., API v1.0), the usnic provider handled
FI_MSG_PREFIX inconsistently between sends and receives.  This has
been fixed in libfabric v1.1.0 (i.e., API v1.1): FI_MSG_PREFIX is
handled consistently for both sends and receives.

Run-time detect which libfabric we are running with and adapt behavior
appropriately.
2015-07-10 06:51:03 -07:00
Jeff Squyres
ddd0de6cfc usnic: make more OS-bypass memory Valgrind-defined
This helps reduce false positives when running MPI apps through
Valgrind.
2015-07-10 06:51:03 -07:00
Jeff Squyres
9bc7a54e0c usnic: correctly count CRC errors
Handle the differences between libfabric v1.0.0 and v1.1.0 in the
return value of fi_cq_readerr().

Also consolidate CRC and truncation errors into the same handling
block, since truncation errors are typically another symptom of CRC
errors.  This ensures that buffers get reposted properly.
2015-07-10 06:51:03 -07:00
Jeff Squyres
fc686f5538 usnic: make configure complain if libfabric cannot be found
Instead of silently determining that the usnic BTL can't be built,
announce that usnic is checking for libfabric support, and then
AC_MSG_RESULT the result of that check.
2015-07-10 06:45:33 -07:00
Jeff Squyres
4341639a66 Revert "configury: fix (again) XRC detection on OFED < 3.12"
@ggouaillardet is likely offline for the weekend, but master is broken
on RHEL 6.5 systems that do not have MOFED installed.  So I'm taking
the liberty of revering this commit; I'm guessing Gilles will fixup
and re-commit next week.

This reverts commit 77f8282d51d8f40f6ae988ef84c9c852de75c625.
2015-07-10 06:45:33 -07:00
Gilles Gouaillardet
77f8282d51 configury: fix (again) XRC detection on OFED < 3.12
since ibv_create_xrc_rcv_qp is now deprecated, and in order to
be "future-proof", we have to consider the case in which only XRC Domains are supported.

Thanks Paul Hargrove for the detailled report.
2015-07-10 15:31:45 +09:00
Rolf vandeVaart
ae0f3cfee7 Make explicit call to initalize MCA parameters in common CUDA code. This allows us to view them with ompi_info and possibly modify with tools interface 2015-07-09 12:51:55 -04:00
Rolf vandeVaart
cdffa4724d Force smcuda BTL to use CUDA IPC path for all GPU buffers where possible 2015-07-08 17:11:25 -04:00
Gilles Gouaillardet
9f171de412 btl/openib: queue pending fragments once only when running out of credit
Fixes open-mpi/ompi#640
2015-07-06 09:45:01 +09:00
bosilca
77367ca02c Merge pull request #687 from rolfv/pr/fix-smcuda-perfprob
Add the ability use different size buffers for host and CUDA buffers
2015-07-02 18:42:41 -04:00
Rolf vandeVaart
30a872b478 Add the ability to send host buffers through one sized staging buffers and CUDA buffers through different sized buffers. Fixes performance issues 2015-07-02 11:11:15 -04:00
Alina Sklarevich
27797654db openib btl: added a new vendor_part_id for Mellanox ConnectX4-LX. 2015-06-29 13:50:43 +03:00
Jeff Squyres
a172bd161e usnic: switch to use the new libfabric common library
The usnic BTL configure.m4 no longer needs to OPAL_CHECK_LIBFABRIC; it
just uses the results from opal/mca/common/libfabric's configure.m4.
We also now don't need to link against libfabric -- they just link
against the opal_common_libfabric library.
2015-06-25 13:33:15 -07:00
Nathan Hjelm
ee36d813dc Merge pull request #657 from hjelmn/c99
more c99 updates
2015-06-25 11:21:09 -06:00
Nathan Hjelm
4d92c9989e more c99 updates
This commit does two things. It removes checks for C99 required
headers (stdlib.h, string.h, signal.h, etc). Additionally it removes
definitions for required C99 types (intptr_t, int64_t, int32_t, etc).

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2015-06-25 10:14:13 -06:00
Howard Pritchard
e49a37c034 ownership: update ownership files
per discussions at OMPI devel workshop

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-06-25 10:04:42 -06:00
Ralph Castain
869041f770 Purge whitespace from the repo 2015-06-23 20:59:57 -07:00
Jeff Squyres
8ab2b11f88 btl_openib.c: fix another compiler warning
Remove this unused variable
2015-06-17 09:00:12 -07:00