1
1

397 Коммитов

Автор SHA1 Сообщение Дата
Gilles Gouaillardet
135ecce0eb btl/openib: rename OPAL_HAVE_XRCD macro into OPAL_HAVE_CONNECTX_XRC_DOMAINS 2015-01-07 13:27:25 +09:00
Nathan Hjelm
6733d89cf9 btl/vader: fix return code check when opening ptrace_scope file 2015-01-06 15:17:56 -07:00
Nathan Hjelm
cde79bfa60 btl/openib: misc cleanup (tabs, etc) and put credit code into a common place (was duplicated in the send and sendi paths) 2015-01-06 11:39:23 -07:00
Nathan Hjelm
9bae131589 btl/openib: fix message coalescing
There was a bug in the openib btl handling this valid sequence of
calls:

desc = btl_alloc ();
btl_free (desc);

When triggered the bug would cause either fragment loss or undefined
behavior (SEGV, etc). The problem occured because btl_alloc contained
the logic to modify the pending fragment (length, etc) and these
changes were not corrected if the fragment was freed instead of sent.

To fix this issue I 1) moved some of the coalescing logic to the
btl_send function, and 2) retry the coalesced fragment on btl_free
if it was never sent. This appears to completely address the issue.
2015-01-06 11:39:16 -07:00
Nathan Hjelm
9aaac11648 btl/openib: fix recieve queue source detection 2015-01-06 11:39:11 -07:00
Howard Pritchard
7df648f1cf btl/openib: fix problems from commit b3617e73
For systems with OFED's lacking XRC support, commit b3617e73
broke the build of the openib btl.  This commit addresses
the issues introduced by this commit.
2015-01-06 11:31:12 -07:00
Gilles Gouaillardet
b3617e736e btl/openib: add XRC support with OFED 3.12+
based on an original patch contributed by Bull.
2015-01-06 15:30:52 +09:00
Howard Pritchard
c857cc926c Merge pull request #327 from hppritcha/topic/async_progress
Topic/async progress
2015-01-05 16:20:44 -07:00
Howard Pritchard
0a6f841d5f xpmem/config: simple xpmem search on Cray's
Use the pkg-config related m4 functions to find out where
Cray's xpmem.h and libxpmem are located on a system.

With this commit, there is no longer any need to have to
explicitly indicate an xpmem install location on the configure
line, at least for Cray systems running CLE 4.X and 5.X.
2014-12-24 14:40:06 -07:00
Howard Pritchard
065c756860 btl/ugni: improve error handling
Improve error handling when pthread functions return errors.
Remove stale debug code.
2014-12-24 11:50:24 -07:00
Howard Pritchard
f8e354ce00 btl/ugni: add a request_progress_thread mca param
Replace temporary environment variables with a MCA
parameter for the ugni btl.  A user wishing to
use the ugni btl async. progress thread needs to
set the request_progress_thread param to true.
For example, using env. variable format:

export OMPI_MCA_btl_ugni_request_progress_thread=1
2014-12-24 11:50:24 -07:00
Howard Pritchard
8b250cc15b btl/ugni: more debug cleanup 2014-12-24 11:50:24 -07:00
Howard Pritchard
f0c519517b btl/ugni: switch to using opal_progress
Switch to invoking opal_progress from the async progress
thread, rather than calling ugni btl specific progress.
2014-12-24 11:50:24 -07:00
Howard Pritchard
47747c1b27 btl/ugni: remove some debug output 2014-12-24 11:50:24 -07:00
Howard Pritchard
2d14c2a204 btl/ugni: switch to using tx cq irqs for rdma
Verified via testing with unit tests, etc. that
in fact BTE TX descriptors using CQs configured to
generate IRQs were in fact working correctly on Cray XC.  Disable
send message back to self and just use IRQs generated
by completion of TX descriptors posted to BTE.
2014-12-24 11:50:24 -07:00
Howard Pritchard
acd07d98da btl/ugni: turn off chatty debug in irq cq setup 2014-12-24 11:50:24 -07:00
Howard Pritchard
0dec2f4af7 btl/ugni: mark btl frags for irqs as btl owned
Make sure frags allocated to generate irqs to wake
the progress thread, etc. set the MCA_BTL_DES_FLAGS_BTL_OWNERSHIP
flag.
2014-12-24 11:50:23 -07:00
Howard Pritchard
d188f0bc6f btl/ugni: honor enable_mpi_threads
Honor enable_mpi_threads setting to enable the ugni btl
async progress thread.  If the app doesn't request thread-multiple
the thread will not be created.
2014-12-24 11:50:23 -07:00
Howard Pritchard
43cdcb745f btl/ugni: add missing mutex lock 2014-12-24 11:50:23 -07:00
Howard Pritchard
83bcbd1cf9 btl/ugni: compilation fixes
Fix compilation problems in ugni btl associated with
async progress additions.
2014-12-24 11:50:23 -07:00
Howard Pritchard
13ab8a9e5a btl/ugni: use MCA_BTL_DES_FLAGS_SIGNAL
Use MCA_BTL_DES_FLAGS_SIGNAL frag flag to indicate
whether or not an interrupt needs to be delivered
along with a control message going through smsg.
2014-12-24 11:50:23 -07:00
Howard Pritchard
3fc7b389ff initial async progress changes for gni 2014-12-24 11:50:23 -07:00
Devendar Bureddy
ccafc62c07 OMPI: btl openib: fix max registarable memory caluclation
- by default allow to register maximum possible (i.e 2 * total_memory)
      memory. This beheviour can be turned off using mca parameter
      "btl_openib_allow_max_memory_registration"

    - In fallback case, use device specific parameters to calulate
      memory limit.
2014-12-23 23:35:54 +02:00
Howard Pritchard
ffbf9738a3 btl/vader: disable SGI UV xpmem for now
This commit allows master to build again on SGI UV systems.

Fixes #322
2014-12-23 12:04:25 -07:00
Rolf vandeVaart
26482db736 Bump up max send size. Gives much better performance for GPU transfers while only decreasing host transfers by a small amount. 2014-12-18 13:22:58 -08:00
Jeff Squyres
c621d1e622 libfabric: don't LIBADD the common library in the static case
Adding the libfabric common library in the --disable-dlopen case will
result in duplicate symbols.
2014-12-18 11:04:08 -08:00
Jeff Squyres
269d7f9713 openib: don't use opal_using_threads() in component_init
Use the flag that was passed in, instead.
2014-12-17 15:08:43 -08:00
Jeff Squyres
d6f059f538 configury: add some descriptive output messages in configure
Ensure that the ofi MTL and the usnic BTL have good descriptive output
messages in configure.
2014-12-17 13:36:01 -08:00
Rolf vandeVaart
f55de452ab Change the way we register the sm memory pool with CUDA. Rather than just registering local free lists, register the entire pool as the local process does not know which memory the remote processes are using for free lists. Fixes performance problem we were seeing with copying out of memory (since host piece was not pinned). 2014-12-17 14:21:34 -05:00
George Bosilca
830df07202 Fix the indentation. 2014-12-16 16:07:42 -05:00
George Bosilca
146ab96e29 These variables are now unnecessary. 2014-12-16 16:05:00 -05:00
Aurélien Bouteiller
ee3b090316 The fallback case when yama is not installed was not correct in CMA vader 2014-12-16 14:39:14 -05:00
Aurélien Bouteiller
0bf860ef02 indentation 2014-12-16 14:22:26 -05:00
Jeff Squyres
95da4a5a0e usnic: no longer use opal_using_threads()
Instead, use the flag that is passed in.
2014-12-16 08:49:01 -08:00
George Bosilca
2fec570fe7 There is no need to keep track of these events. They are scheduled
as triggers in libevent, so one bookkepping should be enough.
2014-12-15 22:35:29 -05:00
George Bosilca
46baab350c The event is automatically deleted by default. 2014-12-15 21:59:20 -05:00
George Bosilca
b01abfa0d7 Don't over-do it! 2014-12-15 21:33:32 -05:00
George Bosilca
f87a4b691b Solve another handshake problem, where one threads was calling del_event
while cleaning up after receiving a zero byte on the connect socket
(localyy started connection), while another was trying to accept a
new connection from the same peer. Create a zero-timed event and
delocalize the accept into a timer_event.
Add support for registering an error callback, that can be used when a
connection is discovered as failed during the initialization process.
2014-12-15 20:27:32 -05:00
George Bosilca
e20413c885 Rearrange the code to remove a compiler complaint about
the missing return from a non-void function.
2014-12-15 15:42:57 -05:00
George Bosilca
2edbe16c47 Add the necessary infrastructure to allow the dumping of all TCP
informations related to an endpoint (status and all pending fragments).
Do some minor space cleanup.
2014-12-13 01:59:55 -05:00
George Bosilca
5b8616d890 Fix the race condition in endpoint connection initialization. The race
was quite subtle, and only happened on the process with the smallest
guid (as this process will tear down the connection created locally and
replace it with the result of accept). If multiple threads are active in
the system, the deadlock occurs during the recv event deletion as one
thread will hold the recv event lock of the endpoint and try to access
the TCP event base lock, while the other thread will hold the TCP event
base lock while trying to access the recv event lock (in case data is
available on the socket).

The proposed solution let the event callback fail to process the data,
preventing the deadlock and allowing the other thread to always complete
it's job. As the event is not execute the same triggered will trigger
again at the next opportunity, so this solution introduce a minimal
delay in the connection establishement.
2014-12-13 01:45:00 -05:00
Nathan Hjelm
38d66272c5 btl/vader: fix compile on SGI UV 2014-12-12 09:09:01 -07:00
Jeff Squyres
cd0a54d76f usnic: short term fix to enable builds on non-libfabric platforms
This isn't quite the Right fix yet, because it doesn't address usnic
for external libfabric builds.  I'll fix that separately / later.
2014-12-09 09:19:26 -08:00
Jeff Squyres
6e24a1eb85 usnic: update for libfabric API change
Use FI_ADDR_UNSPEC for posting a receive from an unspecified source.
2014-12-09 06:06:52 -08:00
Jeff Squyres
9547345b18 usnic: fix show_help message
Rename a few symbols to use libfabric-friendly names.  Fix a show_help
message when fi_av_insert times out.
2014-12-08 11:39:07 -08:00
Jeff Squyres
8e49cc754f usnic: update to latest libfabric API changes 2014-12-08 11:37:37 -08:00
Jeff Squyres
984982790a usnic: convert from verbs to libfabric (yay!)
This commit represents the conversion of the usnic BTL from verbs to
libfabric.

For the moment, libfabric is embedded in Open MPI (currently in the
usnic BTL).  This is because the libfabric API is still changing, and
also has not yet been released.  Ultimately, this embedded copy of
libfabric will likely disappear and the usnic BTL will rely on an
external installation of libfabric.

New configure options:

* --with-libfabric: will cause configure to fail if libfabric support
    cannot be built
* --without-libfabric: will prevent libfabric support from being built
* --with-libfabric=DIR: use an external libfabric installation
* --with-libfabric-libdir=LIBDIR: when paired with --with-libfabric=DIR,
    use LIBDIR for the libfabric installation library dir

The --with-libnl3[-libdir] arguments are now gone.
2014-12-08 11:37:37 -08:00
Nathan Hjelm
f989fe27b8 btl/vader: workaround to make jenkins happy 2014-12-03 15:51:58 -07:00
Todd Kordenbrock
c0c680bccb Portals4 BTL: Do not disqualify if a peer does not put Portals4 BTL modex info
If OPAL_MODEX_RECV() returns OPAL_ERR_NOT_FOUND, the peer didn't
send any Portals4 BTL info.  This is not a fatal error.  Instead of
disqualifying the Portals4 BTL just ignore that peer.

@jsquyres reported this in #194.
2014-12-03 14:22:10 -06:00
George Bosilca
dee243c58d ompi_proc_finalize has an interesting side effect. A proc is
inserted in the ompi_proc_list as soon as it is created and it
is removed only upon the call to the destructor. In ompi_proc_finalize
we loop over all procs in ompi_proc_finalize and release them once.
However, as a proc is not removed from this list right away, we
decrease the ref count for each proc until it reach zero and the
proc is finally removed. Thus, we cannot clean the BML/BTL after
the call the ompi_proc_finalize.
A quick fix is to delay the call to ompi_proc_finalize until all
other frameworks have been finalized, and then the behavior
depicted above will give the expected outcome.
2014-11-28 18:26:36 -05:00