1
1

177 Коммитов

Автор SHA1 Сообщение Дата
Howard Pritchard
13ab8a9e5a btl/ugni: use MCA_BTL_DES_FLAGS_SIGNAL
Use MCA_BTL_DES_FLAGS_SIGNAL frag flag to indicate
whether or not an interrupt needs to be delivered
along with a control message going through smsg.
2014-12-24 11:50:23 -07:00
Howard Pritchard
3fc7b389ff initial async progress changes for gni 2014-12-24 11:50:23 -07:00
Devendar Bureddy
ccafc62c07 OMPI: btl openib: fix max registarable memory caluclation
- by default allow to register maximum possible (i.e 2 * total_memory)
      memory. This beheviour can be turned off using mca parameter
      "btl_openib_allow_max_memory_registration"

    - In fallback case, use device specific parameters to calulate
      memory limit.
2014-12-23 23:35:54 +02:00
Howard Pritchard
ffbf9738a3 btl/vader: disable SGI UV xpmem for now
This commit allows master to build again on SGI UV systems.

Fixes #322
2014-12-23 12:04:25 -07:00
Rolf vandeVaart
26482db736 Bump up max send size. Gives much better performance for GPU transfers while only decreasing host transfers by a small amount. 2014-12-18 13:22:58 -08:00
Jeff Squyres
c621d1e622 libfabric: don't LIBADD the common library in the static case
Adding the libfabric common library in the --disable-dlopen case will
result in duplicate symbols.
2014-12-18 11:04:08 -08:00
Jeff Squyres
269d7f9713 openib: don't use opal_using_threads() in component_init
Use the flag that was passed in, instead.
2014-12-17 15:08:43 -08:00
Jeff Squyres
d6f059f538 configury: add some descriptive output messages in configure
Ensure that the ofi MTL and the usnic BTL have good descriptive output
messages in configure.
2014-12-17 13:36:01 -08:00
Rolf vandeVaart
f55de452ab Change the way we register the sm memory pool with CUDA. Rather than just registering local free lists, register the entire pool as the local process does not know which memory the remote processes are using for free lists. Fixes performance problem we were seeing with copying out of memory (since host piece was not pinned). 2014-12-17 14:21:34 -05:00
George Bosilca
830df07202 Fix the indentation. 2014-12-16 16:07:42 -05:00
George Bosilca
146ab96e29 These variables are now unnecessary. 2014-12-16 16:05:00 -05:00
Aurélien Bouteiller
ee3b090316 The fallback case when yama is not installed was not correct in CMA vader 2014-12-16 14:39:14 -05:00
Aurélien Bouteiller
0bf860ef02 indentation 2014-12-16 14:22:26 -05:00
Jeff Squyres
95da4a5a0e usnic: no longer use opal_using_threads()
Instead, use the flag that is passed in.
2014-12-16 08:49:01 -08:00
George Bosilca
2fec570fe7 There is no need to keep track of these events. They are scheduled
as triggers in libevent, so one bookkepping should be enough.
2014-12-15 22:35:29 -05:00
George Bosilca
46baab350c The event is automatically deleted by default. 2014-12-15 21:59:20 -05:00
George Bosilca
b01abfa0d7 Don't over-do it! 2014-12-15 21:33:32 -05:00
George Bosilca
f87a4b691b Solve another handshake problem, where one threads was calling del_event
while cleaning up after receiving a zero byte on the connect socket
(localyy started connection), while another was trying to accept a
new connection from the same peer. Create a zero-timed event and
delocalize the accept into a timer_event.
Add support for registering an error callback, that can be used when a
connection is discovered as failed during the initialization process.
2014-12-15 20:27:32 -05:00
George Bosilca
e20413c885 Rearrange the code to remove a compiler complaint about
the missing return from a non-void function.
2014-12-15 15:42:57 -05:00
George Bosilca
2edbe16c47 Add the necessary infrastructure to allow the dumping of all TCP
informations related to an endpoint (status and all pending fragments).
Do some minor space cleanup.
2014-12-13 01:59:55 -05:00
George Bosilca
5b8616d890 Fix the race condition in endpoint connection initialization. The race
was quite subtle, and only happened on the process with the smallest
guid (as this process will tear down the connection created locally and
replace it with the result of accept). If multiple threads are active in
the system, the deadlock occurs during the recv event deletion as one
thread will hold the recv event lock of the endpoint and try to access
the TCP event base lock, while the other thread will hold the TCP event
base lock while trying to access the recv event lock (in case data is
available on the socket).

The proposed solution let the event callback fail to process the data,
preventing the deadlock and allowing the other thread to always complete
it's job. As the event is not execute the same triggered will trigger
again at the next opportunity, so this solution introduce a minimal
delay in the connection establishement.
2014-12-13 01:45:00 -05:00
Nathan Hjelm
38d66272c5 btl/vader: fix compile on SGI UV 2014-12-12 09:09:01 -07:00
Jeff Squyres
cd0a54d76f usnic: short term fix to enable builds on non-libfabric platforms
This isn't quite the Right fix yet, because it doesn't address usnic
for external libfabric builds.  I'll fix that separately / later.
2014-12-09 09:19:26 -08:00
Jeff Squyres
6e24a1eb85 usnic: update for libfabric API change
Use FI_ADDR_UNSPEC for posting a receive from an unspecified source.
2014-12-09 06:06:52 -08:00
Jeff Squyres
9547345b18 usnic: fix show_help message
Rename a few symbols to use libfabric-friendly names.  Fix a show_help
message when fi_av_insert times out.
2014-12-08 11:39:07 -08:00
Jeff Squyres
8e49cc754f usnic: update to latest libfabric API changes 2014-12-08 11:37:37 -08:00
Jeff Squyres
984982790a usnic: convert from verbs to libfabric (yay!)
This commit represents the conversion of the usnic BTL from verbs to
libfabric.

For the moment, libfabric is embedded in Open MPI (currently in the
usnic BTL).  This is because the libfabric API is still changing, and
also has not yet been released.  Ultimately, this embedded copy of
libfabric will likely disappear and the usnic BTL will rely on an
external installation of libfabric.

New configure options:

* --with-libfabric: will cause configure to fail if libfabric support
    cannot be built
* --without-libfabric: will prevent libfabric support from being built
* --with-libfabric=DIR: use an external libfabric installation
* --with-libfabric-libdir=LIBDIR: when paired with --with-libfabric=DIR,
    use LIBDIR for the libfabric installation library dir

The --with-libnl3[-libdir] arguments are now gone.
2014-12-08 11:37:37 -08:00
Nathan Hjelm
f989fe27b8 btl/vader: workaround to make jenkins happy 2014-12-03 15:51:58 -07:00
Todd Kordenbrock
c0c680bccb Portals4 BTL: Do not disqualify if a peer does not put Portals4 BTL modex info
If OPAL_MODEX_RECV() returns OPAL_ERR_NOT_FOUND, the peer didn't
send any Portals4 BTL info.  This is not a fatal error.  Instead of
disqualifying the Portals4 BTL just ignore that peer.

@jsquyres reported this in #194.
2014-12-03 14:22:10 -06:00
George Bosilca
dee243c58d ompi_proc_finalize has an interesting side effect. A proc is
inserted in the ompi_proc_list as soon as it is created and it
is removed only upon the call to the destructor. In ompi_proc_finalize
we loop over all procs in ompi_proc_finalize and release them once.
However, as a proc is not removed from this list right away, we
decrease the ref count for each proc until it reach zero and the
proc is finally removed. Thus, we cannot clean the BML/BTL after
the call the ompi_proc_finalize.
A quick fix is to delay the call to ompi_proc_finalize until all
other frameworks have been finalized, and then the behavior
depicted above will give the expected outcome.
2014-11-28 18:26:36 -05:00
Gilles Gouaillardet
a6744b8177 fix misc memory leaks specific to the master 2014-11-25 13:52:10 +09:00
Gilles Gouaillardet
758f7ab768 Revert "btl/vader: use FRAG_ALLOC_USER when single_copy_mechanism is VADER_NONE"
as discussed with @hjelmn in open-mpi/ompi-release#86

This reverts commit d2d7f39a4bc8dde7e8c480f248819ec0d0e6be62.
2014-11-20 16:04:55 +09:00
Nathan Hjelm
1b564f62bd Revert "Merge pull request #275 from hjelmn/btlmod"
This reverts commit ccaecf0fd6c862877e6a1e2643f95fa956c87769, reversing
changes made to 6a19bf85dde5306f559f09952cf3919d97f52502.
2014-11-19 23:22:43 -07:00
Nathan Hjelm
b1f9569b7d Revert "btl/openib: fix warnings"
This reverts commit 6e6c786b49dc9df70c87bb8348c12b6f68de173b.
2014-11-19 23:16:16 -07:00
Nathan Hjelm
6e6c786b49 btl/openib: fix warnings 2014-11-19 15:57:01 -07:00
Nathan Hjelm
2b579610f2 btl/openib: fix compilation issues with XRC 2014-11-19 11:44:48 -07:00
Nathan Hjelm
2a382c2ec1 add btl comment 2014-11-19 11:33:04 -07:00
Nathan Hjelm
bf7daac388 btl/openib: add atomic operation support 2014-11-19 11:33:04 -07:00
Nathan Hjelm
45d1fac8af ugni thread safety fixes 2014-11-19 11:33:03 -07:00
Nathan Hjelm
5e7c77c576 btl/ugni: add support for atomic operations 2014-11-19 11:33:03 -07:00
Nathan Hjelm
90554d0f95 btl/openib: misc cleanup (tabs, etc) and put credit code into a common place (was duplicated in the send and sendi paths) 2014-11-19 11:33:03 -07:00
Nathan Hjelm
4122067236 btl/openib: fix message coalescing 2014-11-19 11:33:03 -07:00
Nathan Hjelm
38e9611930 btl/openib: fix recieve queue source detection 2014-11-19 11:33:03 -07:00
Nathan Hjelm
7c43b566d2 more openib updates 2014-11-19 11:33:03 -07:00
Nathan Hjelm
3ea10476a4 btl/sm fix compilation when not using CMA or KNEM 2014-11-19 11:33:03 -07:00
Nathan Hjelm
4ccb20b097 btl: fix warning about enum type and modify btl_sendi to allow the
value NULL for the descriptor

The send inline optimization uses the btl_sendi function to achieve
lower latency and higher message rates. The problem is the btl_sendi
function was allowed to return a descriptor to the caller. This is fine
for some paths but not ok for the send inline optimization. To fix
this the btl now must be able to handle descriptor = NULL.
2014-11-19 11:33:03 -07:00
Nathan Hjelm
2a70238f4d First crack at adding atomic operation support 2014-11-19 11:33:03 -07:00
Nathan Hjelm
249e5e009f Fix knem support in both sm and vader 2014-11-19 11:33:02 -07:00
Nathan Hjelm
e03956e099 Update the scif and openib btls for the new btl interface
Other changes:
 - Remove the registration argument from prepare_src since it no
   longer is meant for RDMA buffers.

 - Additional cleanup and bugfixes.
2014-11-19 11:33:02 -07:00
Nathan Hjelm
ec33374339 btl: remove des_remote/des_remote_count from the mca_btl_base_descriptor_t
structure

This structure member was originally used to specify the remote segment
for an RDMA operation. Since the new btl interface no longer uses
desriptors for RDMA this member no longer has a purpose. In addition
to removing these members the local segment information has been
renamed to des_segments/des_segment_count.
2014-11-19 11:33:02 -07:00