1
1

2438 Коммитов

Автор SHA1 Сообщение Дата
Gilles Gouaillardet
d8482ce6f4 opal/mca/memory: add a memoryc_set_alignment subroutine to the OPAL memory MCA
this commit also (partially) reverts :
 - open-mpi/ompi@7de01b347c
 - open-mpi/ompi@8b05f308f9
2016-02-24 09:50:12 +09:00
Ralph Castain
d653cf2847 Convert the orte_job_data pointer array to a hash table so it doesn't grow forever as we run lots and lots of jobs in the persistent DVM. 2016-02-21 11:55:49 -08:00
Ralph Castain
8c92a179c0 Minor memory leak 2016-02-19 15:05:39 -08:00
Nathan Hjelm
2031bb6f01 btl/openib: XRC save SRQ#s on the loopback endpoint
This commit fixes a bug that can occur when communicating via XRC to
peers on the same node. UDCM was not saving the SRQ numbers on the
loopback endpoint (which shares its ib_addr info with all local peers)
so any messages to local peers use an invalid SRQ number.

Fixes open-mpi/ompi#1383

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2016-02-18 20:59:11 -07:00
rhc54
bfd4254a7b Merge pull request #1382 from rhc54/topic/cleanup
Cleanup some valgrind complaints about jumps with uninitialized values.
2016-02-18 17:29:37 -08:00
Ralph Castain
6e68d758b9 Cleanup some valgrind complaints about jumps with uninitialized values. Fix a few IOF issues reported by Mark Santcroos when submitting jobs from tools. Add the ability to pass directives to the --output-filename option that tell ORTE to (a) not include the jobid in the path to the output files, and (b) not to copy the output to the tool (i.e., just store it in the files).
ck

Remove stale debug

Fix a segfault if no subscribers are present
2016-02-18 16:30:37 -08:00
Nathan Hjelm
371df45bf8 btl/openib: fix locking bugs with XRC ib_addr lock
This bug fixes two issue with the ib_addr lock:

 - The ib_addr lock must always be obtained regardless of
   opal_using_threads() as the CPC is run in a seperate thread.

 - The ib_addr lock is held in mca_btl_openib_endpoint_connected when
   calling back into the CPC start_connect on any pending
   connections. This will attempt to obtain the ib_addr lock
   again. Since this is not a performance-critical part of the code
   the lock has been changed to be recursive.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-02-18 15:55:34 -07:00
Nathan Hjelm
4dc73d7765 btl/openib: XRC fix bug that could cause an invalid SRQ# to be used
This commit fixes a bug that occurs when attempting a get or put
operation on an endpoint that is not already connected. In this case
the remote_srqn may be set to an invalid value as the rem_srqs array
on the endpoint is not populated. This commit moves the usage of the
rem_srqs array to the internal put/get functions where it is
guaranteed this array is populated.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-02-18 15:44:29 -07:00
Ralph Castain
60a7bc2e50 Enable the PMIx notification callback system. This currently is only supported by the pmix120 component, which is not selected by default. All other components will ignore error registration requests, and thus do not support debugger attach when launched via mpirun. Note that direct launched applications will support such attachment, but may not do so in a scalable fashion.
Fixes ##1225
2016-02-18 09:29:12 -08:00
rhc54
2745610eb7 Merge pull request #1377 from rhc54/topic/pmix
Plug a leak in the PMIx subsystem
2016-02-17 20:05:45 -08:00
Ralph Castain
efb0eff43e Plug a leak in the PMIx subsystem 2016-02-17 19:00:36 -08:00
rhc54
dc4d3edc06 Merge pull request #1372 from rhc54/topic/sing
Further enhance the support for Singularity containers.
2016-02-17 16:39:23 -08:00
Nathan Hjelm
4f4ea96940 btl/openib/udcm: fix local XRC connections
This commit ensures ib_addr->remote_xrc_rcv_qp_num value is set when
creating the loopback queue pair. This is needed when communicating
with any other local peer.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-02-17 14:54:19 -07:00
Ralph Castain
8f9508cace Further enhance the support for Singularity containers. Extend the "personality" command-line option to allow specifying both model (e.g., "ompi") and container (e.g., "singularity"), and add the necessary logic to support multiple options. Add a new pmix "isolated" component to handle singletons where no HNP is available since containers cannot launch the HNP. 2016-02-17 13:33:06 -08:00
Nathan Hjelm
bf8360388f btl/openib/udcm: fix XRC support
This commit fixes two bugs in XRC support

 - When dynamic add_procs support was added to master the remote
   process name was added to the non-XRC request structure. The same
   value was not added to the XRC xconnect structure. This error was
   not caught because the send/recv code was incorrectly using the
   wrong structure member. This commmit adds the member and ensure the
   xconnect code uses the correct structure.

 - XRC loopback QP support has been fixed. It was 1) not setting the
   correct fields on the endpoint structure, 2) calling
   udcm_xrc_recv_qp_connect, and 3) was not initializing the endpoint
   data.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-02-16 16:49:04 -07:00
Nathan Hjelm
201c280e6c btl/openib: fix error in param check in mca_btl_openib_put
mca_btl_openib_put incorrectly checks the qp inline max before
allowing an inline put. This check will always fail for an endpoint
that has not been connected. The commit changes the check to use the
btl_put_local_registration_threshold instead.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-02-16 16:46:32 -07:00
Nathan Hjelm
123a39ac3c btl/openib: fix regression in XRC support
Commit open-mpi/ompi@400af6c52d
introduced a regression in XRC support. The commit reversed the
ordering of shared receive queue (SRQ) and completion queue (CQ)
completion. CQ creation must always preceed SRQ creation when using
XRC as the CQs are needed to create the SRQs. This commit fixes the
ordering so that CQs are always created before SRQs.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-02-16 16:46:20 -07:00
George Bosilca
9dc79f4bc1 Initialize these 2 common symbols. 2016-02-15 12:27:24 -05:00
igor-ivanov
d9eefefa74 Merge pull request #1351 from igor-ivanov/pr/issue-1336
opal/memory: Move Memory Allocation Hooks usage from openib
2016-02-15 14:07:36 +04:00
Ralph Castain
aa9e5a1a27 Add support for Singularity containers, including a .m4 file for checking if Singularity is available and an orte/schizo component for setting the proper support if a container was given as the executable
Cleanup the configury so we properly check for Singularity under the various typical use-cases

Bring the Singularity support online. We have to turn "off" the sm BTL as it segfaults from inside the container - root cause remains unclear. Also turned "off" the various OPAL shmem components in case they are involved and someone else tries to use them. Happily, the vader BTL works just fine!
2016-02-13 04:40:22 -08:00
Igor Ivanov
8b05f308f9 opal/memory: Move Memory Allocation Hooks usage from openib
These changes fix issue https://github.com/open-mpi/ompi/issues/1336

- improve abstractions: opal/memory/linux component should be single place that opeartes with
Memory Allocation Hooks.
- avoid collisions in case dynamic component open/close: it is safe because it is linked statically.
- does not change original behaivour.
2016-02-11 14:46:35 +02:00
Jeff Squyres
8d0a592563 usnic: update a few verbose reachability messages 2016-02-06 03:28:48 -08:00
Jeff Squyres
87dbe6ce01 usnic: add high-verbose reachability messages 2016-02-06 03:28:47 -08:00
Jeff Squyres
dac2fe1589 usnic: ensure to use ntohl() for network-order values 2016-02-06 03:28:47 -08:00
Jeff Squyres
51240394a7 usnic: ensure to init module->av_eq_num 2016-02-06 03:28:47 -08:00
Jeff Squyres
89eea51075 usnic: fix calculation for number of blocks 2016-02-02 16:56:34 -08:00
Jeff Squyres
d812695201 verbs: fix typo 2016-02-02 14:23:45 -08:00
Jeff Squyres
2cf9b26d34 verbs_usnic: previous commit missed a symbol
0715802f52c24c236700ac085090d5441524644c missed that there is a call
to a common/verbs_usnic symbol in the common/verbs component.  This
call needs to be compiled out when the common/verbs_usnic component is
not built.
2016-02-02 14:05:59 -08:00
Nathan Hjelm
a016c17714 Merge pull request #1338 from hjelmn/ugni_threading
UNGI threading fixes
2016-02-02 13:22:57 -07:00
Jeff Squyres
0715802f52 verbs_usnic: do not build by default
This component is a workaround to a bug in libibverbs that prints a
dire warning that usNIC devices are not supported (of course not --
usNIC devices provide functionality through libfabric, not
libibverbs).  This component was written before a better workaround
was created: a "no op" libibverbs plugin for usNIC devices
(https://github.com/cisco/libusnic_verbs, and is also available in
binary form on cisco.com).

Hence, this component no longer builds by default.  It's still
available if a user specifically asks for it (e.g., if they do not
want to install the "no op" libibverbs plugin), but it's not the
default.  This component also has the side-effect of making
libopen-pal.so depend on libibverbs.so, which can be annoying for
packagers (which is another reason it isn't built by default any
more).
2016-02-02 11:22:04 -08:00
Nathan Hjelm
cd11fc3081 btl/ugni: fix race condition that causes completions to be dropped
The send code in the ugni btl has an optimization that enables it to
return 1 (fragment gone) in some cases. This optimization involved
removing the btl ownership and callback flags to ensure the fragment
stuck around long enough for its completion flag to be checked. This
works fine for the single-threaded case but not in the multi-threaded
case. It is possible that a fragment will be completed by another
thread while a thread is in mca_btl_ugni_send. This competition can
lead to a leaked fragment, missed callback, or both. To fix the issue
without removing the optimization a reference count has been added to
the fragment. Callbacks and fragment release will not be made until
the fragment reference count has reach 0. The count is incremented
before sending the frag and decremented after the completion flag has
been checked. The fix has been verified to work using a multi-threaded
RMA benchmark with the osc/pt2pt component.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-02-02 12:14:31 -07:00
Nathan Hjelm
14704201e2 btl/ugni: fix race condition when adding endpoint to wait list
This commit fixes a race condition that can cause an endpoint to be
added to the wait list multiple times. To fix the issue an additional
check has been added to ensure the endpoint is not on the wait list
after the wait list lock is held. The wait list processing code has
also been updated to keep the wait list lock until all wait listed
endpoints have been handled. This reduces the chance that an endpoint
that is being processed by the wait list code is not re-added to the
list by a competing send.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-02-02 12:13:49 -07:00
Jeff Squyres
9f3ed00125 usnic: minor updates from code review
Three minor updates from the code review of
https://github.com/open-mpi/ompi-release/pull/933:

* Remove an extra blank line a show_help message
* We no longer allow -1 for the MCA param btl_usnic_av_eq_num, so
  change the flag to REGINT_GE_ONE
* Change "num_blocks" definition to be in terms of block_len (not
  eq_size)
2016-02-01 11:14:30 -08:00
Jeff Squyres
c2615a4732 usnic: change retrans timeout to 5ms
A bunch of empirical testing has shown that increasing the retranmit
timeout from 1ms to 5ms doesn't adversely affect performance, yet
decreases the number of gratuitious retransmissions.
2016-01-30 10:49:14 -08:00
Jeff Squyres
797d5026c8 usnic: better av_eq_num default value handling 2016-01-30 10:46:14 -08:00
Jeff Squyres
db825abc00 usnic: don't overrun the fi_av_insert() EQ
Add endpoints in a blocked manner so that we don't overrun the
fi_av_insert() event queue.  Also make the AV EQ length an MCA param,
and report it in mca_btl_base_verbose >=5 output.
2016-01-30 08:33:48 -08:00
Jeff Squyres
d624e0d60f usnic: fix wraparound sequence number issue
Sequence numbers will wrap around; it is not sufficient to check for
(seq-1) -- must use the SEQ_DIFF macro to properly handle the
wraparound.

This bug wasn't serious; it just meant we might retransmit one or two
extra times when retransmits were triggerd and the sequence numbers
wrapped around their sliding windows.
2016-01-30 08:32:13 -08:00
Jeff Squyres
4de4a263f5 usnic: ensure all messages are sent on the data channel
Messages should go on the data channel, even if they're short.  Only
ACKs go on the priority channel.
2016-01-30 08:31:21 -08:00
Gilles Gouaillardet
d529951206 hwloc: correctly count cores with at least one allowed PU
when SMT is enabled, a core must be counted as long as one of its hwthread is allowed

Thanks Ben Menadue for the report.

This fixes a regression from open-mpi/ompi@6d149554a7
2016-01-29 11:54:34 +09:00
Edgar Gabriel
3f7fff5780 Merge pull request #1331 from edgargabriel/solaris-statfs-fix
Solaris statfs fix
2016-01-28 20:16:33 -06:00
Gilles Gouaillardet
f5a53b5f1e pmix: fix Makefile.am to correctly exclude autogenerated file from tarball
(back-ported from pmix/master@73daf58ee5)
2016-01-28 11:42:03 +09:00
Edgar Gabriel
722aab92e6 - extend opal_path_nfs to retrieve the file system type
- use opal_path_nfs in the fs_base function to avoid code duplication.
2016-01-26 13:36:21 -06:00
Gilles Gouaillardet
6d149554a7 hwloc: have opal_hwloc_base_get_pu search for HWLOC_OBJ_PU when mpirun is invoked with --use-hwthread-cpus
Fixes open-mpi/ompi#1247
2016-01-26 18:10:33 +09:00
Gilles Gouaillardet
15e26da1e1 pmix configury: add missing PMIX_CHECK_ICC_VARARGS function
Thanks Paul Hargrove for the report

(back-ported from pmix/master@7b16e914bf)
2016-01-26 10:57:16 +09:00
rhc54
b172b8599b Merge pull request #1285 from ggouaillardet/topic/pmix_dist_fix
pmix: do not include automatically generated include/private/autogen/…
2016-01-16 20:49:41 -08:00
Jeff Squyres
348ac507c2 usnic: explain why we still have OPAL_HAVE_HWLOC
Put in a comment explaining why btl_usnic_compat.h still defines
OPAL_HAVE_HWLOC, even though master/v2.x no longer does.
2016-01-16 04:11:05 -08:00
Jeff Squyres
0f5fcf9029 usnic: fix common symbol 2016-01-16 03:55:27 -08:00
Jeff Squyres
60ffe713b8 common syms: whitelist bison-generated common symbols
Bison generates some common symbols that we can't do anything about,
so whitelist them.
2016-01-16 03:53:14 -08:00
Tim Mattox
958de82471 hwloc_base_util.c: Remove newly unused variable 'i'. 2016-01-14 16:35:47 -05:00
Gilles Gouaillardet
1d38430e43 opal: replace opal_convert_jobid_to_string with opal_snprintf_jobid 2016-01-14 10:39:03 +09:00