1
1
Граф коммитов

1779 Коммитов

Автор SHA1 Сообщение Дата
George Bosilca
46baab350c The event is automatically deleted by default. 2014-12-15 21:59:20 -05:00
George Bosilca
b01abfa0d7 Don't over-do it! 2014-12-15 21:33:32 -05:00
George Bosilca
f87a4b691b Solve another handshake problem, where one threads was calling del_event
while cleaning up after receiving a zero byte on the connect socket
(localyy started connection), while another was trying to accept a
new connection from the same peer. Create a zero-timed event and
delocalize the accept into a timer_event.
Add support for registering an error callback, that can be used when a
connection is discovered as failed during the initialization process.
2014-12-15 20:27:32 -05:00
George Bosilca
e20413c885 Rearrange the code to remove a compiler complaint about
the missing return from a non-void function.
2014-12-15 15:42:57 -05:00
Ralph Castain
573a574a3c Remove an unused dstore type that was redundant with another one. Define a corresponding PMIX_NODE_ID type (contains the vpid of the daemon hosting the proc) and ensure that the PMIx server includes that info in its process map 2014-12-15 12:11:13 -08:00
Ralph Castain
9658256a98 Restore the passing of the complete job map to the local proc on first get_attr so the info can be used by the MPI layer without continual calls back to the server. We'll find a more memory efficient method later. 2014-12-13 18:44:09 -08:00
George Bosilca
2edbe16c47 Add the necessary infrastructure to allow the dumping of all TCP
informations related to an endpoint (status and all pending fragments).
Do some minor space cleanup.
2014-12-13 01:59:55 -05:00
George Bosilca
5b8616d890 Fix the race condition in endpoint connection initialization. The race
was quite subtle, and only happened on the process with the smallest
guid (as this process will tear down the connection created locally and
replace it with the result of accept). If multiple threads are active in
the system, the deadlock occurs during the recv event deletion as one
thread will hold the recv event lock of the endpoint and try to access
the TCP event base lock, while the other thread will hold the TCP event
base lock while trying to access the recv event lock (in case data is
available on the socket).

The proposed solution let the event callback fail to process the data,
preventing the deadlock and allowing the other thread to always complete
it's job. As the event is not execute the same triggered will trigger
again at the next opportunity, so this solution introduce a minimal
delay in the connection establishement.
2014-12-13 01:45:00 -05:00
Ralph Castain
bffb2b7a4b Correct some issues with variables used before being set 2014-12-12 17:23:32 -08:00
Ralph Castain
0630680f36 Two cleanups required for transfer to 1.8.4:
* Use %d format for the topo signature as some systems apparently have problems with %u
* Use correct variable in show_help message
2014-12-12 17:23:32 -08:00
Howard Pritchard
6cf258638a mpool/udreg: minor comment improvement 2014-12-12 14:05:18 -07:00
Nathan Hjelm
38d66272c5 btl/vader: fix compile on SGI UV 2014-12-12 09:09:01 -07:00
Jeff Squyres
e4b3c6f1c4 libfabric psm: fix (void*) dereference
Committed upstream to libfabric as well.
2014-12-11 20:12:13 -08:00
Jeff Squyres
0f28233b35 libfabric: don't use __thread
There's no real reason that this routine should use thread local
storage.  Plus, __thread appears to be a GCC extension.
2014-12-11 14:10:48 -08:00
Jeff Squyres
4551cab6f1 help messages: fix obvious typos 2014-12-11 12:23:33 -08:00
rolfv
f471b09ae9 Add support for CUDA Unified memory. Basically, add a new flag and disable some
optimizations when that flag is detected.  Lightly reviewed by bosilca.
2014-12-10 05:46:00 -08:00
Jeff Squyres
e6c8bfc201 libfabric: Gah -- also remove the "pragma pop" line
Thanks to Nathan for pointing out that I missed snipping one line in
2f9c69f016 (I removed the trailing
comment, but not the trailing pragma -- oops!).
2014-12-09 14:03:39 -08:00
Jeff Squyres
2f9c69f016 libfabric: use correct C99 notation for var-length array
Nathan pointed out the correct C99 way to notate a variable-length
array in a struct.  This change has now been accepted upstream in
libfabric.
2014-12-09 13:33:15 -08:00
Jeff Squyres
cd0a54d76f usnic: short term fix to enable builds on non-libfabric platforms
This isn't quite the Right fix yet, because it doesn't address usnic
for external libfabric builds.  I'll fix that separately / later.
2014-12-09 09:19:26 -08:00
Nathan Hjelm
b2b7ecc7c4 Merge pull request #300 from hjelmn/topic/atomic_lifo_fifo
Add opal_fifo_t class and rename opal_atomic_lifo_t to opal_lifo_t
2014-12-09 10:54:50 -06:00
Jeff Squyres
c40fd09d2a libfabric: fix providers to conditionally add libs/flags
Only allow the usnic and PSM providers to add CPPFLAGS and LIBADD
flags when they are going to be built.
2014-12-09 07:15:25 -08:00
Jeff Squyres
45d6f29a27 Merge pull request #310 from yburette/master
libfabric: add optional PSM provider.
2014-12-09 06:39:34 -08:00
Jeff Squyres
6e24a1eb85 usnic: update for libfabric API change
Use FI_ADDR_UNSPEC for posting a receive from an unspecified source.
2014-12-09 06:06:52 -08:00
Jeff Squyres
f5a07f651c libfabric: Open MPI addition to stem a flood of warnings
Add a pragma to not warn about zero-length arrays.  This needs to be
addressed upstream, but for now, do it here.
2014-12-09 06:04:37 -08:00
Jeff Squyres
f331f48796 libfabric: update embedded libfabric to 934a714
Update the embedded copy of libfabric to the github ofiwg/libfabric
repo hash 934a714ca85f1a30a1e384a7d5f714ee962dc253.
2014-12-09 06:03:51 -08:00
Jeff Squyres
09d03a154b libfabric: fix some typos in the usnic configury 2014-12-09 05:52:24 -08:00
Ralph Castain
18d9fdfd8d Restore full topology comparison to support inventory monitoring 2014-12-09 01:33:06 -08:00
Ralph Castain
9b2f8cd840 Add the processor architecture to the topology signature 2014-12-09 01:17:00 -08:00
Howard Pritchard
3a14c8eeff fix build for cray xc
Recent addition of libfabric embdded broke build on Cray XC/XE.
This commit fixes this problem.
2014-12-08 22:21:13 -08:00
Yohann Burette
f90a7b51d2 libfabric: add optional PSM provider. 2014-12-08 16:49:41 -08:00
Ralph Castain
bb529ebd8e Revise the way we handle hetero nodes as users are finding this (a) a significant surprise, and (b) confusing as to when it is required. So try to automate it a bit by creating a topology "signature" that mpirun can share on the cmd line with the remote daemons, thus allowing them to check to see if they match. This isn't comprehensive of course - for now, it only checks the number of each type of hwloc object on the node. This is good enough to pickup major differences (e.g., where we have different numbers of sockets or assigned core bindings).
Retain the hetero-nodes flag for those cases where the user *knows* that there are differences and our automated system isn't good enough to see it.

Will obviously require further refinement as we find out which variances it can detect, and which it cannot.
2014-12-08 15:38:14 -08:00
Yohann Burette
f33a9afd22 libfabric: fix typo in Makefile.am 2014-12-08 13:19:43 -08:00
Jeff Squyres
ac8e9d103c libfabric: need to make AM_CONDITIONALs always be run
Ensure that the usnic-specific AM_CONDITIONAL for the embedded
libfabric is always run.
2014-12-08 11:51:26 -08:00
Jeff Squyres
d64881f040 psm_am.h: add missing file from libfabric snapshot
This is just about to be fixed upstream, but "make dist" was not
including this file in the libfabric tarball.
2014-12-08 11:39:08 -08:00
Jeff Squyres
d02756cdbb libfabric: various configury updates
1. Ensure to override CFLAGS properly.  Move the setting of CFLAGS outside the AM_CONDITIONAL so that Automake doesn't get confused (because CFLAGS is already set inside an AM_CONDITIONAL -- moving it outside the conditional ensure that this local CFLAGS override trumps all other CFLAGS overrides).
2. Only build libfabric on Linux. Add a little more configury to ensure that we only try to build libfabric on Linux.
3. Remove a dead/unused file
4. Fix typo in condition check
5. Use "false", not "/bin/false"
2014-12-08 11:39:07 -08:00
Jeff Squyres
92818d1fa5 usnic: remove SVN-style $Id$ tokens (and #idents)
This commit is also upstream in libfabric.
2014-12-08 11:39:07 -08:00
Jeff Squyres
9547345b18 usnic: fix show_help message
Rename a few symbols to use libfabric-friendly names.  Fix a show_help
message when fi_av_insert times out.
2014-12-08 11:39:07 -08:00
Jeff Squyres
8e49cc754f usnic: update to latest libfabric API changes 2014-12-08 11:37:37 -08:00
Jeff Squyres
c4e8d67515 libfabric: sync to upstream libfabric github
Bring down the latest from the libfabric github, as of
9d051567c8eb7adc2af89516f94c7d0539152948.
2014-12-08 11:37:37 -08:00
Jeff Squyres
7a96b58882 common verbs: remove usnic-specific code
Now that the usnic BTL uses libfabric, we can remove the
usnic-specific code from opal/mca/common/verbs.
2014-12-08 11:37:37 -08:00
Jeff Squyres
984982790a usnic: convert from verbs to libfabric (yay!)
This commit represents the conversion of the usnic BTL from verbs to
libfabric.

For the moment, libfabric is embedded in Open MPI (currently in the
usnic BTL).  This is because the libfabric API is still changing, and
also has not yet been released.  Ultimately, this embedded copy of
libfabric will likely disappear and the usnic BTL will rely on an
external installation of libfabric.

New configure options:

* --with-libfabric: will cause configure to fail if libfabric support
    cannot be built
* --without-libfabric: will prevent libfabric support from being built
* --with-libfabric=DIR: use an external libfabric installation
* --with-libfabric-libdir=LIBDIR: when paired with --with-libfabric=DIR,
    use LIBDIR for the libfabric installation library dir

The --with-libnl3[-libdir] arguments are now gone.
2014-12-08 11:37:37 -08:00
Nathan Hjelm
20c6eb5237 Rename opal_atomic_lifo_t to opal_lifo_t and improve interface
- Rename opal_atomic_lifo_t to opal_lifo_t to reflect both atomic and
   non-atomic usage. Added new routines (opal_lifo_*_st) for non-atomic
   usage as well as routines conditioned off opal_using_threads(). The
   atomic versions are always thread safe and the non-atomic are always
   not thread safe.

 - Add a new atomic lifo implementation that makes use of 128-bit
   compare-and-swap. The new implementation should scale better with
   larger numbers of threads.

 - Add threading unit test for opal_lifo_t.
2014-12-04 15:30:02 -07:00
George Bosilca
04a4cbd77a Fix the clock_gettime monotonic timer. Thanks to Gilles for the
first sketch of the patch.
2014-12-04 00:20:56 -05:00
Jeff Squyres
983bd49f11 opal_timer_require_monotinic: change to bool / level 5 2014-12-03 17:09:43 -08:00
Jeff Squyres
8880b070b8 Merge pull request #295 from jsquyres/topic/bosilca-accurate-timers
Topic/bosilca accurate timers
2014-12-03 19:46:14 -05:00
Howard Pritchard
c67afadcfc Merge pull request #289 from hppritcha/topic/remove_pmi
Topic/remove pmi
2014-12-03 16:58:35 -07:00
Nathan Hjelm
f989fe27b8 btl/vader: workaround to make jenkins happy 2014-12-03 15:51:58 -07:00
Todd Kordenbrock
c0c680bccb Portals4 BTL: Do not disqualify if a peer does not put Portals4 BTL modex info
If OPAL_MODEX_RECV() returns OPAL_ERR_NOT_FOUND, the peer didn't
send any Portals4 BTL info.  This is not a fatal error.  Instead of
disqualifying the Portals4 BTL just ignore that peer.

@jsquyres reported this in #194.
2014-12-03 14:22:10 -06:00
Howard Pritchard
c75dccede1 pmix/cray: remove finalize call from comp close
The finalize call in component close method is
no longer being matched by an equivalent init call,
so remove this call in the close method.
2014-12-03 09:44:18 -07:00
Ralph Castain
d9b23c1054 Increment the init_count in the Slurm pmix components so they correctly respond to calls to pmix.initialized 2014-12-02 20:20:29 -08:00
Ralph Castain
cb15cc06e1 Minor changes per Jeff's request on PR for 1.8.4 2014-12-02 19:54:10 -08:00
George Bosilca
a35d2b9fb5 Update copyrights and mark ia32 timers as non-monotonic. 2014-12-01 14:03:54 -08:00
George Bosilca
5277fd5aa2 Various cleanups. 2014-12-01 14:03:47 -08:00
George Bosilca
00300f464d Add support for clock_gettime on Linux. Allow the user to
request a monotonic timer via MCA parameters.
2014-12-01 14:03:40 -08:00
Ralph Castain
960ef34988 Ensure the LSF ras adds the hosts to the allocation. Correctly handle the semi-colon vs comma situation in hwloc slot_lists 2014-11-30 14:37:37 -08:00
Ralph Castain
3f9d9ae8b6 Provide tighter LSF integration by correctly handling scenarios where the user has asked LSF to assign bindings. Fix a couple of typos in lex parser definitions. Tell hostfile parser to ignore binding designations in hostfiles. Add an attribute to indicate that cpusets were provided as physical cpu ids.
Once validated, a version of this will be backported to the v1.8.4 release.
2014-11-30 11:50:31 -08:00
George Bosilca
dee243c58d ompi_proc_finalize has an interesting side effect. A proc is
inserted in the ompi_proc_list as soon as it is created and it
is removed only upon the call to the destructor. In ompi_proc_finalize
we loop over all procs in ompi_proc_finalize and release them once.
However, as a proc is not removed from this list right away, we
decrease the ref count for each proc until it reach zero and the
proc is finally removed. Thus, we cannot clean the BML/BTL after
the call the ompi_proc_finalize.
A quick fix is to delay the call to ompi_proc_finalize until all
other frameworks have been finalized, and then the behavior
depicted above will give the expected outcome.
2014-11-28 18:26:36 -05:00
bosilca
8cae899a42 Merge pull request #285 from bosilca/master
Reenable high accuracy timers
2014-11-25 17:09:34 -05:00
Gilles Gouaillardet
578fe41788 fix hangs introduced by previous commit a6744b8177 2014-11-25 17:50:44 +09:00
Gilles Gouaillardet
a6744b8177 fix misc memory leaks specific to the master 2014-11-25 13:52:10 +09:00
George Bosilca
261684858f Improved support for OSX timers. 2014-11-24 17:15:49 -05:00
George Bosilca
1877dfd0df On Darwin make sure the field we expect to be 0 is indeed 0. 2014-11-24 14:16:36 -05:00
George Bosilca
766cfece36 Remove useless header. 2014-11-24 00:57:54 -05:00
George Bosilca
5f49a11b29 Minor cleanups. 2014-11-24 00:44:50 -05:00
George Bosilca
e27759956f Allow the use of the optimized used timers 2014-11-23 23:51:13 -05:00
George Bosilca
324e43909d Enable CUDA support on Mac OS X. 2014-11-20 13:51:10 -06:00
Gilles Gouaillardet
758f7ab768 Revert "btl/vader: use FRAG_ALLOC_USER when single_copy_mechanism is VADER_NONE"
as discussed with @hjelmn in open-mpi/ompi-release#86

This reverts commit d2d7f39a4b.
2014-11-20 16:04:55 +09:00
Nathan Hjelm
1b564f62bd Revert "Merge pull request #275 from hjelmn/btlmod"
This reverts commit ccaecf0fd6, reversing
changes made to 6a19bf85dd.
2014-11-19 23:22:43 -07:00
Nathan Hjelm
b1f9569b7d Revert "btl/openib: fix warnings"
This reverts commit 6e6c786b49.
2014-11-19 23:16:16 -07:00
Nathan Hjelm
6e6c786b49 btl/openib: fix warnings 2014-11-19 15:57:01 -07:00
Nathan Hjelm
ccaecf0fd6 Merge pull request #275 from hjelmn/btlmod
Updated the btl interface. Please update your components.
2014-11-19 15:01:40 -07:00
Ralph Castain
6a19bf85dd Add a little debug 2014-11-19 13:45:00 -08:00
Ralph Castain
27be73c6fb Allow callers to dstore.open to not request a specific list of desired components, but just take the highest priority one that is available 2014-11-19 13:31:31 -08:00
Nathan Hjelm
2b579610f2 btl/openib: fix compilation issues with XRC 2014-11-19 11:44:48 -07:00
Nathan Hjelm
2a382c2ec1 add btl comment 2014-11-19 11:33:04 -07:00
Nathan Hjelm
bf7daac388 btl/openib: add atomic operation support 2014-11-19 11:33:04 -07:00
Nathan Hjelm
45d1fac8af ugni thread safety fixes 2014-11-19 11:33:03 -07:00
Nathan Hjelm
5e7c77c576 btl/ugni: add support for atomic operations 2014-11-19 11:33:03 -07:00
Nathan Hjelm
90554d0f95 btl/openib: misc cleanup (tabs, etc) and put credit code into a common place (was duplicated in the send and sendi paths) 2014-11-19 11:33:03 -07:00
Nathan Hjelm
4122067236 btl/openib: fix message coalescing 2014-11-19 11:33:03 -07:00
Nathan Hjelm
38e9611930 btl/openib: fix recieve queue source detection 2014-11-19 11:33:03 -07:00
Nathan Hjelm
7c43b566d2 more openib updates 2014-11-19 11:33:03 -07:00
Nathan Hjelm
3ea10476a4 btl/sm fix compilation when not using CMA or KNEM 2014-11-19 11:33:03 -07:00
Nathan Hjelm
4ccb20b097 btl: fix warning about enum type and modify btl_sendi to allow the
value NULL for the descriptor

The send inline optimization uses the btl_sendi function to achieve
lower latency and higher message rates. The problem is the btl_sendi
function was allowed to return a descriptor to the caller. This is fine
for some paths but not ok for the send inline optimization. To fix
this the btl now must be able to handle descriptor = NULL.
2014-11-19 11:33:03 -07:00
Nathan Hjelm
2a70238f4d First crack at adding atomic operation support 2014-11-19 11:33:03 -07:00
Nathan Hjelm
249e5e009f Fix knem support in both sm and vader 2014-11-19 11:33:02 -07:00
Nathan Hjelm
e03956e099 Update the scif and openib btls for the new btl interface
Other changes:
 - Remove the registration argument from prepare_src since it no
   longer is meant for RDMA buffers.

 - Additional cleanup and bugfixes.
2014-11-19 11:33:02 -07:00
Nathan Hjelm
ec33374339 btl: remove des_remote/des_remote_count from the mca_btl_base_descriptor_t
structure

This structure member was originally used to specify the remote segment
for an RDMA operation. Since the new btl interface no longer uses
desriptors for RDMA this member no longer has a purpose. In addition
to removing these members the local segment information has been
renamed to des_segments/des_segment_count.
2014-11-19 11:33:02 -07:00
Nathan Hjelm
2d381f800f Update the interface to provide a cleaner interface for RDMA operations.
The old BTL interface provided support for RDMA through the use of
the btl_prepare_src and btl_prepare_dst functions. These functions were
expected to prepare as much of the user buffer as possible for the RDMA
operation and return a descriptor. The descriptor contained segment
information on the prepared region. The btl user could then pass the
RDMA segment information to a remote peer. Once the peer received that
information it then packed it into a similar descriptor on the other
side that could then be passed into a single btl_put or btl_get
operation.

Changes:

 - Removed the btl_prepare_dst function. This reflects the fact that
   RDMA operations no longer depend on "prepared" descriptors.

 - Removed the btl_seg_size member. There is no need to btl's to
   subclass the mca_btl_base_segment_t class anymore.

...

Add more
2014-11-19 11:33:02 -07:00
Howard Pritchard
a632b632ca better way to tell if a process is in a Cray PAGG
Use a more reliable way to tell if a process is
1) in a Cray PAGG
2) is actually considered an application process on
   a compute node (not for example, a process in a PAGG
   on a mom node).
2014-11-12 12:56:15 -07:00
Howard Pritchard
72bb4a2eee make cray pmi compile again
Commit @80f07b65 resulted in changes that
caused cray pmi component to no longer compile.
This commit fixes that issue.
2014-11-12 12:33:30 -07:00
Nathan Hjelm
cfbb9cba16 btl/vader: don't assume the address in the put/get segment is unmodified when
using knem

It is valid to modify the remote segment that will be used with the
btl put/get operations as long as the resulting address range falls in
the originally prepared segment. Vader should have been calculating the
offset of the remote address in the registered region. This commit
fixes this issue.
2014-11-12 10:12:52 -07:00
rhc54
87fa1061d4 Merge pull request #267 from artpol84/s2_fix
Fix SLURM PMI2 component. set s2_nrank to the relative position of a pro...

Good catch!
2014-11-12 08:43:03 -08:00
Jeff Squyres
f39b294afe mca base: fix trivial typos in help message 2014-11-12 08:40:17 -08:00
Artem Polyakov
fce08a3db3 Fix SLURM PMI2 component. set s2_nrank to the relative position of a process inside the node
(not relative position of a node inside the allocation).
2014-11-12 16:26:35 +06:00
Gilles Gouaillardet
b088175705 btl/vader: fix a typo in mca_btl_vader_put_knem 2014-11-12 19:00:00 +09:00
Ralph Castain
780c93ee57 Per the PR and discussion on today's telecon, extend the process name definition as a two-field struct of uint32_t's down to the OPAL layer. This resolves issues created by prior commits that impacted both heterogeneous and SPARC support. This also simplifies the OMPI code base by removing the need for frequent memcpy's when transitioning between the OMPI/ORTE layers and OPAL.
We recognize that this means other users of OPAL will need to "wrap" the opal_process_name_t if they desire to abstract it in some fashion. This is regrettable, and we are looking at possible alternatives that might mitigate that requirement. Meantime, however, we have to put the needs of the OMPI community first, and are taking this step to restore hetero and SPARC support.
2014-11-11 17:00:42 -08:00
Gilles Gouaillardet
40942c62ce dstore: remove unused variables 2014-11-11 18:14:59 +09:00
Gilles Gouaillardet
80f07b65f1 pmix: correctly split pmi messages
Thanks to @elenash for all the reviews
2014-11-11 17:16:00 +09:00
Ralph Castain
d0704ef118 Restore handling of physical processors in rankfiles. Note that the prior implementation was likely incorrect as it falsely assumed that physical core indices were unique, which isn't always true. Stipulate that physical rankfiles can only include PU numbers, and bind the result to the core that contains that physical PU. Update the mpirun man page to cover the new use-case. 2014-11-10 14:00:40 -08:00
Ralph Castain
2a90788724 Support physical processor ids in rankfile 2014-11-10 14:00:40 -08:00
Howard Pritchard
6c8c9cb4a3 another fix for --enable-dlopen for ugni btl
missed a change to create libmca_common_ugni.la
file correctly.
2014-11-10 13:40:59 -07:00
Gilles Gouaillardet
d2d7f39a4b btl/vader: use FRAG_ALLOC_USER when single_copy_mechanism is VADER_NONE 2014-11-10 17:02:45 +09:00
Howard Pritchard
5c08aa8552 enable ugni btl to work without disable-dlopen
There were mistakes in the Makefiles for the ugni btl and
mca/common/ugni that prevented the ugni btl from being
used unless one happened to set the --disable-dlopen option
on the config line.

This commit fixes this problem.
2014-11-09 15:19:47 -07:00
rolfv
022612c83b Missed a removal from previous commit 2014-11-07 11:08:41 -08:00
rolfv
cbb43d5ac3 Make sure initialization happens 2014-11-07 11:00:45 -08:00
Ralph Castain
b56b744041 Silence some warnings and remove debug output 2014-11-07 07:54:01 -08:00
Elena
eb7872488c fix incorrect mca param registration in latest commit 2014-11-07 07:31:42 +02:00
elenash
2687637071 Merge pull request #263 from elenash/master
dstore sm component implementing shared memory database for pmix client/server communication
2014-11-07 07:56:55 +03:00
Howard Pritchard
b389895c66 fix make dist for pmix/cray
Include file was left out of "sources" list that prevented
building for cray from dist tarball.
2014-11-06 15:10:51 -07:00
Howard Pritchard
59f8d0a92d cleanup ugni compiler warnings 2014-11-06 12:25:10 -07:00
George Bosilca
8da5dcc22e Don't release the provided opal_proc in the error path. 2014-11-06 08:42:23 -08:00
Gilles Gouaillardet
e269a52ac7 btl/openib: send openib modex with the PMIX_GLOBAL flag 2014-11-06 08:42:23 -08:00
Elena
03fc809bc9 This commit contains new dstore component sm which is used for communication between pmix server and clients at the same node via shared memory. 2014-11-06 16:01:19 +02:00
George Bosilca
5443f69efb Merge branch 'master' of github.com:open-mpi/ompi 2014-11-06 01:40:13 -05:00
George Bosilca
54ddb0aece Don't release the provided opal_proc in the error path. 2014-11-06 01:39:25 -05:00
Gilles Gouaillardet
d542c9ff2d btl/openib: send openib modex with the PMIX_GLOBAL flag 2014-11-06 15:00:08 +09:00
Steve Wise
7316a88754 openib btl: add Soft iWARP device to the ini file
This enables IBM's software iWARP provider.  With this driver you can
run iWARP/RDMA over any ethernet NIC.  Useful for testing OMPI RDMA
logic without requiring an expensive RDMA adapter/infrastructure.

The Soft iWARP code is at: https://www.gitorious.org/softiwarp
2014-11-04 14:48:43 -08:00
Gilles Gouaillardet
76ee98c86a btl/scif: start the listening thread once only 2014-10-31 16:34:02 +09:00
Gilles Gouaillardet
ca0b969991 pmix: fix a return status in native_get_attr 2014-10-30 15:26:23 +09:00
Gilles Gouaillardet
8c556bbc66 pmix: fix alignment issue 2014-10-29 13:19:23 +09:00
Ralph Castain
4f0c1ae8d9 Continue cleanup of the PMI config code. Eliminate the multiple calls to check for pmi1 and pmi2 - we must check it only once to get the pmix components to build only in the correct situations. Ensure we set the wrapper flags so we handle static builds correctly. 2014-10-27 20:37:33 -07:00
Gilles Gouaillardet
b4e445afb5 btl/sm: fix a typo in the error message 2014-10-28 11:25:42 +09:00
Jeff Squyres
9334abc474 Makefile: fix problems with static linking
Avoid a problem with double-derefence of a variable macro name (i.e.,
a macro with part of its name from an AC_SUBST, such as
```$(foo@BAR@baz)```.

In what might be a bug in Automake 1.14.1, if you do a pattern like
this:

```makefile
lib_LTLIBRARIES = lib@A_PREFIX@a_lib.la
noinst_LTLIBRARIES = lib@A_PREFIX@a_noinst.la

lib@A_PREFIX@a_lib_la_SOURCES = a.c

lib@A_PREFIX@a_noinst_la_SOURCES = $(lib@A_PREFIX@a_lib_la_SOURCES)
```

Then in the resulting Makefile, the value of
```$(lib@A_PREFIX@a_lib_la_OBJECTS)``` will be *blank* (when it really
should be ```a.o```).

To workaround this potential bug, I've simply avoided doing
double-derefences like this, and effectively set the second
```_SOURCES``` line equal to ```a.c``` (just like the first
```_SOURCES``` line).

Fixes #250.
2014-10-24 16:27:54 -07:00
rolfv
9134f48d4c Do not use sendi path with GPU buffer 2014-10-24 13:35:01 -07:00
Nathan Hjelm
56a8687c2a shmem/mmap: do not use O_CREAT in shared memory attach 2014-10-24 11:02:04 -06:00
Jeff Squyres
5207429734 help-opal-shmem-mmap.txt: trivial typo fix 2014-10-24 03:23:53 -07:00
Nathan Hjelm
d72fc7a05f btl/vader: more updates to the help messages 2014-10-23 08:48:54 -06:00
Gilles Gouaillardet
55a5c99ff0 btl/vader: fix typos in the help file 2014-10-23 19:28:09 +09:00
Gilles Gouaillardet
248acbbc3b pmix/slurm: correctly set locality of the local ranks as "not found" 2014-10-23 17:02:07 +09:00
Ralph Castain
894acb0aa8 configury: new OPAL_SET_MCA_PREFIX/ORTE_SET_MCA_CMD_LINE_ID macros
These two macros set the MCA prefix and MCA cmd line id,
   respectively.  Specifically, MCA parameters will be named
   PREFIX<foo> in the environment, and the cmd line will use
   -ID foo bar.

   These macros must be called during configure.ac and a value
   supplied. In the case of Open MPI, the values given are
   PREFIX=OMPI_MCA_ and ID=mca.

   Other projects (such as ORCM) will call these macros with
   their own unique values.  For example, ORCM uses PREFIX=ORCM_MCA_
   and ID=omca

   This scheme is necessary to allow running Open MPI applications under
   systems that use their own versions of ORTE and OPAL.  For example,
   when running OMPI applications under ORCM, we need the MCA params passed
   to the ORCM daemons to be separated from those recognized by the OMPI application.
2014-10-22 18:57:40 -07:00
Ralph Castain
5059077510 Sigh - revert changes to file that shouldn't have been included in the prior commit 2014-10-22 14:06:14 -07:00
Ralph Castain
2ec59acac4 Silence a slew of warnings when --enable-memchecker is given. Reviewed by Jeff 2014-10-22 13:59:08 -07:00
Nathan Hjelm
e1bc2de853 btl/vader: defensive programming: use an actual function for the dummy btl_get and btl_put 2014-10-22 14:57:55 -06:00
Nathan Hjelm
19fbe868b8 btl/sm: defensive programming: use an actual function for the dummy btl_get 2014-10-22 14:57:55 -06:00
Aurélien Bouteiller
f232e94c02 Merge branch 'master' of github.com:open-mpi/ompi 2014-10-22 16:56:06 -04:00
Aurélien Bouteiller
55e49470de Patch from Nathan outlined with a crash the mishandling of the case where CMA is requested but not available. 2014-10-22 16:55:18 -04:00
Nathan Hjelm
998e69a6fa btl/sm: add some protection for the use_knem = -1 case
Need to unset the dummy btl_get and remove the MCA_BTL_FLAGS_GET flag
if neither knem nor cma can be used.
2014-10-22 13:57:01 -06:00
Nathan Hjelm
d7c7bb3993 btl/sm: re-enable the use of CMA and knem
At some point we added a sanity check to the btl base to ensure that
the btl flags match the available functions (this prevents user's from
specifying get or put when no function exists). This check was
disabling get for the sm btl since at the time of the check there is
no btl_get function. The simplest fix is to set a dummy value to btl_get
that will be overwritten with the proper value on btl initialization.

Closes #239.
2014-10-22 13:30:27 -06:00
Jeff Squyres
ec4268b59c usnic: do not send zero-length modex message
If there are no usnic BTL modules, then just avoid sending any modex
message at all (other BTLs do this; it's safe to do).

The change is smaller than it looks: I added a "if 0 ==..." check at
the top to return immediately if there are no BTL modules.  Then I
removed some now-unnecessary conditionals and un-indented as
appropriate.

Fixes #248
2014-10-22 11:11:58 -07:00
Jeff Squyres
e415c8f9a8 vader: Remove stale comment 2014-10-22 10:32:33 -07:00
Jeff Squyres
c22e1ae33b configury: new OPAL_SET_LIB_PREFIX/ORTE_SET_LIB_PREFIX macros
These two macros set the prefix for the OPAL and ORTE libraries,
respectively.  Specifically, the OPAL library will be named
libPREFIXopen-pal.la and the ORTE library will be named
libPREFIXopen-rte.la.

These macros must be called, even if the prefix argument is empty.

The intent is that Open MPI will call these macros with an empty
prefix, but other projects (such as ORCM) will call these macros with
a non-empty prefix.  For example, ORCM libraries can be named
liborcm-open-pal.la and liborcm-open-rte.la.

This scheme is necessary to allow running Open MPI applications under
systems that use their own versions of ORTE and OPAL.  For example,
when running MPI applications under ORTE, if the ORTE and OPAL
libraries between OMPI and ORCM are not identical (which, because they
are released at different times, are likely to be different), we need
to ensure that the OMPI applications link against their ORTE and OPAL
libraries, but the ORCM executables link against their ORTE and OPAL
libraries.
2014-10-22 10:32:19 -07:00
Gilles Gouaillardet
75e8387a4e vader: vader_add_procs report the error if init_vader_endpoint fails 2014-10-22 19:11:54 +09:00
Gilles Gouaillardet
7508c6f3ad pmix: correctly handle NULL OPAL_BYTE_OBJECT object 2014-10-22 17:15:21 +09:00
Nathan Hjelm
1a3734ae57 btl/vader: fix compilation on OS X 2014-10-21 09:27:36 -06:00
Gilles Gouaillardet
f56169cee6 btl/vader: silence warning
correctly check HAVE_SYS_PRCTL_H
2014-10-21 19:51:29 +09:00
Gilles Gouaillardet
d60f0cbd88 btl/vader: report an error when a segment cannot be attached 2014-10-21 10:42:22 +09:00
Nathan Hjelm
13643f5b6e btl/vader: improved single-copy support
This commit makes the folowing changes:

 - Add support for the knem single-copy mechanism. Initially vader will only
   support the synchronous copy mode. Asynchronous copy support may be added
   int the future.

 - Improve Linux cross memory attach (CMA) when using restrictive ptrace
   settings. This will allow Open MPI to use CMA without modifying the system
   settings to support ptrace attach (see /etc/sysctl.d/10-ptrace.conf).

 - Allow runtime selection of the single copy mechanism. The default behavior
   is to use the best available. The priority list of single-copy mehanisms is
   as follows: xpmem, cma, and knem.

 - Allow disabling support for kernel-assisted single copy.

 - Some tuning and bug fixes.
2014-10-20 11:44:52 -06:00
Nadezhda Kogteva
2bce929330 MTL MXM cleanup: unnecessary OMPI_MTL_MXM_CONNECT_ON_FIRST_COMM variable removed 2014-10-20 10:29:47 +03:00
Aurélien Bouteiller
e3be1fb9a5 Quick pass over the sm-knem code, indent fixes 2014-10-17 10:38:35 -04:00
Jeff Squyres
43aff4d8b3 btl sm: error if knem support is requested and cannot be activated
Restore the functionality to error out (and show a helpful message) if
knem support is requested by is either not compiled in or cannot be
activated.

Thanks to Gus Correa for bringing the matter to our attention.
2014-10-16 20:01:26 -07:00
Jeff Squyres
b04a2634c6 btl sm: restore btl_sm_have_knem_support MCA param
Somehow, this MCA param was accidentally dropped after v1.6.5.  Thanks
to Gus Correa for bringing this matter to our attention.

Also moving some MCA params down from level 9 to levels 4/5.
2014-10-16 19:48:21 -07:00
Ralph Castain
b6aa691e0a Fix incorrect implementation of new MCA param mca_base_env_list - it was not picking up envars and forwarding them, but only worked if you explicitly set a value for the envar. Ensure it works for both direct and indirect launch modes. Remove stale code as this replaced orte_forward_envars. Ensure it doesn't get passed to the ORTE daemons. 2014-10-16 12:58:56 -07:00
Gilles Gouaillardet
27dcca0bb2 pmi/s1: fix large keys
do not overwrite the PMI key when pushing a message that does
not fit within 255 bytes
2014-10-16 13:29:32 +09:00
George Bosilca
7541c03b4c Mark all instances where atomic operations are used but their return value is unnecessary 2014-10-15 21:47:32 -04:00
Jeff Squyres
dc66e197cc var: fix segv in deprecated file var show_help()
Ensure to include the new variable filename in the show_help() output
when we load a deprecated MCA param from a file.

Fixes #236
2014-10-15 08:07:31 -07:00
Jeff Squyres
51027a6635 usnic: fix minor typo
Change harmless-but-weird comma to semicolon.  Found during code
review.
2014-10-15 05:32:36 -07:00
Gilles Gouaillardet
5c81658d58 pmix: fix big endian arch
use the appropriate 64 bits type otherwise data gets incorrectly
truncated on big endian arch
2014-10-15 17:17:09 +09:00
Ralph Castain
4d27eb70f2 Extend the dstore framework to include a new "update_handle" API so the attributes of an existing handle can be changed. We can't just open a new handle as the upper layers won't know where to find the info. :-( 2014-10-10 12:40:32 -07:00
Ralph Castain
1ae34da5e5 Add an attributes parameter to the dstore.open function so we can pass directives to the active storage component. This can, for example, include the backing file info for a new shared memory segment. 2014-10-10 12:13:25 -07:00
Nathan Hjelm
a31cf3b740 btl/vader: missing include 2014-10-09 13:57:21 -06:00
Nathan Hjelm
9e0c07e4ce btl/ugni: improve the handling of eager get fragments when the btl runs out
of preregistered buffers

Before this change eager gets we retried on each progress loop. This commit
modifies the protocol to only retry eager gets when another eager get has
completed. This commit also cleans up some callback code that is no longer
needed.
2014-10-09 13:57:21 -06:00
Howard Pritchard
ebc368d26b remove GNI_RDMAMODE_FENCE bit in GNI_PostRdma
The GNI_RDMAMODE_FENCE bit was a left over from
async progress work that is not needed at this point
in the gni BTL.  Removing the bit also allows
for the removal of the GNI_CDM_MODE_BTE_SINGLE_CHANNEL
bit from the GNI_CdmCreate call.
2014-10-09 12:41:19 -06:00
Elena
c905fe9b78 pmix: removed pmix_base_direct modex mca parameter, renamed orte_full_modex_cutoff and ompi_hostname_cutoff to direct_modex_cutoff 2014-10-09 06:15:31 +02:00
Howard Pritchard
9947758d98 initial thread safety for ugni btl
This commit adds initial ugni thread safety support.
With this commit, sun thread tests (excepting MPI-2 RMA)
pass with various process counts and threads/process.
Also osu_latency_mt passes.
2014-10-08 10:13:22 -06:00
Jeff Squyres
a422d893b8 memchecker: per RFC, use calloc for OBJ_NEW
With --enable-memchecker builds, use calloc(3) for OBJ_NEW instead of
malloc(3).  This cuts down on a lot of valgrind/memory checker false
positive output.

Also make a minor change in the valgrind configure.m4; have it assign
0xf to a char.  The prior assignment (of 0xff) was warning about an
overflow.  This didn't really matter, but we might as well make the
test not have a gratuitious warning in it.
2014-10-07 09:55:54 -07:00
Ralph Castain
fd6a044b7f Cleanup some cruft resulting from the move of the btl's to opal. We had created the ability to delay modex operations, which included a need to delay retrieving hostname info for remote procs. This allowed us to not retrieve the modex info until first message unless required - the hostname is generally only required for debug and error messages.
Properly setup the opal_process_info structure early in the initialization procedure. Define the local hostname right at the beginning of opal_init so all parts of opal can use it. Overlay that during orte_init as the user may choose to remove fqdn and strip prefixes during that time. Setup the job_session_dir and other such info immediately when it becomes available during orte_init.
2014-10-03 16:02:57 -06:00
Howard Pritchard
5428301c81 Remove catamount timer support
With the 1.9 release, support for catamount is being
dropped. Hence,  removing catamount timer support.
2014-10-03 14:53:09 -06:00
rolfv
697b18db63 Making async copy the default 2014-10-03 06:42:18 -07:00
Gilles Gouaillardet
5c5453b8b1 pmix: fix test in native_get_attr 2014-10-03 11:54:08 +09:00
Jeff Squyres
8468424f45 distscript: remove configure.params and autogen.subdirs kruft
Remove configure.params support: configure.params hasn't been used in
years.

Also remove autogen.subdirs support; those should really be handled by
their respective Makefile.am's.
2014-10-02 11:32:54 -07:00
Ralph Castain
9e35f80ab6 Don't multiply define WANT_PMI_SUPPORT and friends. Turns out they weren't being used anywhere anyway, so no point in defining them at all
This commit was SVN r32822.
2014-09-30 20:43:25 +00:00
Howard Pritchard
8da51fab81 cray pmi equivalent to commit 5eb65b24
This commit was SVN r32820.
2014-09-30 19:25:00 +00:00
Ralph Castain
8d0b4f222a The pmix.get functions should not be returning "success" if the requested info isn't found. Fix the macros and the component functions so they correctly return "not found" in that situation, and set the data regions and size to NULL and 0, respectively.
This commit was SVN r32818.
2014-09-30 18:03:12 +00:00
Howard Pritchard
1df933ea27 remove ompi/runtime/params.h include in ugni btl
This commit was SVN r32813.
2014-09-29 19:26:33 +00:00
Howard Pritchard
201d4ec3ad fix setting of PMIX_NODE_RANK in cray pmix comp.
Per discussions with pmix folks, it was determined that
the way the cray pmi pmix component was computing the
PMIX_NODE_RANK attribute for a process was incorrect.
This commit fixes the problem.

This commit was SVN r32810.
2014-09-29 16:55:31 +00:00
Rolf vandeVaart
399dc3db43 Code to check for managed memory. Configure support also.
This commit was SVN r32801.
2014-09-26 16:24:45 +00:00
Rolf vandeVaart
35858f837a Revert r32713. Have different code for this.
This commit was SVN r32800.

The following SVN revision numbers were found above:
  r32713 --> open-mpi/ompi@9a2bab0e27
2014-09-26 14:56:18 +00:00
Nathan Hjelm
e0eb1f2e73 btl/vader: make vader registration lookup/caching thread safe
This commit was SVN r32798.
2014-09-25 22:24:06 +00:00
George Bosilca
53e012ae97 Fix typo.
This commit was SVN r32795.
2014-09-25 17:18:27 +00:00
Nathan Hjelm
aba87f3776 btl/vader:silence warning
This commit was SVN r32788.
2014-09-24 22:10:23 +00:00
Nathan Hjelm
79881ca892 btl/vader: prevent double-destruction of endpoints and move endpoint teardown code into destructor
This commit was SVN r32779.
2014-09-23 21:51:15 +00:00
Nathan Hjelm
2d8fba0861 btl/vader: silence warning
This commit was SVN r32778.
2014-09-23 21:33:45 +00:00
Nathan Hjelm
8bd3160432 btl/vader: fix several typos in vader update
This commit was SVN r32775.
2014-09-23 20:25:36 +00:00
Nathan Hjelm
12bfd13150 btl/vader: improve performance for both single and multiple threads
This is a large update that does the following:

 - Only allocate fast boxes for a peer if a send count threshold
   has been reached (default: 16). This will greatly reduce the memory
   usage with large numbers of local peers.

 - Improve performance by limiting the number of fast boxes that can
   be allocated per peer (default: 32). This will reduce the amount
   of time spent polling for fast box messages.

 - Provide new MCA variables to configure the size, maximum count,
   and send count thresholds for fast boxes allocations.

 - Updated buffer design to increase the range of message sizes that
   can be sent with a fast box.

 - Add thread protection around fast box allocation (locks). When
   spin locks are available this should be updated to use spin locks.

 - Various fixes and cleanup.

This commit was SVN r32774.
2014-09-23 18:11:22 +00:00
Howard Pritchard
1508a01325 Fixes to enable mpirun to work again on Cray
The ess pmi module was not handling aprun launched
daemons.  All daemons were thinking they were vpid 1.

Also, turns out that on cray systems using MOM nodes
for launched jobs, just detecting whether or not a
process is in a PAGG container is not sufficient.

Crank up the priority of the alps PLM component in the
event that the configure detected the presence of both
slurm and alps.

Have the ESS pmi component open the pmix framework and
select a pmix component.

This commit was SVN r32773.
2014-09-23 15:37:26 +00:00
Vasily Filipov
e26af91a64 BTL/OPENIB: set "max_lmc" param to be "1" and not "all available values" by default.
cmr=v1.8.3:reviewer=miked 

This commit was SVN r32736.
2014-09-15 13:56:41 +00:00
Alex Mikheev
31d0724a08 OMPI: btl openib: fix detection of max registarable memory
Deal with the case when mlx4 module is loaded but device
is not present

cmr=v1.8.3:reviewer=miked

This commit was SVN r32734.
2014-09-15 12:17:23 +00:00
Ralph Castain
fad4384463 Not sure how we could get to this point without having already detected the error, but just to be safe - check for end-of-array and return if error.
Refs trac:4897

This commit was SVN r32731.

The following Trac tickets were found above:
  Ticket 4897 --> https://svn.open-mpi.org/trac/ompi/ticket/4897
2014-09-13 02:23:30 +00:00
Ralph Castain
7269dae2da Per patch from Samuel Thibault, silence warning from Clang
This commit was SVN r32720.
2014-09-12 22:22:11 +00:00
Ralph Castain
0445052a1c Check for multiple declarations of a given MCA param and error out if detected as that can create an ambiguous definition of the param value.
Refs trac:4897

This commit was SVN r32719.

The following Trac tickets were found above:
  Ticket 4897 --> https://svn.open-mpi.org/trac/ompi/ticket/4897
2014-09-12 22:21:30 +00:00
Jeff Squyres
d244b7b860 mca_base_var: fix possibilty of unaligned variable assignments
Add a debugging check that ensures that the registered storage is
aligned appropriately for the type that is specified.

When we know that the storage is properly aligned, we can cast the
mbv_storage to the appropriate type and then simply do the assignment.
We used to do this assignment via a union, but clang's
-fsanitizer=alignment complained about this.

This commit was SVN r32716.
2014-09-11 23:02:49 +00:00
Ralph Castain
1f2c5863f0 Revert r32675 in favor of a different solution proposed by Brice
This commit was SVN r32715.

The following SVN revision numbers were found above:
  r32675 --> open-mpi/ompi@916f98a3ee
2014-09-11 21:58:48 +00:00
Howard Pritchard
e43715574a remove ignored restrct return type qualifier
The use of restrict in the return type qualifier for mca_btl_vader_reserve_fbox
is being ignored by gnu compiler.  for newer gcc, one sees this warning only
with -Wignored-qualifiers set, but for older variants of gcc it was reported
that numerous warning messages about this ignored qualifier were being
generated as vader is being compiled.

The warning reported by gcc is

btl_vader_fbox.h:53:47: warning: type qualifiers ignored on function return type [-Wignored-qualifiers]
 static inline mca_btl_vader_fbox_t * restrict mca_btl_vader_reserve_fbox (struct mca_btl_base_endpoint_t *ep, const size_t size)

This commit was SVN r32714.
2014-09-11 21:12:41 +00:00
Rolf vandeVaart
9a2bab0e27 Add support for detecting CUDA managed memory. Disabled for now.
This commit was SVN r32713.
2014-09-11 21:07:17 +00:00
Howard Pritchard
820b34e5d2 Fix bad cut/paste for commit c19e7369
This commit was SVN r32712.
2014-09-11 21:00:04 +00:00
Howard Pritchard
d07c5674a3 Fix potential double free in cray pmi cray_fini
This commit was SVN r32711.
2014-09-11 20:30:40 +00:00
Ralph Castain
cb2ad98f57 Silence an unused function warning
This commit was SVN r32704.
2014-09-10 17:36:34 +00:00
Ralph Castain
a7c5b77d70 Just because the openib BTL can't reach a process doesn't mean it is a job-ending error. If we have other methods for reaching the process (e.g., sm for a local proc), then that's okay. If there is no method for reaching a proc, then that's an error - but the BML will report that situation.
The question of whether or not the openib BTL supports loopback is a separate question. It may be more appropriate to make the modex be PMIX_GLOBAL for cases where openib can support loopback so someone can run without a shared memory component. I'll leave that decision to the IB vendors.

This commit was SVN r32702.
2014-09-10 17:02:16 +00:00
Ralph Castain
e671620ac7 Per request from Jeff: tune up the help messages for binding options
Refs trac:4898

This commit was SVN r32691.

The following Trac tickets were found above:
  Ticket 4898 --> https://svn.open-mpi.org/trac/ompi/ticket/4898
2014-09-09 22:39:22 +00:00