1
1
Граф коммитов

2145 Коммитов

Автор SHA1 Сообщение Дата
Jeff Squyres
0c502d90cd hwloc README-ompi.txt: update for what we pulled from hwloc
Document what we pulled from the hwloc tree.
2015-03-27 06:49:42 -07:00
Brice Goglin
29ccbfd590 hwloc pci: fix bridge depth
It was setup in the PCI backend before filtering,
and partially updated after filtering in the core.
Only setup once correctly after filtering in the core.

(cherry picked from commit open-mpi/hwloc@9659653d24)

Conflicts:
	tests/hwloc/linux/40intel64-2g2n4c+pci.output
	tests/hwloc/xml/192em64t-12gr2n8c2t-distancegroups.xml
	tests/hwloc/xml/192em64t-24n8c2t-distancegroups.xml
	tests/hwloc/xml/192em64t-24n8c2t-nodistancegroups.xml
	tests/hwloc/xml/24em64t-2n6c2t-pci.xml
	tests/hwloc/xml/32em64t-2n8c2t-pci-normalio.xml
	tests/hwloc/xml/96em64t-4n4d3ca2co-pci.xml
	utils/hwloc/test-hwloc-compress-dir.input.tar.gz
	utils/hwloc/test-hwloc-compress-dir.output.tar.gz

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2015-03-27 06:49:39 -07:00
Brice Goglin
1905f35a1e hwloc: bitmap: fix a corner case in hwloc_bitmap_isincluded() with infinite sets
If super_set contains more allocated ulongs than sub_set,
we did not check the last ulongs.
We would return true instead of false when sub_set is
infinite while the last ulongs in super_set are not full.

This fixes tests/hwloc_bitmap_compare_inclusion on some platforms.

(cherry picked from commit open-mpi/hwloc@299e6e846f)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2015-03-27 05:14:44 -07:00
Brice Goglin
5c9157c547 hwloc: core: only update root->complete sets if insert succeeds
Otherwise we get spurious bits for crazy topologies such as 8em64t-2s2ca2c-buggynuma.output

Will make debug asserts easier.

(cherry picked from commit open-mpi/hwloc@546cd9330a)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2015-03-27 05:14:44 -07:00
Brice Goglin
dec01097f8 hwloc: groups: add complete sets when inserting distance/pci groups
Make sure we define complete cpuset/nodeset when we define groups' main cpuset/nodeset
during later insert of groups (for PCI hostbridges or distances).
Otherwise they may end up clearing child/parent complete sets which
suddenly become incoherent while they were fixed earlier.

Needed to fix allowed_nodeset meaning.

(cherry picked from commit open-mpi/hwloc@7c88d17add)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2015-03-27 05:14:33 -07:00
Brice Goglin
d6e415cd41 hwloc: AIX: Fix PU os_index
When looking for PUs inside R_MAXSDL rads, some AIX 6.1 releases
return one first rad without any PU.
AIX 6.1 00F63F144C00 does (on quad-power7).
AIX 6.1 00CBAAC24C00 doesn't (on 16x power6).

So we can't assume rad #x contains PU #x. But we already have the right
code to fill the cpuset from the rad, so use that to obtain the PU os_index
as well.

Cannot be used to obtain NUMA node os_index since there's no way to directly
retrieve NUMA nodes from rads (mempools seem unrelated). Just keep using #rad
for NUMA nodes os_index and document that convention when converting back in
set_membind().

Thanks to Hendryk Bockelmann and Erik Schnetter for helping debugging.

(cherry picked from commit open-mpi/hwloc@60006c7b88)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2015-03-27 05:14:33 -07:00
Brice Goglin
80140bbe7b hwloc: distances: when we fail to insert an intermediate group, don't try to group further above
Otherwise we'll have some NULL objects above, would be annoying.
No need to dig further, the distance matrix is likely buggy.

We still keep the inserted groups at this level (incomplete level)
because removing them is hard.

(cherry picked from commit open-mpi/hwloc@312a971ec9)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2015-03-27 05:14:33 -07:00
Brice Goglin
29c99156cf hwloc: pci: fix SR-IOV VF vendor/device names
Commit 626129d2818693e62b83c1cfa2ba6e058e5bed66 fixed the hwloc
device/vendor numbers obtained from libpciaccess.
But the corresponding names are still retrieved from pciaccess numbers,
so fix these numbers inside pciaccess structures before retrieving the names.

(cherry picked from commit open-mpi/hwloc@85ea6e4acc)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2015-03-27 05:14:32 -07:00
Brice Goglin
da164be0ef hwloc: error: point to the FAQ when displaying the big OS error message
(cherry picked from commit open-mpi/hwloc@b191f816f6)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2015-03-27 05:14:32 -07:00
Brice Goglin
3f96e7a271 hwloc: synthetic: Misc levels are not allowed in the synthetic description
Misc objects were used between system and machine in the past
but quickly got replaced with groups.

(cherry picked from commit open-mpi/hwloc@6c2aa6d1ea)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2015-03-27 05:14:16 -07:00
Brice Goglin
050bb35feb hwloc: x86: use Group instead of Misc for unknown x2apic levels
Misc are reserved for annotating the topology, the core
doesn't like merging them. Group is more appropriate.

(cherry picked from commit open-mpi/hwloc@3c47649591)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2015-03-27 05:14:16 -07:00
Brice Goglin
379c7b0d8b hwloc: x86: use ulong for cache sizes, uint won't be enough in the near future
(cherry picked from commit open-mpi/hwloc@ae82597773)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2015-03-27 05:14:16 -07:00
Brice Goglin
6caf9edbea hwloc: hpux: improve hwloc_hpux_find_ldom() looking for NUMA node
hwloc_get_first_largest_obj_inside_cpuset() returns the largest/highest object,
but it could still have a child with the same cpuset.
So check children as well in case there's a matching NUMA node there.

(cherry picked from commit open-mpi/hwloc@57a1c4fbe4)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2015-03-27 05:14:16 -07:00
Brice Goglin
fff1bb5dcd hwloc: core: reorder children in merge_useless_child() as well
When ignore_keep_structure is enabled, intermediate level can disappear
between parent and child, making the new child complete_cpuset smaller,
causing the child list to require a reorder just like in remove_ignored().

(cherry picked from commit open-mpi/hwloc@88afbe6b62)

Embed this related commit:
core: abstract out reorder_children(), needed when merging modifies the list of children
(cherry picked from commit open-mpi/hwloc@14db82d391)

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2015-03-27 05:14:16 -07:00
Brice Goglin
77978a846e hwloc: core: fix the merging of identical objects in presence of Misc objects
If object A contains B + I/O as children, we can "ignore" I/Os and still
try to merge A and B. We now do the same for Misc objects without cpusets
instead of I/Os.

This fixes a corner case when export/reimport to XML creates a slightly
different topology (making hwloc_insert_misc fail inside a Linux cgroup).

Thanks to Dave Love for reporting the problem.

Fixes #118

(cherry picked from commit open-mpi/hwloc@650371e115)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2015-03-27 05:14:15 -07:00
Brice Goglin
5427b33caf hwloc: debug: fix an overzealous assertion about the parent cpuset vs its children
When I/O are attached under a PU, removing the children's cpusets from
the parent cpuset doesn't give 0, it gives the PU cpuset.
The assertion fails on single-pu machines with I/O when --merge is given,
only one PU remains with I/O under it.

But if we insert Misc by cpuset under PU, it gives 0 as expected.

Fix the assertion accordingly.

Thanks to Thomas Van Doren for reporting the issue.

(cherry picked from commit open-mpi/hwloc@45c94c336d)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2015-03-27 05:14:15 -07:00
Brice Goglin
9b59d532fc hwloc: cpuid-x86: Fix duplicate asm labels in case of heavy inlining on x86-32
hwloc_x86_discover() calls hwloc_look_x86() twice, which calls hwloc_have_x86_cpuid().
If everything gets inlined, the asm label inside hwloc_have_x86_cpuid()
is duplicated.
Use a local label with f annotation in jumps to avoid the problem.

Thanks to Thomas Van Doren for reporting the issue (found with gcc -m32).

(cherry picked from commit open-mpi/hwloc@50e447f5bc)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2015-03-27 05:14:15 -07:00
Brice Goglin
86a536ca58 hwloc: x86 and OSF: Don't forget to set NUMA node nodeset
x86: Not critical since BSDs that use this backend have no membind support,
but better fix it for uniformization.
(cherry picked from commit open-mpi/hwloc@a431361c7d)

OSF: Looks like nobody ever tried to play with memory binding on OSF/Tru64.
(cherry picked from commit open-mpi/hwloc@2d6c73356d)

Conflicts:
	NEWS

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2015-03-27 05:14:15 -07:00
Brice Goglin
db5bc72496 hwloc: API: clearly state that os_index isn't unique while logical_index is
(cherry picked from commit open-mpi/hwloc@6c75302ab2)

Conflicts:
	opal/mca/hwloc/hwloc191/hwloc/include/hwloc.h

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2015-03-27 05:14:15 -07:00
Brice Goglin
a636790604 hwloc: opal/mca/hwloc/hwloc191/hwloc/NEWS update
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2015-03-27 05:13:41 -07:00
Brice Goglin
7c96aecfaf hwloc: errors: improve the advice to send hwloc-gather-topology files in the OS error message
(cherry picked from commit open-mpi/hwloc@f77aa01b3c)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2015-03-27 05:13:41 -07:00
Brice Goglin
d5f8c89527 hwloc: configure: fix the check for X11/Xutil.h
At least some solaris enforce the need to #include X11/Xlib.h first.

Thanks to Siegmar Gross for reporting the issue.

(cherry picked from commit open-mpi/hwloc@005a7e89b6)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2015-03-27 05:13:01 -07:00
Brice Goglin
50b035dddb hwloc: misc.h: Fix hwloc_strncasecmp() with some icc
tolower needs <ctype.h>

Thanks to Ralph Castain for reporting the failure.

(cherry picked from commit open-mpi/hwloc@038c372a58)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2015-03-27 05:10:57 -07:00
Brice Goglin
6764413aa3 hwloc: misc.h: Fix hwloc_strncasecmp() build under strict flags on BSD
strncasecmp() needs <strings.h>

Thanks to Pavan Balaji for reporting the failure.

(cherry picked from commit open-mpi/hwloc@37439c4801)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2015-03-27 05:10:46 -07:00
Brice Goglin
6b0011f138 hwloc: v1.9.1 released, doing 1.9.2rc1 now
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2015-03-27 05:06:15 -07:00
Jeff Squyres
a85edb8ad4 libfabric: update to Github libfabric 0d7daf720f04 2015-03-26 14:40:46 -07:00
Elena
90f5b2bb84 Introduce -tune command line option to set env vars and mca params from file 2015-03-26 18:33:53 +02:00
Ralph Castain
1b24536941 Allow for different security domains. Let the initiator of the connection determine the method to be used - if the receiver cannot support it, then that's an error that will cause the connection attempt to fail. 2015-03-25 13:22:01 -07:00
Ralph Castain
9dbc69df0f Stop an ugly infinite loop caused by continual re-opening of the opal if framework. 2015-03-24 17:50:14 -07:00
Rolf vandeVaart
dfb7e00ef5 Make sure context is still around when doing some other cleanup 2015-03-24 16:47:40 -04:00
Ralph Castain
ed5d10b816 Somehow slipped by - ensure we correctly count the cores 2015-03-19 17:56:18 -07:00
Ralph Castain
43a3baad5e Ensure we use the first compute node's topology for mapping
Don't filter the topology by cpuset if you are mpirun until you know that no other compute nodes are involved. This deals with the corner case where mpirun is executing on a node of different topology from the compute nodes.

Simplify - don't mandate that all cpus in the given cpuset be present on every node. We can then run everything thru the filter as before, which ensures that any procs run on mpirun are also contained within the specified cpuset.

Correctly count the number of available PUs under each object when given a cpuset

Fix the default binding settings, and correctly count PUs when no cpuset is given

Ensure the binding policy gets set in all cases
2015-03-19 16:30:36 -07:00
Nathan Hjelm
ccba8ce856 Merge pull request #457 from hjelmn/mpit_fixes
mca/base: fix bugs in framework deregistration/re-registration
2015-03-18 08:37:49 -06:00
Ralph Castain
d7d8ae46ed We no longer pass the RML URI for procs launched via mpirun as the daemon has no need for that info. 2015-03-17 06:10:20 -07:00
Mike Dubman
7640507438 Merge pull request #472 from miked-mellanox/topic/fix_compile_warn
btl/openib: fix compiler warning, by HalR
2015-03-13 14:06:07 +02:00
Jeff Squyres
4ab9e67832 hwloc external: portability updates
Change "test -a" to "&& test", and change foo="$bar" to foo=$bar.  No
substantive code changes.
2015-03-13 04:40:09 -07:00
Jeff Squyres
4d63c88ed1 hwloc external: whitespace cleanup, no code changes 2015-03-13 04:40:05 -07:00
Mike Dubman
00784ae3ba btl/openib: fix compiler warning, by HalR 2015-03-13 13:17:23 +02:00
Todd Kordenbrock
9350b06f7d btl-portals4: fix compiler warnings 2015-03-12 20:34:04 -05:00
Jeff Squyres
65a0e041ac dl: need to use LIBADD, not LIBS
When we use LIBADD for static libraries, the dependent libraries get
propagated properly.  For example, the dl/dlopen component will almost
certainly require the -ldl library; when using LIBS, that doesn't get
propagated elsewhere in the tree, but when using LIBADD, it does
(e.g., when linking opal_wrapper_compiler).
2015-03-12 15:01:14 -07:00
Ryan Grant
6f76984a3c Merge pull request #470 from tkordenbrock/topic/update-portals4-to-btl3
btl-portals4: implement the BTL 3.0 interface
2015-03-12 15:34:05 -06:00
Jeff Squyres
a1daa39425 libfabric: update to Github lifabric 90ac5a258418e
Update to latest upstream Github lifabric in order to fix some usnic
bugs.
2015-03-12 13:23:32 -07:00
Todd Kordenbrock
d1656347c8 btl-portals4: implement the BTL 3.0 interface 2015-03-12 14:19:44 -05:00
adrianreber
714d9aa67e Merge pull request #348 from adrianreber/topic/orte_cr_continue_like_restart
Topic/orte cr continue like restart
2015-03-12 14:54:02 +01:00
Nathan Hjelm
fd78491768 Merge pull request #451 from elenash/master
fix: mca_base_env_var mca parameter is never handled if it's set from am...
2015-03-11 09:54:25 -06:00
Nathan Hjelm
ce6caab2a7 Merge pull request #463 from hjelmn/cuda_async
btl/openib: cuda: fix CUDA-aware support with async copy
2015-03-11 09:52:48 -06:00
Jeff Squyres
c61dd4d56f usnic: each err eq entry reports *1* completion
Actually, the return from fi_eq_readerr() only indicates a *single*
error completion (not err_entry.data completions).
2015-03-11 08:07:20 -07:00
Ralph Castain
2de5cd6e5f Ensure we don't install the libevent internal headers 2015-03-11 07:35:20 -07:00
Nathan Hjelm
395635f017 Merge pull request #461 from hjelmn/btl_openib_cleanup
btl/openib: remove derived btl segment type
2015-03-11 08:20:41 -06:00
Jeff Squyres
9c926e5e82 usnic: add more commments/explanation about error cases
If we really get a catastrophic error from a libfabric call, don't
bother trying to continue (because data has been corrupted and there's
nothing sane left to do).  Just call opal_btl_usnic_exit() (which
tries to call the PML error callback, but we're so early in the
module_init process that this likely hasn't been setup yet, so the job
will likely abort).
2015-03-11 07:16:28 -07:00
Jeff Squyres
51583789fb usnic: re-indent some show_help code
Nothing too substantial here, but two of the messages moved from
"libfabric API failed" to "internal error during init", just to be a
bit more descriptive.
2015-03-11 07:15:28 -07:00
Jeff Squyres
1b836d784c usnic: subtract number of errored insertions from loop count
When we get errors, the entry.data field tells us how many errors are
being reported.  So decrement the loop count variable by that much.

This fixes CSCut30441.
2015-03-11 07:13:10 -07:00
Adrian Reber
f45dd069bd FT: fix compilation using --with-ft (1/5)
Enabling the FT code breaks compilation (again). This series
tries to fix the compiler errors. This is again only fixing
the compiler errors without any warranty that the result
might actually support FT again.

This first patch moves orte_cr_continue_like_restart from ORTE
to opal_cr_continue_like_restart in OPAL. This only leaves three
calls from OPAL to ORTE in the FT code. As it is not yet 100%
clear how to handle these calls the code orte_sstore.set_attr()
has been #ifdef'd out for now.
2015-03-11 14:23:33 +01:00
Nathan Hjelm
b308afa8fd btl/openib: remove derived btl segment type
The derived segment type (btl_openib_segment_t) was intended to store
the registration info needed for put and get. In BTL 3.0 this is no
longer required. I intended to remove this type as part of
open-mpi/ompi@74f1af4548 .

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-03-10 14:41:15 -06:00
Nathan Hjelm
3d32dbd793 btl/openib: cuda: fix CUDA-aware support with async copy
This commit should resolve an issue seen with CUDA-aware support. The
problem came in with BTL 3.0. Before 3.0 the size of the copy was
stored in the incoming segment's des_remote_count field. This field
does not exist in BTL 3.0 so I stored the value in the
des_segment_count field. This caused problems with the cuda support
code. To fix the issue the endpoint pointer is now stored in the in
fragment's endpoint pointer which free's up the segment's des_cbdata
pointer for storing the transfer size.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-03-10 14:38:12 -06:00
Jeff Squyres
cbd99d5f60 libfabric: update to Github upstream 1b4bb2285b
Get a usnic bug fix.
2015-03-10 12:09:02 -07:00
Jeff Squyres
d97551bdb1 usnic: endpoint type hint moved to a sub-struct
Update to match new libfabric API/structure change.
2015-03-10 09:47:41 -07:00
Jeff Squyres
1a1be2efa0 libfabric: update to Github upstream 7095f3dc 2015-03-10 09:47:40 -07:00
Jeff Squyres
afec1454f5 usnic: only setup the connectivity checker if we have modules
If we ended up with no modules (e.g., all usnic devices were
excluded), there was a race condition in that the connectivity agent
could tear down its local socket before one or more of the local
clients saw it.  Therefore, the local clients would timeout waiting
for the socket to appear.

So move the connectivity checker init later in the bootstrapping
process (it *must* be setup before module_init()), and have it only
invoked if we actually ended up with one or more modules.
2015-03-10 07:43:20 -07:00
Jeff Squyres
06accb721c usnic: ensure to free all resources if no usnic BTLs found
If all usnic devices are excluded, then we need to ensure the error
path includes freeing the filter.

This was Coverity CID 1288085
2015-03-10 07:43:20 -07:00
Jeff Squyres
8fef4e865f dl dlopen: fix use-after-free
Re-structure the loop looking for duplicates a little so that we only
have a single free of the string that happens regardless of whether we
found a duplicate or not.

This was Coverity CID 1288090
2015-03-10 07:43:20 -07:00
Jeff Squyres
3efb5f56ae dl dlopen: ensure dirs is not NULL
opal_argv_split() may have returned NULL.

This was Coverity CID 1288088
2015-03-10 07:43:20 -07:00
Jeff Squyres
86968dcdda dl dlopen: fix resource leak
closedir() was one block higher than it should have been.

This was Coverity CID 1288087.
2015-03-10 07:43:20 -07:00
Jeff Squyres
546ad3f060 dl dlopen: free resources upon error
Ensure to take the right path out upon errors (that will free any
pending resources).

This was Coverity CID 1288086
2015-03-10 07:43:19 -07:00
Rolf vandeVaart
49b5eb6c91 Fix missing initialization of variable 2015-03-10 10:33:27 -04:00
Jeff Squyres
4b2cba46f4 usnic: fix bootstrap error paths
Fix previously-unfinished error paths during startup/bootstrapping.
Instead of just blindly continuing on when an fi_* function call
fails, opal_show_help and skip that device.

Also, only check the usnic config minimums once.  They're VIC-wide and
won't change on a per-device basis -- we only need to check them once.

Fixes CSCut19179.
2015-03-09 16:57:41 -07:00
Nathan Hjelm
005c6022e2 mca/base: fix bugs in framework deregistration/re-registration
There were a number of bugs in the framework/variable code that
affected deregistration:

 - Frameworks could be erroneously closed if seperately registered and
   opened then subsequently closed. This was a bug in the original
   design which only reference counted opens but not
   registrations. This would cause undefined behavior if
   MPI_T_finalize actually calls ompi_info_close_components as
   intended. Now both registrations and opens are reference counted
   and frameworks/components are not torn down until the matching
   number of close calls have been made.

 - group_find_by_name did not pass the invalidok flags down
   to mca_base_var_group_get_internal correctly.

 - Group deregistration caused the group to be completely reset. This
   does not match the behavior required by MPI_T as it could reduce
   the number of variables/subgroups in a group.

This commit also updates MPI_T_finalize to call
ompi_info_close_components as originally intended.

Closes #374

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-03-09 16:52:53 -06:00
Jeff Squyres
1995f6beba cuda: convert to opal_dl interface 2015-03-09 08:18:13 -07:00
Jeff Squyres
a9d86129c6 mca base: convert to opal_dl interface 2015-03-09 08:16:55 -07:00
Jeff Squyres
39364d315c libltdl: dl component based on libltdl
Works on any system that libltdl supports and has ltdl.h and libltdl
available.
2015-03-09 08:16:55 -07:00
Jeff Squyres
7d340c0c26 dlopen: simple dl component based on POSIX dlopen
Works on systems with dlopen (e.g., Linux and OS X).  It requires
dlfcn.h and libdl, which many systems have installed by default.
2015-03-09 08:16:55 -07:00
Jeff Squyres
e81c070ef0 dl framework: new dynamic loader framework
Embedding libltdl without the use of Libtool bootstrapping has
proven... difficult.  Instead, create a new simple "dl" framework.  It
only provides 4 functions:

- open a DSO (very similar to lt_dlopenadvise())
- lookup a symbol in a previously-opened DSO (very similar to lt_dlsym())
- close a previously-opened DSO (very similar to lt_dlclose())
- iterate over all files in a directory (very similar to ld_dlforeachfile())

There will be follow-on commits with a simple dlopen-based component
(nowhere near as complete/functional as libltdl, but good enough for
Linux and OS X), and a libltdl-based component for all other
platforms.

The intent is that the dlopen-based component can be built by default
in almost all cases.  But if libltdl is available, that component will
be built.  End result: we still get DSO-based functionality by default
in (almost?) all cases.  Without embedding libltdl.  Which is what we
want.
2015-03-09 08:16:55 -07:00
Gilles Gouaillardet
2789d782ab timer/linux: fix insecure data handling
as reported by Coverity with CID 1269923
2015-03-09 19:14:56 +09:00
Rolf vandeVaart
237c268a09 Add extra check during cleanup to make sure we really should clean up the CUDA resources. 2015-03-06 13:12:19 -05:00
Elena
737f06dd68 fix: mca_base_env_var mca parameter is never handled if it's set from amca conf file 2015-03-06 12:12:26 +02:00
Gilles Gouaillardet
da90ed4483 opal/compress: remove misc dead code
as reported by Coverity with CIDs 71856,
1269714, 1269715, 1269717, 1269718, 1269723, 1269724
2015-03-06 15:34:08 +09:00
Gilles Gouaillardet
521317341d btl/openib: fix a double free
as reported by Coverity with CID 1287033
2015-03-06 14:58:11 +09:00
George Bosilca
75479c0f17 Fix some typos. 2015-03-05 12:59:58 -05:00
Alina Sklarevich
1560ed9761 initialize opal_common_verbs_want_fork_support to -1.
This way, if the call to ibv_fork_init() fails, the job will still
continue.
2015-03-05 14:29:09 +02:00
Gilles Gouaillardet
e1cc931e1b btl/tcp: silence CID 710616 2015-03-05 14:20:08 +09:00
Gilles Gouaillardet
852dbafd51 mca/base: fix misc memory leaks
as reported by Coverity with CIDs 710628, 1196713 and 1269855
2015-03-05 14:06:18 +09:00
Gilles Gouaillardet
134c866aa9 btl/openib: fix misc memory leaks
as reported by Coverity with CIDs 1269848, 1269852 and 1269862
2015-03-05 14:06:18 +09:00
Gilles Gouaillardet
d1b2f043ff fix misc memory leaks
as already reported by Coverity with CIDs
71818, 71819, 72250, 715767, 1196749 and 1274002
2015-03-05 13:58:05 +09:00
Mike Dubman
171d674ca4 Merge pull request #441 from open-mpi/revert-438-topic/use_opal_common_verbs_want_fork_support
Revert "create the opal_common_verbs_want_fork_support parameter."
2015-03-04 10:13:00 +02:00
Gilles Gouaillardet
f43b5b46ee btl/openib: fix heterogeneous support
Thanks @bosilca for the pointer
2015-03-04 13:53:05 +09:00
Gilles Gouaillardet
81b0444ef2 btl/openib: fix comment syntax, no code change
and silence gcc warning about nested comments
2015-03-04 11:24:14 +09:00
Rolf vandeVaart
edf58eb549 Implement CUDA-aware workaround while fork support worked out 2015-03-03 09:50:01 -05:00
Mike Dubman
98503b56e0 Revert "create the opal_common_verbs_want_fork_support parameter." 2015-03-03 14:28:31 +02:00
Mike Dubman
cc7caf699e Merge pull request #438 from alinask/topic/use_opal_common_verbs_want_fork_support
create the opal_common_verbs_want_fork_support parameter.
2015-03-02 07:43:37 +02:00
Gilles Gouaillardet
04a0438b56 sec/munge: send NULL terminated strings 2015-03-02 12:19:46 +09:00
Alina Sklarevich
8fe42f1bc1 create the opal_common_verbs_want_fork_support parameter.
call the opal_common_verbs_mca_register function to make sure that
opal_common_verbs_want_fork_support mca parameter is created and therefore
can be used to control the fork support.
2015-03-01 17:40:49 +02:00
Ralph Castain
d81c372ea2 Remove the "forwarding" of envars when direct launched - there aren't any envars we can forward under that use-case 2015-02-27 12:19:48 -08:00
Rolf vandeVaart
e48bc77342 Fix the coverity fix 2015-02-27 12:49:44 -05:00
Rolf vandeVaart
cfe91d4d0f Fix compile error for CUDA-aware by using new fork MCA parameter. 2015-02-27 10:00:46 -05:00
Gilles Gouaillardet
e0026224e7 pstat linux: close the files
as reported by Coverity with CID 71983
2015-02-27 19:48:01 +09:00
Gilles Gouaillardet
f33cd58ee9 btl/tcp: fix misc memory leaks
as reported by Coverity with CIDs 710615, 710616 and 710618
2015-02-27 19:16:22 +09:00
Gilles Gouaillardet
60404d1953 btl/base: fix misc memory leaks
as reported by Coverity as CIDs 71818 71819
2015-02-27 19:06:06 +09:00
George Bosilca
455b465329 Reflect in the naming the location of the variable. 2015-02-26 18:22:23 -05:00
Jeff Squyres
5215dc0db3 shmem base: do not allow framework selection to occur twice
Both opal_shmem_base_select() and
opal_shmem_base_best_runnable_component_name() and were calling
opal_shmem_base_runtime_query(), which would do component selection
(and closing of losing components) twice.

Put protection in opal_shmem_base_runtime_query() to return the cached
results the second time.  Additionally, make
opal_shmem_base_runtime_query() "own" the cached results
(vs. opal_shmem_base_select).
2015-02-26 14:56:46 -08:00
Jeff Squyres
312b0afb67 shmem base: make these the version-less struct names
Minor style commit; no substantive code change.
2015-02-26 14:56:46 -08:00
Jeff Squyres
90a2c3cd99 shmem base: this function had no purpose being public
Make it static to the base, and move it up higher in the file so that
multiple functions beneath it can call it.
2015-02-26 14:56:46 -08:00
Jeff Squyres
9666884cf1 shmem: make the base_module_t a real mca_base_module_t
This allows the opal_shmem_base_module_t to be properly cast to an
mca_base_module_t.

(this commit is the rationale for the previous shmem C99 .member
initialization commit)
2015-02-26 14:56:46 -08:00
Jeff Squyres
62259a74f5 shmem: use C99 struct initialization
Use .member=foo initialization for the shmem framework and components
and modules.
2015-02-26 14:56:46 -08:00
George Bosilca
778ba0317e Revert "Minor cleanups."
This reverts commit 3b4da0bda4.
2015-02-26 17:53:58 -05:00
George Bosilca
2c60c18e6f A better fix for the want_fork_support issue. I noticed a naming
conflict where ompi was used down in OPAL. I correctly renamd the
MCA parameter, and created a deprecated synomym for the old
name.
2015-02-26 17:35:54 -05:00
George Bosilca
5c3ce3a737 Merge branch 'master' of github.com:open-mpi/ompi 2015-02-26 17:10:18 -05:00
George Bosilca
aeace0468e A more sensible fix, move the MCA variable in the verbs common area. 2015-02-26 16:51:09 -05:00
Nathan Hjelm
855d422e62 Merge pull request #408 from hjelmn/btl_3_0_mod
btl: expose local registration thresholds
2015-02-26 12:57:43 -07:00
Mike Dubman
dbc15009b6 Merge pull request #415 from alinask/topic/fix_fork_support_flow
Fix the calls to ibv_fork_init and remove btl_openib_want_fork_support.
2015-02-26 21:50:11 +02:00
Rolf vandeVaart
bbdcf9ff33 Fix missing cast change from opal_free_list_changes. Fixes warning 2015-02-26 11:24:49 -05:00
Gilles Gouaillardet
b888768ca3 btl/scif: fix a typo
this is likely a typo introduced by open-mpi/ompi@5f1254d710
@hjelmn could you please double check this ?
2015-02-26 13:45:51 +09:00
Nathan Hjelm
8a17e69067 btl/ugni: fix typos introduced by free list update 2015-02-25 12:43:05 -07:00
George Bosilca
f3b58006c8 Merge branch 'master' of github.com:open-mpi/ompi 2015-02-25 12:01:35 -05:00
Jeff Squyres
9381f38a98 libevent2021: remove stale owner.txt file
I'm guessing this directory was accidentally left in the tree when
creating the owner.txt files.
2015-02-25 07:37:27 -08:00
Jeff Squyres
f3c9354d4b usnic: restore compatibility with the v1.8 branch
Also include two other minor changes:

1. More C99-style member initialization in the component struct
1. Fix the BTL module member initialization to not be redundant
2015-02-25 05:37:51 -08:00
Nysal Jan K.A
881a9f3d58 Fix cache line size detection on power
Due to the nature of the cache architecture on power,
we don't export coherency_line_size for L2 in sysfs.
If we are unable to get the L2 cache line size, try L1.

See open-mpi/ompi#383 for more information.
2015-02-25 17:26:28 +05:30
Alina Sklarevich
e4c4e7df5e Fix the calls to ibv_fork_init and remove btl_openib_want_fork_support.
In order to have an effect, ibv_fork_init should be called in the
beginning of the verbs initialization flow - before the calls to the
ibv_create_qp and ibv_create_cq verbs.
These functions are called from the oob/ud code and by the time the
other verbs components (btl openib, pml yalla, ...) call ibv_fork_init,
it's too late. This commit forces the call to ibv_fork_init (if it's
requested) right at the beginning of all the components that are using
verbs.
(ibv_fork_init() can be safely called multiple times)

This commit also removes the btl_openib_want_fork_support mca parameter
and adds a new mca parameter instead - opal_verbs_want_fork_support.
Through this new parameter, fork support may be requested for ALL
components.
The default value for this parameter is set to 1.

Before this commit the btl_openib_want_fork_support parameter didn't
provide fork support for the openib btl if its value was set to 1.
(because when openib called ibv_fork_init, it was already after the
calls to ibv_create_* in oob/ud and thereofre it failed).
2015-02-25 10:58:50 +02:00
Jeff Squyres
a85a392896 Merge pull request #422 from jsquyres/topic/coverity-fixes
Some Coverity fixes
2015-02-24 17:00:10 -05:00
Jeff Squyres
3cd36ab12a openib: fix double free
This was CID 1269989
2015-02-24 15:24:10 -05:00
Jeff Squyres
f381c5ea8b crs none: ensure file is != NULL before closing it
This was CID 71701
2015-02-24 15:24:09 -05:00
Jeff Squyres
8fd5e75463 crs base: ensure metadata != NULL
This was CID 71700
2015-02-24 15:24:08 -05:00
Jeff Squyres
5894f0c1f2 usnic: update to new mpool API
NOTE: Have not added cross-compatibility with v1.8 branch yet
2015-02-24 10:05:45 -07:00
Nathan Hjelm
5f1254d710 Update code base to use the new opal_free_list_t
Use of the old ompi_free_list_t and ompi_free_list_item_t is
deprecated. These classes will be removed in a future commit.

This commit updates the entire code base to use opal_free_list_t and
opal_free_list_item_t.

Notes:

OMPI_FREE_LIST_*_MT -> opal_free_list_* (uses opal_using_threads ())

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-02-24 10:05:45 -07:00
Nathan Hjelm
ed78553512 Update opal_free_list_t usage to reflect new class interface.
Please verify your components have been updated correctly. Keep in
mind that in terms of threading:

OPAL_FREE_LIST_GET -> opal_free_list_get_st
OPAL_FREE_LIST_RETURN -> opal_free_list_return_st

I used the opal_using_threads() variant anytime it appeared multiple
threads could be operating on the free list. If this is not the case
update to _st. If multiple threads are always in use change to _mt.
2015-02-24 10:05:44 -07:00
Howard Pritchard
c9e81b54fb Merge pull request #412 from hppritcha/topic/owner_files
add owner files to opa/ompi/orte mca directories
2015-02-23 09:48:20 -07:00
Jeff Squyres
3e8f468709 mca_base_framework: use the right type for dequeued list items
The items on the list are (mca_base_component_list_item_t*)'s, not
(mca_base_component_t*)'s.
2015-02-23 08:30:58 -08:00
Gilles Gouaillardet
8d44d7086a hwloc/base: fix misc memory leaks
as reported by Coverity with CIDs 710636 and 1270441
2015-02-23 13:55:04 +09:00
Howard Pritchard
bf89131f9e add owner files to opa/ompi/orte mca directories
This commit adds an owner file in each of the component directories
for each framework.  This allows for a simple script to parse
the contents of the files and generate, among other things, tables
to be used on the project's wiki page.  Currently there are two
"fields" in the file, an owner and a status.  A tool to parse
the files and generate tables for the wiki page will be added
in a subsequent commit.
2015-02-22 15:10:23 -07:00
George Bosilca
3b4da0bda4 Minor cleanups. 2015-02-21 16:36:29 -05:00
Jeff Squyres
937bbbac34 libfabric: update to 8528d35551a78b5241e615c0e6ac5a711f96a03c
Update to latest from libfabric Github master
ofiwg/libfabric@8528d35551
2015-02-20 12:37:27 -08:00
Nathan Hjelm
cc750b00a6 btl: export local registration thresholds
Some BTLs do not require local registration for some rdma
transactions. For example: inline put on openib, fma put on ugni. This
commit adds code to expose the local registration thresholds to BTL
users. Optimized code can take advantage of this information to
improve rdma performance.
2015-02-19 16:13:37 -07:00
Jeff Squyres
6098b84294 libfabric: pass the appropriate LDFLAGS to libfabric components
When compiling against an external libfabric, ensure to also pass the
appropriate -L flags so that the compiler/linker can find it.
2015-02-19 05:35:38 -08:00
Jeff Squyres
2d636147e3 reachable netlink: fix the component symbol name 2015-02-19 04:25:15 -08:00
Ralph Castain
008755ab17 Remove stale file reference 2015-02-18 18:36:08 -08:00
Nathan Hjelm
0e09b9298a mca/base: add framework flag indicating a framework does not have
dso components

This flag is needed for a special case framework: dl. The framework is
needed before any dl components can be used.
2015-02-18 14:03:51 -07:00
rhc54
ae16a168ec Merge pull request #401 from rhc54/reachable
Add reachable framework for determining TCP connections
2015-02-18 08:22:48 -08:00
Jeff Squyres
b66fc3aed9 opal_check_visibility.m4: remove extraneous sym link
The sym link to this m4 is not necessary down in the component.
2015-02-18 03:40:25 -08:00
Jeff Squyres
f040ef09ff libfabric: properly define HAVE_ALIAS_ATTRIBUTE
@ggouaillardet identified that HAVE_ALIAS_ATTRIBUTE was not properly
being defined in the embedded libfabric.  This is because the
embedded configury missed the test for it (i.e., the real configure.ac
for libfabric always defines HAVE_ALIAS_ATTRIBUTE to 0 or 1 -- we
didn't emulate that properly here in libfabric's configure.m4).

Also, fix some grammar and properly escape another AC_MSG_CHECKING
message in libfabric's configure.m4.
2015-02-18 03:26:34 -08:00
Gilles Gouaillardet
28714b60cb btl/sm: fix misc errors
as reported by Coverity as CIDs 711636 and 1269847
2015-02-18 17:05:19 +09:00
Ralph Castain
9ef523c152 Add reachable framework for determining TCP connections 2015-02-17 21:47:09 -08:00
Jeff Squyres
9cb047c1ee libfabric: don't install the osd.h headers
When configured --with-devel-headers, there's now 2 "osd.h" header
files in libfabric (in different dirs).  Automake's "install" target
didn't like this, and errored out.

Since embedding libfabric is a temporary measure, just avoid the
problem by not installing any libfabric headers.
2015-02-17 07:10:12 -08:00
Gilles Gouaillardet
55948f2a6d hwloc: fix misc memory leak
as reported by Coverity with CID 1270441
(previous commit open-mpi/ompi@c25185f3a9 did not fully fix that one)
2015-02-17 14:06:15 +09:00
Gilles Gouaillardet
da7ffb6448 btl/vader: fix memory leak
as reported by Coverity with CID 1269904
2015-02-16 13:51:05 +09:00
Gilles Gouaillardet
c25185f3a9 opal/hwloc: fix misc memory leaks
as reported by Coverity with CIDS 710631-710638, 1196705,
1196716, 1196717, 1196752, 1196753
2015-02-16 12:23:37 +09:00
Gilles Gouaillardet
8dd77c692e opal/hwloc: fix misc bugs
as reported by Coverity with CIDs 72224, 703566,
1196821, 1196842, 1196657 and 1196658
2015-02-16 11:59:48 +09:00
Gilles Gouaillardet
0ce59f2d29 pmix: fix misc memory leaks
as reported by Coverity as CID 1269843, 1269854, 1269856, 1269857 and 1269858
2015-02-16 11:19:43 +09:00
George Bosilca
a7a4d6335e Various cleanups. 2015-02-15 11:39:09 -05:00
George Bosilca
a4aa74d4b9 Fix the SM BTL. 2015-02-15 11:38:45 -05:00
George Bosilca
84994c7438 This comment seems to contradict with the compilers opportunities to
optimize the unused data out.
2015-02-15 11:37:22 -05:00
Jeff Squyres
2ca14acaf0 libfabric: add missing files into Makefile.am 2015-02-14 05:01:29 -08:00
Jeff Squyres
955d8b7525 usnic: adapt for new libfabric API 2015-02-13 14:44:23 -08:00
Jeff Squyres
3abebe7251 libfabric: update to ofiwg/libfabric@06fdfbef98 2015-02-13 14:44:06 -08:00
Nathan Hjelm
1162093d34 btl/scif: fix debug build
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-02-13 11:46:39 -07:00
Jeff Squyres
67ee1e6cf8 usnic: restore compatibilty between master and v1.8
Add the functions that changed between BTL 2.0 and 3.0 into compat.h
and compat.c:

* module.btl_prepare_src: the signature and body of this method
  changed between 2.0 and 3.0.  However, the functions that this
  method calls did *not* need to change, so they are copied over
  wholesale (with the exception that they no longer accept the unused
  `registration` parameter).
* module.btl_prepare_dst: this method does not exist in BTL 3.0.
* module.btl_put: the signature and body of this method changed
  between 2.0 and 3.0.
2015-02-13 11:46:38 -07:00
Jeff Squyres
ad841d7ba3 usnic: update to BTL 3.0 2015-02-13 11:46:38 -07:00
Jeff Squyres
0a5fd8e36a usnic: update README for new BTL 3.0 scheme details 2015-02-13 11:46:38 -07:00
Jeff Squyres
cf99f0c905 usnic: just add comments/explanations -- no code changes 2015-02-13 11:46:38 -07:00
Jeff Squyres
af61065b87 usnic: minor update of member field names 2015-02-13 11:46:38 -07:00
Jeff Squyres
8311428602 btl.h: whitespace cleanup
No code changes
2015-02-13 11:46:38 -07:00
Jeff Squyres
7971fd57f0 btl.h: add more description for reg/dereg functions 2015-02-13 11:46:38 -07:00
Nathan Hjelm
a3b739d117 btl/ugni: use pthread_join to wait on progress thread completion
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-02-13 11:46:38 -07:00
Nathan Hjelm
953efc3eb2 btl/openib: fix compilation issues with XRC
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-02-13 11:46:38 -07:00
Nathan Hjelm
a9763e123d add btl comment
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-02-13 11:46:38 -07:00
Nathan Hjelm
1e518504e4 btl/smcuda: update for BTL 3.0 interface 2015-02-13 11:46:37 -07:00
Nathan Hjelm
aba0675fe7 btl/vader: update for BTL 3.0 interface
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-02-13 11:46:37 -07:00
Nathan Hjelm
f8ac3fb1e8 btl/ugni: add support for atomic operations
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-02-13 11:46:37 -07:00
Nathan Hjelm
655604f509 btl/ugni: update for BTL 3.0 interface
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-02-13 11:46:37 -07:00
Nathan Hjelm
4972d97b8b btl/template: update for BTL 3.0 interface
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-02-13 11:46:37 -07:00
Nathan Hjelm
f241b6e0a7 btl/tcp: update for BTL 3.0 interface
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-02-13 11:46:36 -07:00
Nathan Hjelm
25176cad27 btl/sm: update for BTL 3.0 interface
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-02-13 11:46:36 -07:00
Nathan Hjelm
19abc19ad9 btl/self: update for BTL 3.0 interface
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-02-13 11:46:36 -07:00
Nathan Hjelm
f96d48a2e1 btl/scif: update for BTL 3.0 interface
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-02-13 11:46:36 -07:00
Nathan Hjelm
cf91156105 btl/openib: add atomic operation support
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-02-13 11:46:36 -07:00
Nathan Hjelm
74f1af4548 btl/openib: update for BTL 3.0 interface
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-02-13 11:46:36 -07:00
Nathan Hjelm
fc7397949c btl: require that btls handle descriptor = NULL in the btl_sendi function
The send inline optimization uses the btl_sendi function to achieve lower
latency and higher message rates. Before this commit BTLs were allowed to
assume the descriptor was non-NULL and were expected to return a valid
descriptor if the send could not be completed using btl_sendi. This
behavior was fine until the usage of btl_sendi was changed in ob1. This
commit allows the caller to specify NULL for the descriptor. The affected
btls have been updated to handle this case.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-02-13 11:46:36 -07:00
Nathan Hjelm
593f97ae92 btl: add support for 64-bit atomic operations
This commit adds an interface for btl's to export support for 64-bit atomic
operations on integers. BTL's that can support atomic operations should
implement these functions and set the appropriate btl_flags and btl_atomic_flags.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-02-13 11:46:36 -07:00
Nathan Hjelm
f8e15ca83d Update the interface to provide a cleaner interface for RDMA operations.
The old BTL interface provided support for RDMA through the use of
the btl_prepare_src and btl_prepare_dst functions. These functions were
expected to prepare as much of the user buffer as possible for the RDMA
operation and return a descriptor. The descriptor contained segment
information on the prepared region. The btl user could then pass the
RDMA segment information to a remote peer. Once the peer received that
information it then packed it into a similar descriptor on the other
side that could then be passed into a single btl_put or btl_get
operation.

Changes:

 - Added functions to register and deregister memory regions with the
   btl. If no registration is needed a btl should set these function
   pointers to NULL. These function take over for btl_prepare_src/dst
   and btl_free for RDMA operations. The caller should specify the
   maximum permissions needed on the memory.

 - Changed the function signatures for both btl_put and btl_get. In
   place of a prepared descriptor the caller should provide the source
   and destination addresses and registration handles as well as a
   new callback function. The callback will be provided with the local
   address and registration handle, callback context, callback data, and
   status. See mca_btl_base_rdma_completion_fn_t in btl.h.

 - Added a new btl constraint: MCA_BTL_REG_HANDLE_MAX_SIZE. This
   value specifies the maximum size of any btl's registration handle.

 - Removed the btl_prepare_dst function. This reflects the fact that
   RDMA operations no longer depend on "prepared" descriptors.

 - Removed the btl_seg_size member. There is no need to btl's to
   subclass the mca_btl_base_segment_t class anymore.

 - Expose the btl's put/get limitations with new struct members:
   btl_put_limit, btl_put_alignment, btl_get_limit, btl_get_alignment.

 - Remove the mca_mpool_base_registration_t argument from the btl_prepare_src
   function. The argument was intended to support RDMA operations and is no
   longer necessary.

 - Remove des_remote/des_remote_count from the mca_btl_base_descriptor_t
   structure. This structure member was originally used to specify the remote
   segment for RDMA operations. Since the new btl interface no longer uses
   desriptors for RDMA this member no longer has a purpose. In addition
   to removing these members the local segment structure fields have been
   renamed to from des_local/des_local_count to des_segments/des_segment_count.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-02-13 11:46:36 -07:00
Howard Pritchard
6a275f4489 Merge pull request #395 from hppritcha/topic/pmix_cray_kvs
pmix/cray: remove workaround for OBJ_RELEASE
2015-02-13 11:25:50 -07:00
Howard Pritchard
bd9d185951 pmix/cray: remove workaround for OBJ_RELEASE
Per feedback from rhc, manually set the base_ptr member
of the opal_buffer_t variable to NULL prior to calling
OBJ_RELEASE.  A similar feature of opal_dss.load also
exists so likewise reset the base_ptr to NULL prior to
invoking it.

Hopefully the opal_buffer_t struct does not change
frequently.

Minor cleanups to reduce output when pmix_base_verbose
mca paramater is set.
2015-02-13 07:47:26 -08:00
Jeff Squyres
f7b4b23383 usnic: ensure to NULL-terminate the string/not overflow
This was CID 1269921.
2015-02-12 13:41:30 -08:00
Jeff Squyres
8febd41a39 usnic: fix minor memory leak
This was CID 1269859.
2015-02-12 13:41:30 -08:00
Jeff Squyres
4c074da1c2 usnic: fix minor memory leak
This was CID 1269853.
2015-02-12 13:41:30 -08:00
Jeff Squyres
a7ce2d406c usnic: don't bother comparing unsigned values for <0
This was CID 1269812.
2015-02-12 13:41:30 -08:00
Jeff Squyres
caacc6ad91 usnic: properly differentiate data pool vs. malloc
usnic_fls() can actually return 0, leading us to incorrectly free() a
buffer instead of OMPI_FREE_LIST_RETURN_MT'ing it.

So add an explicit bool in the struct that tracks whether the buffer
came from malloc or a freelist.

This was CID 1269660.
2015-02-12 13:41:30 -08:00
Jeff Squyres
3b39535ebb usnic: ensure that the string is NULL-terminated
This was CID 1269666.
2015-02-12 13:41:30 -08:00
Jeff Squyres
41c6e26a38 usnic: ensure the copied string is NULL-terminated
This was CID 1269667
2015-02-12 13:41:30 -08:00
Jeff Squyres
81585c0a7c usnic: strengthen the check-if-accept()-failed test
This was Coverity CID 1269801.
2015-02-12 13:41:30 -08:00
Jeff Squyres
117e6feaa1 shmem sysv: ensure we don't shmdt(NULL)
This was CID 71999.
2015-02-12 13:41:30 -08:00
Jeff Squyres
6d3a84514f mca_base_cmd_line.c: fix minor memory leak
This was CID 1269874.
2015-02-12 13:41:29 -08:00
Jeff Squyres
f8e334357d mca_base_pvar.c: protect removal from list
Only remove it from the list if it is actually on the list.

This was CID 1269758.
2015-02-12 13:41:29 -08:00
Nathan Hjelm
f1dc29b145 btl/vader: fix modex size when xpmem is in use
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-02-12 14:06:24 -07:00
Nathan Hjelm
49ba150972 mca/base: fix path string parsing
CID 993709
2015-02-12 13:03:46 -07:00
Jeff Squyres
00c878957c mca_base_var.c: add debug check for another programming error
Coverity alerted us to the fact that there are places where
the synonym_for param is hard-coded to -1 when calling
register_variable().  It would be a coding error if synonym_for==-1
and (flags & MCA_BASE_VAR_FLAG_SYNONYM)>0, so let's add that to the
debug-only check at the top of the function.

This was CID 993717.
2015-02-12 10:24:02 -08:00
Jeff Squyres
332943f1c3 pstat linux: ensure to close the file
This was CID 71983.
2015-02-12 10:24:02 -08:00
Jeff Squyres
6a64fe85a1 pstat linux: ensure read() returns >=0
This was CID 71182.
2015-02-12 10:24:02 -08:00
Jeff Squyres
8be0e0b0ca usnic: don't close fp upon error
Let the caller close fp.  Properly check for errors when calling
subroutines.

This was Coverity CID 1269995.
2015-02-12 10:24:01 -08:00
Howard Pritchard
0cf2b478e0 Merge pull request #391 from hppritcha/topic/cray_pmi_kvs
pmix/cray: initial kvs removal work
2015-02-11 19:55:34 -07:00
Howard Pritchard
9955834ff1 pmix/cray: initial kvs removal work
Remove use of the Cray PMI KVS - which is designed for a lighweight
MPI that exchanges only a minimimal amount of connection info
(about 128 bytes per rank) - within cray/pmix.  Use Cray PMI
collective extensions instead.

This is the first of several steps to accelerate launch of
Open MPI on Cray systems using either native aprun or nativized
slurm.
2015-02-11 15:14:55 -08:00
Rolf vandeVaart
08dceda2c0 Fix logic for handling priority and eager RDMA. There was some refactoring that was done
in this code and it ended up changing the logic that is used to set up eager RDMA.
Rather than setting up eager RDMA with a high priority message, it did it the other
way around.  For some reason, CUDA-aware support did not like this.  So, basically,
restore the logic to the way it was prior to the refactoring.  The refactoring did not
intend to change this.  Lightly reviewed by hjelmn.
2015-02-11 16:38:36 -05:00
Jeff Squyres
4f1996df5d various: remove $(LTDLINCL) from Makefile.am's that didn't need it 2015-02-11 12:25:20 -08:00
Ralph Castain
3de8c5c7c6 Cleanup the munge support - the credential cannot be reused for multiple connections 2015-02-10 20:34:35 -08:00
George Bosilca
e173f9b0c0 Somehow we lost one of the most critical parameter
allowing the PML to decide how to order the different
interconnects. Bring it back !
2015-02-10 20:32:05 -05:00
Ralph Castain
3ae3b96c17 Fix master compilation - a buried header dependency must have been removed. 2015-02-10 07:22:10 -08:00
Mike Dubman
6816e3421f Merge pull request #377 from regrant/ib_wr_fix
fix problem with get_pathrecord posting too many recv requests
2015-02-10 08:47:23 +02:00
Ralph Castain
bef830efef Fix debug output 2015-02-09 20:49:04 -08:00
Ralph Castain
07134f5b17 Add munge security 2015-02-09 20:49:03 -08:00
Ralph Castain
a3275aa867 Once again, fix the blasted singleton comm_spawn 2015-02-05 17:34:25 -08:00
Jeff Squyres
0dbbffb753 pmix_base_frame: use the "= { 0 }" initializer
Per open-mpi/ompi#381, convert the specific intialization of opal_pmix
to use the generic "= { 0 }" initializer.  This form can be used to
initialize any type when the intent is just to zero out / assign
*some* value.
2015-02-05 17:51:06 -05:00
Ralph Castain
4d882796b6 Silence warnings 2015-02-05 11:41:00 -08:00
Howard Pritchard
e508a4078e Merge pull request #376 from regrant/ib_error_fix
fixes OpenIB connect error reporting for ibv_* calls that return an errn...
2015-02-04 10:22:03 -07:00
Jeff Squyres
621af3aa07 pmix_base: fix global opal_pmix symbol for static linking on OS X
OS X has weirdness when static linking.  If a symbol is not
initialized, it is put into the common block section, and Weird Things
happen (linking when trying to using that global symbol will fail).
If you initialize the variable, it goes into a different section (and
linking to it will work).

This link (that might go stale someday) has some information about OS
X linker scope and treatment of symbol definitions:
https://developer.apple.com/library/mac/documentation/DeveloperTools/Conceptual/MachOTopics/1-Articles/executing_files.html#//apple_ref/doc/uid/TP40001829-98432-TPXREF120

Fixes #375.
2015-02-04 12:12:31 -05:00
Ryan Grant
de93497789 fix problem with get_pathrecord posting too many recv requests 2015-02-04 09:53:58 -07:00
Ryan Grant
5d5e9bc1f8 fixes OpenIB connect error reporting for ibv_* calls that return an errno 2015-02-04 09:09:14 -07:00
Jeff Squyres
a3728f09af libfabric: add another missing file to the Makefile.am 2015-02-04 04:02:27 -08:00
Jeff Squyres
66a680879e libfabric: fix header file name in Makefile.am 2015-02-03 19:41:25 -08:00
Jeff Squyres
cb7cc171f9 usnic: update README.txt notes
Update notes about copying the usnic BTL between master and the v1.8
branch.
2015-02-03 15:54:36 -08:00
Jeff Squyres
edf7232e00 usnic: enable building with an external libfabric 2015-02-03 13:46:06 -08:00
Jeff Squyres
bfa54d5d7b usnic: update to match new libfabric 2015-02-03 13:46:06 -08:00
Jeff Squyres
d2490d2fd8 libfabric: update Makefile.am to match new libfabric drop 2015-02-03 13:46:05 -08:00
Jeff Squyres
3dc0abfbc4 libfabric: update to (just past) 1.0rc1
Updated to Github ofiwg/libfabric@6b005d0d19.
2015-02-03 13:46:05 -08:00
Ralph Castain
d3267c200f Add missing OMPI-changes to libevent 2.0.22 2015-02-02 20:57:40 -08:00
Jeff Squyres
965ccab6cc libfabric: remove a few warnings
Embedding libfabric is a temporary measure; I'm removing some warning
notifications so that the output isn't so cluttered (we're getting
the real warnings fixed upstream, but the OMPI community doesn't
really care/need to see the warnings in the meantime).
2015-01-29 17:38:02 -08:00
Todd Kordenbrock
37e6096fe7 Copyright update. 2015-01-29 11:08:13 -06:00
Todd Kordenbrock
ca30e129e8 Add the option to use the Portals4 logical to physical table.
This commit adds an MCA variable to select Portals4 logical
addressing, populates the logical-to-physical mapping table and
initializes the NI in this mode.
2015-01-29 11:08:13 -06:00
George Bosilca
b9a63cbe7a One less warning. 2015-01-27 13:25:55 -05:00
Ralph Castain
294ebc907a Fix singleton operations so they can work inside a slurm environment 2015-01-27 09:29:42 -06:00
Ralph Castain
ba25e8a0ce Fix singletons 2015-01-27 09:29:42 -06:00
Ralph Castain
028b00154d Complete implementation of the schizo framework to support OMPI component 2015-01-27 09:29:42 -06:00
Jeff Squyres
436223959d usnic: update to match new libfabric APIs 2015-01-24 05:49:36 -08:00
Jeff Squyres
7d5755f62b libfabric: update to ofiwg/libfabric@b3f7af4c67
Pull down a new embedded copy of libfabric from
https://github.com/ofiwg/libfabric.
2015-01-24 05:48:48 -08:00
Howard Pritchard
056daa05bf btl/ugni: use PMIX_GLOBAL for modex_send in ugni
Using PMIX_REMOTE is not the right thing for ugni
BTL when its possible that spawned ranks end up
on the same node as some of the spawnee ranks.
2015-01-22 06:53:45 -08:00
Gilles Gouaillardet
9f80aa2d28 btl/openib: regression fix when rdmacm or udcm are disabled
This fixes a regression introduced in open-mpi/ompi@661c35ca67

Thanks to Mark Santcroos for reporting this issue
2015-01-20 11:31:50 +09:00
Rolf vandeVaart
66f6026214 Improve error message to help user figure out what to do 2015-01-16 13:55:27 -05:00
Jeff Squyres
65a279019e usnic: fix typo in memchecker usage 2015-01-16 09:42:19 -08:00
Jeff Squyres
3969fe3a94 libfabric: ensure wrapper libs are loaded for static builds
For static builds, we need to also set
<framework>_<component>_WRAPPER_EXTRA_LIBS so that the wrappers know
what other libraries to add to link executables.
2015-01-16 09:29:52 -08:00
Gilles Gouaillardet
661c35ca67 cleanup dead code caused by the removal of the --with-threads configure option 2015-01-16 19:13:59 +09:00
Nathan Hjelm
006074c48d Merge pull request #332 from hjelmn/openib_updates
Openib updates
2015-01-15 15:05:18 -06:00
Jeff Squyres
d13c14ec82 CSCus22527: fix off-by-one error in checking the number of VFs
Ensure to count *this* process when checking for how many VFs we need
on the local server.

(cherry picked from commit 386c01934e98cb8dcb48ff648ecdfb0c8677baa9)
2015-01-15 11:44:29 -08:00
Jeff Squyres
4685767b2d libfabric: update usnic configury
Use new common m4 macro for choosing between libnl3 and libnl.
2015-01-15 07:12:39 -08:00
Jeff Squyres
400b02e566 libfabric: update to github:ofiwg/libfabric HEAD
Specifically: bbf0f3ea8e92c92a7cee56473ecdbbbb34cceb7d (15 Jan 2015)
2015-01-15 07:11:54 -08:00
Aurélien Bouteiller
f49981bb2a Disable coalescing until pull request #332 gets in. 2015-01-14 14:12:47 -05:00
Nathan Hjelm
cf4975501d rcache/vma: fix parent class of mca_rcache_vma_t
There was a mismatch between the structure for mca_rcache_vma_t and
the OBJ_CLASS_INSTANCE. One was opal_list_item_t and the other was
ompi_free_list_item_t. The super class in the structure looks like it
is the correct one. Changed the superclass in OBJ_CLASS_INSTANCE to
match.
2015-01-14 10:21:24 -07:00
Jeff Squyres
e4e5e7dbc0 usnic: ensure to clean up nicely in case of low resources
If there are not enough resources (e.g., low VFs), we can end up
calling finalize_one_channel() on the same channel multiple times.  So
ensure to NULL out fields that we have freed already so that we do not
try to free them a second time.

Fixes CSCus26648.
2015-01-13 14:37:31 -08:00
Jeff Squyres
8807ae2497 usnic libfabric: also set the us_netmask_be field.
From libfabric upstream commit ofiwg/libfabric@3976745.

Part of the fix for CSCus22495.
2015-01-13 12:04:57 -08:00
Jeff Squyres
d00cede718 usnic: fix if_include/exclude of CIDR-specified networks
Fix the ordering so that we obtain the usnic netmask information
*before* we do the filtering based on CIDR-specified networks.

Also requires upstream Github libfabric commit 3976745.

Fixes CSCus22495.
2015-01-13 12:04:51 -08:00
Jeff Squyres
a220b92cf8 usnic: fix function name in opal_output 2015-01-13 12:04:07 -08:00
Gilles Gouaillardet
955f3c2730 configury: check existence of the atomic_init function in libfabric
intel compilers implements atomic_init in c++ only,
so disable c11 atomic in libfabric for now
2015-01-13 16:39:41 +09:00
Gilles Gouaillardet
cbe0d26b2d configury: do test the __STDC_NO_ATOMICS__ macro for libfabric 2015-01-13 16:06:37 +09:00
Jeff Squyres
5ed688a074 usnic: enusre that we only get "usnic"-named providers
Also, a minor update to a verbose message.
2015-01-12 13:21:22 -08:00
Jeff Squyres
881b1dcf19 usnic: document libfabric abstractions
Handy tips to remember the libfabric abstractions and what they
correspond to in usnic/VIC terms.
2015-01-09 15:21:51 -08:00