1
1
Граф коммитов

3473 Коммитов

Автор SHA1 Сообщение Дата
Jeff Squyres
1e6a558993 opal_info_support.c: whitespace cleanup
No code changes
2015-04-20 08:56:42 -07:00
Jeff Squyres
14afe14b6a mca_base_var.c: whitespace cleanup
No code changes
2015-04-20 08:56:42 -07:00
Jeff Squyres
1f237b78d1 *_info tools: quote parsable values if they contain colons
Thanks to Lev Givon for the suggestion.
2015-04-20 08:56:42 -07:00
Nathan Hjelm
662460b06b Modify destructor function configury
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-04-20 09:51:06 -06:00
Nathan Hjelm
33181b2543 opal: use C99 subobject naming for component initialization
This commit helps future-proof opal components by initializing each
component member by name.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-04-18 10:29:58 -06:00
Mike Dubman
45396451cc Merge pull request #537 from nkogteva/timings_fixes
opal timing: added ability to choose the timer type
2015-04-17 18:27:46 +03:00
Nathan Hjelm
bfacb5dd73 common/ugni: update for MCA 2.1 2015-04-17 08:09:18 -06:00
Nadezhda Kogteva
116169c38a opal timing: added ability to choose the timer type 2015-04-17 11:15:55 +03:00
Nathan Hjelm
38589c46c0 opal: add a destructor/fini function to opal
This commit is related to an RFC from June 2014. Disscussion can be
found at:

http://www.open-mpi.org/community/lists/devel/2014/07/15140.php

The finalize function is set using either the linker option -fini or
__attribute__((destructor)) depending on compiler support. I have
confirmed that this hybrid approach works with all the major
compilers. The attribute is supported by gcc, clang, llvm, xlc, and
icc. The fini function will support pgi. If a compiler/linker
combination does not support either the destructor or fini function a
message will be printed on re-init indicating it is not supported (an
improvement over the current behavior-- SEGV).

I moved the following to the destructor function:

 - Class system finalize. This solves a bug when MPI_T_finalize is
   called before MPI_Init. The only downside to this change is we will
   leave the footprint of the opal class system after
   MPI_Finalize. This footprint should be relatively small.

This is an alternative to #517 but the two PRs are not
mutually-exclusive (with some modifications). This commit should also
be safe for 1.8.x as it does not change internal or external ABI (#517
changes internal ABI).

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-04-16 19:53:52 -06:00
Nathan Hjelm
51fec14ac7 mca/base: silence some clang/coverity warnings
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-04-16 12:07:34 -06:00
Nathan Hjelm
3436f2917d Merge pull request #449 from hjelmn/mca_base_update
mca/base update
2015-04-16 08:41:48 -06:00
Nathan Hjelm
ef01e130aa mca/base: protect mca_base_component_repository_release if dlopen support is disabled 2015-04-15 10:06:43 -06:00
Nathan Hjelm
e794658f2d Merge pull request #516 from hjelmn/repository_update
RFC: Repository update
2015-04-15 10:03:08 -06:00
Nathan Hjelm
81502fafa8 Merge pull request #379 from hjelmn/remove_enable_smp_locks
Per-RFC: remove the --disable-smp-locks configure option
2015-04-15 10:02:23 -06:00
Elena
96bdf595c2 fix for -am -tune options issue came from PR 520 2015-04-15 15:51:49 +03:00
Nathan Hjelm
c954f457d9 mca/base: update the way dynamic components are handled
This commit is a rework of the component repository. The changes
included in this commit are:

 - Remove the component dependency code based off .ompi_info
   files. This code is legacy code dating back 10 years that and is no
   longer used.

 - Move the plugin scanning code to the component repository. New
   calls have been added to add new scanning paths, query available
   components, and dlopen/load components.

 - Pass the framework down to mca_base_component_find/filter. Eventually
   the framework structure will be used to further validate components
   before they are used.

 - Add support to the MCA framework system to disable scanning for
   dlopened components on open (support already existed in
   register). This is really only relevant to installdirs as it has no
   register function and no DSO components.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-04-14 15:55:33 -06:00
Nathan Hjelm
f8158a5ec1 btl/vader: fix deadlock in mca_btl_vader_progress_endpoints
This commit fixes a typo in mca_btl_vader_progress_endpoints where
OPAL_THREAD_LOCK was used when OPAL_THREAD_UNLOCK was intended.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-04-14 09:34:45 -06:00
Jeff Squyres
fadc3ad01a hwloc/external/configure.m4: no need to unset
Instead, use a safe environment variable name (that is SCOPE_PUSHed and
SCOPE_POPed).
2015-04-14 07:04:01 -07:00
Nathan Hjelm
113c890ccf Merge pull request #520 from hjelmn/valgrind_cleanness
fix memory leaks and valgrind errors
2015-04-13 10:09:34 -06:00
Nathan Hjelm
a7b0c00ab6 fix memory leaks and valgrind errors
This commit fixes several vagrind errors. Included:

 - installdirs did not correctly reinitialize all pointers to NULL
   at close. This causes valgrind errors on a subsequent call to
   opal_init_tool.

 - several opal strings were leaked by opal_deregister_params which
   was setting them to NULL instead of letting them be freed by the
   MCA variable system.

 - move opal_net_init to AFTER the variable system is initialized and
   opal's MCA variables have been registered. opal_net_init uses a
   variable registered by opal_register_params!

 - do not leak ompi_mpi_main_thread when it is allocated by
   MPI_T_init_thread.

 - do not overwrite ompi_mpi_main_thread if it is already set (by
   MPI_T_init_thread).

 - mca_base_var: read_files was overwritting mca_base_var_file_list
   even if it was non-NULL.

 - mca_base_var: set all file global variables to initial states on
   finalize.

 - btl/vader: decrement enumerator reference count to ensure that it
   is freed.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-04-11 09:28:35 -06:00
Ralph Castain
033418f62a Correct a typo that reversed the default binding pattern. Ensure we default bind to hwthread if user specified --use-hwthread-cpus if nprocs <= 2, and bind to hwthread if told to do so. 2015-04-10 15:58:35 -07:00
Nathan Hjelm
75f210fdb9 opal/util/error: check for existing convertor for error range
This commit fixes a bug when opal_error_init is called with the same
values multiple times. If opal_error_init is called too many times it
will start failing with OPAL_ERR_OUT_OF_RESOURCE. To fix the problem
check if an existing convertor matching the requested one and return
that one instead.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-04-09 11:51:36 -06:00
Rolf vandeVaart
b7913836fc Initialize variables for safety 2015-04-09 12:58:55 -04:00
Rolf vandeVaart
d6d7184703 Enhance verbose message 2015-04-09 12:29:09 -04:00
Ralph Castain
acc2c7937c Thanks Nathan - decrement the counter to ensure singleton's startup correctly 2015-04-08 11:23:35 -07:00
Nathan Hjelm
9cd955badf opal: fix multiple bugs in MCA and opal
This commit fixes the following bugs:

 - opal_output_finalize did not properly set internal state. This
   caused problems when calling the sequence opal_output_init (),
   opal_output_finalize (), opal_output_init ().

 - opal_info support called mca_base_open () but never called the
   matching mca_base_close (). mca_base_open () and mca_base_close ()
   have been updated to use a open count instead of an open flag to
   allow mca_base_open to be called through multiple paths (as may be
   the case when MPI_T is in use).

 - orte_info support did not register opal variables. This can cause
   orte-info to not return opal variables.

 - opal_info, orte_info, and ompi_info support have been updated to
   use a register count.

 - When opening the dl framework the reference count was added to
   ensure the framework stuck around. The framework being closed
   prematurely was a bug in the MCA base that has since been
   corrected. The increment (and associated decrement) have been
   removed.

 - dl/dlopen did not set the value of
   mca_dl_dlopen_component.filename_suffixes_mca_storage on each call
   to register. Instead the value was set in the component
   structure. This caused the value to be lost when re-loading the
   component. Fixed by setting the default value in register.

 - Reset shmem framework state on close to avoid returning a stale
   component after reloading opal/shmem.

 - MCA base parameters were not properly deregistered when the MCA
   base was closed.

This commit may fix #374.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-04-07 19:13:20 -06:00
Ralph Castain
c32609b1c7 Bring over open-mpi/hwloc@f714f8d
linux: only use the device-tree on Power machines

It's available on ARM but the assumption that cpus' "reg" start at 0
is invalid.
We could make that work but the device-tree doesn't currently
bring anything better than sysfs on ARM, so don't bother for now.
2015-04-04 09:30:21 -07:00
Jeff Squyres
d825ec7cc7 usnic: fix the CQ-reading logic for -FI_EAGAIN 2015-04-02 15:56:50 -07:00
Jeff Squyres
5aabee2644 libfabric: a few fixes since 1.0rc3
Including a critical atomic initialization fix for the usnic provider.
2015-04-02 15:54:01 -07:00
Jeff Squyres
d6d8ab01e5 libfabric: the fi_log.h file moved 2015-04-01 14:43:07 -07:00
Jeff Squyres
26b3c48ccb usnic: update to API change in libfabric 2015-04-01 06:43:08 -07:00
Jeff Squyres
5e47eb81bf libfabric: update component configury for new libfabric test 2015-04-01 06:43:08 -07:00
Jeff Squyres
a89a5872c2 libfabric: update to official 1.0.0rc3 release
One change was made to the 1.0.0rc3 tarball: remove an errand
debugging printf that accidentally made its way into the tarball (but
isn't in git).
2015-04-01 06:43:08 -07:00
Mike Dubman
58d002098b Merge pull request #474 from elenash/master
Introduce -tune command line option to set env vars and mca params from ...
2015-04-01 08:23:34 +03:00
Nathan Hjelm
b6043ec459 Merge pull request #503 from hjelmn/vader_32bit_fix
btl/vader: fix fast box support for 32-bit architectures
2015-03-31 09:12:22 -06:00
Nathan Hjelm
17b80a987e btl/vader: fix fast box support for 32-bit architectures
On 32-bit architectures loads/stores of fast box headers may take
multiple instructions. This can lead to a data race between the
sender/receiver when reading/writing the sequence number. This can
lead to a situation where the receiver could process incomplete
data. To fix the issue this commit re-orders the fast box header to
put the sequence number and the tag in the same 32-bits to ensure they
are always loaded/stored together.

Fixes #473

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-03-30 16:28:16 -06:00
rhc54
bc016617a0 Merge pull request #501 from rhc54/topic/sec2
Support authentication across security domains
2015-03-30 09:59:43 -07:00
Ralph Castain
79b90a54b6 Remove stale and unused component 2015-03-30 09:56:06 -07:00
Ralph Castain
d07dc362d5 Ensure we can authenticate when crossing security domains by including all available credentials, and letting the receiver use the highest priority one they have in common. 2015-03-28 20:34:26 -07:00
Ralph Castain
b67b3619fc If we are using the default bindings, and one or more nodes are not setup to support binding, then don't error out - just don't bind.
Thanks to Annu Desari for pointing out the problem.
2015-03-28 08:20:24 -07:00
Ralph Castain
d2d02a1642 ckpt 2015-03-28 07:59:20 -07:00
Jeff Squyres
89e14f5ad6 usnic: fix comment typos 2015-03-27 17:21:53 -07:00
Jeff Squyres
2672d8d26b usnic: update comments explaining fi_av_insert() reaping
Asynchronous fi_av_insert()s are somewhat tricky; update the comments
explaining what is going on (and fix some old/stale comments).
2015-03-27 12:01:59 -07:00
Jeff Squyres
6da77ee940 usnic: only warn about each unreachable endpoint once
fi_av_insert() is invoked with a context containing each endpoint
USNIC_NUM_CHANNELS times.  If the address on that endpoint fails to
resolve / is unreachable / has some error, we'll therefore get
USNIC_NUM_CHANNELS error completions with that same endpoint.  We
therefore only want to warn about the unreachability of (and
OBJ_RELEASE) that endpoint the *first* time.

Fixes CSCut46822.
2015-03-27 12:01:59 -07:00
Jeff Squyres
506431d1b6 usnic: count fi_av_insert() completions properly
num_left represents the number of times we called fi_av_insert(), and
therefore it indicates the number of non-error completions that we
will receive.
2015-03-27 12:01:59 -07:00
Nathan Hjelm
6d1a41611f MCA/base: Detect overlapping variable names.
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2015-03-27 10:59:05 -06:00
Nathan Hjelm
de1e7d58e1 Add support setting variables from project_name environment variables
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2015-03-27 10:59:05 -06:00
Nathan Hjelm
b68d66bb9b MCA: Add the project/project version to the MCA base component
This commit adds support for project_framework_component_* parameter
matching. This is the first step in allowing the same framework name
in multiple projects. This change also bumps the MCA component version
to 2.1.0.

All master frameworks have been updated to use the new component
versioning macro. An mca.h has been added to each project to add a
project specific versioning macro of the form
PROJECT_MCA_VERSION_2_1_0.

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2015-03-27 10:59:04 -06:00
Jeff Squyres
0c502d90cd hwloc README-ompi.txt: update for what we pulled from hwloc
Document what we pulled from the hwloc tree.
2015-03-27 06:49:42 -07:00
Brice Goglin
29ccbfd590 hwloc pci: fix bridge depth
It was setup in the PCI backend before filtering,
and partially updated after filtering in the core.
Only setup once correctly after filtering in the core.

(cherry picked from commit open-mpi/hwloc@9659653d24)

Conflicts:
	tests/hwloc/linux/40intel64-2g2n4c+pci.output
	tests/hwloc/xml/192em64t-12gr2n8c2t-distancegroups.xml
	tests/hwloc/xml/192em64t-24n8c2t-distancegroups.xml
	tests/hwloc/xml/192em64t-24n8c2t-nodistancegroups.xml
	tests/hwloc/xml/24em64t-2n6c2t-pci.xml
	tests/hwloc/xml/32em64t-2n8c2t-pci-normalio.xml
	tests/hwloc/xml/96em64t-4n4d3ca2co-pci.xml
	utils/hwloc/test-hwloc-compress-dir.input.tar.gz
	utils/hwloc/test-hwloc-compress-dir.output.tar.gz

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2015-03-27 06:49:39 -07:00
Brice Goglin
1905f35a1e hwloc: bitmap: fix a corner case in hwloc_bitmap_isincluded() with infinite sets
If super_set contains more allocated ulongs than sub_set,
we did not check the last ulongs.
We would return true instead of false when sub_set is
infinite while the last ulongs in super_set are not full.

This fixes tests/hwloc_bitmap_compare_inclusion on some platforms.

(cherry picked from commit open-mpi/hwloc@299e6e846f)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2015-03-27 05:14:44 -07:00
Brice Goglin
5c9157c547 hwloc: core: only update root->complete sets if insert succeeds
Otherwise we get spurious bits for crazy topologies such as 8em64t-2s2ca2c-buggynuma.output

Will make debug asserts easier.

(cherry picked from commit open-mpi/hwloc@546cd9330a)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2015-03-27 05:14:44 -07:00
Brice Goglin
dec01097f8 hwloc: groups: add complete sets when inserting distance/pci groups
Make sure we define complete cpuset/nodeset when we define groups' main cpuset/nodeset
during later insert of groups (for PCI hostbridges or distances).
Otherwise they may end up clearing child/parent complete sets which
suddenly become incoherent while they were fixed earlier.

Needed to fix allowed_nodeset meaning.

(cherry picked from commit open-mpi/hwloc@7c88d17add)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2015-03-27 05:14:33 -07:00
Brice Goglin
d6e415cd41 hwloc: AIX: Fix PU os_index
When looking for PUs inside R_MAXSDL rads, some AIX 6.1 releases
return one first rad without any PU.
AIX 6.1 00F63F144C00 does (on quad-power7).
AIX 6.1 00CBAAC24C00 doesn't (on 16x power6).

So we can't assume rad #x contains PU #x. But we already have the right
code to fill the cpuset from the rad, so use that to obtain the PU os_index
as well.

Cannot be used to obtain NUMA node os_index since there's no way to directly
retrieve NUMA nodes from rads (mempools seem unrelated). Just keep using #rad
for NUMA nodes os_index and document that convention when converting back in
set_membind().

Thanks to Hendryk Bockelmann and Erik Schnetter for helping debugging.

(cherry picked from commit open-mpi/hwloc@60006c7b88)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2015-03-27 05:14:33 -07:00
Brice Goglin
80140bbe7b hwloc: distances: when we fail to insert an intermediate group, don't try to group further above
Otherwise we'll have some NULL objects above, would be annoying.
No need to dig further, the distance matrix is likely buggy.

We still keep the inserted groups at this level (incomplete level)
because removing them is hard.

(cherry picked from commit open-mpi/hwloc@312a971ec9)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2015-03-27 05:14:33 -07:00
Brice Goglin
29c99156cf hwloc: pci: fix SR-IOV VF vendor/device names
Commit 626129d2818693e62b83c1cfa2ba6e058e5bed66 fixed the hwloc
device/vendor numbers obtained from libpciaccess.
But the corresponding names are still retrieved from pciaccess numbers,
so fix these numbers inside pciaccess structures before retrieving the names.

(cherry picked from commit open-mpi/hwloc@85ea6e4acc)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2015-03-27 05:14:32 -07:00
Brice Goglin
da164be0ef hwloc: error: point to the FAQ when displaying the big OS error message
(cherry picked from commit open-mpi/hwloc@b191f816f6)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2015-03-27 05:14:32 -07:00
Brice Goglin
3f96e7a271 hwloc: synthetic: Misc levels are not allowed in the synthetic description
Misc objects were used between system and machine in the past
but quickly got replaced with groups.

(cherry picked from commit open-mpi/hwloc@6c2aa6d1ea)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2015-03-27 05:14:16 -07:00
Brice Goglin
050bb35feb hwloc: x86: use Group instead of Misc for unknown x2apic levels
Misc are reserved for annotating the topology, the core
doesn't like merging them. Group is more appropriate.

(cherry picked from commit open-mpi/hwloc@3c47649591)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2015-03-27 05:14:16 -07:00
Brice Goglin
379c7b0d8b hwloc: x86: use ulong for cache sizes, uint won't be enough in the near future
(cherry picked from commit open-mpi/hwloc@ae82597773)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2015-03-27 05:14:16 -07:00
Brice Goglin
6caf9edbea hwloc: hpux: improve hwloc_hpux_find_ldom() looking for NUMA node
hwloc_get_first_largest_obj_inside_cpuset() returns the largest/highest object,
but it could still have a child with the same cpuset.
So check children as well in case there's a matching NUMA node there.

(cherry picked from commit open-mpi/hwloc@57a1c4fbe4)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2015-03-27 05:14:16 -07:00
Brice Goglin
fff1bb5dcd hwloc: core: reorder children in merge_useless_child() as well
When ignore_keep_structure is enabled, intermediate level can disappear
between parent and child, making the new child complete_cpuset smaller,
causing the child list to require a reorder just like in remove_ignored().

(cherry picked from commit open-mpi/hwloc@88afbe6b62)

Embed this related commit:
core: abstract out reorder_children(), needed when merging modifies the list of children
(cherry picked from commit open-mpi/hwloc@14db82d391)

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2015-03-27 05:14:16 -07:00
Brice Goglin
77978a846e hwloc: core: fix the merging of identical objects in presence of Misc objects
If object A contains B + I/O as children, we can "ignore" I/Os and still
try to merge A and B. We now do the same for Misc objects without cpusets
instead of I/Os.

This fixes a corner case when export/reimport to XML creates a slightly
different topology (making hwloc_insert_misc fail inside a Linux cgroup).

Thanks to Dave Love for reporting the problem.

Fixes #118

(cherry picked from commit open-mpi/hwloc@650371e115)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2015-03-27 05:14:15 -07:00
Brice Goglin
5427b33caf hwloc: debug: fix an overzealous assertion about the parent cpuset vs its children
When I/O are attached under a PU, removing the children's cpusets from
the parent cpuset doesn't give 0, it gives the PU cpuset.
The assertion fails on single-pu machines with I/O when --merge is given,
only one PU remains with I/O under it.

But if we insert Misc by cpuset under PU, it gives 0 as expected.

Fix the assertion accordingly.

Thanks to Thomas Van Doren for reporting the issue.

(cherry picked from commit open-mpi/hwloc@45c94c336d)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2015-03-27 05:14:15 -07:00
Brice Goglin
9b59d532fc hwloc: cpuid-x86: Fix duplicate asm labels in case of heavy inlining on x86-32
hwloc_x86_discover() calls hwloc_look_x86() twice, which calls hwloc_have_x86_cpuid().
If everything gets inlined, the asm label inside hwloc_have_x86_cpuid()
is duplicated.
Use a local label with f annotation in jumps to avoid the problem.

Thanks to Thomas Van Doren for reporting the issue (found with gcc -m32).

(cherry picked from commit open-mpi/hwloc@50e447f5bc)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2015-03-27 05:14:15 -07:00
Brice Goglin
86a536ca58 hwloc: x86 and OSF: Don't forget to set NUMA node nodeset
x86: Not critical since BSDs that use this backend have no membind support,
but better fix it for uniformization.
(cherry picked from commit open-mpi/hwloc@a431361c7d)

OSF: Looks like nobody ever tried to play with memory binding on OSF/Tru64.
(cherry picked from commit open-mpi/hwloc@2d6c73356d)

Conflicts:
	NEWS

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2015-03-27 05:14:15 -07:00
Brice Goglin
db5bc72496 hwloc: API: clearly state that os_index isn't unique while logical_index is
(cherry picked from commit open-mpi/hwloc@6c75302ab2)

Conflicts:
	opal/mca/hwloc/hwloc191/hwloc/include/hwloc.h

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2015-03-27 05:14:15 -07:00
Brice Goglin
a636790604 hwloc: opal/mca/hwloc/hwloc191/hwloc/NEWS update
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2015-03-27 05:13:41 -07:00
Brice Goglin
7c96aecfaf hwloc: errors: improve the advice to send hwloc-gather-topology files in the OS error message
(cherry picked from commit open-mpi/hwloc@f77aa01b3c)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2015-03-27 05:13:41 -07:00
Brice Goglin
d5f8c89527 hwloc: configure: fix the check for X11/Xutil.h
At least some solaris enforce the need to #include X11/Xlib.h first.

Thanks to Siegmar Gross for reporting the issue.

(cherry picked from commit open-mpi/hwloc@005a7e89b6)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2015-03-27 05:13:01 -07:00
Brice Goglin
50b035dddb hwloc: misc.h: Fix hwloc_strncasecmp() with some icc
tolower needs <ctype.h>

Thanks to Ralph Castain for reporting the failure.

(cherry picked from commit open-mpi/hwloc@038c372a58)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2015-03-27 05:10:57 -07:00
Brice Goglin
6764413aa3 hwloc: misc.h: Fix hwloc_strncasecmp() build under strict flags on BSD
strncasecmp() needs <strings.h>

Thanks to Pavan Balaji for reporting the failure.

(cherry picked from commit open-mpi/hwloc@37439c4801)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2015-03-27 05:10:46 -07:00
Brice Goglin
6b0011f138 hwloc: v1.9.1 released, doing 1.9.2rc1 now
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2015-03-27 05:06:15 -07:00
Jeff Squyres
a85edb8ad4 libfabric: update to Github libfabric 0d7daf720f04 2015-03-26 14:40:46 -07:00
Elena
90f5b2bb84 Introduce -tune command line option to set env vars and mca params from file 2015-03-26 18:33:53 +02:00
rhc54
2ff7575dde Merge pull request #497 from rhc54/topic/sec
Allow for different security domains.
2015-03-25 21:01:29 -07:00
Ralph Castain
1b24536941 Allow for different security domains. Let the initiator of the connection determine the method to be used - if the receiver cannot support it, then that's an error that will cause the connection attempt to fail. 2015-03-25 13:22:01 -07:00
Nathan Hjelm
26f96c0049 asm/ia32: do not use the = constraint modifier for input/output
operands

Also added support for the xchg instruction. The instruction is
supported by ia32 and may benefit vader.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-03-25 13:42:27 -06:00
Ralph Castain
9dbc69df0f Stop an ugly infinite loop caused by continual re-opening of the opal if framework. 2015-03-24 17:50:14 -07:00
Rolf vandeVaart
dfb7e00ef5 Make sure context is still around when doing some other cleanup 2015-03-24 16:47:40 -04:00
Ralph Castain
ed5d10b816 Somehow slipped by - ensure we correctly count the cores 2015-03-19 17:56:18 -07:00
Ralph Castain
43a3baad5e Ensure we use the first compute node's topology for mapping
Don't filter the topology by cpuset if you are mpirun until you know that no other compute nodes are involved. This deals with the corner case where mpirun is executing on a node of different topology from the compute nodes.

Simplify - don't mandate that all cpus in the given cpuset be present on every node. We can then run everything thru the filter as before, which ensures that any procs run on mpirun are also contained within the specified cpuset.

Correctly count the number of available PUs under each object when given a cpuset

Fix the default binding settings, and correctly count PUs when no cpuset is given

Ensure the binding policy gets set in all cases
2015-03-19 16:30:36 -07:00
Nathan Hjelm
ccba8ce856 Merge pull request #457 from hjelmn/mpit_fixes
mca/base: fix bugs in framework deregistration/re-registration
2015-03-18 08:37:49 -06:00
Ralph Castain
d7d8ae46ed We no longer pass the RML URI for procs launched via mpirun as the daemon has no need for that info. 2015-03-17 06:10:20 -07:00
Mike Dubman
7640507438 Merge pull request #472 from miked-mellanox/topic/fix_compile_warn
btl/openib: fix compiler warning, by HalR
2015-03-13 14:06:07 +02:00
Jeff Squyres
4ab9e67832 hwloc external: portability updates
Change "test -a" to "&& test", and change foo="$bar" to foo=$bar.  No
substantive code changes.
2015-03-13 04:40:09 -07:00
Jeff Squyres
4d63c88ed1 hwloc external: whitespace cleanup, no code changes 2015-03-13 04:40:05 -07:00
Mike Dubman
00784ae3ba btl/openib: fix compiler warning, by HalR 2015-03-13 13:17:23 +02:00
Todd Kordenbrock
9350b06f7d btl-portals4: fix compiler warnings 2015-03-12 20:34:04 -05:00
Jeff Squyres
65a0e041ac dl: need to use LIBADD, not LIBS
When we use LIBADD for static libraries, the dependent libraries get
propagated properly.  For example, the dl/dlopen component will almost
certainly require the -ldl library; when using LIBS, that doesn't get
propagated elsewhere in the tree, but when using LIBADD, it does
(e.g., when linking opal_wrapper_compiler).
2015-03-12 15:01:14 -07:00
Ryan Grant
6f76984a3c Merge pull request #470 from tkordenbrock/topic/update-portals4-to-btl3
btl-portals4: implement the BTL 3.0 interface
2015-03-12 15:34:05 -06:00
Jeff Squyres
a1daa39425 libfabric: update to Github lifabric 90ac5a258418e
Update to latest upstream Github lifabric in order to fix some usnic
bugs.
2015-03-12 13:23:32 -07:00
Todd Kordenbrock
d1656347c8 btl-portals4: implement the BTL 3.0 interface 2015-03-12 14:19:44 -05:00
adrianreber
714d9aa67e Merge pull request #348 from adrianreber/topic/orte_cr_continue_like_restart
Topic/orte cr continue like restart
2015-03-12 14:54:02 +01:00
Nathan Hjelm
fd78491768 Merge pull request #451 from elenash/master
fix: mca_base_env_var mca parameter is never handled if it's set from am...
2015-03-11 09:54:25 -06:00
Nathan Hjelm
ce6caab2a7 Merge pull request #463 from hjelmn/cuda_async
btl/openib: cuda: fix CUDA-aware support with async copy
2015-03-11 09:52:48 -06:00
Jeff Squyres
c61dd4d56f usnic: each err eq entry reports *1* completion
Actually, the return from fi_eq_readerr() only indicates a *single*
error completion (not err_entry.data completions).
2015-03-11 08:07:20 -07:00
Ralph Castain
2de5cd6e5f Ensure we don't install the libevent internal headers 2015-03-11 07:35:20 -07:00
Nathan Hjelm
395635f017 Merge pull request #461 from hjelmn/btl_openib_cleanup
btl/openib: remove derived btl segment type
2015-03-11 08:20:41 -06:00
Jeff Squyres
9c926e5e82 usnic: add more commments/explanation about error cases
If we really get a catastrophic error from a libfabric call, don't
bother trying to continue (because data has been corrupted and there's
nothing sane left to do).  Just call opal_btl_usnic_exit() (which
tries to call the PML error callback, but we're so early in the
module_init process that this likely hasn't been setup yet, so the job
will likely abort).
2015-03-11 07:16:28 -07:00
Jeff Squyres
51583789fb usnic: re-indent some show_help code
Nothing too substantial here, but two of the messages moved from
"libfabric API failed" to "internal error during init", just to be a
bit more descriptive.
2015-03-11 07:15:28 -07:00
Jeff Squyres
1b836d784c usnic: subtract number of errored insertions from loop count
When we get errors, the entry.data field tells us how many errors are
being reported.  So decrement the loop count variable by that much.

This fixes CSCut30441.
2015-03-11 07:13:10 -07:00
Adrian Reber
f45dd069bd FT: fix compilation using --with-ft (1/5)
Enabling the FT code breaks compilation (again). This series
tries to fix the compiler errors. This is again only fixing
the compiler errors without any warranty that the result
might actually support FT again.

This first patch moves orte_cr_continue_like_restart from ORTE
to opal_cr_continue_like_restart in OPAL. This only leaves three
calls from OPAL to ORTE in the FT code. As it is not yet 100%
clear how to handle these calls the code orte_sstore.set_attr()
has been #ifdef'd out for now.
2015-03-11 14:23:33 +01:00
Nathan Hjelm
b308afa8fd btl/openib: remove derived btl segment type
The derived segment type (btl_openib_segment_t) was intended to store
the registration info needed for put and get. In BTL 3.0 this is no
longer required. I intended to remove this type as part of
open-mpi/ompi@74f1af4548 .

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-03-10 14:41:15 -06:00
Nathan Hjelm
3d32dbd793 btl/openib: cuda: fix CUDA-aware support with async copy
This commit should resolve an issue seen with CUDA-aware support. The
problem came in with BTL 3.0. Before 3.0 the size of the copy was
stored in the incoming segment's des_remote_count field. This field
does not exist in BTL 3.0 so I stored the value in the
des_segment_count field. This caused problems with the cuda support
code. To fix the issue the endpoint pointer is now stored in the in
fragment's endpoint pointer which free's up the segment's des_cbdata
pointer for storing the transfer size.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-03-10 14:38:12 -06:00
Jeff Squyres
cbd99d5f60 libfabric: update to Github upstream 1b4bb2285b
Get a usnic bug fix.
2015-03-10 12:09:02 -07:00
Jeff Squyres
d97551bdb1 usnic: endpoint type hint moved to a sub-struct
Update to match new libfabric API/structure change.
2015-03-10 09:47:41 -07:00
Jeff Squyres
1a1be2efa0 libfabric: update to Github upstream 7095f3dc 2015-03-10 09:47:40 -07:00
Jeff Squyres
afec1454f5 usnic: only setup the connectivity checker if we have modules
If we ended up with no modules (e.g., all usnic devices were
excluded), there was a race condition in that the connectivity agent
could tear down its local socket before one or more of the local
clients saw it.  Therefore, the local clients would timeout waiting
for the socket to appear.

So move the connectivity checker init later in the bootstrapping
process (it *must* be setup before module_init()), and have it only
invoked if we actually ended up with one or more modules.
2015-03-10 07:43:20 -07:00
Jeff Squyres
06accb721c usnic: ensure to free all resources if no usnic BTLs found
If all usnic devices are excluded, then we need to ensure the error
path includes freeing the filter.

This was Coverity CID 1288085
2015-03-10 07:43:20 -07:00
Jeff Squyres
8fef4e865f dl dlopen: fix use-after-free
Re-structure the loop looking for duplicates a little so that we only
have a single free of the string that happens regardless of whether we
found a duplicate or not.

This was Coverity CID 1288090
2015-03-10 07:43:20 -07:00
Jeff Squyres
3efb5f56ae dl dlopen: ensure dirs is not NULL
opal_argv_split() may have returned NULL.

This was Coverity CID 1288088
2015-03-10 07:43:20 -07:00
Jeff Squyres
86968dcdda dl dlopen: fix resource leak
closedir() was one block higher than it should have been.

This was Coverity CID 1288087.
2015-03-10 07:43:20 -07:00
Jeff Squyres
546ad3f060 dl dlopen: free resources upon error
Ensure to take the right path out upon errors (that will free any
pending resources).

This was Coverity CID 1288086
2015-03-10 07:43:19 -07:00
Rolf vandeVaart
49b5eb6c91 Fix missing initialization of variable 2015-03-10 10:33:27 -04:00
Gilles Gouaillardet
dc0bc756dc iof/base: fix misc memory leak
as reported by Coverity with CID 1196732
2015-03-10 14:37:53 +09:00
Gilles Gouaillardet
f7f7fa73dd opal_cr: fix incorrect NULL assignment
as reported by Coverity with CID 1288084
2015-03-10 12:06:57 +09:00
Jeff Squyres
4b2cba46f4 usnic: fix bootstrap error paths
Fix previously-unfinished error paths during startup/bootstrapping.
Instead of just blindly continuing on when an fi_* function call
fails, opal_show_help and skip that device.

Also, only check the usnic config minimums once.  They're VIC-wide and
won't change on a per-device basis -- we only need to check them once.

Fixes CSCut19179.
2015-03-09 16:57:41 -07:00
Nathan Hjelm
005c6022e2 mca/base: fix bugs in framework deregistration/re-registration
There were a number of bugs in the framework/variable code that
affected deregistration:

 - Frameworks could be erroneously closed if seperately registered and
   opened then subsequently closed. This was a bug in the original
   design which only reference counted opens but not
   registrations. This would cause undefined behavior if
   MPI_T_finalize actually calls ompi_info_close_components as
   intended. Now both registrations and opens are reference counted
   and frameworks/components are not torn down until the matching
   number of close calls have been made.

 - group_find_by_name did not pass the invalidok flags down
   to mca_base_var_group_get_internal correctly.

 - Group deregistration caused the group to be completely reset. This
   does not match the behavior required by MPI_T as it could reduce
   the number of variables/subgroups in a group.

This commit also updates MPI_T_finalize to call
ompi_info_close_components as originally intended.

Closes #374

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-03-09 16:52:53 -06:00
Jeff Squyres
914880a368 libtldl: remove libltdl from the tree
The libltdl interface has been completely replaced by the OPAL DL
framework (i.e., the opal_dl interface).

Fixes open-mpi/ompi#311
2015-03-09 08:18:14 -07:00
Jeff Squyres
0a2767a5d3 opal lt_interface: remove in favor of opal_dl interface 2015-03-09 08:18:13 -07:00
Jeff Squyres
1995f6beba cuda: convert to opal_dl interface 2015-03-09 08:18:13 -07:00
Jeff Squyres
a9d86129c6 mca base: convert to opal_dl interface 2015-03-09 08:16:55 -07:00
Jeff Squyres
39364d315c libltdl: dl component based on libltdl
Works on any system that libltdl supports and has ltdl.h and libltdl
available.
2015-03-09 08:16:55 -07:00
Jeff Squyres
7d340c0c26 dlopen: simple dl component based on POSIX dlopen
Works on systems with dlopen (e.g., Linux and OS X).  It requires
dlfcn.h and libdl, which many systems have installed by default.
2015-03-09 08:16:55 -07:00
Jeff Squyres
e81c070ef0 dl framework: new dynamic loader framework
Embedding libltdl without the use of Libtool bootstrapping has
proven... difficult.  Instead, create a new simple "dl" framework.  It
only provides 4 functions:

- open a DSO (very similar to lt_dlopenadvise())
- lookup a symbol in a previously-opened DSO (very similar to lt_dlsym())
- close a previously-opened DSO (very similar to lt_dlclose())
- iterate over all files in a directory (very similar to ld_dlforeachfile())

There will be follow-on commits with a simple dlopen-based component
(nowhere near as complete/functional as libltdl, but good enough for
Linux and OS X), and a libltdl-based component for all other
platforms.

The intent is that the dlopen-based component can be built by default
in almost all cases.  But if libltdl is available, that component will
be built.  End result: we still get DSO-based functionality by default
in (almost?) all cases.  Without embedding libltdl.  Which is what we
want.
2015-03-09 08:16:55 -07:00
Gilles Gouaillardet
1746e23f11 opal/cr: fix misc memory leak and error case
as reported by Coverity with CIDs 71858 and 710640
2015-03-09 19:28:52 +09:00
Gilles Gouaillardet
2789d782ab timer/linux: fix insecure data handling
as reported by Coverity with CID 1269923
2015-03-09 19:14:56 +09:00
Gilles Gouaillardet
307a33991c opal/dss_unpack: correctly handle some bozo cases
as reported by Coverity with CIDs 1269987 and 1269988
2015-03-09 14:24:00 +09:00
Rolf vandeVaart
237c268a09 Add extra check during cleanup to make sure we really should clean up the CUDA resources. 2015-03-06 13:12:19 -05:00
Elena
737f06dd68 fix: mca_base_env_var mca parameter is never handled if it's set from amca conf file 2015-03-06 12:12:26 +02:00
Gilles Gouaillardet
da90ed4483 opal/compress: remove misc dead code
as reported by Coverity with CIDs 71856,
1269714, 1269715, 1269717, 1269718, 1269723, 1269724
2015-03-06 15:34:08 +09:00
Gilles Gouaillardet
5b41de4886 opal/dss: fix memory allocation in opal_dss_copy_null
as reported by Coverity with CID 71915
2015-03-06 15:15:11 +09:00
Gilles Gouaillardet
521317341d btl/openib: fix a double free
as reported by Coverity with CID 1287033
2015-03-06 14:58:11 +09:00
George Bosilca
5905bd2ff9 As the pStack pointer changes during the execution it is not
safe to free it. Instead, keep a copy of the original location
to be able to free the correct memory region.
2015-03-05 13:36:21 -05:00
George Bosilca
75479c0f17 Fix some typos. 2015-03-05 12:59:58 -05:00
George Bosilca
3c489ea5d5 Use malloc instead of alloca as some users could use very large arrays
of datatypes. Thanks Bogdan Sataric for the bug report.
2015-03-05 12:24:28 -05:00
Alina Sklarevich
1560ed9761 initialize opal_common_verbs_want_fork_support to -1.
This way, if the call to ibv_fork_init() fails, the job will still
continue.
2015-03-05 14:29:09 +02:00
Gilles Gouaillardet
e1cc931e1b btl/tcp: silence CID 710616 2015-03-05 14:20:08 +09:00
Gilles Gouaillardet
852dbafd51 mca/base: fix misc memory leaks
as reported by Coverity with CIDs 710628, 1196713 and 1269855
2015-03-05 14:06:18 +09:00
Gilles Gouaillardet
134c866aa9 btl/openib: fix misc memory leaks
as reported by Coverity with CIDs 1269848, 1269852 and 1269862
2015-03-05 14:06:18 +09:00
Gilles Gouaillardet
d1b2f043ff fix misc memory leaks
as already reported by Coverity with CIDs
71818, 71819, 72250, 715767, 1196749 and 1274002
2015-03-05 13:58:05 +09:00
Nathan Hjelm
ac82d1a6be Per-RFC: remove the --disable-smp-locks configure option
Use of this configuration option can cause crashing, hanging, and
(worse) incorrect results when btl/sm, btl/scif, or btl/vader are
in use. We discussed this at the January 2015 developers meeting
and it was decided to remove the option entirely. This commit does
just that. All usage of OPAL_WANT_SMP_LOCKS has been removed.
2015-03-04 11:31:43 -07:00
Mike Dubman
171d674ca4 Merge pull request #441 from open-mpi/revert-438-topic/use_opal_common_verbs_want_fork_support
Revert "create the opal_common_verbs_want_fork_support parameter."
2015-03-04 10:13:00 +02:00
Gilles Gouaillardet
f43b5b46ee btl/openib: fix heterogeneous support
Thanks @bosilca for the pointer
2015-03-04 13:53:05 +09:00
Gilles Gouaillardet
81b0444ef2 btl/openib: fix comment syntax, no code change
and silence gcc warning about nested comments
2015-03-04 11:24:14 +09:00
Rolf vandeVaart
edf58eb549 Implement CUDA-aware workaround while fork support worked out 2015-03-03 09:50:01 -05:00
Mike Dubman
98503b56e0 Revert "create the opal_common_verbs_want_fork_support parameter." 2015-03-03 14:28:31 +02:00
Mike Dubman
cc7caf699e Merge pull request #438 from alinask/topic/use_opal_common_verbs_want_fork_support
create the opal_common_verbs_want_fork_support parameter.
2015-03-02 07:43:37 +02:00
Gilles Gouaillardet
04a0438b56 sec/munge: send NULL terminated strings 2015-03-02 12:19:46 +09:00