It was setup in the PCI backend before filtering,
and partially updated after filtering in the core.
Only setup once correctly after filtering in the core.
(cherry picked from commit open-mpi/hwloc@9659653d24)
Conflicts:
tests/hwloc/linux/40intel64-2g2n4c+pci.output
tests/hwloc/xml/192em64t-12gr2n8c2t-distancegroups.xml
tests/hwloc/xml/192em64t-24n8c2t-distancegroups.xml
tests/hwloc/xml/192em64t-24n8c2t-nodistancegroups.xml
tests/hwloc/xml/24em64t-2n6c2t-pci.xml
tests/hwloc/xml/32em64t-2n8c2t-pci-normalio.xml
tests/hwloc/xml/96em64t-4n4d3ca2co-pci.xml
utils/hwloc/test-hwloc-compress-dir.input.tar.gz
utils/hwloc/test-hwloc-compress-dir.output.tar.gz
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
If super_set contains more allocated ulongs than sub_set,
we did not check the last ulongs.
We would return true instead of false when sub_set is
infinite while the last ulongs in super_set are not full.
This fixes tests/hwloc_bitmap_compare_inclusion on some platforms.
(cherry picked from commit open-mpi/hwloc@299e6e846f)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Otherwise we get spurious bits for crazy topologies such as 8em64t-2s2ca2c-buggynuma.output
Will make debug asserts easier.
(cherry picked from commit open-mpi/hwloc@546cd9330a)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Make sure we define complete cpuset/nodeset when we define groups' main cpuset/nodeset
during later insert of groups (for PCI hostbridges or distances).
Otherwise they may end up clearing child/parent complete sets which
suddenly become incoherent while they were fixed earlier.
Needed to fix allowed_nodeset meaning.
(cherry picked from commit open-mpi/hwloc@7c88d17add)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
When looking for PUs inside R_MAXSDL rads, some AIX 6.1 releases
return one first rad without any PU.
AIX 6.1 00F63F144C00 does (on quad-power7).
AIX 6.1 00CBAAC24C00 doesn't (on 16x power6).
So we can't assume rad #x contains PU #x. But we already have the right
code to fill the cpuset from the rad, so use that to obtain the PU os_index
as well.
Cannot be used to obtain NUMA node os_index since there's no way to directly
retrieve NUMA nodes from rads (mempools seem unrelated). Just keep using #rad
for NUMA nodes os_index and document that convention when converting back in
set_membind().
Thanks to Hendryk Bockelmann and Erik Schnetter for helping debugging.
(cherry picked from commit open-mpi/hwloc@60006c7b88)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Otherwise we'll have some NULL objects above, would be annoying.
No need to dig further, the distance matrix is likely buggy.
We still keep the inserted groups at this level (incomplete level)
because removing them is hard.
(cherry picked from commit open-mpi/hwloc@312a971ec9)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Commit 626129d2818693e62b83c1cfa2ba6e058e5bed66 fixed the hwloc
device/vendor numbers obtained from libpciaccess.
But the corresponding names are still retrieved from pciaccess numbers,
so fix these numbers inside pciaccess structures before retrieving the names.
(cherry picked from commit open-mpi/hwloc@85ea6e4acc)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Misc objects were used between system and machine in the past
but quickly got replaced with groups.
(cherry picked from commit open-mpi/hwloc@6c2aa6d1ea)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Misc are reserved for annotating the topology, the core
doesn't like merging them. Group is more appropriate.
(cherry picked from commit open-mpi/hwloc@3c47649591)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
hwloc_get_first_largest_obj_inside_cpuset() returns the largest/highest object,
but it could still have a child with the same cpuset.
So check children as well in case there's a matching NUMA node there.
(cherry picked from commit open-mpi/hwloc@57a1c4fbe4)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
When ignore_keep_structure is enabled, intermediate level can disappear
between parent and child, making the new child complete_cpuset smaller,
causing the child list to require a reorder just like in remove_ignored().
(cherry picked from commit open-mpi/hwloc@88afbe6b62)
Embed this related commit:
core: abstract out reorder_children(), needed when merging modifies the list of children
(cherry picked from commit open-mpi/hwloc@14db82d391)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
If object A contains B + I/O as children, we can "ignore" I/Os and still
try to merge A and B. We now do the same for Misc objects without cpusets
instead of I/Os.
This fixes a corner case when export/reimport to XML creates a slightly
different topology (making hwloc_insert_misc fail inside a Linux cgroup).
Thanks to Dave Love for reporting the problem.
Fixes#118
(cherry picked from commit open-mpi/hwloc@650371e115)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
When I/O are attached under a PU, removing the children's cpusets from
the parent cpuset doesn't give 0, it gives the PU cpuset.
The assertion fails on single-pu machines with I/O when --merge is given,
only one PU remains with I/O under it.
But if we insert Misc by cpuset under PU, it gives 0 as expected.
Fix the assertion accordingly.
Thanks to Thomas Van Doren for reporting the issue.
(cherry picked from commit open-mpi/hwloc@45c94c336d)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
hwloc_x86_discover() calls hwloc_look_x86() twice, which calls hwloc_have_x86_cpuid().
If everything gets inlined, the asm label inside hwloc_have_x86_cpuid()
is duplicated.
Use a local label with f annotation in jumps to avoid the problem.
Thanks to Thomas Van Doren for reporting the issue (found with gcc -m32).
(cherry picked from commit open-mpi/hwloc@50e447f5bc)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
x86: Not critical since BSDs that use this backend have no membind support,
but better fix it for uniformization.
(cherry picked from commit open-mpi/hwloc@a431361c7d)
OSF: Looks like nobody ever tried to play with memory binding on OSF/Tru64.
(cherry picked from commit open-mpi/hwloc@2d6c73356d)
Conflicts:
NEWS
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
At least some solaris enforce the need to #include X11/Xlib.h first.
Thanks to Siegmar Gross for reporting the issue.
(cherry picked from commit open-mpi/hwloc@005a7e89b6)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
tolower needs <ctype.h>
Thanks to Ralph Castain for reporting the failure.
(cherry picked from commit open-mpi/hwloc@038c372a58)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
strncasecmp() needs <strings.h>
Thanks to Pavan Balaji for reporting the failure.
(cherry picked from commit open-mpi/hwloc@37439c4801)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Don't filter the topology by cpuset if you are mpirun until you know that no other compute nodes are involved. This deals with the corner case where mpirun is executing on a node of different topology from the compute nodes.
Simplify - don't mandate that all cpus in the given cpuset be present on every node. We can then run everything thru the filter as before, which ensures that any procs run on mpirun are also contained within the specified cpuset.
Correctly count the number of available PUs under each object when given a cpuset
Fix the default binding settings, and correctly count PUs when no cpuset is given
Ensure the binding policy gets set in all cases
When we use LIBADD for static libraries, the dependent libraries get
propagated properly. For example, the dl/dlopen component will almost
certainly require the -ldl library; when using LIBS, that doesn't get
propagated elsewhere in the tree, but when using LIBADD, it does
(e.g., when linking opal_wrapper_compiler).
If we really get a catastrophic error from a libfabric call, don't
bother trying to continue (because data has been corrupted and there's
nothing sane left to do). Just call opal_btl_usnic_exit() (which
tries to call the PML error callback, but we're so early in the
module_init process that this likely hasn't been setup yet, so the job
will likely abort).
Nothing too substantial here, but two of the messages moved from
"libfabric API failed" to "internal error during init", just to be a
bit more descriptive.
When we get errors, the entry.data field tells us how many errors are
being reported. So decrement the loop count variable by that much.
This fixes CSCut30441.
Enabling the FT code breaks compilation (again). This series
tries to fix the compiler errors. This is again only fixing
the compiler errors without any warranty that the result
might actually support FT again.
This first patch moves orte_cr_continue_like_restart from ORTE
to opal_cr_continue_like_restart in OPAL. This only leaves three
calls from OPAL to ORTE in the FT code. As it is not yet 100%
clear how to handle these calls the code orte_sstore.set_attr()
has been #ifdef'd out for now.
The derived segment type (btl_openib_segment_t) was intended to store
the registration info needed for put and get. In BTL 3.0 this is no
longer required. I intended to remove this type as part of
open-mpi/ompi@74f1af4548 .
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit should resolve an issue seen with CUDA-aware support. The
problem came in with BTL 3.0. Before 3.0 the size of the copy was
stored in the incoming segment's des_remote_count field. This field
does not exist in BTL 3.0 so I stored the value in the
des_segment_count field. This caused problems with the cuda support
code. To fix the issue the endpoint pointer is now stored in the in
fragment's endpoint pointer which free's up the segment's des_cbdata
pointer for storing the transfer size.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
If we ended up with no modules (e.g., all usnic devices were
excluded), there was a race condition in that the connectivity agent
could tear down its local socket before one or more of the local
clients saw it. Therefore, the local clients would timeout waiting
for the socket to appear.
So move the connectivity checker init later in the bootstrapping
process (it *must* be setup before module_init()), and have it only
invoked if we actually ended up with one or more modules.
Re-structure the loop looking for duplicates a little so that we only
have a single free of the string that happens regardless of whether we
found a duplicate or not.
This was Coverity CID 1288090
Fix previously-unfinished error paths during startup/bootstrapping.
Instead of just blindly continuing on when an fi_* function call
fails, opal_show_help and skip that device.
Also, only check the usnic config minimums once. They're VIC-wide and
won't change on a per-device basis -- we only need to check them once.
Fixes CSCut19179.
There were a number of bugs in the framework/variable code that
affected deregistration:
- Frameworks could be erroneously closed if seperately registered and
opened then subsequently closed. This was a bug in the original
design which only reference counted opens but not
registrations. This would cause undefined behavior if
MPI_T_finalize actually calls ompi_info_close_components as
intended. Now both registrations and opens are reference counted
and frameworks/components are not torn down until the matching
number of close calls have been made.
- group_find_by_name did not pass the invalidok flags down
to mca_base_var_group_get_internal correctly.
- Group deregistration caused the group to be completely reset. This
does not match the behavior required by MPI_T as it could reduce
the number of variables/subgroups in a group.
This commit also updates MPI_T_finalize to call
ompi_info_close_components as originally intended.
Closes#374
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Embedding libltdl without the use of Libtool bootstrapping has
proven... difficult. Instead, create a new simple "dl" framework. It
only provides 4 functions:
- open a DSO (very similar to lt_dlopenadvise())
- lookup a symbol in a previously-opened DSO (very similar to lt_dlsym())
- close a previously-opened DSO (very similar to lt_dlclose())
- iterate over all files in a directory (very similar to ld_dlforeachfile())
There will be follow-on commits with a simple dlopen-based component
(nowhere near as complete/functional as libltdl, but good enough for
Linux and OS X), and a libltdl-based component for all other
platforms.
The intent is that the dlopen-based component can be built by default
in almost all cases. But if libltdl is available, that component will
be built. End result: we still get DSO-based functionality by default
in (almost?) all cases. Without embedding libltdl. Which is what we
want.
call the opal_common_verbs_mca_register function to make sure that
opal_common_verbs_want_fork_support mca parameter is created and therefore
can be used to control the fork support.
Both opal_shmem_base_select() and
opal_shmem_base_best_runnable_component_name() and were calling
opal_shmem_base_runtime_query(), which would do component selection
(and closing of losing components) twice.
Put protection in opal_shmem_base_runtime_query() to return the cached
results the second time. Additionally, make
opal_shmem_base_runtime_query() "own" the cached results
(vs. opal_shmem_base_select).
This allows the opal_shmem_base_module_t to be properly cast to an
mca_base_module_t.
(this commit is the rationale for the previous shmem C99 .member
initialization commit)
Also include two other minor changes:
1. More C99-style member initialization in the component struct
1. Fix the BTL module member initialization to not be redundant
Due to the nature of the cache architecture on power,
we don't export coherency_line_size for L2 in sysfs.
If we are unable to get the L2 cache line size, try L1.
See open-mpi/ompi#383 for more information.
In order to have an effect, ibv_fork_init should be called in the
beginning of the verbs initialization flow - before the calls to the
ibv_create_qp and ibv_create_cq verbs.
These functions are called from the oob/ud code and by the time the
other verbs components (btl openib, pml yalla, ...) call ibv_fork_init,
it's too late. This commit forces the call to ibv_fork_init (if it's
requested) right at the beginning of all the components that are using
verbs.
(ibv_fork_init() can be safely called multiple times)
This commit also removes the btl_openib_want_fork_support mca parameter
and adds a new mca parameter instead - opal_verbs_want_fork_support.
Through this new parameter, fork support may be requested for ALL
components.
The default value for this parameter is set to 1.
Before this commit the btl_openib_want_fork_support parameter didn't
provide fork support for the openib btl if its value was set to 1.
(because when openib called ibv_fork_init, it was already after the
calls to ibv_create_* in oob/ud and thereofre it failed).
Use of the old ompi_free_list_t and ompi_free_list_item_t is
deprecated. These classes will be removed in a future commit.
This commit updates the entire code base to use opal_free_list_t and
opal_free_list_item_t.
Notes:
OMPI_FREE_LIST_*_MT -> opal_free_list_* (uses opal_using_threads ())
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Please verify your components have been updated correctly. Keep in
mind that in terms of threading:
OPAL_FREE_LIST_GET -> opal_free_list_get_st
OPAL_FREE_LIST_RETURN -> opal_free_list_return_st
I used the opal_using_threads() variant anytime it appeared multiple
threads could be operating on the free list. If this is not the case
update to _st. If multiple threads are always in use change to _mt.
This commit adds an owner file in each of the component directories
for each framework. This allows for a simple script to parse
the contents of the files and generate, among other things, tables
to be used on the project's wiki page. Currently there are two
"fields" in the file, an owner and a status. A tool to parse
the files and generate tables for the wiki page will be added
in a subsequent commit.
Some BTLs do not require local registration for some rdma
transactions. For example: inline put on openib, fma put on ugni. This
commit adds code to expose the local registration thresholds to BTL
users. Optimized code can take advantage of this information to
improve rdma performance.
@ggouaillardet identified that HAVE_ALIAS_ATTRIBUTE was not properly
being defined in the embedded libfabric. This is because the
embedded configury missed the test for it (i.e., the real configure.ac
for libfabric always defines HAVE_ALIAS_ATTRIBUTE to 0 or 1 -- we
didn't emulate that properly here in libfabric's configure.m4).
Also, fix some grammar and properly escape another AC_MSG_CHECKING
message in libfabric's configure.m4.
When configured --with-devel-headers, there's now 2 "osd.h" header
files in libfabric (in different dirs). Automake's "install" target
didn't like this, and errored out.
Since embedding libfabric is a temporary measure, just avoid the
problem by not installing any libfabric headers.
Add the functions that changed between BTL 2.0 and 3.0 into compat.h
and compat.c:
* module.btl_prepare_src: the signature and body of this method
changed between 2.0 and 3.0. However, the functions that this
method calls did *not* need to change, so they are copied over
wholesale (with the exception that they no longer accept the unused
`registration` parameter).
* module.btl_prepare_dst: this method does not exist in BTL 3.0.
* module.btl_put: the signature and body of this method changed
between 2.0 and 3.0.
The send inline optimization uses the btl_sendi function to achieve lower
latency and higher message rates. Before this commit BTLs were allowed to
assume the descriptor was non-NULL and were expected to return a valid
descriptor if the send could not be completed using btl_sendi. This
behavior was fine until the usage of btl_sendi was changed in ob1. This
commit allows the caller to specify NULL for the descriptor. The affected
btls have been updated to handle this case.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit adds an interface for btl's to export support for 64-bit atomic
operations on integers. BTL's that can support atomic operations should
implement these functions and set the appropriate btl_flags and btl_atomic_flags.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The old BTL interface provided support for RDMA through the use of
the btl_prepare_src and btl_prepare_dst functions. These functions were
expected to prepare as much of the user buffer as possible for the RDMA
operation and return a descriptor. The descriptor contained segment
information on the prepared region. The btl user could then pass the
RDMA segment information to a remote peer. Once the peer received that
information it then packed it into a similar descriptor on the other
side that could then be passed into a single btl_put or btl_get
operation.
Changes:
- Added functions to register and deregister memory regions with the
btl. If no registration is needed a btl should set these function
pointers to NULL. These function take over for btl_prepare_src/dst
and btl_free for RDMA operations. The caller should specify the
maximum permissions needed on the memory.
- Changed the function signatures for both btl_put and btl_get. In
place of a prepared descriptor the caller should provide the source
and destination addresses and registration handles as well as a
new callback function. The callback will be provided with the local
address and registration handle, callback context, callback data, and
status. See mca_btl_base_rdma_completion_fn_t in btl.h.
- Added a new btl constraint: MCA_BTL_REG_HANDLE_MAX_SIZE. This
value specifies the maximum size of any btl's registration handle.
- Removed the btl_prepare_dst function. This reflects the fact that
RDMA operations no longer depend on "prepared" descriptors.
- Removed the btl_seg_size member. There is no need to btl's to
subclass the mca_btl_base_segment_t class anymore.
- Expose the btl's put/get limitations with new struct members:
btl_put_limit, btl_put_alignment, btl_get_limit, btl_get_alignment.
- Remove the mca_mpool_base_registration_t argument from the btl_prepare_src
function. The argument was intended to support RDMA operations and is no
longer necessary.
- Remove des_remote/des_remote_count from the mca_btl_base_descriptor_t
structure. This structure member was originally used to specify the remote
segment for RDMA operations. Since the new btl interface no longer uses
desriptors for RDMA this member no longer has a purpose. In addition
to removing these members the local segment structure fields have been
renamed to from des_local/des_local_count to des_segments/des_segment_count.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Per feedback from rhc, manually set the base_ptr member
of the opal_buffer_t variable to NULL prior to calling
OBJ_RELEASE. A similar feature of opal_dss.load also
exists so likewise reset the base_ptr to NULL prior to
invoking it.
Hopefully the opal_buffer_t struct does not change
frequently.
Minor cleanups to reduce output when pmix_base_verbose
mca paramater is set.
usnic_fls() can actually return 0, leading us to incorrectly free() a
buffer instead of OMPI_FREE_LIST_RETURN_MT'ing it.
So add an explicit bool in the struct that tracks whether the buffer
came from malloc or a freelist.
This was CID 1269660.
Coverity alerted us to the fact that there are places where
the synonym_for param is hard-coded to -1 when calling
register_variable(). It would be a coding error if synonym_for==-1
and (flags & MCA_BASE_VAR_FLAG_SYNONYM)>0, so let's add that to the
debug-only check at the top of the function.
This was CID 993717.
Remove use of the Cray PMI KVS - which is designed for a lighweight
MPI that exchanges only a minimimal amount of connection info
(about 128 bytes per rank) - within cray/pmix. Use Cray PMI
collective extensions instead.
This is the first of several steps to accelerate launch of
Open MPI on Cray systems using either native aprun or nativized
slurm.
in this code and it ended up changing the logic that is used to set up eager RDMA.
Rather than setting up eager RDMA with a high priority message, it did it the other
way around. For some reason, CUDA-aware support did not like this. So, basically,
restore the logic to the way it was prior to the refactoring. The refactoring did not
intend to change this. Lightly reviewed by hjelmn.
Per open-mpi/ompi#381, convert the specific intialization of opal_pmix
to use the generic "= { 0 }" initializer. This form can be used to
initialize any type when the intent is just to zero out / assign
*some* value.
Embedding libfabric is a temporary measure; I'm removing some warning
notifications so that the output isn't so cluttered (we're getting
the real warnings fixed upstream, but the OMPI community doesn't
really care/need to see the warnings in the meantime).
This commit adds an MCA variable to select Portals4 logical
addressing, populates the logical-to-physical mapping table and
initializes the NI in this mode.
For static builds, we need to also set
<framework>_<component>_WRAPPER_EXTRA_LIBS so that the wrappers know
what other libraries to add to link executables.
Ensure to count *this* process when checking for how many VFs we need
on the local server.
(cherry picked from commit 386c01934e98cb8dcb48ff648ecdfb0c8677baa9)
There was a mismatch between the structure for mca_rcache_vma_t and
the OBJ_CLASS_INSTANCE. One was opal_list_item_t and the other was
ompi_free_list_item_t. The super class in the structure looks like it
is the correct one. Changed the superclass in OBJ_CLASS_INSTANCE to
match.
If there are not enough resources (e.g., low VFs), we can end up
calling finalize_one_channel() on the same channel multiple times. So
ensure to NULL out fields that we have freed already so that we do not
try to free them a second time.
Fixes CSCus26648.
Fix the ordering so that we obtain the usnic netmask information
*before* we do the filtering based on CIDR-specified networks.
Also requires upstream Github libfabric commit 3976745.
Fixes CSCus22495.