1
1

159 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
e02c39385a Merge branch 'master' into topic/modex 2017-08-22 20:06:35 -07:00
Gilles Gouaillardet
565b516dae hwloc/base: fix opal_output() usage
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-08-23 10:24:47 +09:00
Ralph Castain
d80b0c7990 If the HWLOC shared memory system is unable to connect, then fallback to providing the topology via XML. Do not automatically provide the XML to every process as that defeats the purpose of the shared memory system. Instead, use PMIx_Query_info_nb to get the info from the server when required.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-08-22 18:12:26 -07:00
Brice Goglin
2d242ab9f0 hwloc/shmem: don't abort on failure to load from shmem
Adopting can fail if the server-side hole isn't available on the client.

We can fallback to other ways to load the topology.

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
2017-08-21 19:57:38 +02:00
Brice Goglin
ffd209fc2e hwloc/shmem: dump /proc/self/maps if failed to find a hole and verbosity > 4
Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
2017-08-21 19:57:38 +02:00
Ralph Castain
41df973359 Add diagnostics for hwloc get_topology
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-08-16 14:21:27 -07:00
Ralph Castain
d1b7c3d8d5 Silence some compile-time warnings. Update scripts now that AUTHORS is gone
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-08-04 20:08:31 -07:00
Ralph Castain
f39ce67982 Merge pull request #3951 from rhc54/topic/hwloc2
Update to hwloc 2.0.0a
2017-08-01 15:18:31 -06:00
Gilles Gouaillardet
825116044e hwloc/base: fix info message for opal_hwloc_base_binding_policy
if np > 2, the default binding is now "numa"

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-07-28 11:17:15 +09:00
Ralph Castain
6ebaed8c01 Restore support for user-provided cpulist
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-07-25 23:51:21 -07:00
Ralph Castain
7a83fdb9bb Update to hwloc 2.0.0a with shmem support.
Update to support passing of HWLOC shmem topology to client procs
Update use of distance API per @bgoglin
Have the openib component lookup its object in the distance matrix
Bring usnic up-to-date
Restore binding for hwloc2

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-07-25 20:26:22 -07:00
Ralph Castain
96f07aebfa Restore binding support
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-07-25 18:44:44 -07:00
Gilles Gouaillardet
60aa9cfcb6 hwloc: add support for hwloc v2 API
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-07-20 17:39:44 +09:00
Gilles Gouaillardet
9f29f3bff4 hwloc: since WHOLE_SYSTEM is no more used, remove useless
checks related to offline and disallowed elements

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-07-20 17:39:21 +09:00
Gilles Gouaillardet
1a34224948 hwloc: do not set the HWLOC_TOPOLOGY_FLAG_WHOLE_SYSTEM flag
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-07-20 17:39:16 +09:00
Ralph Castain
2753f53e6d Detect that we have a mix of BE/LE in the system, provide a warning that OMPI doesn't currently support this environment, and error out
Fixes #2817

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-07-03 15:47:05 -07:00
Nathan Hjelm
ffd8ee2dfd opal: use opal_list_t convienience macros
This commit cleans up code in opal to use OPAL_LIST_FOREACH(_SAFE),
OPAL_LIST_DESTRUCT, and OPAL_LIST_RELEASE.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-06-20 12:37:12 -06:00
Gilles Gouaillardet
7e01be60d9 hwloc: add support for hwloc v1.5
hwloc v1.5 does not support HWLOC_OBJ_OSDEV_COPROC
nor hwloc_topology_dup(), so for this version :
- do not search for coprocessors
- do not try hwloc_topology_dup(), note this is not
  used anywhere in the code base

Thanks Jeff for helping with the wording

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-03-03 09:39:24 +09:00
Jeff Squyres
fec519a793 hwloc: rename opal/mca/hwloc/hwloc.h -> hwloc-internal.h
Per a prior commit, the presence of "hwloc.h" can cause ambiguity when
using --with-hwloc=external (i.e., whether to include
opal/mca/hwloc/hwloc.h or whether to include the system-installed
hwloc.h).

This commit:

1. Renames opal/mca/hwloc/hwloc.h to hwloc-internal.h.
2. Adds opal/mca/hwloc/autogen.options to tell autogen.pl to expect to
   find hwloc-internal.h (instead of hwloc.h) in opal/mca/hwloc.
3. s@opal/mca/hwloc/hwloc.h@opal/mca/hwloc/hwloc-internal.h@g in the
   rest of the code base.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2017-02-28 07:48:42 -08:00
Ralph Castain
0c8609ca16 Update to newest PMIx master (includes configuration cleanups). Silence trivial Coverity warning in hwloc base.
Cleanup a race condition segfault during finalize by ensuring the PMIx progress thread is stopped prior to starting to tear down the messaging components

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-02-14 15:14:00 -08:00
Ralph Castain
ef86707fbe Deprecate the --slot-list paramaeter in favor of --cpu-list. Remove the --cpu-set param (mark it as deprecated) and use --cpu-list instead as it was confusing having the two params. The --cpu-list param defines the cpus to be used by procs of this job, and the binding policy will be overlayed on top of it.
Note: since the discovered cpus are filtered against this list, #slots will be set to the #cpus in the list if no slot values are given in a -host or -hostname specification.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-24 13:33:22 -08:00
Ralph Castain
08c76a42bb Update to latest PMIx master
Signed-off-by: Ralph Castain <rhc@open-mpi.org>

Plug a minor memory leak. Tell the PMIx server not to create a dstore memory region for the daemon job as there is nobody to share it with.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>

Protect users of hwloc membind functions

Signed-off-by: Ralph Castain <rhc@open-mpi.org>

Update PMIx to include NULL string protection

Signed-off-by: Ralph Castain <rhc@open-mpi.org>

Update to PMIx master to include key overwrite protection

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-30 12:44:47 -08:00
Ralph Castain
fe68f23099 Only instantiate the HWLOC topology in an MPI process if it actually will be used.
There are only five places in the non-daemon code paths where opal_hwloc_topology is currently referenced:

* shared memory BTLs (sm, smcuda). I have added a code path to those components that uses the location string
  instead of the topology itself, if available, thus avoiding instantiating the topology

* openib BTL. This uses the distance matrix. At present, I haven't developed a method
  for replacing that reference. Thus, this component will instantiate the topology

* usnic BTL. Uses the distance matrix.

* treematch TOPO component. Does some complex tree-based algorithm, so it will instantiate
  the topology

* ess base functions. If a process is direct launched and not bound at launch, this
  code attempts to bind it. Thus, procs in this scenario will instantiate the
  topology

Note that instantiating the topology on complex chips such as KNL can consume
megabytes of memory.

Fix pernode binding policy

Properly handle the unbound case

Correct pointer usage

Do not free static error messages!

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-29 10:33:29 -08:00
Ralph Castain
3a2d6a5ab6 Begin to reduce reliance of application procs on the topology tree itself by having the daemon provide more detailed info. In this case, provide the topology description string so that procs can readily determine the number of types of objects on the node, and a "locality" string that describes which objects this process is executing upon. The latter allows a process to compute the objects of overlap between itself and another proc without consulting the topology tree.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-28 09:14:26 -08:00
Ralph Castain
585540bcee Reduce the flood of warnings due to uninitialized variables, mismatched types, and unused things to a more bearable trickle
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-14 16:33:50 -08:00
Gilles Gouaillardet
45732fd764 hwloc/base: fix a memory leak in buffer_cleanup()
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2016-12-01 14:24:29 +09:00
Gilles Gouaillardet
cd2b5a82ed hwloc: plug memory leak
as reported by Coverity with CID 1270441
2016-09-07 10:08:44 +09:00
Karol Mroz
e1c64e6e59 opal: standardize on max hostname length
Define OPAL_MAXHOSTNAMELEN to be either:
  (MAXHOSTNAMELEN + 1) or
  (limits.h:HOST_NAME_MAX + 1) or
  (255 + 1)

For pmix code, define above using PMIX_MAXHOSTNAMELEN.

Fixup opal layer to use the new max.

Signed-off-by: Karol Mroz <mroz.karol@gmail.com>
2016-04-24 08:19:47 +02:00
Gilles Gouaillardet
d529951206 hwloc: correctly count cores with at least one allowed PU
when SMT is enabled, a core must be counted as long as one of its hwthread is allowed

Thanks Ben Menadue for the report.

This fixes a regression from open-mpi/ompi@6d149554a7
2016-01-29 11:54:34 +09:00
Gilles Gouaillardet
6d149554a7 hwloc: have opal_hwloc_base_get_pu search for HWLOC_OBJ_PU when mpirun is invoked with --use-hwthread-cpus
Fixes open-mpi/ompi#1247
2016-01-26 18:10:33 +09:00
Tim Mattox
958de82471 hwloc_base_util.c: Remove newly unused variable 'i'. 2016-01-14 16:35:47 -05:00
Tim Mattox
f2d4a8d266 Replace a bit counting loop with a call to an efficient population count routine 2016-01-12 10:48:56 -05:00
Gilles Gouaillardet
975b6fd51b hwloc: do not count not allowed cores in df_search_cores 2015-09-17 13:10:34 +09:00
Nathan Hjelm
899bf548a2 opal/hwloc: fix topology detection when socket is above numa
The OPAL_PROC_ON_* definitions have been changed from values to
flags. This should not cause any problems as these values were already
used as flags throughout the code base. Note, there will be a
difference between localities produced by the new code and the
old. For example, if a machine does not have a level-3 but two cores
share a level-1 or level-2 cache cache the level-3 bit will not be set
in the locality and OPAL_PROC_ON_LOCAL_L3CACHE will return 0. Before
this change it would have returned 1.

In addition the OPAL_PROC_ON_LOCAL_* macros have been simplified.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-09-10 14:17:45 -06:00
Ralph Castain
d97bc29102 Remove OPAL_HAVE_HWLOC qualifier and error out if --without-hwloc is given 2015-09-04 16:54:40 -07:00
Ralph Castain
ed93154e43 Fix hetero operations. An error in the hwloc utilities only allocated memory for the first display of a binding map, and then assumed that all nodes had the same number of cores in them. This resulted in memory corruption whenever someone displayed a binding pattern for a hetero cluster, and a smaller node was first in line. 2015-07-07 12:52:16 -07:00
Ralph Castain
869041f770 Purge whitespace from the repo 2015-06-23 20:59:57 -07:00
Ralph Castain
033418f62a Correct a typo that reversed the default binding pattern. Ensure we default bind to hwthread if user specified --use-hwthread-cpus if nprocs <= 2, and bind to hwthread if told to do so. 2015-04-10 15:58:35 -07:00
Ralph Castain
ed5d10b816 Somehow slipped by - ensure we correctly count the cores 2015-03-19 17:56:18 -07:00
Ralph Castain
43a3baad5e Ensure we use the first compute node's topology for mapping
Don't filter the topology by cpuset if you are mpirun until you know that no other compute nodes are involved. This deals with the corner case where mpirun is executing on a node of different topology from the compute nodes.

Simplify - don't mandate that all cpus in the given cpuset be present on every node. We can then run everything thru the filter as before, which ensures that any procs run on mpirun are also contained within the specified cpuset.

Correctly count the number of available PUs under each object when given a cpuset

Fix the default binding settings, and correctly count PUs when no cpuset is given

Ensure the binding policy gets set in all cases
2015-03-19 16:30:36 -07:00
Nysal Jan K.A
881a9f3d58 Fix cache line size detection on power
Due to the nature of the cache architecture on power,
we don't export coherency_line_size for L2 in sysfs.
If we are unable to get the L2 cache line size, try L1.

See open-mpi/ompi#383 for more information.
2015-02-25 17:26:28 +05:30
Howard Pritchard
c9e81b54fb Merge pull request #412 from hppritcha/topic/owner_files
add owner files to opa/ompi/orte mca directories
2015-02-23 09:48:20 -07:00
Gilles Gouaillardet
8d44d7086a hwloc/base: fix misc memory leaks
as reported by Coverity with CIDs 710636 and 1270441
2015-02-23 13:55:04 +09:00
Howard Pritchard
bf89131f9e add owner files to opa/ompi/orte mca directories
This commit adds an owner file in each of the component directories
for each framework.  This allows for a simple script to parse
the contents of the files and generate, among other things, tables
to be used on the project's wiki page.  Currently there are two
"fields" in the file, an owner and a status.  A tool to parse
the files and generate tables for the wiki page will be added
in a subsequent commit.
2015-02-22 15:10:23 -07:00
Gilles Gouaillardet
55948f2a6d hwloc: fix misc memory leak
as reported by Coverity with CID 1270441
(previous commit open-mpi/ompi@c25185f3a9 did not fully fix that one)
2015-02-17 14:06:15 +09:00
Gilles Gouaillardet
c25185f3a9 opal/hwloc: fix misc memory leaks
as reported by Coverity with CIDS 710631-710638, 1196705,
1196716, 1196717, 1196752, 1196753
2015-02-16 12:23:37 +09:00
Gilles Gouaillardet
8dd77c692e opal/hwloc: fix misc bugs
as reported by Coverity with CIDs 72224, 703566,
1196821, 1196842, 1196657 and 1196658
2015-02-16 11:59:48 +09:00
Ralph Castain
123fdd603f If we are using hwthread cpus, then default to binding there, letting the user override to whatever they want 2014-12-19 08:04:28 -08:00
Ralph Castain
0630680f36 Two cleanups required for transfer to 1.8.4:
* Use %d format for the topo signature as some systems apparently have problems with %u
* Use correct variable in show_help message
2014-12-12 17:23:32 -08:00
Ralph Castain
18d9fdfd8d Restore full topology comparison to support inventory monitoring 2014-12-09 01:33:06 -08:00