1
1
openmpi/opal/mca
Mark Allen bdd92a7a64 -cpu-set as a constraint rather than as a binding
The first category of issue I'm addressing is that recent code changes
seem to only consider -cpu-set as a binding option. Eg a command like
this
  % mpirun -np 2 --report-bindings --use-hwthread-cpus \
      --bind-to cpulist:ordered --map-by hwthread --cpu-set 6,7 hostname
which just round robins over the --cpu-set list.

Example output which seems fine to me:
> MCW rank 0: [..../..B./..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
> MCW rank 1: [..../...B/..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]

It should also be possible though to pass a --cpu-set to most other
map/bind options and have it be a constraint on that binding. Eg
  % mpirun -np 2 --report-bindings \
      --bind-to hwthread --map-by hwthread --cpu-set 6,7 hostname
  % mpirun -np 2 --report-bindings \
      --bind-to hwthread --map-by ppr:2:node,pe=2 --cpu-set 6,7,12,13 hostname

The first command above errors that
> Conflicting directives for mapping policy are causing the policy
> to be redefined:
>   New policy:   RANK_FILE
>   Prior policy:  BYHWTHREAD

The error check in orte_rmaps_rank_file_open() is likely too aggressive.
The intent seems to be that any option like "--map-by whatever" will
check to see if a rankfile is in use, and report that mapping via rmaps
and using an explicit rankfile is a conflict.

But the check has been expanded to not just check
    NULL != orte_rankfile
but also errors out if
    (NULL != opal_hwloc_base_cpu_list &&
    !OPAL_BIND_ORDERED_REQUESTED(opal_hwloc_binding_policy))
which seems to be only recognizing -cpu-set as a binding option and
ignoring -cpu-set as a constraint on other binding policies.

For now I've changed the
    NULL != opal_hwloc_base_cpu_list
to
    OPAL_BIND_TO_CPUSET == OPAL_GET_BINDING_POLICY(opal_hwloc_binding_policy)
so it hopefully only errors out if -cpu-set is being used as a binding
policy.  Whether I did that right or not it's enough to get to the next
stage of testing the example commands I have above.

Another place similar logic is used is hwloc_base_frame.c where it has
    /* did the user provide a slot list? */
    if (NULL != opal_hwloc_base_cpu_list) {
        OPAL_SET_BINDING_POLICY(opal_hwloc_binding_policy, OPAL_BIND_TO_CPUSET);
    }
where it used to (long ago) only do that if
    !OPAL_BINDING_POLICY_IS_SET(opal_hwloc_binding_policy)
I think the new code is making it impossible to use --cpu-set as anything
other than a binding policy.

That brings us past the error detection and into the real functionality, some of
which has been stripped out, probably in moving to hwloc-2:
  % mpirun -np 2 --report-bindings \
      --bind-to hwthread --map-by hwthread --cpu-set 6,7 hostname
> MCW rank 0: [B.../..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
> MCW rank 1: [.B../..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]

The rank_by() function in rmaps_base_ranking.c makes an array out of objects
returned from
    opal_hwloc_base_get_obj_by_type(,,,i,)
which uses df_search().  That function changed quite a bit from hwloc-1 to 2
but it used to include a check for
    available = opal_hwloc_base_get_available_cpus(topo, start)
which is where the bitmask from --cpu-set goes.  And it used to skip objs that
had hwloc_bitmap_iszero(available).

So I restored that behavior in ds_search() by adding a "constrained_cpuset" to
replace start->cpuset that it was otherwise processing.  With that change in
place the first command works:
  % mpirun -np 2 --report-bindings \
      --bind-to hwthread --map-by hwthread --cpu-set 6,7 hostname
> MCW rank 0: [..../..B./..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
> MCW rank 1: [..../...B/..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]

The other command uses a different path though that still ignored the
available mask:
  % mpirun -np 2 --report-bindings \
      --bind-to hwthread --map-by ppr:2:node:pe=2 --cpu-set 6,7,12,13 hostname
> MCW rank 0: [BB../..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
> MCW rank 1: [..BB/..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
In bind_generic() the code used to call
opal_hwloc_base_find_min_bound_target_under_obj() which used
opal_hwloc_base_get_ncpus(), and that's where it would
intersect objects with the available cpuset and skip over ones
that were't available. To match the old behavior I added a few
lines in bind_generic() to skip over objects that don't intersect
the available mask. After that we get
  % mpirun -np 2 --report-bindings \
      --bind-to hwthread --map-by ppr:2:node:pe=2 --cpu-set 6,7,12,13 hostname
> MCW rank 0: [..../..BB/..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
> MCW rank 1: [..../..../..../BB../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]

I think the above changes are improvements, but I don't feel like they're
comprehensive.  I only traced through enough code to fix the two specific
bugs I was dealing with.

Signed-off-by: Mark Allen <markalle@us.ibm.com>
2019-04-12 15:33:56 -04:00
..
allocator mca: Dynamic components link against project lib 2017-08-24 11:56:16 -04:00
backtrace stacktrace: Add flexibility in stacktrace ouptut 2017-01-26 11:55:32 -06:00
base opal: clean up init/finalize 2018-12-18 14:37:04 -07:00
btl Merge pull request #6541 from bwbarrett/bugfix/enotconn 2019-03-28 22:42:52 -04:00
common ompi/oshmem/spml/ucx: use lockfree array to optimize spml_ucx_progress/delete oshmem_barrier in shmem_ctx_destroy 2019-03-21 23:01:45 +02:00
compress Correctly name the package so the flags get set 2019-02-08 15:11:48 -08:00
crs remove some dead crs components 2018-10-17 10:29:00 -06:00
dl Handle asprintf errors with opal_asprintf wrapper 2018-10-08 16:43:53 -07:00
event event/external: fix version requirement 2018-10-29 10:12:56 +09:00
hwloc -cpu-set as a constraint rather than as a binding 2019-04-12 15:33:56 -04:00
if Merge pull request #5786 from jsquyres/pr/string-madness 2018-10-04 16:12:46 -05:00
installdirs Handle asprintf errors with opal_asprintf wrapper 2018-10-08 16:43:53 -07:00
memchecker mca: Dynamic components link against project lib 2017-08-24 11:56:16 -04:00
memcpy Purge whitespace from the repo 2015-06-23 20:59:57 -07:00
memory opal: Disable memory patcher component on MacOS 2018-10-02 13:35:15 -04:00
mpool mpool: add new base module type "basic" 2019-01-14 15:48:46 -07:00
patcher misc: compiler warning fixes 2018-09-15 06:04:13 -07:00
pmix pmix/pmix4x: refresh to the latest PMIx 2019-04-09 14:03:00 +09:00
pstat opal: convert from strncpy() -> opal_string_copy() 2018-09-27 11:56:18 -07:00
rcache opal: convert from strncpy() -> opal_string_copy() 2018-09-27 11:56:18 -07:00
reachable opal: convert from strncpy() -> opal_string_copy() 2018-09-27 11:56:18 -07:00
shmem opal: convert from strncpy() -> opal_string_copy() 2018-09-27 11:56:18 -07:00
timer opal/sys: Introduce OPAL_HAVE_SYS_TIMER_GET_FREQ macro 2019-04-04 11:48:02 +09:00
Makefile.am Purge whitespace from the repo 2015-06-23 20:59:57 -07:00
mca.h mca/base: enforce max string lengths 2018-09-05 08:42:00 -07:00