The first category of issue I'm addressing is that recent code changes
seem to only consider -cpu-set as a binding option. Eg a command like
this
% mpirun -np 2 --report-bindings --use-hwthread-cpus \
--bind-to cpulist:ordered --map-by hwthread --cpu-set 6,7 hostname
which just round robins over the --cpu-set list.
Example output which seems fine to me:
> MCW rank 0: [..../..B./..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
> MCW rank 1: [..../...B/..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
It should also be possible though to pass a --cpu-set to most other
map/bind options and have it be a constraint on that binding. Eg
% mpirun -np 2 --report-bindings \
--bind-to hwthread --map-by hwthread --cpu-set 6,7 hostname
% mpirun -np 2 --report-bindings \
--bind-to hwthread --map-by ppr:2:node,pe=2 --cpu-set 6,7,12,13 hostname
The first command above errors that
> Conflicting directives for mapping policy are causing the policy
> to be redefined:
> New policy: RANK_FILE
> Prior policy: BYHWTHREAD
The error check in orte_rmaps_rank_file_open() is likely too aggressive.
The intent seems to be that any option like "--map-by whatever" will
check to see if a rankfile is in use, and report that mapping via rmaps
and using an explicit rankfile is a conflict.
But the check has been expanded to not just check
NULL != orte_rankfile
but also errors out if
(NULL != opal_hwloc_base_cpu_list &&
!OPAL_BIND_ORDERED_REQUESTED(opal_hwloc_binding_policy))
which seems to be only recognizing -cpu-set as a binding option and
ignoring -cpu-set as a constraint on other binding policies.
For now I've changed the
NULL != opal_hwloc_base_cpu_list
to
OPAL_BIND_TO_CPUSET == OPAL_GET_BINDING_POLICY(opal_hwloc_binding_policy)
so it hopefully only errors out if -cpu-set is being used as a binding
policy. Whether I did that right or not it's enough to get to the next
stage of testing the example commands I have above.
Another place similar logic is used is hwloc_base_frame.c where it has
/* did the user provide a slot list? */
if (NULL != opal_hwloc_base_cpu_list) {
OPAL_SET_BINDING_POLICY(opal_hwloc_binding_policy, OPAL_BIND_TO_CPUSET);
}
where it used to (long ago) only do that if
!OPAL_BINDING_POLICY_IS_SET(opal_hwloc_binding_policy)
I think the new code is making it impossible to use --cpu-set as anything
other than a binding policy.
That brings us past the error detection and into the real functionality, some of
which has been stripped out, probably in moving to hwloc-2:
% mpirun -np 2 --report-bindings \
--bind-to hwthread --map-by hwthread --cpu-set 6,7 hostname
> MCW rank 0: [B.../..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
> MCW rank 1: [.B../..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
The rank_by() function in rmaps_base_ranking.c makes an array out of objects
returned from
opal_hwloc_base_get_obj_by_type(,,,i,)
which uses df_search(). That function changed quite a bit from hwloc-1 to 2
but it used to include a check for
available = opal_hwloc_base_get_available_cpus(topo, start)
which is where the bitmask from --cpu-set goes. And it used to skip objs that
had hwloc_bitmap_iszero(available).
So I restored that behavior in ds_search() by adding a "constrained_cpuset" to
replace start->cpuset that it was otherwise processing. With that change in
place the first command works:
% mpirun -np 2 --report-bindings \
--bind-to hwthread --map-by hwthread --cpu-set 6,7 hostname
> MCW rank 0: [..../..B./..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
> MCW rank 1: [..../...B/..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
The other command uses a different path though that still ignored the
available mask:
% mpirun -np 2 --report-bindings \
--bind-to hwthread --map-by ppr:2:node:pe=2 --cpu-set 6,7,12,13 hostname
> MCW rank 0: [BB../..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
> MCW rank 1: [..BB/..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
In bind_generic() the code used to call
opal_hwloc_base_find_min_bound_target_under_obj() which used
opal_hwloc_base_get_ncpus(), and that's where it would
intersect objects with the available cpuset and skip over ones
that were't available. To match the old behavior I added a few
lines in bind_generic() to skip over objects that don't intersect
the available mask. After that we get
% mpirun -np 2 --report-bindings \
--bind-to hwthread --map-by ppr:2:node:pe=2 --cpu-set 6,7,12,13 hostname
> MCW rank 0: [..../..BB/..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
> MCW rank 1: [..../..../..../BB../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
I think the above changes are improvements, but I don't feel like they're
comprehensive. I only traced through enough code to fix the two specific
bugs I was dealing with.
Signed-off-by: Mark Allen <markalle@us.ibm.com>
The new proc group is created from the "world_group" based on the
ranks mapping which can be directly taken from proc_name->vpid.
Signed-off-by: Valentin Petrov <valentinp@mellanox.com>
... to avoid using an architecture name macro in
`opal/mca/timer/linux/timer_linux_component.c`.
The function name `opal_sys_timer_freq` is also changed for
consistency with `opal_sys_timer_get_cycles`.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
... in the case of `OPAL_GCC_INLINE_ASSEMBLY == 0`
In this case, `OPAL_HAVE_SYS_TIMER_GET_CYCLES` should be 0 because
the `opal_sys_timer_get_cycles` function is not defined.
The history:
1. Before 8d4175ad89, `OPAL_HAVE_SYS_TIMER_GET_CYCLES` was 0.
2. In 8d4175ad89, adf92d6237, adf92d6237, and c62ce1593a,
`OPAL_HAVE_SYS_TIMER_GET_CYCLES` was changed to 1 by introducing
`opal/asm/base/*.asm`.
3. In ebce88b7ad, `opal/asm/base/*.asm` were removed.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
In case of using a btl_put in ob1, the handle of the locally registered
memory is sent with a PUT control message. In the current master code
the sent handle is necessary the handle in the frag but if the handle
has been successfully registered in the request, the frag structure does
not have any valid handle and all fragments use the request one.
I suggest to check if the handle in the fragment is valid and if not to
send the handle from the request.
Signed-off-by: Brelle Emmanuel <emmanuel.brelle@atos.net>
In the case the btl_get fails Ob1 tries to fallback on btl_put first but
the return code was ignored. So the code fell back on both btl_put and
btl_send.
Signed-off-by: Brelle Emmanuel <emmanuel.brelle@atos.net>
In fint_2_int.h there are some conversion macros for logicals. It has
one path for OMPI_SIZEOF_FORTRAN_LOGICAL != SIZEOF_INT where a new array
would be allocated and the conversions then might expand to
c_array[i] = (array[i] == 0 ? 0 : 1)
and another path for OMPI_SIZEOF_FORTRAN_LOGICAL == SIZEOF_INT where it
does things "in place", so the same conversion there would just be
array[i] = (array[i] == 0 ? 0 : 1)
The problem is some of the logical arrays being converted are INPUT
arguments. And it's possible for some compilers to even put the argument
in read-only memory so the above "in place" conversion SEGV's. A
testcase I have used
call MPI_CART_SUB(oldcomm, (/.true.,.false./), newcomm, ierr)
and gfortran put the second arg in read-only mem.
In cart_sub_f.c you can trace the ompi_fortran_logical_t *remain_dims arg.
remain_dims[] is for input only, but the file uses
OMPI_LOGICAL_ARRAY_NAME_DECL(remain_dims);
OMPI_ARRAY_LOGICAL_2_INT(remain_dims, ndims);
PMPI_Cart_sub(..., OMPI_LOGICAL_ARRAY_NAME_CONVERT(remain_dims), ...);
OMPI_ARRAY_INT_2_LOGICAL(remain_dims, ndims);
to convert it to c-ints make a C call then restore it to Fortran logicals
before returning.
It's not always wrong to convert purely in-place, eg cart_get_f.c has
a periods[] that's exclusively for OUTPUT and it would be fine with the
macros as they were. But I still say the macros are invalid because they
don't distinguish whether they're being used on INPUT or OUTPUT args and
thus they can't be used in a way that's legal for both cases.
It might be possible to fix the macros by adding more of them so that
cart_create_f.c and cart_get_f.c would use different macros that give
more context. But my fix here is just to turn off the first block and
make all paths run as if OMPI_SIZEOF_FORTRAN_LOGICAL != SIZEOF_INT.
The main macros that get enlarged by this change are
define OMPI_ARRAY_LOGICAL_2_INT_ALLOC : mallocs now
define OMPI_ARRAY_LOGICAL_2_INT : also mallocs now
But these are only used in 4 places, three of which are the purpose of
this checkin, to avoid the former in-place expansion of an INPUT arg:
cart_create_f.c
cart_map_f.c
cart_sub_f.c
and one of which is an OUPUT arg that was fine and that gets
unnecessarily expanded into a separate array by this checkin.
cart_get_f.c
So I think an unnecessary malloc in cart_get_f.c is the only downside
to this change, where the logicals array argument could have been used
and converted in place.
Signed-off-by: Mark Allen <markalle@us.ibm.com>
Update provided by Gilles Gouaillardet to keep the in-place option
if OMPI_FORTRAN_VALUE_TRUE == 1 where no conversion is needed.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
This is mostly based off recent UCX additions to their patcher:
https://github.com/openucx/ucx/pull/2703
They added triggers for
* mmap when (flags & MAP_FIXED) && (addr != NULL)
* shmat when (shmflg & SHM_REMAP) && (shmaddr != NULL)
Beyond that I noticed they already had a trigger for
* madvise when (advice == MADV_FREE)
that we didn't so I added that.
And the other main thing is we didn't really have shmat/shmdt
active for some systems because we only had a path for
syscall(SYS_shmdt, ) but we needed to also have a path for
syscall(SYS_ipc, IPCOP_shmdt, ) and same for shmat.
Signed-off-by: Mark Allen <markalle@us.ibm.com>
Avoid printing an error message about ENOTCONN return codes from
getpeername() when handling an incoming connection request. At
this point in the receive state machine, the remote process has
been verified to be a valid OMPI instance. In all-to-all startup
at 4k rank scale, we're seeing this error message when the remote
side drops the connection because it realizes it's the "loser"
in the connection race. We were already doing all the right things,
other than printing a scary error message. So skip the error
message and call it good.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
This is not fixing any issue, it is simply preventing a sefault if the
communicator creation has not happened as expected. Thus, this code path
should never really be hit in a correct MPI application with a valid
communicator creation support.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
This is so when a debugger attaches using MPIR, it can step out of this stack back into main.
This cannot be done with certain aggressive optimisations and missing debug information.
Signed-off-by: James Clark <james.clark@arm.com>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Co-authored-by: Jeff Squyres <jsquyres@cisco.com>
free the component mpool in mca_btl_vader_component_close()
and after freeing soem objects that depend on it such as
mca_btl_vader_component.vader_frags_user
Thanks Christoph Niethammer for reporting this.
Refs. open-mpi/ompi#6524
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>