1
1
Граф коммитов

5886 Коммитов

Автор SHA1 Сообщение Дата
Joshua Hursey
0e8a97c598 Fix the sigkill timeout sleep to prevent SIGCHLD from preventing completion.
* The user can set `-mca odls_base_sigkill_timeout 30` to have ORTE wait
   30 seconds before sending SIGTERM then another 30 seconds before sending
   SIGKILL to remaining processes. This usually happens on an abnormal
   termination. Sometimes the user wants to delay the cleanup to give the
   system time to write out corefile or run other diagnostics.
 * The problem is that child processes may be completing while ORTE is
   in this loop. The SIGCHLD will interrupt the `sleep` system call.
   Without the loop the sleep could effectively be ignored in this case.
   - Sleep returns the amount of time remaining to sleep. If it was
     interrupted by a signal then it is a positive number less than or
     equal to the parameter passed to it. If it slept the whole time
     then it returns 0.

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2019-10-02 14:41:34 -04:00
Ralph Castain
7444b32494
Remove stale references to orte_oob_base.ev_base
The oob is restricted to the main event base

Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-09-29 18:52:03 -07:00
Ralph Castain
5e9d07d4e0
Merge pull request #7010 from rhc54/topic/oob
Cleanup stale code in ORTE/OOB
2019-09-25 14:23:43 -07:00
Ralph Castain
41eb41c3f2
Cleanup stale code in ORTE/OOB
Remove code for multiple OOB progress threads as it is an optimization
nobody uses. Also turns out to have a race condition that can cause
segfault on finalize, so maybe good that nobody is using it.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-09-25 13:27:41 -07:00
Austen Lauria
77144689f0 Add 'orte_' prefix to noop_mpir_breakpoint_ptr.
Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
2019-09-18 17:44:40 -04:00
Austen Lauria
067adfa417 Conform MPIR_Breakpoint to MPIR standard.
- Fix MPIR_Breakpoint standard violation by returning void
  instead of a void*.

Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
2019-09-17 15:19:00 -04:00
Ralph Castain
06d188ebf3
Be a little less restrictive on interface requirements
If both types of interfaces are enabled, don't error out if one of them
isn't able to open listener sockets. Only one interface family may be
available on some machines, but someone might want to build the code to
run more generally.

Refs https://github.com/pmix/prrte/pull/249

Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-09-06 08:27:05 -07:00
Ralph Castain
373e816b37
Ensure buffer_unload leaves the buffer in a clean state
Silence a warning in orte/nidmap

Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-09-04 08:32:27 -07:00
Jeff Squyres
197beb30d5 orterun: remove duplicate code
https://github.com/open-mpi/ompi/pull/6895 fixed the code in orterun.c
to allow running as root if both OMPI_ALLOW_RUN_AS_ROOT and
OMPI_ALLOW_RUN_AS_ROOT_CONFIRM env vars are set.  However, this
env-var-checking code already exists in
orte_submit.c:orte_submit_init() -- it looks like the
geteuid()/getenv()-checking code here in orterun is now duplicate
code.

So let's just get rid of the duplicate code.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2019-08-19 15:36:59 -04:00
Simon Byrne
9c8671c48b Run-as-root env vars in orterun.c
I found that I needed to apply the same change as #5597 to orterun.c for the environment variables to work correctly.

Signed-off-by: Simon Byrne <simonbyrne@gmail.com>
2019-08-12 18:52:52 -07:00
Ralph Castain
bd5a1765ee
Fix typos
Provide a missing header and paren

Thanks to @zerothi for the assistance

Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-08-07 05:47:12 -07:00
Ralph Castain
ea0dfc3218
Allow individual jobs to set their map/rank/bind policies
Override the defaults when provided. Ignore LSF binding file if user
overrides by specifying a policy.

Fixes #6631

Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-08-06 07:48:58 -07:00
William Zhang
4ebb37a26c opal/util: Change opal/util/if.h macro IF_NAMESIZE to OPAL_IF_NAMESIZE
Due to IF_NAMESIZE being a reused and conditionally defined macro,
issues could arise from macro mismatches. In particular, in cases where
opal/util/if.h is included, but net/if.h is not, IF_NAMESIZE will be 32.
If net/if.h is included on Linux systems, IF_NAMESIZE will be 16. This
can cause a mismatch when using the same macro on a system. Thus
different parts of the code can have differring ideas on the size of a
structure containing a char name[IF_NAMESIZE]. To avoid this error case,
we avoid reusing the IF_NAMESIZE macro and instead define our own as
OPAL_IF_NAMESIZE.

Signed-off-by: William Zhang <wilzhang@amazon.com>
2019-07-29 21:24:39 +00:00
Austen Lauria
00106f5ac9 Try to prevent the compiler from optimizing out MPIR_Breakpoint().
Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
2019-07-24 09:16:54 -04:00
Gilles Gouaillardet
24f2961156 ess/base: fix a misc memory leak in orte_ess_base_proc_binding()
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-07-12 17:01:04 +09:00
Jeff Squyres
b738fa295d
Merge pull request #6796 from orivej/fix-tcp_component_close-segfault
Fix oob_tcp tcp_component_close segfault with active listeners
2019-07-08 18:13:52 -04:00
Orivej Desh
78b7e342bd Fix oob_tcp tcp_component_close segfault with active listeners
oob_tcp in non-HNP mode shares libevent event_base with oob_base [1].
orte_oob_base_close calls:
(1) oob_tcp component_shutdown, then
(2) opal_progress_thread_finalize, then
(3) oob_tcp tcp_component_close [2].
opal_progress_thread_finalize calls tracker_destructor [3] that frees the
event_base [4]. If any oob_tcp event listeners are active at this time, oob_tcp
will crash trying to delete them at [5] [6].

This change moves oob_tcp event listener cleanup from component_close to
component_shutdown so that it happens before the event_base is freed.

[1] https://github.com/open-mpi/ompi/blob/v4.0.1/orte/mca/oob/tcp/oob_tcp_listener.c#L160
[2] https://github.com/open-mpi/ompi/blob/v4.0.1/orte/mca/oob/base/oob_base_frame.c#L95
[3] https://github.com/open-mpi/ompi/blob/v4.0.1/opal/runtime/opal_progress_threads.c#L232
[4] https://github.com/open-mpi/ompi/blob/v4.0.1/opal/runtime/opal_progress_threads.c#L65
[5] https://github.com/open-mpi/ompi/blob/v4.0.1/orte/mca/oob/tcp/oob_tcp_component.c#L192
[6] https://github.com/open-mpi/ompi/blob/v4.0.1/orte/mca/oob/tcp/oob_tcp_listener.c#L955

Signed-off-by: Orivej Desh <orivej@gmx.fr>
2019-07-04 20:45:47 +00:00
Orivej Desh
de522545c0 Fix ORTE_FORCED_TERMINATE message
The format string expects to see the file and line before the error text and code.

Signed-off-by: Orivej Desh <orivej@gmx.fr>
2019-06-28 00:46:55 +00:00
Jordan Hayes
7dad74032e plm_slurm_module: adjust for new SLURM CLI options
SLURM 19 discontinued the use of --cpu_bind (and changed it to
--cpu-bind).  There's no easy way to test at run time which one is
accepted, so set the environment variable SLURM_CPU_BIND to "none",
which should do the same thing as the srun CLI parameter.

Signed-off-by: Jordan Hayes <jhayes@ucr.edu>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2019-05-15 14:52:33 -07:00
Eisuke Kawashima
027f74bc39 orterun.1in: Fix typo and other minor updates
Signed-off-by: Eisuke Kawashima <e-kwsm@users.noreply.github.com>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2019-04-22 09:35:57 -04:00
Mark Allen
bf3980d70c fix hang in -np 3 --rank-by core
The following command hangs:
  % mpirun --rank-by core -np 3 --report-bindings hostname
because of a loop where i is supposed to cycle through an
array of size num_objs, but for some reason it's only
looking at node->num_procs entries.

I changed the counter so it stays in the loop (stays on this
node) until it makes a full cycle through the array of objects
without any assignments then it ends the loop so it can go to
the next node.

Signed-off-by: Mark Allen <markalle@us.ibm.com>
2019-04-12 15:34:02 -04:00
Mark Allen
bdd92a7a64 -cpu-set as a constraint rather than as a binding
The first category of issue I'm addressing is that recent code changes
seem to only consider -cpu-set as a binding option. Eg a command like
this
  % mpirun -np 2 --report-bindings --use-hwthread-cpus \
      --bind-to cpulist:ordered --map-by hwthread --cpu-set 6,7 hostname
which just round robins over the --cpu-set list.

Example output which seems fine to me:
> MCW rank 0: [..../..B./..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
> MCW rank 1: [..../...B/..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]

It should also be possible though to pass a --cpu-set to most other
map/bind options and have it be a constraint on that binding. Eg
  % mpirun -np 2 --report-bindings \
      --bind-to hwthread --map-by hwthread --cpu-set 6,7 hostname
  % mpirun -np 2 --report-bindings \
      --bind-to hwthread --map-by ppr:2:node,pe=2 --cpu-set 6,7,12,13 hostname

The first command above errors that
> Conflicting directives for mapping policy are causing the policy
> to be redefined:
>   New policy:   RANK_FILE
>   Prior policy:  BYHWTHREAD

The error check in orte_rmaps_rank_file_open() is likely too aggressive.
The intent seems to be that any option like "--map-by whatever" will
check to see if a rankfile is in use, and report that mapping via rmaps
and using an explicit rankfile is a conflict.

But the check has been expanded to not just check
    NULL != orte_rankfile
but also errors out if
    (NULL != opal_hwloc_base_cpu_list &&
    !OPAL_BIND_ORDERED_REQUESTED(opal_hwloc_binding_policy))
which seems to be only recognizing -cpu-set as a binding option and
ignoring -cpu-set as a constraint on other binding policies.

For now I've changed the
    NULL != opal_hwloc_base_cpu_list
to
    OPAL_BIND_TO_CPUSET == OPAL_GET_BINDING_POLICY(opal_hwloc_binding_policy)
so it hopefully only errors out if -cpu-set is being used as a binding
policy.  Whether I did that right or not it's enough to get to the next
stage of testing the example commands I have above.

Another place similar logic is used is hwloc_base_frame.c where it has
    /* did the user provide a slot list? */
    if (NULL != opal_hwloc_base_cpu_list) {
        OPAL_SET_BINDING_POLICY(opal_hwloc_binding_policy, OPAL_BIND_TO_CPUSET);
    }
where it used to (long ago) only do that if
    !OPAL_BINDING_POLICY_IS_SET(opal_hwloc_binding_policy)
I think the new code is making it impossible to use --cpu-set as anything
other than a binding policy.

That brings us past the error detection and into the real functionality, some of
which has been stripped out, probably in moving to hwloc-2:
  % mpirun -np 2 --report-bindings \
      --bind-to hwthread --map-by hwthread --cpu-set 6,7 hostname
> MCW rank 0: [B.../..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
> MCW rank 1: [.B../..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]

The rank_by() function in rmaps_base_ranking.c makes an array out of objects
returned from
    opal_hwloc_base_get_obj_by_type(,,,i,)
which uses df_search().  That function changed quite a bit from hwloc-1 to 2
but it used to include a check for
    available = opal_hwloc_base_get_available_cpus(topo, start)
which is where the bitmask from --cpu-set goes.  And it used to skip objs that
had hwloc_bitmap_iszero(available).

So I restored that behavior in ds_search() by adding a "constrained_cpuset" to
replace start->cpuset that it was otherwise processing.  With that change in
place the first command works:
  % mpirun -np 2 --report-bindings \
      --bind-to hwthread --map-by hwthread --cpu-set 6,7 hostname
> MCW rank 0: [..../..B./..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
> MCW rank 1: [..../...B/..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]

The other command uses a different path though that still ignored the
available mask:
  % mpirun -np 2 --report-bindings \
      --bind-to hwthread --map-by ppr:2:node:pe=2 --cpu-set 6,7,12,13 hostname
> MCW rank 0: [BB../..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
> MCW rank 1: [..BB/..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
In bind_generic() the code used to call
opal_hwloc_base_find_min_bound_target_under_obj() which used
opal_hwloc_base_get_ncpus(), and that's where it would
intersect objects with the available cpuset and skip over ones
that were't available. To match the old behavior I added a few
lines in bind_generic() to skip over objects that don't intersect
the available mask. After that we get
  % mpirun -np 2 --report-bindings \
      --bind-to hwthread --map-by ppr:2:node:pe=2 --cpu-set 6,7,12,13 hostname
> MCW rank 0: [..../..BB/..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
> MCW rank 1: [..../..../..../BB../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]

I think the above changes are improvements, but I don't feel like they're
comprehensive.  I only traced through enough code to fix the two specific
bugs I was dealing with.

Signed-off-by: Mark Allen <markalle@us.ibm.com>
2019-04-12 15:33:56 -04:00
James Clark
20f5840cbb Add a compilation flag that adds unwind info to all files that are present in the stack starting from MPI_Init.
This is so when a debugger attaches using MPIR, it can step out of this stack back into main.
This cannot be done with certain aggressive optimisations and missing debug information.

Signed-off-by: James Clark <james.clark@arm.com>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>

Co-authored-by: Jeff Squyres <jsquyres@cisco.com>
2019-03-27 14:32:15 +00:00
Ralph Castain
8174286530 Sync nidmap to PRRTE to fix hetero topo problem
Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-03-26 08:24:09 -07:00
Ralph Castain
dfbc14430d
Merge pull request #6440 from ggouaillardet/topic/yield_when_idle
schizo/ompi: correctly handle the yield_when_idle option
2019-03-25 12:17:34 -07:00
Ralph Castain
5aa775c02e Correctly set the byte_object size
Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-03-18 14:29:37 -07:00
Ralph Castain
aed06e68b9 Protect against NULL node pointer
Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-03-16 01:31:28 -07:00
Ralph Castain
2794ae43b3 Update nidmap
Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-03-16 01:20:15 -07:00
Ralph Castain
35a597178d Ensure that nodes are always used in order provided
If a user provides a list of nodes to use via -host or -hostfile, then
ensure that the ranks are placed according to that order. Also fix a bug
where the number of slots on a node was incorrectly computed for
localhost if the name given didn't exactly match the return from
get_hostname.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-03-15 12:58:10 -07:00
Gilles Gouaillardet
cc97c0f611 schizo/ompi: correctly handle the yield_when_idle option
in schizo/ompi, sets the new OMPI_MCA_mpi_oversubscribe environment
variable according to the node oversubscription state.

This MCA parameter is used to set the default value of the
mpi_yield_when_idle parameter.

This two steps tango is needed so the mpi_yield_when_idle setting
is always honored when set in a config file.

Refs. open-mpi/ompi#6433

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-02-28 09:53:29 +09:00
Ralph Castain
60961ceb41 Fix cross-mpirun connect/accept operations
Ensure we publish all the info required to be returned to the other
mpirun when executing this operation. We need to know the daemon (and
its URI) that is hosting each of the other procs so we can do a direct
modex operation and retrieve their connection info.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-02-26 17:08:48 -08:00
Ralph Castain
2f15379171 Remove stale singularity/schizo component
Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-02-20 17:38:24 -08:00
Ralph Castain
2da5651869 Restore orted hnp_uri cmd line option
Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-02-18 13:24:03 -08:00
Ralph Castain
e56ee1e06a Remove the remaining cruft from dual oob transport
* When we moved to allowing dual rml/oob transports, we added a bunch of
stuff that is no longer needed. Remove it so as to simplify the
messaging system.

* Fix the routed/radix component so it correctly returns the parent's
vpid

Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-02-08 11:12:31 -08:00
Gilles Gouaillardet
b80210c36a orte/util: strdup() in orte_util_decode_nidmap() since opal_argv_free() will free()
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-02-08 11:11:25 -08:00
Gilles Gouaillardet
78152aec85 orte/nidmap: do not use compressed when uninitialized
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-02-08 11:11:25 -08:00
Ralph Castain
1ee6c185f7 Remove stale code
Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-02-08 11:11:25 -08:00
Ralph Castain
01e9aca40f Add topology support for hetero systems
Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-02-08 11:11:25 -08:00
Gilles Gouaillardet
88ac05fca6 misc fixes
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-02-08 11:11:25 -08:00
Ralph Castain
125d236173 Move from the use of regex to compression
We've been fighting the battle of trying to create a regex generator and
parser that can handle arbitrary hostname schemes - without long-term
success. The worst of it is that there is no way of checking to see if
the computed regex is correct short of parsing it and doing a
character-by-character comparison with the original string. Ugh...there
has to be a better solution.

One option is to investigate using 3rd-party regex libraries as
those are coming from communities whose sole focus is resolving that
problem. However, someone would need to spend the time to investigate
it, and we'd have to find a license-friendly implementation.

Another option is to quit beating our heads against the wall and just
compress the information. It won't be as much of a reduction, but we
also won't keep hitting scenarios where things break. In this case, it
seems that "perfection" is definitely the enemy of "good enough".

This PR implements the compression option while retaining the
possibility of people adding regex-generating components. The
compression code used in ORTE is consolidated into the opal/compress
framework. That framework currently held bzip and gzip components for
use in compressing checkpoint files - since we no longer support C/R, I
have .opal_ignore'd those components.

However, I have left the original framework APIs alone in case someone
ever decides to redo C/R. The APIs of interest here are added to the
framework - specifically, the "compress_block" and "decompress_block"
functions. I then moved the ORTE zlib compression code into a new
component in this framework.

Unfortunately, the framework currently is a single-select one - i.e.,
only one active component at a time. Since I .opal_ignore'd the other
two and made the priority of zlib high, this isn't a problem. However,
if someone wants to re-enable bzip/gzip or add another component, they
might need to transition opal/compress to a multi-select framework.

Included changes:

* Consolidate the compression code into the opal/compress framework

* Move the ORTE zlib compression code into a new opal/compress/zlib
  component

* Ignore the bzip and gzip components in opal/compress framework

* Add a "compress_base_limit" MCA param to set the threshold above which
  we compress data - defaults to 4096 bytes

* Delete stale brucks and rcd components from orte/grpcomm framework

* Delete the orte/regx framework

* Update the launch system to use opal/compress instead of string regex

* Provide a default module if no zlib is available

* Fix some misc multi-node issues

* Properly generate the nidmap in response to a "connection warmup"
  message so the remote daemon knows the children it needs to launch.

* Remove stale references to orte_node_regex

* opal_byte_object_t's are not OPAL objects - properly release allocated
  memory.

* Set the topology

* Currently only handling homogeneous case

* Update the compress framework files to conform

* Consolidate open/close into one "frame" file. Ensure we open/close the
  framework

Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-02-08 11:11:14 -08:00
Ralph Castain
fcbc7ea298
Merge pull request #6306 from karasevb/regx_host_ordering_fix
regex: fixed host ordering for different prefixes
2019-02-08 11:09:55 -08:00
Ralph Castain
8794077520 Remove stale rml/ofi component
Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-01-30 12:41:50 -08:00
Boris Karasev
46e38b9193 regx: fixed the order of hosts for ranges with different prefixes
Example:
For the list of hosts `a01,b00,a00` a regex is generated:
`a[2:1.0],b[2:0]`, where `a`-hosts prefixes moved to the begining,
it breaks the hosts ordering.
This commit fixes regex for that case to `a[2:1],b[2:0],a[2:0]`

Signed-off-by: Boris Karasev <karasev.b@gmail.com>
2019-01-30 15:06:30 +06:00
Boris Karasev
1967e41a71 regx/reverse: fixed adding an empty range for no numerical hostnames
Example:
For the nodelist `jjss,jjss0000001,jjss0000003,jjss0000002` a regular
expression was `jjss[0:0],jjss[7:1,3,2]` that led to incorrect unpacking
the first host as `jjs0`. This commit fixes an adding empty range for
not numeric hostnames. Here is the fixed regex for this exapmle:
`jjss,jjss[7:1,3,2]`

Signed-off-by: Boris Karasev <karasev.b@gmail.com>
2019-01-30 09:41:00 +06:00
Boris Karasev
d1ad90f47e regx/test: update regex test
Signed-off-by: Boris Karasev <karasev.b@gmail.com>
2019-01-30 09:40:59 +06:00
Jason Williams
98d81a5f7a Adding changes for issue #6303 for branch master.
Signed-off-by: Jason Williams <uberlinuxguy@gmail.com>
2019-01-26 10:49:47 -05:00
Howard Pritchard
b46e15535a orte: shutdown
be more careful about closing framewworks as part of
orte_finalize.  Owing to recent restructuring in opal to handle
finalize in a more general fashion, the missing framework
closes were causing meltdowns as the mca vars subsystem
was cleaning itself up.

This problem was recently reported by Siegmar:

https://www.mail-archive.com/users@lists.open-mpi.org//msg32946.html

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2019-01-04 11:04:13 -07:00
Ralph Castain
b19e5edf76 Correct parsing of ppr directives
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2019-01-02 09:03:13 -08:00
Jeff Squyres
f96c04244d odls_base_default_fns.c: put the free() in the right place
Fixes CID 1441826.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-12-22 06:40:05 -08:00
Ralph Castain
d728380741 If job is fully described, there will be no ppn string to unpack
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-12-17 16:13:55 -08:00