1
1

4443 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
7b138ec6d9
Update Slurm launch support
Assign all cpu's on node to the daemon

Signed-off-by: Ralph Castain <rhc@pmix.org>
(cherry picked from commit 7bac7eed6ef423e47fe980b4c32eae36b8e1d4cb)
2020-12-15 08:26:33 -08:00
Ralph Castain
09e9fe0178
Fix the verbose output in ess base
Only get the locality string and output binding message when requested

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-11-25 14:24:47 -08:00
Raghu Raja
38011d3402
Merge pull request #8204 from jsquyres/pr/v4.1.x/fix-warnings
v4.1.x: fix many warnings
2020-11-23 17:28:44 -08:00
Jeff Squyres
f9e2bf7c6b Fix many compiler warnings
Fixes #8195. This PR doesn't fix all the warnings from #8195, but
fixes many of them (e.g., I didn't get the "string might be truncated"
warnings on my Mac).

This is an adaptation of 14aa5fae3c42f14a1c6a259dede93d5ca7ecb82c from
master; it drops some things that aren't relevant here on the v4.1.x
branch and adds a few more warnings fixes that are relevant here on
v4.1.x that aren't relevant on master.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry-picked from 14aa5fae3c42f14a1c6a259dede93d5ca7ecb82c)
2020-11-23 12:43:33 -08:00
Ralph Castain
4993a091a0
Correctly skip the "mpirun" node when launching orted on it
Mark the node as "unusable" so it does not get included when computing
number of procs for the case where the user does not specify -np.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-11-13 23:10:42 -08:00
Ralph Castain
ec3589389a Correct computation of relative locality
Ensure we always pass the cpuset as well as the locality string for each
proc. Correct the mtl/ofi component's computation of relative locality
as the function being called expects to be given the locality string of
each proc, not the cpuset. If the locality string of the current proc
isn't available, then use the cpuset if available and compute the
locality before trying to compute relative localities of our peers.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-11-10 13:07:55 -08:00
Joshua Hursey
7c67fb36b3
Fix cpu-list for non-uniform nodes
* The `--cpu-list` argument restricts the hwloc topology. When that topology is
   sent from the remote daemon back to the HNP it is packed as XML. This packing
   results in the loss of the applied cpu-list. Mapping will then be using
   the full topology instead of the restricted topology when mapping the
   processes. When the launch command reaches the remote node it will not
   be congruent with the remote node's view of the topology leading to a
   launch failure especially if the HNP and remote node have differing
   topologies.
   - The solution is to make sure that the HNP re-applies the cpu-list to
     the incomming topology from the remote node. This way the HNP and the
     remote node are using the same restricted topology.

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2020-10-22 20:10:49 -05:00
Wei Zhang
48cca8ab83 oob/tcp: fix a race condition on stop_thread pipe
This patch fix a race condition, which caused the main thread
to sometimes write to a closed pipe.

The following are details:

Currently, during shut down, the main thread will do the following:

1. set listen_thread_action to false.
2. write to stop_thread pipe to tell the listener thread to exit.

The listener thread do the following:

1. call select() to listen to a set of file descriptors with
   a maximum wait time.

2. check listen_thread_action. If it is false, close the stop_thread
   pipe.

The main thread will write to closed pipe, when

1. listener's call to select() finished because maximum wait time reached.

2. main thread set listen_thread_action to false

3. listener thread check listen_thread_action and closed the pipe

4. main thread write to the closed pipe.

This patch address the issue by having the main thread close the pipe
after the listener thread has been joined. This way, main thread
will both write and close the thread, so there is no conflict.

Note This patch was opened directly against v4.1.x branch because
the orte/mca/oob/tcp directory has been removed from master branch.

Signed-off-by: Wei Zhang <wzam@amazon.com>
2020-09-30 18:22:53 +00:00
Jeff Squyres
57044aa0b0
Merge pull request #7934 from jjhursey/v4-jsm-no-bind
v4.1.x: schizo/jsm: Disable binding when direct launched
2020-07-14 11:19:19 -04:00
Joshua Hursey
c71e1fa1db
v4.1.x: schizo/jsm: Disable binding when direct launched
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2020-07-13 15:31:21 -05:00
Artem Polyakov
8b5b0c5a2a schizo/slurm: Disable binding in case of Slurm direct launch
Signed-off-by: Artem Polyakov <artpol84@gmail.com>
(cherry picked from commit c72f295dfaced7e9f879b1eba3eebb7c35cb5bf5)
2020-07-13 12:54:24 -07:00
Ralph Castain
12468349d2
Increment the vpid after assignment
Fix the rank-by operation

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-06-30 06:54:04 -07:00
Jeff Squyres
470b1c518d
Merge pull request #7812 from hppritcha/topic/cobalt_for_orte
RAS:ALPS add support for ANL Cobalt
2020-06-25 10:40:57 -04:00
Joshua Hursey
3234079bbc
Add detection for JSM direct launch
* Adds the `schizo/jsm` component that detects if the process was
   direct launched with IBM's Job Step Manager (JSM). JSM is a PMIx
   enhanced runtime environment so flag it as such.

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
(cherry picked from commit 4f1de51371048085b86ee64e05849ad929c9f35c)
2020-06-11 08:51:17 -05:00
Howard Pritchard
3137a78bce RAS:ALPS add support for ANL Cobalt
This commit enables the ALPS RAS to get reservation information
from COBALT.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2020-06-09 19:38:21 +00:00
Christoph Niethammer
1632aa06c0 Correct typo in mapping-too-low* help messages
Signed-off-by: Christoph Niethammer <niethammer@hlrs.de>
2020-05-10 22:39:28 +02:00
Geoff Paulsen
64b78cca3f
Merge pull request #7643 from alex--m/topic/rmaps_fix
rmaps/base: fix logic (crash, in some cases) when num_procs > num_obj…
2020-04-24 13:59:00 -05:00
Ralph Castain
79c426885a
Correct the error codes and clarify/correct logic
Every mapper is required to set the locale, which is why it is an error
if the locale attribute isn't found. Likewise, it is an error for any
mapper to set a NULL locale as it makes no sense. However, I can see
that maybe some compiler or static code checker might want to see
concrete evidence we checked it - so check it in the right place.

Backport the equivalent code from PRRTE as we know that works - more
confidence than trying to add another patch to this old code.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-04-18 06:32:07 -07:00
Alex Margolin
1cd89c9d7b
rmaps/base: fix logic (crash, in some cases) when num_procs > num_objects
Signed-off-by: Alex Margolin <alex.margolin@huawei.com>
2020-04-17 16:11:34 +03:00
Geoff Paulsen
1b0741160c
Merge pull request #7609 from hppritcha/topic/fix_for_issue5115
debuggers: don't remove session dirctory
2020-04-10 08:41:49 -05:00
Howard Pritchard
e679a0a6f3 debuggers: don't remove session dirctory
when the exiting process is a debugger process.

relates to #5115

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2020-04-07 16:43:44 -06:00
Ralph Castain
16c8931ec9
Daemonize the orteds during tree-spawn
Somehow, the line of code that actually added the daemonize option to
the orted cmd line was removed.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-30 15:48:27 -07:00
Joshua Hursey
05d003b109
plm/rsh: Fix segv on missing agent.
* Additionally, fixes the `NULL` option to `OMPI_MCA_plm_rsh_agent`
   would would also lead to a segv. Now it operates as intended by
   disqualifying the `rsh` component and falling back onto the `isolated`
   component.

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
(cherry picked from commit 62d0058738e8a111cd099199bc5f1886f13aa8ec)
2020-01-27 10:34:28 -06:00
Geoff Paulsen
93c879962e
Merge pull request #7168 from wbailey2/pr/fix-yield_when_idle
v4.0.x: schizo/ompi: correctly handle the yield_when_idle option
2020-01-03 14:06:36 -06:00
Maxwell Coil
3964144ca5 Fix misleading error message with missing #! interpreter
This change fixes the misleading error message. I added a conditional to
determine whether the error is due to a missing file or a bad interpreter.
If it is the latter, a new, more precise error message will be displayed.

Fixes #4528

Signed-off-by: Maxwell Coil <mcoil@nd.edu>
(cherry picked from commit 9b73f6ac83b1efd6db60d71a6e0a4c83b52af0aa)
2019-12-14 12:22:55 -05:00
Gilles Gouaillardet
b004f4c391 schizo/ompi: correctly handle the yield_when_idle option
in schizo/ompi, sets the new OMPI_MCA_mpi_oversubscribe environment
variable according to the node oversubscription state.

This MCA parameter is used to set the default value of the
mpi_yield_when_idle parameter.

This two steps tango is needed so the mpi_yield_when_idle setting
is always honored when set in a config file.

Refs. open-mpi/ompi#6433

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
(cherry-picked from cc97c0f6116ae62feef)
2019-12-02 17:18:26 -05:00
Scott Miller
8eae54fd27 plm/rsh: Add chdir option to change directory before orted exec
Signed-off-by: Scott Miller <scott.miller1@ibm.com>
(cherry picked from commit c1b8599528feaac3c1d9ad44f0113f65eb7cc97f)

Conflicts:
	orte/mca/plm/rsh/plm_rsh_module.c
2019-10-29 15:49:41 -04:00
Joshua Hursey
c6fab32137
Fix the sigkill timeout sleep to prevent SIGCHLD from preventing completion.
* The user can set `-mca odls_base_sigkill_timeout 30` to have ORTE wait
   30 seconds before sending SIGTERM then another 30 seconds before sending
   SIGKILL to remaining processes. This usually happens on an abnormal
   termination. Sometimes the user wants to delay the cleanup to give the
   system time to write out corefile or run other diagnostics.
 * The problem is that child processes may be completing while ORTE is
   in this loop. The SIGCHLD will interrupt the `sleep` system call.
   Without the loop the sleep could effectively be ignored in this case.
   - Sleep returns the amount of time remaining to sleep. If it was
     interrupted by a signal then it is a positive number less than or
     equal to the parameter passed to it. If it slept the whole time
     then it returns 0.

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
(cherry picked from commit 0e8a97c598d841d472047ea1025931813c3ef8a9)
2019-10-02 14:49:47 -05:00
Ralph Castain
edbfcf090a
Cleanup stale code in ORTE/OOB
Remove code for multiple OOB progress threads as it is an optimization
nobody uses. Also turns out to have a race condition that can cause
segfault on finalize, so maybe good that nobody is using it.

Signed-off-by: Ralph Castain <rhc@pmix.org>
(cherry picked from commit 41eb41c3f224abcacec78f4d77c138197c36f172)
(cherry picked from commit a2f35c1834ab2fcb216285621d177a179e33dfe7)
2019-09-26 15:21:15 -07:00
KAWASHIMA Takahiro
e5be033c14 ess/pmi: Fix --enable-timing compilation error
This commit fixes an compilation error when configured
with `--enable-timing`.

Procedures in the function `orte_ess_base_app_setup`
in `orte/mca/ess/base/ess_base_std_app.c` are moved
to `orte/mca/ess/pmi/ess_pmi_module.c`
and `orte/mca/ess/singleton/ess_singleton_module.c`
in the recent commit 57f6b94fa5.

In `ess_pmi_module.c`, the first argument of the
`OPAL_TIMING_ENV_NEXT` macro should have been adapted
to the destination function but was not.

In `ess_singleton_module.c`, `OPAL_TIMING_ENV_INIT`
was not used in the destination function originally.
So `OPAL_TIMING_ENV_NEXT` cannot be used in the function.

Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
(cherry picked from commit 8e7d874e14a5485dceff836419e36b6b24a66f48)
2019-09-19 18:14:06 -04:00
Geoff Paulsen
a482edc14e
Merge pull request #6944 from jjhursey/v4/fix-tree-launch
Fix tree spawn routed component issue
2019-09-09 13:10:36 -05:00
Ralph Castain
95cc53e331
Be a little less restrictive on interface requirements
If both types of interfaces are enabled, don't error out if one of them
isn't able to open listener sockets. Only one interface family may be
available on some machines, but someone might want to build the code to
run more generally.

Refs https://github.com/pmix/prrte/pull/249

Signed-off-by: Ralph Castain <rhc@pmix.org>
(cherry picked from commit 06d188ebf3646760f50d4513361b50642af9cec4)
2019-09-09 07:52:03 -07:00
Joshua Hursey
4c1160e257 Fix tree spawn routed component issue
* Fix #6618
   - See comments on Issue #6618 for finer details.
 * The `plm/rsh` component uses the highest priority `routed` component
   to construct the launch tree. The remote orted's will activate all
   available `routed` components when updating routes. This allows the
   opportunity for the parent vpid on the remote `orted` to not match
   that which was expected in the tree launch. The result is that the
   remote orted tries to contact their parent with the wrong contact
   information and orted wireup will fail.
 * This fix forces the orteds to use the same `routed` component as
   the HNP used when contructing the tree, if tree launch is enabled.

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2019-08-29 16:26:43 -04:00
Scott Miller
1b0cfdf264 v4.0.x: regx/naive: add regx/naive component
Signed-off-by: Scott Miller <scott.miller1@ibm.com>
2019-08-26 11:37:07 -04:00
Ralph Castain
f0f25b60a8
Fix typos
Provide a missing header and paren

Thanks to @zerothi for the assistance

Signed-off-by: Ralph Castain <rhc@pmix.org>
(cherry picked from commit bd5a1765eea200651babc5bfd9f45a9f3cedefbc)
2019-08-07 05:51:29 -07:00
Ralph Castain
9898332ae0
Allow individual jobs to set their map/rank/bind policies
Override the defaults when provided. Ignore LSF binding file if user
overrides by specifying a policy.

Fixes #6631

Signed-off-by: Ralph Castain <rhc@pmix.org>
(cherry picked from commit ea0dfc321809db50f78e742da1d22f9ef59650a3)
2019-08-07 05:51:06 -07:00
Orivej Desh
667fe3f3f3 Fix oob_tcp tcp_component_close segfault with active listeners
oob_tcp in non-HNP mode shares libevent event_base with oob_base [1].
orte_oob_base_close calls:
(1) oob_tcp component_shutdown, then
(2) opal_progress_thread_finalize, then
(3) oob_tcp tcp_component_close [2].
opal_progress_thread_finalize calls tracker_destructor [3] that frees the
event_base [4]. If any oob_tcp event listeners are active at this time, oob_tcp
will crash trying to delete them at [5] [6].

This change moves oob_tcp event listener cleanup from component_close to
component_shutdown so that it happens before the event_base is freed.

[1] https://github.com/open-mpi/ompi/blob/v4.0.1/orte/mca/oob/tcp/oob_tcp_listener.c#L160
[2] https://github.com/open-mpi/ompi/blob/v4.0.1/orte/mca/oob/base/oob_base_frame.c#L95
[3] https://github.com/open-mpi/ompi/blob/v4.0.1/opal/runtime/opal_progress_threads.c#L232
[4] https://github.com/open-mpi/ompi/blob/v4.0.1/opal/runtime/opal_progress_threads.c#L65
[5] https://github.com/open-mpi/ompi/blob/v4.0.1/orte/mca/oob/tcp/oob_tcp_component.c#L192
[6] https://github.com/open-mpi/ompi/blob/v4.0.1/orte/mca/oob/tcp/oob_tcp_listener.c#L955

Signed-off-by: Orivej Desh <orivej@gmx.fr>
(cherry picked from commit 78b7e342bd26f493547f750dac842252e7a15143)
2019-07-08 15:14:43 -07:00
Geoff Paulsen
0cd5a5afb8
Merge pull request #6714 from rhc54/cmr40/routed
Fix tree spawn at scale
2019-06-07 14:13:22 -05:00
perrynzhou
5acaf006ae regx/base: fix an integer overflow
use strtol() instead of atoi() in order to handle hostnames
containing a large number.

This is a one-off commit for the release branches since
the regx framework has already been removed from master.

Refs. open-mpi/ompi#6729

Signed-off-by: perrynzhou <perrynzhou@gmail.com>
2019-06-06 14:37:33 +09:00
Ralph Castain
6c2cd10d68
Fix tree spawn at scale
Remove the debruijn component as it changes the daemon's parent
process ID, thus breaking the other routed components

Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-06-04 09:49:01 -07:00
Jordan Hayes
e00d0abe56 plm_slurm_module: adjust for new SLURM CLI options
SLURM 19 discontinued the use of --cpu_bind (and changed it to
--cpu-bind).  There's no easy way to test at run time which one is
accepted, so set the environment variable SLURM_CPU_BIND to "none",
which should do the same thing as the srun CLI parameter.

Signed-off-by: Jordan Hayes <jhayes@ucr.edu>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 7dad74032e30259506da7fa582dd8c4351e6e0a1)
2019-05-16 09:13:28 -07:00
Geoff Paulsen
811dfc63e0
Merge pull request #6550 from rhc54/cmr402/clnup
v4.0.x: Cleanup race condition in finalize that leads to incomplete vader cleanup
2019-04-09 10:13:15 -05:00
James Clark
d8dc69feb5 Add a compilation flag that adds unwind info to all files that are present in the stack starting from MPI_Init.
This is so when a debugger attaches using MPIR, it can step out of this stack back into main.
This cannot be done with certain aggressive optimisations and missing debug information.

Signed-off-by: James Clark <james.clark@arm.com>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>

Co-authored-by: Jeff Squyres <jsquyres@cisco.com>

(cherry-picked from 20f5840)
2019-04-01 11:10:04 +01:00
Ralph Castain
2536b4f869 Remove stale ORTE code
Functionality moved to PMIx

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit cfdd08d309d9ebc48229f0ca68ceec64a7e6389f)
2019-03-31 11:26:18 -07:00
Ralph Castain
861016c3b2 Cleanup race condition in finalize
See https://github.com/open-mpi/ompi/issues/5798#issuecomment-426545893
for a lengthy explanation

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 57f6b94fa53166bc4d513be4507382e832a3a8c7)
2019-03-31 11:23:27 -07:00
Ralph Castain
b5d46494cd Fix cross-mpirun connect/accept operations
Ensure we publish all the info required to be returned to the other
mpirun when executing this operation. We need to know the daemon (and
its URI) that is hosting each of the other procs so we can do a direct
modex operation and retrieve their connection info.

Signed-off-by: Ralph Castain <rhc@pmix.org>
(cherry picked from commit 60961ceb41cde6fbdc84694748696e90ddcfb345)
2019-03-01 08:41:23 -08:00
Howard Pritchard
55915c3885 rml/ofi: remove
per discussion at the 2/19/19 devel-core meeting,
remove rml/ofi from 4.0.x

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2019-02-19 10:27:47 -07:00
Geoff Paulsen
c593b2004e
Merge pull request #6380 from hppritcha/ggouaillardet-topic/oob_tcp_cross_version_compatibility
v4.0.x: oob/tcp: add cross version compatibility support
2019-02-15 13:39:12 -06:00
Howard Pritchard
de1dd1c2b0 oob/tcp: hardwire oob_tcp version string to 4.0.0
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2019-02-13 12:54:03 -07:00
Gilles Gouaillardet
dd750795ee oob/tcp: add cross version compatibility support
Since we intend to provide cross version compatibility
between versions with the same major and minor, use
MAJOR.MINOR.0 instead of orte_version_string
(e.g. MAJOR.MINOR.RELEASEGREEK).

Open MPI 4.0.0 has already been released, so in order to make
it compatible with future 4.0.x releases, we have to use 4.0.0
as the version string, that is why we use MAJOR.MINOR.0 instead
of MAJOR.MINOR

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-02-13 10:21:32 -07:00