1
1
Граф коммитов

5788 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
e27e945d9a Complete job control integration
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-08-20 16:08:54 -07:00
Ralph Castain
50e6e14020 Protect against infinite loops
Flag that we provided a notification and ignore it if it attempts to come back up.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit ea0d70bc9396def61545e2ce492a55c4c3aa7772)
2018-07-19 13:57:52 -07:00
Ralph Castain
6b6e63a346 Control inheritance of launch directives by child jobs
Do not have child jobs inherit launch directives unless requested to do so. This affects the map-by, rank-by, bind-to, npernode, pernode, npersocket, persocket, and cpus-per-rank directives. Values provided in the spawn call always take precedence - if a particular value isn't specified, then the ORTE defaults will be used if inheritance is not requested, and the values specified by MCA param will be used if inheritance is set.

Always inherit oversubscribe for now as otherwise MTT will break

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-07-10 15:12:05 -07:00
Ralph Castain
c6ee8d2f5c
Merge pull request #5300 from hjelmn/goodbye_oobud
oob/ud: remove as it has bitrotted
2018-07-09 12:40:04 -07:00
Ralph Castain
0ddbc75ce5
Merge pull request #4930 from kizill/fix-ipv6
fixed ipv6 OOB connection problems (fix issue #1585)
2018-06-26 09:13:53 -07:00
Ralph Castain
3b2390e5d5 Silence coverity warnings, remove/ignore build product
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-06-25 08:01:28 -07:00
Jeff Squyres
4603852740 orterun: use consistent CLI option name for --bind-to
Since the new binding option is tied to the --cpu-list orterun CLI
option, make the --bind-to option reflect the same name (vs. the
--cpu-set CLI option, which is entirely different).  For example:

    mpirun --bind-to cpu-list:ordered ...

Note that "--bind-to cpulist:ordered" is accepted as a synonym,
because people will be lazy.

Also add some minor updates to the orterun.1in man page for
clarification.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-06-21 08:22:00 -07:00
Ralph Castain
d2838139e4 Update man and help output for new binding option
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-06-21 06:36:11 -07:00
Ralph Castain
f17d47087a Define a new binding method and qualifier
Allow users to request that procs be bound to a cpu in a given cpu-list based on their corresponding local rank

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-06-20 21:26:09 -07:00
Ralph Castain
97665d44cd Prevent thread log when show_help msgs are emitted
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-06-19 21:07:03 -07:00
Ralph Castain
98b4ed9a3a Fix the no-disconnect test
A race condition exists based on whether or not the userdata object attached to a hwloc_obj_t has been initialized. These objects are setup whenever we scan for resources under that location. You therefore must not set a variable to the pointer to the userdata object and then call a function that will initialize the data in it - you need to set the variable after the function call, and protect against a NULL pointer

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-06-19 13:52:34 -07:00
Nathan Hjelm
845516ca11 oob/ud: remove as it has bitrotted
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-06-19 13:06:26 -06:00
Ralph Castain
cdb3d798f0 Silence Coverity warnings
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-06-19 07:47:50 -07:00
Ralph Castain
f0a0d606a0 Correct accounting for tools
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 1be080f7b92bad39745f42628a8cb6afefad2d2a)
2018-06-18 13:24:25 -07:00
Ralph Castain
795140e590 Make use of "instant-on" feature optional
The PMIx support for "instant on" remains experimental, so disable it by default. Provide an MCA param and corresponding command line option to enable it at runtime.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-06-17 02:42:00 -07:00
Ralph Castain
fa18ba395d Sync to latest PMIx v3.0rc
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-06-17 02:41:46 -07:00
Ralph Castain
ea21f7175a Silence warnings and remove unused code
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-06-16 17:42:48 -07:00
Ralph Castain
06eb51161f Rename prun ompi-prun
This is a minor abstraction break in naming, but hopefully acceptable for now. I will update the contents of the program a little later. This resolves the immediate issue of naming conflict with the PRRTE binary.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-06-06 07:40:03 -07:00
Ralph Castain
7c0ec7e851 Cleanup warnings in binding code
This still leaves two unresolved warnings:

base/rmaps_base_binding.c:577:22: warning: variable ‘clvm’ set but not used [-Wunused-but-set-variable]
     unsigned clvl=0, clvm=0;
                      ^~~~
base/rmaps_base_binding.c:576:27: warning: variable ‘hwm’ set but not used [-Wunused-but-set-variable]
     hwloc_obj_type_t hwb, hwm;
                           ^~~

The problem is that these values are used in the OPAL_HWLOC_MAKE_OBJ_CACHE macro to form a variable name. Thus, the compiler doesn't recognize the values as being "used". I'm not entirely sure how to resolve it cleanly.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-06-03 11:47:14 -07:00
Brice Goglin
c4dffa1d0f rmaps: simplify the lookup for the binding object and fix for hwloc 2.0
Don't bother doing a lookup upwards or downwards for the target object type.
Just use the target depth, iterate over the level until we find the min_bound
object that intersects the locale cpuset.

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
2018-05-24 11:53:07 +02:00
Brice Goglin
bd08a6ead9 hwloc: fix hwloc/shmem.h in the external case
Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
2018-05-24 11:53:07 +02:00
Jeff Squyres
af4299ebc5 hwloc: updates for hwloc 2.0.x API
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-05-24 11:53:07 +02:00
Ralph Castain
d2040497b8 Silence Coverity warning
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-04-27 07:30:00 -07:00
Ralph Castain
1e8add52d7 Silence warning
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-04-26 11:45:25 -07:00
Ralph Castain
9ae80596f6 Fix rank-by option and improve npernode/skt
This fixes a problem reported by @bgoglin where rank-by was incorrectly generating values when ranking by a type of object (e.g., socket). It also corrects the handling of the pernode, npernode, and npersocket options - these should only set the #procs and the default mapping pattern. They specifically should not prohibit the user from requesting a different mapping.

Thus, the following should be valid:

mpirun -npernode 2 --map-by socket ...

should put 2 procs on each node, mapping them by-socket on each node.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-04-25 20:35:43 -07:00
Gilles Gouaillardet
a05456ab5e orte: only set the ORTE_NODE_ALIAS attribute when required
When there is no alias for a given node, do not set the
ORTE_NODE_ALIAS attribute to an empty string any more.

Thanks Erico for reporting this issue.
Thanks Ralph for the guidance.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-04-25 11:43:46 +09:00
Gilles Gouaillardet
49d6658f60 Revert "orte_dt_print_node: correctly handle nodes with no alias(es)"
This reverts commit 866f449cff.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-04-25 11:43:35 +09:00
Gilles Gouaillardet
866f449cff orte_dt_print_node: correctly handle nodes with no alias(es)
Thanks Erico for reporting this issue

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-04-24 15:44:26 +09:00
Jeff Squyres
3c8d08a9f3
Merge pull request #5078 from jsquyres/pr/slurm-plm-minor-warning-message-update
plm/slurm: slightly improve verbose warning message
2018-04-17 07:59:50 -07:00
Gilles Gouaillardet
4f1cb4747c odls/base: fix support for PMIx < v2.1
wrap opal_pmix.get() around opal_pmix.legacy_get() to support previous PMIx releases.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-04-17 09:59:47 +09:00
Jeff Squyres
5394845ce6 plm/slurm: slightly improve verbose warning message
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-04-16 17:04:12 -07:00
Nathan Hjelm
664ba32435 plm/base: fix typo in variable name
An incorrectly named variable caused all pml variables to disappear
from ompi_info. This commit fixes the typo. We may add some logic into
the MCA base to catch these sorts of things in the future.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-04-10 17:53:16 -06:00
Jeff Squyres
7a6e8cac58 orterun.1in: fix typo
Found via https://github.com/open-mpi/ompi-www/pull/61.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-04-09 14:13:24 -04:00
Ralph Castain
d644f7ee26 Correctly fix the ranking policy
Shorten the loops as much as possible - if someone wants to further optimize, they are welcome to do so.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-03-26 16:06:46 -07:00
Ralph Castain
322f6c5056 Fix a breakage in the ranking system
While it may be faster to reverse the order of the assignment loops, it also results in the wrong answer

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-03-25 15:55:56 -07:00
Ralph Castain
8454fc8a65 Allow oversubscription on managed allocations
Fixes https://github.com/pmix/pmix-reference-server/issues/42

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-03-25 08:37:51 -07:00
Artem Polyakov
77ff99e9ee
Merge pull request #4933 from karasevb/timings_update
timings: added new timing points
2018-03-25 00:10:49 -07:00
Boris Karasev
6afc7099a0 plm/base: fixed the hosts filtering
Reseting the `ORTE_NODE_FLAG_MAPPED` flag after hosts filtering, this
flag is used subsequently and can be affect to the node mapping logic

Signed-off-by: Boris Karasev <karasev.b@gmail.com>
2018-03-23 09:41:16 +03:00
Jeff Squyres
0f8077ace6 oob/tcp: add show_help message about version mismatch
Be more explicit about version mismatch between ORTE processes.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-03-21 20:18:28 -07:00
Boris Karasev
3796307a57 timings: added new timing points
Signed-off-by: Boris Karasev <karasev.b@gmail.com>
2018-03-21 05:16:25 +02:00
Stanislav Kirillov
86061fbf8d
fixed ipv6 OOB connection problems
Signed-off-by: Stanislav Kirillov <staskirillof@yandex.ru>
2018-03-20 16:07:53 +00:00
Aurelien Bouteiller
e08e580e27
Merge pull request #4916 from abouteiller/topic/scaling.pl-m
Scaling.pl: Fix Srun options and wait for DVM launch
2018-03-15 22:06:01 -04:00
Aurélien Bouteiller
9e23d24bb4
Scaling.pl: Fix Srun options and wait for DVM launch
Flush out the DVM ready notice on stdout

Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
2018-03-15 00:00:49 -04:00
Joshua Hursey
ccb4f43c9b Fix MPIR_proctable structure visibility
* The `MPIR_PROCDESC` structure needs to be visible even in optimized
   builds so that debuggers can attach to `mpirun` and properly read the
   `MPIR_proctable`.
 * In the v2.0.x and v2.x series this structure resided in the `orterun`
   directory and included the `CFLAGS` fix included here. This code
   moved in the v3.x series and the `CFLAGS` did not move causing this
   issue.
   - Instead of applying the debug `CFLAGS` globally to libopen-rte,
     only apply them to the `orted_submit.c` compile which contains the
     MPIR symbols.

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2018-03-09 21:15:28 -05:00
Ralph Castain
2f85db9791 Always register the nspace for jobs
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-03-02 02:00:31 -08:00
Ralph Castain
7241043809 Modify the internal logic for resolve nodes/peers
The current code path for PMIx_Resolve_peers and PMIx_Resolve_nodes executes a threadshift in the preg components themselves. This is done to ensure thread safety when called from the user level. However, it causes thread-stall when someone attempts to call the regex functions from _inside_ the PMIx code base should the call occur from within an event.

Accordingly, move the threadshift to the client-level functions and make the preg components just execute their algorithms. Create a new pnet/test component to verify that the prge code can be safely accessed - set that component to be selected only when the user directly specifies it. The new component will be used to validate various logical extensions during development, and can then be discarded.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 456ac7f7af3d9ba09888e3c899eb001daaa24aef)
2018-03-02 02:00:31 -08:00
Ralph Castain
17c40f4cea Implement support for proctable queries
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-03-02 02:00:31 -08:00
Ralph Castain
0434b615b5 Update ORTE to support PMIx v3
This is a point-in-time update that includes support for several new PMIx features, mostly focused on debuggers and "instant on":

* initial prototype support for PMIx-based debuggers. For the moment, this is restricted to using the DVM. Supports direct launch of apps under debugger control, and indirect launch using prun as the intermediate launcher. Includes ability for debuggers to control the environment of both the launcher and the spawned app procs. Work continues on completing support for indirect launch

* IO forwarding for tools. Output of apps launched under tool control is directed to the tool and output there - includes support for XML formatting and output to files. Stdin can be forwarded from the tool to apps, but this hasn't been implemented in ORTE yet.

* Fabric integration for "instant on". Enable collection of network "blobs" to be delivered to network libraries on compute nodes prior to local proc spawn. Infrastructure is in place - implementation will come later.

* Harvesting and forwarding of envars. Enable network plugins to harvest envars and include them in the launch msg for setting the environment prior to local proc spawn. Currently, only OmniPath is supported. PMIx MCA params control which envars are included, and also allows envars to be excluded.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-03-02 02:00:31 -08:00
Scott Miller
d7e594fcff Fix PATH and LD_LIBRARY_PATH prefixing to use first app context value for ORTE_APP_PREFIX_DIR
Signed-off-by: Scott Miller <scott.miller1@ibm.com>
2018-02-28 18:41:47 -05:00
Gilles Gouaillardet
02b97146de orted_submit: fix the --oversubscribe option
do set the ORTE_MAPPING_SUBSCRIBE_GIVEN directive when --oversubscribe is used

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-02-20 17:15:12 +09:00