1
1
Граф коммитов

5475 Коммитов

Автор SHA1 Сообщение Дата
Artem Polyakov
45898a9c65 opal/timing: add the draft of env-based timings
This commit adds new timing feature that uses environment variables to
expose timing information. This allows easy access to this data (if
timing is enabled) from any other part of the application for the subsequent
postprocessing.
In particular this will be integrated with OMPI-level timing framework that
whill use MPI_Reduce functionality to provide more compact and easy-to use
information.

This commit also adds the example of usage of this framework by annotating
rte_init function. The result is not used anywhere for now. It will be
postprocessed in subsequent commits.

NOTE: that functionality is currently disabled untill it will be verified at runtime

Signed-off-by: Artem Polyakov <artpol84@gmail.com>
2017-04-07 21:16:22 +06:00
Artem Polyakov
88ed79ea25 opal/timing: remove old framework
Signed-off-by: Artem Polyakov <artpol84@gmail.com>
2017-04-07 21:16:22 +06:00
Artem Polyakov
482d7c9322 opal/timing: remove RML timings
Signed-off-by: Artem Polyakov <artpol84@gmail.com>
2017-04-07 21:16:21 +06:00
Artem Polyakov
79100de014 opal/timing: Remove oob tracing
Signed-off-by: Artem Polyakov <artpol84@gmail.com>
2017-04-07 21:16:21 +06:00
Ralph Castain
b526bca56c Fix a potential segfault by avoiding NULL topologies prior to launching the VM.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-04-06 20:51:19 -07:00
Ralph Castain
b33b4607df Correctly identify the source of the event when notifying of abnormal termination by a process
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-04-06 20:50:38 -07:00
Ralph Castain
a29ca2bb0d Enable slurm operations on Cray with constraints
Cleanup some errors in the nidmap code that caused us to send unnecessary topologies

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-04-06 08:58:06 -07:00
Ralph Castain
db8943cedd Provide further (hopefully) helpful messages about the hotel size
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-04-05 04:27:32 -07:00
Ralph Castain
b7e9711f45 Resolve the direct modex race condition. The request hotel was running out of rooms, thereby returning an error upon checkin - and we had missed error_logging a couple of those places. Hence no error message and things just hung.
Output a (hopefully) helpful message when we timeout an operation

Thanks to Nathan for tracking it down.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-04-04 21:32:44 -07:00
Ralph Castain
9a69b20d09 Merge pull request #3282 from rhc54/topic/direct
Set the PARENT vpid for direct routed module
2017-04-04 20:55:12 -07:00
Ralph Castain
40ca43e157 Set the PARENT vpid for direct routed module
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-04-04 19:03:28 -07:00
Ralph Castain
734b90aa6b Adjust the timeout for direct modex requests to reflect the size of the job. It can take several seconds to start all the procs, and we don't want to timeout due to differences in start times of the various procs
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-04-04 18:20:51 -07:00
Ralph Castain
9cb18b8348 Merge pull request #3280 from rhc54/topic/dvm
Fix the DVM by ensuring that all nodes, even those that didn't partic…
2017-04-04 18:15:33 -07:00
Ralph Castain
74863a0ea4 Fix the DVM by ensuring that all nodes, even those that didn't participate (i.e., didn't have any local children) in a job, clean up all resources associated with that job upon its completion. With the advent of backend distributed mapping, nodes that weren't part of the job would still allocate resources on other nodes - and then start from that point when mapping the next job. This change ensures that all daemons start from the same point each time.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-04-04 17:31:38 -07:00
Nathaniel Graham
7063f3021f Merge pull request #3231 from nrgraham23/revamp_mpirun_help
mpirun --help output revamp
2017-04-04 12:32:20 -06:00
Nathaniel Graham
19e5d15491 mpirun --help output revamp
This commit modifies the output from the mpirun --help
command.  The options have been split into groups, to
make the output smaller and more readable.  The groups
are: general, debug, output, input, mapping, ranking,
binding, devel, compatibility, launch, dvm, and
unsupported. There is also a special "full" command
that can be used to get the old behaviour of printing
out all of the options.  Unsupported options may only
be seen with this full output.

This commit also adds a special case for the help
argument.  It makes it possible for the user to
enter 0 or 1 arguments instead of having to always
enter an argument.  This defaults to printing out
the "general" help options so the user can then
see what help arguments there are.

Signed-off-by: Nathaniel Graham <ngraham@lanl.gov>
2017-04-04 10:59:32 -06:00
Ralph Castain
393c4536eb Remove stale code line
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-04-04 08:13:15 -07:00
Ralph Castain
92c996487c Update how we pass the node regex so we pass _all_ nodes, even those without daemons. This allows the backend daemons to form a complete picture of the allocation. Include info on which nodes have daemons on them, and populate that info on the backend as well.
Set the daemons' state to "running" and mark them as "alive" by default when constructing the nidmap

Get the DVM running again

Fix direct modex by eliminating race condition caused by releasing data while sending it

Up the size limit before compressing

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-04-03 19:25:15 -07:00
Ralph Castain
d782542c5c Merge pull request #3241 from rhc54/topic/cov
Silence coverity dead-code warning
2017-03-27 08:33:33 -07:00
Jeff Squyres
a333cf691a orte: minor tweaks to run-as-root message
Two updates:

1. Remove the "run as root" error message from orterun.c, because that
   functionality is now in orted_submit.c (although it is still
   required in orte-dvm.c -- so sync the message in orted_submit.c and
   orte-dvm.c to be identical).
2. Slightly tweak the text of the "run as root" error message to
   explicitly state that we (strongly) suggest running as a non-root
   user (and add a little whitespace).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2017-03-27 04:50:21 -07:00
Ralph Castain
583dbe954c Silence coverity dead-code warning
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-03-26 20:36:43 -07:00
Ralph Castain
ecc8000136 Silence a flood of warnings when compiling with gcc on Cray
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-03-24 13:37:11 -06:00
Ralph Castain
35f817911e Fix coverity issues
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-03-24 08:09:46 -07:00
Ralph Castain
ea84a53faa Merge pull request #3218 from rhc54/topic/pmix2
Update to include the PMIx 2.0 APIs for monitoring and job control.
2017-03-21 20:11:10 -07:00
Ralph Castain
d645557fa0 Update to include the PMIx 2.0 APIs for monitoring and job control. Include required integration, but leave the monitors off for now. Move the sensor framework out of ORTE as it is being absorbed into PMIx
Fix typo and silence warnings

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-03-21 17:47:08 -07:00
Ralph Castain
10d401b6ec Merge pull request #3217 from rhc54/topic/wdirs
Resolve a race condition for setting our working directory when fork/exec'ing application procs.
2017-03-21 17:39:54 -07:00
Ralph Castain
74fd2c30af Cleanup alps odls module
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-03-21 17:41:11 -06:00
Ralph Castain
f8e1e3bed3 Ensure we properly exit with error if we cannot map the job
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-03-21 15:15:32 -07:00
Ralph Castain
75684dc260 Resolve a race condition for setting our working directory when fork/exec'ing application procs. We have to ensure we do it after the fork occurs since we want to use multiple threads in the odls. Otherwise, the different threads are bouncing the entire process around.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-03-21 13:54:03 -07:00
Ralph Castain
dc85e7fde7 Provide a little more help on the error messages when an executable isn't found so we have some better idea where we were looking for it. Don't double-report such errors. Ensure the ORTE_ERROR_NAME doesn't get a NULL back for the string name of an error code as that might cause some systems to segfault
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-03-17 09:54:37 -07:00
Howard Pritchard
1709febdea Merge pull request #3166 from hppritcha/topic/swat_state_orted_comp_warning
ORTED: swat another compiler warning
2017-03-15 08:40:59 -06:00
Ralph Castain
96d7d10c1d Merge pull request #3170 from rhc54/topic/reg
Ensure the backend daemons know if we are in a managed allocation and if the HNP was included in the allocation
2017-03-14 12:48:09 -07:00
Ralph Castain
61a71e25ef Ensure the backend daemons know if we are in a managed allocation and if
the HNP was included in the allocation

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-03-14 10:06:43 -07:00
Howard Pritchard
5daaf7f3fd ORTED: swat another compiler warning
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2017-03-14 08:41:51 -06:00
Ralph Castain
52c9e631de Silence Coverity warnings
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-03-14 07:30:42 -07:00
Ralph Castain
b1a01d77ae Update the TM module to support regex passing
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-03-13 21:50:40 -07:00
Ralph Castain
bb574a41df Update launchers to get correct regex
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-03-13 11:21:44 -07:00
Ralph Castain
105fb152e1 Silence Coverity warnings
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-03-13 08:38:51 -07:00
Ralph Castain
b9f5cab710 Add a minor debug statement
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-03-12 18:15:44 -07:00
Gilles Gouaillardet
23d44a5284 sensor/base: initialize orte_sensor_base global variable
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-03-13 09:39:43 +09:00
Ralph Castain
6d6bc9bd07 Update alps module to new APIs
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-03-12 09:43:07 -07:00
Ralph Castain
70591bf4dc Enable parallel fork/exec of local procs by providing the option of multiple odls progress threads
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-03-11 20:48:04 -08:00
Ralph Castain
ab50665222 Restore sensor framework
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-03-11 17:46:32 -08:00
Ralph Castain
c6bc3ccb76 Sync to latest PMIx master and PMIx reference server
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-03-11 12:50:38 -08:00
Howard Pritchard
f8183f71f7 rmaps/base: swat compiler warning
gcc was complaining about variables possibly used uninitialized

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2017-03-09 14:30:06 -06:00
Ralph Castain
48fc339718 Create an alternative mapping method that pushes responsibility
onto the backend daemons. By default, let mpirun only pack the app_context
info and send that to the backend daemons where the mapping will
be done. This significantly reduces the computational time on mpirun as it isn't
running up/down the topology tree computing thousands of binding
locations, and it reduces the launch message to a very small number of
bytes.

When running -novm, fall back to the old way of doing things
where mpirun computes the entire map and binding, and then sends
the full info to the backend daemon.

Add a new cmd line option/mca param --fwd-mpirun-port that allows
mpirun to dynamically select a port, but then passes that back to
all the other daemons so they will use that port as a static port
for their own wireup. In this mode, we no longer "phone home" directly
to mpirun, but instead use the static port to wireup at daemon
start. We then use the routing tree to rollup the initial
launch report, and limit the number of open sockets on mpirun's node.

Update ras simulator to track the new nidmap code

Cleanup some bugs in the nidmap regex code, and enhance the error message for not enough slots to include the host on which the problem is found.

Update gadget platform file

Initialize the range count when starting a new range

Fix the no-np case in managed allocation

Ensure DVM node usage gets cleaned up after each job

Update scaling.pl script to use --fwd-mpirun-port. Pre-connect the daemon to its parent during launch while we are otherwise waiting for the daemon's children to send their "phone home" rollup messages

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-03-07 20:43:12 -08:00
Ralph Castain
83199979ba Remove the stale opal/sec framework
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-03-02 15:41:56 -08:00
Ralph Castain
c757c3d260 Fix double-free in rml/ofi shutdown
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-03-01 11:53:46 -08:00
Jeff Squyres
fec519a793 hwloc: rename opal/mca/hwloc/hwloc.h -> hwloc-internal.h
Per a prior commit, the presence of "hwloc.h" can cause ambiguity when
using --with-hwloc=external (i.e., whether to include
opal/mca/hwloc/hwloc.h or whether to include the system-installed
hwloc.h).

This commit:

1. Renames opal/mca/hwloc/hwloc.h to hwloc-internal.h.
2. Adds opal/mca/hwloc/autogen.options to tell autogen.pl to expect to
   find hwloc-internal.h (instead of hwloc.h) in opal/mca/hwloc.
3. s@opal/mca/hwloc/hwloc.h@opal/mca/hwloc/hwloc-internal.h@g in the
   rest of the code base.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2017-02-28 07:48:42 -08:00
Thomas Naughton
74f8c2ae30 orte-clean: fix bad username/uid usage, add orte-dvm
This fixes a mismatch between PS listing that returned
USERNAME but code was pruning based on UID.

This changes the OPAL_PS_FLAVOR_CHECK format to return
'uid' instead of 'user'.  (Note: Avoiding call to
getlogin_r() but assuming UID is uniform on system,
same assumption exists for session dir anyway.)

Note, still maintains behavior from man page for root
running orte-clean on node (kills all orteds).

Adds 'orte-dvm' to list of procnames that will be checked/killed.

Signed-off-by: Thomas Naughton <naughtont@ornl.gov>
2017-02-28 08:00:06 -05:00