1
1

557 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
3f032d39e8 Mark the proc as alive so waitpid callback system doesn't immediately activate the callback
Refs trac:4717

This commit was SVN r32026.

The following Trac tickets were found above:
  Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717
2014-06-18 14:04:55 +00:00
Ralph Castain
8e7c0257f0 Cleanup some missed updates to orte_wait_cb as params have changed
Refs trac:4717

This commit was SVN r32025.

The following Trac tickets were found above:
  Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717
2014-06-17 23:40:31 +00:00
Ralph Castain
5216bd5558 Multiple sigchld reports can occur within a single event callback, so have to reap them until none remain. Also, need to ensure the daemon is flagged as alive prior to calling wait_cb
Refs trac:4717

This commit was SVN r32020.

The following Trac tickets were found above:
  Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717
2014-06-17 18:46:40 +00:00
Ralph Castain
42bf7466fc This isn't as big a change as it appears - a change in one place caused a whole bunch of files to require updated #include's due to some arcane linkage. Rework the orte_wait code to reflect the introduction of the state machine. If we are in cleanup mode and just want to kill all our local children, then there is no reason to be polite about it as that introduces *very* long delays at scale. Just kill the procs and move on.
Refs trac:4717

This commit was SVN r32019.

The following Trac tickets were found above:
  Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717
2014-06-17 17:57:51 +00:00
Ralph Castain
b2413a6b88 Cannot update the proc state prior to activating the state machine as some callback functions need to compare the prior proc state against the new one.
cmr=v1.8.2:reviewer=jsquyres

This commit was SVN r31949.
2014-06-04 03:40:08 +00:00
Ralph Castain
c5384d44d7 Protect against NULL result in get_attr
This commit was SVN r31947.
2014-06-04 03:09:37 +00:00
Ralph Castain
f1978fba7c Cleanup a set of typos on the orte_get_attribute call
This commit was SVN r31942.
2014-06-03 20:36:38 +00:00
Ralph Castain
5668f085a3 Silence some useless warnings, and fix a missed updated in the tm plm
This commit was SVN r31930.
2014-06-02 17:57:56 +00:00
Ralph Castain
742c0d2284 Fix typo that would cause a segfault if orte_startup_timeout was set
This commit was SVN r31929.
2014-06-02 15:59:18 +00:00
Ralph Castain
65a35d92ef Cleanup compile issues - missing updates to some plm components and the slurm ras component
This commit was SVN r31921.
2014-06-01 17:59:06 +00:00
Ralph Castain
8736a1c138 Per RFC:
http://www.open-mpi.org/community/lists/devel/2014/05/14822.php

Revamp the ORTE global data structures to reduce memory footprint and add new features. Add ability to control/set cpu frequency, though this can only be done if the sys admin has setup the system to support it (or you run as root).

This commit was SVN r31916.
2014-06-01 16:14:10 +00:00
Nathan Hjelm
041b72b0cc plm/alps: better workaround for the noisy cray pmi implementation
This commit is a slightly better workaround to prevent mesages of
the form:
[unset]:_pmi_alps_get_apid:alps_app_lli_put_request failed
[unset]:_pmi_alps_get_appLayout:pmi_alps_get_apid returned with error: Bad file descriptor

It works by completely disabling PMI in the application process when using
mpirun. This should not be an issue for any apps.

cmr=v1.8.2:reviewer=rhc

This commit was SVN r31882.
2014-05-22 16:04:36 +00:00
Nathan Hjelm
2a57e71a47 plm/alps: fix typo introduced in r31589
This commit was SVN r31747.

The following SVN revision numbers were found above:
  r31589 --> open-mpi/ompi@445b552d3a
2014-05-13 22:36:54 +00:00
Ralph Castain
5602156a1c Use the correct abstraction layer name for the data dirs
This commit was SVN r31684.
2014-05-08 14:32:24 +00:00
Ralph Castain
11faab1091 The final step of the RFC: convert the <foo>libdir and friends to fit their respective code areas, and equate them all at the top. Note that we can't entirely separate things as the opal_install_dirs framework can't handle separated locations for the various trees.
This commit was SVN r31679.
2014-05-08 02:01:35 +00:00
Ralph Castain
445b552d3a Try again to get an error message printed when a daemon fails to successfully report back to mpirun. In this case, there is no guaranteed way for the daemon to output the error report itself - we don't have a connection back to the HNP, and we have tied stderr off to /dev/null (for good reasons). So the HNP has to detect the failure itself and report it.
The HNP can't know the precise reason, of course - all it knows is that the daemon failed. So output a generic error message that provides guidance on probable causes.

Refs trac:4571

This commit was SVN r31589.

The following Trac tickets were found above:
  Ticket 4571 --> https://svn.open-mpi.org/trac/ompi/ticket/4571
2014-05-01 19:48:21 +00:00
Ralph Castain
238ecea311 When we comm_spawn, we really want to respect the original -host directives and not expand the daemon virtual machine unless directed to do so in the comm_spawn command. Otherwise, we will automatically launch daemons on every node in the allocation.
cmr=v1.8.2:reviewer=rhc:subject=respect vm boundaries during comm_spawn

This commit was SVN r31578.
2014-04-30 22:26:18 +00:00
Jeff Squyres
ea4c916096 plm_slurm_module.c: don't leave the extra fd to /dev/null open
Prior to r29058, this same logic was in place (i.e., ensure that the
extra fd to /dev/null is closed).  It looks like it was accidentally
removed in the ORTE conversion to the state machine in r29058.

This ''might'' have something to do with many hangs that we're seeing
in Cisco MTT with jobs that exhibit failure (e.g., call MPI_ABORT)...?

cmr=v1.8.2:reviewer=rhc

This commit was SVN r31469.

The following SVN revision numbers were found above:
  r29058 --> open-mpi/ompi@a200e4f865
2014-04-21 20:09:15 +00:00
Ralph Castain
a368e84e70 Per the RFC, remove the sensor framework from the ORTE code area, relocating it offsite to the ORCM code area. Also update some ignores to ensure we don't pickup crosstalk in components
This commit was SVN r31403.
2014-04-15 21:48:24 +00:00
Nathan Hjelm
9df795d1dd plm/alps: silence annoying warning message when using Cray PMI 3.x or
newer

This commit adds a workaround for messages printed by the Cray PMI library
when launching using mpirun. We are still talking with Cray to find a
better fix but this will silence the warnings for now.

cmr=v1.8.1:reviewer=manjugv

This commit was SVN r31352.
2014-04-08 21:54:10 +00:00
Dave Goodell
19efa09540 plm/slurm: tweak /dev/null usage (#4489)
See the ticket for more details.

cmr=v1.8.1:reviewer=rhc:ticket=4489

This commit was SVN r31351.

The following Trac tickets were found above:
  Ticket 4489 --> https://svn.open-mpi.org/trac/ompi/ticket/4489
2014-04-08 21:46:07 +00:00
Ralph Castain
957c9ecf53 Okay, silence the anality by simplifying the already irrelevant code, thus allowing us to turn our attention to things that actually matter
Refs trac:4489

This commit was SVN r31348.

The following Trac tickets were found above:
  Ticket 4489 --> https://svn.open-mpi.org/trac/ompi/ticket/4489
2014-04-08 19:51:11 +00:00
Ralph Castain
8ce98ccc8d Not sure when this got messed up, but correct the stdout/stderr redirection on the srun command so we don't get all those slurm warnings
cmr=v1.8.1:reviewer=dgoodell:subject=silence srun warning output

This commit was SVN r31308.
2014-04-04 04:23:31 +00:00
Ralph Castain
3fdcaeab97 Fix a problem where we need to abort due to a mapping failure, but we are in a managed environment and thus the orteds have not wired up. Thus, if we send the exit message across the routed network, the remote daemons won't have a way to relay the message along - and we won't exit.
If we are aborting, then set the flags so the HNP directly sends an exit command to each daemon. Make it the halt_vm command so the remote daemon doesn't try to relay it, but instead just exits without waiting for its routed children to exit first.

cmr=v1.8.1:reviewer=jsquyres:subject=fix hangs due to abort prior to daemon wireup

This commit was SVN r31304.
2014-04-02 04:17:55 +00:00
Ralph Castain
70ee3fb000 Ensure that orted's are not bound to single processors if the TaskAffinity option is set by default. Thanks to Artem Polyakov for the patch, and for his patience in explaining the situation.
Reviewed with Moe Jette to ensure this was correct, and confirmed by me.

RM-approved

cmr=v1.8:reviewer=ompi-gk1.8

This commit was SVN r31288.
2014-03-29 18:30:38 +00:00
Ralph Castain
bd9bd2ff16 Be consistent in our handling of the "only HNP in allocation" case when setting up the VM. Thanks to Tetsuya Mishima for the suggestion.
cmr=v1.8:reviewer=rhc

This commit was SVN r31195.
2014-03-24 15:28:09 +00:00
Ralph Castain
d17f811ff5 Surrender to the tyranny of C++ and give up on enum for node states, as nice as that would be, in favor of retaining memory footprint constraints.
This commit was SVN r31149.
2014-03-19 16:15:24 +00:00
Ralph Castain
0aa23cdc35 Cleanup copy/paste errors to ensure we progress the launch
cmr=v1.7.5:reviewer=rhc

This commit was SVN r31102.
2014-03-18 01:24:49 +00:00
Ralph Castain
45196d222b Minor cleanup of the node state definitions - using the enum allows the debuggers to pretty-print the value
This commit was SVN r31090.
2014-03-17 21:27:58 +00:00
Ralph Castain
b248b27637 Remove a check that prevented mpirun from exiting when it should in the single-node case
Refs trac:4393

This commit was SVN r31080.

The following Trac tickets were found above:
  Ticket 4393 --> https://svn.open-mpi.org/trac/ompi/ticket/4393
2014-03-15 15:25:44 +00:00
Ralph Castain
fbc5e3b773 Deal with the corner case where we encounter an error when attempting to launch a daemon. In this case, we will order abnormal termination before daemons callback to us, and thus any attempt to send them a "die" message will fail. Ensure that mpirun at least exits cleanly in this scenario, thereby allowing the remote daemons that did get launched to commit suicide when comm fails.
cmr=v1.7.5:reviewer=jsquyres

This commit was SVN r31068.
2014-03-14 15:32:30 +00:00
Adrian Reber
7304b700e1 Fix the newly added FT event state when compiling --with-ft
This commit was SVN r30988.
2014-03-11 13:20:08 +00:00
Ralph Castain
7a44af375c Add an FT event state and set the state machine to callback to the OOB base ft event when activated
This commit was SVN r30950.
2014-03-06 02:44:29 +00:00
Ralph Castain
c9465d97b4 Resolve a race condition when responding to a SIGTERM to ensure that any final message from the application is correctly output. Remove a duplicate command, reduce the priority of the daemon exit command to MSG so that the IOF will have a chance to output cached messages. Update the signal trapping test.
Thanks to Paul Kapinos for reporting the problem.

cmr=v1.7.5:reviewer=jsquyres:subject=resolve a race condition

This commit was SVN r30942.
2014-03-05 04:38:17 +00:00
Ralph Castain
0ac97761cc Now that we are binding by default, the issue of #slots and what to do when oversubscribed has become a bit more complicated. This isn't a problem in managed environments as we are always provided an accurate assignment for the #slots, or when -host is used to define the allocation since we automatically assume one slot for every time a node is named.
The problem arises when a hostfile is used, and the user provides host names without specifying the slots= paramater. In these cases, we assign slots=1, but automatically allow oversubscription since that number isn't confirmed. We then provide a separate parameter by which the user can direct that we assign the number of slots based on the sensed hardware - e.g., by telling us to set the #slots equal to the #cores on each node. However, this has been set to "off" by default.

In order to make this a little less complex for the user, set the default such that we automatically set #slots equal to #cores (or #hwt's if use_hwthreads_as_cpus has been set) only for those cases where the user provides names in a hostfile but does not provide slot information.

Also cleanup some a couple of issues in the mapping/binding system:

* ensure we only override the binding directive if we are oversubscribed *and* overload is not allowed

* ensure that the MPI procs don't attempt to bind themselves if they are launched by an orted as any binding directive (no matter what it was) would have been serviced by the orted on launch

* minor cleanup to the warning message when oversubscribed and binding was requested

cmr=v1.7.5:reviewer=rhc:subject=update mapping/binding system

This commit was SVN r30909.
2014-03-03 16:46:37 +00:00
Ralph Castain
0dc5f50d27 Add a plm component for local-only operation that doesn't require rsh/ssh to be installed. Requested by Fedora packagers for testing purposes.
cmr=v1.7.5:reviewer=jsquyres:subject=Add a plm component for local-only operation

This commit was SVN r30645.
2014-02-09 15:53:10 +00:00
Ralph Castain
1326ed704f Per the RFC discussed here:
http://www.open-mpi.org/community/lists/devel/2014/01/13789.php

add support for async modex when requested.

cmr=v1.7.5:reviewer=jsquyres:subject=Add async modex support

This commit was SVN r30565.
2014-02-05 14:39:27 +00:00
Adrian Reber
fde1040d2f Use unique collective ids for the checkpoint/restart code
This commit was SVN r30552.
2014-02-04 14:03:05 +00:00
Ralph Castain
53b1be5067 Only report launch progress when specifically requested to do so. Thanks to Tetsuya Mishima for spotting it.
Reviewed by rhc and RM-approved

cmr=v1.7.4:reviewer=ompi-gk1.7

This commit was SVN r30434.
2014-01-27 15:17:42 +00:00
Ralph Castain
f73d23e723 Correct the location of the counter when tracking process launch for reporting progress
cmr=v1.7.4:reviewer=hjelmn

This commit was SVN r30415.
2014-01-24 21:03:05 +00:00
Ralph Castain
e3cb4b4a5b Grant Nathan his wish - add an --disable-getpwuid to the configure options and protect all users of that code so it disappears if disabled.
cmr=v1.7.5:reviewer=hjelmn:subject=disable getpwuid if requested

This commit was SVN r30413.
2014-01-24 19:18:37 +00:00
Ralph Castain
fcdd904af4 Simplify and update hostfile handling to correctly support hostfiles that list nodes multiple times, once for each slot, and those that list a host once and include an explicit slot count. Eliminate support for mixing those two modes as this logic became just too complex when attempting to handle all the corner cases.
cmr=v1.7.4:reviewer=jsquyres

This commit was SVN r30325.
2014-01-18 16:08:40 +00:00
Ralph Castain
4cdc291df1 Ensure slurm properly dies on abnormal termination
cmr=v1.7.4:reviewer=jsquyres:subject=Ensure slurm properly dies on abnormal termination

This commit was SVN r30182.
2014-01-09 16:52:02 +00:00
Ralph Castain
80497d73cf Need to mark the daemon as alive so that exit commands are properly routed during abnormal terminations. Also, remove stale references to the "selected oob component" as we no longer require only one component be selected
cmr=v1.7.4:reviewer=jsquyres

This commit was SVN r30162.
2014-01-08 22:35:48 +00:00
Brian Barrett
8b778903d8 Fix longstanding issue with our multi-project support. Rather than using
pkg{data,lib,includedir}, use our own ompi{data,lib,includedir}, which is
always set to {datadir,libdir,includedir}/openmpi.  This will keep us from
having help files in prefix/share/open-rte when building without Open MPI,
but in prefix/share/openmpi when building with Open MPI.

This commit was SVN r30140.
2014-01-07 22:11:15 +00:00
Nathan Hjelm
3be4536d9b Cleanup various leaks in ompi_info reported by valgrind.
cmr=v1.7.4:reviewer=jsquyres

This commit was SVN r30058.
2013-12-23 17:47:43 +00:00
Ralph Castain
71b52fe861 Ensure that comm_spawn'd procs get user-specified forwarded envars
Thanks to Tim Miller for reporting the regression from the 1.6 series

cmr=v1.7.4:reviewer=jsquyres:subject=Ensure that comm_spawn'd procs get user-specified forwarded envars

This commit was SVN r30012.
2013-12-20 14:47:35 +00:00
Adrian Reber
b42aad44a3 Trying to get the C/R code to compile again. This patch
includes various fixes all over the C/R code which are
hard to group like the other patches.

Changes from V1:
* explain why mca_base_component_distill_checkpoint_ready no longer works
* compare return result of opal functions with OPAL_* values

Changes from V2:
* use orte_rml_oob_ft_event() instead of referencing through the modules
* properly protect variable (thanks to --enable-picky)

This commit was SVN r29922.
2013-12-16 15:35:28 +00:00
Jeff Squyres
770bf77149 Fix some minor memory leaks in error code paths.
Many thanks to Tom Fogal for the patch.

cmr=v1.7.4:reviewer=rhc:subject=Fix minor memory leaks in error code paths

This commit was SVN r29905.
2013-12-14 00:41:21 +00:00
Jeff Squyres
2e7653e4c2 Add missing argv.h includes.
Noticed these as part of #3694: external libevent's don't cause argv.h
to automatically get included.

Refs trac:3694

This commit was SVN r29897.

The following Trac tickets were found above:
  Ticket 3694 --> https://svn.open-mpi.org/trac/ompi/ticket/3694
2013-12-13 21:17:36 +00:00