1
1

540 Коммитов

Автор SHA1 Сообщение Дата
Jeff Squyres
ea4c916096 plm_slurm_module.c: don't leave the extra fd to /dev/null open
Prior to r29058, this same logic was in place (i.e., ensure that the
extra fd to /dev/null is closed).  It looks like it was accidentally
removed in the ORTE conversion to the state machine in r29058.

This ''might'' have something to do with many hangs that we're seeing
in Cisco MTT with jobs that exhibit failure (e.g., call MPI_ABORT)...?

cmr=v1.8.2:reviewer=rhc

This commit was SVN r31469.

The following SVN revision numbers were found above:
  r29058 --> open-mpi/ompi@a200e4f865
2014-04-21 20:09:15 +00:00
Ralph Castain
a368e84e70 Per the RFC, remove the sensor framework from the ORTE code area, relocating it offsite to the ORCM code area. Also update some ignores to ensure we don't pickup crosstalk in components
This commit was SVN r31403.
2014-04-15 21:48:24 +00:00
Nathan Hjelm
9df795d1dd plm/alps: silence annoying warning message when using Cray PMI 3.x or
newer

This commit adds a workaround for messages printed by the Cray PMI library
when launching using mpirun. We are still talking with Cray to find a
better fix but this will silence the warnings for now.

cmr=v1.8.1:reviewer=manjugv

This commit was SVN r31352.
2014-04-08 21:54:10 +00:00
Dave Goodell
19efa09540 plm/slurm: tweak /dev/null usage (#4489)
See the ticket for more details.

cmr=v1.8.1:reviewer=rhc:ticket=4489

This commit was SVN r31351.

The following Trac tickets were found above:
  Ticket 4489 --> https://svn.open-mpi.org/trac/ompi/ticket/4489
2014-04-08 21:46:07 +00:00
Ralph Castain
957c9ecf53 Okay, silence the anality by simplifying the already irrelevant code, thus allowing us to turn our attention to things that actually matter
Refs trac:4489

This commit was SVN r31348.

The following Trac tickets were found above:
  Ticket 4489 --> https://svn.open-mpi.org/trac/ompi/ticket/4489
2014-04-08 19:51:11 +00:00
Ralph Castain
8ce98ccc8d Not sure when this got messed up, but correct the stdout/stderr redirection on the srun command so we don't get all those slurm warnings
cmr=v1.8.1:reviewer=dgoodell:subject=silence srun warning output

This commit was SVN r31308.
2014-04-04 04:23:31 +00:00
Ralph Castain
3fdcaeab97 Fix a problem where we need to abort due to a mapping failure, but we are in a managed environment and thus the orteds have not wired up. Thus, if we send the exit message across the routed network, the remote daemons won't have a way to relay the message along - and we won't exit.
If we are aborting, then set the flags so the HNP directly sends an exit command to each daemon. Make it the halt_vm command so the remote daemon doesn't try to relay it, but instead just exits without waiting for its routed children to exit first.

cmr=v1.8.1:reviewer=jsquyres:subject=fix hangs due to abort prior to daemon wireup

This commit was SVN r31304.
2014-04-02 04:17:55 +00:00
Ralph Castain
70ee3fb000 Ensure that orted's are not bound to single processors if the TaskAffinity option is set by default. Thanks to Artem Polyakov for the patch, and for his patience in explaining the situation.
Reviewed with Moe Jette to ensure this was correct, and confirmed by me.

RM-approved

cmr=v1.8:reviewer=ompi-gk1.8

This commit was SVN r31288.
2014-03-29 18:30:38 +00:00
Ralph Castain
bd9bd2ff16 Be consistent in our handling of the "only HNP in allocation" case when setting up the VM. Thanks to Tetsuya Mishima for the suggestion.
cmr=v1.8:reviewer=rhc

This commit was SVN r31195.
2014-03-24 15:28:09 +00:00
Ralph Castain
d17f811ff5 Surrender to the tyranny of C++ and give up on enum for node states, as nice as that would be, in favor of retaining memory footprint constraints.
This commit was SVN r31149.
2014-03-19 16:15:24 +00:00
Ralph Castain
0aa23cdc35 Cleanup copy/paste errors to ensure we progress the launch
cmr=v1.7.5:reviewer=rhc

This commit was SVN r31102.
2014-03-18 01:24:49 +00:00
Ralph Castain
45196d222b Minor cleanup of the node state definitions - using the enum allows the debuggers to pretty-print the value
This commit was SVN r31090.
2014-03-17 21:27:58 +00:00
Ralph Castain
b248b27637 Remove a check that prevented mpirun from exiting when it should in the single-node case
Refs trac:4393

This commit was SVN r31080.

The following Trac tickets were found above:
  Ticket 4393 --> https://svn.open-mpi.org/trac/ompi/ticket/4393
2014-03-15 15:25:44 +00:00
Ralph Castain
fbc5e3b773 Deal with the corner case where we encounter an error when attempting to launch a daemon. In this case, we will order abnormal termination before daemons callback to us, and thus any attempt to send them a "die" message will fail. Ensure that mpirun at least exits cleanly in this scenario, thereby allowing the remote daemons that did get launched to commit suicide when comm fails.
cmr=v1.7.5:reviewer=jsquyres

This commit was SVN r31068.
2014-03-14 15:32:30 +00:00
Adrian Reber
7304b700e1 Fix the newly added FT event state when compiling --with-ft
This commit was SVN r30988.
2014-03-11 13:20:08 +00:00
Ralph Castain
7a44af375c Add an FT event state and set the state machine to callback to the OOB base ft event when activated
This commit was SVN r30950.
2014-03-06 02:44:29 +00:00
Ralph Castain
c9465d97b4 Resolve a race condition when responding to a SIGTERM to ensure that any final message from the application is correctly output. Remove a duplicate command, reduce the priority of the daemon exit command to MSG so that the IOF will have a chance to output cached messages. Update the signal trapping test.
Thanks to Paul Kapinos for reporting the problem.

cmr=v1.7.5:reviewer=jsquyres:subject=resolve a race condition

This commit was SVN r30942.
2014-03-05 04:38:17 +00:00
Ralph Castain
0ac97761cc Now that we are binding by default, the issue of #slots and what to do when oversubscribed has become a bit more complicated. This isn't a problem in managed environments as we are always provided an accurate assignment for the #slots, or when -host is used to define the allocation since we automatically assume one slot for every time a node is named.
The problem arises when a hostfile is used, and the user provides host names without specifying the slots= paramater. In these cases, we assign slots=1, but automatically allow oversubscription since that number isn't confirmed. We then provide a separate parameter by which the user can direct that we assign the number of slots based on the sensed hardware - e.g., by telling us to set the #slots equal to the #cores on each node. However, this has been set to "off" by default.

In order to make this a little less complex for the user, set the default such that we automatically set #slots equal to #cores (or #hwt's if use_hwthreads_as_cpus has been set) only for those cases where the user provides names in a hostfile but does not provide slot information.

Also cleanup some a couple of issues in the mapping/binding system:

* ensure we only override the binding directive if we are oversubscribed *and* overload is not allowed

* ensure that the MPI procs don't attempt to bind themselves if they are launched by an orted as any binding directive (no matter what it was) would have been serviced by the orted on launch

* minor cleanup to the warning message when oversubscribed and binding was requested

cmr=v1.7.5:reviewer=rhc:subject=update mapping/binding system

This commit was SVN r30909.
2014-03-03 16:46:37 +00:00
Ralph Castain
0dc5f50d27 Add a plm component for local-only operation that doesn't require rsh/ssh to be installed. Requested by Fedora packagers for testing purposes.
cmr=v1.7.5:reviewer=jsquyres:subject=Add a plm component for local-only operation

This commit was SVN r30645.
2014-02-09 15:53:10 +00:00
Ralph Castain
1326ed704f Per the RFC discussed here:
http://www.open-mpi.org/community/lists/devel/2014/01/13789.php

add support for async modex when requested.

cmr=v1.7.5:reviewer=jsquyres:subject=Add async modex support

This commit was SVN r30565.
2014-02-05 14:39:27 +00:00
Adrian Reber
fde1040d2f Use unique collective ids for the checkpoint/restart code
This commit was SVN r30552.
2014-02-04 14:03:05 +00:00
Ralph Castain
53b1be5067 Only report launch progress when specifically requested to do so. Thanks to Tetsuya Mishima for spotting it.
Reviewed by rhc and RM-approved

cmr=v1.7.4:reviewer=ompi-gk1.7

This commit was SVN r30434.
2014-01-27 15:17:42 +00:00
Ralph Castain
f73d23e723 Correct the location of the counter when tracking process launch for reporting progress
cmr=v1.7.4:reviewer=hjelmn

This commit was SVN r30415.
2014-01-24 21:03:05 +00:00
Ralph Castain
e3cb4b4a5b Grant Nathan his wish - add an --disable-getpwuid to the configure options and protect all users of that code so it disappears if disabled.
cmr=v1.7.5:reviewer=hjelmn:subject=disable getpwuid if requested

This commit was SVN r30413.
2014-01-24 19:18:37 +00:00
Ralph Castain
fcdd904af4 Simplify and update hostfile handling to correctly support hostfiles that list nodes multiple times, once for each slot, and those that list a host once and include an explicit slot count. Eliminate support for mixing those two modes as this logic became just too complex when attempting to handle all the corner cases.
cmr=v1.7.4:reviewer=jsquyres

This commit was SVN r30325.
2014-01-18 16:08:40 +00:00
Ralph Castain
4cdc291df1 Ensure slurm properly dies on abnormal termination
cmr=v1.7.4:reviewer=jsquyres:subject=Ensure slurm properly dies on abnormal termination

This commit was SVN r30182.
2014-01-09 16:52:02 +00:00
Ralph Castain
80497d73cf Need to mark the daemon as alive so that exit commands are properly routed during abnormal terminations. Also, remove stale references to the "selected oob component" as we no longer require only one component be selected
cmr=v1.7.4:reviewer=jsquyres

This commit was SVN r30162.
2014-01-08 22:35:48 +00:00
Brian Barrett
8b778903d8 Fix longstanding issue with our multi-project support. Rather than using
pkg{data,lib,includedir}, use our own ompi{data,lib,includedir}, which is
always set to {datadir,libdir,includedir}/openmpi.  This will keep us from
having help files in prefix/share/open-rte when building without Open MPI,
but in prefix/share/openmpi when building with Open MPI.

This commit was SVN r30140.
2014-01-07 22:11:15 +00:00
Nathan Hjelm
3be4536d9b Cleanup various leaks in ompi_info reported by valgrind.
cmr=v1.7.4:reviewer=jsquyres

This commit was SVN r30058.
2013-12-23 17:47:43 +00:00
Ralph Castain
71b52fe861 Ensure that comm_spawn'd procs get user-specified forwarded envars
Thanks to Tim Miller for reporting the regression from the 1.6 series

cmr=v1.7.4:reviewer=jsquyres:subject=Ensure that comm_spawn'd procs get user-specified forwarded envars

This commit was SVN r30012.
2013-12-20 14:47:35 +00:00
Adrian Reber
b42aad44a3 Trying to get the C/R code to compile again. This patch
includes various fixes all over the C/R code which are
hard to group like the other patches.

Changes from V1:
* explain why mca_base_component_distill_checkpoint_ready no longer works
* compare return result of opal functions with OPAL_* values

Changes from V2:
* use orte_rml_oob_ft_event() instead of referencing through the modules
* properly protect variable (thanks to --enable-picky)

This commit was SVN r29922.
2013-12-16 15:35:28 +00:00
Jeff Squyres
770bf77149 Fix some minor memory leaks in error code paths.
Many thanks to Tom Fogal for the patch.

cmr=v1.7.4:reviewer=rhc:subject=Fix minor memory leaks in error code paths

This commit was SVN r29905.
2013-12-14 00:41:21 +00:00
Jeff Squyres
2e7653e4c2 Add missing argv.h includes.
Noticed these as part of #3694: external libevent's don't cause argv.h
to automatically get included.

Refs trac:3694

This commit was SVN r29897.

The following Trac tickets were found above:
  Ticket 3694 --> https://svn.open-mpi.org/trac/ompi/ticket/3694
2013-12-13 21:17:36 +00:00
Ralph Castain
83e59e6761 Once again, the Slurm folks have decided to redefine their envars, reversing what they had previously told us to do. So cleanup the Slurm allocation code, and also adjust to a change in srun behavior that now aborts a job if the ntasks-per-node doesn't get specified when ORTE calls it, but the user specified it when getting an allocation. Sigh.
cmr=v1.7.4:reviewer=miked:subject=Update Slurm allocation and launch

This commit was SVN r29849.
2013-12-09 17:58:46 +00:00
Ralph Castain
f1e510154c Revise the launch timeout detection so we don't mistakenly declare "failed to start". Recognize that timeout is at the per-job level, and define the timeout param as a total value instead of seconds/daemon as it otherwise can get to be an enormous (and useless) number.
Resolves problems in loop_spawn where the timer was incorrectly firing and killing the overall job.

cmr=v1.7.4:reviewer=hjelmn

This commit was SVN r29661.
2013-11-11 23:50:40 +00:00
Ralph Castain
604970a1a2 Initialize orte_coprocessors hash table to NULL. Delay coprocessor detection on HNP until after node topology final definition in case rmaps changes it. Minor spacing change.
Refs trac:3847

This commit was SVN r29504.

The following Trac tickets were found above:
  Ticket 3847 --> https://svn.open-mpi.org/trac/ompi/ticket/3847
2013-10-24 00:08:47 +00:00
Ralph Castain
f5920e9312 Revert r29489. This function only executes in the HNP. In orte/mca/ess/hnp/ess_hnp_module.c, we already check for local coprocessors and add them to the hash table if found. Thus, r29489 simply overwrote what was already present.
The data for each remote daemon is added later in the daemon callback function. Only the HNP retains info in the hash table.

If it is desirable to have each daemon retain its own coprocessor info, then this must be done in orte/mca/ess/base/ess_base_std_orted.c.

This commit was SVN r29497.

The following SVN revision numbers were found above:
  r29489 --> open-mpi/ompi@2e2794fa15
2013-10-23 22:35:24 +00:00
Nathan Hjelm
2e2794fa15 Fix coprocessor detection by always adding the local daemon's co-processors
to the hash table.

Tested and working on a system with 2 Xeon Phi co-processors.

cmr=v1.7.4:ticket=3847:reviewer=ompi-rm1.7

This commit was SVN r29489.

The following Trac tickets were found above:
  Ticket 3847 --> https://svn.open-mpi.org/trac/ompi/ticket/3847
2013-10-23 15:56:23 +00:00
Ralph Castain
960a255e7f Do some cleanup of the --without-hwloc build - no need to work on coprocessors since we can't detect them anyway, cleanup some unused variables in the ppr mapper
This commit was SVN r29476.
2013-10-23 01:45:21 +00:00
Ralph Castain
b12167abef Per a good suggestion from Jeff, make the coprocessor mapping more scalable by using a hash table to cache the coprocessor list, and then do a single pass thru the nodes at the end to assign hostid's.
Refs trac:3847

This commit was SVN r29439.

The following Trac tickets were found above:
  Ticket 3847 --> https://svn.open-mpi.org/trac/ompi/ticket/3847
2013-10-14 22:01:48 +00:00
Ralph Castain
24c811805f ****************************************************************
This change contains a non-mandatory modification
       of the MPI-RTE interface. Anyone wishing to support
       coprocessors such as the Xeon Phi may wish to add
       the required definition and underlying support
****************************************************************

Add locality support for coprocessors such as the Intel Xeon Phi.

Detecting that we are on a coprocessor inside of a host node isn't straightforward. There are no good "hooks" provided for programmatically detecting that "we are on a coprocessor running its own OS", and the ORTE daemon just thinks it is on another node. However, in order to properly use the Phi's public interface for MPI transport, it is necessary that the daemon detect that it is colocated with procs on the host.

So we have to split the locality to separately record "on the same host" vs "on the same board". We already have the board-level locality flag, but not quite enough flexibility to handle this use-case. Thus, do the following:

1. add OPAL_PROC_ON_HOST flag to indicate we share a host, but not necessarily the same board

2. modify OPAL_PROC_ON_NODE to indicate we share both a host AND the same board. Note that we have to modify the OPAL_PROC_ON_LOCAL_NODE macro to explicitly check both conditions

3. add support in opal/mca/hwloc/base/hwloc_base_util.c for the host to check for coprocessors, and for daemons to check to see if they are on a coprocessor. The former is done via hwloc, but support for the latter is not yet provided by hwloc. So the code for detecting we are on a coprocessor currently is Xeon Phi specific - hopefully, we will find more generic methods in the future.

4. modify the orted and the hnp startup so they check for coprocessors and to see if they are on a coprocessor, and have the orteds pass that info back in their callback message. Automatically detect that coprocessors have been found and identify which coprocessors are on which hosts. Note that this algo isn't scalable at the moment - this will hopefully be improved over time.

5. modify the ompi proc locality detection function to look for coprocessor host info IF the OMPI_RTE_HOST_ID database key has been defined. RTE's that choose not to provide this support do not have to do anything - the associated code will simply be ignored.

6. include some cleanup of the hwloc open/close code so it conforms to how we did things in other frameworks (e.g., having a single "frame" file instead of open/close). Also, fix the locality flags - e.g., being on the same node means you must also be on the same cluster/cu, so ensure those flags are also set.

cmr:v1.7.4:reviewer=hjelmn

This commit was SVN r29435.
2013-10-14 16:52:58 +00:00
Ralph Castain
f4f2287958 Singletons currently start out by spawning an HNP - this is required solely in the cases where the singleton subsequently calls MPI_Comm_spawn or publishes port info without support from an external orte-server. In all other cases, the HNP is of no value and can actually be a detriment by creating additional overhead on the node. This is particularly concerning for async operations where processes may begin as singletons and then dynamically wireup to perform pt2pt communications.
So we now allow singletons to start on their own, only spawning an HNP when initiating an operation that actually requires it.

cmr:v1.7.4:reviewer=jsquyres

This commit was SVN r29354.
2013-10-04 02:58:26 +00:00
Ralph Castain
6522963b9c Flag that a daemon has been launched when it reports back to the HNP so we avoid re-launching it on spawns against dynamic allocations
cmr:v1.7.3:reviewer=jsquyres

This commit was SVN r29245.
2013-09-25 16:58:19 +00:00
Ralph Castain
23c8848157 Only connect the first time thru the Torque launch, remove stale code
cmr:v1.7.3:reviewer=jsquyres

This commit was SVN r29227.
2013-09-22 23:53:57 +00:00
Ralph Castain
d32dfc96be Use the rankfile to obtain list of nodes for VM launch if/when rankfile is given.
cmr:v1.7.3:reviewer=jsquyres:subject=Obtain VM nodes from rankfile

This commit was SVN r29119.
2013-09-04 16:37:30 +00:00
Ralph Castain
43d1cd92ac Ensure we activate the "daemons launched" state when only the HNP is left or else we will hang.
cmr:v1.7.3:reviewer=jsquyres

This commit was SVN r29094.
2013-08-29 22:50:51 +00:00
Ralph Castain
a200e4f865 As per the RFC, bring in the ORTE async progress code and the rewrite of OOB:
*** THIS RFC INCLUDES A MINOR CHANGE TO THE MPI-RTE INTERFACE ***

Note: during the course of this work, it was necessary to completely separate the MPI and RTE progress engines. There were multiple places in the MPI layer where ORTE_WAIT_FOR_COMPLETION was being used. A new OMPI_WAIT_FOR_COMPLETION macro was created (defined in ompi/mca/rte/rte.h) that simply cycles across opal_progress until the provided flag becomes false. Places where the MPI layer blocked waiting for RTE to complete an event have been modified to use this macro.

***************************************************************************************

I am reissuing this RFC because of the time that has passed since its original release. Since its initial release and review, I have debugged it further to ensure it fully supports tests like loop_spawn. It therefore seems ready for merge back to the trunk. Given its prior review, I have set the timeout for one week.

The code is in  https://bitbucket.org/rhc/ompi-oob2


WHAT:    Rewrite of ORTE OOB

WHY:       Support asynchronous progress and a host of other features

WHEN:    Wed, August 21

SYNOPSIS:
The current OOB has served us well, but a number of limitations have been identified over the years. Specifically:

* it is only progressed when called via opal_progress, which can lead to hangs or recursive calls into libevent (which is not supported by that code)

* we've had issues when multiple NICs are available as the code doesn't "shift" messages between transports - thus, all nodes had to be available via the same TCP interface.

* the OOB "unloads" incoming opal_buffer_t objects during the transmission, thus preventing use of OBJ_RETAIN in the code when repeatedly sending the same message to multiple recipients

* there is no failover mechanism across NICs - if the selected NIC (or its attached switch) fails, we are forced to abort

* only one transport (i.e., component) can be "active"


The revised OOB resolves these problems:

* async progress is used for all application processes, with the progress thread blocking in the event library

* each available TCP NIC is supported by its own TCP module. The ability to asynchronously progress each module independently is provided, but not enabled by default (a runtime MCA parameter turns it "on")

* multi-address TCP NICs (e.g., a NIC with both an IPv4 and IPv6 address, or with virtual interfaces) are supported - reachability is determined by comparing the contact info for a peer against all addresses within the range covered by the address/mask pairs for the NIC.

* a message that arrives on one TCP NIC is automatically shifted to whatever NIC that is connected to the next "hop" if that peer cannot be reached by the incoming NIC. If no TCP module will reach the peer, then the OOB attempts to send the message via all other available components - if none can reach the peer, then an "error" is reported back to the RML, which then calls the errmgr for instructions.

* opal_buffer_t now conforms to standard object rules re OBJ_RETAIN as we no longer "unload" the incoming object

* NIC failure is reported to the TCP component, which then tries to resend the message across any other available TCP NIC. If that doesn't work, then the message is given back to the OOB base to try using other components. If all that fails, then the error is reported to the RML, which reports to the errmgr for instructions

* obviously from the above, multiple OOB components (e.g., TCP and UD) can be active in parallel

* the matching code has been moved to the RML (and out of the OOB/TCP component) so it is independent of transport

* routing is done by the individual OOB modules (as opposed to the RML). Thus, both routed and non-routed transports can simultaneously be active

* all blocking send/recv APIs have been removed. Everything operates asynchronously.


KNOWN LIMITATIONS:

* although provision is made for component failover as described above, the code for doing so has not been fully implemented yet. At the moment, if all connections for a given peer fail, the errmgr is notified of a "lost connection", which by default results in termination of the job if it was a lifeline

* the IPv6 code is present and compiles, but is not complete. Since the current IPv6 support in the OOB doesn't work anyway, I don't consider this a blocker

* routing is performed at the individual module level, yet the active routed component is selected on a global basis. We probably should update that to reflect that different transports may need/choose to route in different ways

* obviously, not every error path has been tested nor necessarily covered

* determining abnormal termination is more challenging than in the old code as we now potentially have multiple ways of connecting to a process. Ideally, we would declare "connection failed" when *all* transports can no longer reach the process, but that requires some additional (possibly complex) code. For now, the code replicates the old behavior only somewhat modified - i.e., if a module sees its connection fail, it checks to see if it is a lifeline. If so, it notifies the errmgr that the lifeline is lost - otherwise, it notifies the errmgr that a non-lifeline connection was lost.

* reachability is determined solely on the basis of a shared subnet address/mask - more sophisticated algorithms (e.g., the one used in the tcp btl) are required to handle routing via gateways

* the RML needs to assign sequence numbers to each message on a per-peer basis. The receiving RML will then deliver messages in order, thus preventing out-of-order messaging in the case where messages travel across different transports or a message needs to be redirected/resent due to failure of a NIC

This commit was SVN r29058.
2013-08-22 16:37:40 +00:00
Nathan Hjelm
841ed962f6 fix MCA variable and component system leaks
cmr=v1.7.3:reviewer=rhc

This commit was SVN r29011.
2013-08-09 19:50:28 +00:00
Nathan Hjelm
299d5b3dd7 Fix two debugger attach bugs.
- orte_debugger_init_after_spawn was not being called for debuggers that
   use the MPIR_attach_fifo to co-locate debugger daemons.
 - MPIR_Breakpoint was not getting called if a debugger reattached. Add
   a job state (ORTE_JOB_STATE_DEBUGGER_DETACH) to reset mpir_breakpoint_fired
   to false when a debugger detaches to ensure MPIR_Breakpoint is called if
   another debugger attaches. Tested with STAT 2.0/launchmon 1.0.

cmr:v1.7

This commit was SVN r28665.
2013-06-20 16:18:05 +00:00
Ralph Castain
f15fe5045e Ensure that debugger connect can occur by getting the rml contact info updated before calling init_after_spawn
cmr:v1.7.3,reviewer=jsquyres

This commit was SVN r28455.
2013-05-06 22:00:45 +00:00