1
1

211 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
b5bf0a7f1d Add a new posix_spawn component to the ODLS framework.
Only selectable when specifically requested via "-mca odls pspawn"

Note that there are several concerns:
  * we aren't getting SIGCHLD calls when the procs terminate
  * we aren't seeing the IO pipes close on termination, though
    we are getting output forwarded to mpirun
  * I haven't found a way to bind the child process prior to exec.
    If we want to use this method, we probably need someone to
    implement a cgroup component for the orte/rtc framework

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-11-30 18:01:31 -08:00
Ralph Castain
3906aaf41a Silence warnings
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-11-25 11:50:18 -08:00
Ralph Castain
30f23ac67a Save one more file descriptor per process by not opening one for stddiag
if PMIx (version > 1.x) is active since all diagnostic messages will instead flow thru
the PMIx connection. Unfortunately, PMIx v1 does not support this
feature, but we can remove the stddiag support once PMIx v1 slides out
of the support window

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-11-25 11:48:53 -08:00
Gilles Gouaillardet
e88767866e iof: optimize handling of stderr when iof_base_redirect_app_stderr_to_stdout is set
avoid creating a pipe for a task stderr when we know it will be redirected to stdout

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-11-24 13:20:03 +09:00
Gilles Gouaillardet
84e96522f2 iof: optimize handling of stdin
since some tasks migth end up having /dev/null as their stdin,
simply avoid pipe creation and destruction for these tasks.

From a pragmatic and MPI point of view, and unless explicitly required
otherwise, all MPI tasks but (the first) one end up with /dev/null
as their stdin.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-11-22 13:18:32 +09:00
Joshua Hursey
e1d079544b mca: Dynamic components link against project lib
* Resolves #3705
 * Components should link against the project level library to better
   support `dlopen` with `RTLD_LOCAL`.
 * Extend the `mca_FRAMEWORK_COMPONENT_la_LIBADD` in the `Makefile.am`
   with the appropriate project level library:
```
MCA components in ompi/
       $(top_builddir)/ompi/lib@OMPI_LIBMPI_NAME@.la
MCA components in orte/
       $(top_builddir)/orte/lib@ORTE_LIB_PREFIX@open-rte.la
MCA components in opal/
       $(top_builddir)/opal/lib@OPAL_LIB_PREFIX@open-pal.la
MCA components in oshmem/
       $(top_builddir)/oshmem/liboshmem.la"
```

Note: The changes in this commit were automated by the script in
the commit that proceeds it with the `libadd_mca_comp_update.py`
script. Some components were not included in this change because
they are statically built only.

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2017-08-24 11:56:16 -04:00
Ralph Castain
c6c0258cd8 Need to signal -pgrp to get to all members of a process group.
Thanks to Ted Sussman for the report and patience in tracking it down

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-27 12:10:34 -07:00
Ralph Castain
206aec6083 By default, apply signals to all direct children _and_ any children they might have spawned (so long as they remain in the same process group). Provide an MCA param (odls_base_signal_direct_children_only) to indicate that the signal is to go _only_ to our direct children, and not be delivered to any children spawned by those procs.
Refs https://www.mail-archive.com/users@lists.open-mpi.org/msg31221.html

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-15 12:26:11 -07:00
Ralph Castain
93cf3c7203 Update OPAL and ORTE for thread safety
(I swear, if I look this over one more time, I'll puke)

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-06 12:30:57 -07:00
Gilles Gouaillardet
16fc0996e6 odls: fix handling of the orte fork agent
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-05-08 16:07:13 +09:00
Ralph Castain
1585854335 Minor coverity cleanups
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-04-12 19:31:35 -07:00
Mark Santcroos
27fa8aabd6 Hardcode basename to "orted" for error reporting.
Signed-off-by: Mark Santcroos <mark.santcroos@rutgers.edu>
2017-04-11 18:59:23 -04:00
Mark Santcroos
af3a6e1a29 Verify that the chdir(2) succeeds.
Signed-off-by: Mark Santcroos <mark.santcroos@rutgers.edu>
2017-04-12 00:37:37 +02:00
Ralph Castain
75684dc260 Resolve a race condition for setting our working directory when fork/exec'ing application procs. We have to ensure we do it after the fork occurs since we want to use multiple threads in the odls. Otherwise, the different threads are bouncing the entire process around.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-03-21 13:54:03 -07:00
Ralph Castain
dc85e7fde7 Provide a little more help on the error messages when an executable isn't found so we have some better idea where we were looking for it. Don't double-report such errors. Ensure the ORTE_ERROR_NAME doesn't get a NULL back for the string name of an error code as that might cause some systems to segfault
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-03-17 09:54:37 -07:00
Ralph Castain
70591bf4dc Enable parallel fork/exec of local procs by providing the option of multiple odls progress threads
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-03-11 20:48:04 -08:00
Jeff Squyres
fec519a793 hwloc: rename opal/mca/hwloc/hwloc.h -> hwloc-internal.h
Per a prior commit, the presence of "hwloc.h" can cause ambiguity when
using --with-hwloc=external (i.e., whether to include
opal/mca/hwloc/hwloc.h or whether to include the system-installed
hwloc.h).

This commit:

1. Renames opal/mca/hwloc/hwloc.h to hwloc-internal.h.
2. Adds opal/mca/hwloc/autogen.options to tell autogen.pl to expect to
   find hwloc-internal.h (instead of hwloc.h) in opal/mca/hwloc.
3. s@opal/mca/hwloc/hwloc.h@opal/mca/hwloc/hwloc-internal.h@g in the
   rest of the code base.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2017-02-28 07:48:42 -08:00
Ralph Castain
85a634926b Update signal handling to introduce a pause between SIGCONT and SIGTERM, followed by another pause before SIGKILL. Do this within the odls/kill_local_procs function while we know we are blocked in an event, and before the daemon shuts down the event progress loop
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-06 12:34:42 -08:00
Ralph Castain
88313debc2 Per discussion on email thread, restore placement of child procs in their own process group so that any signal sent to one of our children is automatically propagated to any child process they might have spawned.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-02 03:36:22 -08:00
Ralph Castain
0ba9572f9f Cleanup the forced termination a bit by restoring the delay before issuing the sigkill, and eliminating the large time loss spent checking if the proc died. The latter is responsible for a large number of test timeouts in MTT
Update alps component
2016-06-02 17:48:21 -07:00
Jeff Squyres
dd9a819a1c odls_default: do not opal_output() while creating a process!
It is verbotten to use opal_output() after the fork() but before the
exec()!  It results in all manner of undefined behavior.  For example,
on some OS X systems, if you run a trivial "hello world" MPI program
with a high level of ODLS verbosity:

```sh
$ mpirun -np 3 --mca odls_base_verbose 100 ./hello_c
```

You will see a bunch of output from the mpirun ODLS base, but then it
*may* hang in odls_default_module.c:do_child() -- after the fork() but
before the exec() -- while trying to opal_output() some debugging
statements.

The solution is to remove these extraneous opal_output() statements.
Indeed, the ODLS base is already outputting the same information that
these opal_output() statements are trying to emit, anyway.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-05-24 21:28:57 -04:00
Ralph Castain
d72c1c72ff Do not push child processes into separate process groups so that any host RM can still "see" them, and ensure that any signal sent to the orted's themselves will be provided to all child processes. Forward all signals from mpirun to the child processes, removing the old MCA parameter required to turn that behavior "on". 2016-03-06 17:55:09 -08:00
Ralph Castain
4a55fba414 Fix registration of error handlers thru the pmix120 component. A thread-shift operation was hanging on the sync_event_base, which made it dependent on someone calling opal_progress. Unfortunately, a process in "sleep" or spinning outside the MPI library won't do that, and so we never complete errhandler registration. 2016-03-02 15:01:01 -08:00
Ralph Castain
06c3dfc052 Refactor the ORTE DVM code so that external codes can submit multiple jobs using only a single connection to the HNP.
* Clean up the DVM so it continues to run even when applications error out and we would ordinarily abort the daemons.
* Create a new errmgr component for the DVM to handle the differences.
* Cleanup the DVM state component.
* Add ORTE bindings directory and brief README
* Pass a local tool index around to match jobs.
* Pass the jobid on job completion.
* Fix initialization logic.
* Add framework for python wrapper.
* Fix terminate-with-non-zero-exit behavior so it properly terminates only the indicated procs, notifies orte-submit, and orte-dvm continues executing.
* Add some missing options to orte-dvm
* Fix a bug in -host processing that caused us to ignore the #slots designator. Add a new attribute to indicate "do not expand the DVM" when submitting job spawn requests.
* It actually makes no sense that we treat the termination of all children differently than terminating the children of a specific job - it only creates confusion over the difference in behavior. So terminate children the same way regardless.

Extend the cmd_line utility to easily allow layering of command line definitions

Catch up with ORTE interface change and make build more generic.

Disable "fixed dvm" logic for now.

Add another cmd_line function to merge a table of cmd line options with another one, reporting as errors any duplicate entries. Use this to allow orterun to reuse the orted_submit code

Fix the "fixed_dvm" logic by ensuring we reset num_new_daemons to zero. Also ensure that the nidmap is sent with the first job so the downstream daemons get the node info. Remove a duplicate cmd line entry in orterun.

Revise the DVM startup procedure to pass the nidmap only once, at the startup of the DVM. This reduces the overhead on each job launch and ensures that the nidmap doesn't get overwritten.

Add new commands to get_orted_comm_cmd_str().

Move ORTE command line options to orte_globals.[ch].

Catch up with extra orte_submit_init parameter.

Add example code.

Add documentation.

Bump version.

The nidmap and routing data must be updated prior to propagating the xcast or else the xcast will fail.

Fix the return code so it is something more expected when an error occurs. Ensure we get an error returned to us when we fail to launch for some reason. In this case, we will always get a launch_cb as we did indeed attempt to spawn it. The error code will be returned in the complete_cb.

Fix the return code from orte_submit_job - it was returning the tracker index instead of "success". Take advantage of ORTE's pretty-print capabilities to provide a nice error output explaining why we failed to launch. Ensure we always get a launch_cb when we fail to launch, but no complete_cb as the job never launched.

Extend the error reporting capability to job completion as well.

Add index parameter to orte_submit_job().

Add orte_job_cancel and implement ORTE_DAEMON_TERMINATE_JOB_CMD.

Factor out dvm termination.

Parse the terminate option at tool level.

Add error string for ORTE_ERR_JOB_CANCELLED.

Add some safeguards.

Cleanup and/of comments.

Enable the return.

Properly ORTE_DECLSPEC orte_submit_halt.

Add orte_submit_halt and orte_submit_cancel to interface.

Use the plm interface to terminate the job
2016-02-13 08:10:44 -08:00
Ralph Castain
f28448702a Eliminate malloc by utilizing /proc/self/fd - optimization 2015-09-22 07:24:54 -07:00
Ralph Castain
92ae386a34 As Jeff proposed, change the check to looking for the filename's first character to be a digit 2015-09-21 08:22:58 -07:00
Ralph Castain
c167acc5a7 Cleanup the odls "close file descriptor" commit to conform to OMPI coding standards and remove memory leaks 2015-09-19 20:46:36 -07:00
Piotr Lesnicki
1dd5487fae odls: close only used file descriptors at fork/exec 2015-09-18 16:44:57 +02:00
Jeff Squyres
31b329e585 odls default: ensure to initialize opts
This fixes CID 71127.
2015-08-12 05:27:37 -07:00
Nathan Hjelm
4d92c9989e more c99 updates
This commit does two things. It removes checks for C99 required
headers (stdlib.h, string.h, signal.h, etc). Additionally it removes
definitions for required C99 types (intptr_t, int64_t, int32_t, etc).

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2015-06-25 10:14:13 -06:00
Ralph Castain
869041f770 Purge whitespace from the repo 2015-06-23 20:59:57 -07:00
Ralph Castain
ea35e47228 Fat SMPs (i.e., systems with nodes containing large numbers of cpus) were failing to start due to connection failures of the opal/pmix support. Root cause was that (a) we were setting the client socket to non-blocking before calling connect, and (b) the server was using the event library to harvest the accepts, and also did the handshake while in that event. So the server would backup beyond the connection backlog limit, and we would fail.
Changing the client to leave its socket as blocking during the connect doesn't solve the problem by itself - you also have to introduce a sleep delay once the backlog is hit to avoid simply machine-gunning your way thru retries. This gets somewhat difficult to adjust as you don't want to unnecessarily prolong startup time.

We've solved this before by adding a listening thread that simply reaps accepts and shoves them into the event library for subsequent processing. This would resolve the problem, but meant yet another daemon-level thread. So I centralized the listening thread support and let multiple elements register listeners on it. Thus, each daemon now has a single listening thread that reaps accepts from multiple sources - for now, the orte/pmix server and the oob/usock support are using it. I'll add in the oob/tcp component later.

This still didn't fully resolve the SMP problem, especially on coprocessor cards (e.g., KNC). Removing the shared memory dstore support helped further improve the behavior - it looks like there is some kind of memory paging issue there that needs further understanding. Given that the shared memory support was about to be lost when I bring over the PMIx integration (until it is restored in that library), it seemed like a reasonable thing to just remove it at this point.
2015-05-29 14:37:14 -07:00
Nathan Hjelm
45e053dbce orte: use C99 subobject naming for component initialization
This commit helps future-proof orte components by initializing each
component member by name.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-04-18 10:29:58 -06:00
Nathan Hjelm
b68d66bb9b MCA: Add the project/project version to the MCA base component
This commit adds support for project_framework_component_* parameter
matching. This is the first step in allowing the same framework name
in multiple projects. This change also bumps the MCA component version
to 2.1.0.

All master frameworks have been updated to use the new component
versioning macro. An mca.h has been added to each project to add a
project specific versioning macro of the form
PROJECT_MCA_VERSION_2_1_0.

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2015-03-27 10:59:04 -06:00
Howard Pritchard
bf89131f9e add owner files to opa/ompi/orte mca directories
This commit adds an owner file in each of the component directories
for each framework.  This allows for a simple script to parse
the contents of the files and generate, among other things, tables
to be used on the project's wiki page.  Currently there are two
"fields" in the file, an owner and a status.  A tool to parse
the files and generate tables for the wiki page will be added
in a subsequent commit.
2015-02-22 15:10:23 -07:00
Elena
03fc809bc9 This commit contains new dstore component sm which is used for communication between pmix server and clients at the same node via shared memory. 2014-11-06 16:01:19 +02:00
Ralph Castain
aec5cd08bd Per the PMIx RFC:
WHAT:    Merge the PMIx branch into the devel repo, creating a new
               OPAL “lmix” framework to abstract PMI support for all RTEs.
               Replace the ORTE daemon-level collectives with a new PMIx
               server and update the ORTE grpcomm framework to support
               server-to-server collectives

WHY:      We’ve had problems dealing with variations in PMI implementations,
               and need to extend the existing PMI definitions to meet exascale
               requirements.

WHEN:   Mon, Aug 25

WHERE:  https://github.com/rhc54/ompi-svn-mirror.git

Several community members have been working on a refactoring of the current PMI support within OMPI. Although the APIs are common, Slurm and Cray implement a different range of capabilities, and package them differently. For example, Cray provides an integrated PMI-1/2 library, while Slurm separates the two and requires the user to specify the one to be used at runtime. In addition, several bugs in the Slurm implementations have caused problems requiring extra coding.

All this has led to a slew of #if’s in the PMI code and bugs when the corner-case logic for one implementation accidentally traps the other. Extending this support to other implementations would have increased this complexity to an unacceptable level.

Accordingly, we have:

* created a new OPAL “pmix” framework to abstract the PMI support, with separate components for Cray, Slurm PMI-1, and Slurm PMI-2 implementations.

* Replaced the current ORTE grpcomm daemon-based collective operation with an integrated PMIx server, and updated the grpcomm APIs to provide more flexible, multi-algorithm support for collective operations. At this time, only the xcast and allgather operations are supported.

* Replaced the current global collective id with a signature based on the names of the participating procs. The allows an unlimited number of collectives to be executed by any group of processes, subject to the requirement that only one collective can be active at a time for a unique combination of procs. Note that a proc can be involved in any number of simultaneous collectives - it is the specific combination of procs that is subject to the constraint

* removed the prior OMPI/OPAL modex code

* added new macros for executing modex send/recv to simplify use of the new APIs. The send macros allow the caller to specify whether or not the BTL supports async modex operations - if so, then the non-blocking “fence” operation is used, if the active PMIx component supports it. Otherwise, the default is a full blocking modex exchange as we currently perform.

* retained the current flag that directs us to use a blocking fence operation, but only to retrieve data upon demand

This commit was SVN r32570.
2014-08-21 18:56:47 +00:00
Ralph Castain
5dbf4a62c4 Cleanup: we were accidentally killing ourselves (bad idea)
Refs trac:4717

This commit was SVN r32022.

The following Trac tickets were found above:
  Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717
2014-06-17 20:38:42 +00:00
Ralph Castain
42bf7466fc This isn't as big a change as it appears - a change in one place caused a whole bunch of files to require updated #include's due to some arcane linkage. Rework the orte_wait code to reflect the introduction of the state machine. If we are in cleanup mode and just want to kill all our local children, then there is no reason to be polite about it as that introduces *very* long delays at scale. Just kill the procs and move on.
Refs trac:4717

This commit was SVN r32019.

The following Trac tickets were found above:
  Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717
2014-06-17 17:57:51 +00:00
Ralph Castain
8736a1c138 Per RFC:
http://www.open-mpi.org/community/lists/devel/2014/05/14822.php

Revamp the ORTE global data structures to reduce memory footprint and add new features. Add ability to control/set cpu frequency, though this can only be done if the sys admin has setup the system to support it (or you run as root).

This commit was SVN r31916.
2014-06-01 16:14:10 +00:00
Ralph Castain
5602156a1c Use the correct abstraction layer name for the data dirs
This commit was SVN r31684.
2014-05-08 14:32:24 +00:00
Ralph Castain
11faab1091 The final step of the RFC: convert the <foo>libdir and friends to fit their respective code areas, and equate them all at the top. Note that we can't entirely separate things as the opal_install_dirs framework can't handle separated locations for the various trees.
This commit was SVN r31679.
2014-05-08 02:01:35 +00:00
Ralph Castain
87d809eefe Add a new "run-time controls" framework for setting controls on processes. Initially, just move the process binding code there under a new "hwloc" component. Additional components to support cgroups, power settings, etc. to follow
This commit was SVN r31633.
2014-05-05 19:22:06 +00:00
Jeff Squyres
e1655ae68d opal/util/fd.c: add new convenience function for setting FD_CLOEXEC
Paul Hargrove pointed out that Stevens tells us that we should
FD_GETFL before FD_SETFL.  And so we shall.

Make a new convenience function to do this (opal_fd_set_cloexec()),
just so that we don't have to litter this 2-step process throughout
the code.

Refs trac:4550

This commit was SVN r31513.

The following Trac tickets were found above:
  Ticket 4550 --> https://svn.open-mpi.org/trac/ompi/ticket/4550
2014-04-24 13:04:49 +00:00
Jeff Squyres
3da579139b More corrections w.r.t. process groups
To accompany r31092 and r310924, also ensure to create a new process
group in the child right after the orted forks.  Add trivial configury
to ensure that we have setpgid, and only do the setpgid/getpgid if we
have setpgid.

Without this commit, killing the entire process group can do
unexpected things (e.g., kill the orted, mpirun, and even mpirun's
parent!).

cmr=v1.7.5:reviewer=rhc

This commit was SVN r31132.

The following SVN revision numbers were found above:
  r31092 --> open-mpi/ompi@99c9ecaed0

The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
  r310924
2014-03-18 21:31:01 +00:00
Ralph Castain
545ac7dc58 Remove the job_control_forwarding logic as we want *any* signal to go to all members of the process group
Refs trac:4404

This commit was SVN r31094.

The following Trac tickets were found above:
  Ticket 4404 --> https://svn.open-mpi.org/trac/ompi/ticket/4404
2014-03-17 22:45:33 +00:00
Ralph Castain
99c9ecaed0 Ensure that we send the specified signal to the entire process group of each member of the pid provided to us. This ensures that any children spawned by our children also see the signal
cmr=v1.7.5:reviewer=jsquyres

This commit was SVN r31092.
2014-03-17 22:12:15 +00:00
Ralph Castain
081669b440 When pretty-printing binding info, we need to pass the topology down to the routine as the mapper isn't always working with the local topology - otherwise, we get an erroneous help message. Thanks to Tetsuya Mishima for reporting it
cmr=v1.7.5:reviewer=rhc:subject=fix pretty-print of bindings

This commit was SVN r30968.
2014-03-10 15:53:07 +00:00
Ralph Castain
c9465d97b4 Resolve a race condition when responding to a SIGTERM to ensure that any final message from the application is correctly output. Remove a duplicate command, reduce the priority of the daemon exit command to MSG so that the IOF will have a chance to output cached messages. Update the signal trapping test.
Thanks to Paul Kapinos for reporting the problem.

cmr=v1.7.5:reviewer=jsquyres:subject=resolve a race condition

This commit was SVN r30942.
2014-03-05 04:38:17 +00:00
Ralph Castain
0ac97761cc Now that we are binding by default, the issue of #slots and what to do when oversubscribed has become a bit more complicated. This isn't a problem in managed environments as we are always provided an accurate assignment for the #slots, or when -host is used to define the allocation since we automatically assume one slot for every time a node is named.
The problem arises when a hostfile is used, and the user provides host names without specifying the slots= paramater. In these cases, we assign slots=1, but automatically allow oversubscription since that number isn't confirmed. We then provide a separate parameter by which the user can direct that we assign the number of slots based on the sensed hardware - e.g., by telling us to set the #slots equal to the #cores on each node. However, this has been set to "off" by default.

In order to make this a little less complex for the user, set the default such that we automatically set #slots equal to #cores (or #hwt's if use_hwthreads_as_cpus has been set) only for those cases where the user provides names in a hostfile but does not provide slot information.

Also cleanup some a couple of issues in the mapping/binding system:

* ensure we only override the binding directive if we are oversubscribed *and* overload is not allowed

* ensure that the MPI procs don't attempt to bind themselves if they are launched by an orted as any binding directive (no matter what it was) would have been serviced by the orted on launch

* minor cleanup to the warning message when oversubscribed and binding was requested

cmr=v1.7.5:reviewer=rhc:subject=update mapping/binding system

This commit was SVN r30909.
2014-03-03 16:46:37 +00:00