1
1
Граф коммитов

990 Коммитов

Автор SHA1 Сообщение Дата
Jeff Squyres
4603852740 orterun: use consistent CLI option name for --bind-to
Since the new binding option is tied to the --cpu-list orterun CLI
option, make the --bind-to option reflect the same name (vs. the
--cpu-set CLI option, which is entirely different).  For example:

    mpirun --bind-to cpu-list:ordered ...

Note that "--bind-to cpulist:ordered" is accepted as a synonym,
because people will be lazy.

Also add some minor updates to the orterun.1in man page for
clarification.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-06-21 08:22:00 -07:00
Ralph Castain
d2838139e4 Update man and help output for new binding option
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-06-21 06:36:11 -07:00
Ralph Castain
06eb51161f Rename prun ompi-prun
This is a minor abstraction break in naming, but hopefully acceptable for now. I will update the contents of the program a little later. This resolves the immediate issue of naming conflict with the PRRTE binary.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-06-06 07:40:03 -07:00
Jeff Squyres
7a6e8cac58 orterun.1in: fix typo
Found via https://github.com/open-mpi/ompi-www/pull/61.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-04-09 14:13:24 -04:00
Joshua Hursey
ccb4f43c9b Fix MPIR_proctable structure visibility
* The `MPIR_PROCDESC` structure needs to be visible even in optimized
   builds so that debuggers can attach to `mpirun` and properly read the
   `MPIR_proctable`.
 * In the v2.0.x and v2.x series this structure resided in the `orterun`
   directory and included the `CFLAGS` fix included here. This code
   moved in the v3.x series and the `CFLAGS` did not move causing this
   issue.
   - Instead of applying the debug `CFLAGS` globally to libopen-rte,
     only apply them to the `orted_submit.c` compile which contains the
     MPIR symbols.

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2018-03-09 21:15:28 -05:00
Ralph Castain
0434b615b5 Update ORTE to support PMIx v3
This is a point-in-time update that includes support for several new PMIx features, mostly focused on debuggers and "instant on":

* initial prototype support for PMIx-based debuggers. For the moment, this is restricted to using the DVM. Supports direct launch of apps under debugger control, and indirect launch using prun as the intermediate launcher. Includes ability for debuggers to control the environment of both the launcher and the spawned app procs. Work continues on completing support for indirect launch

* IO forwarding for tools. Output of apps launched under tool control is directed to the tool and output there - includes support for XML formatting and output to files. Stdin can be forwarded from the tool to apps, but this hasn't been implemented in ORTE yet.

* Fabric integration for "instant on". Enable collection of network "blobs" to be delivered to network libraries on compute nodes prior to local proc spawn. Infrastructure is in place - implementation will come later.

* Harvesting and forwarding of envars. Enable network plugins to harvest envars and include them in the launch msg for setting the environment prior to local proc spawn. Currently, only OmniPath is supported. PMIx MCA params control which envars are included, and also allows envars to be excluded.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-03-02 02:00:31 -08:00
Ralph Castain
af07b3df89 Update help and man pages for output-filename
Warn that relative path will be converted to absolute path, meaning that the file system on remote nodes must be the same as on the node where mpirun is executed.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-02-13 15:33:33 -08:00
Ralph Castain
b643852d8a Properly terminate the job when executable not found
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-01-26 12:09:24 -08:00
Ralph Castain
ac522a521f Ensure that prun doesn't prematurely exit
Ensure that prun doesn't exit until notified that its own child job
terminated.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-01-11 19:03:32 -08:00
Ralph Castain
3b71be4db4 Update the scaling script to avoid use of "system" command, thus ensuring that each command sees the same environment. Fix prun to pickup and propagate OMPI MCA params
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-10-23 16:27:41 -07:00
Ralph Castain
60b338e857 Sync to PMIx v3. Ensure prun uses the ess/tool component.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-10-14 08:24:57 -07:00
Ralph Castain
388034c814 Add support for the -v (verbose) option to prun and silence the "executing" and "completed" output otherwise.
Debounce "unreachable" notifications for tools when they disconnect
Enable the -x cmd line option for prun

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 0a5b36180a22959654461ac1303cec35313f8b4a)
2017-10-10 12:54:49 -07:00
Ralph Castain
5352c31914 Enable remote tool connections for the DVM. Fix notifications so we "de-bounce" termination calls
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-10-06 10:47:05 -07:00
Ralph Castain
7d538c661c Initialize variable and correctly compare against success
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-09-29 16:07:42 -07:00
Ralph Castain
fe9b584c05 Fully support OMPI spawn options. Fix a bug in the round-robin mappers where we weren't adding nodes to the job map node array, and so resources were not released
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 285d8cfef74ffc899e9c51e1d9c597b7fb2ceb89)
2017-09-21 10:29:27 -07:00
Ralph Castain
e575c4d6f9 Fix tool connection logic so we properly search for default session server, perform specified number of retries, etc.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 7c755e01004f8b86c71f1729662979ea45ab1adb)
2017-09-19 13:35:46 -07:00
Ralph Castain
ed508010b4 Remove stale tools
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-09-18 07:30:47 -07:00
Ralph Castain
3c914a7a97 Complete the fix of the ORTE DVM. We will now use "prun" instead of "orterun -hnp foo" to execute jobs. This provides the feature of automatic discovery of the orte-dvm so you don't need to manually enter URI's or contact file locations. All IO is forwarded to prun.
Still in the "needs to be done" category:

* mapping/ranking/binding options aren't correctly supported

* if the DVM encounters some errors (e.g., not enough resources for the job), the resulting error is globally set and impacts any subsequent job submission

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-09-16 13:13:07 -07:00
Ralph Castain
7c7d8a69a0 Backport changes from PMIx reference server
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-09-14 11:48:56 -07:00
Ralph Castain
3f8908871b Since the DVM is now tied to prun, don't build the DVM either unless prun can be built
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-09-13 11:55:10 -07:00
Ralph Castain
589cc03d8e Only build prun if building --with-devel-headers
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-09-13 11:38:11 -07:00
Ralph Castain
bbd83fd4c0 Add a new launcher "prun" for starting applications against the ORTE DVM.
Unlike "orterun", "prun" is a PMIx-only program that discovers the DVM connection instead of requiring that we explicitly provide it. Only build "prun" if PMIx v2.x is available.

This gets the DVM working again, but still is showing problems for multiple executions. I'll detail those in a separate issue. Thus, the DVM should still be considered "broken".

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-09-12 21:40:41 -07:00
Ralph Castain
b225366012 Bring the ofi/rml component online by completing the wireup protocol for the daemons. Cleanup the current confusion over how connection info gets created and
passed to make it all flow thru the opal/pmix "put/get" operations. Update the PMIx code to latest master to pickup some required behaviors.

Remove the no-longer-required get_contact_info and set_contact_info from the RML layer.

Add an MCA param to allow the ofi/rml component to route messages if desired. This is mainly for experimentation at this point as we aren't sure if routing wi
ll be beneficial at large scales. Leave it "off" by default.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-07-20 21:01:57 -07:00
Ralph Castain
8a4565874e Enable ORTE to continue running when a node fails - user takes responsibility for zombies. Minor cleanup to orte-clean
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-27 09:05:26 -07:00
Ralph Castain
38636f4f0a Ensure we properly cleanup on termination, including when terminating due to ctrl-c
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-21 06:33:37 -07:00
Ralph Castain
2aa286c9d0 Update orte-clean so it cleans legacy session directories as well as pmix artifacts
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-20 17:46:39 -07:00
Gilles Gouaillardet
72c7329462 configury: use 'uname -n' when 'hostname' is not available
the 'hostname' command might not be available on some platforms
such as Fedora Core 26, so mimick config/libtool.m4 and fallback
to 'uname -n' if needed

Refs. #3680

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-06-12 15:04:32 +09:00
Ralph Castain
93cf3c7203 Update OPAL and ORTE for thread safety
(I swear, if I look this over one more time, I'll puke)

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-06 12:30:57 -07:00
Ralph Castain
9f60cd0fe7 Update the connect/accept support so we check to see if we have the proper infrastructure and RTE support, including whether we have ompi-server available if the connect/accept spans multiple applications. Print pretty help messages in all cases where we do not have support
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-05-27 10:47:08 -07:00
Ralph Castain
9cb18b8348 Merge pull request #3280 from rhc54/topic/dvm
Fix the DVM by ensuring that all nodes, even those that didn't partic…
2017-04-04 18:15:33 -07:00
Ralph Castain
74863a0ea4 Fix the DVM by ensuring that all nodes, even those that didn't participate (i.e., didn't have any local children) in a job, clean up all resources associated with that job upon its completion. With the advent of backend distributed mapping, nodes that weren't part of the job would still allocate resources on other nodes - and then start from that point when mapping the next job. This change ensures that all daemons start from the same point each time.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-04-04 17:31:38 -07:00
Nathaniel Graham
19e5d15491 mpirun --help output revamp
This commit modifies the output from the mpirun --help
command.  The options have been split into groups, to
make the output smaller and more readable.  The groups
are: general, debug, output, input, mapping, ranking,
binding, devel, compatibility, launch, dvm, and
unsupported. There is also a special "full" command
that can be used to get the old behaviour of printing
out all of the options.  Unsupported options may only
be seen with this full output.

This commit also adds a special case for the help
argument.  It makes it possible for the user to
enter 0 or 1 arguments instead of having to always
enter an argument.  This defaults to printing out
the "general" help options so the user can then
see what help arguments there are.

Signed-off-by: Nathaniel Graham <ngraham@lanl.gov>
2017-04-04 10:59:32 -06:00
Jeff Squyres
a333cf691a orte: minor tweaks to run-as-root message
Two updates:

1. Remove the "run as root" error message from orterun.c, because that
   functionality is now in orted_submit.c (although it is still
   required in orte-dvm.c -- so sync the message in orted_submit.c and
   orte-dvm.c to be identical).
2. Slightly tweak the text of the "run as root" error message to
   explicitly state that we (strongly) suggest running as a non-root
   user (and add a little whitespace).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2017-03-27 04:50:21 -07:00
Ralph Castain
dc85e7fde7 Provide a little more help on the error messages when an executable isn't found so we have some better idea where we were looking for it. Don't double-report such errors. Ensure the ORTE_ERROR_NAME doesn't get a NULL back for the string name of an error code as that might cause some systems to segfault
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-03-17 09:54:37 -07:00
Thomas Naughton
74f8c2ae30 orte-clean: fix bad username/uid usage, add orte-dvm
This fixes a mismatch between PS listing that returned
USERNAME but code was pruning based on UID.

This changes the OPAL_PS_FLAVOR_CHECK format to return
'uid' instead of 'user'.  (Note: Avoiding call to
getlogin_r() but assuming UID is uniform on system,
same assumption exists for session dir anyway.)

Note, still maintains behavior from man page for root
running orte-clean on node (kills all orteds).

Adds 'orte-dvm' to list of procnames that will be checked/killed.

Signed-off-by: Thomas Naughton <naughtont@ornl.gov>
2017-02-28 08:00:06 -05:00
Nathaniel Graham
f9c05bdb03 Update the mpirun man page
This update should fix the mpirun man page so all
mpirun command line options are included, and
mpirun commands that have been removed are no
longer in the man page.  I also fixed some of
the file formatting, and bolding of command
parameters.

Signed-off-by: Nathaniel Graham <ngraham@lanl.gov>
2017-02-15 17:24:28 -07:00
Ralph Castain
cfce565ce9 Merge pull request #2763 from naughtont3/tjn-ortedvm-daemonize
dvm: add daemonize and set-sid options
2017-01-20 08:08:21 -08:00
Thomas Naughton
39d335a277 dvm: add daemonize and set-sid options
Signed-off-by: Thomas Naughton <naughtont@ornl.gov>
2017-01-20 09:28:26 -05:00
Ralph Castain
19bb64cfb8 If a tool sees the HNP it is attached to die (thereby losing connection), then stop the event loop instead of going through the abort code path. This will allow the tool to cleanup before exiting
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-19 14:04:06 -08:00
Ralph Castain
2fb9e7cc2b Add some missing qualifiers to the wrapper compilers for -lopen-rte and -lopen-pal
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-16 19:24:16 -08:00
Ralph Castain
5c87fc10dc Update the mpirun man page to accurately reflect the use of -host
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-12 20:28:53 -08:00
Ralph Castain
ef3f748d0d Transfer some minor cleanups back from the PMIx reference server
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-23 08:46:04 -08:00
Ralph Castain
1e2019ce2a Revert "Update to sync with OMPI master and cleanup to build"
This reverts commit cb55c88a8b.
2016-11-22 15:03:20 -08:00
Ralph Castain
cb55c88a8b Update to sync with OMPI master and cleanup to build
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-11-22 14:24:54 -08:00
Ralph Castain
649301a3a2 Revise the routed framework to be multi-select so it can support the new conduit system. Update all calls to rml.send* to the new syntax. Define an orte_mgmt_conduit for admin and IOF messages, and an orte_coll_conduit for all collective operations (e.g., xcast, modex, and barrier).
Still not completely done as we need a better way of tracking the routed module being used down in the OOB - e.g., when a peer drops connection, we want to remove that route from all conduits that (a) use the OOB and (b) are routed, but we don't want to remove it from an OFI conduit.
2016-10-23 21:52:39 -07:00
Gilles Gouaillardet
628c730196 pkgconfig: define the pkgincludedir variable in *.pc files
this has been made necesarry with open-mpi/ompi@12e796dcaf

Refs open-mpi/ompi#2069
2016-09-13 09:50:14 +09:00
Jeff Squyres
fd829ac389 Merge pull request #1982 from jsquyres/pr/fix-pkg-config-static
pkg-config: fix static linking
2016-09-07 14:55:50 -04:00
Ralph Castain
4e0788e9ad Enable PSM to support dynamic processes
Fix comm_spawn to correctly reference the actual parent process that requested the spawn when looking for the parent job object
2016-09-02 10:22:04 -07:00
Jeff Squyres
fb894e6e3e pkg-config: fix static linking
We need to list all major project libraries in the private libraries
line to enable static linking to work properly.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-08-17 20:37:51 -05:00
Gilles Gouaillardet
273e56096b configury: capture configury command line
configury command line is quoted and made available via the OPAL_CONFIGURE_CLI macro.
it can be retrieved via {orte-info,ompi_info,oshmem_info} -c, or
{orte-info,ompi_info,oshmem_info} --all --parseable | grep ^config:cli:
2016-07-29 09:14:09 +09:00