1
1
Граф коммитов

12 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
0434b615b5 Update ORTE to support PMIx v3
This is a point-in-time update that includes support for several new PMIx features, mostly focused on debuggers and "instant on":

* initial prototype support for PMIx-based debuggers. For the moment, this is restricted to using the DVM. Supports direct launch of apps under debugger control, and indirect launch using prun as the intermediate launcher. Includes ability for debuggers to control the environment of both the launcher and the spawned app procs. Work continues on completing support for indirect launch

* IO forwarding for tools. Output of apps launched under tool control is directed to the tool and output there - includes support for XML formatting and output to files. Stdin can be forwarded from the tool to apps, but this hasn't been implemented in ORTE yet.

* Fabric integration for "instant on". Enable collection of network "blobs" to be delivered to network libraries on compute nodes prior to local proc spawn. Infrastructure is in place - implementation will come later.

* Harvesting and forwarding of envars. Enable network plugins to harvest envars and include them in the launch msg for setting the environment prior to local proc spawn. Currently, only OmniPath is supported. PMIx MCA params control which envars are included, and also allows envars to be excluded.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-03-02 02:00:31 -08:00
Ralph Castain
b643852d8a Properly terminate the job when executable not found
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-01-26 12:09:24 -08:00
Ralph Castain
ac522a521f Ensure that prun doesn't prematurely exit
Ensure that prun doesn't exit until notified that its own child job
terminated.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-01-11 19:03:32 -08:00
Ralph Castain
3b71be4db4 Update the scaling script to avoid use of "system" command, thus ensuring that each command sees the same environment. Fix prun to pickup and propagate OMPI MCA params
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-10-23 16:27:41 -07:00
Ralph Castain
60b338e857 Sync to PMIx v3. Ensure prun uses the ess/tool component.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-10-14 08:24:57 -07:00
Ralph Castain
388034c814 Add support for the -v (verbose) option to prun and silence the "executing" and "completed" output otherwise.
Debounce "unreachable" notifications for tools when they disconnect
Enable the -x cmd line option for prun

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 0a5b36180a22959654461ac1303cec35313f8b4a)
2017-10-10 12:54:49 -07:00
Ralph Castain
5352c31914 Enable remote tool connections for the DVM. Fix notifications so we "de-bounce" termination calls
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-10-06 10:47:05 -07:00
Ralph Castain
fe9b584c05 Fully support OMPI spawn options. Fix a bug in the round-robin mappers where we weren't adding nodes to the job map node array, and so resources were not released
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 285d8cfef74ffc899e9c51e1d9c597b7fb2ceb89)
2017-09-21 10:29:27 -07:00
Ralph Castain
e575c4d6f9 Fix tool connection logic so we properly search for default session server, perform specified number of retries, etc.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 7c755e01004f8b86c71f1729662979ea45ab1adb)
2017-09-19 13:35:46 -07:00
Ralph Castain
3c914a7a97 Complete the fix of the ORTE DVM. We will now use "prun" instead of "orterun -hnp foo" to execute jobs. This provides the feature of automatic discovery of the orte-dvm so you don't need to manually enter URI's or contact file locations. All IO is forwarded to prun.
Still in the "needs to be done" category:

* mapping/ranking/binding options aren't correctly supported

* if the DVM encounters some errors (e.g., not enough resources for the job), the resulting error is globally set and impacts any subsequent job submission

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-09-16 13:13:07 -07:00
Ralph Castain
7c7d8a69a0 Backport changes from PMIx reference server
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-09-14 11:48:56 -07:00
Ralph Castain
bbd83fd4c0 Add a new launcher "prun" for starting applications against the ORTE DVM.
Unlike "orterun", "prun" is a PMIx-only program that discovers the DVM connection instead of requiring that we explicitly provide it. Only build "prun" if PMIx v2.x is available.

This gets the DVM working again, but still is showing problems for multiple executions. I'll detail those in a separate issue. Thus, the DVM should still be considered "broken".

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-09-12 21:40:41 -07:00