1
1
Граф коммитов

634 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
d97bc29102 Remove OPAL_HAVE_HWLOC qualifier and error out if --without-hwloc is given 2015-09-04 16:54:40 -07:00
Ralph Castain
cf6137b530 Integrate PMIx 1.0 with OMPI.
Bring Slurm PMI-1 component online
Bring the s2 component online

Little cleanup - let the various PMIx modules set the process name during init, and then just raise it up to the ORTE level. Required as the different PMI environments all pass the jobid in different ways.

Bring the OMPI pubsub/pmi component online

Get comm_spawn working again

Ensure we always provide a cpuset, even if it is NULL

pmix/cray: adjust cray pmix component for pmix

Make changes so cray pmix can work within the integrated
ompi/pmix framework.

Bring singletons back online. Implement the comm_spawn operation using pmix - not tested yet

Cleanup comm_spawn - procs now starting, error in connect_accept

Complete integration
2015-08-29 16:04:10 -07:00
Ralph Castain
0b1d4b62be Cleanup some cruft and update to coordinate with CM operations:
* don't pass --tree-spawn to the orted cmd line. If someone doesn't want tree-spawn, it shows up as an MCA param anyway
* ensure state/orted component disqualifies itself from CM operations
* clarify the DVM proc_type definitions
* ensure we stop littering the tmp dir with session directories
2015-08-12 10:32:14 -07:00
Howard Pritchard
1b55d14dff plm/alps: remove unneded env. variable setting
In order to address issue #741, the orted's now are
always launched with the Cray PMI environment variables

PMI_NO_FORK
PMI_NO_PREINITIALIZE

set to disable running of the library's ctor.
So there's no longer a need to set these for the
application(s) being launched by the orted's.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-08-05 13:27:18 -07:00
Ralph Castain
023936e84b Silence coverity warnings 2015-07-29 07:28:08 -07:00
Howard Pritchard
70096d3753 plm/alps: fix orted based launch failures.
Turns out that when one builds Open MPI with --disable-dlopen
for Cray, a whole bunch of cray specific libraries get linked
in to the orted executable.  One of these is Cray PMI.  The
Cray PMI has a ctor which, if run, causes job launches using
mpirun to fail.  This commit suppresses the running of the
ctor and thus prevents failure to launch.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-07-23 15:07:57 -07:00
Nathan Hjelm
4d92c9989e more c99 updates
This commit does two things. It removes checks for C99 required
headers (stdlib.h, string.h, signal.h, etc). Additionally it removes
definitions for required C99 types (intptr_t, int64_t, int32_t, etc).

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2015-06-25 10:14:13 -06:00
Ralph Castain
869041f770 Purge whitespace from the repo 2015-06-23 20:59:57 -07:00
Ralph Castain
869b2891c4 When doing comm-spawn, track the last object we bound to and ensure that we start the next job on the next object so we avoid overload situations when they aren't necessary 2015-06-17 09:20:08 -07:00
Ralph Castain
c21cd1c91e Ensure the ssh session is dead 2015-05-23 08:14:29 -07:00
Ralph Castain
920562d9b4 Ensure that all ssh sessions are terminated when abnormally terminating the job 2015-05-23 08:14:29 -07:00
Gilles Gouaillardet
2e384a3b65 initialize common symbols from orte
A few uninitialized common symbols are remaining (generated by flex) :
 * orte/mca/rmaps/rank_file/rmaps_rank_file_lex.c: orte_rmaps_rank_file_leng
 * orte/mca/rmaps/rank_file/rmaps_rank_file_lex.c: orte_rmaps_rank_file_text
 * orte/util/hostfile/hostfile_lex.c: orte_util_hostfile_leng
 * orte/util/hostfile/hostfile_lex.c: orte_util_hostfile_text
2015-05-08 10:11:58 +09:00
Ralph Castain
8e3f0b1d33 Ensure the --tree-spawn option is inside any parens from the sh and ksh shell support 2015-05-06 15:18:15 -07:00
Ralph Castain
7d1980ba83 Add the ability to specify the number of desired slots in the --host option. Just giving a host name => one slot (multiple copies of the name yield one slot per copy). Giving "foo:3" indicates you want three slots - a shorthand notation for saying "foo" three times. Giving "foo:*" indicates you want the topology to set the number of slots based on the orte_set_slots param. 2015-04-30 20:35:23 -07:00
Jeff Squyres
11e8c2096b plm rsh: assign some levels to the rsh PLM MCA params 2015-04-20 16:18:57 -07:00
Nathan Hjelm
45e053dbce orte: use C99 subobject naming for component initialization
This commit helps future-proof orte components by initializing each
component member by name.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-04-18 10:29:58 -06:00
Ralph Castain
34b53ac3dc Silence Coverity warnings 2015-04-18 07:48:22 -07:00
Ralph Castain
12bfb27161 Redo in cleaner form: Per request from Andy Rieb, add ability to pass PATH and LD_LIBRARY_PATH elements to ssh command 2015-04-17 16:11:37 -07:00
Nathan Hjelm
3436f2917d Merge pull request #449 from hjelmn/mca_base_update
mca/base update
2015-04-16 08:41:48 -06:00
Ralph Castain
d9c555b547 Revert "Per request from Andy Rieb, add ability to pass PATH and LD_LIBRARY_PATH elements to ssh command"
This reverts commit open-mpi/ompi@278324c52a.

Revert "Add the ability to pass args to the rsh/ssh command line"

This reverts commit open-mpi/ompi@6f227f8564.
2015-04-16 08:03:14 -06:00
Ralph Castain
278324c52a Per request from Andy Rieb, add ability to pass PATH and LD_LIBRARY_PATH elements to ssh command 2015-04-15 20:30:04 -06:00
Ralph Castain
6f227f8564 Add the ability to pass args to the rsh/ssh command line 2015-04-15 20:07:13 -06:00
Ralph Castain
91e1cbf284 Init variable 2015-04-11 07:44:57 -07:00
Ralph Castain
3e44d3c9e3 Enable singletons to run without any active OOB module until they attempt to comm_spawn 2015-04-10 14:06:42 -07:00
Ralph Castain
9f8ae59162 Properly enclose the different && clauses 2015-04-01 18:48:25 -07:00
Ralph Castain
57c21d5209 Ensure the DVM flows thru the "daemons reported" state 2015-04-01 16:47:34 -07:00
Mike Dubman
58d002098b Merge pull request #474 from elenash/master
Introduce -tune command line option to set env vars and mca params from ...
2015-04-01 08:23:34 +03:00
Ralph Castain
6f9140a341 Add a little more debug to launch 2015-03-31 20:10:21 -07:00
Nathan Hjelm
b68d66bb9b MCA: Add the project/project version to the MCA base component
This commit adds support for project_framework_component_* parameter
matching. This is the first step in allowing the same framework name
in multiple projects. This change also bumps the MCA component version
to 2.1.0.

All master frameworks have been updated to use the new component
versioning macro. An mca.h has been added to each project to add a
project specific versioning macro of the form
PROJECT_MCA_VERSION_2_1_0.

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2015-03-27 10:59:04 -06:00
Elena
90f5b2bb84 Introduce -tune command line option to set env vars and mca params from file 2015-03-26 18:33:53 +02:00
Ralph Castain
6aa33deafb Remove debug 2015-03-25 19:58:51 -07:00
Ralph Castain
6ba76ed8d8 Per user request, we allow -host to specify a host that is not included in a hostfile (however, we reject it if we were given an allocation by a resource manager). Since we cannot know if an IP addr form references the same node that was previously given as a string name, we have no choice but to assume they are different. Get the topology from the right place in that situation so mpirun can succeed. 2015-03-25 06:16:01 -07:00
Ralph Castain
43a3baad5e Ensure we use the first compute node's topology for mapping
Don't filter the topology by cpuset if you are mpirun until you know that no other compute nodes are involved. This deals with the corner case where mpirun is executing on a node of different topology from the compute nodes.

Simplify - don't mandate that all cpus in the given cpuset be present on every node. We can then run everything thru the filter as before, which ensures that any procs run on mpirun are also contained within the specified cpuset.

Correctly count the number of available PUs under each object when given a cpuset

Fix the default binding settings, and correctly count PUs when no cpuset is given

Ensure the binding policy gets set in all cases
2015-03-19 16:30:36 -07:00
Gilles Gouaillardet
2ab9a411f8 plm/base: fix misc memory leaks
as reported by Coverity with CIDs 1196733 and 1196745
2015-03-09 16:25:07 +09:00
Gilles Gouaillardet
7de3f35b90 pml/rsh: fix misc memory leaks
as reported by Coverity with CIDs 71091, 71230, 71231, 72274, 72389,
1196718 and 1196719
2015-03-05 20:03:37 +09:00
Jeff Squyres
05f00aface plm base: ensure mca_base_var_get_value() and mca_base_var_find() succeed
This was CID 993712
2015-02-24 15:48:50 -05:00
Jeff Squyres
e2223cd9bf plm_rsh: ensure cwd array is \0-terminated
This was CID 72257
2015-02-24 15:24:08 -05:00
Howard Pritchard
bf89131f9e add owner files to opa/ompi/orte mca directories
This commit adds an owner file in each of the component directories
for each framework.  This allows for a simple script to parse
the contents of the files and generate, among other things, tables
to be used on the project's wiki page.  Currently there are two
"fields" in the file, an owner and a status.  A tool to parse
the files and generate tables for the wiki page will be added
in a subsequent commit.
2015-02-22 15:10:23 -07:00
Ralph Castain
3ae3b96c17 Fix master compilation - a buried header dependency must have been removed. 2015-02-10 07:22:10 -08:00
Ralph Castain
a3275aa867 Once again, fix the blasted singleton comm_spawn 2015-02-05 17:34:25 -08:00
Ralph Castain
2b0b012460 Continue refinement of the DVM operations. Send the spawn request to the right place (it helps) as it isn't a comm_spawn request and has to be treated a little differently. Ensure IO gets forwarded back to the tool. Ensure the tool outputs show_help locally as there is no place to send it. 2015-02-04 06:21:54 -08:00
Ralph Castain
ec5ccb76cf Enable persistent ORTE DVM so users can execute multiple OMPI jobs within an allocation without restarting the DVM every time. 2015-01-30 11:00:43 -08:00
Howard Pritchard
f34dd5f5fd plm/alps: update copyright 2015-01-07 12:33:38 -07:00
Howard Pritchard
c454d11b01 plm/alps: fix orted abort hang problem
Turns out the alps plm component wasn't changing the state
of the job upon terminating the orted's in the case of
an abnormal termination.  This caused mpirun to hang
with a zommbie'd aprun process if an orted on a node
in the job was killed via signal.
2015-01-07 12:31:41 -07:00
Jeff Squyres
7b43bdc984 plm base: move flag inside the #if in which it is used
Avoid a compiler warning by declaring the tflag only inside the #if in
which it is used (i.e., if hwloc support is built).
2014-12-18 10:56:23 -08:00
Ralph Castain
bb529ebd8e Revise the way we handle hetero nodes as users are finding this (a) a significant surprise, and (b) confusing as to when it is required. So try to automate it a bit by creating a topology "signature" that mpirun can share on the cmd line with the remote daemons, thus allowing them to check to see if they match. This isn't comprehensive of course - for now, it only checks the number of each type of hwloc object on the node. This is good enough to pickup major differences (e.g., where we have different numbers of sockets or assigned core bindings).
Retain the hetero-nodes flag for those cases where the user *knows* that there are differences and our automated system isn't good enough to see it.

Will obviously require further refinement as we find out which variances it can detect, and which it cannot.
2014-12-08 15:38:14 -08:00
Ralph Castain
c4002a8485 Further cleanups on the LSF integration - the affinity file is apparently always present, but simply empty if affinity wasn't set. 2014-12-04 12:24:35 -08:00
Ralph Castain
c88f181efe Fix singleton comm-spawn, yet again. The new grpcomm collectives require a complete knowledge of every active proc in the system in case they participate in a collective. So ensure we pass the required job info when we spawn new daemons, and construct the necessary connections to allow grpcomm to operate. 2014-12-03 18:11:17 -08:00
Jeff Squyres
a3af7d6dbb Revert "lsf configury: add dependent libraries for static linking"
This reverts commit 56cfa90dda.
2014-12-03 13:32:56 -08:00
Jeff Squyres
92c2ff91ec Revert "Cleanup static build requirements by adding the wrapper flags back to the component configure.m4's. Minor cleanup of the lsf configure logic."
This reverts commit open-mpi/ompi@32bf0e7b7e.
2014-12-03 13:15:20 -08:00