1
1

625 Коммитов

Автор SHA1 Сообщение Дата
rhc54
341ab683de Merge pull request #2532 from rhc54/topic/pmixptl
Update to latest PMIx master + PTL branch
2016-12-07 17:28:22 -08:00
Gilles Gouaillardet
123036dbf8 ess/base: invoke orte_routed.update_routing_plan() earlier
fix an issue that can be evidenced with two nodes
n0$ mpirun --host n1:1 --mca oob_tcp_static_ipv4_ports 1234 -np 1 --mca routed radix --mca oob tcp true

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2016-12-07 17:19:25 +09:00
Ralph Castain
fbed2d794a Update to latest PMIx master + PTL branch
Update the usock component to disable it

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-06 20:47:44 -08:00
Howard Pritchard
2cbc0e8472 pmix/cray: fix disable-dlopen problem
PR open-mpi/ompi#2432 introduced a regression where configure
and build with --disable-dlopn caused build failure owing
to unresolved alps lli symbols in the libopal-pal shared library.

This commit fixes this problem.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2016-11-21 13:45:10 -06:00
Ralph Castain
188880be3f Since static ports are only used by ORTE if the runtime option is given,
there is no need for a configure option as well - so remove the
--enable-orte-static-ports configure option. When decoding the daemon
nidmap, mark new daemons as ALIVE by default - we will discover dead
ones as we go.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-11-04 05:01:42 -07:00
Ralph Castain
64873487b4 Remove the max_connections parameter from the radix component as it is confusing. Modify PMIx client init so that it simply returns the nspace/rank if called by a server - this allows the server to retrieve its assigned ID. Register the server's nspace so client-side operations can succeed
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-11-01 12:17:11 -07:00
Ralph Castain
b8c5d1ad88 Update the routed components as we no longer need to init_routes. Fixes case of direct launch via srun
Signed-off-by: Ralph Castain <rhc@open-mpi.org>

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-10-31 12:38:13 -07:00
Gilles Gouaillardet
fb5bcc47ce ess/singleton: use opal_setenv instead of putenv
so it fixes a memory leak on finalize

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2016-10-28 09:32:30 +09:00
Ralph Castain
f298f294e1 Update PMIx to latest master tarball. Ensure we set the HNP name for orted's so that PMIx_Lookup can find the server
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-10-26 15:48:56 -07:00
Ralph Castain
227d4d9609 Open the conduits for application procs - we probably can remove all the
RML-related frameworks from MPI applications now, but let's wait a bit
to ensure we have cleaned up all the points where messaging might occur.
2016-10-24 16:53:19 -07:00
Ralph Castain
649301a3a2 Revise the routed framework to be multi-select so it can support the new conduit system. Update all calls to rml.send* to the new syntax. Define an orte_mgmt_conduit for admin and IOF messages, and an orte_coll_conduit for all collective operations (e.g., xcast, modex, and barrier).
Still not completely done as we need a better way of tracking the routed module being used down in the OOB - e.g., when a peer drops connection, we want to remove that route from all conduits that (a) use the OOB and (b) are routed, but we don't want to remove it from an OFI conduit.
2016-10-23 21:52:39 -07:00
Ralph Castain
7be607582e ORTE applications need to commit any modex send's prior to calling fence 2016-10-18 09:22:56 -07:00
Gilles Gouaillardet
451b9dc467 ess: tear down pmix (if any) before oob 2016-10-13 14:08:02 +09:00
Ralph Castain
a2919174d0 Bring the RML modifications across. This is the first step in a revamp of the ORTE messaging subsystem to support fabric-based communications during launch and wireup phases. When completed, the grpcomm and plm frameworks will each have their own "conduit" for communication - each conduit corresponds to a particular RML messaging transport. This can be the active OOB-based component, or a provider from within the RML/OFI component. Messages sent down the conduit will flow across the associated transport.
Multiple conduits can exist at the same time, and can even point to the same base transport. Each conduit can have its own characteristics (e.g., flow control) based on the info keys provided to the "open_conduit" call. For ease during the transition period, the "legacy" RML interfaces remain as wrappers over the new conduit-based APIs using a default conduit opened during orte_init - this default conduit is tied to the OOB framework so that current behaviors are preserved. Once the transition has been completed, a one-time cleanup will be done to update all RML calls to the new APIs and the "legacy" interfaces will be deleted.

While we are at it: Remove oob/usock component to eliminate the TMPDIR length problem - get all working, including oob_stress
2016-10-11 16:01:02 -07:00
Gilles Gouaillardet
0931d09afa ess/singleton: silence a valgrind warning
initialize a pointer and keep valgrind happy about it
2016-09-27 15:22:39 +09:00
Gilles Gouaillardet
f9ebba4668 ess/singleton: only realloc() when required in fork_hnp() 2016-09-23 16:35:59 +09:00
Gilles Gouaillardet
c7bf9a0ec9 ess/singleton: fix read on the pipe to spawn'ed orted
and close the pipe on both ends when it is no more needed
2016-09-22 14:21:52 +09:00
Gilles Gouaillardet
83399adb3f singleton: "safe" read/write to the pipe between (spawn'ed) orted and singleton 2016-09-20 14:56:58 +09:00
Gilles Gouaillardet
e7ae6975d0 orted: fix spawn in singleton mode
in singleton mode, have the spawn'ed orted invoke orte_pre_condition_transports()
and send the transport key back to the singleton
2016-09-20 14:39:22 +09:00
Ralph Castain
a16b3cc33d Fix some minor complaints - missing "void" in function parameters 2016-09-15 15:18:42 -07:00
Ralph Castain
6f086189e6 Fix trivial typo 2016-09-15 13:10:55 -07:00
Gilles Gouaillardet
11ebf3ab23 ess/singleton: when forking hnp, use the PMIX_NAMESPACE sent by the hnp
as the jobid
2016-09-15 13:57:23 +09:00
Artem Polyakov
a9a7f39773 ess/pmi: fix the comments about MCA/PMIx setting conflict resolution. 2016-09-07 07:47:35 +03:00
Artem Polyakov
74a11d7832 Fix session dir cleanup code. 2016-09-05 07:53:55 +03:00
Artem Polyakov
dc0ab674de Add PMIx key to provide RM with ability to indicate that it will cleanup
session directories provided at through OPAL_PMIX_TMPDIR,
OPAL_PMIX_NSDIR, OPAL_PMIX_PROCDIR
2016-09-05 07:48:44 +03:00
Artem Polyakov
81195ab724 Several fixes related to session directories:
* enable OMPI to retrieve paths from RM through PMIx
* cleanups related to tempdirs.
2016-09-05 07:48:44 +03:00
Ralph Castain
2f6e0fec90 Provide the number of nodes in the job 2016-08-26 14:50:41 -07:00
Gilles Gouaillardet
93e73841f9 ess/singleton: push all PMIX_* environment variables, regardless how many there are 2016-08-23 09:46:55 +09:00
Gilles Gouaillardet
a1e8e58a8a ess/singleton: expects 4 PMIX_* environment variables or more 2016-08-23 09:34:03 +09:00
Ralph Castain
be8424b691 Provide backward compatible keys so that the non-PMIx components in the opal/pmix framework don't have to adjust as we continue to work on finalizing the PMIx reference scheme. Activate and utilize the new PMIx show_help capability to provide more meaningful error output when the server cannot start.
Add a contrib script to cleanup permissions incorrectly modified due to things like smb mounts

dd
2016-08-13 12:13:04 -07:00
Ralph Castain
08a0644df5 Fix shared memory rendezvous 2016-08-13 08:14:50 -07:00
rhc54
1ef3c86d44 Merge pull request #1931 from hjelmn/ess_fix
ess/base: set up nidmap after pmix
2016-08-12 13:10:30 -07:00
Artem Polyakov
1351a7065c ess/pmi: minor code readablility cleanup.
Split process name variable "name" to
- "wildcard_rank" for the cases where wildcard is used.
- "pname" for the case where reference to particular process is needed.
2016-08-06 15:45:19 +06:00
Nathan Hjelm
3c23502dfe ess/base: set up nidmap after pmix
This fixes a SEGV when the nidmap code attempts to use
opal_pmix.store_local before pmix is set up.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-08-02 09:50:00 -06:00
Ralph Castain
71de03fc67 Cleanup the new naming requirements to ensure that info is correctly retrieved
Cleanup permissions

Restore singleton operations
2016-07-21 09:46:03 -07:00
Ralph Castain
01a653d50a Remove a debug print in comm_cid.c. Update PMIx2 to include the revised PMIx_Get logic for higher performance by reducing the number of hash table lookups. Fix a bug where requests for data from a proc in another nspace could hang, or result in "not found".
Remove stale file reference

Restore autogen pass thru pmix

Remove generated file
2016-07-20 00:58:19 -07:00
Ralph Castain
ee56d9dc1a Shorten the session directory name as some OS's are now providing unusually long temp directory names, causing us to overflow the sockaddr field 2016-07-05 14:59:50 -07:00
Ralph Castain
3913595e10 Enable simulation of large-scale clusters by allowing multiple daemons/node. Specifying the ras_base_multiplier parameter to be greater than 1 will cause ORTE to replicate each allocated node by that factor. A daemon will be spawned for each replica, thus letting ORTE function as if it were on a much larger cluster.
Note that this cannot be used for MPI performance testing. It is really only useful for ORTE scaling tests. It also only works with the rsh/ssh launcher.
2016-05-29 18:56:18 -07:00
rhc54
ff8518853e Merge pull request #1604 from rhc54/topic/psm2
Improve the transport key print statement to ensure that we don't get…
2016-05-03 13:43:10 -07:00
rhc54
2fa8b6c6ac Merge pull request #1525 from rhc54/topic/schizo
Extend the schizo framework
2016-05-01 15:09:08 -07:00
Ralph Castain
6ac7929bd0 Extend the schizo framework to allow definition of CLI options by environment. Refactor orterun to mesh with the orted_submit code, thus improving code reuse. Eliminate the orte-submit tool as orterun can now meet that need.
Cleanups per @jjhursey review
2016-05-01 11:30:25 -07:00
Ralph Castain
0f05893952 Ensure consistency between max_procs and univ_size values - since orte wants max_procs, have the proc get that value instead of univ_size
Make the singleton module consistent as well
2016-05-01 11:13:33 -07:00
Ralph Castain
29bc24bdd5 Improve the transport key print statement to ensure that we don't get zero fields as this can be a problem for PSM 2016-04-28 20:11:12 -07:00
Ralph Castain
75dc4c305a Correctly set the #procs in the job to "job_size", and the max_procs to "univ_size" 2016-04-27 12:00:19 -07:00
Ralph Castain
2432daf065 Some minor cleanups of a memory leak and error output 2016-04-08 07:46:18 -07:00
Rainer Keller
52080a5736 As per the pull request to pmix/master:
https://github.com/pmix/master/pull/71

Have OMPI's current version of pmix120 nicely fail in case of
too long sun_path (longer than 108 or in case of OSX 103 chars).
And have OMPI return proper error messages with hints how to
amend.
2016-04-07 22:12:53 +02:00
Ralph Castain
a3fea58d1c Minor cleanups to prior PR commit 2016-03-24 15:55:14 -07:00
rhc54
6756e19aa2 Merge pull request #1457 from anandhis/master
rml changes
2016-03-24 15:17:29 -07:00
Ralph Castain
8c14df2328 Revert "Modify singularity support per patch from Greg Kurtzer"
This reverts commit open-mpi/ompi@f7257a8310.

Ensure that we properly cleanup the session directory tree. Prior code had issues with symlinks, especially if the file that the link points to was already removed as we traverse the tree. Also found that the dirent checks for directory type weren't fully portable, and so fall back to the stat-based approach which is known to be portable.

Fix singularity singletons by detecting we are in a container and properly setting the pmix selection to pick the isolated component. Remove a stale restriction blocking use of the sm btl
2016-03-24 11:27:18 -07:00
Ralph Castain
c146c4969b Revert part of open-mpi/ompi@c1bbbb5e2f to restore the usock component, thus fixing show_help aggregation.
Fixes #1467

Restore debugger attach operations

Fixes #1225
2016-03-18 21:49:04 -07:00