1
1
Граф коммитов

4023 Коммитов

Автор SHA1 Сообщение Дата
Artem Polyakov
58300afff2 orte/oob/tcp: Plug the memory leak.
Plug coverity defect CID 1396541.

Signed-off-by: Artem Polyakov <artpol84@gmail.com>
2016-12-01 06:48:25 +07:00
Ralph Castain
47ed214458 Do not resend if max_retries is exceeded. Make a verbose output available to tell us where the intended message was to go.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-11-29 19:21:16 -08:00
rhc54
d31f173744 Merge pull request #2476 from rhc54/topic/dbgupdate
Bring forward the debugger-related changes
2016-11-29 19:10:32 -08:00
Ralph Castain
d5fd635efe Bring forward the debugger-related changes
Refs https://github.com/open-mpi/ompi/pull/2425

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-11-29 13:15:20 -08:00
Ralph Castain
30ff8be9c9 Silence minor warnings
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-11-29 08:33:22 -08:00
Jeff Squyres
a6d390fe7b Merge pull request #2461 from artpol84/oob/msg_drop
orte/oob/tcp: Fix message dropping in case of concurrent connection.
2016-11-29 11:23:15 -05:00
Ralph Castain
f7699a7eeb Silence warnings in a .opal_ignore'd component
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-11-28 13:18:25 -08:00
Artem Polyakov
ada93e0c02 orte/oob/tcp: Fix message dropping in case of concurrent connection.
The problem was observed for direct modex used with recursive doubling
algorithm (used for collective ID calculation prior to d52a2d081e9598a9ac9a50fb4b013a6d2a72375b)
that has pairwise nature and counter-connections are highly likely.

The following scenario was uncovering the issue:
* ranks `x` and `y` want to communicate with each other, `x` < `y`;
* rank `x` initiates the connection and sends the ack;
* rank `y` starts to `connect()` and gets the ack from `x`;
* `y` identifies that it already started connecting and `y` > `x` so it rejects incoming connection.
* `x` sees that his connection was rejected in `mca_oob_tcp_peer_recv_connect_ack()` when trying to
read the message header using `tcp_peer_recv_blocking()` which calls `mca_oob_tcp_peer_close()`
that effectively flushes all the messages in the peer->send_queue.
* `y` send the ack to `x` and the connection is established, however all the messages for the peer
at `x` are vanished (except the front one in peer->send_msg).

This commit introduces a "nack" function that will be used at `y` side to tell `x` that `y` has the
priority and `x`'s connection should be closed. This allows to avoid "guessing" on the unexpectedly
closed connection.

Signed-off-by: Artem Polyakov <artpol84@gmail.com>
2016-11-27 04:58:34 +07:00
Howard Pritchard
2cbc0e8472 pmix/cray: fix disable-dlopen problem
PR open-mpi/ompi#2432 introduced a regression where configure
and build with --disable-dlopn caused build failure owing
to unresolved alps lli symbols in the libopal-pal shared library.

This commit fixes this problem.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2016-11-21 13:45:10 -06:00
Ralph Castain
eb67c2fd44 Update OFI/rml component - still .opal_ignore'd
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-11-18 14:54:26 -08:00
Ralph Castain
9c6c2fa61d Bring the v2.0.x debugger patch up to the master branch
Ensure the personality gets set as specified by user, or defaults to
"ompi"

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-11-18 12:45:45 -08:00
Ralph Castain
188880be3f Since static ports are only used by ORTE if the runtime option is given,
there is no need for a configure option as well - so remove the
--enable-orte-static-ports configure option. When decoding the daemon
nidmap, mark new daemons as ALIVE by default - we will discover dead
ones as we go.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-11-04 05:01:42 -07:00
Gilles Gouaillardet
da0c873e14 oob/tcp: enhance debugging output
display the hop node used to send a message
(if the message is sent directly, then the hop is the destination)

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2016-11-04 14:16:06 +09:00
Josh Hursey
b18598f6c7 Merge pull request #2329 from jjhursey/topic/short-hostname-lsf-fix
ras/*: Fix !orte_keep_fqdn_hostnames for RAS components
2016-11-02 10:49:08 -05:00
Ralph Castain
435d771e76 Fix the radix routed component to correctly handle connected tools - in such cases, the route must be direct to the tool.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-11-01 19:03:26 -07:00
Ralph Castain
64873487b4 Remove the max_connections parameter from the radix component as it is confusing. Modify PMIx client init so that it simply returns the nspace/rank if called by a server - this allows the server to retrieve its assigned ID. Register the server's nspace so client-side operations can succeed
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-11-01 12:17:11 -07:00
Joshua Hursey
ed5268a96a ras/slurm: Fix !orte_keep_fqdn_hostnames for Slurm
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2016-11-01 13:21:30 -05:00
Joshua Hursey
5a4c52d9cb ras/loadleveler: Fix !orte_keep_fqdn_hostnames for Loadleveler
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2016-11-01 13:21:30 -05:00
Joshua Hursey
8230201ad1 ras/gridengine: Fix !orte_keep_fqdn_hostnames for GridEngine
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2016-11-01 13:21:30 -05:00
Joshua Hursey
9643175e40 ras/tm: Fix !orte_keep_fqdn_hostnames for TORQUE
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2016-11-01 13:21:24 -05:00
Joshua Hursey
8d02a33639 ras/lsf: Fix !orte_keep_fqdn_hostnames for LSF
* By default, make sure that we are using the short hostnames and not
   the fully qualified hostnames when running under LSF.
 * Related to commit open-mpi/ompi@d26dd2c20e

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2016-11-01 13:04:52 -05:00
rhc54
6074c2a2a9 Merge pull request #2322 from rhc54/topic/routed
Update the routed components as we no longer need to init_routes.
2016-10-31 13:37:07 -07:00
Ralph Castain
b8c5d1ad88 Update the routed components as we no longer need to init_routes. Fixes case of direct launch via srun
Signed-off-by: Ralph Castain <rhc@open-mpi.org>

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-10-31 12:38:13 -07:00
Jeff Squyres
773d6039e7 Merge pull request #2306 from hjelmn/alps_cores
ras/alps: use cpuCnt if using hwthreads as cores
2016-10-31 15:22:13 -04:00
Gilles Gouaillardet
30298cc83c oob/tcp: remove debug that should have never been commited
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2016-10-31 16:41:14 +09:00
Gilles Gouaillardet
75e96004a4 oob/tcp: fix a typo in mca_oob_tcp_component_no_route()
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2016-10-31 16:30:24 +09:00
Gilles Gouaillardet
fb5bcc47ce ess/singleton: use opal_setenv instead of putenv
so it fixes a memory leak on finalize

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2016-10-28 09:32:30 +09:00
Gilles Gouaillardet
ef2b3ac8d2 rml/oob: fix misc memory leaks in open_conduit()
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2016-10-28 09:28:42 +09:00
Gilles Gouaillardet
831f7d9c9d rml/base: plug misc memory leaks
plug leaks in orte_rml_API_get_contact_info() and orte_rml_base_close()

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2016-10-28 09:28:05 +09:00
Nathan Hjelm
c3614d30fa ras/alps: use cpuCnt if using hwthreads as cores
This commit updates the alps ras component to allow the use of
hyperthreads on compute nodes. In this case we need to use the cpuCnt
value from the node structure instead of numPEs.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-10-27 09:51:17 -06:00
Gilles Gouaillardet
3d4285b04d oob/tcp: silence valgrind warning
fully initialize allocated memory to keep valgrind happy

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2016-10-27 17:12:46 +09:00
rhc54
2b18044051 Merge pull request #2301 from rhc54/topic/update
Update PMIx to latest master tarball. Ensure we set the HNP name for …
2016-10-26 16:42:15 -07:00
Ralph Castain
f298f294e1 Update PMIx to latest master tarball. Ensure we set the HNP name for orted's so that PMIx_Lookup can find the server
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-10-26 15:48:56 -07:00
Anandhi S Jayakumar
94593ca20b Adding ofi plugin to allow for opening a conduit to use ethernet/fabric.
modified:   ../orte/mca/rml/base/rml_base_frame.c
	modified:   ../orte/mca/rml/base/rml_base_stubs.c
	deleted:    ../orte/mca/rml/ofi/.opal_ignore
	modified:   ../orte/mca/rml/ofi/Makefile.am
	modified:   ../orte/mca/rml/ofi/rml_ofi.h
	modified:   ../orte/mca/rml/ofi/rml_ofi_component.c
	modified:   ../orte/mca/rml/ofi/rml_ofi_send.c
	modified:   ../orte/test/system/ofi_conduit_stress.c

	Removed stale include directive
	modified:   ../orte/mca/rml/ofi/Makefile.am

The ofi plugin supports multiple providers, and identifies them
by ofi_prov_id,  changed the previous name conduit_id to ofi_prov_id
	modified:   ../orte/mca/rml/base/base.h
	modified:   ../orte/mca/rml/ofi/rml_ofi.h
	modified:   ../orte/mca/rml/ofi/rml_ofi_component.c
	modified:   ../orte/mca/rml/ofi/rml_ofi_request.h
	modified:   ../orte/mca/rml/ofi/rml_ofi_send.c

Adding ofi plugin to allow for opening a conduit to use ethernet/fabric.

	modified:   ../orte/mca/rml/base/rml_base_frame.c
	modified:   ../orte/mca/rml/base/rml_base_stubs.c
	deleted:    ../orte/mca/rml/ofi/.opal_ignore
	modified:   ../orte/mca/rml/ofi/Makefile.am
	modified:   ../orte/mca/rml/ofi/rml_ofi.h
	modified:   ../orte/mca/rml/ofi/rml_ofi_component.c
	modified:   ../orte/mca/rml/ofi/rml_ofi_send.c
	modified:   ../orte/test/system/ofi_conduit_stress.c

	Removed stale include directive
	modified:   ../orte/mca/rml/ofi/Makefile.am

The ofi plugin supports multiple providers, and identifies them
by ofi_prov_id,  changed the previous name conduit_id to ofi_prov_id
	modified:   ../orte/mca/rml/base/base.h
	modified:   ../orte/mca/rml/ofi/rml_ofi.h
	modified:   ../orte/mca/rml/ofi/rml_ofi_component.c
	modified:   ../orte/mca/rml/ofi/rml_ofi_request.h
	modified:   ../orte/mca/rml/ofi/rml_ofi_send.c

Fixed merge issues, and minor pull-request comments
	modified:   ../orte/mca/rml/base/base.h
	modified:   ../orte/mca/rml/base/rml_base_frame.c
	modified:   ../orte/mca/rml/ofi/rml_ofi.h
	modified:   ../orte/mca/rml/ofi/rml_ofi_component.c

Adding ofi plugin to allow for opening a conduit to use ethernet/fabric.

	modified:   ../orte/mca/rml/base/rml_base_frame.c
	modified:   ../orte/mca/rml/base/rml_base_stubs.c
	deleted:    ../orte/mca/rml/ofi/.opal_ignore
	modified:   ../orte/mca/rml/ofi/Makefile.am
	modified:   ../orte/mca/rml/ofi/rml_ofi.h
	modified:   ../orte/mca/rml/ofi/rml_ofi_component.c
	modified:   ../orte/mca/rml/ofi/rml_ofi_send.c
	modified:   ../orte/test/system/ofi_conduit_stress.c

	Removed stale include directive
	modified:   ../orte/mca/rml/ofi/Makefile.am

The ofi plugin supports multiple providers, and identifies them
by ofi_prov_id,  changed the previous name conduit_id to ofi_prov_id
	modified:   ../orte/mca/rml/base/base.h
	modified:   ../orte/mca/rml/ofi/rml_ofi.h
	modified:   ../orte/mca/rml/ofi/rml_ofi_component.c
	modified:   ../orte/mca/rml/ofi/rml_ofi_request.h
	modified:   ../orte/mca/rml/ofi/rml_ofi_send.c

Adding ofi plugin to allow for opening a conduit to use ethernet/fabric.

	modified:   ../orte/mca/rml/base/rml_base_frame.c
	modified:   ../orte/mca/rml/base/rml_base_stubs.c
	deleted:    ../orte/mca/rml/ofi/.opal_ignore
	modified:   ../orte/mca/rml/ofi/Makefile.am
	modified:   ../orte/mca/rml/ofi/rml_ofi.h
	modified:   ../orte/mca/rml/ofi/rml_ofi_component.c
	modified:   ../orte/mca/rml/ofi/rml_ofi_send.c
	modified:   ../orte/test/system/ofi_conduit_stress.c

	Removed stale include directive
	modified:   ../orte/mca/rml/ofi/Makefile.am

Fixed merge issues, and minor pull-request comments
	modified:   ../orte/mca/rml/base/base.h
	modified:   ../orte/mca/rml/base/rml_base_frame.c
	modified:   ../orte/mca/rml/ofi/rml_ofi.h
	modified:   ../orte/mca/rml/ofi/rml_ofi_component.c

Removed trailing space
	modified:   ../orte/mca/rml/ofi/rml_ofi_component.c

Cleaned up test- ofi_conduit_stress.c
	modified:   ../orte/test/system/ofi_conduit_stress.c

cleaned up printing the provider info during initialisation
	modified:   ../orte/mca/rml/ofi/rml_ofi.h
	modified:   ../orte/mca/rml/ofi/rml_ofi_component.c

Signed-off-by: Anandhi S Jayakumar <anandhi.s.jayakumar@intel.com>

Fixing warnings
	modified:   ../orte/mca/rml/ofi/rml_ofi.h
	modified:   ../orte/mca/rml/ofi/rml_ofi_component.c
	modified:   ../orte/mca/rml/ofi/rml_ofi_send.c

Signed-off-by: Anandhi S Jayakumar <anandhi.s.jayakumar@intel.com>

minor cleanup
	modified:   ../orte/mca/rml/ofi/rml_ofi_component.c
	modified:   ../orte/mca/rml/ofi/rml_ofi_send.c

Signed-off-by: Anandhi S Jayakumar <anandhi.s.jayakumar@intel.com>

more cleanup
	modified:   ../orte/mca/rml/ofi/rml_ofi_component.c

Signed-off-by: Anandhi S Jayakumar <anandhi.s.jayakumar@intel.com>

Sending the ethernet address only in the get_contact_info, rest will be sent through modex
	modified:   ../orte/mca/rml/ofi/rml_ofi.h
	modified:   ../orte/mca/rml/ofi/rml_ofi_component.c

Signed-off-by: Anandhi S Jayakumar <anandhi.s.jayakumar@intel.com>

Adding error logging on failures
	modified:   ../orte/mca/rml/ofi/rml_ofi_component.c

Signed-off-by: Anandhi S Jayakumar <anandhi.s.jayakumar@intel.com>

Handling the OPAL_MODEX_SEND/RECV generically for all ofi providers.
	modified:   ../orte/mca/rml/ofi/rml_ofi.h
	modified:   ../orte/mca/rml/ofi/rml_ofi_component.c
	modified:   ../orte/mca/rml/ofi/rml_ofi_send.c

Signed-off-by: Anandhi S Jayakumar <anandhi.s.jayakumar@intel.com>

Adding to build ofi for limited people
	new file:   ../orte/mca/rml/ofi/.opal_ignore
	new file:   ../orte/mca/rml/ofi/.opal_unignore

Signed-off-by: Anandhi S Jayakumar <anandhi.s.jayakumar@intel.com>

Removign the error logging for now
	modified:   ../orte/mca/rml/ofi/rml_ofi_component.c
2016-10-26 13:11:07 -07:00
Ralph Castain
d031946c46 When mpirun operates in --continuous mode, we won't terminate the job when a remote process dies. In that case, we have to activate both the waitpid _and_ the IOF complete states to ensure we properly mark the proc as dead and perform any required notifications
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-10-25 12:18:14 -07:00
Ralph Castain
227d4d9609 Open the conduits for application procs - we probably can remove all the
RML-related frameworks from MPI applications now, but let's wait a bit
to ensure we have cleaned up all the points where messaging might occur.
2016-10-24 16:53:19 -07:00
Ralph Castain
649301a3a2 Revise the routed framework to be multi-select so it can support the new conduit system. Update all calls to rml.send* to the new syntax. Define an orte_mgmt_conduit for admin and IOF messages, and an orte_coll_conduit for all collective operations (e.g., xcast, modex, and barrier).
Still not completely done as we need a better way of tracking the routed module being used down in the OOB - e.g., when a peer drops connection, we want to remove that route from all conduits that (a) use the OOB and (b) are routed, but we don't want to remove it from an OFI conduit.
2016-10-23 21:52:39 -07:00
Ralph Castain
df8ac7b747 Properly mark a node as down and decrease the number of daemons so any
subsequent grpcomm collectives can correctly operate. Note that only the
direct grpcomm component knows how to deal with down nodes.
2016-10-21 09:53:37 -07:00
Gilles Gouaillardet
1846c2d8ad plm/rsh: use an alternate port if the ORTE_NODE_PORT attribute is set 2016-10-19 16:18:52 +09:00
Ralph Castain
16540c7422 Properly report failure to launch when someone mis-types the name of the application
Fixes #2233
2016-10-18 10:09:30 -07:00
Ralph Castain
7be607582e ORTE applications need to commit any modex send's prior to calling fence 2016-10-18 09:22:56 -07:00
Ralph Castain
57114a09ae Pickup the npernode and npersocket options and include them in the job object 2016-10-17 12:26:21 -07:00
Gilles Gouaillardet
bd1b6fe661 rml/oob: add a missing include file 2016-10-16 10:25:00 +09:00
Gilles Gouaillardet
451b9dc467 ess: tear down pmix (if any) before oob 2016-10-13 14:08:02 +09:00
Ralph Castain
fca1556787 Some compilers apparently complain about this, so modify the typedef statements 2016-10-12 08:44:03 -07:00
Ralph Castain
a2919174d0 Bring the RML modifications across. This is the first step in a revamp of the ORTE messaging subsystem to support fabric-based communications during launch and wireup phases. When completed, the grpcomm and plm frameworks will each have their own "conduit" for communication - each conduit corresponds to a particular RML messaging transport. This can be the active OOB-based component, or a provider from within the RML/OFI component. Messages sent down the conduit will flow across the associated transport.
Multiple conduits can exist at the same time, and can even point to the same base transport. Each conduit can have its own characteristics (e.g., flow control) based on the info keys provided to the "open_conduit" call. For ease during the transition period, the "legacy" RML interfaces remain as wrappers over the new conduit-based APIs using a default conduit opened during orte_init - this default conduit is tied to the OOB framework so that current behaviors are preserved. Once the transition has been completed, a one-time cleanup will be done to update all RML calls to the new APIs and the "legacy" interfaces will be deleted.

While we are at it: Remove oob/usock component to eliminate the TMPDIR length problem - get all working, including oob_stress
2016-10-11 16:01:02 -07:00
Gilles Gouaillardet
c92e9a5406 use the new OPAL_HASH_TABLE_FOREACH convenience macro 2016-10-08 16:58:20 +09:00
Gilles Gouaillardet
0931d09afa ess/singleton: silence a valgrind warning
initialize a pointer and keep valgrind happy about it
2016-09-27 15:22:39 +09:00
Gilles Gouaillardet
f9ebba4668 ess/singleton: only realloc() when required in fork_hnp() 2016-09-23 16:35:59 +09:00
Gilles Gouaillardet
c7bf9a0ec9 ess/singleton: fix read on the pipe to spawn'ed orted
and close the pipe on both ends when it is no more needed
2016-09-22 14:21:52 +09:00
Ralph Castain
de7b1494d9 Clean out old cruft from the ORCM project 2016-09-21 00:13:30 -07:00
Gilles Gouaillardet
83399adb3f singleton: "safe" read/write to the pipe between (spawn'ed) orted and singleton 2016-09-20 14:56:58 +09:00
Gilles Gouaillardet
e7ae6975d0 orted: fix spawn in singleton mode
in singleton mode, have the spawn'ed orted invoke orte_pre_condition_transports()
and send the transport key back to the singleton
2016-09-20 14:39:22 +09:00
Ralph Castain
a16b3cc33d Fix some minor complaints - missing "void" in function parameters 2016-09-15 15:18:42 -07:00
Ralph Castain
6f086189e6 Fix trivial typo 2016-09-15 13:10:55 -07:00
Gregory M. Kurtzer
16794cc260 Updates to support Singularity containers v2.2 2016-09-15 09:52:06 -07:00
Gilles Gouaillardet
11ebf3ab23 ess/singleton: when forking hnp, use the PMIX_NAMESPACE sent by the hnp
as the jobid
2016-09-15 13:57:23 +09:00
Gilles Gouaillardet
e84b35217f oob/tcp: plug a memory leak
as reported by Coverity with CID 1196711
2016-09-08 18:50:18 +09:00
Gilles Gouaillardet
b2a2be0e5a odls: fix memory leak plug
This fixes commit open-mpi/ompi@e2c343cdfc.
2016-09-08 10:02:52 +09:00
Artem Polyakov
9eba1b0b75 Merge pull request #2042 from artpol84/pmix_sdirs
Several fixes related to session directories:
2016-09-07 14:15:47 +07:00
Artem Polyakov
a9a7f39773 ess/pmi: fix the comments about MCA/PMIx setting conflict resolution. 2016-09-07 07:47:35 +03:00
Gilles Gouaillardet
e2c343cdfc odls: plus memory leak
as reported by Coverity with CID 710645
2016-09-07 10:08:44 +09:00
Gilles Gouaillardet
c09899f6af plm: plus resource leaks
as reported by Coverity with CIDs 72274 and 1196733
2016-09-07 10:08:44 +09:00
Josh Hursey
f6337f9eae Merge pull request #2047 from jjhursey/topic/mixed-host2
orte: !FQDN implementation to use opal_net_isaddr
2016-09-06 13:08:54 -05:00
Ralph Castain
f85dcaee2a Fixes CID 1369067 and CID 1196684
Fixes CID 1369648

    Fixes CID 1372409
2016-09-06 08:43:15 -07:00
Artem Polyakov
74a11d7832 Fix session dir cleanup code. 2016-09-05 07:53:55 +03:00
Artem Polyakov
dc0ab674de Add PMIx key to provide RM with ability to indicate that it will cleanup
session directories provided at through OPAL_PMIX_TMPDIR,
OPAL_PMIX_NSDIR, OPAL_PMIX_PROCDIR
2016-09-05 07:48:44 +03:00
Artem Polyakov
81195ab724 Several fixes related to session directories:
* enable OMPI to retrieve paths from RM through PMIx
* cleanups related to tempdirs.
2016-09-05 07:48:44 +03:00
Ralph Castain
fb51d65049 Minor change: check for NULL before using the job map to avoid segfault when erroring out prior to creating the map 2016-09-04 07:53:12 -07:00
Joshua Hursey
fe937d1e82 orte: !FQDN implementation to use opal_net_isaddr
* Switch to use opal_net_isaddr() for checking if a name is an IP
   address - as it is a bit cleaner, and uses common functionality.
2016-09-02 13:31:49 -05:00
Ralph Castain
4e0788e9ad Enable PSM to support dynamic processes
Fix comm_spawn to correctly reference the actual parent process that requested the spawn when looking for the parent job object
2016-09-02 10:22:04 -07:00
Ralph Castain
0ea1cff733 Implement notification of completion on comm_spawn'd child jobs. Add a configure flag to enable PMIx 3's shared memory datastore, and set it disable by default so that comm_spawn functions again. Will reverse the default once that feature is fully functional 2016-09-01 13:10:10 -07:00
Gilles Gouaillardet
0b8c58298d oob/usock: fix handling of orte_process_name_t *
orte_process_name_t is aligned on 32 bits, so it cannot simply be casted
into an int64_t. use memcpy() instead

Thanks Paul Hargrove for the report
2016-09-01 13:18:02 +09:00
Ralph Castain
c1050bc01e Provide a mechanism for obtaining memory profiles of daemons and application profiles for use in studying our memory footprint. Setting OMPI_MEMPROFILE=N causes mpirun to set a timer for N seconds. When the timer fires, mpirun will query each daemon in the job to report its own memory usage plus the average memory usage of its child processes. The Proportional Set Size (PSS) is used for this purpose. 2016-08-31 09:32:07 -07:00
Ralph Castain
9b991bd1f5 Ensure that the "running" state is correctly updated
It is possible that one or more procs could get thru PMIx_Init, and thus be marked as in state "registered", before all local procs have been started. If that happens, then we would report some of the procs in state "running", and the others in state "registered" - which means that the HNP would miss the "running" stage of the state machine.

Thanks to Jingchao Zhang for his patience in tracking this down on the 2.0 branch
2016-08-30 19:24:39 -07:00
Josh Hursey
b0d8638824 Merge pull request #2015 from jjhursey/topic/mixed-hostnames
orte: Expand use of !orte_keep_fqdn_hostnames MCA parameter
2016-08-29 09:14:54 -05:00
Ralph Castain
2f6e0fec90 Provide the number of nodes in the job 2016-08-26 14:50:41 -07:00
Joshua Hursey
d26dd2c20e orte: Expand the application of !orte_keep_fqdn_hostnames
* Expand the use of the `orte_keep_fqdn_hostnames` MCA parameter when
   it is set to false.
 * If that parameter is set to false (default) then short hostnames
   (e.g., `node01`) will match with the long hostnames (e.g.,
   `node01.mycluster.org`). This allows a user (or resource manager)
    to mix the use of short and long hostnames.
  - Note that this mechanism does _not_ perform a DNS lookup, but
    instead strips off the FQDN by truncating the hostname string at
    the first `.` character (when not an IP address).
     - By default (`false`) the following is true:
       `node01 == node01.mycluster.org == node01.bogus.com`
       since we use `node01` as the hostname.
2016-08-26 16:09:04 -05:00
Artem Polyakov
55ac3b0be3 orte/schizo: fix binding detection in slurm component
in SLURM 16.05 the SLURM_CPU_BIND_TYPE is equal to "mask_cpu:"
instead of "mask_cpu". Account for that.
2016-08-26 09:55:52 +03:00
rhc54
19b0f4db9f Merge pull request #1995 from rhc54/topic/pe-per-rank
Change the behavior of cpus-per-rank.
2016-08-25 14:38:12 -05:00
Ralph Castain
440eae90ec Correct the binding algorithm to decouple it from oversubscribe.
Oversubscribe stipulates that we allow more procs on the node than assigned slots - it has nothing to do with the number of available pe's. Let overload directives handle the pe situation.
2016-08-24 21:17:22 -07:00
Gilles Gouaillardet
93e73841f9 ess/singleton: push all PMIX_* environment variables, regardless how many there are 2016-08-23 09:46:55 +09:00
Gilles Gouaillardet
a1e8e58a8a ess/singleton: expects 4 PMIX_* environment variables or more 2016-08-23 09:34:03 +09:00
Ralph Castain
7de4d6922b Change the behavior of cpus-per-rank. We previously counted each cpu against the #slots. However, IBM has pointed out that "slot" is equated to the number of processes allowed to run on each node, and not the number of cpus on the node. This has been a continuing source of confusion, so make the distinction a "hard" one.
Each process occupies a "slot". We automatically set #slots = #cpus if nothing else is told to us. If you want to run more procs and slots, you must tell us to allow oversubscription.

A process can utilize multiple pe's if that option is given. If you try to bind more than one proc to a given pe, then we will error out unless you tell us to allow overloading.
2016-08-22 15:54:41 -07:00
Jeff Squyres
71ec5cfb43 rsh: robustify the check for plm_rsh_agent default value
Don't strcmp against the default value -- the default value may change
over time.  Instead, check to see if the MCA var source is not
DEFAULT.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-08-16 06:58:20 -05:00
rhc54
d7cd802426 Merge pull request #1971 from rhc54/topic/sesdir
Update the session dir structure. Restore the creation of a top-level…
2016-08-16 03:14:08 -05:00
Ralph Castain
ae2af61ee3 Update the session dir structure. Restore the creation of a top-level dir based on userid so that everything is contained under the user's top-level dir. Make the next level down (the "job family" level) be either the pid (indicated by a name of "pid.N") or the job family if not launched by mpirun. This allows for proper rendezvous by direct-launched procs. 2016-08-15 22:46:46 -05:00
Ralph Castain
9f43db7303 Further cleanup getpwuid usage - try it first (unless completely disabled), and then silently failover to try other methods. 2016-08-15 07:51:36 -07:00
Ralph Castain
be8424b691 Provide backward compatible keys so that the non-PMIx components in the opal/pmix framework don't have to adjust as we continue to work on finalizing the PMIx reference scheme. Activate and utilize the new PMIx show_help capability to provide more meaningful error output when the server cannot start.
Add a contrib script to cleanup permissions incorrectly modified due to things like smb mounts

dd
2016-08-13 12:13:04 -07:00
Ralph Castain
08a0644df5 Fix shared memory rendezvous 2016-08-13 08:14:50 -07:00
rhc54
ddde154d28 Merge pull request #1962 from rhc54/topic/notify
Ensure we properly convert pmix status to ORTE state before activatin…
2016-08-13 06:59:50 -07:00
Ralph Castain
48d35a9627 Ensure we properly convert pmix status to ORTE state before activating an error state upon notification. Cleanup some conversion issues on notification info. Add a new orte_notify.c test program 2016-08-12 21:14:29 -07:00
rhc54
9eed451916 Merge pull request #1960 from rhc54/topic/rsh
Restore the rsh template creation code
2016-08-12 13:38:43 -07:00
rhc54
1ef3c86d44 Merge pull request #1931 from hjelmn/ess_fix
ess/base: set up nidmap after pmix
2016-08-12 13:10:30 -07:00
Ralph Castain
5717b75b45 Restore the rsh template creation code 2016-08-12 12:43:40 -07:00
Ralph Castain
1c44543854 If the ssh agent hasn't been given, then check for qrsh and friends 2016-08-12 07:46:39 -07:00
Artem Polyakov
1351a7065c ess/pmi: minor code readablility cleanup.
Split process name variable "name" to
- "wildcard_rank" for the cases where wildcard is used.
- "pname" for the case where reference to particular process is needed.
2016-08-06 15:45:19 +06:00
Nathan Hjelm
3c23502dfe ess/base: set up nidmap after pmix
This fixes a SEGV when the nidmap code attempts to use
opal_pmix.store_local before pmix is set up.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-08-02 09:50:00 -06:00
Ralph Castain
71de03fc67 Cleanup the new naming requirements to ensure that info is correctly retrieved
Cleanup permissions

Restore singleton operations
2016-07-21 09:46:03 -07:00
Ralph Castain
01a653d50a Remove a debug print in comm_cid.c. Update PMIx2 to include the revised PMIx_Get logic for higher performance by reducing the number of hash table lookups. Fix a bug where requests for data from a proc in another nspace could hang, or result in "not found".
Remove stale file reference

Restore autogen pass thru pmix

Remove generated file
2016-07-20 00:58:19 -07:00