1
1

5441 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
c6f6f40529 Transfer debugger support changes
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-17 18:14:46 -08:00
Ralph Castain
269753f5c1 Transfer back changes from debugger attach work
Silence warning

Remove debug

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-17 10:00:52 -08:00
Ralph Castain
215d6290e0 Add a flux component for LLNL
Fine tuning of flux component
Fix a few minor issues with the initial cut:
* Job id could be obtained from the PMI kvsname like SLURM,
  but simpler to getenv (FLUX_JOB_ID)
* Flux pmi-1 doesn't define PMI_BOOL, PMI_TRUE, PMI_FALSE
* Flux pmi-1 maps the deprecated PMI_Get_kvs_domain_id() to
  PMI_KVS_Get_my_name() internally, so just call that instead.
* Drop residual slurm references.

Add wrappers for PMI functions so that if HAVE_FLUX_PMI_LIBRARY
is not defined, the component can dlopen libpmi.so at location
specified by the FLUX_PMI_LIBRARY_PATH env variable, which adds
flexibility.  If HAVE_FLUX_PMI_LIBRARY is defined, link with
libpmi.so at build time in the usual way.

Update configury for flux component

Update m4 so the configure options work as follows:

 --with-flux-pmi
      Build Flux PMI support (default: yes)

 --with-flux-pmi-library
      Link Flux PMI support with PMI library at build
      time. Otherwise the library is opened at runtime at
      location specified by FLUX_PMI_LIBRARY_PATH environment
      variable. Use this option to enable Flux support when
      building statically or without dlopen support (default: no)

If the latter option is provided, the library/header is located at
build time using the pkg-config module 'flux-pmi'.  Otherwise there
is no library/header dependency.

Handle the case where ompi is configured with --disable-dlopen
or --enable-statkc.  In those cases, don't build the component
unless --with-flux-pmi-library is provided.

It is fatal if the user explicitly requests --with-flux-pmi but
it cannot be built (e.g. due to --disable-dlopen).

Add a schizo/flux component

Update schizo/flux component

Eliminate slurm-specific usage cases.

Since the module is only loaded if FLUX_JOB_ID is set, there are
only two cases to handle:

1) App was launched indirectly through mpirun.  This is not yet
supported with Flux, but hook remains in case this mode is supported
in the future.

2) App was launched directly by Flux, with Flux providing
CPU binding, if any.

Fix up white space in pmix/flux component

Drop non-blocking fence from pmix:flux component

The flux PMI-1 library is not thread safe, therefore
register a regular blocking fence callback instead of the
thread-shifting fencenb().

pmix/flux component avoids extra PMI_KVS_Gets

Keys stored into the base cache under the wildcard
rank are not intended to be part of the global key namespace.
These keys therefore should not trigger a PMI_KVS_Get() if they
are not found in the cache.

Minor pmix/flux component cleanup

pmix/flux: drop code for fetching unused pmix_id

pmix/flux: err_exit must return error

Problem: in flux_init(), although 'ret' (variable holding
err_exit return code) is initialized to OPAL_ERROR, the
variable is reused as a temporary result code, so if there are
some successes followed by a failure that doesn't set 'ret',
flux_init() could return success with PMI not initialized.

Ensure that a "goto err_exit" returns OPAL_ERROR if 'ret'
is not set to some other error code.

pmix/flux: don't mix OPAL_ and PMI_ return codes

Problem: flux_init() can return both PMI_ and OPAL_ return
codes.  Although OPAL_SUCCESS and PMI_SUCCESS are both defined
as 0, other codes are not compatible.

Ensure that flux_init() consistently uses 'rc' for PMI_
return codes and 'ret' for OPAL_ return codes.

pmix/flux: factor out repeated code for cache put

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-16 18:26:38 -08:00
Ralph Castain
2af677b1cf Ensure that we don't bind-by-default in an oversubscribed condition
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-15 07:58:52 -08:00
Ralph Castain
884fb7fcf2 Update the PMIx2 support to include the latest shared memory optimizations
Update ORTE support for dynamic PMIx operations e.g., PMIx_Spawn
Update to track master
Ensure that --disable-pmix-dstore actually disables the dstore. Sync to a few debugger updates

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-14 15:00:10 -08:00
Ralph Castain
9f69b0183f Ensure jobs that fail always return a non-zero exit code.
Thanks to Ashley Pittman for the report.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-14 09:41:06 -08:00
rhc54
341ab683de Merge pull request #2532 from rhc54/topic/pmixptl
Update to latest PMIx master + PTL branch
2016-12-07 17:28:22 -08:00
Ralph Castain
e1aa7939ef Correctly cleanup the local children and node map info on remote orteds upon job completion. Ensure that register_nspace only includes procs from that job in the proc map
Thanks to Ashley Pittman for the report

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-07 13:53:00 -08:00
Gilles Gouaillardet
123036dbf8 ess/base: invoke orte_routed.update_routing_plan() earlier
fix an issue that can be evidenced with two nodes
n0$ mpirun --host n1:1 --mca oob_tcp_static_ipv4_ports 1234 -np 1 --mca routed radix --mca oob tcp true

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2016-12-07 17:19:25 +09:00
Ralph Castain
fbed2d794a Update to latest PMIx master + PTL branch
Update the usock component to disable it

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-06 20:47:44 -08:00
Ralph Castain
85a634926b Update signal handling to introduce a pause between SIGCONT and SIGTERM, followed by another pause before SIGKILL. Do this within the odls/kill_local_procs function while we know we are blocked in an event, and before the daemon shuts down the event progress loop
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-06 12:34:42 -08:00
Ralph Castain
d8f262e39b Resolve a duplicate symbol issue when the rml/ofi component is enabled
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-05 13:41:38 -08:00
Ralph Castain
79cde184ad Allow a PMIx tool to spawn a job
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-03 16:00:47 -08:00
Ralph Castain
af9a55ccf1 Fix the session directory cleanup - only remove the jobfam session dir level if we are the local daemon and are cleaning up our own session directory.
Update the scaling test to run more trials and report the options being tested each time

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-03 09:59:18 -08:00
Ralph Castain
1a0bccb536 Now that PMIx has settled on its release strategy and numbering, update the OPAL pmix framework to track
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-02 15:44:43 -08:00
Ralph Castain
88313debc2 Per discussion on email thread, restore placement of child procs in their own process group so that any signal sent to one of our children is automatically propagated to any child process they might have spawned.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-02 03:36:22 -08:00
Ralph Castain
dd491db21f Fix IOF when outputing to files - the remote orteds were failing to output stdout/err from their procs.
Silence a warning in orted_submit

Protect against a free'd value in an error path when forming oob tcp connections

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-01 14:12:47 -08:00
Artem Polyakov
58300afff2 orte/oob/tcp: Plug the memory leak.
Plug coverity defect CID 1396541.

Signed-off-by: Artem Polyakov <artpol84@gmail.com>
2016-12-01 06:48:25 +07:00
rhc54
b6647ce286 Merge pull request #2480 from rhc54/topic/sessiondir
Fix session directory cleanup
2016-11-30 12:24:37 -08:00
Ralph Castain
9c596a6f54 We added another layer to the session directory tree (the jobfam layer), but we forgot to include it in the teardown procedure, thus causing us to leave droppings behind. Add the jobfam_session_dir to the teardown, and ensure that all levels are addressed
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-11-29 20:51:28 -08:00
Ralph Castain
47ed214458 Do not resend if max_retries is exceeded. Make a verbose output available to tell us where the intended message was to go.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-11-29 19:21:16 -08:00
rhc54
d31f173744 Merge pull request #2476 from rhc54/topic/dbgupdate
Bring forward the debugger-related changes
2016-11-29 19:10:32 -08:00
Ralph Castain
d5fd635efe Bring forward the debugger-related changes
Refs https://github.com/open-mpi/ompi/pull/2425

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-11-29 13:15:20 -08:00
Ralph Castain
30ff8be9c9 Silence minor warnings
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-11-29 08:33:22 -08:00
Jeff Squyres
a6d390fe7b Merge pull request #2461 from artpol84/oob/msg_drop
orte/oob/tcp: Fix message dropping in case of concurrent connection.
2016-11-29 11:23:15 -05:00
Ralph Castain
f7699a7eeb Silence warnings in a .opal_ignore'd component
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-11-28 13:18:25 -08:00
Artem Polyakov
ada93e0c02 orte/oob/tcp: Fix message dropping in case of concurrent connection.
The problem was observed for direct modex used with recursive doubling
algorithm (used for collective ID calculation prior to d52a2d081e9598a9ac9a50fb4b013a6d2a72375b)
that has pairwise nature and counter-connections are highly likely.

The following scenario was uncovering the issue:
* ranks `x` and `y` want to communicate with each other, `x` < `y`;
* rank `x` initiates the connection and sends the ack;
* rank `y` starts to `connect()` and gets the ack from `x`;
* `y` identifies that it already started connecting and `y` > `x` so it rejects incoming connection.
* `x` sees that his connection was rejected in `mca_oob_tcp_peer_recv_connect_ack()` when trying to
read the message header using `tcp_peer_recv_blocking()` which calls `mca_oob_tcp_peer_close()`
that effectively flushes all the messages in the peer->send_queue.
* `y` send the ack to `x` and the connection is established, however all the messages for the peer
at `x` are vanished (except the front one in peer->send_msg).

This commit introduces a "nack" function that will be used at `y` side to tell `x` that `y` has the
priority and `x`'s connection should be closed. This allows to avoid "guessing" on the unexpectedly
closed connection.

Signed-off-by: Artem Polyakov <artpol84@gmail.com>
2016-11-27 04:58:34 +07:00
Ralph Castain
1e2019ce2a Revert "Update to sync with OMPI master and cleanup to build"
This reverts commit cb55c88a8b7817d5891ff06a447ea190b0e77479.
2016-11-22 15:03:20 -08:00
Ralph Castain
cb55c88a8b Update to sync with OMPI master and cleanup to build
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-11-22 14:24:54 -08:00
Howard Pritchard
2cbc0e8472 pmix/cray: fix disable-dlopen problem
PR open-mpi/ompi#2432 introduced a regression where configure
and build with --disable-dlopn caused build failure owing
to unresolved alps lli symbols in the libopal-pal shared library.

This commit fixes this problem.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2016-11-21 13:45:10 -06:00
Ralph Castain
eb67c2fd44 Update OFI/rml component - still .opal_ignore'd
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-11-18 14:54:26 -08:00
Ralph Castain
9c6c2fa61d Bring the v2.0.x debugger patch up to the master branch
Ensure the personality gets set as specified by user, or defaults to
"ompi"

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-11-18 12:45:45 -08:00
Ralph Castain
188880be3f Since static ports are only used by ORTE if the runtime option is given,
there is no need for a configure option as well - so remove the
--enable-orte-static-ports configure option. When decoding the daemon
nidmap, mark new daemons as ALIVE by default - we will discover dead
ones as we go.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-11-04 05:01:42 -07:00
Gilles Gouaillardet
da0c873e14 oob/tcp: enhance debugging output
display the hop node used to send a message
(if the message is sent directly, then the hop is the destination)

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2016-11-04 14:16:06 +09:00
Josh Hursey
b18598f6c7 Merge pull request #2329 from jjhursey/topic/short-hostname-lsf-fix
ras/*: Fix !orte_keep_fqdn_hostnames for RAS components
2016-11-02 10:49:08 -05:00
Ralph Castain
435d771e76 Fix the radix routed component to correctly handle connected tools - in such cases, the route must be direct to the tool.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-11-01 19:03:26 -07:00
Ralph Castain
64873487b4 Remove the max_connections parameter from the radix component as it is confusing. Modify PMIx client init so that it simply returns the nspace/rank if called by a server - this allows the server to retrieve its assigned ID. Register the server's nspace so client-side operations can succeed
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-11-01 12:17:11 -07:00
Joshua Hursey
ed5268a96a ras/slurm: Fix !orte_keep_fqdn_hostnames for Slurm
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2016-11-01 13:21:30 -05:00
Joshua Hursey
5a4c52d9cb ras/loadleveler: Fix !orte_keep_fqdn_hostnames for Loadleveler
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2016-11-01 13:21:30 -05:00
Joshua Hursey
8230201ad1 ras/gridengine: Fix !orte_keep_fqdn_hostnames for GridEngine
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2016-11-01 13:21:30 -05:00
Joshua Hursey
9643175e40 ras/tm: Fix !orte_keep_fqdn_hostnames for TORQUE
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2016-11-01 13:21:24 -05:00
Joshua Hursey
8d02a33639 ras/lsf: Fix !orte_keep_fqdn_hostnames for LSF
* By default, make sure that we are using the short hostnames and not
   the fully qualified hostnames when running under LSF.
 * Related to commit open-mpi/ompi@d26dd2c20e

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2016-11-01 13:04:52 -05:00
rhc54
6074c2a2a9 Merge pull request #2322 from rhc54/topic/routed
Update the routed components as we no longer need to init_routes.
2016-10-31 13:37:07 -07:00
Ralph Castain
b8c5d1ad88 Update the routed components as we no longer need to init_routes. Fixes case of direct launch via srun
Signed-off-by: Ralph Castain <rhc@open-mpi.org>

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-10-31 12:38:13 -07:00
Jeff Squyres
773d6039e7 Merge pull request #2306 from hjelmn/alps_cores
ras/alps: use cpuCnt if using hwthreads as cores
2016-10-31 15:22:13 -04:00
Gilles Gouaillardet
30298cc83c oob/tcp: remove debug that should have never been commited
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2016-10-31 16:41:14 +09:00
Gilles Gouaillardet
75e96004a4 oob/tcp: fix a typo in mca_oob_tcp_component_no_route()
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2016-10-31 16:30:24 +09:00
Gilles Gouaillardet
fb5bcc47ce ess/singleton: use opal_setenv instead of putenv
so it fixes a memory leak on finalize

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2016-10-28 09:32:30 +09:00
Gilles Gouaillardet
ef2b3ac8d2 rml/oob: fix misc memory leaks in open_conduit()
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2016-10-28 09:28:42 +09:00
Gilles Gouaillardet
831f7d9c9d rml/base: plug misc memory leaks
plug leaks in orte_rml_API_get_contact_info() and orte_rml_base_close()

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2016-10-28 09:28:05 +09:00