openmpi

Автор	SHA1	Сообщение	Дата
Artem Polyakov	3eb6c98542	orte/odls: Fix ORTE state machine for the non-zero exit case This commit fixes rare race condition that occurs when the process that is calling `exit(-1)` has delay between fd cleanup and actual OS-level exit. This may happen if the process has some work to do `on_exit()`. Problem description: Consider an application process that has called `exit(nonzero)`, it's fd's was closed but it's actual termination at OS level is delayed by some cleanups (eg. in callbacks registered via `on_exit()`). Observed sequence of events was the following: * orted gets stdio disconnection and activating `IOF COMPLETE` state. * parallel OOB disconnection causes `COMMUNICATION FAILURE` state to be activated. * during `COMMUNICATION FAILURE` processing `odls_base_default_wait_local_proc` is called even though real waitpid wasn't yet called (code mentions that waitpid might not be called for unspecified reason). Because of that real exit code is unknown and set to 0. `odls_base_default_wait_local_proc` callback sees `IOF COMPLETE` flag and in conjunction with 0-exit-code it activates `WAITPID FIRED` state. * processing of `WAITPID FIRED` leads to `NORMALLY TERMINATED` to be activated. * `NORMALLY TERMINATED` state in particular leads `ORTE_PROC_FLAG_ALIVE` flag for this proc to be dropped. * when application process finally exits and `wait_signal_callback` is launched. It sets real exit code and calls `odls_base_default_wait_local_proc` again but at this time since the process has `ORTE_PROC_FLAG_ALIVE` flag dropped `WAITPID FIRED` state is activated (instead of `EXITED WITH NON-ZERO`) leading to a hang that was observed. Signed-off-by: Artem Polyakov <artpol84@gmail.com>	2017-01-06 11:12:55 +02:00
Ralph Castain	b343df43a1	Merge pull request #2669 from rhc54/topic/memprobe Complete the memprobe support.	2017-01-05 12:02:56 -08:00
Ralph Castain	6509f60929	Complete the memprobe support. This provides a new scaling tool called "mpi_memprobe" that samples the memory footprint of the local daemon and the client procs, and then reports the results. The output contains the footprint of the daemon on each node, plus the average footprint of the client procs on that node. Samples are taken after MPI_Init, and then again after MPI_Barrier. This allows the user to see memory consumption caused by add_procs, as well as any modex contribution from forming connections if pmix_base_async_modex is given. Using the probe simply involves executing it via mpirun, with however many copies you want per node. Example: $ mpirun -npernode 2 ./mpi_memprobe Sampling memory usage after MPI_Init Data for node rhc001 Daemon: 12.483398 Client: 6.514648 Data for node rhc002 Daemon: 11.865234 Client: 4.643555 Sampling memory usage after MPI_Barrier Data for node rhc001 Daemon: 12.520508 Client: 6.576660 Data for node rhc002 Daemon: 11.879883 Client: 4.703125 Note that the client value on node rhc001 is larger - this is where rank=0 is housed, and apparently it gets a larger footprint for some reason. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-05 10:32:17 -08:00
Ralph Castain	b4088c331a	Merge pull request #2662 from rhc54/topic/stuff Variety of cleanups	2017-01-04 10:25:26 -08:00
Ralph Castain	91d714fe93	Add flags to direct PMIx to only use one listener, but without directing which one (tcp or usock) to use. This allows the user to set PMIX_MCA_ptl in their environment to select the transport method. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-04 09:16:44 -08:00
Ralph Castain	f355fb926d	Continue cleanup of notifications. Resolve a race condition that can result in attempt to send a message on a closed socket Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-04 09:16:33 -08:00
Joshua Ladd	57c0c847d0	Merge pull request #2603 from xinzhao3/topic/revert-ucx-mt Revert "PML/SPML/UCX: add UCX MT support to PML and SPML."	2017-01-04 11:50:37 -05:00
Ralph Castain	5737a45b35	Merge pull request #2658 from rhc54/topic/removal Remove the bcol, coll/ml, and sbgp code as stale and lacking a maintainer	2017-01-03 20:34:09 -08:00
Ralph Castain	66131b4183	Remove the bcol, coll/ml, and sbgp code as stale and lacking a maintainer Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-03 19:32:48 -08:00
Ralph Castain	dadc6fbaf6	Merge pull request #2448 from thananon/remove_request_lock Completely removed ompi_request_lock and ompi_request_cond	2017-01-03 19:31:46 -08:00
Jeff Squyres	33d2988985	Merge pull request #2647 from OMGtechy/master Fixed -Wmisleading-indentation in ad_read_coll.c	2017-01-03 12:24:22 -05:00
Ralph Castain	218aed144d	Merge pull request #2654 from rhc54/topic/memory Remove stale global variables	2017-01-02 15:09:09 -08:00
Ralph Castain	9eab9a1ed3	Remove stale global variables Revamp the event notification integration to rely on the PMIx event chaining and remove the duplicate chaining in OPAL. This ensures we get system-level events that target non-default handlers. Restore the hostname entries for MPI-level error messages, but provide an MCA param (orte_hostname_cutoff) to remove them for large clusters where the memory footprint is problematic. Set the default at 1000 nodes in the job (not the allocation). Begin first cut at memory profiler Some minor cleanups of memprobe Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-02 14:04:24 -08:00
rhc54	5f68d655d6	Merge pull request #2651 from rhc54/topic/minor Minor cleanups	2016-12-30 18:52:12 -08:00
Ralph Castain	e8aea2ebfc	Minor cleanups Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-30 16:19:42 -08:00
rhc54	56b1e10ac0	Merge pull request #2649 from rhc54/topic/foot2 Update to latest PMIx master	2016-12-30 15:36:03 -08:00
Ralph Castain	08c76a42bb	Update to latest PMIx master Signed-off-by: Ralph Castain <rhc@open-mpi.org> Plug a minor memory leak. Tell the PMIx server not to create a dstore memory region for the daemon job as there is nobody to share it with. Signed-off-by: Ralph Castain <rhc@open-mpi.org> Protect users of hwloc membind functions Signed-off-by: Ralph Castain <rhc@open-mpi.org> Update PMIx to include NULL string protection Signed-off-by: Ralph Castain <rhc@open-mpi.org> Update to PMIx master to include key overwrite protection Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-30 12:44:47 -08:00
rhc54	a16162832b	Merge pull request #2648 from rhc54/topic/topo Only instantiate the HWLOC topology in an MPI process if it actually will be used.	2016-12-29 11:52:08 -08:00
Ralph Castain	fe68f23099	Only instantiate the HWLOC topology in an MPI process if it actually will be used. There are only five places in the non-daemon code paths where opal_hwloc_topology is currently referenced: * shared memory BTLs (sm, smcuda). I have added a code path to those components that uses the location string instead of the topology itself, if available, thus avoiding instantiating the topology * openib BTL. This uses the distance matrix. At present, I haven't developed a method for replacing that reference. Thus, this component will instantiate the topology * usnic BTL. Uses the distance matrix. * treematch TOPO component. Does some complex tree-based algorithm, so it will instantiate the topology * ess base functions. If a process is direct launched and not bound at launch, this code attempts to bind it. Thus, procs in this scenario will instantiate the topology Note that instantiating the topology on complex chips such as KNL can consume megabytes of memory. Fix pernode binding policy Properly handle the unbound case Correct pointer usage Do not free static error messages! Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-29 10:33:29 -08:00
Ralph Castain	52533f755e	Remove debug Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-28 13:24:39 -08:00
Joshua Gerrard	94e87654c6	Fixed -Wmisleading-indentation in ad_read_coll.c Signed-off-by: Joshua Gerrard <joshuagerrard+ompi-commit@protonmail.com>	2016-12-28 20:14:13 +00:00
rhc54	acbf1cbaef	Merge pull request #2646 from rhc54/topic/squeze Begin to reduce reliance of application procs on the topology tree it…	2016-12-28 10:16:58 -08:00
Ralph Castain	3a2d6a5ab6	Begin to reduce reliance of application procs on the topology tree itself by having the daemon provide more detailed info. In this case, provide the topology description string so that procs can readily determine the number of types of objects on the node, and a "locality" string that describes which objects this process is executing upon. The latter allows a process to compute the objects of overlap between itself and another proc without consulting the topology tree. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-28 09:14:26 -08:00
rhc54	75be023f90	Merge pull request #2645 from rhc54/topic/maps Fix mapping directive checks	2016-12-28 03:43:46 -08:00
Ralph Castain	7866bb1119	Add debug, cleanup cpus/rank Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-27 21:25:52 -08:00
Ralph Castain	1e4bffd937	Fix mapping directive checks Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-27 20:42:47 -08:00
Jeff Squyres	31e98401c7	Merge pull request #2636 from jsquyres/pr/fix-cflags-uniqueness configury: fix OMPI_UNIQUE -> OMPI_FLAGS_UNIQUE	2016-12-27 17:24:06 -05:00
Jeff Squyres	d772fcf8f1	Merge pull request #2509 from OMGtechy/master Fixed memory leak and some -Werror=unused-result warnings	2016-12-27 17:13:23 -05:00
Jeff Squyres	15c1ee13fb	Merge pull request #2624 from jsquyres/pr/one-more-buildrpm-fix buildrpm.sh: don't use $HOME	2016-12-27 17:02:36 -05:00
Jeff Squyres	fb74c80e4b	configury: fix OMPI_UNIQUE -> OMPI_FLAGS_UNIQUE Looks like we missed one place where we needed to swap OMPI_UNIQUE for OMPI_FLAGS_UNIQUE. Thanks to Phil Tooley (@Telemin) for reporting the issue. Fixes #2635. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2016-12-27 13:36:53 -08:00
rhc54	24000aae84	Merge pull request #2638 from rhc54/topic/pmixcflags Avoid mangling user-provided CFLAGS by using the new PMIX_FLAGS_UNIQ autoconf script in place of PMIX_UNIQ	2016-12-27 10:13:58 -08:00
Ralph Castain	d3aa3777f3	Per @jsquyres: avoid mangling user-provided CFLAGS by using the new PMIX_FLAGS_UNIQ autoconf script in place of PMIX_UNIQ Refs #2636 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-27 09:00:59 -08:00
Ralph Castain	791f4f1ce3	Adjust debug output for clarity Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-26 14:04:20 -08:00
Gilles Gouaillardet	22db1d36b6	pmix2x: silence misc warnings Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2016-12-26 13:35:17 +09:00
rhc54	67a08e825e	Merge pull request #2632 from rhc54/topic/updates Transfer some minor cleanups back from the PMIx reference server	2016-12-23 12:49:33 -08:00
Ralph Castain	ef3f748d0d	Transfer some minor cleanups back from the PMIx reference server Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-23 08:46:04 -08:00
Nysal Jan K A	19e3be31e5	Merge pull request #2421 from nysal/master mpit: Fix MPI_T_pvar_get_index	2016-12-22 15:33:51 +05:30
Gilles Gouaillardet	54c84196a6	btl/vader: plug a memory leak as reported by Coverity with CID 1362691 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2016-12-22 16:04:36 +09:00
Nysal Jan K.A	25ba507ada	mpit: Fix MPI_T_pvar_get_index MPI_T_pvar_get_index was returning an incorrect index. The index was never set correctly while registering the performance variables. Additionally fix a missing case in the mca_base_var_type_t to MPI datatype conversion. This type is currently used for control variables registered by mxm, fca and hcoll components. Signed-off-by: Nysal Jan K.A <jnysal@in.ibm.com>	2016-12-22 12:30:21 +05:30
Gilles Gouaillardet	773cad6b3e	ompi/debugger: fix mqs_version_string() Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2016-12-22 15:00:47 +09:00
Jeff Squyres	3571c3c5bb	hwloc external: minor fixes to 9649c44 - Fix capitolization typos - Make comment more correct / flow better - Use AM_CPPFLAGS, not DEFAULT_INCLUDES - Remove extra "hwloc/" from external hwloc.h specification Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2016-12-21 09:06:24 -08:00
Jeff Squyres	6002a8bca5	buildrpm.sh: don't use $HOME This is news to me: I didn't know that some distros do not set $HOME. So use "~" instead, and only try to grep ~/.rpmmacros if it exists. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2016-12-21 07:42:32 -08:00
Jeff Squyres	678e314c0e	opal_configure_options: remove stale option help --with-libltdl is now added (via AC_ARG_WITH) in opal/mca/dl/libltdl/configure.m4 -- it no longer belongs up here in this top-level m4 file. Plus, the help string in this stale entry is also stale/incorrect. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2016-12-21 07:28:31 -08:00
Gilles Gouaillardet	9649c44fa0	hwloc: correctly handle --with-hwloc=external - simply #include "hwloc.h" to use the external hwloc header - do use the external hwloc header instead of opal/mca/hwloc/hwloc.h Thanks Orion Poplawski for the report Fixes open-mpi/ompi#2616 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2016-12-21 11:58:10 +09:00
Jeff Squyres	5ecd271934	buildrpm.sh: minor fixes Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2016-12-20 10:54:37 -08:00
rhc54	75ec38db7d	Merge pull request #2609 from rhc54/topic/psrv Bring across some more patches from the debugger work	2016-12-19 20:38:28 -08:00
Ralph Castain	ea133206ec	Sync the internal OMPI component to PMIx master Update external PMIx v2.x component Add missing Makefile Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-19 19:14:16 -08:00
Ralph Castain	4774eb8b5a	Update NEWS with 1.10.5 items Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-19 16:48:45 -08:00
Xin Zhao	2d77912c19	Revert "PML/SPML/UCX: add UCX MT support to PML and SPML." This reverts commit 0ecf3c951c0a87ab5bdd76a541a69852af671ba9. Signed-off-by: Xin Zhao <xinz@mellanox.com>	2016-12-19 18:57:48 +02:00
rhc54	0acdcebab2	Merge pull request #2601 from rhc54/topic/dbgr Transfer across final fixes from debugger attach work	2016-12-19 03:56:42 -08:00

1 2 3 4 5 ...

26314 Коммитов