Ralph Castain
548cd24e4e
Forward-port changes proposed for v3.0 to master from PR #3677
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-09 07:51:21 -07:00
Ralph Castain
1f0f03b45b
Print a better error message when srun isn't found in the path. Ensure we don't segfault if -host specifies a node not included in the allocation
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-09 07:46:47 -07:00
Ralph Castain
3d0fc29b4b
Merge pull request #3684 from rhc54/topic/trivial
...
Protect against NULL topology
2017-06-09 06:04:20 -07:00
Ralph Castain
00ba6a1be6
Protect against NULL topology
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-08 20:56:44 -07:00
Nathan Hjelm
db2204f2f3
ompi: add support for new communicator info assertions
...
This commit adds code to allow support for the info assertions added
by mpi-forum/mpi-issues#11 . The assertions added are:
mpi_assert_no_any_tag, mpi_assert_no_any_source,
mpi_assert_exact_length, and mpi_assert_allow_overtaking.
This commit also adds support for the mpi_assert_no_any_source and
mpi_assert_allow_overtaking info keys to the ob1 pml.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-06-08 15:52:12 -06:00
Ralph Castain
a9005d6f72
Merge pull request #3679 from rhc54/topic/spawn
...
Fix the backend mapper algorithm for comm_spawn. The front and back e…
2017-06-08 10:23:07 -07:00
Geoff Paulsen
bdc7206230
Merge pull request #3672 from markalle/pr/darray_fix
...
Type_create_darray with mix of BLOCK/CYCLIC
2017-06-08 10:52:50 -05:00
Ralph Castain
7b39f19f60
Fix the backend mapper algorithm for comm_spawn. The front and back ends need to get the nodes into the job map in the same order so that the ranking algorithms will reach the same results
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-08 08:00:52 -07:00
Ralph Castain
20166460c7
Merge pull request #3676 from rhc54/topic/orted
...
Ensure the orted doesn't go into an infinite loop during force-terminate
2017-06-08 05:51:20 -07:00
KAWASHIMA Takahiro
362445d486
Use same prefix format for [host:pid]
...
Hostname and PID are output as a message prefix in many places in
our code. Their printf-formats were either `[%s:%d]` or `[%s:%05d]`.
This commit changes `[%s:%d]` to `[%s:%05d]`. The latter was more
widely used in our code (including OPAL output system and the signal
handler).
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
2017-06-08 19:35:03 +09:00
KAWASHIMA Takahiro
6b91eddc8b
Apply opal_abort_delay
to the signal handler
...
This commit expands the effect of the MCA parameter `opal_abort_delay`
to the OPAL signal handler. This allows attaching of a debugger on
segmentation fault etc. before quitting the job.
The sleep code is moved to the `opal_delay_abort` function from the
`ompi_mpi_abort` and `oshmem_shmem_abort` functions for code cleanup.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
2017-06-08 19:34:48 +09:00
Ralph Castain
81ab79f311
Ensure the orted doesn't go into an infinite loop during force-terminate
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-07 21:44:49 -07:00
Ralph Castain
7002535059
Merge pull request #3671 from rhc54/topic/ofi
...
We cannot use OFI to determine when daemons can finalize as we don't …
2017-06-07 15:08:56 -07:00
George Bosilca
484004b03d
simple_spawn should be independent of ORTE.
2017-06-07 17:51:46 -04:00
Mark Allen
aeb2c02d2f
Type_create_darray with mix of BLOCK/CYCLIC
...
Example (using MPI_ORDER_C so the below has 6 rows of 4 ints to parcel out)
size = 4;
rank = 0;
ndims=2;
gsizes[0] = 6;
gsizes[1] = 4;
distribs[0] = MPI_DISTRIBUTE_CYCLIC;
distribs[1] = MPI_DISTRIBUTE_BLOCK;
dargs[0] = 2;
dargs[1] = 2;
psizes[0] = 2;
psizes[1] = 2;
MPI_Type_create_darray(size, rank, ndims,
gsizes, distribs, dargs, psizes,
MPI_ORDER_C, MPI_INT, &mydt);
Expectation for the layout:
inner dimension (1) is
4 items (ints) distributed block over 2 ranks with 2 items each
eg for rank 0: [ x x . . ]
outer dimension (0) is:
6 items (the above [ x x . .]) cyclic over 2 ranks with 2 items each
eg for rank 0:
[ x x . . ] : offset=0 bytes=8
[ x x . . ] : ofset=16 bytes=8
[ . . . . ]
[ . . . . ]
[ x x . . ] : offset=64 bytes=8
[ x x . . ] : offset=80 bytes=8
Or more specifically a stream of ints 0,1,2,3,4,5,6,7 sent into that
type should be
[ 0 1 . . ]
[ 2 3 . . ]
[ . . . . ]
[ . . . . ]
[ 4 5 . . ]
[ 6 7 . . ]
The data was laying out though as
[ 0 1 2 3 ]
[ . . . . ]
[ . . . . ]
[ . . . . ]
[ 4 5 6 7 ]
[ . . . . ]
because the recursive construction inside the block() function (which
creates the smaller row datatype [ x x . . ]) wasn't setting the extent
of that type.
Signed-off-by: Mark Allen <markalle@us.ibm.com>
2017-06-07 16:53:03 -04:00
Ralph Castain
919d7fcf49
We cannot use OFI to determine when daemons can finalize as we don't see the "sockets" go away. So always use the OOB for the mgmt conduit - this provides the necessary termination signal AND ensures that IOF and other mgmt messages go solely across TCP.
...
Cleanup the way we look for matching OFI addresses by using the opal_net_samenetwork helper function. This now works for multi-network environments, but only using the socket provider
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-07 13:51:30 -07:00
Nathan Hjelm
f038fe6427
Merge pull request #3661 from jjhursey/fix/ppc-wmb
...
atomics/powerpc: Fix WMB instruction
2017-06-07 12:14:20 -06:00
Ralph Castain
c9a0fd3d3f
Merge pull request #3666 from rhc54/topic/extpmix
...
Correct the external pmix configury
2017-06-07 06:21:36 -07:00
Ralph Castain
2d65908184
Correct the external pmix configury
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-07 00:33:29 -07:00
Ralph Castain
ea5649d381
Merge pull request #3665 from rhc54/topic/trivial
...
Add missing constant to error-strings
2017-06-07 00:09:30 -07:00
Ralph Castain
88b5ec3597
Merge pull request #3664 from rhc54/topic/ext2
...
Get the pmix/ext2x component to work. Fix a minor problem in the libevent external component.
2017-06-06 21:11:47 -07:00
Ralph Castain
bd1793ad17
Get the pmix/ext2x component to work. Fix a minor problem in the libevent external component.
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-06 20:06:28 -07:00
Ralph Castain
17484409a3
Merge pull request #3662 from rhc54/topic/pmixupagain
...
Update to pmix v2.0.0rc1, including thread safety fixes
2017-06-06 16:12:24 -07:00
Ralph Castain
acd60a2cc4
Add missing constant to error-strings
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-06 16:10:52 -07:00
Ralph Castain
c3e6dc2022
Update to pmix v2.0.0rc1, including thread safety fixes
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-06 15:16:34 -07:00
Ralph Castain
21fba8b7f3
Merge pull request #3659 from rhc54/topic/threads
...
Update OPAL and ORTE for thread safety
2017-06-06 14:52:40 -07:00
Joshua Hursey
4796193cdb
atomics/powerpc: Fix WMB instruction
...
* `lwsync` is a write memory barrier.
- `eieio` is really not meant for this type of operation.
* `lwsync` can also be used for the read memory barrier according to
my reading of the of the Power 8 ISA docs (v2.07)
- https://www-01.ibm.com/marketing/iwm/iwm/web/reg/download.do?source=swg-opower&S_PKG=dl&lang=en_US&cp=UTF-8
* References https://github.com/pmix/pmix/pull/391
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2017-06-06 16:41:37 -05:00
Ralph Castain
93cf3c7203
Update OPAL and ORTE for thread safety
...
(I swear, if I look this over one more time, I'll puke)
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-06 12:30:57 -07:00
Ralph Castain
7be09f8143
Merge pull request #3658 from rhc54/topic/pmixup
...
Update to PMIx master
2017-06-06 11:23:20 -07:00
Ralph Castain
2f85d10600
Update to PMIx master
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-06 08:19:25 -07:00
George Bosilca
ba46b35515
Dont assume a size for constants with UL and ULL.
...
According to Section 6.4.4.1 of the C, we do not need to prepend a type
to a constant to get the right size. The compiler will infer the type
according to the number of bits in the constant.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2017-06-05 22:07:53 -04:00
Ralph Castain
29411472f2
Merge pull request #3656 from rhc54/topic/silence
...
Silence warnings when terminating
2017-06-05 15:22:02 -07:00
Ralph Castain
a28eaf914a
Silence warnings when terminating
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-05 13:53:07 -07:00
Jeff Squyres
44aef39b24
Merge pull request #3641 from ggouaillardet/topic/fortran_strings
...
fortran/base: rename strings.h into fortran_base_strings.h
2017-06-05 15:31:08 -04:00
Ralph Castain
8a377beb25
Merge pull request #3651 from rhc54/topic/stuff
...
Do not hang if we cannot relay messages. Eliminate extra error log message
2017-06-05 09:36:29 -07:00
Ralph Castain
594c0e2876
Retain the max terminal length of 78 characters, replace the word "disabled" with a simple "-" and hope people know what that means
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-05 07:10:05 -07:00
Ralph Castain
8f526968c2
Do not hang if we cannot relay messages. Eliminate extra error log message
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-05 06:35:19 -07:00
Ralph Castain
dea9ef2020
Merge pull request #3637 from hjelmn/osc_sm_info_fix
...
osc/sm: fix SEGV in new info usage
2017-06-05 05:45:21 -07:00
Ralph Castain
6d68d2ee0b
Merge pull request #3650 from rhc54/topic/info
...
Change the default sizes for opal_info output
2017-06-05 05:21:59 -07:00
Ralph Castain
e25a051f41
Change the default sizes for opal_info output
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-04 20:30:53 -07:00
Ralph Castain
51b4078b70
Merge pull request #3648 from rhc54/topic/ofi
...
Clean up the conduit open code so we return detectable errors when co…
2017-06-02 18:08:55 -07:00
Ralph Castain
e884cbf5f5
Even though the ofi component doesn't do any routing itself, the rest of the code base (e.g., grpcomm) needs to know what routing module this component is using. So set it to the "direct" module, and don't allow ofi to be used if that module isn't available.
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-02 15:47:25 -07:00
Jeff Squyres
68a22689c4
Merge pull request #3649 from jsquyres/pr/fix-signal-include
...
ess: add missing <signal.h> header
2017-06-02 18:41:48 -04:00
Ralph Castain
ba9a6078c2
Add ability to select transport, and only compare the first one in the conduit list for a match. This lets you select which conduit to use for OFI - if you set "-mca rml_ofi_transports ethernet" you'll pickup the mgmt conduit. If you set "-mca rml_ofi_transports fabric", you'll get the coll conduit
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-02 14:31:23 -07:00
Jeff Squyres
af9565ec25
ess: add missing <signal.h> header
...
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2017-06-02 14:11:40 -07:00
Ralph Castain
b0b985bb06
Merge pull request #3644 from rhc54/topic/signals
...
Shift the signal forwarding code to ess/base...
2017-06-02 13:45:13 -07:00
Ralph Castain
066d5eedce
Shift the signal forwarding code to ess/base so it can be available to more than just the hnp component. Extend the slurm component to use it so that any signals given directly to the daemons by their slurmstepd get forwarded to their local clients
...
Check for NULL
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-02 10:59:14 -07:00
Ralph Castain
6b3bbd30c5
Clean up the conduit open code so we return detectable errors when conduit not opened.
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-02 10:40:51 -07:00
Ralph Castain
e45a358bf0
Merge pull request #3647 from rhc54/topic/forced
...
Provide better help when forced_terminate is invoked
2017-06-02 10:27:41 -07:00
Ralph Castain
2ab4f93f6a
Instead of "forced_terminate" just quietly causing the daemon to disappear, let's at least attempt to let the user know where the problem occurred.
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-02 08:28:16 -07:00