1
1

25805 Коммитов

Автор SHA1 Сообщение Дата
Jeff Squyres
1953e3406f btl/tcp: add show_help message when peer hangs up
We commonly see messages on the users list where a peer has hung up
because it has crashed.  Instead of having just a BTL_ERROR message,
make this a real opal_show_help() message that tells the user that the
peer unexpectedly hung up, and they should look into *why* that peer
hung up.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-09-06 09:40:03 -04:00
Gilles Gouaillardet
894be7860a gcc_builtin/atomic: Silence numerous warnings from Studio compilers
This commit adds selective use of a compiler-specific pragma to
silence the numerous warnings the Sun/Oracle/Studio compilers emit for
the GNU-style inline asm used in atomic.h.

Thanks Paul Hargrove for the initial patch and the guidance.
2016-09-06 09:07:16 +09:00
Gilles Gouaillardet
7b39d9065c Merge pull request #2054 from ggouaillardet/topic/mca_btl_tcp_proc_insert
btl/tcp: make mca_btl_tcp_proc_insert re-entrant
2016-09-06 08:54:38 +09:00
Gilles Gouaillardet
91e1200c14 ompi/request: correctly handle zero count in ompi_request_default_wait_{all,any,some} 2016-09-05 17:19:30 +09:00
Gilles Gouaillardet
4b208e4463 btl/tcp: make mca_btl_tcp_proc_insert re-entrant
otherwise bad things happen with
 --mca btl_tcp_progress_thread 1 (non default)
and
 --mca mpi_add_procs_cutoff 0 (default)
2016-09-05 15:57:34 +09:00
Artem Polyakov
74a11d7832 Fix session dir cleanup code. 2016-09-05 07:53:55 +03:00
Artem Polyakov
dc0ab674de Add PMIx key to provide RM with ability to indicate that it will cleanup
session directories provided at through OPAL_PMIX_TMPDIR,
OPAL_PMIX_NSDIR, OPAL_PMIX_PROCDIR
2016-09-05 07:48:44 +03:00
Artem Polyakov
81195ab724 Several fixes related to session directories:
* enable OMPI to retrieve paths from RM through PMIx
* cleanups related to tempdirs.
2016-09-05 07:48:44 +03:00
Ralph Castain
fb51d65049 Minor change: check for NULL before using the job map to avoid segfault when erroring out prior to creating the map 2016-09-04 07:53:12 -07:00
Alex Mikheev
439456ae96 OSHMEM: spml ikrit: fixes zero copy
Allow mxm to use zero copy in put() and get() for the large messages.
2016-09-04 12:16:09 +03:00
Nathan Hjelm
a36bdfe69f opal/asm: updates to powerpc assembly
This commit contains the following changes:

 - There is a bug in the PGI 16.x betas for ppc64 that causes them to
   emit the incorrect instruction for loading 64-bit operands. If not
   cast to void * the operands are loaded with lwz (load word and
   zero) instead of ld. This does not affect optimized mode. The work
   around is to cast to void * and was implemented similar to a
   work-around for a xlc bug.

 - Actually implement 64-bit add/sub. These functions were missing and
   fell back to the less efficient compare-and-swap implementations.

Thanks to @PHHargrove for helping to track this down. With this update
the GCC inline assembly works as expected with pgi and ppc64.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-09-02 23:47:47 -06:00
Jeff Squyres
95c6f6cfc0 btl/tcp: fix help message
It looks like one help message was accidentally pasted in the middle
of another.  Disentangle the two messages from each other, and
slightly tweak the one message to say that the job may also crash (in
addition to hanging).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-09-02 17:14:22 -04:00
rhc54
9c496f767b Merge pull request #1602 from rhc54/topic/psm
Enable PSM to support dynamic processes
2016-09-02 14:41:19 -05:00
Nathan Hjelm
795833bfac config: re-enable GCC inline ASM check for PGI
We disabled this support a long time ago. Probably safe to assume
whatever bug we were working around no longer exists.

Closes open-mpi/ompi#2044

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-09-02 12:44:08 -06:00
Joshua Hursey
fe937d1e82 orte: !FQDN implementation to use opal_net_isaddr
* Switch to use opal_net_isaddr() for checking if a name is an IP
   address - as it is a bit cleaner, and uses common functionality.
2016-09-02 13:31:49 -05:00
Ralph Castain
4e0788e9ad Enable PSM to support dynamic processes
Fix comm_spawn to correctly reference the actual parent process that requested the spawn when looking for the parent job object
2016-09-02 10:22:04 -07:00
Nathan Hjelm
3274203f8a Merge pull request #2046 from hjelmn/ugni_fix
btl/ugni: fix erroneous warning message
2016-09-02 10:32:28 -06:00
Nathan Hjelm
f93c1f2106 btl/ugni: fix erroneous warning message
This commit prevents the connection code from trying to connect an
endpoint if the directed datagram has been posted but not received.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-09-02 09:17:44 -06:00
Ralph Castain
5c9ea565b6 Update NEWS 2016-09-01 18:10:21 -07:00
Ralph Castain
34f04a7924 Remove spurious Makefile.am line 2016-09-01 15:31:09 -07:00
Nathan Hjelm
1ce5847e8b osc/rdma: add support for network AMOs
This commit adds support for using network AMOs for MPI_Accumulate,
MPI_Fetch_and_op, and MPI_Compare_and_swap. This support is only
enabled if the ompi_single_intrinsic info key is specified or the
acc_single_interinsic MCA variable is set. This configuration
indicates to this implementation that no long accumulates will be
performed since these do not currently mix with the AMO
implementation.

This commit also cleans up the code somwhat. This includes removing
unnecessary struct keywords where the type is also typedef'd.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-09-01 15:47:33 -06:00
rhc54
fde6e6c6f8 Merge pull request #2043 from rhc54/topic/notifycomplete
Implement notification of completion on comm_spawn'd child jobs.
2016-09-01 16:42:30 -05:00
Ralph Castain
0ea1cff733 Implement notification of completion on comm_spawn'd child jobs. Add a configure flag to enable PMIx 3's shared memory datastore, and set it disable by default so that comm_spawn functions again. Will reverse the default once that feature is fully functional 2016-09-01 13:10:10 -07:00
Nathan Hjelm
43b2e3a844 Merge pull request #2041 from hjelmn/osc_pt2pt_fix
osc/pt2pt: do not use frag send to send lock request
2016-09-01 13:02:45 -06:00
Nathan Hjelm
cb1cb5ffed osc/pt2pt: do not use frag send to send lock request
This commit cleans up some code in the passive target path. The code
used the buffered frag control send path but it is more appropriate to
use the unbuffered one. This avoids checking structures that are
should not be in use in this path.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-09-01 09:57:27 -06:00
Gilles Gouaillardet
184d53a018 oshmem: swap fields of oshmem_proc_data_t to prevent padding
previously, the definition was

struct oshmem_proc_data_t {
    int num_transports;
    char * transport_ids;
};

so in 64 bits arch, the compiler would very likely insert a 4 bytes
padding before the two fields in order to have transport_ids aligned
2016-09-01 14:20:14 +09:00
Gilles Gouaillardet
0a25420dac oshmem: get rid of oshmem_proc_t and use ompi_proc_t instead
store oshmem related per proc data in an oshmem_proc_data_t struct,
that is stored in the padding section of an ompi_proc_t

this data can be accessed via the OSHMEM_PROC_DATA(proc) macro

Fixes open-mpi/ompi#2023
2016-09-01 14:20:14 +09:00
Gilles Gouaillardet
0b8c58298d oob/usock: fix handling of orte_process_name_t *
orte_process_name_t is aligned on 32 bits, so it cannot simply be casted
into an int64_t. use memcpy() instead

Thanks Paul Hargrove for the report
2016-09-01 13:18:02 +09:00
Gilles Gouaillardet
75b7ef97a0 coll/libnbc: fix nbc_ireduce when sendbuf == recvbuf
if sendbuf is equal to recvbuf, that should not be interpreted
as equivalent to MPI_IN_PLACE on the non root rank(s)

Thanks Valentin Petrov for the report
2016-09-01 10:19:05 +09:00
Gilles Gouaillardet
2969235324 libnbc: fix NBC_Copy for predefined datatypes
predefined datatypes such as MPI_LONG_DOUBLE_INT are not really contiguous,
so use span as returned by opal_datatype_span() instead of type extent,
otherwise data might be written above allocated memory.

Thanks Valentin Petrov for the report
2016-09-01 10:18:57 +09:00
Jeff Squyres
16fe18eb7c Merge pull request #2036 from edgargabriel/pr/datatype-refcount-fix
io/ompio: fix the reference count of basic datatypes used as etypes o…
2016-08-31 19:05:49 -04:00
Edgar Gabriel
be183cb3dd io/ompio: fix the reference count of basic datatypes used as etypes or ftypes. 2016-08-31 14:08:26 -05:00
rhc54
39d086e000 Merge pull request #2035 from rhc54/topic/memprofile
Provide a mechanism for obtaining memory profiles of daemons and application profiles for use in studying our memory footprint
2016-08-31 14:06:48 -05:00
Ralph Castain
39992d1ad7 Silence trivial Coverity warnings 2016-08-31 09:42:33 -07:00
Ralph Castain
c1050bc01e Provide a mechanism for obtaining memory profiles of daemons and application profiles for use in studying our memory footprint. Setting OMPI_MEMPROFILE=N causes mpirun to set a timer for N seconds. When the timer fires, mpirun will query each daemon in the job to report its own memory usage plus the average memory usage of its child processes. The Proportional Set Size (PSS) is used for this purpose. 2016-08-31 09:32:07 -07:00
Jeff Squyres
ead9b6389a README: update for new mailman and main web sites
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-08-31 09:55:36 -04:00
Jeff Squyres
c33aaa5604 Merge pull request #1997 from jsquyres/pr/make-cma-configury-better
opal_check_cma: make consistent with rest of configury
2016-08-31 09:50:24 -04:00
Jeff Squyres
ba0ed2401a Merge pull request #2031 from jsquyres/pr/fix-fortran-runpath-detection
opal_setup_wrappers.m4: fix typo in Fortran rpath detection
2016-08-31 09:28:32 -04:00
rhc54
ed5846038b Merge pull request #2033 from rhc54/topic/state
Ensure that the "running" state is correctly updated
2016-08-31 01:50:38 -05:00
Ralph Castain
9b991bd1f5 Ensure that the "running" state is correctly updated
It is possible that one or more procs could get thru PMIx_Init, and thus be marked as in state "registered", before all local procs have been started. If that happens, then we would report some of the procs in state "running", and the others in state "registered" - which means that the HNP would miss the "running" stage of the state machine.

Thanks to Jingchao Zhang for his patience in tracking this down on the 2.0 branch
2016-08-30 19:24:39 -07:00
Jeff Squyres
1b9e165a4c opal_setup_wrappers.m4: fix typo in Fortran rpath detection
Add missing space in Libtool command for runpath detection.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-08-30 16:48:05 -07:00
rhc54
bfe0327f7b Merge pull request #2029 from rhc54/topic/notify
Since we changed storage to pointers in pmix_value_t, we need to allocate space for those values when unpacking
2016-08-29 23:11:23 -05:00
Ralph Castain
cfa784c9a6 Since we changed storage to pointers in pmix_value_t, we need to allocate space for those values when unpacking 2016-08-29 20:22:24 -07:00
Nathan Hjelm
99b26644c1 Merge pull request #2011 from hjelmn/osc_pt2pt_fix
osc/pt2pt: fix possible race in peer locking
2016-08-29 09:17:36 -06:00
Josh Hursey
b0d8638824 Merge pull request #2015 from jjhursey/topic/mixed-hostnames
orte: Expand use of !orte_keep_fqdn_hostnames MCA parameter
2016-08-29 09:14:54 -05:00
George Bosilca
a6d515ba9e Fixes opal_atomic_ll_64. Thanks to Paul Hardgrove for
the report and his patch.

This is an addition to #1140 and should go in 2.x
2016-08-27 12:43:48 -04:00
Nathan Hjelm
d33204b0dc Merge pull request #2021 from hjelmn/xlc_fix
opal/patcher: fix xlc support
2016-08-26 18:15:41 -06:00
rhc54
b90a64e734 Merge pull request #2022 from rhc54/topic/nnodes
Provide the number of nodes in the job
2016-08-26 18:15:24 -05:00
Ralph Castain
2f6e0fec90 Provide the number of nodes in the job 2016-08-26 14:50:41 -07:00
Joshua Hursey
d26dd2c20e orte: Expand the application of !orte_keep_fqdn_hostnames
* Expand the use of the `orte_keep_fqdn_hostnames` MCA parameter when
   it is set to false.
 * If that parameter is set to false (default) then short hostnames
   (e.g., `node01`) will match with the long hostnames (e.g.,
   `node01.mycluster.org`). This allows a user (or resource manager)
    to mix the use of short and long hostnames.
  - Note that this mechanism does _not_ perform a DNS lookup, but
    instead strips off the FQDN by truncating the hostname string at
    the first `.` character (when not an IP address).
     - By default (`false`) the following is true:
       `node01 == node01.mycluster.org == node01.bogus.com`
       since we use `node01` as the hostname.
2016-08-26 16:09:04 -05:00