1
1
Граф коммитов

4479 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
0445052a1c Check for multiple declarations of a given MCA param and error out if detected as that can create an ambiguous definition of the param value.
Refs trac:4897

This commit was SVN r32719.

The following Trac tickets were found above:
  Ticket 4897 --> https://svn.open-mpi.org/trac/ompi/ticket/4897
2014-09-12 22:21:30 +00:00
Ralph Castain
9e7e90265f Temporarily make the direct grpcomm component the default until we can debug the other modules
This commit was SVN r32707.
2014-09-11 14:47:54 +00:00
Ralph Castain
4eb6291334 Avoid conflicts when multiple collectives are underway in ORTE by giving each grpcomm component its own RML tag and posting persistent receives. We use the signature anyway to determine which collective the received message is addressing, so there is no need to post non-persistent receives.
This commit was SVN r32703.
2014-09-10 17:36:16 +00:00
Ralph Castain
ea11e63f59 Per patch from Tetsuya, allow the user to bind-to none when specifying multiple pe's/rank as requested by Reuti. This allows the user to reserve multiple "slots" in the allocation for each process while mapping, but not to bind the process to specific processing elements on the node.
Reviewed by rhc, so RM-approved to go across to v1.8.3

cmr=v1.8.3:reviewer=ompi-gk1.8

This commit was SVN r32701.
2014-09-10 15:52:18 +00:00
Ralph Castain
e671620ac7 Per request from Jeff: tune up the help messages for binding options
Refs trac:4898

This commit was SVN r32691.

The following Trac tickets were found above:
  Ticket 4898 --> https://svn.open-mpi.org/trac/ompi/ticket/4898
2014-09-09 22:39:22 +00:00
Gilles Gouaillardet
63209eac5b orte/util: use ORTE_JOB_FAMILY and ORTE_LOCAL_JOBID macros
This commit was SVN r32688.
2014-09-09 05:13:00 +00:00
Ralph Castain
4207b4c4ad Improve the --bind-to help message to better indicate the default options under various values of np. Remove the warning message if the user doesn't specify a binding policy and we are overloaded
cmr=v1.8.3:reviewer=jsquyres

This commit was SVN r32687.
2014-09-08 21:03:51 +00:00
Ralph Castain
4df1aa63f7 Since we've run into the situation where someone puts a script wrapper around a launcher such as srun, we need to always protect MCA cmd line params with quotes. This means we also need to protect the backend from quotes coming into the system as part of a value, or else the parser gets confused.
So add a new function for wrapping MCA arguments, and tell the backend parser to ignore/remove leading/trailing quotes.

cmr=v1.8.3:reviewer=jsquyres

This commit was SVN r32686.
2014-09-08 20:38:46 +00:00
Ralph Castain
6323b226c7 Bring over some updates from the PMIx branch - mostly just minor cleanups. Make the direct grpcomm component no longer be the default. For now, we seem to be having problems with non-blocking fence operations, so make them not be the default under any scenario (e.g., when sm is the only btl in operation).
This commit was SVN r32673.
2014-09-06 19:19:44 +00:00
Ralph Castain
94ffca4901 Correct the cutoff point for full modex operation as it is based on the number of nodes in the system, not the number of procs in the signature.
This commit was SVN r32666.
2014-09-03 17:28:12 +00:00
Ralph Castain
2bfb18e004 Resolve some race conditions when async pmix modex modes are invoked. Since calls to "get" data can come both locally and remotely before data for a given proc has actually been received, we have to track all requests that cannot be immediately fulfilled and provide the data once it has been received.
This commit was SVN r32664.
2014-09-02 20:04:17 +00:00
Ralph Castain
4d186e6402 Properly protect the MCA parameters being registered by the OOB/TCP component when IPv6 is enabled
cmr=v1.8.3:reviewer=jsquyres

This commit was SVN r32662.
2014-09-02 14:53:00 +00:00
Ralph Castain
f2b26bde4c Resolve a race condition that could cause us to hang during abnormal terminations due to multi-counting num_terminated
This commit was SVN r32660.
2014-09-02 00:32:52 +00:00
Ralph Castain
e49ca05f11 Remove unused variable
This commit was SVN r32651.
2014-08-31 03:11:50 +00:00
Ralph Castain
5cdbc00136 Re-enable the usock oob component. Ensure the TCP component promotes messages for other procs to the OOB base so that other components have a chance to send the relay. Seems to be passing MTT, so let's see how it works for others.
This commit was SVN r32650.
2014-08-30 19:33:46 +00:00
Ralph Castain
a2085a5916 Fix the PSM transport key generator to match prior releases
This commit was SVN r32649.
2014-08-30 00:48:25 +00:00
Ralph Castain
cb0739dfd4 Update the regex to resolve a bug
This commit was SVN r32647.
2014-08-29 22:24:20 +00:00
Ralph Castain
8faabed2cd Add some further initialization and protection for zero-byte messages
This commit was SVN r32644.
2014-08-29 17:24:55 +00:00
Ralph Castain
2b225e3776 Cleanup a race condition regarding marking that waitpid_fired. We should always mark it as fired when we enter the wait_local_proc routine, and also mark it as no longer alive if iof_complete has also been found. If other places in the code also update those flags, there is no harm done.
This commit was SVN r32643.
2014-08-29 17:03:31 +00:00
Ralph Castain
730e28349e Some minor uninitialized variable cleanups
This commit was SVN r32629.
2014-08-29 02:21:13 +00:00
Ralph Castain
fafdbeec0c Cleanup and enable the new daemon collective modules for more scalable operations. Thanks to Nadezhda Kogteva (Mellanox) for doing them.
This commit was SVN r32624.
2014-08-28 20:35:35 +00:00
Ralph Castain
731a878ff3 Add a bunch of debug to help track down the problem, and eventually find another place where comparison of signatures was incorrectly performed - use the dss compare operation to be consistent and safe
This commit was SVN r32620.
2014-08-27 19:52:20 +00:00
Ralph Castain
5fb7c7d23b Don't explicitly add the hostname to the data fetch when we already cached a remote blob
This commit was SVN r32619.
2014-08-27 16:18:05 +00:00
Ralph Castain
3c24770bce Protect debug printing on backend nodes
This commit was SVN r32618.
2014-08-27 16:17:28 +00:00
Ralph Castain
b87b69e977 Ensure the nodes get added to the job map on the remote nodes, add some debug to grpcomm daemon array construction
This commit was SVN r32617.
2014-08-27 16:16:46 +00:00
Ralph Castain
842aaf6167 Correctly end mapping oversubscribed nodes round-robin byslot
cmr=v1.8.3:reviewer=rhc

This commit was SVN r32616.
2014-08-27 16:15:18 +00:00
Gilles Gouaillardet
2679629a12 pmix: fix compilation when configured with --without-hwloc
This commit was SVN r32604.
2014-08-26 08:31:05 +00:00
Ralph Castain
1221e8a96f Compare the full signature - thanks to Gilles for identifying the problem
This commit was SVN r32595.
2014-08-25 14:52:06 +00:00
Ralph Castain
5a13cdb739 Fix a race condition caused by a bad attribute flag that created an OR instead of an AND condition check
This commit was SVN r32587.
2014-08-22 22:48:16 +00:00
Ralph Castain
039b7acfb5 Fix the quoting algorithm so only rsh command lines get quoted values
cmr=v1.8.2:reviewer=jsquyres

This commit was SVN r32586.
2014-08-22 22:47:38 +00:00
Ralph Castain
f00af81c1d Little more cleanup under the abort cases cited by Gilles. All seem to be working now
This commit was SVN r32585.
2014-08-22 19:57:57 +00:00
Ralph Castain
b1a7375192 Fix the "unreachable" message so it outputs the correct hostname for the remote proc. Cleanup some of the pmix stuff when running corner cases of errors
This commit was SVN r32584.
2014-08-22 19:20:45 +00:00
Ralph Castain
6ff2a60829 Handle the non-blocking fence case correctly, and ensure we always at least pass back the hostname of the process whose info is being requested so that the ompi_proc_t can correctly initialize it when we are in a non-blocking fence with np < cutoff scenario
This commit was SVN r32578.
2014-08-22 14:26:24 +00:00
Ralph Castain
8f1b9b463e Fix shared memory operations - need to pass the local topology and cpusets of all local peers so we can properly compute relative locality for them. Also need to set default locality to "on node" in case where cpusets are not passed because procs are not bound.
This commit was SVN r32577.
2014-08-22 05:17:51 +00:00
Ralph Castain
c6f78d6e54 The PMI ess component now gets used for more than direct launch, so only set standalone_operation flag if no daemon uri is available so we aggregate show_help messages
This commit was SVN r32574.
2014-08-22 03:00:56 +00:00
Ralph Castain
aec5cd08bd Per the PMIx RFC:
WHAT:    Merge the PMIx branch into the devel repo, creating a new
               OPAL “lmix” framework to abstract PMI support for all RTEs.
               Replace the ORTE daemon-level collectives with a new PMIx
               server and update the ORTE grpcomm framework to support
               server-to-server collectives

WHY:      We’ve had problems dealing with variations in PMI implementations,
               and need to extend the existing PMI definitions to meet exascale
               requirements.

WHEN:   Mon, Aug 25

WHERE:  https://github.com/rhc54/ompi-svn-mirror.git

Several community members have been working on a refactoring of the current PMI support within OMPI. Although the APIs are common, Slurm and Cray implement a different range of capabilities, and package them differently. For example, Cray provides an integrated PMI-1/2 library, while Slurm separates the two and requires the user to specify the one to be used at runtime. In addition, several bugs in the Slurm implementations have caused problems requiring extra coding.

All this has led to a slew of #if’s in the PMI code and bugs when the corner-case logic for one implementation accidentally traps the other. Extending this support to other implementations would have increased this complexity to an unacceptable level.

Accordingly, we have:

* created a new OPAL “pmix” framework to abstract the PMI support, with separate components for Cray, Slurm PMI-1, and Slurm PMI-2 implementations.

* Replaced the current ORTE grpcomm daemon-based collective operation with an integrated PMIx server, and updated the grpcomm APIs to provide more flexible, multi-algorithm support for collective operations. At this time, only the xcast and allgather operations are supported.

* Replaced the current global collective id with a signature based on the names of the participating procs. The allows an unlimited number of collectives to be executed by any group of processes, subject to the requirement that only one collective can be active at a time for a unique combination of procs. Note that a proc can be involved in any number of simultaneous collectives - it is the specific combination of procs that is subject to the constraint

* removed the prior OMPI/OPAL modex code

* added new macros for executing modex send/recv to simplify use of the new APIs. The send macros allow the caller to specify whether or not the BTL supports async modex operations - if so, then the non-blocking “fence” operation is used, if the active PMIx component supports it. Otherwise, the default is a full blocking modex exchange as we currently perform.

* retained the current flag that directs us to use a blocking fence operation, but only to retrieve data upon demand

This commit was SVN r32570.
2014-08-21 18:56:47 +00:00
Ralph Castain
b4511913f6 Remove an unnecessary optimization that can cause more trouble than it's worth - just try all the addresses that are given to us.
Refs trac:4870

This commit was SVN r32558.

The following Trac tickets were found above:
  Ticket 4870 --> https://svn.open-mpi.org/trac/ompi/ticket/4870
2014-08-20 20:58:07 +00:00
Ralph Castain
fa28710d53 Track down the last piece of the connection problem. It appears that
providing a netmask of 0 to opal_net_samenetwork results in everything
looking like it is on the same network. Hence, we were not retaining any
of the alternative addresses, so we had no other way to check them.

Refs trac:4870

This commit was SVN r32556.

The following Trac tickets were found above:
  Ticket 4870 --> https://svn.open-mpi.org/trac/ompi/ticket/4870
2014-08-20 16:55:36 +00:00
Ralph Castain
343038af7b Frazzle-frump! Missed that we reset the peer state just before the new check.
Refs trac:4870

This commit was SVN r32554.

The following Trac tickets were found above:
  Ticket 4870 --> https://svn.open-mpi.org/trac/ompi/ticket/4870
2014-08-19 22:34:49 +00:00
Ralph Castain
0a91fdf85f If an initial address fails to connect, record that fact and attempt the next address for that proc. If nothing succeeds, then declare failure.
cmr=v1.8.2:reviewer=edgar

This commit was SVN r32553.
2014-08-19 19:48:24 +00:00
Ralph Castain
024572cb6c Sigh - I promised to remove these deprecation warnings back in June. My apologies to Dave Goodell and others who requested it.
cmr=v1.8.2:reviewer=dgoodell:subject=remove deprecation warnings for pernode, npernode, and npersocket

This commit was SVN r32552.
2014-08-19 19:40:20 +00:00
Jeff Squyres
1551339eba rsh: revert part of r32517: keep the quoting
As part of reviewing CMR #4860, I talked through r32517 with Ralph.

In attempt to fix various rsh quoting problems, r32517 removed all the
quoting from the main code path and then only added it back in at the
end in some cases.

This commit puts back the quoting parts that were removed in r32517
(r32517 fixed 2 other important bugs: a) change "--<foo>" to "--mca
<foo_equivalent> 1" so that de-duplication works, and b) change a !=
to ==).

refs trac:4860

This commit was SVN r32524.

The following SVN revision numbers were found above:
  r32517 --> open-mpi/ompi@7342bce58f

The following Trac tickets were found above:
  Ticket 4860 --> https://svn.open-mpi.org/trac/ompi/ticket/4860
2014-08-13 19:27:10 +00:00
Gilles Gouaillardet
f96d382d1d Fix typo.
Thanks to Christopher Samuel for reporting it

This commit was SVN r32520.
2014-08-13 05:54:59 +00:00
Ralph Castain
7342bce58f Cleanup the over-aggressive quoting of params on the orted cmd line. Remove duplicates caused by passing on both cmd line shortcuts and the mca param version of the same thing.
Fixes trac:4857

cmr=v1.8.2:reviewer=jsquyres

This commit was SVN r32517.

The following Trac tickets were found above:
  Ticket 4857 --> https://svn.open-mpi.org/trac/ompi/ticket/4857
2014-08-13 03:51:04 +00:00
George Bosilca
de7191132d Remove few warnings.
This commit was SVN r32506.
2014-08-11 13:34:44 +00:00
Gilles Gouaillardet
a873f45a90 Fix r32460 race condition resolution when procs call MPI_Abort.
do not invoke orte_session_dir_finalize(...) so
orte_ess_base_app_abort(...) can successfully createi
<orte_process_info.proc_session_dir>/aborted

cmr=v1.8.2:reviewer=rhc

This commit was SVN r32498.

The following SVN revision numbers were found above:
  r32460 --> open-mpi/ompi@abedb97be4
2014-08-11 05:50:32 +00:00
Gilles Gouaillardet
e184733ef6 check-help-strings cleanup
This commit was SVN r32496.
2014-08-11 03:26:21 +00:00
Gilles Gouaillardet
f24699623f check-help-strings cleanup
This commit was SVN r32495.
2014-08-11 03:25:22 +00:00
Gilles Gouaillardet
c3c364a262 check-help-strings cleanup
This commit was SVN r32494.
2014-08-11 03:22:05 +00:00
Gilles Gouaillardet
d9e0212e0e check-help-strings cleanup
This commit was SVN r32493.
2014-08-11 03:21:08 +00:00