1
1
Граф коммитов

20575 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
356e7ea904 Move all collective id's into the attributes and let the job pack/unpack take care of them instead of singling them out. Add the envars just prior to forking the children instead of into the launch message itself. Remove a few #if CR as the attributes functionality can handle this condition now.
This commit was SVN r32133.
2014-07-03 15:58:13 +00:00
Ralph Castain
0a4639308e Remove a potential race condition - we'll cleanup the local children when we are all done
This commit was SVN r32132.
2014-07-03 14:13:43 +00:00
Jeff Squyres
852af8b834 ompi_mpi_abort: fix corner cases, simplify logic
I recently found a case where ompi_mpi_abort() segv's:

{{{
$ mpirun --mca btl non_existent_btl_name ...
}}}

In this case, the BML init fails because we have no paths to any
peers.  It calls ompi_mpi_abort(), but this is before ompi_comm_self
has been setup.  ompi_mpi_abort() assumes that if the comm parameter
is != NULL, it can be used.  But since we aborted so early in
MPI_INIT, that's a false assumption.

(note that this isn't happening on v1.8 because the check for
INIT/FINALIZE in ompi_mpi_abort() is a little different.  Hence: this
is a trunk issue -- at least for now)

When fixing this problem, I noticed a few other problems in ompi_mpi_abort():

* the group access was incorrect (it didn't use accessor functions)
* it wasn't clear that ORTE's ompi_rte_abort_peers() returns
  NOT_IMPLEMENTED and falls through down to ompi_rte_abort()
* the check for my proc in the communicator was a little more
  complicated than necessary
* the logic for checking for aborts early in MPI_INIT wasn't right
* some comments were stale
* the hostname output in error messages would be NULL if MPI_FINALIZE
  had been invoked
* it was possible to abort, but still exit with a 0 status

This commit fixes all of the above problems, and makes the logic a
little more straightforward.  Thanks to Ralph Castain and George
Bosilca for the assists with this patch.

This commit was SVN r32125.
2014-07-03 02:38:27 +00:00
George Bosilca
843ef1fcb0 ompi_mpi_abort had one extra argument that was never used. Clean it up.
This commit was SVN r32124.
2014-07-03 00:34:44 +00:00
George Bosilca
2883adcdf3 Remove useless variables.
This commit was SVN r32123.
2014-07-03 00:30:54 +00:00
Ralph Castain
149810f02c Per request from Jeff, slightly modify the show_help message as the precise name of the NUMA-containing packages differs based on OS and distro
cmr=v1.8.2:reviewer=jsquyres:subject=modify show_help message

This commit was SVN r32122.
2014-07-02 14:46:00 +00:00
Gilles Gouaillardet
8a2a0293fd fix sort_devs_by_distance in btl/openib
no need to #include <math.h> ...

cmr=v1.8.2:reviewer=miked:ticket=4759

This commit was SVN r32121.

The following Trac tickets were found above:
  Ticket 4759 --> https://svn.open-mpi.org/trac/ompi/ticket/4759
2014-07-02 08:08:10 +00:00
Gilles Gouaillardet
134eee1c4f fix sort_devs_by_distance in btl/openib
The distances as returned by hwloc_get_whole_distance_matrix_by_type are typ float.
This patch handle all distances as float.

cmr=v1.8.2:reviewer=miked

This commit was SVN r32120.
2014-07-02 07:56:40 +00:00
Ryan Grant
a1d312343b This commit fixes trac:4681 - ibm c_fence_lock hangs
cmr=v1.8.2:reviewer=tkordenbrock:subject=Portals4/MTL hanging fix

This commit was SVN r32113.

The following Trac tickets were found above:
  Ticket 4681 --> https://svn.open-mpi.org/trac/ompi/ticket/4681
2014-07-01 17:03:03 +00:00
Ryan Grant
5cb8cc856c Refs trac:4682 - This commit fixes c_flush test failure in the ibm test suite for Portals 4 OSC
cmr=v1.8.2:reviewer=tkordenbrock:subject=Move r32112 to v1.8.2 branch

This commit was SVN r32112.

The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
  r32112

The following Trac tickets were found above:
  Ticket 4682 --> https://svn.open-mpi.org/trac/ompi/ticket/4682
2014-07-01 16:26:16 +00:00
Ralph Castain
e9d69ca370 Remove stale test
This commit was SVN r32104.
2014-06-29 16:37:19 +00:00
Mike Dubman
503db51715 OSHMEM: properly handle situation when verb is unavailable
fixed by Alex, reviewed by Miked
cmr=v1.8.2:reviewer=ompi-rm1.8

This commit was SVN r32103.
2014-06-28 19:02:24 +00:00
Mike Dubman
142f7290bc BUILD: update platform file with debug caps
cmr=v1.8.2:reviewer=ompi-rm1.8

This commit was SVN r32102.
2014-06-28 18:55:31 +00:00
Mike Dubman
ce6d5b8cd7 HCOLL: make it OFF by default
fixed by miked, reviewed by Alex

cmr=v1.8.2:reviewer=ompi-rm1.8

This commit was SVN r32101.
2014-06-28 18:45:03 +00:00
Mike Dubman
247da2819f OSHMEM: fix wrong btl/sm processing and typo
fixed by Igor reviewed by Alex,Mike,Yossi

cmr=v1.8.2:reviewer=ompi-rm1.8

This commit was SVN r32100.
2014-06-28 18:40:28 +00:00
Mike Dubman
5a06f5dff5 OSHMEM: fix bss check
fixed by AlexM reviewed by Miked

cmr=v1.8.2:reviewer=ompi-rm1.8

This commit was SVN r32099.
2014-06-28 18:37:45 +00:00
Dave Goodell
c104604387 common/verbs: update usnic transport probe
RHEL 7 has shipped with kernel support for the RDMA_TRANSPORT_USNIC
enum, but ''not'' the RDMA_TRANSPORT_USNIC_UDP enum.  This means that
when you install usNIC drivers from cisco.com, the kernel will report
IBV_TRANSPORT_USNIC, even though the transport is actually using UDP.

Therefore, we have to modify the logic in common/verbs to do the
additional magic probe if the device reports either an
IBV_TRANSPORT_IWARP or IBV_TRANSPORT_USNIC (because both of those might
be lies -- do the probe to figure out the real transport).

The code changed by this patch is fairly trivial; it simply moves the
logic of the magic probe to its own short function, and then calls that
short function in both the IBV_TRANSPORT_(IWARP|USNIC) cases.  It looks
longer because several lengthy comments were also updated.

Authored-by: Jeff Squyres <jsquyres@cisco.com>
Reviewed-by: Dave Goodell <dgoodell@cisco.com>

cmr=v1.8.2:reviewer=ompi-rm1.8

This commit was SVN r32098.
2014-06-27 18:43:32 +00:00
Devendar Bureddy
228772ae81 hcoll gatherv support
cmr=v1.8.2:reviewer=jladd

This commit was SVN r32097.
2014-06-26 18:14:41 +00:00
George Bosilca
99561c5cc1 If the enable fails don't give up, but instead keep going with
the other collective modules. If we endup without some of the
collective the code will raise an error anyway.

cmr=v1.8.2:reviewer=hjelmn

This commit was SVN r32096.
2014-06-26 15:52:45 +00:00
Jeff Squyres
8e52ba423f finalize/disconnect: add explicit comment about why we use an RTE barrier
Based on extensive discussions before/at the June 2014 developer's
meeting, put a lengthy comment explaining a second reason why we
''must'' use an RTE barrier during MPI_FINALIZE and
MPI_COMM_DISCONNECT (i.e., unreliable transports).  Slightly explain
more the original reason why we do this, too (BTLs can lie/buffer a
message without actually injecting it on the network). 

This commit was SVN r32095.
2014-06-26 14:31:40 +00:00
Adrian Reber
47b118c0ae fix FT compilation
This commit was SVN r32094.
2014-06-26 03:40:07 +00:00
Adrian Reber
cabf1d4e68 use the orte attributes in the FT code to fix compile errors
This commit was SVN r32093.
2014-06-26 03:19:17 +00:00
Adrian Reber
10c1a50705 "handle" removal of opal_db.remove() in the FT code
This commit was SVN r32092.
2014-06-26 03:11:37 +00:00
Dave Goodell
f6bb853409 usnic: properly check src iface in route queries
rtnetlink doesn't check the source address when determining whether to
return route info for a query.  So we need to check that the OIF matches
the OIF of the source interface name.  Without this check, OMPI might
pair a local interface which does not have a route to a particular
remote interface.

Fixes Cisco bug CSCup55797.

Reviewed-by: Jeff Squyres <jsquyres@cisco.com>

cmr=v1.8.2:reviewer=ompi-rm1.8

This commit was SVN r32090.
2014-06-25 22:39:02 +00:00
Ralph Castain
f3cb124e50 Revert r32082 and r32070 - the developer's conference has decided to go a different direction on the threaded progress effort. This will involve some degree of prototyping to understand the tradeoffs prior to making a final design decision, and so we'll hold off on the final change until that is completed.
This commit was SVN r32089.

The following SVN revision numbers were found above:
  r32070 --> open-mpi/ompi@12d92d0c22
  r32082 --> open-mpi/ompi@aa6438ef7a
2014-06-25 20:43:28 +00:00
Adrian Reber
9f73e79d91 also change the callback function prototype (to get the FT code to compile again)
This commit was SVN r32088.
2014-06-25 20:37:02 +00:00
Adrian Reber
4aca7095dc fix a syntax error in the FT code
This commit was SVN r32087.
2014-06-25 20:35:50 +00:00
Adrian Reber
4b25e92194 get the FT code to compile again by adding/removing #includes
This commit was SVN r32086.
2014-06-25 18:42:17 +00:00
Ralph Castain
8fca77c3d3 Protect the binding policy setting so it builds when --without-hwloc
Refs trac:4742

This commit was SVN r32085.

The following Trac tickets were found above:
  Ticket 4742 --> https://svn.open-mpi.org/trac/ompi/ticket/4742
2014-06-25 18:13:54 +00:00
Adrian Reber
72f1c7941f use a consistent naming scheme for the SNAPSHOT attributes
This commit was SVN r32083.
2014-06-25 15:26:24 +00:00
Rolf vandeVaart
aa6438ef7a Remove OMPI_USE_PROGRESS_THREADS that was missed.
This commit was SVN r32082.
2014-06-25 14:42:21 +00:00
Gilles Gouaillardet
53ae38cfb1 Handle error case in mca_spml_yoda_register
if source memory could not be registered, then return NULL
some cleanup might be needed, please refer to the FIXME in the code

cmr=v1.8.2:reviewer=miked

This commit was SVN r32081.
2014-06-25 08:58:45 +00:00
MPI Team
db35021a6c Update git/hg ignore files
This commit was SVN r32080.
2014-06-25 05:00:31 +00:00
Gilles Gouaillardet
fae7adf8ee Remove legacy FCA_IS_LOCAL_PROCESS macro
and use OPAL_PROC_ON_LOCAL_NODE instead

cmr=v1.8.2:reviewer=rhc

This commit was SVN r32079.
2014-06-25 02:37:53 +00:00
Ralph Castain
f70b4a33ec Per the developer conference, let's be a little nicer during MPI_Finalize and ease up on the cpu by inserting usleep into the loop over opal_progress while waiting for the RTE barrier to complete. This is a non-performant area of the code, and while most codes may call finalize at close-to-similar times, there are some that may choose to have one or more procs continue to perform some work prior to finalizing.
So save a little power while we are waiting.

cmr=v1.8.2:reviewer=jladd:subject=save power during finalize

This commit was SVN r32077.
2014-06-24 21:59:50 +00:00
Nathan Hjelm
563eaf0726 Fix support for Cray alps
The alps ras and plm components were broken by recent changes in ORTE. This
commit resolves those issues.

Changes:

 - Define PMI2_SUCCESS if it isn't defined. This fixes a problem with Cray's
   PMI implementation which does not define (for some reason) PMI2_SUCCESS. We
   had previously just used PMI_SUCCESS.

 - Add missing definition and a typo in pml_alps_module.

 - launch_id is no longer available in the orte_node_t structure. Use the
   attribute lookup to get the value.

 - Do not use an O(n^2) sorting algorithm when putting alps nodes in order. Use
   opal_list_sort instead (O(nlogn)).

This commit was SVN r32076.
2014-06-24 21:29:04 +00:00
Jeff Squyres
bce33635a7 sctp: remove from trunk
At the developer meeting today, the question was raised as to whether
the SCTP BTL was maintained any more.  I emailed Alan Wagner to see if
he had any interest/resources to continue to maintain the SCTP BTL.
He indicated that he unfortunately had any resources to maintain it;
it would be fine to remove the SCTP BTL from the trunk.

So long, SCTP BTL... fare thee well...

This commit was SVN r32075.
2014-06-24 21:23:09 +00:00
Jeff Squyres
69fa331cc2 openib/ugni: output verbose message when a BTL is ignored due to THREAD_MULTIPLE
usnic and portals4 already do this.

cmr=v1.8.2:reviewer=hjelmn

This commit was SVN r32074.
2014-06-24 21:13:17 +00:00
Jeff Squyres
d7a2d964f0 usnic: use the correct output stream name
cmr=v1.8.2:reviewer=dgoodell

This commit was SVN r32073.
2014-06-24 18:13:49 +00:00
Ralph Castain
5f6be06b54 Per request from Gilles and discussion at devel conference, have the --oversubscribe option automatically set both oversubscribe and overload-allowed properties as this is likely what the user intended.
cmr=v1.8.2:reviewer=rhc:subject=automatically set oversub/load

This commit was SVN r32072.
2014-06-24 18:11:39 +00:00
Jeff Squyres
fb9d063be2 Fortran: include the type functions (eq/ne) in libmpi_usempif08
This file has to be pre-emptively compiled to generate the module, but
then it also has to be included in libmpi_usempif08.

cmr=v1.8.2:ticket=trac:4736

This commit was SVN r32071.

The following Trac tickets were found above:
  Ticket 4736 --> https://svn.open-mpi.org/trac/ompi/ticket/4736
2014-06-24 17:48:15 +00:00
Ralph Castain
12d92d0c22 Per the OMPI developer conference, remove the last vestiges of OMPI_USE_PROGRESS_THREADS
This commit was SVN r32070.
2014-06-24 17:05:11 +00:00
Ralph Castain
1949f485ac Update platform file
cmr=v1.8.2:reviewer=ompi-gk1.8

This commit was SVN r32069.
2014-06-24 13:53:05 +00:00
Ralph Castain
20535bca19 Reorder the var release so a debugger can still see the var name that caused a segfault, thus helping to identify the var in question
cmr=v1.8.2:reviewer=hjelmn

This commit was SVN r32068.
2014-06-24 13:51:31 +00:00
Gilles Gouaillardet
926e29c972 Fortran: add ompi/mpi/fortran/use-mpi-f08/mpi-f08-sizeof.F90 to the dist tarball.
cmr=v1.8.2:ticket=trac:4736

This commit was SVN r32065.

The following Trac tickets were found above:
  Ticket 4736 --> https://svn.open-mpi.org/trac/ompi/ticket/4736
2014-06-23 04:14:28 +00:00
Gilles Gouaillardet
d1f5d9f675 Fortran: fix OMPI_GENERATE_F77_BINDINGS macro invokation
Some parameters were ommited and compilation failed if
configured with --disable-weak-symbols

cmr=v1.8.2:ticket=trac:4736

This commit was SVN r32064.

The following Trac tickets were found above:
  Ticket 4736 --> https://svn.open-mpi.org/trac/ompi/ticket/4736
2014-06-23 02:10:35 +00:00
Ralph Castain
34e5573988 Resolve the MTT timeout problem. This appears to have largely been caused by missing sigchld notifications, thus causing the daemons to believe that not all procs had exited. Let comm failure also serve as notification of process termination, and add appropriate flags/attributes to avoid multiple reporting of proc termination.
This won't transition cleanly to the 1.8 series, and may represent too much change, so we'll have to (a) evaluate whether or not to bring it over (once it demonstrates that it does indeed solve the problem), and (b) develop a custom patch for that purpose.

Refs trac:4717

This commit was SVN r32063.

The following Trac tickets were found above:
  Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717
2014-06-21 17:09:02 +00:00
Jeff Squyres
011db6974e usnic: refactor usnic_add_procs() into 2 distinct parts
1: find/create procs, and create associated endpoint for each
2: resolve peer addresses

The 2nd part is done as a separate loop so that the address lookups
can be parallelized.

The overall result is to split usnic_add_procs() into two smaller,
simpler parts.

cmr=v1.8.2:ticket=trac:4734

This commit was SVN r32062.

The following Trac tickets were found above:
  Ticket 4734 --> https://svn.open-mpi.org/trac/ompi/ticket/4734
2014-06-20 20:58:36 +00:00
Jeff Squyres
1ea7bad5a0 usnic: behave better when ibv_create_ah() fails
When ibv_create_ah() fails due to an address resolution failure, it
really only means that we can't reach that one peer -- so we should
just ignore that one peer.  If ibv_create_ah() fails for some other
reason, then give up on the entire usnic_X device.

Change the show_help() message that is displayed when ibv_create_ah()
fails due to address resolution failure; indicate that it's likely a
routing problem.  Also opal_output_verbose() the same info, since
show_help() is de-duplicated (and this particular show_help() message
can be squelched).

Fixes Cisco bugs CSCup35851 and CSCup35872.

cmr=v1.8.2:ticket=trac:4734

This commit was SVN r32061.

The following Trac tickets were found above:
  Ticket 4734 --> https://svn.open-mpi.org/trac/ompi/ticket/4734
2014-06-20 20:53:50 +00:00
Ralph Castain
9cfc408fd4 Little more debug - getting close to figuring this one out
Refs trac:4717

This commit was SVN r32060.

The following Trac tickets were found above:
  Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717
2014-06-20 16:24:06 +00:00