1
1
Граф коммитов

11639 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
72530f8fed Cleanly handle the failed start of an orted, or its unexpected failure after start. This commit will allow mpirun to exit cleanly when this occurs, and does a best-effort attempt to cleanup the mess. However, it still has two unresolved issues that need to be eventually addressed:
1. it depends upon the ability of the native environment to alert us that the orted has died/failed to start. I have included that support for SLURM, but other environments need to be done.

2. for some yet-to-be-determined reason, the message that tells the remaining daemons to "die" isn't getting out of the RML, even though no obvious blockage is standing in the way. Work will continue on resolving that problem. For now, the orteds appear to be exiting on their own quite nicely when they see their HNP "lifeline" disappear.

This represents the best-available fix for ticket #221 so I am closing that ticket at this time.

This commit was SVN r18536.
2008-05-29 13:38:27 +00:00
Ralph Castain
52fb773c6c Tell slurm to kill the job if an orted abnormally exits
This commit was SVN r18535.
2008-05-29 12:26:58 +00:00
Ralph Castain
b2a566b610 Break an infinite loop in orte_output caused by debugging of ORTE comm subsystems
This commit was SVN r18534.
2008-05-29 12:21:17 +00:00
Ralph Castain
e5e542ddcf Clarify an error message
This commit was SVN r18533.
2008-05-29 12:20:24 +00:00
Jeff Squyres
d5bf8fe005 Remove unused variables.
This commit was SVN r18532.
2008-05-29 11:58:16 +00:00
Jeff Squyres
ed5bc2cd08 Per http://www.open-mpi.org/community/lists/devel/2008/05/4057.php, remove the darwin memory hooks component
This commit was SVN r18531.
2008-05-28 23:50:53 +00:00
Jeff Squyres
e5ea9d08ca Fixes trac:1305: check to see if $sysfsdir/class/infiniband exists and is
non-empty.  If not, then exit the openib btl silently.  This addresses
the case where libibverbs is installed (which is getting more common)
and therefore the openib BTL was built/installed, but the kernel
drivers are not loaded (assumedly because there is no RDMA hardware
present).  In this case, "mpirun a.out" will not issue a warning.

There appears to be no good way to definitely tell if there are no
RDMA hardware devices present.  For example, if libibverbs/the openib
BTL is installed, there are no RDMA devices present, but the RDMA
hardware kernel drivers ''are'' loaded, OMPI will warn that it was
unable to find suitable devices.  This warning is easily eliminated by
unloading the kernel drivers.

This commit was SVN r18530.

The following Trac tickets were found above:
  Ticket 1305 --> https://svn.open-mpi.org/trac/ompi/ticket/1305
2008-05-28 22:05:47 +00:00
Josh Hursey
4ac7016200 Make sure to check "opal_list_get_last" instead of "opal_list_get_end".
The former will return a valid item in the list, the latter will return an invalid item that marks the end of the list.

It was happending that when oversubscribing by way of an appfile we would cause a segv because we tried to interpret the invalid item returned by "opal_list_get_end" instead of a valid item. We would then try to write to unallocated memory.

This commit fixes trac:1279

This commit was SVN r18529.

The following Trac tickets were found above:
  Ticket 1279 --> https://svn.open-mpi.org/trac/ompi/ticket/1279
2008-05-28 19:37:20 +00:00
Ralph Castain
f76240e7cc Modify the nidmap utility to pass daemon vpids for nodes. In some mapping algo's, it is possible for nodes to be skipped. This results in daemon vpids that differ from the index of their respective node in the node array, causing the daemon to not recognize procs that it is supposed to launch.
This commit was SVN r18528.
2008-05-28 18:38:47 +00:00
Pavel Shamis
28c763f751 Fixing the error flow when somebody tries to use XRC without XOOB.
This commit was SVN r18527.
2008-05-28 15:56:04 +00:00
Pavel Shamis
2c81b0ab9a Fixing compilation warning in btl_openib_connect_ibcm.c
This commit was SVN r18526.
2008-05-28 15:20:48 +00:00
Ralph Castain
347752e40b One more corner case (caught by Jeff) occurs when someone requests usage help - e.g., with mpirun --help. In this case, we do a show_help prior to orte_init, but it is okay to do so.
To allow this, we let show_help just operate correctly without any warning about pre-orte_output_init.

This commit was SVN r18525.
2008-05-28 13:58:03 +00:00
Ralph Castain
828ae26d90 ORTE-level MCA params are defined in several places. Ompi_info cannot call orte_init due to an issue with the memory allocator, thus making it impossible for ompi_info to display all of the ORTE-level MCA params.
By consolidating them all into one function, ompi_info can call that function and register the desired variables. This also requires, however, that ompi_info call orte_output_init to avoid generating tons of error messages, so make that adjustment too. 

Fixes ticket #1314

In addition, orte_output has a race condition issue whereby calls to orte_output/verbose can occur prior to either the RML being defined/setup, or the HNP being defined. This latter occurs during the initialization of the orte_process_info structure. In both cases, there is no way orte_output can send the output to the HNP. Hence, the message must be simply output locally.

Fixes ticket #1315

This commit was SVN r18524.
2008-05-28 13:29:58 +00:00
Pavel Shamis
879a9fe45c setup_qps() may exit with error.
This commit was SVN r18523.
2008-05-28 11:36:38 +00:00
Pavel Shamis
e657a03143 Fixing broken XRC initialization flow.
This commit was SVN r18522.
2008-05-28 11:31:38 +00:00
Jeff Squyres
6a82b7bbb4 This file does not need to be executable.
This commit was SVN r18516.
2008-05-27 22:03:05 +00:00
Jeff Squyres
1cba10e11b Per advice from Ralf W. (see bug-libtool list post 4:48pm US Eastern
time, 27 May 2008 -- not web archived as of this commit), do the
following:

 * move libtoolize earlier in the process
 * remove most of acinclude.m4; instead, use "aclocal -I config" at
   the top-level to have it automatically pull in any relevant .m4
   file
 * add patch for ifort shared library support for LT 2.2.4
   (http://lists.gnu.org/archive/html/bug-libtool/2008-05/msg00049.html);
   will likely be unnecessary in future LT versions

This commit was SVN r18515.
2008-05-27 21:58:09 +00:00
Jeff Squyres
15fce83c5b Change some define's to AC_DEFUN's so that "aclocal -I config" will
find all the right .m4 files upon demand.

This commit was SVN r18514.
2008-05-27 21:54:23 +00:00
Ralph Castain
5a2992dea2 After some discussion with Jeff, we have determined that the only time orte_output functions can be accessed prior to calling orte_output_init is if someone calls them prior to calling orte_init. This is clearly an error, so we now report that fact to the caller so it can be fixed.
It is still possible that someone can call an orte_output function during orte_finalize - this is not an error. Prior commits ensured that this is correctly handled. This commit only deals with improper calls prior to calling orte_init.

This commit was SVN r18513.
2008-05-27 20:13:04 +00:00
Ralph Castain
be193ead83 Ensure that any use of orte_output prior to calling orte_output_init gets a properly initiated stream tracking object to avoid later segfaults
This commit was SVN r18512.
2008-05-27 19:07:31 +00:00
Tim Mattox
2446f543c3 Resync the trunk NEWS file with the 1.2.x NEWS file.
This commit was SVN r18511.
2008-05-27 18:28:38 +00:00
George Bosilca
1eb1742225 Remove this left over dependency.
This commit was SVN r18508.
2008-05-27 16:57:40 +00:00
Ralph Castain
93d932aa0c Ensure that the display-map and display-allocation outputs get processed through the new OPAL filter framework by passing them through orte_output instead of using the opal_dss.dump function.
This commit was SVN r18507.
2008-05-27 15:46:21 +00:00
Ralph Castain
e190a990ba Do not re-init the orte_output system if we have already finalized it as part of orte_finalize. Instead, default to routing the output to the std opal_output system so that the message still gets out. Of course, such messages cannot be filtered, but they are only for debug purposes by ORTE developers, so this should be a minimial issue.
This commit was SVN r18506.
2008-05-27 15:15:55 +00:00
Ralph Castain
0b2b655de5 Initialize a variable so it can correctly be dealt with at shutdown - fixes trac:1312
This commit was SVN r18505.

The following Trac tickets were found above:
  Ticket 1312 --> https://svn.open-mpi.org/trac/ompi/ticket/1312
2008-05-27 14:53:24 +00:00
Rolf vandeVaart
18879285c7 Fix the selection logic to prevent memory leaks. More work may be done in the priority logic but for now we just fix the leaks and preserve current behavior.
This commit fixes trac:1307.

This commit was SVN r18504.

The following Trac tickets were found above:
  Ticket 1307 --> https://svn.open-mpi.org/trac/ompi/ticket/1307
2008-05-27 14:16:39 +00:00
Pak Lui
695c158192 silence some intel and pgcc compiler warnings.
This commit was SVN r18501.
2008-05-26 20:35:13 +00:00
Sharon Melamed
64fe554b8e Fix bug in carto component select. After the insertion of mca_base_select the carto file component was never selected.
This commit was SVN r18496.
2008-05-26 12:52:41 +00:00
Pavel Shamis
6596d19c90 Adding new ConnectX vendor_part_id. Fix for ticket #1310.
This commit was SVN r18495.
2008-05-26 12:25:49 +00:00
Gleb Natapov
5fabade090 Use payload_buffer_alignment value for payload alignment.
This commit was SVN r18493.
2008-05-26 08:29:02 +00:00
Jeff Squyres
e1f118d0e6 Remove unused variable
This commit was SVN r18491.
2008-05-24 13:05:04 +00:00
Rolf vandeVaart
5baa733ad5 Fix another warning (using a variable before it was initialized.)
Thanks Jeff for pointing this out.

This commit was SVN r18489.
2008-05-23 13:57:55 +00:00
Rolf vandeVaart
0d8faf7559 Fix the fix for ticket #1298. Thanks George for pointing it out.
This commit was SVN r18488.
2008-05-23 13:33:38 +00:00
Jeff Squyres
1b50e5f6a5 Use the right variable in the output
This commit was SVN r18487.
2008-05-23 13:11:12 +00:00
Rich Graham
b08839f9f5 change reduce-scatter/gather for non-power of 2. Spreading out the
load for the non-power of 2 phase of the reduction.

This commit was SVN r18486.
2008-05-22 21:42:42 +00:00
Rich Graham
f2a4b67809 automate the allreduce selection logic.
This commit was SVN r18484.
2008-05-22 20:53:35 +00:00
Jeff Squyres
8faeeab81a Style cleanup only: s/struct foo/foo_t/g to conform to rest of code
base

This commit was SVN r18483.
2008-05-22 19:26:00 +00:00
Jeff Squyres
1f7f0e1f96 Fixes trac:1281
* s/port/tcp_port/g where relevant to disambiguate TCP port from
   device port
 * Rework ipaddrcheck to make it work in the LMC>0 case

This commit was SVN r18482.

The following Trac tickets were found above:
  Ticket 1281 --> https://svn.open-mpi.org/trac/ompi/ticket/1281
2008-05-22 19:18:15 +00:00
Rolf vandeVaart
8c3b31b181 Need to properly handle zero-length scatters and gathers on intercommunicators. Add a check for the MPI_ROOT and MPI_PROC_NULL processes so they do not enter collective module when count=0.
This commit was SVN r18481.
2008-05-22 19:09:43 +00:00
Rich Graham
5900415a25 for non-powers of 2, distribute the work on the first step among all
the procs doing the work.

This commit was SVN r18480.
2008-05-22 18:50:53 +00:00
Jon Mason
d0e26b1cf6 Add pretty comments for *_iwarp.*
This commit was SVN r18478.
2008-05-22 18:02:20 +00:00
Jeff Squyres
62ac6533e0 * Add proper copyrights
* Ensure _iwarp.h is always included, or you'll get warnings on
   platforms that don't have the RDMACM
 * Add skeleton for function descriptions in comments in iwarp.h

This commit was SVN r18477.
2008-05-22 17:41:43 +00:00
Jeff Squyres
28b56c389a Only check if the opal_ifindex is >= 0 (opal_ifbegin() and
opal_ifnext() return -1 upon completion); don't check it against
opal_ifcount() -- the interface indexes aren't necessarily related to
how many interfaces were found.

This commit was SVN r18476.
2008-05-22 02:10:23 +00:00
Jeff Squyres
27978b29f8 Fixes trac:1302: ensure to also use the LID for identifing an incoming
IBCM request (not just the port number).

This commit was SVN r18475.

The following Trac tickets were found above:
  Ticket 1302 --> https://svn.open-mpi.org/trac/ompi/ticket/1302
2008-05-22 01:28:34 +00:00
George Bosilca
21b940887a Tricky stuff !!! If we post a receive for ZERO bytes and we match it
with something with a different size ... well we segfault. The reason was
that the logic in the PML OB1 call the convertor based on the length
of he data on the wire and not the length of the data that the receiver
expects.

In other words, this is only half a patch :) It fix the problem, but we
still have to make sure the unpack is not called at all when the receiver
expect ZERO bytes.

This commit was SVN r18474.
2008-05-21 23:31:34 +00:00
Pak Lui
7b3d7dcac4 This commit closes trac:1300.
This commit was SVN r18473.

The following Trac tickets were found above:
  Ticket 1300 --> https://svn.open-mpi.org/trac/ompi/ticket/1300
2008-05-21 22:35:04 +00:00
George Bosilca
c31cc5b270 Remove a warning about line being unused.
This commit was SVN r18472.
2008-05-21 20:46:22 +00:00
George Bosilca
df2156568d The Elan BTL is now thread safe, and can be build in all conditions.
This commit was SVN r18471.
2008-05-21 20:44:37 +00:00
Pak Lui
1585789e8b Fix the undeclared variable.
This commit was SVN r18470.
2008-05-21 04:09:54 +00:00
Rich Graham
afd71abde6 remove some useless qualifiers.
This commit was SVN r18469.
2008-05-21 01:11:49 +00:00