1
1
Граф коммитов

11605 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
c992e99035 Remove the tags from orte_output_open and the filtering operation from orte_output - this will be handled differently to improve the XML output interface
This commit was SVN r18557.
2008-06-03 14:24:01 +00:00
Ralph Castain
95578b0528 Fix single-node operations so that the HNP correctly exits when the job completes
This commit was SVN r18556.
2008-06-03 14:23:04 +00:00
Ralph Castain
b456fb2d42 Upgrade the node/orted failure detection code to cover all environments. Use the native environment's capabilities where possible - e.g., SLURM detects orted failure and can report it. Elsewhere, use a heartbeat system to detect orted failure - e.g., for TM and rsh. Heart rate is set via mca param. The HNP checks for callback every 2*heartrate, declares orted failure if not seen in last 2*heartrate time.
Also detect orted failed-to-start by setting timeout on launch. Currently only used in TM launcher.

Neither detection is enabled by default, but are only active if heartrate is set and/or launch timeout is set. Exception for SLURM as orted failure is always detected and reported.

More info to come on devel list.

This commit was SVN r18555.
2008-06-02 21:46:34 +00:00
Jeff Squyres
69d78c6739 Fixes trac:1215: adds specific show_help messages about PP vs. SRQ/XRC RNR
retry exceeded errors.

This commit was SVN r18554.

The following Trac tickets were found above:
  Ticket 1215 --> https://svn.open-mpi.org/trac/ompi/ticket/1215
2008-06-02 11:03:48 +00:00
Jeff Squyres
8c267d50a3 Fixes trac:1121.
We already show_help when we fail to create queues, so I just made the
message a little more verbose such that it may be that OMPI is trying
to use a feature that is not supported on the hardware.

This commit was SVN r18553.

The following Trac tickets were found above:
  Ticket 1121 --> https://svn.open-mpi.org/trac/ompi/ticket/1121
2008-05-30 19:03:58 +00:00
Tim Mattox
203462bc25 Apparently "make -j 4 distcheck" has a race condition when "installing" in
parallel.  Remove the "-j 4" so we don't get random tarball build failures.
Hopefully this won't take all that much longer to make the tarball each night.

This commit was SVN r18552.
2008-05-30 04:39:48 +00:00
George Bosilca
e361bcb64c Send optimizations.
1. The send path get shorter. The BTL is allowed to return > 0 to specify that the
   descriptor was pushed to the networks, and that the memory attached to it is 
   available again for the upper layer. The MCA_BTL_DES_SEND_ALWAYS_CALLBACK flag
   can be used by the PML to force the BTL to always trigger the callback.
   Unmodified BTL will continue to work as expected, as they will return OMPI_SUCCESS
   which force the PML to have exactly the same behavior as before. Some BTLs have
   been modified: self, sm, tcp, mx.
2. Add send immediate interface to BTL.
   The idea is to have a mechanism of allowing the BTL to take advantage of
   send optimizations such as the ability to deliver data "inline". Some
   network APIs such as Portals allow data to be sent using a "thin" event
   without packing data into a memory descriptor. This interface change
   allows the BTL to use such capabilities and allows for other optimizations
   in the future. All existing BTLs except for Portals and sm have this interface
   set to NULL.

This commit was SVN r18551.
2008-05-30 03:58:39 +00:00
Galen Shipman
4da4c44210 Receive side changes, basically uses multiple active message callbacks rather
than using a single receive callback followed by a switch on the header.
Also fast pathed the matching for small fragments. 

This commit was SVN r18549.
2008-05-30 01:29:09 +00:00
Rolf vandeVaart
8b01499d6e Fix up the checking for whether the epoll interface is working properly. We now make sure that we can properly use the 3 interfaces to epoll. This was needed as there are compilers that do not recognize the packed attribute. The plan is to also request this change get moved upstream to libevent.
Reviewed by Jeff Squyres
This change fixes trac:1316.

This commit was SVN r18548.

The following Trac tickets were found above:
  Ticket 1316 --> https://svn.open-mpi.org/trac/ompi/ticket/1316
2008-05-30 01:00:19 +00:00
Jeff Squyres
3b568d4b14 Remove an old attempt to understand the tradeoffs with using GNU libc's malloc_hooks functionality, which turned out to be totally unusable in practice. I think we just always forgot to remove them.
This commit was SVN r18547.
2008-05-30 00:11:12 +00:00
Shiqing Fan
af656b2b3d Fix some typing mistakes, make the sources compile again for Windows Visual Studio.
This commit was SVN r18542.
2008-05-29 15:27:43 +00:00
Shiqing Fan
b67a1244b6 Some small fixes.
This commit was SVN r18541.
2008-05-29 15:05:28 +00:00
Jeff Squyres
728ee47be4 Just check for the presents of $sysfsdir/class/infiniband and check
that it's a directory.  That's good enough to know that the
OpenFabrics kernel drivers have been loaded.  If you have no RDMA
devices and don't want to see the OMPI warning about not finding any
devices, then don't start the OpenFabrics kernel drivers.

This commit was SVN r18540.
2008-05-29 14:19:51 +00:00
Ralph Castain
2772221ae0 Add the -xml option to mpirun to indicate that xml output is desired
This commit was SVN r18539.
2008-05-29 14:11:31 +00:00
Ralph Castain
2b28bef15a Provide a "nicer" indication that we don't know the pid of the failed orted
This commit was SVN r18538.
2008-05-29 14:10:58 +00:00
Nysal Jan
25ac3629e9 eHCA does not have SRQ. Adding receive_queues value so that it works out of the box
This commit was SVN r18537.
2008-05-29 13:55:39 +00:00
Ralph Castain
72530f8fed Cleanly handle the failed start of an orted, or its unexpected failure after start. This commit will allow mpirun to exit cleanly when this occurs, and does a best-effort attempt to cleanup the mess. However, it still has two unresolved issues that need to be eventually addressed:
1. it depends upon the ability of the native environment to alert us that the orted has died/failed to start. I have included that support for SLURM, but other environments need to be done.

2. for some yet-to-be-determined reason, the message that tells the remaining daemons to "die" isn't getting out of the RML, even though no obvious blockage is standing in the way. Work will continue on resolving that problem. For now, the orteds appear to be exiting on their own quite nicely when they see their HNP "lifeline" disappear.

This represents the best-available fix for ticket #221 so I am closing that ticket at this time.

This commit was SVN r18536.
2008-05-29 13:38:27 +00:00
Ralph Castain
52fb773c6c Tell slurm to kill the job if an orted abnormally exits
This commit was SVN r18535.
2008-05-29 12:26:58 +00:00
Ralph Castain
b2a566b610 Break an infinite loop in orte_output caused by debugging of ORTE comm subsystems
This commit was SVN r18534.
2008-05-29 12:21:17 +00:00
Ralph Castain
e5e542ddcf Clarify an error message
This commit was SVN r18533.
2008-05-29 12:20:24 +00:00
Jeff Squyres
d5bf8fe005 Remove unused variables.
This commit was SVN r18532.
2008-05-29 11:58:16 +00:00
Jeff Squyres
ed5bc2cd08 Per http://www.open-mpi.org/community/lists/devel/2008/05/4057.php, remove the darwin memory hooks component
This commit was SVN r18531.
2008-05-28 23:50:53 +00:00
Jeff Squyres
e5ea9d08ca Fixes trac:1305: check to see if $sysfsdir/class/infiniband exists and is
non-empty.  If not, then exit the openib btl silently.  This addresses
the case where libibverbs is installed (which is getting more common)
and therefore the openib BTL was built/installed, but the kernel
drivers are not loaded (assumedly because there is no RDMA hardware
present).  In this case, "mpirun a.out" will not issue a warning.

There appears to be no good way to definitely tell if there are no
RDMA hardware devices present.  For example, if libibverbs/the openib
BTL is installed, there are no RDMA devices present, but the RDMA
hardware kernel drivers ''are'' loaded, OMPI will warn that it was
unable to find suitable devices.  This warning is easily eliminated by
unloading the kernel drivers.

This commit was SVN r18530.

The following Trac tickets were found above:
  Ticket 1305 --> https://svn.open-mpi.org/trac/ompi/ticket/1305
2008-05-28 22:05:47 +00:00
Josh Hursey
4ac7016200 Make sure to check "opal_list_get_last" instead of "opal_list_get_end".
The former will return a valid item in the list, the latter will return an invalid item that marks the end of the list.

It was happending that when oversubscribing by way of an appfile we would cause a segv because we tried to interpret the invalid item returned by "opal_list_get_end" instead of a valid item. We would then try to write to unallocated memory.

This commit fixes trac:1279

This commit was SVN r18529.

The following Trac tickets were found above:
  Ticket 1279 --> https://svn.open-mpi.org/trac/ompi/ticket/1279
2008-05-28 19:37:20 +00:00
Ralph Castain
f76240e7cc Modify the nidmap utility to pass daemon vpids for nodes. In some mapping algo's, it is possible for nodes to be skipped. This results in daemon vpids that differ from the index of their respective node in the node array, causing the daemon to not recognize procs that it is supposed to launch.
This commit was SVN r18528.
2008-05-28 18:38:47 +00:00
Pavel Shamis
28c763f751 Fixing the error flow when somebody tries to use XRC without XOOB.
This commit was SVN r18527.
2008-05-28 15:56:04 +00:00
Pavel Shamis
2c81b0ab9a Fixing compilation warning in btl_openib_connect_ibcm.c
This commit was SVN r18526.
2008-05-28 15:20:48 +00:00
Ralph Castain
347752e40b One more corner case (caught by Jeff) occurs when someone requests usage help - e.g., with mpirun --help. In this case, we do a show_help prior to orte_init, but it is okay to do so.
To allow this, we let show_help just operate correctly without any warning about pre-orte_output_init.

This commit was SVN r18525.
2008-05-28 13:58:03 +00:00
Ralph Castain
828ae26d90 ORTE-level MCA params are defined in several places. Ompi_info cannot call orte_init due to an issue with the memory allocator, thus making it impossible for ompi_info to display all of the ORTE-level MCA params.
By consolidating them all into one function, ompi_info can call that function and register the desired variables. This also requires, however, that ompi_info call orte_output_init to avoid generating tons of error messages, so make that adjustment too. 

Fixes ticket #1314

In addition, orte_output has a race condition issue whereby calls to orte_output/verbose can occur prior to either the RML being defined/setup, or the HNP being defined. This latter occurs during the initialization of the orte_process_info structure. In both cases, there is no way orte_output can send the output to the HNP. Hence, the message must be simply output locally.

Fixes ticket #1315

This commit was SVN r18524.
2008-05-28 13:29:58 +00:00
Pavel Shamis
879a9fe45c setup_qps() may exit with error.
This commit was SVN r18523.
2008-05-28 11:36:38 +00:00
Pavel Shamis
e657a03143 Fixing broken XRC initialization flow.
This commit was SVN r18522.
2008-05-28 11:31:38 +00:00
Jeff Squyres
6a82b7bbb4 This file does not need to be executable.
This commit was SVN r18516.
2008-05-27 22:03:05 +00:00
Jeff Squyres
1cba10e11b Per advice from Ralf W. (see bug-libtool list post 4:48pm US Eastern
time, 27 May 2008 -- not web archived as of this commit), do the
following:

 * move libtoolize earlier in the process
 * remove most of acinclude.m4; instead, use "aclocal -I config" at
   the top-level to have it automatically pull in any relevant .m4
   file
 * add patch for ifort shared library support for LT 2.2.4
   (http://lists.gnu.org/archive/html/bug-libtool/2008-05/msg00049.html);
   will likely be unnecessary in future LT versions

This commit was SVN r18515.
2008-05-27 21:58:09 +00:00
Jeff Squyres
15fce83c5b Change some define's to AC_DEFUN's so that "aclocal -I config" will
find all the right .m4 files upon demand.

This commit was SVN r18514.
2008-05-27 21:54:23 +00:00
Ralph Castain
5a2992dea2 After some discussion with Jeff, we have determined that the only time orte_output functions can be accessed prior to calling orte_output_init is if someone calls them prior to calling orte_init. This is clearly an error, so we now report that fact to the caller so it can be fixed.
It is still possible that someone can call an orte_output function during orte_finalize - this is not an error. Prior commits ensured that this is correctly handled. This commit only deals with improper calls prior to calling orte_init.

This commit was SVN r18513.
2008-05-27 20:13:04 +00:00
Ralph Castain
be193ead83 Ensure that any use of orte_output prior to calling orte_output_init gets a properly initiated stream tracking object to avoid later segfaults
This commit was SVN r18512.
2008-05-27 19:07:31 +00:00
Tim Mattox
2446f543c3 Resync the trunk NEWS file with the 1.2.x NEWS file.
This commit was SVN r18511.
2008-05-27 18:28:38 +00:00
George Bosilca
1eb1742225 Remove this left over dependency.
This commit was SVN r18508.
2008-05-27 16:57:40 +00:00
Ralph Castain
93d932aa0c Ensure that the display-map and display-allocation outputs get processed through the new OPAL filter framework by passing them through orte_output instead of using the opal_dss.dump function.
This commit was SVN r18507.
2008-05-27 15:46:21 +00:00
Ralph Castain
e190a990ba Do not re-init the orte_output system if we have already finalized it as part of orte_finalize. Instead, default to routing the output to the std opal_output system so that the message still gets out. Of course, such messages cannot be filtered, but they are only for debug purposes by ORTE developers, so this should be a minimial issue.
This commit was SVN r18506.
2008-05-27 15:15:55 +00:00
Ralph Castain
0b2b655de5 Initialize a variable so it can correctly be dealt with at shutdown - fixes trac:1312
This commit was SVN r18505.

The following Trac tickets were found above:
  Ticket 1312 --> https://svn.open-mpi.org/trac/ompi/ticket/1312
2008-05-27 14:53:24 +00:00
Rolf vandeVaart
18879285c7 Fix the selection logic to prevent memory leaks. More work may be done in the priority logic but for now we just fix the leaks and preserve current behavior.
This commit fixes trac:1307.

This commit was SVN r18504.

The following Trac tickets were found above:
  Ticket 1307 --> https://svn.open-mpi.org/trac/ompi/ticket/1307
2008-05-27 14:16:39 +00:00
Pak Lui
695c158192 silence some intel and pgcc compiler warnings.
This commit was SVN r18501.
2008-05-26 20:35:13 +00:00
Sharon Melamed
64fe554b8e Fix bug in carto component select. After the insertion of mca_base_select the carto file component was never selected.
This commit was SVN r18496.
2008-05-26 12:52:41 +00:00
Pavel Shamis
6596d19c90 Adding new ConnectX vendor_part_id. Fix for ticket #1310.
This commit was SVN r18495.
2008-05-26 12:25:49 +00:00
Gleb Natapov
5fabade090 Use payload_buffer_alignment value for payload alignment.
This commit was SVN r18493.
2008-05-26 08:29:02 +00:00
Jeff Squyres
e1f118d0e6 Remove unused variable
This commit was SVN r18491.
2008-05-24 13:05:04 +00:00
Rolf vandeVaart
5baa733ad5 Fix another warning (using a variable before it was initialized.)
Thanks Jeff for pointing this out.

This commit was SVN r18489.
2008-05-23 13:57:55 +00:00
Rolf vandeVaart
0d8faf7559 Fix the fix for ticket #1298. Thanks George for pointing it out.
This commit was SVN r18488.
2008-05-23 13:33:38 +00:00
Jeff Squyres
1b50e5f6a5 Use the right variable in the output
This commit was SVN r18487.
2008-05-23 13:11:12 +00:00