1
1
Граф коммитов

12464 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
9a57db4a81 To support comm_spawn in fully routed environments, daemons need to know the route to all procs in their job family. They already had this information, but were not retaining it. The infrastructure to do so has existed for some time - just never had the time to complete it.
This commit does that by ensuring that daemons retain knowledge of proc location for all procs in their job family. It required a minor change to the ESS API to allow the daemons to update their pidmaps as data was received. In addition, the routed modules have been updated to take advantage of the newly available info, and the encode/decode pidmap utilities have been updated to communicate the required info in the launch message.

This commit was SVN r20022.
2008-11-18 15:35:50 +00:00
Ralph Castain
9ba78f6e5f Ensure exit-no-reply gets relayed to downstream orteds prior to exiting ourselves
This commit was SVN r20021.
2008-11-18 14:54:52 +00:00
Ralph Castain
89559396ea Resolve a race condition when running under a SLURM environment.
The slurm plm fork/exec's a call to srun to launch its daemons. When mpirun terminates, it then sends out a "terminate" command to those daemons. The daemons respond back to mpirun, and then exit.

If slurm itself is running on a slow network, and mpirun is running the OOB across a fast network, then it is possible for mpirun to receive notification of daemon termination and exit -before- the srun can complete its bookkeeping and declare the job as complete. When this happens, slurm becomes confused and loses state.

Mucho bad. :-/

This commit changes the termination logic so that mpirun will wait for srun to report complete before exiting. It also enables fully routed communications since it no longer requires daemons to report back that they are terminating, thus allowing the daemons to terminate asynchronously (thereby breaking routing paths).

This commit was SVN r20018.
2008-11-18 13:59:23 +00:00
Ralph Castain
182b15e252 Remove duplicate definition of orte_xml_output - thanks Shiqing for catching it!
This commit was SVN r20017.
2008-11-18 13:53:13 +00:00
Jon Mason
4757970438 This patch consists of two parts. Part one is the fixing of a bug in the
determing of the IP subnet.  The netmask was being used improperly when
determining which subnet each connection is on.  Part two is the ability to
include/exclude specific subnets.

This patch fixes ticket #1665

This commit was SVN r20016.
2008-11-17 20:20:24 +00:00
Terry Dontje
5ce4f6fc1d Updated the compiler and system section for Sun's products.
This commit was SVN r20014.
2008-11-17 18:48:41 +00:00
Shiqing Fan
4d2c118d3b - fix a type cast. The whole libmpi library has to be compiled as CXX on Windows, and MS compiler recognizes this as an error.
This commit was SVN r20012.
2008-11-17 12:18:01 +00:00
Jeff Squyres
a5ed965c78 A few updates from the community.
This commit was SVN r20006.
2008-11-15 17:34:38 +00:00
Josh Hursey
8359b86061 Fix a typo.
This commit was SVN r20005.
2008-11-15 16:03:07 +00:00
Jeff Squyres
bb8fe9a893 Various updates to README, but several questions still remain that
must be answered by others in the community.

This commit was SVN r20004.
2008-11-15 15:27:05 +00:00
Patrick Geoffray
0f331b4c13 Define a "fake" mpool to provide a memory release callback for the
memory hooks (munmap) and initialize the mallopt component, and 
nothing else.
Use this mpool in the MX common initialization, supporting both BTL 
and MTL. Automatically set the MX_RCACHE environment variable to 
enable registration cache in MX.

Tested with success for munmap() and large free().

This commit was SVN r20003.
2008-11-15 04:17:58 +00:00
Jeff Squyres
d7f3dd2230 Add a comment explaining exactly what is returned by this function
because we wasted a good amount of time today assuming that it was
returning the actual netmask.  Specifically, we were confused why it
returned 0x18 instead of 0xffffff00 for a class C subnet (the
head-smacking moment wasn't until [much] later when we converted 0x18
to decimal, which is 24.  Then the Clue Light(tm) went on).

This commit was SVN r20002.
2008-11-14 22:59:41 +00:00
Ralph Castain
68423f7544 Partially restore the iof changes - this repairs the initial observation of inconsistent and incomplete output
This commit was SVN r19999.
2008-11-14 20:36:18 +00:00
Ralph Castain
586334d1c8 Per discussion with Tim Mattox, reset the trunk to pre-19991 level for the iof only. I will shortly add a changeset that will repair the one known error where we were incorrectly closing the stdout/err/diag file descriptors when all we wanted to do was close stdin. I will leave out the changes associated with coordinating proc termination due to race conditions IU encounted during MTT testing. I have been unable to replicate those so far, but we hope to resolve it in the near future.
This commit was SVN r19998.
2008-11-14 20:22:36 +00:00
Ralph Castain
891630ae85 Handle a race condition between mpirun detecting stdin closed (and releasing the read event), and receiving an xon/xoff notice from a remote orted that detects proc termination and tells mpirun "don't send any more input - the proc is gone". This latter was necessary since we might have hung an infinite source of input on mpirun, while the proc terminated after some point in time.
This commit was SVN r19997.
2008-11-14 15:19:53 +00:00
Ralph Castain
101b6fdeb8 Cleanup a little on how we handle the stdin write when we encounter end-of-input. Ensure that mpirun handles it correctly if the proc receiving stdin is local to mpirun
This commit was SVN r19996.
2008-11-14 14:31:33 +00:00
Nysal Jan
e4bdaac6d8 Fixed the case where a device does not support inline data. Redefined the interpretation of max_inline_data MCA parameter.
* If max_inline_data == -1 perform runtime detection 
* If max_inline_data >=0 use the value provided 
* If the user does not explicitly set this via command line, use the value from INI file

This commit fixes trac:1662

This commit was SVN r19995.

The following Trac tickets were found above:
  Ticket 1662 --> https://svn.open-mpi.org/trac/ompi/ticket/1662
2008-11-14 12:15:35 +00:00
Ralph Castain
875741a5e3 Don't set the stdin fd to -1 before calling the object destructor as that function calls event delete, which uses the fd as an index into the event array.
This commit was SVN r19994.
2008-11-13 19:34:29 +00:00
Ralph Castain
b8ae4604ed Correct the notifier default module to include the new added API
This commit was SVN r19993.
2008-11-13 18:03:41 +00:00
Ralph Castain
702fc7154c Remove stale function definition
This commit was SVN r19992.
2008-11-13 05:07:11 +00:00
Ralph Castain
555bbf0c02 Fix the iof race conditions wrt proc termination. This is comprised of two sections:
1. modify the iof to track when a proc actually closes all of its open iof output pipes. When this occurs, notify the odls that the proc's iof is complete. This is done via a zero-time event so that we can step out of the read event before processing the notification.

2. in the odls, modify the waitpid callback so it only flags that it was called. Add a function to receive the iof-complete notification, and a function that checks for both iof complete and waitpid callback before declaring a proc fully terminated. This ensures that we read and deliver -all- of the IO prior to declaring the job complete.

Also modified the odls call to orte_iof.close (and the component's implementation) so it only closes stdin, leaving the other io channels alone. This fixes the other half of the known problem.

This should fix the ticket on this subject, but I'll wait to close it pending further testing in the trunk.

This commit was SVN r19991.
2008-11-12 23:32:01 +00:00
Ralph Castain
26cd1c1955 Fix a typo and some formatting
This commit was SVN r19990.
2008-11-12 22:01:40 +00:00
Josh Hursey
bf96a8dea0 Fixes a bug that may occur with really long environment variables on job restart.
This happens with really long paths as part of the variable name.

Found in MTT testing (where the paths are long). This will need to be moved to v1.3

This commit was SVN r19989.
2008-11-12 21:43:34 +00:00
Rolf vandeVaart
76f8ce01cf Need to add sppp to list of default excluded interfaces
to support Sun M9000 server.

This commit was SVN r19988.
2008-11-12 20:30:14 +00:00
Jeff Squyres
120e09b9cd * Consolidate the list of copyrights a bit
* Print warnings for some common copyright problems

This commit was SVN r19987.
2008-11-12 18:15:09 +00:00
Ralph Castain
ce26e3a2fb Update the notifier framework in prep for move to v1.3. Add an API to handle the case where error messages have been expressed via "show_help" so they can look similar to what was presented to users. Add three key calls in the openib btl to drop messages into syslog.
This will sit in trunk for a few days - would like to actually see some errors reported to syslog before moving the code to 1.3

This commit was SVN r19986.
2008-11-12 18:03:51 +00:00
Jeff Squyres
a48b2d45be Fix wonky copyright year.
This commit was SVN r19985.
2008-11-12 17:51:54 +00:00
Jeff Squyres
bb0b5b04bd Remove duplicate copyright notice (found by script).
This commit was SVN r19984.
2008-11-12 17:42:40 +00:00
Jeff Squyres
9c07842148 Script to help find copyright notices in the tree.
This commit was SVN r19983.
2008-11-12 17:36:10 +00:00
Jeff Squyres
3419d93368 Refs trac:1399: update copyrights in LICENSE file after checking all the
files in the tree with a script.

This commit was SVN r19982.

The following Trac tickets were found above:
  Ticket 1399 --> https://svn.open-mpi.org/trac/ompi/ticket/1399
2008-11-12 17:35:40 +00:00
Kenneth Matney
07f7f00c91 This disables sendi, since it may do 0-byte requests and it still has
another bug.  This also causes 0-byte requests to be treated as a buffer
error, causing the base request to be requeued.  On Cray XT, it may be
temporarily impossible to make allocations for buffer requests, as the
default stack size is small (8 MB) and there is no true swap device.
Even with the stack size increased, there will be cases in which this
condition recurs.

One possibility is to make the buffer allocations off of the heap; but,
this does not change the fact that eventually an out-of-memory condition
will occur and we need to support multiple receives in transit, a
condition for which the available buffer space may change.  On the other
hand, if we switch to allocating the buffer space from the heap, we will
need to return an error when the allocation fails and there are no other
buffers in transit.

This commit was SVN r19981.
2008-11-12 16:04:14 +00:00
George Bosilca
e84af7920e Move __counter outside the #ifdef section. Cleanup the usage of __counter.
This commit was SVN r19979.
2008-11-11 16:46:11 +00:00
Jeff Squyres
ccab62d5e6 Refs trac:1399: updates to the INSTALL file.
This commit was SVN r19978.

The following Trac tickets were found above:
  Ticket 1399 --> https://svn.open-mpi.org/trac/ompi/ticket/1399
2008-11-11 15:52:21 +00:00
Jeff Squyres
69821184ee Refs trac:1399. Minor updates to HACKING.
This commit was SVN r19977.

The following Trac tickets were found above:
  Ticket 1399 --> https://svn.open-mpi.org/trac/ompi/ticket/1399
2008-11-11 15:12:46 +00:00
George Bosilca
6344b8dffe Force an explicit cast to keep the compilers quiet.
This commit was SVN r19975.
2008-11-11 14:58:53 +00:00
George Bosilca
aac4724c9d Add a high accuracy timer for MIPS.
This commit was SVN r19974.
2008-11-11 14:57:39 +00:00
George Bosilca
584154c2d3 Remove the group header file dependency.
This commit was SVN r19965.
2008-11-10 19:37:52 +00:00
Josh Hursey
d5c38c2601 fix some typos. should be moved to v1.3
This commit was SVN r19964.
2008-11-10 19:05:26 +00:00
Josh Hursey
080e581422 This commit removes some duplicate finalize code between the component's finalize, and the version that C/R needed in the ft_event function. From my testing everything looks fine, but should probably soak overnight just to be sure. It will need to be moved to v1.3
Thanks to Jeff, Pasha, and Tim M. for bringing this to my attention.

This commit was SVN r19963.
2008-11-10 18:35:57 +00:00
Josh Hursey
460e84f174 A fix for the intel "MPI_Send_init_ator_c" test.
It highlighted a bug in the bookmark component where for persistent sends we were not copying the context, but just moving it. This caused us to lose track of the message if it is started/completed multiple times.

This will need to be brought over to the v1.3 branch, but it should soak overnight to get a round of testing first.

This commit was SVN r19962.
2008-11-10 16:55:58 +00:00
Josh Hursey
077b3df7cc Fix C/R restart case by passing the correct address to the orte_ess_base_build_nidmap() function. This cropped up from r19866.
It does not look like this effects the v1.3 branch since r19866 has not moved to the release branch.

Thanks to Leonardo Fialho for reporting this and supplying a patch.

This commit was SVN r19961.

The following SVN revision numbers were found above:
  r19866 --> open-mpi/ompi@f54fda489e
2008-11-10 15:19:28 +00:00
Pavel Shamis
29cc6de40b OOB, XOOB, RDMACM and IBCM does not support qp creation and connection for self communication. So we must use self.
This commit was SVN r19960.
2008-11-10 11:24:57 +00:00
Ralph Castain
5889dcd30b Fix a warning reported by Jeff that actually could cause singleton operations to fail. Ensure that the byte object used to init the job map for singleton's is properly initialized.
This commit was SVN r19957.
2008-11-08 01:09:06 +00:00
Jeff Squyres
f4ba25cf3c Remove linking components against ORTE and OPAL libs. This was
removed from all other components long ago; I'm not sure how these
survived.

This commit was SVN r19956.
2008-11-08 00:56:57 +00:00
Jeff Squyres
ecd0b12576 Add note about MPI_REAL16 support.
This commit was SVN r19955.
2008-11-08 00:54:15 +00:00
Jeff Squyres
9712e41a29 Fix svn:ignore
This commit was SVN r19954.
2008-11-07 22:59:20 +00:00
Jeff Squyres
7b32402959 Fixes from Brian for OS X 10.4.
This commit was SVN r19953.
2008-11-07 22:13:43 +00:00
Jeff Squyres
4f028171a2 Refs trac:1603:
* Add OMPI_F77_CHECK_REAL16_C_EQUV test whether REAL*16 is bit
   equivalent to long double.  AC_DEFINE OMPI_REAL16_MATCHES_C with
   result (0 or 1).
 * Update ompi_info to only show real16 support if
   OMPI_REAL16_MATCHES_C is 1.
 * Update DDT to only support REAL16 and COMPLEX32 if
   1==OMPI_REAL16_MATCHES_C.
 * MPI Op function pointer tabls will have NULL for the REAL16 and
   COMPLEX32 entries if 0==OMPI_REAL16_MATCHES_C.
 * Slightly cleaned up OMPI_F77_GET_ALIGNMENT and OMPI_F77_CHECK m4
   tests (use OMPI_VAR_SCOPE_PUSH/POP).

This commit was SVN r19948.

The following Trac tickets were found above:
  Ticket 1603 --> https://svn.open-mpi.org/trac/ompi/ticket/1603
2008-11-07 20:37:21 +00:00
Matthias Jurenz
aafa318248 Fixed faulty length-parameter in snprintf call
This commit was SVN r19947.
2008-11-07 17:15:07 +00:00
George Bosilca
03434f8f10 Some compilers complain about casting a pointer to a integer type with a different
size. The correct way is to cast to an integer type that has the same length, and
then allow the compiler to upgrade to the read type.

This commit was SVN r19944.
2008-11-07 16:27:05 +00:00