1
1
Граф коммитов

867 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
19767802de Let the errmgr know how to deal with incomplete starts
This commit was SVN r14495.
2007-04-24 19:04:29 +00:00
Ralph Castain
ef71055cf8 Teach the odls to properly test for and report failed-to-start for application processes.
Test for system limits (where known) prior to doing things like fork and pipe since some systems aren't very nice about it when we try to exceed such limits.

This commit was SVN r14494.
2007-04-24 18:54:45 +00:00
Ralph Castain
f5ef3d795e Tell the smr how to handle failed-to-start
This commit was SVN r14488.
2007-04-24 16:23:26 +00:00
Jeff Squyres
0674bbd001 Fix segv when the shell is not recognized. Thanks to Mostyn Lewis for
noticing the problem.

This commit was SVN r14483.
2007-04-24 12:00:54 +00:00
Ralph Castain
2d04298002 Update the orted cmd xmit functions to match orted recv's. This fixes trac:1004.
This commit was SVN r14482.

The following Trac tickets were found above:
  Ticket 1004 --> https://svn.open-mpi.org/trac/ompi/ticket/1004
2007-04-24 01:58:40 +00:00
Josh Hursey
260e7612ad Fix a few interface changes introduced by r14475
This commit was SVN r14479.

The following SVN revision numbers were found above:
  r14475 --> open-mpi/ompi@18b2dca51c
2007-04-23 20:18:27 +00:00
Ralph Castain
5f94d6d791 Fix the cnos rml to match revised xcast API
This commit was SVN r14478.
2007-04-23 19:07:44 +00:00
Ralph Castain
18b2dca51c Bring in the code for routing xcast stage gate messages via the local orteds. This code is inactive unless you specifically request it via an mca param oob_xcast_mode (can be set to "linear" or "direct"). Direct mode is the old standard method where we send messages directly to each MPI process. Linear mode sends the xcast message via the orteds, with the HNP sending the message to each orted directly.
There is a binomial algorithm in the code (i.e., the HNP would send to a subset of the orteds, which then relay it on according to the typical log-2 algo), but that has a bug in it so the code won't let you select it even if you tried (and the mca param doesn't show, so you'd *really* have to try).

This also involved a slight change to the oob.xcast API, so propagated that as required.

Note: this has *only* been tested on rsh, SLURM, and Bproc environments (now that it has been transferred to the OMPI trunk, I'll need to re-test it [only done rsh so far]). It should work fine on any environment that uses the ORTE daemons - anywhere else, you are on your own... :-)

Also, correct a mistake where the orte_debug_flag was declared an int, but the mca param was set as a bool. Move the storage for that flag to the orte/runtime/params.c and orte/runtime/params.h files appropriately.

This commit was SVN r14475.
2007-04-23 18:41:04 +00:00
Ralph Castain
b260f8ee36 Enable the job_family API
This commit was SVN r14473.
2007-04-23 18:26:33 +00:00
Ralph Castain
3d4f1b86d2 Modify the name service to provide necessary support for failed-to-start scenarios. Add a new API to get_vpid_range - this should be used in place of the rmgr API of that name to avoid race conditions (will remove that API in later commit).
This commit was SVN r14465.
2007-04-23 12:48:19 +00:00
Jeff Squyres
0ba47105ed Merge the /tmp/jms-installdirs-trunk branch into the trunk. This
finally brings in functionality that is already on the 1.2 branch, and
was developed and tested in the v1.2ofed branch (and other places).

Short version of new features:

 * Support for ibv_fork_init() 
 * Automatically fill in the openib BTL bandwidth value by 
   querying the HCA port 
 * Installdirs functionality 
 * Fixes to always use -I in the Fortran wrapper compilers (#924) 
 * Gleb's mpool updates 
 * Remove some kruft in btl/openib/configure.m4, therefore 
   fixing the harmless warnings noted in #665 
 * Bunches of updates to the Linux RPM spec file 

I.e., effectively the same thing that r14411 brought to the v1.2
branch.

Also effectively brought in r14432 and r14433 (some fixes on top of
the original r14411 commit to v1.2).  Still need to bring in the moral
equivalent of r14445 after this commit (fixes to installdirs).

This commit was SVN r14449.

The following SVN revision numbers were found above:
  r14411 --> open-mpi/ompi@83b31314ae
  r14432 --> open-mpi/ompi@a48f160595
  r14433 --> open-mpi/ompi@68f346d2bc
  r14445 --> open-mpi/ompi@13d366b827
2007-04-21 00:15:05 +00:00
Josh Hursey
b9da59ebc3 Fix the way we determine which sequence number to restart with.
Create a sentinel value in the metadata file to clearly indicate
that the sequence number is complete (versus in progress). This
way we do not try to restart from an invalid sequence number
which can lead to badness.

This commit was SVN r14423.
2007-04-19 15:04:27 +00:00
Sven Stork
037b01ce9e - more symbols that need to be exported
This commit was SVN r14415.
2007-04-18 14:53:56 +00:00
Jeff Squyres
1e364218a2 Remove unused variable
This commit was SVN r14413.
2007-04-18 13:10:10 +00:00
Josh Hursey
6ee0c641fd Cleanup the output from orte-checkpoint so it is a bit more clear and references
the sequence number.

Before:
[...] Finished - Global Snapshot Reference: ompi_global_snapshot_1234.ckpt

After:
Snashot Ref.:   1 ompi_global_snapshot_1234.ckpt

This commit was SVN r14381.
2007-04-15 14:28:56 +00:00
George Bosilca
9e840fbe14 The missing orted bug is now fixed. orterun will not deadlock when
the program it try to spawn is missing.

Description of the problem: When the rsh pls try to spawn a local
process which is missing (such as a removed orted) the orterun
deadlock.

Description of the fix: The forked child deal with finding the
program to be executed. If it fails to find it, then instead of
calling exit (as a normal forked program is expected to do) it 
continue the execution using a execution path it was never
expected to use (back in orterun and then main). Bad things 
happens as expected. Forcing the child to use exit when it fails
to find the orted (and forcing the child to use exit everywhere
instead of return) correct the logic of the rsh pls and make it
behave as expected.

This commit was SVN r14377.
2007-04-14 17:36:27 +00:00
Ralph Castain
adb44c44b1 Revert prior commits from last night that involved significant change to the GPR, along with cosmetic changes to the odls_default module pending review and test.
Reverts r14328, r14329, r14331, r14333, r14335, r14338, and r14336.

This commit was SVN r14351.

The following SVN revision numbers were found above:
  r14328 --> open-mpi/ompi@d1ce4a44ca
  r14329 --> open-mpi/ompi@604e79f2d2
  r14331 --> open-mpi/ompi@b2b3417475
  r14333 --> open-mpi/ompi@8882f355b4
  r14335 --> open-mpi/ompi@10dfd534f6
  r14336 --> open-mpi/ompi@5c65c55e59
  r14338 --> open-mpi/ompi@579184cd72
2007-04-12 13:13:28 +00:00
Jeff Squyres
51f286d737 Just like r14289 on the ORTE trunk:
Per discussions with Brian and Ralph, make a slight correction in
where components are installed. Use $pkglibdir, not $libdir/openmpi,
so that when compiled in the orte trunk, components are installed to
the right directory (because the component search patch is checking
$pkglibdir).

This commit was SVN r14345.

The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
  r14289
2007-04-12 11:19:42 +00:00
George Bosilca
1c037df7e7 Only print information if the condition is met.
This commit was SVN r14340.
2007-04-12 07:28:18 +00:00
George Bosilca
579184cd72 Rollback commit r14335 it get into the trunk too early.
This commit was SVN r14338.

The following SVN revision numbers were found above:
  r14335 --> open-mpi/ompi@10dfd534f6
2007-04-12 06:21:59 +00:00
George Bosilca
5c65c55e59 Few cleanups. The most important is getting rid of the orte_bitmap_t class
which is not used anymore in the orte code.

This commit was SVN r14336.
2007-04-12 05:50:33 +00:00
George Bosilca
10dfd534f6 Correctly remove the itag if we fail the condition. And be pedantic with the code.
This commit was SVN r14335.
2007-04-12 05:33:31 +00:00
George Bosilca
b882b7e1b3 Update the Windows ODLS.
This commit was SVN r14334.
2007-04-12 05:19:25 +00:00
George Bosilca
8882f355b4 Move these functions at their right place.
This commit was SVN r14333.
2007-04-12 05:18:23 +00:00
George Bosilca
9de6ae0753 ORTE_MODULE_DECLSPEC is not required here.
This commit was SVN r14332.
2007-04-12 05:17:03 +00:00
George Bosilca
b2b3417475 A more optimized version of the orte_gpr_replica_check_itag_list function. Strictly
follow the same behavior as before, the changes just make sure the check is done
in linear time and the memory usage is kept to a minimum.

This commit was SVN r14331.
2007-04-12 05:13:10 +00:00
George Bosilca
604e79f2d2 There is a cleanup label, so I expect to use it in all cases.
This commit was SVN r14329.
2007-04-12 05:05:36 +00:00
George Bosilca
d1ce4a44ca Fix small memory leak (only happens in debug mode).
This commit was SVN r14328.
2007-04-12 05:02:57 +00:00
George Bosilca
cad93a7693 Add more output. Fix some typos, and some small cleanups.
This commit was SVN r14327.
2007-04-12 05:01:29 +00:00
George Bosilca
0d82473b9d Enable the null IOF.
This commit was SVN r14326.
2007-04-12 05:00:05 +00:00
George Bosilca
e7c4f1ca64 Remove some unused code and correct the finalize function (cancel the pending
receive request).

This commit was SVN r14324.
2007-04-12 04:58:12 +00:00
George Bosilca
4a87c782c3 Release all unselected components. This is a little bit more tricky than usual,
as the IOF components lack the required finalize function. Instead rely on the
module finalize. Read the comment or more informations.

This commit was SVN r14323.
2007-04-12 04:57:08 +00:00
George Bosilca
c15cd5e4ab Unload all non necessary PLS. Once the selection process is done, we should release all
unselected PLS. This decrease the footprint of all Open MPI based processes.

This commit was SVN r14322.
2007-04-12 04:55:23 +00:00
Tim Prins
6872f21af0 remove unused variable
This commit was SVN r14306.
2007-04-11 17:15:14 +00:00
Pak Lui
e9e8dc2765 * comment out unused code
This commit was SVN r14297.
2007-04-10 22:38:34 +00:00
Josh Hursey
cd5047a9bf Refs trac:976
Collect the base 'orted' command line into a base function since most of the
PLS components were duplicating this code. Add AMCA parameter command line
component to the base set.

Add Aggregate MCA parameter support to the following PLS components:
 - gridengine
 - process
 - slurm
 - poe
 - tm

Improve support for 'rsh' component.

Did/could not support the following components:
 - bproc
 - proxy
 - xcpu
 - cnos
 - xgrid

The above components had peculiar needs that made it non-trivial to add an 
option. The authors of these components need to help in supporting this
new option.

I was only able to test the SLURM and RSH components due to system availability.
The others should work without problem.

This commit was SVN r14284.

The following Trac tickets were found above:
  Ticket 976 --> https://svn.open-mpi.org/trac/ompi/ticket/976
2007-04-10 14:23:32 +00:00
Tim Prins
2ffc02870d Reduce the memory usage of the GPR:
- Make it so that all the GPR pointer arrays are allocated initially at 16 elements instead of 512. This saves (on a 64 bit machine) approximately 4*(# procs + # nodes) KB.
- Fix up the segment prealloc function so that preallocating an existant segment is not an error, and make the areas where we do large inserts use it.

Fix the orte_pointer_array to efficiently implement setting its size. Before we just realloced the array one block at a time until the desired size was reached. Now we resize it all in one realloc.

This commit was SVN r14264.
2007-04-09 00:40:15 +00:00
Brian Barrett
13a4bba13f Yet another dumb thing that shouldn't have been in r14261.
This commit was SVN r14263.

The following SVN revision numbers were found above:
  r14261 --> open-mpi/ompi@8a55c84d0b
2007-04-07 23:23:23 +00:00
Brian Barrett
32f0090f81 fix dumb variable scope mistake
This commit was SVN r14262.
2007-04-07 23:00:57 +00:00
Brian Barrett
8a55c84d0b Fix a number of OOB issues:
* Remove the connect() timeout code, as it had some nasty race conditions
    when connections were established as the trigger was firing.  A better
    solution has been found for the cluster where this was needed, so just
    removing it was easiest.
  * When a fatal error (too many connection failures) occurs, set an error
    on messages in the queue even if there isn't an active message.  The
    first message to any peer will be queued without being active (and
    so will all subsequent messages until the connection is established),
    and the orteds will hang until that first message completes.  So if
    an orted can never contact it's peer, it will never exit and just sit
    waiting for that message to complete.
  * Cover an interesting RST condition in the connect code.  A connection
    can complete the three-way handshake, the connector can even send
    some data, but the server side will drop the connection because it
    can't move it from the half-connected to fully-connected state because
    of space shortage in the listen backlog queue.  This causes a RST to
    be received first time that recv() is called, which will be when waiting
    for the remote side of the OOB ack.  In this case, transition the
    connection back into a CLOSED state and try to connect again.
  * Add levels of debugging, rather than all or nothing, each building on
    the previous level.  0 (default) is hard errors.  1 is connection 
    error debugging info.  2 is all connection info.  3 is more state
    info.  4 includes all message info.
  * Add some hopefully useful comments

This commit was SVN r14261.
2007-04-07 22:33:30 +00:00
Tim Prins
df4c468bb4 fix some more minor memory leaks
This commit was SVN r14260.
2007-04-07 18:41:16 +00:00
Tim Prins
8e7765e456 Fix a gigantic memory leak. We were copying a message to send into a buffer, then never freeing the copy we made. But we were mistakenly allocating the buffer on the stack, so the memory checking tools never caught the leak. On 96 nodes, 384 processes, mpirun memory usage went from about 12M to 3M for me after this minor change...
This commit was SVN r14257.
2007-04-07 02:25:48 +00:00
George Bosilca
33bf6c6e54 Move the comment at the right place.
This commit was SVN r14237.
2007-04-05 20:36:33 +00:00
George Bosilca
5c355d0bea Always return an initialized variable. More output if we fail to read
from the shell detection child. Don't spawn orted, instead spawn what's
inside the mca_pls_rsh_component.orted.

This commit was SVN r14236.
2007-04-05 20:17:10 +00:00
George Bosilca
8fb8363868 Correctly detect the remote shell, and the local one. Big clean-up on how we
deal with the PLS RSH. Remove support for unknown user (i.e. if the user is
not known by the system, then it shouldn't be allowed to spawn anything).

This commit was SVN r14232.
2007-04-05 19:22:26 +00:00
Ralph Castain
d5b5cd2d3c Add test code for multiple comm_spawn calls.
Add ERROR_LOG calls to more clearly document failures in the rsh launcher.

This commit was SVN r14214.
2007-04-04 13:24:39 +00:00
Jeff Squyres
fe58753a23 Add a little documentation to iof.h.
This commit was SVN r14208.
2007-04-03 18:17:35 +00:00
George Bosilca
f2a6b9394f Deal with the include spree. Protect "environ" on Windows.
Some others minors modifications in order to make it
compile [again] on Windows.

This commit was SVN r14188.
2007-04-01 16:16:54 +00:00
George Bosilca
01a4f56369 Mostly DECLSPEC cleanups and some include corrections.
This commit was SVN r14186.
2007-04-01 16:08:27 +00:00
Tim Prins
2f74160a37 Fix some more memory leaks
This commit was SVN r14175.
2007-03-30 13:43:50 +00:00