1
1
Граф коммитов

1275 Коммитов

Автор SHA1 Сообщение Дата
Galen Shipman
df86202202 get bproc to compile, other issues still remain..
This commit was SVN r14661.
2007-05-15 23:11:33 +00:00
Brian Barrett
21e00f6f0c Clean up a couple of configure things:
* Require Autoconf 2.60 or higher and remove some cruft
    required for AC 2.59 or the AC 2.59 / AC 2.60 mix
  * Remove a bunch of now unnecessary AC_SUBST calls
  * Use the libtool-provided variables for the -I and
    library to use when compiling against ltdl

Fixes trac:1000

This commit was SVN r14652.

The following Trac tickets were found above:
  Ticket 1000 --> https://svn.open-mpi.org/trac/ompi/ticket/1000
2007-05-15 04:23:48 +00:00
Jeff Squyres
c5782642d9 Fix some param names so that they show up when you "ompi_info --param
oob all".

This commit was SVN r14646.
2007-05-11 20:58:11 +00:00
Rich Graham
5359cee937 declare undeclared function, so that the code will compile.
This commit was SVN r14625.
2007-05-09 04:47:40 +00:00
Jeff Squyres
51ff779a5d Minor gramatical nit found by Karen/Sun.
This commit was SVN r14622.
2007-05-08 21:24:44 +00:00
Jeff Squyres
395d05b6bc Update the man page to describe both -wdir and -wd. -wdir is consider
the "primary" option and -wd is the synonym.  Regardless, either of
them function exactly like the other.

This commit was SVN r14618.
2007-05-08 20:27:20 +00:00
Jeff Squyres
8a68b2dba7 Add -wdir option as a synonym for -wd (to make us match the man page).
This commit was SVN r14614.
2007-05-08 19:09:32 +00:00
Ralph Castain
ad541e163e Fix compiler warning
This commit was SVN r14605.
2007-05-08 13:21:18 +00:00
Sven Stork
3707207cca - we don't need to export this symbol
This commit was SVN r14593.
2007-05-07 13:05:52 +00:00
Sven Stork
a04c8eb39a - Bring over the visibility feature, for a finer symbol export control
via the visibility feature that is provided by some compilers.

  Per default this feature is disabled, to enable it you need to
  configure with --enable-visibility and obviously you need a compiler
  with visibility support. Please refer to the wiki for more information.
  https://svn.open-mpi.org/trac/ompi/wiki/Visibility

This commit was SVN r14582.
2007-05-04 09:03:37 +00:00
Ralph Castain
2683c85085 Update the TM launcher so it provides an appropriate error message when encountering an invalid launch_id. This is a first step towards fixing ticket #1016, but needs to be followed by a more complete solution.
This commit was SVN r14578.
2007-05-03 20:14:24 +00:00
Gleb Natapov
25190b85f8 Init locks in open() function. Init() function is not called from ompi_info and
we crash in close().

This commit was SVN r14568.
2007-05-02 09:03:14 +00:00
Ralph Castain
4510b42638 Hold the RMGR in the spawn command until the application process actually launches. Previously, we returned from spawn immediately after launching the daemons - this meant that the caller had to define their own "wait until app launches". This only tells the caller that the app procs were launched, of course - it doesn't mean that they have started execution or reached any particular stage. However, for non-MPI procs, this is as far as we can go - there is no further stage gate we can provide.
Still, better than what we provided before...

This commit was SVN r14554.
2007-05-01 11:27:36 +00:00
Shiqing Fan
c166e3d02c Too few arguments for call, fixed according to the corresponding definition.
This commit was SVN r14538.
2007-04-27 13:14:43 +00:00
Ralph Castain
7d6d0a1c00 Update reuse_daemons to find the daemons again - requires that orteds now report their nodenames (probably temporary patch pending upcoming minor revision of orted)
This commit was SVN r14533.
2007-04-26 15:09:54 +00:00
Ralph Castain
c733a7916b Update the gridengine pls to handle failed-to-start. Fix a few places where the fork'd child incorrectly called "return" instead of "exit" (undoubtedly copied from the same error in the old rsh pls).
This commit was SVN r14532.
2007-04-26 15:08:37 +00:00
Ralph Castain
bca2de3a57 Complete the update of the rsh pls to handle failed-to-start
This commit was SVN r14531.
2007-04-26 15:07:40 +00:00
Josh Hursey
d68ff8c2a3 minor typo
This commit was SVN r14516.
2007-04-25 19:54:53 +00:00
Josh Hursey
596062d34b Seems that the recent changes in the sds and oob exposed some invalid
assumptions in the FT restart code for the ORTE layer.

This fixes those problems by having the RML completely shutdown and 
restart the OOB framework (instead of just the module as before).
This makes it much easier to manage, and maintainable as the OOB
changes in the future.

The SDS now does communication as part of its startup procedure, so
we need to make sure we restart the RML before the SDS so that it can
communicate properly.

OOB base [close|open] used a static bool to determine if they have
been called previously or not. I needed to expose this boolean so 
that I can close() then open() the oob base in the restart procedure.
The functionality has not changed, we just now have the ability to 
open/close the framework as many times as we need to as long as we
always call them in that order. (So calling open twice in a row is not allowed
as before, it is only allowed if you open(), close(), then open() again).

Things seem to be working now.

This commit was SVN r14515.
2007-04-25 19:51:52 +00:00
Brian Barrett
4b8bb70afb A couple cleanups for the IPv6 support:
- make opal_sockaddr2str() take a sockaddr_storage instead of a sockaddr_in6
    so that it works for IPv4 and IPv6 addresses, and remove a whole bunch
    of #ifs in the OOOB code.
  - Fix a compiler warning in the TCP BTL due to run-time determined
    array size by making it a dynamicly allocated array.
  - Fix the unpacking code of IPv4 addresses when using IPv6 support, so
    that the address is in the correct location (instead of in an IPv6
    structure, use an IPv4 structure).  Refs trac:1005.

This commit was SVN r14514.

The following Trac tickets were found above:
  Ticket 1005 --> https://svn.open-mpi.org/trac/ompi/ticket/1005
2007-04-25 19:08:07 +00:00
Adrian Knoth
d1ce39de4f Move mca_btl_tcp_addr_isipv4public to opal_addr_isipv4public
This commit was SVN r14512.
2007-04-25 18:06:06 +00:00
Ralph Castain
7d0f51e6b9 Begin setting up for a change to the OOB information passing functionality - this is totally transparent at the moment (need to change computers).
This commit was SVN r14510.
2007-04-25 17:36:26 +00:00
Adrian Knoth
35fce38f43 Don't know why this line was here.
This commit was SVN r14509.
2007-04-25 12:31:13 +00:00
Ralph Castain
8517a5a3a6 cleanup a few compiler warnings
This commit was SVN r14507.
2007-04-25 11:51:18 +00:00
Adrian Knoth
868d8febfa Enable rds/hostfile to accept IPv6 addresses.
This commit was SVN r14505.
2007-04-25 06:55:58 +00:00
Jeff Squyres
c4c68e666a Merge in the ipv6 work from /tmp/ipv6-merge.
This commit was SVN r14503.
2007-04-25 01:55:40 +00:00
Jeff Squyres
321e08c605 Add some missing header files
This commit was SVN r14500.
2007-04-24 21:39:12 +00:00
Ralph Castain
18cb5c9762 Complete modifications for failed-to-start of applications. Modifications for failed-to-start of orteds coming next.
This completes the minor changes required to the PLS components. Basically, there is a small change required to the parameter list of the orted cmd functions. I caught and did it for xcpu and poe, in addition to the components listed in my email - so I think that only leaves xgrid unconverted.

The orted fail-to-start mods will also make changes in the PLS components, but those can be localized so they come in one at a time.

This commit was SVN r14499.
2007-04-24 20:53:54 +00:00
Ralph Castain
a764aa6395 Modify iof to report back more descriptive errors
This commit was SVN r14497.
2007-04-24 19:28:37 +00:00
Ralph Castain
c774f641fb Modify orterun to provide more user-friendly reporting on jobs that fail to start
This commit was SVN r14496.
2007-04-24 19:19:14 +00:00
Ralph Castain
19767802de Let the errmgr know how to deal with incomplete starts
This commit was SVN r14495.
2007-04-24 19:04:29 +00:00
Ralph Castain
ef71055cf8 Teach the odls to properly test for and report failed-to-start for application processes.
Test for system limits (where known) prior to doing things like fork and pipe since some systems aren't very nice about it when we try to exceed such limits.

This commit was SVN r14494.
2007-04-24 18:54:45 +00:00
Ralph Castain
f5ef3d795e Tell the smr how to handle failed-to-start
This commit was SVN r14488.
2007-04-24 16:23:26 +00:00
Jeff Squyres
0674bbd001 Fix segv when the shell is not recognized. Thanks to Mostyn Lewis for
noticing the problem.

This commit was SVN r14483.
2007-04-24 12:00:54 +00:00
Ralph Castain
2d04298002 Update the orted cmd xmit functions to match orted recv's. This fixes trac:1004.
This commit was SVN r14482.

The following Trac tickets were found above:
  Ticket 1004 --> https://svn.open-mpi.org/trac/ompi/ticket/1004
2007-04-24 01:58:40 +00:00
Josh Hursey
260e7612ad Fix a few interface changes introduced by r14475
This commit was SVN r14479.

The following SVN revision numbers were found above:
  r14475 --> open-mpi/ompi@18b2dca51c
2007-04-23 20:18:27 +00:00
Ralph Castain
5f94d6d791 Fix the cnos rml to match revised xcast API
This commit was SVN r14478.
2007-04-23 19:07:44 +00:00
Ralph Castain
18b2dca51c Bring in the code for routing xcast stage gate messages via the local orteds. This code is inactive unless you specifically request it via an mca param oob_xcast_mode (can be set to "linear" or "direct"). Direct mode is the old standard method where we send messages directly to each MPI process. Linear mode sends the xcast message via the orteds, with the HNP sending the message to each orted directly.
There is a binomial algorithm in the code (i.e., the HNP would send to a subset of the orteds, which then relay it on according to the typical log-2 algo), but that has a bug in it so the code won't let you select it even if you tried (and the mca param doesn't show, so you'd *really* have to try).

This also involved a slight change to the oob.xcast API, so propagated that as required.

Note: this has *only* been tested on rsh, SLURM, and Bproc environments (now that it has been transferred to the OMPI trunk, I'll need to re-test it [only done rsh so far]). It should work fine on any environment that uses the ORTE daemons - anywhere else, you are on your own... :-)

Also, correct a mistake where the orte_debug_flag was declared an int, but the mca param was set as a bool. Move the storage for that flag to the orte/runtime/params.c and orte/runtime/params.h files appropriately.

This commit was SVN r14475.
2007-04-23 18:41:04 +00:00
Ralph Castain
009be1c1b5 Reorganize the orted code for easier maintenance. Add ability to deliver xcast messages to local procs (not used at this point).
This commit was SVN r14474.
2007-04-23 18:28:20 +00:00
Ralph Castain
b260f8ee36 Enable the job_family API
This commit was SVN r14473.
2007-04-23 18:26:33 +00:00
Ralph Castain
7a57b694bb Allow caller to get session directory name without anything else
This commit was SVN r14472.
2007-04-23 18:25:36 +00:00
Ralph Castain
9cd85ef55a Add a few more error constants that will help provide more definitive output to the user
This commit was SVN r14471.
2007-04-23 18:25:03 +00:00
Brian Barrett
0a8af62c64 Fix broken build on OS X with static compiles. Everything that uses
anything in OPAL *MUST* call either opal_init() or opal_init_util().

This commit was SVN r14468.
2007-04-23 15:45:39 +00:00
Ralph Castain
477828159e Add a few test functions transferred from ORTE trunk
This commit was SVN r14467.
2007-04-23 14:43:55 +00:00
Ralph Castain
f47e7382e3 Add a new function to wake orterun up - used in failed-to-start scenarios, but can be used anytime a lower level needs to ensure orterun wakes up
This commit was SVN r14466.
2007-04-23 12:49:25 +00:00
Ralph Castain
3d4f1b86d2 Modify the name service to provide necessary support for failed-to-start scenarios. Add a new API to get_vpid_range - this should be used in place of the rmgr API of that name to avoid race conditions (will remove that API in later commit).
This commit was SVN r14465.
2007-04-23 12:48:19 +00:00
Josh Hursey
27a42f48d3 Make sure to call opal_init_util before mca_base_open().
This bug(?) become apparent due to the installdirs commit since these tools
were not finding the proper libraries since the paths were wonkey.

It all looks good now. :)

This commit was SVN r14461.
2007-04-21 22:38:15 +00:00
Jeff Squyres
5bebd24250 Bring over Brian's installdirs fixes from this afternoon (r14445).
This commit was SVN r14450.

The following SVN revision numbers were found above:
  r14445 --> open-mpi/ompi@13d366b827
2007-04-21 00:16:31 +00:00
Jeff Squyres
0ba47105ed Merge the /tmp/jms-installdirs-trunk branch into the trunk. This
finally brings in functionality that is already on the 1.2 branch, and
was developed and tested in the v1.2ofed branch (and other places).

Short version of new features:

 * Support for ibv_fork_init() 
 * Automatically fill in the openib BTL bandwidth value by 
   querying the HCA port 
 * Installdirs functionality 
 * Fixes to always use -I in the Fortran wrapper compilers (#924) 
 * Gleb's mpool updates 
 * Remove some kruft in btl/openib/configure.m4, therefore 
   fixing the harmless warnings noted in #665 
 * Bunches of updates to the Linux RPM spec file 

I.e., effectively the same thing that r14411 brought to the v1.2
branch.

Also effectively brought in r14432 and r14433 (some fixes on top of
the original r14411 commit to v1.2).  Still need to bring in the moral
equivalent of r14445 after this commit (fixes to installdirs).

This commit was SVN r14449.

The following SVN revision numbers were found above:
  r14411 --> open-mpi/ompi@83b31314ae
  r14432 --> open-mpi/ompi@a48f160595
  r14433 --> open-mpi/ompi@68f346d2bc
  r14445 --> open-mpi/ompi@13d366b827
2007-04-21 00:15:05 +00:00
Josh Hursey
b9da59ebc3 Fix the way we determine which sequence number to restart with.
Create a sentinel value in the metadata file to clearly indicate
that the sequence number is complete (versus in progress). This
way we do not try to restart from an invalid sequence number
which can lead to badness.

This commit was SVN r14423.
2007-04-19 15:04:27 +00:00
Sven Stork
037b01ce9e - more symbols that need to be exported
This commit was SVN r14415.
2007-04-18 14:53:56 +00:00
Jeff Squyres
1e364218a2 Remove unused variable
This commit was SVN r14413.
2007-04-18 13:10:10 +00:00
Josh Hursey
6ee0c641fd Cleanup the output from orte-checkpoint so it is a bit more clear and references
the sequence number.

Before:
[...] Finished - Global Snapshot Reference: ompi_global_snapshot_1234.ckpt

After:
Snashot Ref.:   1 ompi_global_snapshot_1234.ckpt

This commit was SVN r14381.
2007-04-15 14:28:56 +00:00
George Bosilca
9e840fbe14 The missing orted bug is now fixed. orterun will not deadlock when
the program it try to spawn is missing.

Description of the problem: When the rsh pls try to spawn a local
process which is missing (such as a removed orted) the orterun
deadlock.

Description of the fix: The forked child deal with finding the
program to be executed. If it fails to find it, then instead of
calling exit (as a normal forked program is expected to do) it 
continue the execution using a execution path it was never
expected to use (back in orterun and then main). Bad things 
happens as expected. Forcing the child to use exit when it fails
to find the orted (and forcing the child to use exit everywhere
instead of return) correct the logic of the rsh pls and make it
behave as expected.

This commit was SVN r14377.
2007-04-14 17:36:27 +00:00
Ralph Castain
adb44c44b1 Revert prior commits from last night that involved significant change to the GPR, along with cosmetic changes to the odls_default module pending review and test.
Reverts r14328, r14329, r14331, r14333, r14335, r14338, and r14336.

This commit was SVN r14351.

The following SVN revision numbers were found above:
  r14328 --> open-mpi/ompi@d1ce4a44ca
  r14329 --> open-mpi/ompi@604e79f2d2
  r14331 --> open-mpi/ompi@b2b3417475
  r14333 --> open-mpi/ompi@8882f355b4
  r14335 --> open-mpi/ompi@10dfd534f6
  r14336 --> open-mpi/ompi@5c65c55e59
  r14338 --> open-mpi/ompi@579184cd72
2007-04-12 13:13:28 +00:00
Jeff Squyres
51f286d737 Just like r14289 on the ORTE trunk:
Per discussions with Brian and Ralph, make a slight correction in
where components are installed. Use $pkglibdir, not $libdir/openmpi,
so that when compiled in the orte trunk, components are installed to
the right directory (because the component search patch is checking
$pkglibdir).

This commit was SVN r14345.

The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
  r14289
2007-04-12 11:19:42 +00:00
George Bosilca
1c037df7e7 Only print information if the condition is met.
This commit was SVN r14340.
2007-04-12 07:28:18 +00:00
George Bosilca
579184cd72 Rollback commit r14335 it get into the trunk too early.
This commit was SVN r14338.

The following SVN revision numbers were found above:
  r14335 --> open-mpi/ompi@10dfd534f6
2007-04-12 06:21:59 +00:00
George Bosilca
5c65c55e59 Few cleanups. The most important is getting rid of the orte_bitmap_t class
which is not used anymore in the orte code.

This commit was SVN r14336.
2007-04-12 05:50:33 +00:00
George Bosilca
10dfd534f6 Correctly remove the itag if we fail the condition. And be pedantic with the code.
This commit was SVN r14335.
2007-04-12 05:33:31 +00:00
George Bosilca
b882b7e1b3 Update the Windows ODLS.
This commit was SVN r14334.
2007-04-12 05:19:25 +00:00
George Bosilca
8882f355b4 Move these functions at their right place.
This commit was SVN r14333.
2007-04-12 05:18:23 +00:00
George Bosilca
9de6ae0753 ORTE_MODULE_DECLSPEC is not required here.
This commit was SVN r14332.
2007-04-12 05:17:03 +00:00
George Bosilca
b2b3417475 A more optimized version of the orte_gpr_replica_check_itag_list function. Strictly
follow the same behavior as before, the changes just make sure the check is done
in linear time and the memory usage is kept to a minimum.

This commit was SVN r14331.
2007-04-12 05:13:10 +00:00
George Bosilca
604e79f2d2 There is a cleanup label, so I expect to use it in all cases.
This commit was SVN r14329.
2007-04-12 05:05:36 +00:00
George Bosilca
d1ce4a44ca Fix small memory leak (only happens in debug mode).
This commit was SVN r14328.
2007-04-12 05:02:57 +00:00
George Bosilca
cad93a7693 Add more output. Fix some typos, and some small cleanups.
This commit was SVN r14327.
2007-04-12 05:01:29 +00:00
George Bosilca
0d82473b9d Enable the null IOF.
This commit was SVN r14326.
2007-04-12 05:00:05 +00:00
George Bosilca
f5478d95df Dont do anything if the array is already empty.
This commit was SVN r14325.
2007-04-12 04:58:47 +00:00
George Bosilca
e7c4f1ca64 Remove some unused code and correct the finalize function (cancel the pending
receive request).

This commit was SVN r14324.
2007-04-12 04:58:12 +00:00
George Bosilca
4a87c782c3 Release all unselected components. This is a little bit more tricky than usual,
as the IOF components lack the required finalize function. Instead rely on the
module finalize. Read the comment or more informations.

This commit was SVN r14323.
2007-04-12 04:57:08 +00:00
George Bosilca
c15cd5e4ab Unload all non necessary PLS. Once the selection process is done, we should release all
unselected PLS. This decrease the footprint of all Open MPI based processes.

This commit was SVN r14322.
2007-04-12 04:55:23 +00:00
George Bosilca
af6891f471 Fix a small typo.
This commit was SVN r14321.
2007-04-12 04:53:30 +00:00
Tim Prins
6872f21af0 remove unused variable
This commit was SVN r14306.
2007-04-11 17:15:14 +00:00
Pak Lui
e9e8dc2765 * comment out unused code
This commit was SVN r14297.
2007-04-10 22:38:34 +00:00
Josh Hursey
cd5047a9bf Refs trac:976
Collect the base 'orted' command line into a base function since most of the
PLS components were duplicating this code. Add AMCA parameter command line
component to the base set.

Add Aggregate MCA parameter support to the following PLS components:
 - gridengine
 - process
 - slurm
 - poe
 - tm

Improve support for 'rsh' component.

Did/could not support the following components:
 - bproc
 - proxy
 - xcpu
 - cnos
 - xgrid

The above components had peculiar needs that made it non-trivial to add an 
option. The authors of these components need to help in supporting this
new option.

I was only able to test the SLURM and RSH components due to system availability.
The others should work without problem.

This commit was SVN r14284.

The following Trac tickets were found above:
  Ticket 976 --> https://svn.open-mpi.org/trac/ompi/ticket/976
2007-04-10 14:23:32 +00:00
Tim Prins
1e7ff7f0fe Fix another buglet.
This commit was SVN r14270.
2007-04-09 17:54:11 +00:00
Tim Prins
2ffc02870d Reduce the memory usage of the GPR:
- Make it so that all the GPR pointer arrays are allocated initially at 16 elements instead of 512. This saves (on a 64 bit machine) approximately 4*(# procs + # nodes) KB.
- Fix up the segment prealloc function so that preallocating an existant segment is not an error, and make the areas where we do large inserts use it.

Fix the orte_pointer_array to efficiently implement setting its size. Before we just realloced the array one block at a time until the desired size was reached. Now we resize it all in one realloc.

This commit was SVN r14264.
2007-04-09 00:40:15 +00:00
Brian Barrett
13a4bba13f Yet another dumb thing that shouldn't have been in r14261.
This commit was SVN r14263.

The following SVN revision numbers were found above:
  r14261 --> open-mpi/ompi@8a55c84d0b
2007-04-07 23:23:23 +00:00
Brian Barrett
32f0090f81 fix dumb variable scope mistake
This commit was SVN r14262.
2007-04-07 23:00:57 +00:00
Brian Barrett
8a55c84d0b Fix a number of OOB issues:
* Remove the connect() timeout code, as it had some nasty race conditions
    when connections were established as the trigger was firing.  A better
    solution has been found for the cluster where this was needed, so just
    removing it was easiest.
  * When a fatal error (too many connection failures) occurs, set an error
    on messages in the queue even if there isn't an active message.  The
    first message to any peer will be queued without being active (and
    so will all subsequent messages until the connection is established),
    and the orteds will hang until that first message completes.  So if
    an orted can never contact it's peer, it will never exit and just sit
    waiting for that message to complete.
  * Cover an interesting RST condition in the connect code.  A connection
    can complete the three-way handshake, the connector can even send
    some data, but the server side will drop the connection because it
    can't move it from the half-connected to fully-connected state because
    of space shortage in the listen backlog queue.  This causes a RST to
    be received first time that recv() is called, which will be when waiting
    for the remote side of the OOB ack.  In this case, transition the
    connection back into a CLOSED state and try to connect again.
  * Add levels of debugging, rather than all or nothing, each building on
    the previous level.  0 (default) is hard errors.  1 is connection 
    error debugging info.  2 is all connection info.  3 is more state
    info.  4 includes all message info.
  * Add some hopefully useful comments

This commit was SVN r14261.
2007-04-07 22:33:30 +00:00
Tim Prins
df4c468bb4 fix some more minor memory leaks
This commit was SVN r14260.
2007-04-07 18:41:16 +00:00
Tim Prins
8e7765e456 Fix a gigantic memory leak. We were copying a message to send into a buffer, then never freeing the copy we made. But we were mistakenly allocating the buffer on the stack, so the memory checking tools never caught the leak. On 96 nodes, 384 processes, mpirun memory usage went from about 12M to 3M for me after this minor change...
This commit was SVN r14257.
2007-04-07 02:25:48 +00:00
Tim Prins
e058266c96 Change the ORTE datatype service in 2 ways:
1. Remove a unneeded field, bytes_avail, from orte_buffer_t. It is a calcualed value, and updating it everywhere is worse then just calculating it in the one place it is acutally used.
2. Change it so the default size of a orte_buffer is 128 bytes instead of 1024 bytes. We then double the size of the buffer up to 1024 bytes, then we additively increase the size by 1024 bytes at a time as was done before.

This commit was SVN r14252.
2007-04-06 19:40:29 +00:00
George Bosilca
33bf6c6e54 Move the comment at the right place.
This commit was SVN r14237.
2007-04-05 20:36:33 +00:00
George Bosilca
5c355d0bea Always return an initialized variable. More output if we fail to read
from the shell detection child. Don't spawn orted, instead spawn what's
inside the mca_pls_rsh_component.orted.

This commit was SVN r14236.
2007-04-05 20:17:10 +00:00
George Bosilca
ef4baeb6ab Don't reset the pid, as at this point it is already set.
This commit was SVN r14235.
2007-04-05 20:13:50 +00:00
George Bosilca
8fb8363868 Correctly detect the remote shell, and the local one. Big clean-up on how we
deal with the PLS RSH. Remove support for unknown user (i.e. if the user is
not known by the system, then it shouldn't be allowed to spawn anything).

This commit was SVN r14232.
2007-04-05 19:22:26 +00:00
Josh Hursey
8fd6d4ba09 add a newline so output is cleaner/clearer
This commit was SVN r14229.
2007-04-05 17:45:03 +00:00
Ralph Castain
e95539a16a Add two new test codes - orte_loop_spawn/child - to help debug issues surrounding multiple calls to comm_spawn
This commit was SVN r14217.
2007-04-04 21:02:18 +00:00
Jeff Squyres
2cbcb4abf1 Remove the French and strip the tests down to essentials (no need for
buffer attaching/detaching, for example).

This commit was SVN r14216.
2007-04-04 15:38:23 +00:00
Ralph Castain
d5b5cd2d3c Add test code for multiple comm_spawn calls.
Add ERROR_LOG calls to more clearly document failures in the rsh launcher.

This commit was SVN r14214.
2007-04-04 13:24:39 +00:00
Jeff Squyres
fe58753a23 Add a little documentation to iof.h.
This commit was SVN r14208.
2007-04-03 18:17:35 +00:00
George Bosilca
f2a6b9394f Deal with the include spree. Protect "environ" on Windows.
Some others minors modifications in order to make it
compile [again] on Windows.

This commit was SVN r14188.
2007-04-01 16:16:54 +00:00
George Bosilca
01a4f56369 Mostly DECLSPEC cleanups and some include corrections.
This commit was SVN r14186.
2007-04-01 16:08:27 +00:00
Tim Prins
2f74160a37 Fix some more memory leaks
This commit was SVN r14175.
2007-03-30 13:43:50 +00:00
George Bosilca
d367d9017c Need the definition of opal_output_close.
This commit was SVN r14167.
2007-03-29 01:18:26 +00:00
Tim Prins
9cb455272b Fix a pile of memory leaks in ORTE.
Fix a major memory leak in the SLURM RAS, and cleanup a bit of code there.

This commit was SVN r14164.
2007-03-29 00:50:56 +00:00
Sven Stork
44ead58103 - export component structure
This commit was SVN r14139.
2007-03-26 13:46:00 +00:00
Ralph Castain
0d98264097 Fix the nolocal option on the OMPI trunk
This commit was SVN r14138.
2007-03-24 16:16:16 +00:00
Galen Shipman
48d1fa830d A race condition exists on the free list of pending connections because
OPAL_FREE_LIST_WAIT/RETURN will not use locks in a non-threaded build
conditionaly use locks if non-threaded around the OPAL_FREE_LIST_WAIT/RETURN 
seems to fix the issue 
Tested at 4K processes and seems to work.. 

This commit was SVN r14135.
2007-03-23 15:19:03 +00:00
Brian Barrett
d454395b51 Need to fall back on the event listen mode if the MCA parameter said use the
listen thread, but we're not the HNP.  This is better than not starting up
any listen mode, which is what we were doing before :/

This commit was SVN r14133.
2007-03-23 13:29:18 +00:00
Jeff Squyres
bcdfbacaa4 Oops -- typo from previous commit. :-(
This commit was SVN r14130.
2007-03-23 00:51:50 +00:00
Jeff Squyres
2105f444ec Add missing header file
This commit was SVN r14129.
2007-03-23 00:47:30 +00:00
Jeff Squyres
a3dd0f2e08 Connect --nolocal up to the MCA param rmaps_base_schedule_local, as it
should be (it's a mistake that it got left out).

This commit was SVN r14127.
2007-03-22 19:29:47 +00:00
Sven Stork
6111ca1152 - Let's try to detect the default nodefile directory because it can different
for different sites. If we cannot detect the default then we fall back to 
  the hard coded path.

This commit was SVN r14121.
2007-03-22 15:26:16 +00:00
Galen Shipman
e654604a25 remove invalid comment
This commit was SVN r14118.
2007-03-22 03:51:36 +00:00
Josh Hursey
3492fdeae3 Fix a couple of compiler warnings (errors?) caught by ICC testing at Cisco.
This commit was SVN r14080.
2007-03-20 14:12:13 +00:00
Rainer Keller
1322f9f346 - Further attributes mainly for opal/* functions, marking
__opal_attribute_nonnull__, __opal_attribute_warn_unused_result__,
   __opal_attribute_malloc__, __opal_attribute_sentinel__ and
   __opal_attribute_format__

This commit was SVN r14078.
2007-03-20 13:01:32 +00:00
Pak Lui
803655b555 * incorporated some of Jeff's comment regarding this fix.
This commit was SVN r14070.
2007-03-19 21:59:48 +00:00
Pak Lui
da4d41e0e7 * fixed the missing fclose and eliminate the call to get_slot_count
since it is not needed

This commit was SVN r14066.
2007-03-19 17:47:30 +00:00
Rich Graham
d2e799f6b5 add some stub functions for the cnos environment.
This commit was SVN r14065.
2007-03-19 17:35:46 +00:00
Josh Hursey
101a2abd09 - Be more careful with parens
- Run the destructor *before* shutting things down.

This commit was SVN r14064.
2007-03-19 17:33:20 +00:00
Brian Barrett
ea08a555f9 Fixed a compile error on OS X 10.3 introduced with 1.1.5 / 1.2. Thanks
to Marius Schamschula for reporting the issue.

This commit was SVN r14063.
2007-03-19 17:25:54 +00:00
Josh Hursey
a181c987cc Remove some old references to ft_enable parameter that no longer exists.
This was replaced by the "-am ft-enable-cr" AMCA parameter.

This commit was SVN r14055.
2007-03-17 20:02:42 +00:00
Josh Hursey
d03073e87d Make sure to protect the finalize call so tools like ompi_info
do not segv.

This commit was SVN r14054.
2007-03-17 19:47:54 +00:00
Josh Hursey
dadca7da88 Merging in the jjhursey-ft-cr-stable branch (r13912 : HEAD).
This merge adds Checkpoint/Restart support to Open MPI. The initial
frameworks and components support a LAM/MPI-like implementation.

This commit follows the risk assessment presented to the Open MPI core
development group on Feb. 22, 2007.

This commit closes trac:158

More details to follow.

This commit was SVN r14051.

The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
  r13912

The following Trac tickets were found above:
  Ticket 158 --> https://svn.open-mpi.org/trac/ompi/ticket/158
2007-03-16 23:11:45 +00:00
Jeff Squyres
c000ee5328 Fixes trac:921
* Do not empty the list of in-flight frags during _close(); the OOB
   callback will still occur (_send_cb()) and try to remove the frag
   from the list, which will then result in an assert failure (debug
   builds).  
 * Add one more fix for a possible problem -- add an extra RETAIN /
   RELEASE pair on the endpoint to ensure that it is not actually
   freed before all in-flight frags have drained.

This commit was SVN r13953.

The following Trac tickets were found above:
  Ticket 921 --> https://svn.open-mpi.org/trac/ompi/ticket/921
2007-03-07 20:12:22 +00:00
Tim Prins
fe3ea0085f Fix minor memory leaks
This commit was SVN r13946.
2007-03-07 01:09:38 +00:00
Jeff Squyres
7b72ded10c Patch from Gotz Waschk to recognize zsh.
This commit was SVN r13907.
2007-03-03 01:42:03 +00:00
Li-Ta Lo
a0e5b6a27c minor clean up and treespawn support
This commit was SVN r13876.
2007-03-01 22:32:37 +00:00
Josh Hursey
0404444dbe * Added 2 new MCA parameters
- mca_base_param_file_prefix
     (Default: NULL)
     This is the fullname of the "-am" mpirun option. Used to specify a ':'
     separated list of AMCA parameter set files.
  - mca_base_param_file_path
     (Default: $SYSCONFDIR/amca-param-sets/:$CWD)
     The path to search for AMCA files with relative paths. A warning will be
     printed if the AMCA file cannot be found.

* Added a new function "mca_base_param_recache_files" the re-reads the file
configurations. This is used internally to help bootstrap the MCA system.

* Added a new orterun/mpirun command line option '-am' that aliases for the
mca_base_param_file_prefix MCA parameter

* Exposed the opal_path_access function as it is generally useful in other
places in the code.

* New function "opal_cmd_line_make_opt_mca" which will allow you to append a
new command line option with MCA parameter identifiers to set at the same
time. Previously this could only be done at command line declaration time.

* Added a new directory under the $pkgdatadir named "amca-param-sets" where all
the 'shipped with' Open MPI AMCA parameter sets are placed. This is the first
place to search for AMCA sets with relative paths.

* An example.conf AMCA parameter set file is located in
contrib/amca-param-sets/.

* Jeff Squyres contributed an OpenIB AMCA set for benchmarking.

Note: You will need to autogen with this commit as it adds a configure param.
  Sorry :(

This commit was SVN r13867.
2007-03-01 13:39:20 +00:00
Rainer Keller
0889ebd59f - Eliminate warnings, that PGI-6.2.5 issues with -Minform=inform
This commit was SVN r13840.
2007-02-28 08:36:34 +00:00
George Bosilca
4bab882d17 These 2 ORTE_DECLSPEC are not required.
This commit was SVN r13825.
2007-02-27 15:45:40 +00:00
Sven Stork
d8a369936e - Fix more symbols that should be exported.
This commit was SVN r13824.
2007-02-27 15:17:17 +00:00
Sven Stork
a86deb460e - export required symbols
This commit was SVN r13810.
2007-02-27 09:43:32 +00:00
Tim Prins
c6f2efe4b8 These are orte functions, the structure should be named as such
This commit was SVN r13765.
2007-02-22 23:29:31 +00:00
George Bosilca
d29423b1f7 orted_globals_t should be global.
This commit was SVN r13684.
2007-02-16 18:16:06 +00:00
Brian Barrett
f6a5d58885 Rather than set the connect event timeout number to something big and hoping
its bigger than the timeout for the connect() call, just don't register
the handler by default and fall back to connect() timing out.  Should give
much happier performance on big clusters.

This commit was SVN r13639.
2007-02-13 18:36:50 +00:00
Pak Lui
085826d94a * Remove the code for putting the bogus exit status of the user proc.
Also remove the smr set_proc_state since it's covered elsewhere.

This commit was SVN r13625.
2007-02-12 23:59:27 +00:00
Brian Barrett
8b28e5b33d Allow the OOB to connect between all MPI applications during MPI_INIT
without also establishing MPI connectivity. 

This commit was SVN r13595.
2007-02-09 20:17:37 +00:00
Brian Barrett
262cbbc5c9 Back out r13593, which contained a change that shouldn't be committed.
This commit was SVN r13594.

The following SVN revision numbers were found above:
  r13593 --> open-mpi/ompi@81472363ea
2007-02-09 20:13:02 +00:00
Brian Barrett
81472363ea Allow the OOB to connect between all MPI applications during MPI_INIT
without also establishing MPI connectivity.

This commit was SVN r13593.
2007-02-09 20:11:40 +00:00
Pak Lui
2d6b3776bf * fix the SEGV described in trac #892 that the exit_status in the 200 range
causes a strsignal to show NULL as a result. Still trying to determine
  why exit_status is in that range.

This commit was SVN r13583.
2007-02-09 16:39:30 +00:00
Ralph Castain
5818a32245 Bring in a forgotten speed improvement for the TM launcher that was developed during SNL Tbird testing last year. Remove the redundant and slow calls to TM to resolve hostnames. Instead, read the host info from the PBS file during the RAS, and then just use that info in the PLS (rather than getting it again).
Adjust the RMAPS mapped_node object to propagate the required launch_id info now included in the ras_node object. This provides support for those few systems that don't use nodename to launch, but instead want some id (typically an index into the array of allocated nodes). This value gets set for each node in the RAS - the RMAPS just propagates it for easy launch.

This commit was SVN r13581.
2007-02-09 15:06:45 +00:00
George Bosilca
79d76b044a ORTE_DECL everything that can be used outside the base directory. I
woner why this file is called private when it's included by all PLS ...

This commit was SVN r13573.
2007-02-09 03:16:19 +00:00
George Bosilca
7750ed22e0 Correct the Windows part of the universe detection.
This commit was SVN r13547.
2007-02-07 22:37:28 +00:00
Pak Lui
ccff0a6e65 * minor fix to correct the pid that always shows up as 0 in the abort
error message. e.g: 

  mpirun noticed that job rank 2 with PID 0 on node burl-ct-v440-4
  exited on signal 15 (Terminated).

This commit was SVN r13537.
2007-02-07 17:46:19 +00:00
Ralph Castain
890e3c7981 Reset the trunk so that the odls now sets the paffinity and sched_yield params again. The sched_yield is still overridden by any user-specified setting.
This change utilizes the new num_processors function. I also left the mods made to ompi_mpi_init and the bug fix for the default value of mpi_yield_when_idle. Note that the mods to mpi_init will not really take effect as the mca param will now *always* be set (either by user or odls). We will need those mods later, so no point in removing them now.

This commit was SVN r13519.
2007-02-06 19:51:05 +00:00
Jeff Squyres
c91fcd7fbd Fix a bunch of minor typos submitted by Bernhard Fischer.
This commit was SVN r13505.
2007-02-06 12:00:30 +00:00
Rolf vandeVaart
dcce8c739c Fix compiler warning. I am not sure how this got
passed us, but thanks to Jeff Squyres for pointing it out.

This commit was SVN r13501.
2007-02-05 22:03:58 +00:00
Rolf vandeVaart
74e3b68ce8 Better document orte-clean's behavior.
This commit was SVN r13498.
2007-02-05 20:01:15 +00:00
Ralph Castain
26897a626d Add a delayed_abort test code. We seem to handle this case just fine now, but Sun reports still seeing troubles on Solaris.
This commit was SVN r13493.
2007-02-05 15:24:01 +00:00
Jeff Squyres
4e506e69e5 Add missing <sys/param.h>
This commit was SVN r13478.
2007-02-03 01:11:35 +00:00
Rolf vandeVaart
bf5113198d Update to orte-clean so it will remove files on local and
remote nodes.  It will also kill off rogue orteds and orterun
processes.  The killing of processes is ifdef'ed out for Windows
since I do not know how to do it there.  Note that this change
will requite an autogen.  

This commit was SVN r13477.
2007-02-03 00:25:42 +00:00
Ralph Castain
a8202742ba Fix a missing function pointer - reference ticket #854
This commit was SVN r13476.
2007-02-02 23:10:14 +00:00
Jeff Squyres
f6e7016cdd Make this test capable of running more than "-np 1". If you run with
"-np X", it will launch X parents and then MPI_COMM_SPAWN X additional
children.

This commit was SVN r13466.
2007-02-02 14:34:53 +00:00
Ralph Castain
3daf8b341b Fix the sched_yield problem for generic environments. We now determine and set sched_yield during mpi_init based on the following logical sequence:
1. if the user has specified sched_yield, we simply do what we are told

2. if they didn't specify anything, try to get the number of processors on this node. Note that we already now get the number of local procs in our job that are sharing this node - that now comes in through the proc callback and is stored in the ompi_proc_t structures.

3. if we can get the number of processors, compare that to the number of local procs from my job that are sharing my node. If the number of local procs exceeds the number of processors, then set sched_yield to true. If not, then be a hog and set sched_yield to false

4. if we can't get the number of processors, default to conservative behavior and set sched_yield to true.

Note that I have not yet dealt with the need to dynamically adjust this setting as more processes are added via comm_spawn. So far, we are *only* looking within our own job. Given that we have now moved this logic to mpi_init (and away from the orteds), it isn't yet clear to me how a process will be informed about the number of procs in *other* jobs that are also sharing this node.

Something to continue to ponder.

This commit was SVN r13430.
2007-02-01 19:31:44 +00:00
Ralph Castain
c754523a14 Add cancel_operations to the pls module definition for tm
This commit was SVN r13416.
2007-02-01 16:52:28 +00:00
Ralph Castain
51fb746da3 Stop overriding the yield_when_idle mca param if the user has set it
This commit was SVN r13414.
2007-02-01 15:01:12 +00:00