1
1
Граф коммитов

1357 Коммитов

Автор SHA1 Сообщение Дата
Josh Hursey
596062d34b Seems that the recent changes in the sds and oob exposed some invalid
assumptions in the FT restart code for the ORTE layer.

This fixes those problems by having the RML completely shutdown and 
restart the OOB framework (instead of just the module as before).
This makes it much easier to manage, and maintainable as the OOB
changes in the future.

The SDS now does communication as part of its startup procedure, so
we need to make sure we restart the RML before the SDS so that it can
communicate properly.

OOB base [close|open] used a static bool to determine if they have
been called previously or not. I needed to expose this boolean so 
that I can close() then open() the oob base in the restart procedure.
The functionality has not changed, we just now have the ability to 
open/close the framework as many times as we need to as long as we
always call them in that order. (So calling open twice in a row is not allowed
as before, it is only allowed if you open(), close(), then open() again).

Things seem to be working now.

This commit was SVN r14515.
2007-04-25 19:51:52 +00:00
Brian Barrett
4b8bb70afb A couple cleanups for the IPv6 support:
- make opal_sockaddr2str() take a sockaddr_storage instead of a sockaddr_in6
    so that it works for IPv4 and IPv6 addresses, and remove a whole bunch
    of #ifs in the OOOB code.
  - Fix a compiler warning in the TCP BTL due to run-time determined
    array size by making it a dynamicly allocated array.
  - Fix the unpacking code of IPv4 addresses when using IPv6 support, so
    that the address is in the correct location (instead of in an IPv6
    structure, use an IPv4 structure).  Refs trac:1005.

This commit was SVN r14514.

The following Trac tickets were found above:
  Ticket 1005 --> https://svn.open-mpi.org/trac/ompi/ticket/1005
2007-04-25 19:08:07 +00:00
Adrian Knoth
d1ce39de4f Move mca_btl_tcp_addr_isipv4public to opal_addr_isipv4public
This commit was SVN r14512.
2007-04-25 18:06:06 +00:00
Ralph Castain
7d0f51e6b9 Begin setting up for a change to the OOB information passing functionality - this is totally transparent at the moment (need to change computers).
This commit was SVN r14510.
2007-04-25 17:36:26 +00:00
Adrian Knoth
35fce38f43 Don't know why this line was here.
This commit was SVN r14509.
2007-04-25 12:31:13 +00:00
Ralph Castain
8517a5a3a6 cleanup a few compiler warnings
This commit was SVN r14507.
2007-04-25 11:51:18 +00:00
Adrian Knoth
868d8febfa Enable rds/hostfile to accept IPv6 addresses.
This commit was SVN r14505.
2007-04-25 06:55:58 +00:00
Jeff Squyres
c4c68e666a Merge in the ipv6 work from /tmp/ipv6-merge.
This commit was SVN r14503.
2007-04-25 01:55:40 +00:00
Jeff Squyres
321e08c605 Add some missing header files
This commit was SVN r14500.
2007-04-24 21:39:12 +00:00
Ralph Castain
18cb5c9762 Complete modifications for failed-to-start of applications. Modifications for failed-to-start of orteds coming next.
This completes the minor changes required to the PLS components. Basically, there is a small change required to the parameter list of the orted cmd functions. I caught and did it for xcpu and poe, in addition to the components listed in my email - so I think that only leaves xgrid unconverted.

The orted fail-to-start mods will also make changes in the PLS components, but those can be localized so they come in one at a time.

This commit was SVN r14499.
2007-04-24 20:53:54 +00:00
Ralph Castain
a764aa6395 Modify iof to report back more descriptive errors
This commit was SVN r14497.
2007-04-24 19:28:37 +00:00
Ralph Castain
c774f641fb Modify orterun to provide more user-friendly reporting on jobs that fail to start
This commit was SVN r14496.
2007-04-24 19:19:14 +00:00
Ralph Castain
19767802de Let the errmgr know how to deal with incomplete starts
This commit was SVN r14495.
2007-04-24 19:04:29 +00:00
Ralph Castain
ef71055cf8 Teach the odls to properly test for and report failed-to-start for application processes.
Test for system limits (where known) prior to doing things like fork and pipe since some systems aren't very nice about it when we try to exceed such limits.

This commit was SVN r14494.
2007-04-24 18:54:45 +00:00
Ralph Castain
f5ef3d795e Tell the smr how to handle failed-to-start
This commit was SVN r14488.
2007-04-24 16:23:26 +00:00
Jeff Squyres
0674bbd001 Fix segv when the shell is not recognized. Thanks to Mostyn Lewis for
noticing the problem.

This commit was SVN r14483.
2007-04-24 12:00:54 +00:00
Ralph Castain
2d04298002 Update the orted cmd xmit functions to match orted recv's. This fixes trac:1004.
This commit was SVN r14482.

The following Trac tickets were found above:
  Ticket 1004 --> https://svn.open-mpi.org/trac/ompi/ticket/1004
2007-04-24 01:58:40 +00:00
Josh Hursey
260e7612ad Fix a few interface changes introduced by r14475
This commit was SVN r14479.

The following SVN revision numbers were found above:
  r14475 --> open-mpi/ompi@18b2dca51c
2007-04-23 20:18:27 +00:00
Ralph Castain
5f94d6d791 Fix the cnos rml to match revised xcast API
This commit was SVN r14478.
2007-04-23 19:07:44 +00:00
Ralph Castain
18b2dca51c Bring in the code for routing xcast stage gate messages via the local orteds. This code is inactive unless you specifically request it via an mca param oob_xcast_mode (can be set to "linear" or "direct"). Direct mode is the old standard method where we send messages directly to each MPI process. Linear mode sends the xcast message via the orteds, with the HNP sending the message to each orted directly.
There is a binomial algorithm in the code (i.e., the HNP would send to a subset of the orteds, which then relay it on according to the typical log-2 algo), but that has a bug in it so the code won't let you select it even if you tried (and the mca param doesn't show, so you'd *really* have to try).

This also involved a slight change to the oob.xcast API, so propagated that as required.

Note: this has *only* been tested on rsh, SLURM, and Bproc environments (now that it has been transferred to the OMPI trunk, I'll need to re-test it [only done rsh so far]). It should work fine on any environment that uses the ORTE daemons - anywhere else, you are on your own... :-)

Also, correct a mistake where the orte_debug_flag was declared an int, but the mca param was set as a bool. Move the storage for that flag to the orte/runtime/params.c and orte/runtime/params.h files appropriately.

This commit was SVN r14475.
2007-04-23 18:41:04 +00:00
Ralph Castain
009be1c1b5 Reorganize the orted code for easier maintenance. Add ability to deliver xcast messages to local procs (not used at this point).
This commit was SVN r14474.
2007-04-23 18:28:20 +00:00
Ralph Castain
b260f8ee36 Enable the job_family API
This commit was SVN r14473.
2007-04-23 18:26:33 +00:00
Ralph Castain
7a57b694bb Allow caller to get session directory name without anything else
This commit was SVN r14472.
2007-04-23 18:25:36 +00:00
Ralph Castain
9cd85ef55a Add a few more error constants that will help provide more definitive output to the user
This commit was SVN r14471.
2007-04-23 18:25:03 +00:00
Brian Barrett
0a8af62c64 Fix broken build on OS X with static compiles. Everything that uses
anything in OPAL *MUST* call either opal_init() or opal_init_util().

This commit was SVN r14468.
2007-04-23 15:45:39 +00:00
Ralph Castain
477828159e Add a few test functions transferred from ORTE trunk
This commit was SVN r14467.
2007-04-23 14:43:55 +00:00
Ralph Castain
f47e7382e3 Add a new function to wake orterun up - used in failed-to-start scenarios, but can be used anytime a lower level needs to ensure orterun wakes up
This commit was SVN r14466.
2007-04-23 12:49:25 +00:00
Ralph Castain
3d4f1b86d2 Modify the name service to provide necessary support for failed-to-start scenarios. Add a new API to get_vpid_range - this should be used in place of the rmgr API of that name to avoid race conditions (will remove that API in later commit).
This commit was SVN r14465.
2007-04-23 12:48:19 +00:00
Josh Hursey
27a42f48d3 Make sure to call opal_init_util before mca_base_open().
This bug(?) become apparent due to the installdirs commit since these tools
were not finding the proper libraries since the paths were wonkey.

It all looks good now. :)

This commit was SVN r14461.
2007-04-21 22:38:15 +00:00
Jeff Squyres
5bebd24250 Bring over Brian's installdirs fixes from this afternoon (r14445).
This commit was SVN r14450.

The following SVN revision numbers were found above:
  r14445 --> open-mpi/ompi@13d366b827
2007-04-21 00:16:31 +00:00
Jeff Squyres
0ba47105ed Merge the /tmp/jms-installdirs-trunk branch into the trunk. This
finally brings in functionality that is already on the 1.2 branch, and
was developed and tested in the v1.2ofed branch (and other places).

Short version of new features:

 * Support for ibv_fork_init() 
 * Automatically fill in the openib BTL bandwidth value by 
   querying the HCA port 
 * Installdirs functionality 
 * Fixes to always use -I in the Fortran wrapper compilers (#924) 
 * Gleb's mpool updates 
 * Remove some kruft in btl/openib/configure.m4, therefore 
   fixing the harmless warnings noted in #665 
 * Bunches of updates to the Linux RPM spec file 

I.e., effectively the same thing that r14411 brought to the v1.2
branch.

Also effectively brought in r14432 and r14433 (some fixes on top of
the original r14411 commit to v1.2).  Still need to bring in the moral
equivalent of r14445 after this commit (fixes to installdirs).

This commit was SVN r14449.

The following SVN revision numbers were found above:
  r14411 --> open-mpi/ompi@83b31314ae
  r14432 --> open-mpi/ompi@a48f160595
  r14433 --> open-mpi/ompi@68f346d2bc
  r14445 --> open-mpi/ompi@13d366b827
2007-04-21 00:15:05 +00:00
Josh Hursey
b9da59ebc3 Fix the way we determine which sequence number to restart with.
Create a sentinel value in the metadata file to clearly indicate
that the sequence number is complete (versus in progress). This
way we do not try to restart from an invalid sequence number
which can lead to badness.

This commit was SVN r14423.
2007-04-19 15:04:27 +00:00
Sven Stork
037b01ce9e - more symbols that need to be exported
This commit was SVN r14415.
2007-04-18 14:53:56 +00:00
Jeff Squyres
1e364218a2 Remove unused variable
This commit was SVN r14413.
2007-04-18 13:10:10 +00:00
Josh Hursey
6ee0c641fd Cleanup the output from orte-checkpoint so it is a bit more clear and references
the sequence number.

Before:
[...] Finished - Global Snapshot Reference: ompi_global_snapshot_1234.ckpt

After:
Snashot Ref.:   1 ompi_global_snapshot_1234.ckpt

This commit was SVN r14381.
2007-04-15 14:28:56 +00:00
George Bosilca
9e840fbe14 The missing orted bug is now fixed. orterun will not deadlock when
the program it try to spawn is missing.

Description of the problem: When the rsh pls try to spawn a local
process which is missing (such as a removed orted) the orterun
deadlock.

Description of the fix: The forked child deal with finding the
program to be executed. If it fails to find it, then instead of
calling exit (as a normal forked program is expected to do) it 
continue the execution using a execution path it was never
expected to use (back in orterun and then main). Bad things 
happens as expected. Forcing the child to use exit when it fails
to find the orted (and forcing the child to use exit everywhere
instead of return) correct the logic of the rsh pls and make it
behave as expected.

This commit was SVN r14377.
2007-04-14 17:36:27 +00:00
Ralph Castain
adb44c44b1 Revert prior commits from last night that involved significant change to the GPR, along with cosmetic changes to the odls_default module pending review and test.
Reverts r14328, r14329, r14331, r14333, r14335, r14338, and r14336.

This commit was SVN r14351.

The following SVN revision numbers were found above:
  r14328 --> open-mpi/ompi@d1ce4a44ca
  r14329 --> open-mpi/ompi@604e79f2d2
  r14331 --> open-mpi/ompi@b2b3417475
  r14333 --> open-mpi/ompi@8882f355b4
  r14335 --> open-mpi/ompi@10dfd534f6
  r14336 --> open-mpi/ompi@5c65c55e59
  r14338 --> open-mpi/ompi@579184cd72
2007-04-12 13:13:28 +00:00
Jeff Squyres
51f286d737 Just like r14289 on the ORTE trunk:
Per discussions with Brian and Ralph, make a slight correction in
where components are installed. Use $pkglibdir, not $libdir/openmpi,
so that when compiled in the orte trunk, components are installed to
the right directory (because the component search patch is checking
$pkglibdir).

This commit was SVN r14345.

The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
  r14289
2007-04-12 11:19:42 +00:00
George Bosilca
1c037df7e7 Only print information if the condition is met.
This commit was SVN r14340.
2007-04-12 07:28:18 +00:00
George Bosilca
579184cd72 Rollback commit r14335 it get into the trunk too early.
This commit was SVN r14338.

The following SVN revision numbers were found above:
  r14335 --> open-mpi/ompi@10dfd534f6
2007-04-12 06:21:59 +00:00
George Bosilca
5c65c55e59 Few cleanups. The most important is getting rid of the orte_bitmap_t class
which is not used anymore in the orte code.

This commit was SVN r14336.
2007-04-12 05:50:33 +00:00
George Bosilca
10dfd534f6 Correctly remove the itag if we fail the condition. And be pedantic with the code.
This commit was SVN r14335.
2007-04-12 05:33:31 +00:00
George Bosilca
b882b7e1b3 Update the Windows ODLS.
This commit was SVN r14334.
2007-04-12 05:19:25 +00:00
George Bosilca
8882f355b4 Move these functions at their right place.
This commit was SVN r14333.
2007-04-12 05:18:23 +00:00
George Bosilca
9de6ae0753 ORTE_MODULE_DECLSPEC is not required here.
This commit was SVN r14332.
2007-04-12 05:17:03 +00:00
George Bosilca
b2b3417475 A more optimized version of the orte_gpr_replica_check_itag_list function. Strictly
follow the same behavior as before, the changes just make sure the check is done
in linear time and the memory usage is kept to a minimum.

This commit was SVN r14331.
2007-04-12 05:13:10 +00:00
George Bosilca
604e79f2d2 There is a cleanup label, so I expect to use it in all cases.
This commit was SVN r14329.
2007-04-12 05:05:36 +00:00
George Bosilca
d1ce4a44ca Fix small memory leak (only happens in debug mode).
This commit was SVN r14328.
2007-04-12 05:02:57 +00:00
George Bosilca
cad93a7693 Add more output. Fix some typos, and some small cleanups.
This commit was SVN r14327.
2007-04-12 05:01:29 +00:00
George Bosilca
0d82473b9d Enable the null IOF.
This commit was SVN r14326.
2007-04-12 05:00:05 +00:00
George Bosilca
f5478d95df Dont do anything if the array is already empty.
This commit was SVN r14325.
2007-04-12 04:58:47 +00:00
George Bosilca
e7c4f1ca64 Remove some unused code and correct the finalize function (cancel the pending
receive request).

This commit was SVN r14324.
2007-04-12 04:58:12 +00:00
George Bosilca
4a87c782c3 Release all unselected components. This is a little bit more tricky than usual,
as the IOF components lack the required finalize function. Instead rely on the
module finalize. Read the comment or more informations.

This commit was SVN r14323.
2007-04-12 04:57:08 +00:00
George Bosilca
c15cd5e4ab Unload all non necessary PLS. Once the selection process is done, we should release all
unselected PLS. This decrease the footprint of all Open MPI based processes.

This commit was SVN r14322.
2007-04-12 04:55:23 +00:00
George Bosilca
af6891f471 Fix a small typo.
This commit was SVN r14321.
2007-04-12 04:53:30 +00:00
Tim Prins
6872f21af0 remove unused variable
This commit was SVN r14306.
2007-04-11 17:15:14 +00:00
Pak Lui
e9e8dc2765 * comment out unused code
This commit was SVN r14297.
2007-04-10 22:38:34 +00:00
Josh Hursey
cd5047a9bf Refs trac:976
Collect the base 'orted' command line into a base function since most of the
PLS components were duplicating this code. Add AMCA parameter command line
component to the base set.

Add Aggregate MCA parameter support to the following PLS components:
 - gridengine
 - process
 - slurm
 - poe
 - tm

Improve support for 'rsh' component.

Did/could not support the following components:
 - bproc
 - proxy
 - xcpu
 - cnos
 - xgrid

The above components had peculiar needs that made it non-trivial to add an 
option. The authors of these components need to help in supporting this
new option.

I was only able to test the SLURM and RSH components due to system availability.
The others should work without problem.

This commit was SVN r14284.

The following Trac tickets were found above:
  Ticket 976 --> https://svn.open-mpi.org/trac/ompi/ticket/976
2007-04-10 14:23:32 +00:00
Tim Prins
1e7ff7f0fe Fix another buglet.
This commit was SVN r14270.
2007-04-09 17:54:11 +00:00
Tim Prins
2ffc02870d Reduce the memory usage of the GPR:
- Make it so that all the GPR pointer arrays are allocated initially at 16 elements instead of 512. This saves (on a 64 bit machine) approximately 4*(# procs + # nodes) KB.
- Fix up the segment prealloc function so that preallocating an existant segment is not an error, and make the areas where we do large inserts use it.

Fix the orte_pointer_array to efficiently implement setting its size. Before we just realloced the array one block at a time until the desired size was reached. Now we resize it all in one realloc.

This commit was SVN r14264.
2007-04-09 00:40:15 +00:00
Brian Barrett
13a4bba13f Yet another dumb thing that shouldn't have been in r14261.
This commit was SVN r14263.

The following SVN revision numbers were found above:
  r14261 --> open-mpi/ompi@8a55c84d0b
2007-04-07 23:23:23 +00:00
Brian Barrett
32f0090f81 fix dumb variable scope mistake
This commit was SVN r14262.
2007-04-07 23:00:57 +00:00
Brian Barrett
8a55c84d0b Fix a number of OOB issues:
* Remove the connect() timeout code, as it had some nasty race conditions
    when connections were established as the trigger was firing.  A better
    solution has been found for the cluster where this was needed, so just
    removing it was easiest.
  * When a fatal error (too many connection failures) occurs, set an error
    on messages in the queue even if there isn't an active message.  The
    first message to any peer will be queued without being active (and
    so will all subsequent messages until the connection is established),
    and the orteds will hang until that first message completes.  So if
    an orted can never contact it's peer, it will never exit and just sit
    waiting for that message to complete.
  * Cover an interesting RST condition in the connect code.  A connection
    can complete the three-way handshake, the connector can even send
    some data, but the server side will drop the connection because it
    can't move it from the half-connected to fully-connected state because
    of space shortage in the listen backlog queue.  This causes a RST to
    be received first time that recv() is called, which will be when waiting
    for the remote side of the OOB ack.  In this case, transition the
    connection back into a CLOSED state and try to connect again.
  * Add levels of debugging, rather than all or nothing, each building on
    the previous level.  0 (default) is hard errors.  1 is connection 
    error debugging info.  2 is all connection info.  3 is more state
    info.  4 includes all message info.
  * Add some hopefully useful comments

This commit was SVN r14261.
2007-04-07 22:33:30 +00:00
Tim Prins
df4c468bb4 fix some more minor memory leaks
This commit was SVN r14260.
2007-04-07 18:41:16 +00:00
Tim Prins
8e7765e456 Fix a gigantic memory leak. We were copying a message to send into a buffer, then never freeing the copy we made. But we were mistakenly allocating the buffer on the stack, so the memory checking tools never caught the leak. On 96 nodes, 384 processes, mpirun memory usage went from about 12M to 3M for me after this minor change...
This commit was SVN r14257.
2007-04-07 02:25:48 +00:00
Tim Prins
e058266c96 Change the ORTE datatype service in 2 ways:
1. Remove a unneeded field, bytes_avail, from orte_buffer_t. It is a calcualed value, and updating it everywhere is worse then just calculating it in the one place it is acutally used.
2. Change it so the default size of a orte_buffer is 128 bytes instead of 1024 bytes. We then double the size of the buffer up to 1024 bytes, then we additively increase the size by 1024 bytes at a time as was done before.

This commit was SVN r14252.
2007-04-06 19:40:29 +00:00
George Bosilca
33bf6c6e54 Move the comment at the right place.
This commit was SVN r14237.
2007-04-05 20:36:33 +00:00
George Bosilca
5c355d0bea Always return an initialized variable. More output if we fail to read
from the shell detection child. Don't spawn orted, instead spawn what's
inside the mca_pls_rsh_component.orted.

This commit was SVN r14236.
2007-04-05 20:17:10 +00:00
George Bosilca
ef4baeb6ab Don't reset the pid, as at this point it is already set.
This commit was SVN r14235.
2007-04-05 20:13:50 +00:00
George Bosilca
8fb8363868 Correctly detect the remote shell, and the local one. Big clean-up on how we
deal with the PLS RSH. Remove support for unknown user (i.e. if the user is
not known by the system, then it shouldn't be allowed to spawn anything).

This commit was SVN r14232.
2007-04-05 19:22:26 +00:00
Josh Hursey
8fd6d4ba09 add a newline so output is cleaner/clearer
This commit was SVN r14229.
2007-04-05 17:45:03 +00:00
Ralph Castain
e95539a16a Add two new test codes - orte_loop_spawn/child - to help debug issues surrounding multiple calls to comm_spawn
This commit was SVN r14217.
2007-04-04 21:02:18 +00:00
Jeff Squyres
2cbcb4abf1 Remove the French and strip the tests down to essentials (no need for
buffer attaching/detaching, for example).

This commit was SVN r14216.
2007-04-04 15:38:23 +00:00
Ralph Castain
d5b5cd2d3c Add test code for multiple comm_spawn calls.
Add ERROR_LOG calls to more clearly document failures in the rsh launcher.

This commit was SVN r14214.
2007-04-04 13:24:39 +00:00
Jeff Squyres
fe58753a23 Add a little documentation to iof.h.
This commit was SVN r14208.
2007-04-03 18:17:35 +00:00
George Bosilca
f2a6b9394f Deal with the include spree. Protect "environ" on Windows.
Some others minors modifications in order to make it
compile [again] on Windows.

This commit was SVN r14188.
2007-04-01 16:16:54 +00:00
George Bosilca
01a4f56369 Mostly DECLSPEC cleanups and some include corrections.
This commit was SVN r14186.
2007-04-01 16:08:27 +00:00
Tim Prins
2f74160a37 Fix some more memory leaks
This commit was SVN r14175.
2007-03-30 13:43:50 +00:00
George Bosilca
d367d9017c Need the definition of opal_output_close.
This commit was SVN r14167.
2007-03-29 01:18:26 +00:00
Tim Prins
9cb455272b Fix a pile of memory leaks in ORTE.
Fix a major memory leak in the SLURM RAS, and cleanup a bit of code there.

This commit was SVN r14164.
2007-03-29 00:50:56 +00:00
Sven Stork
44ead58103 - export component structure
This commit was SVN r14139.
2007-03-26 13:46:00 +00:00
Ralph Castain
0d98264097 Fix the nolocal option on the OMPI trunk
This commit was SVN r14138.
2007-03-24 16:16:16 +00:00
Galen Shipman
48d1fa830d A race condition exists on the free list of pending connections because
OPAL_FREE_LIST_WAIT/RETURN will not use locks in a non-threaded build
conditionaly use locks if non-threaded around the OPAL_FREE_LIST_WAIT/RETURN 
seems to fix the issue 
Tested at 4K processes and seems to work.. 

This commit was SVN r14135.
2007-03-23 15:19:03 +00:00
Brian Barrett
d454395b51 Need to fall back on the event listen mode if the MCA parameter said use the
listen thread, but we're not the HNP.  This is better than not starting up
any listen mode, which is what we were doing before :/

This commit was SVN r14133.
2007-03-23 13:29:18 +00:00
Jeff Squyres
bcdfbacaa4 Oops -- typo from previous commit. :-(
This commit was SVN r14130.
2007-03-23 00:51:50 +00:00
Jeff Squyres
2105f444ec Add missing header file
This commit was SVN r14129.
2007-03-23 00:47:30 +00:00
Jeff Squyres
a3dd0f2e08 Connect --nolocal up to the MCA param rmaps_base_schedule_local, as it
should be (it's a mistake that it got left out).

This commit was SVN r14127.
2007-03-22 19:29:47 +00:00
Sven Stork
6111ca1152 - Let's try to detect the default nodefile directory because it can different
for different sites. If we cannot detect the default then we fall back to 
  the hard coded path.

This commit was SVN r14121.
2007-03-22 15:26:16 +00:00
Galen Shipman
e654604a25 remove invalid comment
This commit was SVN r14118.
2007-03-22 03:51:36 +00:00
Josh Hursey
3492fdeae3 Fix a couple of compiler warnings (errors?) caught by ICC testing at Cisco.
This commit was SVN r14080.
2007-03-20 14:12:13 +00:00
Rainer Keller
1322f9f346 - Further attributes mainly for opal/* functions, marking
__opal_attribute_nonnull__, __opal_attribute_warn_unused_result__,
   __opal_attribute_malloc__, __opal_attribute_sentinel__ and
   __opal_attribute_format__

This commit was SVN r14078.
2007-03-20 13:01:32 +00:00
Pak Lui
803655b555 * incorporated some of Jeff's comment regarding this fix.
This commit was SVN r14070.
2007-03-19 21:59:48 +00:00
Pak Lui
da4d41e0e7 * fixed the missing fclose and eliminate the call to get_slot_count
since it is not needed

This commit was SVN r14066.
2007-03-19 17:47:30 +00:00
Rich Graham
d2e799f6b5 add some stub functions for the cnos environment.
This commit was SVN r14065.
2007-03-19 17:35:46 +00:00
Josh Hursey
101a2abd09 - Be more careful with parens
- Run the destructor *before* shutting things down.

This commit was SVN r14064.
2007-03-19 17:33:20 +00:00
Brian Barrett
ea08a555f9 Fixed a compile error on OS X 10.3 introduced with 1.1.5 / 1.2. Thanks
to Marius Schamschula for reporting the issue.

This commit was SVN r14063.
2007-03-19 17:25:54 +00:00
Josh Hursey
a181c987cc Remove some old references to ft_enable parameter that no longer exists.
This was replaced by the "-am ft-enable-cr" AMCA parameter.

This commit was SVN r14055.
2007-03-17 20:02:42 +00:00
Josh Hursey
d03073e87d Make sure to protect the finalize call so tools like ompi_info
do not segv.

This commit was SVN r14054.
2007-03-17 19:47:54 +00:00
Josh Hursey
dadca7da88 Merging in the jjhursey-ft-cr-stable branch (r13912 : HEAD).
This merge adds Checkpoint/Restart support to Open MPI. The initial
frameworks and components support a LAM/MPI-like implementation.

This commit follows the risk assessment presented to the Open MPI core
development group on Feb. 22, 2007.

This commit closes trac:158

More details to follow.

This commit was SVN r14051.

The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
  r13912

The following Trac tickets were found above:
  Ticket 158 --> https://svn.open-mpi.org/trac/ompi/ticket/158
2007-03-16 23:11:45 +00:00
Jeff Squyres
c000ee5328 Fixes trac:921
* Do not empty the list of in-flight frags during _close(); the OOB
   callback will still occur (_send_cb()) and try to remove the frag
   from the list, which will then result in an assert failure (debug
   builds).  
 * Add one more fix for a possible problem -- add an extra RETAIN /
   RELEASE pair on the endpoint to ensure that it is not actually
   freed before all in-flight frags have drained.

This commit was SVN r13953.

The following Trac tickets were found above:
  Ticket 921 --> https://svn.open-mpi.org/trac/ompi/ticket/921
2007-03-07 20:12:22 +00:00
Tim Prins
fe3ea0085f Fix minor memory leaks
This commit was SVN r13946.
2007-03-07 01:09:38 +00:00
Jeff Squyres
7b72ded10c Patch from Gotz Waschk to recognize zsh.
This commit was SVN r13907.
2007-03-03 01:42:03 +00:00
Li-Ta Lo
a0e5b6a27c minor clean up and treespawn support
This commit was SVN r13876.
2007-03-01 22:32:37 +00:00
Josh Hursey
0404444dbe * Added 2 new MCA parameters
- mca_base_param_file_prefix
     (Default: NULL)
     This is the fullname of the "-am" mpirun option. Used to specify a ':'
     separated list of AMCA parameter set files.
  - mca_base_param_file_path
     (Default: $SYSCONFDIR/amca-param-sets/:$CWD)
     The path to search for AMCA files with relative paths. A warning will be
     printed if the AMCA file cannot be found.

* Added a new function "mca_base_param_recache_files" the re-reads the file
configurations. This is used internally to help bootstrap the MCA system.

* Added a new orterun/mpirun command line option '-am' that aliases for the
mca_base_param_file_prefix MCA parameter

* Exposed the opal_path_access function as it is generally useful in other
places in the code.

* New function "opal_cmd_line_make_opt_mca" which will allow you to append a
new command line option with MCA parameter identifiers to set at the same
time. Previously this could only be done at command line declaration time.

* Added a new directory under the $pkgdatadir named "amca-param-sets" where all
the 'shipped with' Open MPI AMCA parameter sets are placed. This is the first
place to search for AMCA sets with relative paths.

* An example.conf AMCA parameter set file is located in
contrib/amca-param-sets/.

* Jeff Squyres contributed an OpenIB AMCA set for benchmarking.

Note: You will need to autogen with this commit as it adds a configure param.
  Sorry :(

This commit was SVN r13867.
2007-03-01 13:39:20 +00:00
Rainer Keller
0889ebd59f - Eliminate warnings, that PGI-6.2.5 issues with -Minform=inform
This commit was SVN r13840.
2007-02-28 08:36:34 +00:00
George Bosilca
4bab882d17 These 2 ORTE_DECLSPEC are not required.
This commit was SVN r13825.
2007-02-27 15:45:40 +00:00
Sven Stork
d8a369936e - Fix more symbols that should be exported.
This commit was SVN r13824.
2007-02-27 15:17:17 +00:00
Sven Stork
a86deb460e - export required symbols
This commit was SVN r13810.
2007-02-27 09:43:32 +00:00
Tim Prins
c6f2efe4b8 These are orte functions, the structure should be named as such
This commit was SVN r13765.
2007-02-22 23:29:31 +00:00
George Bosilca
d29423b1f7 orted_globals_t should be global.
This commit was SVN r13684.
2007-02-16 18:16:06 +00:00
Brian Barrett
f6a5d58885 Rather than set the connect event timeout number to something big and hoping
its bigger than the timeout for the connect() call, just don't register
the handler by default and fall back to connect() timing out.  Should give
much happier performance on big clusters.

This commit was SVN r13639.
2007-02-13 18:36:50 +00:00
Pak Lui
085826d94a * Remove the code for putting the bogus exit status of the user proc.
Also remove the smr set_proc_state since it's covered elsewhere.

This commit was SVN r13625.
2007-02-12 23:59:27 +00:00
Brian Barrett
8b28e5b33d Allow the OOB to connect between all MPI applications during MPI_INIT
without also establishing MPI connectivity. 

This commit was SVN r13595.
2007-02-09 20:17:37 +00:00
Brian Barrett
262cbbc5c9 Back out r13593, which contained a change that shouldn't be committed.
This commit was SVN r13594.

The following SVN revision numbers were found above:
  r13593 --> open-mpi/ompi@81472363ea
2007-02-09 20:13:02 +00:00
Brian Barrett
81472363ea Allow the OOB to connect between all MPI applications during MPI_INIT
without also establishing MPI connectivity.

This commit was SVN r13593.
2007-02-09 20:11:40 +00:00
Pak Lui
2d6b3776bf * fix the SEGV described in trac #892 that the exit_status in the 200 range
causes a strsignal to show NULL as a result. Still trying to determine
  why exit_status is in that range.

This commit was SVN r13583.
2007-02-09 16:39:30 +00:00
Ralph Castain
5818a32245 Bring in a forgotten speed improvement for the TM launcher that was developed during SNL Tbird testing last year. Remove the redundant and slow calls to TM to resolve hostnames. Instead, read the host info from the PBS file during the RAS, and then just use that info in the PLS (rather than getting it again).
Adjust the RMAPS mapped_node object to propagate the required launch_id info now included in the ras_node object. This provides support for those few systems that don't use nodename to launch, but instead want some id (typically an index into the array of allocated nodes). This value gets set for each node in the RAS - the RMAPS just propagates it for easy launch.

This commit was SVN r13581.
2007-02-09 15:06:45 +00:00
George Bosilca
79d76b044a ORTE_DECL everything that can be used outside the base directory. I
woner why this file is called private when it's included by all PLS ...

This commit was SVN r13573.
2007-02-09 03:16:19 +00:00
George Bosilca
7750ed22e0 Correct the Windows part of the universe detection.
This commit was SVN r13547.
2007-02-07 22:37:28 +00:00
Pak Lui
ccff0a6e65 * minor fix to correct the pid that always shows up as 0 in the abort
error message. e.g: 

  mpirun noticed that job rank 2 with PID 0 on node burl-ct-v440-4
  exited on signal 15 (Terminated).

This commit was SVN r13537.
2007-02-07 17:46:19 +00:00
Ralph Castain
890e3c7981 Reset the trunk so that the odls now sets the paffinity and sched_yield params again. The sched_yield is still overridden by any user-specified setting.
This change utilizes the new num_processors function. I also left the mods made to ompi_mpi_init and the bug fix for the default value of mpi_yield_when_idle. Note that the mods to mpi_init will not really take effect as the mca param will now *always* be set (either by user or odls). We will need those mods later, so no point in removing them now.

This commit was SVN r13519.
2007-02-06 19:51:05 +00:00
Jeff Squyres
c91fcd7fbd Fix a bunch of minor typos submitted by Bernhard Fischer.
This commit was SVN r13505.
2007-02-06 12:00:30 +00:00
Rolf vandeVaart
dcce8c739c Fix compiler warning. I am not sure how this got
passed us, but thanks to Jeff Squyres for pointing it out.

This commit was SVN r13501.
2007-02-05 22:03:58 +00:00
Rolf vandeVaart
74e3b68ce8 Better document orte-clean's behavior.
This commit was SVN r13498.
2007-02-05 20:01:15 +00:00
Ralph Castain
26897a626d Add a delayed_abort test code. We seem to handle this case just fine now, but Sun reports still seeing troubles on Solaris.
This commit was SVN r13493.
2007-02-05 15:24:01 +00:00
Jeff Squyres
4e506e69e5 Add missing <sys/param.h>
This commit was SVN r13478.
2007-02-03 01:11:35 +00:00
Rolf vandeVaart
bf5113198d Update to orte-clean so it will remove files on local and
remote nodes.  It will also kill off rogue orteds and orterun
processes.  The killing of processes is ifdef'ed out for Windows
since I do not know how to do it there.  Note that this change
will requite an autogen.  

This commit was SVN r13477.
2007-02-03 00:25:42 +00:00
Ralph Castain
a8202742ba Fix a missing function pointer - reference ticket #854
This commit was SVN r13476.
2007-02-02 23:10:14 +00:00
Jeff Squyres
f6e7016cdd Make this test capable of running more than "-np 1". If you run with
"-np X", it will launch X parents and then MPI_COMM_SPAWN X additional
children.

This commit was SVN r13466.
2007-02-02 14:34:53 +00:00
Ralph Castain
3daf8b341b Fix the sched_yield problem for generic environments. We now determine and set sched_yield during mpi_init based on the following logical sequence:
1. if the user has specified sched_yield, we simply do what we are told

2. if they didn't specify anything, try to get the number of processors on this node. Note that we already now get the number of local procs in our job that are sharing this node - that now comes in through the proc callback and is stored in the ompi_proc_t structures.

3. if we can get the number of processors, compare that to the number of local procs from my job that are sharing my node. If the number of local procs exceeds the number of processors, then set sched_yield to true. If not, then be a hog and set sched_yield to false

4. if we can't get the number of processors, default to conservative behavior and set sched_yield to true.

Note that I have not yet dealt with the need to dynamically adjust this setting as more processes are added via comm_spawn. So far, we are *only* looking within our own job. Given that we have now moved this logic to mpi_init (and away from the orteds), it isn't yet clear to me how a process will be informed about the number of procs in *other* jobs that are also sharing this node.

Something to continue to ponder.

This commit was SVN r13430.
2007-02-01 19:31:44 +00:00
Ralph Castain
c754523a14 Add cancel_operations to the pls module definition for tm
This commit was SVN r13416.
2007-02-01 16:52:28 +00:00
Ralph Castain
51fb746da3 Stop overriding the yield_when_idle mca param if the user has set it
This commit was SVN r13414.
2007-02-01 15:01:12 +00:00
George Bosilca
9f73335bdb Silence the compiler.
This commit was SVN r13381.
2007-01-31 04:24:56 +00:00
Jeff Squyres
8d872b195a Refs trac:726
Tested this functionality quite a bit more and made some fixes:

 * Print far fewer help messages
 * Fix one additional deadlock upon error
 * Change some ORTE_LOG messages to silent (because they're not
   errors)
 * Some code got re-indented, sorry...

Discussed and reviewed with Ralph.

This commit was SVN r13375.

The following Trac tickets were found above:
  Ticket 726 --> https://svn.open-mpi.org/trac/ompi/ticket/726
2007-01-30 23:03:13 +00:00
Jeff Squyres
78a13bc3ea Fix the MPI_ABORT problem. We added an orte_initialized variable
yesterday and set it to "true" in orte_init().  But ompi_mpi_init()
doesn't call orte_init() -- it calls orte_init_stage1() and
orte_init_stage2(). So orte_initialized was never set to true, and
Badness happend from there (w.r.t. ompi_mpi_abort()).

This patch moves the setting of orte_initialized to orte_init_stage2()
so that everyone will always get it set properly.

It also moves setting orte_universe_info.state to RUNNING into
stage2() as well -- Ralph confirmed that that should have been there
for the same reasons that orte_initialized needs to be there.

This commit was SVN r13374.
2007-01-30 23:00:43 +00:00
Rainer Keller
061ba05439 - Fixes uncovered with the format attribute to
opal_output and opal_output_verbose

This commit was SVN r13371.
2007-01-30 20:56:31 +00:00
Rainer Keller
3669e8921e - Fix further compiler warnings regarding initialization
and shadowing variables.

This commit was SVN r13358.
2007-01-30 06:34:38 +00:00
George Bosilca
dea69e3c7c Remove one of the %s.
This commit was SVN r13357.
2007-01-30 03:56:48 +00:00
Jeff Squyres
e90b3e415b * Before this commit, if we called ompi_mpi_abort() before MPI_INIT
completed successfully, Bad Things(tm) could happen.
 * Now we explicitly check orte_initialized (a new global in ORTE
   indicating whether we are between orte_init() and orte_finalize()
   or not), and if so, react accordingly.
 * If ORTE is initialized, use orte_system_info.nodename; otherwise,
   use gethostname().
 * Add loop protection to ensure that ompi_mpi_abort() is not invoked
   multiple times recursively.

This commit was SVN r13354.
2007-01-29 22:01:28 +00:00
Rich Graham
f6c99d0207 set orte_odls_base.components_available to false if no odls components are
available.  Startup now works if no odls components are availble.

This commit was SVN r13339.
2007-01-27 15:37:13 +00:00
Jeff Squyres
3c5c8c3c4c Refinement of Rainer's r13227 and r13228 (worked with Rainer, Ralph,
and George on these refinements):

 * Rename the static OBJ initializer macro to be
   OPAL_OBJ_STATIC_INIT(class)
 * Ensure that all static OBJ initializations get a refcount of 1
   (doesn't ''really'' matter, since they're static, it should never
   get to the point where the OBJ is DESTRUCTed, but more correct
   nonetheless)
 * Add a "magic number" to the OBJ when compiling with debug support.
   The magic number does some rudimentary support to ensure that
   you're operating on a valid OBJ (and fails an assertion if you're
   not).  Check to ensure that the memory contains the magic number
   when performing actions of OBJ's.  Also remove the magic number
   when DESTRUCTing OBJs, so that if, for example, an OBJ is
   DESTRUCTed more than once, we'll fail the magic number assert.

This commit was SVN r13338.

The following SVN revision numbers were found above:
  r13227 --> open-mpi/ompi@96030de97b
  r13228 --> open-mpi/ompi@c2e9075d29
2007-01-27 13:44:03 +00:00
Jeff Squyres
974dcebf9f Finish backing out r13316 by also removing the comments that it
insertted.

This commit was SVN r13324.

The following SVN revision numbers were found above:
  r13316 --> open-mpi/ompi@35c1370a13
2007-01-26 13:09:18 +00:00
George Bosilca
668a2bd7ac Remove some debug output.
This commit was SVN r13323.
2007-01-26 08:09:22 +00:00
George Bosilca
bd7eebda83 Deal with the argv problem from r13321 for the Windows PLS.
This commit was SVN r13322.

The following SVN revision numbers were found above:
  r13321 --> open-mpi/ompi@b439e87f96
2007-01-26 07:21:07 +00:00
George Bosilca
b439e87f96 We have this one starting from r12059. We save a pointer to the argv[*] and
then we modify the argv, forcing the reallocation of the array. With luck
the saved pointer still have a meaning ... without execve return with error
14 (EFAULT).

This commit was SVN r13321.

The following SVN revision numbers were found above:
  r12059 --> open-mpi/ompi@ae79894bad
2007-01-26 07:06:52 +00:00
George Bosilca
29597cf0c5 We need to initialize the ODLS as they are the only one to define
the ORTE_DAEMON_CMD type. Which, unfortunately, is used all over
the place. Without this, we get error:
[msc01:12341] [0,0,0] ORTE_ERROR_LOG: Data pack failed in file ../../ompi-trunk/orte/dss/dss_pack.c at line 83
[msc01:12341] [0,0,0] ORTE_ERROR_LOG: Data pack failed in file ../../ompi-trunk/orte/dss/dss_pack.c at line 58
[msc01:12341] [0,0,0] ORTE_ERROR_LOG: Data pack failed in file ../../../../ompi-trunk/orte/mca/pls/base/pls_base_orted_cmds.c at line 136

This commit was SVN r13320.
2007-01-26 04:32:15 +00:00
Ralph Castain
0905dfdfba Make sure the params.h file gets included in the tarballs
This commit was SVN r13318.
2007-01-26 03:05:30 +00:00
Rich Graham
35c1370a13 odls components are handled only by daemon procs.
This commit was SVN r13316.
2007-01-25 21:18:59 +00:00
Rich Graham
3488b394be fix typo in name of the cancel operation.
This commit was SVN r13312.
2007-01-25 19:07:27 +00:00
Jeff Squyres
580a7a108c Fix a compiler warning.
This commit was SVN r13310.
2007-01-25 17:22:01 +00:00
Tim Prins
e199bf9b64 Refs trac:801
Fix compiler warning

This commit was SVN r13308.

The following Trac tickets were found above:
  Ticket 801 --> https://svn.open-mpi.org/trac/ompi/ticket/801
2007-01-25 16:12:05 +00:00
Ralph Castain
ab5ea61100 Bring over the rest of the ctrl-c fixes. This commit includes:
1. add a "cancel_operation" API to the pls components that allows orterun to demand that an orted operation (e.g., terminate_job) be immediately cancelled and abandoned.

2. changes the pls orted commands from blocking to non-blocking. This allows us to interrupt those operations should an orted be non-responsive. The change also adds an orte_abort_timeout that limits how long orterun will automatically wait for the orteds to respond - if the terminate command, for example, doesn't see orted response within that time, then we printout an appropriate error message and just give up.

3. modifies orterun to allow multiple ctrl-c's to simply abort the program even if the orteds have not responded

4. does some cleanup on the orte-level mca params so that their implementation looks a lot more like that of ompi - makes it easier to maintain. This change also includes the definition of an orte_abort_timeout struct and associated MCA param (can't have too many!) so you can set the time after which orterun gives up on waiting for orteds to respond

This needs more testing before migrating to 1.2.

This commit was SVN r13304.
2007-01-25 14:17:44 +00:00
Ralph Castain
53967bd698 Fix a memory corruption problem deep inside the registry when subscriptions/triggers are processed. The create_value function will malloc space for the pointers to keyval objects, but doesn't actually allocate space for the objects themselves. When constructing the gpr_notify_data object, we forgot to OBJ_NEW the keyval objects. Since the create_value function didn't explicitly NULL those memory locations, it just so happened that there was a non-NULL address in them....which we dutifully dumped a keyval into.
This fix includes two parts: (a) we now initialize the keyval pointer locations to NULL after the malloc, and (b) we now OBJ_NEW the keyvals prior to storing info in them.

BTW, in case anyone reads this and wonders why we don't just OBJ_NEW the keyvals in create_value, the reason is simply that some places in the code use static keyvals and simply assign those addresses into the value object's array. So not everyone wants to OBJ_NEW keyvals - by not forcing it here in create_value, we give the user the flexibility to do whatever they want.

This commit was SVN r13300.
2007-01-25 12:54:02 +00:00
George Bosilca
5711583bdf Force only one thread to come out from the
socket engine.

This commit was SVN r13298.
2007-01-25 07:36:42 +00:00
George Bosilca
dcce444ed4 When the user give a prefix that really means something. I expect to
start looking for the daemons using the prefix.

This commit was SVN r13297.
2007-01-25 07:35:25 +00:00
George Bosilca
950b07d860 Work around the Windows sockets model.
This commit was SVN r13294.
2007-01-25 00:19:02 +00:00
George Bosilca
3b988fcdfd Small update the the process PLS.
This commit was SVN r13293.
2007-01-25 00:17:54 +00:00
Tim Prins
4fd81b3407 Fixes trac:801
- Make it so the SLURM ras can handle different nodelist configurations
- Some code cleanup and better/more informative error messages and error handling

This commit was SVN r13271.

The following Trac tickets were found above:
  Ticket 801 --> https://svn.open-mpi.org/trac/ompi/ticket/801
2007-01-24 14:45:42 +00:00
George Bosilca
9b16827049 Add ORTE_DECLSPEC, and few conversions.
This commit was SVN r13268.
2007-01-24 00:52:08 +00:00
George Bosilca
1e38810c2d Correctly close the sockets on a generic way.
This commit was SVN r13254.
2007-01-23 03:17:23 +00:00
Ralph Castain
46b7df5683 Fix bproc nodename to correctly assign process-to-nodename mapping
This commit was SVN r13244.
2007-01-22 19:39:00 +00:00
Jeff Squyres
6584df9262 For --prefix-like behavior, we used to modifiy environ directly and
then exec the "srun..." from there.  But somewhere along the line, we
switched to having a copy of environ and modifying that.  It looks
like we forgot to update the stuff for --prefix behavior.  So this
commit fixes the setenv's for PATH and LD_LIBRARY_PATH to modify the
environ copy (not environ itself) so that the values properly get
passed down to the srun environment via execve().

This restores --prefix behavior in the SLURM pls.

This commit was SVN r13239.
2007-01-22 15:50:35 +00:00
George Bosilca
3169a29da4 Revert commit r13235.
This commit was SVN r13238.

The following SVN revision numbers were found above:
  r13235 --> open-mpi/ompi@2636881324
2007-01-22 06:46:58 +00:00
George Bosilca
1b92589179 Update the PLS process.
This commit was SVN r13236.
2007-01-22 05:48:25 +00:00
George Bosilca
2636881324 Remove unused variables.
This commit was SVN r13235.
2007-01-22 05:46:57 +00:00
George Bosilca
93c3e3a21f __WINDOWS__ is defined or not.
This commit was SVN r13234.
2007-01-22 05:46:30 +00:00
Rainer Keller
c2e9075d29 - Define a OPAL_CLASS_EMPTY to be used for initialization.
Similar within the dt_module for the predefined datatypes.

This commit was SVN r13228.
2007-01-21 15:52:06 +00:00
Rainer Keller
96030de97b - Initialize the size of the opal_object class.
- Use the OBJ_CLASS_INSTANCE macro to initialize classes.
   This also gets rid of several missing initialization errors.

This commit was SVN r13227.
2007-01-21 14:24:29 +00:00
Rainer Keller
125ba1acfa - Reduce the amount of warnings with -Wshadow -- mainly due to
usage of index and abs in inline-fcts in header files.

This commit was SVN r13217.
2007-01-19 19:48:06 +00:00
Ralph Castain
b63d4ddfbf Clean up a compiler warning under bproc
This commit was SVN r13198.
2007-01-18 20:07:06 +00:00
Ralph Castain
1487e22ec8 Store the mapping mode so that it can be recovered later
This commit was SVN r13197.
2007-01-18 20:00:15 +00:00
Ralph Castain
2c46e10692 Convert this test back to the old form of xcast API
This commit was SVN r13194.
2007-01-18 19:32:43 +00:00
Ralph Castain
455e4ada9a Bring the modified/updated pernode and npernode behaviors over from the openrte repository. This change enables npernode to pay attention to the total #procs to be launched, and cleans up the bynode vs. byslot mapping directives when in pernode and npernode modes.
This commit was SVN r13191.
2007-01-18 17:15:19 +00:00
Brian Barrett
2755d5ccef Only do Windows things if we're on Windows. Need another case for when we
don't have windows and we don't have waitpid() (ie, the Cray)

This commit was SVN r13173.
2007-01-17 23:16:52 +00:00
Brian Barrett
ffe35ef6b8 Update the Cray XT3 run-time support files to compile with latest RTE changes
This commit was SVN r13172.
2007-01-17 22:47:27 +00:00
Ralph Castain
e093b5a256 Check for NULL before release, just to be safe
This commit was SVN r13162.
2007-01-17 21:29:34 +00:00
Ralph Castain
da82359446 Update some of the orte tests to sync with openrte repository
This commit was SVN r13155.
2007-01-17 16:15:37 +00:00
Jeff Squyres
6f7adfe231 Fix for the oob base open and close functions being invoked twice by
ompi_info -- once directly and once via the rml oob component.

This commit was SVN r13152.
2007-01-17 15:18:13 +00:00
Ralph Castain
cc905290e4 Fix the pernode and npernode options - the mca parameters weren't being set to correspond to the command line options
This commit was SVN r13151.
2007-01-17 14:56:22 +00:00
Jeff Squyres
3983141342 Remove this extra #if -- it wasn't necessary and was causing compiler warnings.
This commit was SVN r13146.
2007-01-17 13:53:02 +00:00
Ralph Castain
5d698dc55b Turn "off" an unimplemented command line option - we do not currently support execution without mpirun waiting for job completion.
This commit was SVN r13127.
2007-01-16 16:10:31 +00:00
Ralph Castain
f08210b3e1 Fix a double-free error
This commit was SVN r13126.
2007-01-16 15:49:16 +00:00
Jeff Squyres
e5205657cf A much better fix for #739. No configure test -- just do a simple
memcpy() instead of assigning the struct's by value.

Fixes trac:739.

This commit was SVN r13081.

The following Trac tickets were found above:
  Ticket 739 --> https://svn.open-mpi.org/trac/ompi/ticket/739
2007-01-11 14:30:32 +00:00
Jeff Squyres
add3909096 Back out 13076 and 13077 in favor of a much simpler approach.
Sorry for the configure change -- hopefully it's early enough in the
morning that it won't affect people... (new approach won't have a
configure change).

Refs trac:739.

This commit was SVN r13080.

The following Trac tickets were found above:
  Ticket 739 --> https://svn.open-mpi.org/trac/ompi/ticket/739
2007-01-11 14:07:15 +00:00
George Bosilca
24a91fad1d OPAL_BOOL_STRUCT_COPY or OMPI_BOOL_STRUCT_COPY that's the question!
Let's minimize the disturbances and say that the configure system is right.
From now on it's OPAL_BOOL_STRUCT_COPY. This one is related to r13076 and
has to follow when r13076 goes in the 1.2.

This commit was SVN r13077.

The following SVN revision numbers were found above:
  r13076 --> open-mpi/ompi@f0932a0701
2007-01-11 05:44:48 +00:00
Jeff Squyres
f0932a0701 A workaround for a bug in the PGI 6.2 compiler series. This bug has
been fixed in the 7.0 PGI series, but is unlikely to be fixed in the
6.2 series:

 * Add a configure test looking for the bad behavior (the PGI compiler
   chokes on C code where structs containing bool's are copied by
   value)
 * Set OMPI_BOOL_STRUCT_COPY to 1 if it's ok, 0 if it's not (i.e., PGI
   6.2 series will have this value set to 0)
 * In two places in the code base -- orte-clean and btl_openib_ini.h,
   we have a struct that contains a bool that is copied by value.  In
   these two places, check OMPI_BOOL_STRUCT_COPY and if it's 1, use
   the "int" type instead of "bool".

Fixes trac:739

This commit was SVN r13076.

The following Trac tickets were found above:
  Ticket 739 --> https://svn.open-mpi.org/trac/ompi/ticket/739
2007-01-11 02:21:26 +00:00
George Bosilca
0f68bb0bc0 Add a debug flag for the ODLS process.
This commit was SVN r13075.
2007-01-11 00:17:34 +00:00
George Bosilca
c8222b57eb The Windows PLS now is able to spawn process locally.
This commit was SVN r13074.
2007-01-11 00:16:58 +00:00
Rolf vandeVaart
9fd5e55b50 Need to include strings.h because that is where the rindex()
function prototype lives.  Without this, we get compile 
warnings.  In addition, for 64-bit Solaris, we get a 
segmentation fault from orterun without this include.

This commit was SVN r13065.
2007-01-10 18:44:08 +00:00
Brian Barrett
03112254e7 Increase connection timeout to 600 seconds, which should always be higher than
the connect() timeout, so that we'll use that rather than our own timeout by
defualt.  There timeout was set low for Big Red, but causes problems for very
large clusters, as there's no way to wire them up in 10 seconds most of the
time.

This commit was SVN r13062.
2007-01-10 04:53:21 +00:00
George Bosilca
c7da2b0a9a Add the process PLS. It's only intended for Windows users.
This commit was SVN r13051.
2007-01-09 00:19:52 +00:00
Ralph Castain
950149ec50 Add another test/example program that uses the OpenRTE to dynamically spawn an application
This commit was SVN r13050.
2007-01-09 00:13:57 +00:00
George Bosilca
77452ea8ea Add a missing include and update the definition of orte rds proxy component.
This commit was SVN r13042.
2007-01-08 22:00:01 +00:00
George Bosilca
409d1b8a8d Make the universe creation functions Windows friendly again.
This commit was SVN r13041.
2007-01-08 21:58:57 +00:00
Jeff Squyres
8a289cf1cb Part 1 of the fix for ticket #726. This commit adds logic to orteun
to effect the following:

 * The first time the user hits ctrl-c, we go into the process of
   killing the ORTE job (this is not new).
 * While waiting for the job to actually terminate, if the user hits
   ctrl-c a second time, we print a warning saying "Hey, I'm still
   trying to kill the job.  If you *really* want me to die
   immediately, hit ctrl-c again within 1 second."
 * If the user hits ctrl-c a within 1 second, orterun quits with a
   warning about how the job may not have actually been killed.

Note that none of this logic won't really work until the second part
of the fix for #726 is also committed (i.e., make pls.terminate_job()
non-blocking).  So I'm now throwing the ticket over to Ralph for the
second part of the fix...

Refs trac:726

This commit was SVN r13040.

The following Trac tickets were found above:
  Ticket 726 --> https://svn.open-mpi.org/trac/ompi/ticket/726
2007-01-08 20:25:26 +00:00
Brian Barrett
e130f18cc2 Fix some compiler warnings that have slipped in lately...
This commit was SVN r13037.
2007-01-08 17:20:09 +00:00
Brian Barrett
a34e67d743 Remove unneeded PARAM_INIT_FILE variable in configure.params files used by
components that use configure.m4 for configuration or are always built. 
The macro has not been needed since moving to configure types other than
configure.stub

Fixes trac:590

This commit was SVN r13031.

The following Trac tickets were found above:
  Ticket 590 --> https://svn.open-mpi.org/trac/ompi/ticket/590
2007-01-08 03:44:22 +00:00
Jeff Squyres
a91c017f81 The constant name changed from ORTE_RML_NAME_ANY to ORTE_NAME_WILDCARD
-- upcate the comments/documentation to match.

This commit was SVN r13001.
2007-01-05 13:38:22 +00:00
Rolf vandeVaart
fdf44cc4ab Add the ability to not only report broken files and directories,
but remove them also.  This current set of changes will affect
nothing as no one is making use of this ability.  However, orte-clean
will be changed soon to utilize this new feature.

This commit was SVN r12996.
2007-01-04 21:48:34 +00:00
Ralph Castain
3ce9b2f6cc Remove some debugging output that mistakenly was left behind.
This commit was SVN r12984.
2007-01-04 17:24:11 +00:00
Ralph Castain
6101050ea6 Remove an abstraction barrier I thought was gone long-ago. The OOB subscription really shouldn't be defined as an OMPI subscription.
I know it's just a technicality, but it is time to address such things rather than just letting them continue to propagate. :-)

This commit was SVN r12954.
2007-01-02 16:16:50 +00:00
Ralph Castain
90f5e3fad8 Fix a buglet in the singleton startup procedure. For purposes of minimizing the xcast message, we "strip" the descriptive info on all subscription messages. This means, though, that we have to store the process name and other info so it can be retrieved in the body of the subscription data (as opposed to in the description). This wasn't being done for singletons because they don't call the RMAPS to "map" themselves.
This has now been corrected. The singleton startup will dutifully call the mapper framework so that the proper data storage locations get initialized. Unfortunately, we then had to instruct the RMAPS not to allocate a vpid range for this job - otherwise, it would make a mistake and think there were two processes in it. Hence, a change was required to RMAPS to tell it "map this job, but don't allocate a vpid range for it".

This change will need to migrate across to 1.2 after it "soaks" the appropriate time.

This commit was SVN r12952.
2007-01-02 16:14:44 +00:00
Rich Graham
6cb2377015 Change the allocation of the shared memory backing file. The file
is allocated on a per comm_world instance, with the lowest rank
in comm_world on the given host creating and initializing the file,
and then notifying the remaining files via the OOB.

Reviewed: Ralph Castain, Brian Barrett
Addressing ticket #674.

This commit was SVN r12949.
2007-01-01 02:39:02 +00:00
Brian Barrett
b5057d923e fix compile error that crept in with orte changes. This just makes it
compile -- I can't check for correctness on this platform.  Someone
probably wants to do that.

This commit was SVN r12948.
2006-12-31 20:16:20 +00:00
Li-Ta Lo
6df4e80727 new XCPU PLS and SDS to work with libxcpu
This commit was SVN r12905.
2006-12-21 00:05:36 +00:00
Brian Barrett
414a87b14c On OS X, the lowest 8 bits of the exit status are for signals, so pushing
rc (which is -1 or 4 if we hit this case) resulted in an odd error that a
signal killed the proc (instead of a startup error, as is reality).
Instead, use the W_EXITCODE macro (if available) to build up an exit
code that has an error code for exit status, but does not make it look
like the process died from a signal

This commit was SVN r12890.
2006-12-18 02:30:05 +00:00
Brian Barrett
bc6cec346f Print out the description of the signal from mpirun when a proc was aborted
by a signal if we have strsignal()

This commit was SVN r12888.
2006-12-17 20:01:11 +00:00
Ralph Castain
a0ef517550 Fix some errors in the bproc components that prevented compiling. Thought I had already done this, but either those changes were lost when I did the merge, or my old man's memory is fading....
Whaz-at??? :-)

This commit was SVN r12874.
2006-12-15 19:40:04 +00:00
Ralph Castain
1e1d0e8a89 Set the app_num attribute into the process environment so we pick it up on the other end
This commit was SVN r12868.
2006-12-15 16:43:52 +00:00
Ralph Castain
677d1260aa cleanup nicely if we don't launch
This commit was SVN r12867.
2006-12-15 14:03:53 +00:00
Ralph Castain
cbb660504c Retain the ability to run valgrind on the bproc launcher - do not call bproc_version if "nolaunch" is specified.
This commit was SVN r12866.
2006-12-15 14:01:21 +00:00
Ralph Castain
64ec238b7b Repair support for Bproc 4 on 64-bit systems. Update the SMR framework to actually support the begin_monitoring API. Implement the get/set_node_state APIs.
This commit was SVN r12864.
2006-12-15 02:34:14 +00:00
Brian Barrett
38c2e43ac2 Print out error string rather than errno for TCP-related errors, making it easier for both the user and us to debug issues with BTL and OOB issues...
This commit was SVN r12852.
2006-12-14 18:20:43 +00:00
Ralph Castain
7b8f445e13 Modify the "--display-map-at-launch" option to just "--display-map". Now that we have a "--do-not-launch" option, the "-at-launch" part of the display-map option was confusing. "--display-map" displays the resulting process map before we launch anyway, so this is clearer.
This commit was SVN r12840.
2006-12-13 13:49:15 +00:00
Ralph Castain
82946cb220 Add a new option to orterun: "--do-not-launch" directs the system to do the allocation, map, job setup, etc., but don't actually launch the job. This lets us test all the setup portions of the code.
Also, take the first step in updating how we handle mca params in ORTE - bring it closer to how it is done in the other two layers. Much more work to be done here.

This commit was SVN r12838.
2006-12-13 04:51:38 +00:00
Ralph Castain
3b064a624e For convenience, revise the orte_job_map_t object so it includes the vpid start/range values, the number of nodes, and the number of processes on each node. These values are all used in various places in the code base - we currently re-compute them multiple times. Since these values do not change and are already being computed by the RMAPS framework, we might as well just save them for re-use.
This commit was SVN r12829.
2006-12-12 16:07:23 +00:00
Ralph Castain
28ce8e5e5e Extend the mpirun options to support "--npernode N". This option tells the system to spawn N procs/node across all nodes in the allocation. If N is greater than the number of allocated slots, then the usual oversubscription logic will apply (i.e., the system will error out if oversubscription is not allowed, otherwise it will run with the sched_yield set to non-aggressive behavior).
In "--npernode" operation, the "-np" command line parameter is ignored.

This commit was SVN r12826.
2006-12-12 00:54:05 +00:00
Ralph Castain
8314e8dbb9 Modify the pernode option so it can accept a request for the number of processes to be launched. We now check three use-cases for pernode:
1. no -np provided - put one proc/node across all allocated nodes

2. -np N provided, N > #nodes - we print a pretty error message and exit

3. -np N provided, N <= #nodes - put one proc/node across N nodes

I also added a new orte constant (ORTE_ERR_SILENT) that allows us to pass up the chain that an error was encountered, but NOT print ORTE_ERROR_LOG messages. This is intended to be used for cases where the error we encounter is NOT an orte error, but rather is one associated with incorrect user input (e.g., the preceding case 2). In such cases, there is no point in printing an ORTE_ERROR_LOG chain of messages as it isn't an orte error.

This commit was SVN r12821.
2006-12-11 18:07:07 +00:00
Ralph Castain
0a5d41857a Complete next round of message size reduction: "strip" the descriptive info from the returned values. I have now added a flag to the gpr address mode (ORTE_GPR_STRIPPED) that instructs the gpr to not include segment names or tokens in the returned gpr_value_t objects.
I found only two places that were looking at the tokens:

1. the odls - we used the tokens to separately process the globals container data from everything else. In this case, I left the subscription that returned the globals data alone, but "stripped" the subscription that returned the launch data for the procs. These subscriptions have nothing to do with the xcast message.

2. the pml_base_modex - the callback function was getting process names from the returned tokens. Actually, this function was doing a very bad thing - it was assuming that the first token returned was *always* the process name. This is currently true, but is one of those assumptions that someone could have easily changed - and suddenly found the system inexplicably failing. I modified the function to (a) get the name sent back to us, (b) "stripped" the value structures of tokens and segment strings, and (c) correctly obtained process names from the returned values. I also reindented the heck out of the code so it was legible (at least, to my old eyes).

This commit was SVN r12813.
2006-12-09 23:10:25 +00:00
Ralph Castain
58569546ed Fix the fix to remove compiler warning - an incorrect "\" was placed in the command string.
This commit was SVN r12805.
2006-12-08 04:17:38 +00:00
Sven Stork
78173a697a Replace the test opertion "-e" with "-r" to improve the protability.
Refs: #392

This commit was SVN r12790.
2006-12-07 12:14:40 +00:00
Ralph Castain
62d7826e01 Helps if we total up the correct field to get the total number of slots in the universe
This commit was SVN r12789.
2006-12-07 03:17:12 +00:00
Ralph Castain
a1153fdc8f Eliminate virtually all of the attribute_predefined data from the STG1 message. We now compute the total number of slots allocated to us and save that in the registry - the attributed_predefined then retrieves it via the STG1 message. The app_num is passed via the process_info structure, which gets the value from the ODLS in the environment.
Obviously, people like bproc will have to get the app_num via another avenue...but that's a problem for another day. Several options are easily available.

This commit was SVN r12788.
2006-12-07 03:11:20 +00:00
Brian Barrett
8f68764e5e A number of heterogeneous fixes for the dss with the new buffer options:
* When using the load/unload interface, stash away the current buffer
    type so that it can be properly unpacked on the receiving side if
    the buffer type is other than the receiver default
  * Include type information for unsized types (bool, int, size_t,
    pid_t) so that they can be properly unpacked by the receiver
    in the heterogeneous case.
  * Restore the NON_DESC type as the default for optimized builds,
    since it looks like this fixes the known issues with the
    non-described buffers

Refs trac:587

This commit was SVN r12784.

The following Trac tickets were found above:
  Ticket 587 --> https://svn.open-mpi.org/trac/ompi/ticket/587
2006-12-06 23:19:06 +00:00
Brian Barrett
cfeac5581a temporarily always use described buffers as the non-described causes all
kinds of problems for heterogeneous environments

This commit was SVN r12783.
2006-12-06 20:22:31 +00:00
Ralph Castain
d4bd60c9fe Restore the paffinity capability, along with all the required logic to ensure we "do the right thing" when the user gives us inaccurate information about the number of slots on a remote node.
This commit was SVN r12780.
2006-12-06 15:59:34 +00:00
Ralph Castain
b1e16fffac Add the C++ doo-hicky stuff around the odls framework definitions just in case somebody, somewhere, on some remote planet where only goats can feed needs it.
This commit was SVN r12777.
2006-12-06 13:58:04 +00:00
Ralph Castain
8ca415a0c5 Remove duplicate orte_odls declaration
This commit was SVN r12776.
2006-12-06 13:44:41 +00:00
Brian Barrett
6f8b366acb Rename liborte to libopen-rte and libopal to libopen-pal per telecon today
and bug #632.

Refs trac:632

This commit was SVN r12762.

The following Trac tickets were found above:
  Ticket 632 --> https://svn.open-mpi.org/trac/ompi/ticket/632
2006-12-05 18:27:24 +00:00
Tim Prins
08d5ca821f Don't get the node architecture when useing the LoadLevleer RAS. It is slow (about a second for ~300 nodes) and we don't even use the value.
This commit was SVN r12758.
2006-12-05 13:47:53 +00:00
Ralph Castain
eb941d8ae2 Fix a bug that declared a node as "oversubscribed" a little early during the mapper procedure. This only affected the mapping procedure, and only if you had set the "--no-oversubscribe" flag.
Kudos to Tim Prins for finding it.

This commit was SVN r12757.
2006-12-05 13:04:27 +00:00
George Bosilca
6f28bcdc21 Remove the last set of compiler warnings from the precondition file.
This commit was SVN r12753.
2006-12-04 21:45:57 +00:00
Brian Barrett
d64fa194f1 Instead of continually screwing around with different format strings to
make this warning-proof, loop over the uint64_ts as an array of integers
and use %x.  The final string is just as random and formatted exactly
the same, so we're all good in that department.

Refs trac:655

This commit was SVN r12742.

The following Trac tickets were found above:
  Ticket 655 --> https://svn.open-mpi.org/trac/ompi/ticket/655
2006-12-04 18:07:24 +00:00
Gleb Natapov
f0132b2499 Provide parameters in a correct order (processor/oversubscribed was swapped).
This commit was SVN r12737.
2006-12-04 12:55:45 +00:00
Rainer Keller
d078bb3e8a - Revert changes and include pointers to discussion.
This commit was SVN r12736.
2006-12-03 17:05:15 +00:00
Rainer Keller
e61dd8722e - Silence compiler on ORTE_TRANSPORT_KEY_FMT, it is fixed to llx
- No functional changes, just indentation and corrections to error
   output.

This commit was SVN r12734.
2006-12-03 13:59:23 +00:00
George Bosilca
a0ed53d70b Make the compilers happy.
This commit was SVN r12729.
2006-12-03 00:19:11 +00:00
Ralph Castain
4151a46871 Per Jeff's request (which made a lot of sense), setup the default buffer type to be DESCRIBED for debug/devel builds, and NON-DESC for optimized builds. The user can still select the default buffer type via mca parameter at runtime - this just sets the default default. :-)
Also, change the dss buffer type mca param to something more easily remembered (it is now "dss_buffer_type"). Heck, even I had to keep looking at the darn code to remember it.

This commit was SVN r12728.
2006-12-02 13:32:16 +00:00
George Bosilca
3fd278c522 Make the tree compile in debug mode.
This commit was SVN r12724.
2006-12-01 23:03:09 +00:00
Ralph Castain
897744cdeb Two major changes to the runtime:
1. implement and enable the non-described buffer operations. I will send out a more detailed explanation separately. However, this mode of operation (which is now the default) significantly reduces message size during startup. If you want the described buffers, set the mca param "-mca dss_describe_buffer 1".

2. revise the xcast system to support both linear and binomial tree broadcast methods. Since we are seeing scenarios where the binomiall tree can cause problems, I have made the linear method the default. To run with the binomial tree, set the mca param "-mca oob_xcast_mode binomial".

3. add some detailed timing reports to the xcast operation. These are enabled via "-mca oob_xcast_timing 1".

4. add some more unit tests for the dss and gpr (focused on support for the non-described buffer)

This commit was SVN r12722.
2006-12-01 22:30:39 +00:00
Jeff Squyres
3cf7dddd47 Fixes trac:635.
Ralph identified the problem, I tracked down ''where'' the fd was
being closed, and Brian figured out ''why'' (and the fix).

What was happening is that a remote process was closing its
stdout/stderr and therefore sending a 0-byte IOF message to mpirun.
mpirun, in turn, closed the iof endpoint associated with that stream
(i.e., stdout/stderr).  IOF does this to handle the case where
mpirun's stdin is closed -- this therefore causes the stdin on all the
ORTE-started processes to have their stdin's closed as well.

So the workaround here is to check that if we get a 0-byte IOF message
on a sink (indicating a remote closure), and if that sink is the
special stdout or stderr stream, don't actually close anything in the
local process.

This commit was SVN r12691.

The following Trac tickets were found above:
  Ticket 635 --> https://svn.open-mpi.org/trac/ompi/ticket/635
2006-11-28 21:42:49 +00:00
Ralph Castain
0398c9e0c5 Correctly setup the sched_yield when launching processes via the orteds. This still doesn't adjust the yield schedule "on-the-fly" as more procs are dynamically added to a node - it just sets it when they are first launched.
This commit was SVN r12683.
2006-11-28 08:27:20 +00:00
George Bosilca
8df8d86b85 Complete the functions to match the expected prototype.
This commit was SVN r12680.
2006-11-28 00:44:30 +00:00
Ralph Castain
bc4e97a435 First stage in the move to a faster startup. Change the ORTE stage gate xcast into a binary tree broadcast (away from a linear broadcast). Also, removed the timing report in the gpr_proxy component that printed out the number of bytes in the compound command message as the answer was "not much" - reduces the clutter in the data.
This commit was SVN r12679.
2006-11-28 00:06:25 +00:00
Ralph Castain
652b91ee26 Remove some compiler warnings
This commit was SVN r12678.
2006-11-27 23:47:36 +00:00
Ralph Castain
9bc25f0bec Fix a potential bug in the registry where it didn't fully check a segment's name when searching for it. Will have to verify that this doesn't break other things.
Bring the bproc system close to being back online....

This commit was SVN r12659.
2006-11-23 04:17:37 +00:00
Brian Barrett
32833deff0 since orteboot, ortehalt, and ortekill were all added today (including to
configure.ac), we need to add them to SUBDIRS to make them end up in the
tarball as well...

This commit was SVN r12658.
2006-11-23 03:10:57 +00:00
Ralph Castain
deb2470ba3 Move the waitpid callback in the bproc pls *after* we store the daemon info. Otherwise, a short-lived app could terminate before we store the daemon info, causing mpirun to not terminate the daemons since the call to get_active_daemons would return a NULL list.
This commit was SVN r12656.
2006-11-22 22:49:22 +00:00
Rainer Keller
b63500f62c - Dont unlock ompi_rte_mutex unconditionally, use the macro instead.
This commit was SVN r12655.
2006-11-22 21:01:43 +00:00
Ralph Castain
b1ff5fe868 Move the name of the bproc common segment to the central schema location - avoids conflicts when bproc 3 components try to build
This commit was SVN r12654.
2006-11-22 20:23:17 +00:00