1
1
Граф коммитов

1501 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
b110a247be Fix comm_spawn (maybe).
Comm_spawn was sticking during spawn_multiple because of a problem in the dpm - the modex there is asking processes to talk to each other in an allgather_list operation, but the procs don't have the required contact info to do so. The solution here was to ensure that all parent procs have full contact info for procs in the child job.

Admittedly, this isn't the long-term answer. We would like to have the contact info given to only the parent procs that were involved in the comm_spawn. There is a way to do that, but this will suffice to keep things working until that can be implemented and tested.

This commit was SVN r17772.
2008-03-06 21:56:00 +00:00
Ralph Castain
57a72c412a Utilize Tim M's suggestion and use atomics to do the locking.
This commit was SVN r17767.
2008-03-06 21:36:32 +00:00
Ralph Castain
097cc83be2 Fix a race condition - ensure we don't call terminate in orterun more than once, even if the timeout fires while we are doing so
This commit was SVN r17766.
2008-03-06 19:35:57 +00:00
Ralph Castain
64d43cc44b Fix the unity routed component and direct xcast mode.
Ensure that direct xcast handles all its use-cases correctly.

Unity routed component needs to use the base recv function to properly operate.

This commit was SVN r17764.
2008-03-06 18:13:05 +00:00
Ralph Castain
ff99aa054f In order to prevent orphaned processes when using non-unity routing methods, the procs need to realize that their local daemon is a critical connection - if that connection unexpectedly closes, they need to terminate.
This commit adds definition for a "lifeline" connection. For an HNP, there is no lifeline, so the lifeline proc is NULL. For a daemon, the lifeline is the HNP - the daemon should abort if it loses that connection.

For a proc using unity routed, the lifeline is the HNP since it connects directly to the HNP.

For a proc using tree routed, the lifeline is the local daemon.

Adjusted OOB to call abort if the lifeline (as opposed to HNP) connection is lost.

This commit was SVN r17761.
2008-03-06 15:30:44 +00:00
Josh Hursey
0b4d9a12ce a bit more verbosity for the fun of it
This commit was SVN r17758.
2008-03-06 14:04:25 +00:00
Tim Prins
f61c2333c0 Remove unneeded field, and the two uses of it.
This commit was SVN r17757.
2008-03-06 12:46:36 +00:00
Tim Prins
d56f19c77d Fix logic error, and remove uneeded checks for invalid results.
This commit was SVN r17756.
2008-03-06 04:38:13 +00:00
Ralph Castain
6d94e7b232 Fix the debug output so it correctly reports launch state
This commit was SVN r17755.
2008-03-06 03:11:01 +00:00
Ralph Castain
3883bbee06 Fix bug - must not "free" tsd-allocated memory
This commit was SVN r17754.
2008-03-06 03:10:14 +00:00
Ralph Castain
5795427f6a Protect against situations where someone didn't fill-in all the app_context fields
This commit was SVN r17752.
2008-03-05 23:55:03 +00:00
Tim Prins
5de3e1965e Remove the orte_proc_table. Migrate all users of it to the opal_hash_table and a new name hash function in orte.
Everything should work, however I am unable to compile and test the sctp BTL.

This commit was SVN r17751.
2008-03-05 22:44:35 +00:00
Tim Prins
f9916811ae Make it so we do not mangle the options the user passes to their executeable. Fixes trac:1124
The change also:
 - cleans up and simplifies the command line processing code
 - adds an error output if more than one hostfile passed for a single app context
 - gets rid of the superfluous orte_app_context_map_t type, and instead use a simple argv of -host options

This commit was SVN r17750.

The following Trac tickets were found above:
  Ticket 1124 --> https://svn.open-mpi.org/trac/ompi/ticket/1124
2008-03-05 22:12:27 +00:00
Rolf vandeVaart
03fdd57d5a Fix the use of --path and -x PATH so that things work properly.
Note that --path specifies extra directories where the executable
is searched for, but does not affect the PATH settings.

This commit fixes trac:1221.

This commit was SVN r17748.

The following Trac tickets were found above:
  Ticket 1221 --> https://svn.open-mpi.org/trac/ompi/ticket/1221
2008-03-05 21:07:43 +00:00
Ralph Castain
4dbc352828 Per request, change name of new enviro var to OMPI_COMM_WORLD_LOCAL_SIZE
This commit was SVN r17736.
2008-03-05 14:45:26 +00:00
Ralph Castain
06d3145fe4 First cut at direct launch for TM. Able to launch non-ORTE procs and detect their completion for a clean shutdown.
This commit was SVN r17732.
2008-03-05 13:51:32 +00:00
Jeff Squyres
d0f5be023c Restore r17703; it was accidentally removed as part of r17704.
This commit was SVN r17728.

The following SVN revision numbers were found above:
  r17703 --> open-mpi/ompi@1bedaea79b
  r17704 --> open-mpi/ompi@8189fcc7d5
2008-03-05 12:01:37 +00:00
George Bosilca
c71f225a28 These functions should only be compiled when OPAL_ENABLE_FT == 1.
This commit was SVN r17727.
2008-03-05 05:57:13 +00:00
Josh Hursey
3b4073e32c This commit fixes the checkpoint/restart functionality on the trunk. Included in this commit are:
* Extension to the ESS framework to support C/R
 * Fixed support for {{{snapc_base_establish_global_snapshot_dir}}}
 * Fixed FileM support
 * Misc. minor code modifications

There are some outstanding visability issues that I want to fix next.

This commit was SVN r17725.
2008-03-05 04:57:23 +00:00
Ralph Castain
edb8e32a7a Add default hostfile parameter plus --default-hostfile command line option.
Fix error message when job setup failed

This commit was SVN r17724.
2008-03-05 04:54:57 +00:00
Ralph Castain
022fc1f382 Add another MPI-related enviro variable OMPI_COMM_WORLD_NUM_LOCAL_PROCS
This commit was SVN r17723.
2008-03-05 04:53:32 +00:00
Ralph Castain
9413d6cf5d Define a default exit code for when things fail prior to a job launch - still needs work, but a start.
Fix a deadlock loop when things really, really go bad. If we timeout trying to kill the job, then it's time to bail as cleanly as possible, not go back and keep trying.

This commit was SVN r17715.
2008-03-05 01:46:30 +00:00
Ralph Castain
bf5ba58ce0 Get the count correct when the user lists the same node multiple times for -host.
This commit was SVN r17711.
2008-03-05 01:24:34 +00:00
Jeff Squyres
8189fcc7d5 Back out r17702; it went very badly.
This commit was SVN r17704.

The following SVN revision numbers were found above:
  r17702 --> open-mpi/ompi@3df754ebd7
2008-03-05 00:42:39 +00:00
Shiqing Fan
1bedaea79b Add support of orte event wait functions for Windows.
This commit was SVN r17703.
2008-03-05 00:25:23 +00:00
Ralph Castain
e745c16ff1 Modify the enviro variable names to be OMPI_...
Add two new ones: OMPI_COMM_WORLD_LOCAL_RANK and OMPI_UNIVERSE_SIZE

This commit was SVN r17694.
2008-03-04 20:16:05 +00:00
Shiqing Fan
ebf9c0441d Set the windows components invisible.
This commit was SVN r17687.
2008-03-04 17:37:17 +00:00
Shiqing Fan
ae41b5418b Update the RAS and PLM components for Windows.
These won't suffer another platforms but only windows. 

This commit was SVN r17686.
2008-03-04 17:13:01 +00:00
Ralph Castain
ffa232687a Fix xcast so it works in multi-node situations where the user specifies a particular mode to use (e.g., direct).
This commit was SVN r17682.
2008-03-03 20:07:02 +00:00
Ralph Castain
841d0e5208 Cleanup an attribute warning - not sure which one to set or where it should go, so I'll leave that to someone more familiar with "attributes".
Ensure some debugging is only enabled when have_debug is set.

This commit was SVN r17681.
2008-03-03 16:06:47 +00:00
Rich Graham
d37db14901 get the shared memory collectives working again with the new
version of orte.

This commit was SVN r17672.
2008-02-29 22:28:57 +00:00
Ralph Castain
6450962d59 Add some debugging to the message event object.
Cleanup some no-longer-used values

This commit was SVN r17671.
2008-02-29 20:10:31 +00:00
Ralph Castain
a1eef0dd50 Fix a race condition in the orted recv/process procedure.
Thx to Tim P for spotting it

This commit was SVN r17666.
2008-02-29 15:18:45 +00:00
Ralph Castain
a585923de1 Silence some minor compiler warnings
This commit was SVN r17662.
2008-02-29 02:39:39 +00:00
Tim Prins
84b2099fe8 Remove the now-unused orte_value_array. As this is the last 'class' split between orte and ompi, remove the big comment about the split in ompi_bitmap.
Also, update some properties (source files should not be executeable...), and remove a couple unneeded inclusions of orte_proc_table.h

This commit was SVN r17655.
2008-02-28 21:39:42 +00:00
Ralph Castain
5e6928d710 Cleanup recursions in ORTE caused by processing recv'd messages that can cause the system to take action resulting in receipt of another message.
Basically, the method employed here is to have a recv create a zero-time timer event that causes the event library to execute a function that processes the message once the recv returns. Thus, any action taken as a result of processing the message occur outside of a recv.

Created two new macros to assist:

ORTE_MESSAGE_EVENT: creates the zero-time event, passing info in a new orte_message_event_t object

ORTE_PROGRESSED_WAIT: while waiting for specified conditions, just calls progress so messages can be recv'd.

Also fixed the failed_launch function as we no longer block in the orted callback function. Updated the error messages to reflect revision. No change in API to this function, but PLM "owners" may want to check their internal error messages to avoid duplication and excessive output.

This has been tested on Mac, TM, and SLURM.

This commit was SVN r17647.
2008-02-28 19:58:32 +00:00
Ralph Castain
5dc64cea6a Correct logic - only issue recv and cancel it if we are an HNP
This commit was SVN r17641.
2008-02-28 15:27:16 +00:00
George Bosilca
7879e0b9c2 Be nice with parallel debugger, export this required symbol.
This commit was SVN r17637.
2008-02-28 05:59:07 +00:00
George Bosilca
9d421bea2a Replace all occurences of orte_pointer_array by opal_pointer_array. Remove the
implementation of orte_pointer_array.

This commit was SVN r17636.
2008-02-28 05:32:23 +00:00
Ralph Castain
d70e2e8c2b Merge the ORTE devel branch into the main trunk. Details of what this means will be circulated separately.
Remains to be tested to ensure everything came over cleanly, so please continue to withhold commits a little longer

This commit was SVN r17632.
2008-02-28 01:57:57 +00:00
Gleb Natapov
da3e69101d Add missing include.
This commit was SVN r17493.
2008-02-18 14:55:02 +00:00
Galen Shipman
18d1d3b408 Add ORTE ALPS support (Cray XT CNL)
This commit was SVN r17482.
2008-02-17 19:29:06 +00:00
George Bosilca
fcab6cc0bb Fix typo.
This commit was SVN r17255.
2008-01-26 21:36:04 +00:00
Rainer Keller
9d4852cdc1 - Get rid of Wshadow warnings.
This commit was SVN r17231.
2008-01-25 14:07:38 +00:00
Sharon Melamed
025b68becf Move the carto framework to the trunk.
This commit was SVN r17177.
2008-01-23 09:20:34 +00:00
Pak Lui
413bcca4c0 Support the qrsh or qsub "-notify" option by catching the SIGUSR1/2
signals and not letting user processes to exit on those signals.

This commit was SVN r17174.
2008-01-22 17:32:29 +00:00
Jeff Squyres
b6e9c99f7d Formatting fixes from Peter Breitenlohner.
This commit was SVN r17163.
2008-01-18 23:21:31 +00:00
Josh Hursey
158dda5458 Fix some overlapping code.
This commit was SVN r17067.
2008-01-08 15:40:21 +00:00
George Bosilca
eb71a634c6 Don't forget to initialize the msg_origin field.
This commit was SVN r17055.
2008-01-04 23:24:49 +00:00
George Bosilca
48f5a26e8c Cast to keep VC happy (quiet).
This commit was SVN r17054.
2008-01-04 23:13:32 +00:00