Ralph Castain
210f591f1c
Cleanup array addressing for opal_pointer_array
...
This commit was SVN r21710.
2009-07-17 22:20:30 +00:00
Ralph Castain
51a8b89a83
Treat termination of continuously operating processes as an abort
...
This commit was SVN r21709.
2009-07-17 22:20:05 +00:00
Ralph Castain
08e17b72cf
Break a circular logic loop in the cm routed module.
...
This commit was SVN r21708.
2009-07-17 18:07:35 +00:00
Ralph Castain
ef20e778b3
Ensure that output ends on an appropriate suffix tag when --tag-output or --xml are selected.
...
When we read the input buffer, we don't always get a complete printf output - we sometimes end mid stream. We still need to add the suffix and a <CR> to keep the output working right.
This commit was SVN r21706.
2009-07-17 05:02:53 +00:00
Ralph Castain
4c1eb040b0
Enable the system to keep functioning even when multiple launches are occurring simultaneously.
...
This is a bit of a hack, but it does seem to allow the system to work. A better solution is being discussed.
This commit was SVN r21705.
2009-07-17 02:28:47 +00:00
Ralph Castain
c0e85a492c
Deleted one too many lines...might be good to set the value of oldnode!
...
Thanks George.
This commit was SVN r21702.
2009-07-16 18:49:24 +00:00
George Bosilca
3e971e61f3
The system headers are supposed to be protected by #ifdef and not by #if.
...
This commit was SVN r21700.
2009-07-16 18:27:33 +00:00
George Bosilca
ed93b967f7
Remove some warnings about uninitialized values.
...
This commit was SVN r21695.
2009-07-16 17:38:09 +00:00
George Bosilca
52d013baae
Add a missing header.
...
This commit was SVN r21694.
2009-07-16 17:21:37 +00:00
Ralph Castain
007d14f238
Add a threshold reporting level to the orte notifier framework. This takes a string value:
...
"critical" - any error at or above the critical severity will be reported (i.e., only critical errors)
"warning" - any error at or above the warning severity will be reported (i.e., warning and critical errors)
"notice" - pretty much everything will be reported
Default to "critical" to keep down the chatter.
Obviously, only places that call orte_notifier will be affected - all other error reporting (e.g., via opal_output calls) is unaffected.
This commit was SVN r21693.
2009-07-16 13:31:23 +00:00
Ralph Castain
ae6c36ae01
Ensure that jdata->num_procs is correct when the rank_file mapper is mapping more procs than are specified in the rank_file
...
This commit was SVN r21690.
2009-07-15 22:45:12 +00:00
Ralph Castain
e75d9b8296
Use orte_notifier to alert sys admins to checksum violations in the csum pml.
...
Add ability to store the RM's jobid string to tag the notifier message so that the sys admin knows what job had the problem.
This commit was SVN r21687.
2009-07-15 19:43:26 +00:00
Ralph Castain
90a2db25e9
Modify the errmgr callback function so it passes the proc that failed instead of only the jobid.
...
Update the cm routed module to detect and pass orted failures.
This commit was SVN r21682.
2009-07-15 11:43:33 +00:00
Ralph Castain
247ba7e90d
Use the base function to claim a slot when fault groups are not defined
...
This commit was SVN r21681.
2009-07-15 11:28:58 +00:00
Ralph Castain
7161b37c76
Ensure that the stdin channel is closed when we kill a local proc - all other channels will automatically be closed when the proc terminates
...
This commit was SVN r21680.
2009-07-15 11:28:19 +00:00
Ralph Castain
dbac602be5
Add support for the add-host and add-hostfile MPI Info keys to allow Comm_spawn users to add new hosts to those already known by mpirun.
...
Requires full testing once comm_spawn is fixed (Edgar is working that now).
This commit was SVN r21664.
2009-07-14 14:34:11 +00:00
Ralph Castain
60edbc7220
Fix hetero operations and comm_spawn (to a point).
...
Remove all architecture references from ORTE and put them back in the modex using modex_send/recv calls.
Hetero operations are now fully supported again. Comm_spawn now works up to the point where it segfaults due to an error in the CID code - which now allows Edgar to dig further! :-)
This commit was SVN r21655.
2009-07-13 20:03:41 +00:00
Ralph Castain
1b418dd397
Fix segfault in comm_spawn. The underlying problem breaking comm_spawn, however, remains - the change to make modex non-blocking causes the system to fail due to the arch not getting properly set.
...
Fix for that coming shortly.
This commit was SVN r21646.
2009-07-13 15:13:06 +00:00
Ralph Castain
b97f885c00
Restore the original API to terminate individual processes instead of the entire job. This was originally removed as we didn't at that time know how to take advantage of it. Some of us are now working on proactive resilience methods that move procs prior to node failure, so this is now a required API. Modify the odls, plm, and orted functions to support this new functionality.
...
Continue work on the resilient mapper, completing support for fault groups.
This commit was SVN r21639.
2009-07-13 02:29:17 +00:00
Ralph Castain
e30826c6e1
Quiet some compiler warnings
...
This commit was SVN r21591.
2009-07-02 17:48:36 +00:00
Shiqing Fan
0b56a8a4d5
Enable IPv6 on Windows by default, and fix two type casts for IPv6 operations.
...
This commit was SVN r21586.
2009-07-02 14:41:03 +00:00
Ralph Castain
4adb3ed80f
Print out a more meaningful and correct error message
...
This commit was SVN r21581.
2009-07-01 20:16:15 +00:00
Ralph Castain
f832352b45
Clean up some compiler warnings
...
This commit was SVN r21577.
2009-07-01 16:51:11 +00:00
Ralph Castain
9635db373d
Ensure that we properly exit if the executable isn't found
...
This commit was SVN r21570.
2009-07-01 03:16:13 +00:00
Josh Hursey
91c4869cdc
Use {{{ORTE_*_VERSION}}} instead of {{{OMPI_*_VERSION}}} in orte
...
This commit was SVN r21558.
2009-06-29 12:53:23 +00:00
Lenny Verkhovsky
e03807a3d1
small patch to extend current rankfile syntax to be compliant with orte_hosts syntax
...
making it possible to claim relative hosts from the hostfile/scheduler
by using +n# hostname, where 0 <= # < np
ex:
cat ~/work/svn/hpc/dev/test/Rankfile/rankfile
rank 0=+n0 slot=0
rank 1=+n0 slot=1
rank 2=+n1 slot=2
rank 3=+n1 slot=1
This commit was SVN r21557.
2009-06-28 11:20:56 +00:00
Shiqing Fan
656ec00611
Remove the definition of _WIN32_DCOM. It only enables DCOM when _WIN32_WINNT is less then 0x400 (Win 95, 98), and we are supporting 0x502(Win XP) and above in Open MPI. Thanks George for pointing this out.
...
This commit was SVN r21555.
2009-06-27 23:36:25 +00:00
Ralph Castain
d0a5468deb
Continue gradual cleanup of pointer array addressing.
...
This commit was SVN r21550.
2009-06-26 22:40:28 +00:00
Ralph Castain
8ccc47b152
Don't wakeup mpirun too early - even though we ordered an abort, we have to wait for the procs/daemons to properly terminate, and give the kill_procs command a chance to get out.
...
This commit was SVN r21549.
2009-06-26 22:40:00 +00:00
Ralph Castain
0a92fe3739
Pull the HNP node from the right index
...
This commit was SVN r21547.
2009-06-26 21:43:09 +00:00
Ralph Castain
b96a71b62e
Enable restart of individual processes upon command via the errmgr callback function. It needs an external application to drive this capability, so normal operations shouldn't be affected.
...
Does not support MPI applications. More work coming to update daemon accounting on movement of procs across nodes.
This commit was SVN r21545.
2009-06-26 20:54:58 +00:00
George Bosilca
f06198a999
BAD grpcomm now has the ability to execute the modex offline. The MPI process
...
prepare the send buffer, and post the collective order to the local daemon. It
then register the callback and return fromthe modex exchange. It will only
wait for this modex completion when the modex_recv is called. Meanwhile, the
daemon will do the allgather.
This commit was SVN r21543.
2009-06-26 20:32:31 +00:00
George Bosilca
6239e714a6
Extract the modex unpacking in it's own function.
...
This commit was SVN r21542.
2009-06-26 20:29:54 +00:00
George Bosilca
760df7fcb9
Get rid of the static buffer. Instead use directly the user supplied one.
...
This commit was SVN r21541.
2009-06-26 19:21:27 +00:00
Shiqing Fan
b0b11e8465
Clean up things properly.
...
This commit was SVN r21539.
2009-06-26 15:20:19 +00:00
Shiqing Fan
6ba25951b4
Add a separate function for reading remote registry, so that it could be easily reused.
...
Add two options for plm process module, i.e. remote_env_prefix for getting OMPI prefix on remote computer by reading its user environment variable (OPENMPI_HOME), and remote_reg_prefix is similar, but it reads the registry on the remote computer. Reading remote env prefix has a higher priority than reading reg prefix, so that user can use their own installation of OMPI.
This commit was SVN r21538.
2009-06-26 13:35:45 +00:00
Shiqing Fan
5b4b8f8899
Get rid of a bunch of compiler warnings on Windows.
...
This commit was SVN r21536.
2009-06-26 07:51:39 +00:00
Lenny Verkhovsky
efa800efea
removed orphan files in rankfile mapper
...
This commit was SVN r21532.
2009-06-25 17:14:10 +00:00
Josh Hursey
26363fcc8b
Move the end timer to the end of the launch loop for the HNP. This matches the timing in the SLURM component, and makes the timing work when enabling the 'tree' based rsh launch.
...
This commit was SVN r21528.
2009-06-25 13:28:34 +00:00
Josh Hursey
8705c7edb1
Fix {{{orte_timing}}} for the RSH component of the PLM.
...
It never set the daemon launch counter before launching.
In the output for 'total job launch time' make the message match that of the SLURM PLM for easier parsing.
This commit was SVN r21527.
2009-06-25 12:37:58 +00:00
Shiqing Fan
07f7fe1a1b
Add two CCP params, for user specifying stdout and stderr to files.
...
This commit was SVN r21521.
2009-06-25 09:03:50 +00:00
Ralph Castain
00fb79567f
Shift some setup items from orterun to the ess/hnp module so that any HNP will perform them.
...
This commit was SVN r21520.
2009-06-25 03:00:53 +00:00
George Bosilca
fabebd140f
If the daemon doesn't contain orted we're adding the prefix_dir and
...
bin_base twice. Create a temporary full_orted_cmd to cope with this
case.
This commit was SVN r21519.
2009-06-24 23:57:32 +00:00
George Bosilca
32f632ad5a
Put back the multi-word command line argument for the RSH PLM.
...
This commit was SVN r21518.
2009-06-24 23:51:53 +00:00
George Bosilca
c0750343b5
Partially revert 21513. Beware of the exception on the orte_launch_agent which
...
is treated apart in orte_plm_base_setup_orted_cmd.
This commit was SVN r21517.
2009-06-24 23:48:14 +00:00
Ralph Castain
2e98ba3fd0
Complete implementation of regexp launch with static oob ports. Only enabled for SLURM at this time - migration to Torque coming
...
This commit was SVN r21516.
2009-06-24 20:31:26 +00:00
George Bosilca
53e76eed75
Missed an include.
...
This commit was SVN r21515.
2009-06-24 20:19:47 +00:00
George Bosilca
35033aeae9
Allow the PLM RSH to follow the routing table with the spawn, instead of the hard
...
coded binomial approach. Now, the PLM extract the children information from the
routed component, and the startup follow the routed topology. As a side effect,
we can now launch using linear or radix topologies, in addition to the previously
binomial topology.
This commit was SVN r21514.
2009-06-24 19:56:29 +00:00
George Bosilca
8cb8f28d9d
When we get a report from an orted about its state, don't use the sender of
...
the message to update the structures, but instead use the information from
the URI. The reason is that even the launch report messages can get routed.
Deal with the orted_cmd_line in a single location.
This commit was SVN r21513.
2009-06-24 19:51:52 +00:00
George Bosilca
84a953a2a6
Be less verbose.
...
This commit was SVN r21512.
2009-06-24 19:49:06 +00:00