1
1

2351 Коммитов

Автор SHA1 Сообщение Дата
Josh Hursey
dcedc99a6a It seems that we do no really want to set this particular variable to NULL. (causes badness in ess)
This commit was SVN r21673.
2009-07-14 19:01:43 +00:00
Josh Hursey
b03923cd72 Make sure to set these values to NULL, so that if we release an object (knowingly or not) twice then we do not fault by referencing unallocated memory.
This commit was SVN r21672.
2009-07-14 18:56:49 +00:00
Ralph Castain
dbac602be5 Add support for the add-host and add-hostfile MPI Info keys to allow Comm_spawn users to add new hosts to those already known by mpirun.
Requires full testing once comm_spawn is fixed (Edgar is working that now).

This commit was SVN r21664.
2009-07-14 14:34:11 +00:00
Ralph Castain
60edbc7220 Fix hetero operations and comm_spawn (to a point).
Remove all architecture references from ORTE and put them back in the modex using modex_send/recv calls.

Hetero operations are now fully supported again. Comm_spawn now works up to the point where it segfaults due to an error in the CID code - which now allows Edgar to dig further! :-)

This commit was SVN r21655.
2009-07-13 20:03:41 +00:00
George Bosilca
a934b9d975 Add the Open MPI specific part based on a patch from Manuel. Add the
sparc and alpha. A manpage patch is also included. This partially fixes
ticket #1973.

This commit was SVN r21654.
2009-07-13 20:01:12 +00:00
Ralph Castain
1b418dd397 Fix segfault in comm_spawn. The underlying problem breaking comm_spawn, however, remains - the change to make modex non-blocking causes the system to fail due to the arch not getting properly set.
Fix for that coming shortly.

This commit was SVN r21646.
2009-07-13 15:13:06 +00:00
Ralph Castain
1f147cf9c6 Don't do an automatic "phone home" if a regex was given to the orted.
This commit was SVN r21645.
2009-07-13 14:50:01 +00:00
Ralph Castain
235db33e83 Some more pointer array addressing cleanup
This commit was SVN r21644.
2009-07-13 14:49:20 +00:00
Ralph Castain
b97f885c00 Restore the original API to terminate individual processes instead of the entire job. This was originally removed as we didn't at that time know how to take advantage of it. Some of us are now working on proactive resilience methods that move procs prior to node failure, so this is now a required API. Modify the odls, plm, and orted functions to support this new functionality.
Continue work on the resilient mapper, completing support for fault groups.

This commit was SVN r21639.
2009-07-13 02:29:17 +00:00
Ralph Castain
50bd635200 Also require that the routed framework be initialized before attempting to use orte_show_help
This commit was SVN r21638.
2009-07-12 10:50:14 +00:00
Ralph Castain
e30826c6e1 Quiet some compiler warnings
This commit was SVN r21591.
2009-07-02 17:48:36 +00:00
Shiqing Fan
0b56a8a4d5 Enable IPv6 on Windows by default, and fix two type casts for IPv6 operations.
This commit was SVN r21586.
2009-07-02 14:41:03 +00:00
Ralph Castain
dd5e195a7d Don't treat the HNP node entry separately - this was just a holdover from the days when we didn't have the regex generator.
Ensure we get an accurate count of the number of daemons in the system.

This commit was SVN r21582.
2009-07-01 20:46:05 +00:00
Ralph Castain
4adb3ed80f Print out a more meaningful and correct error message
This commit was SVN r21581.
2009-07-01 20:16:15 +00:00
Ralph Castain
f832352b45 Clean up some compiler warnings
This commit was SVN r21577.
2009-07-01 16:51:11 +00:00
Ralph Castain
bc0fe3c6da Add some more tests for parallel IO that have caused problems in the past.
Add a README that explains how to run the ziatest for launch timing

This commit was SVN r21576.
2009-07-01 14:47:14 +00:00
Ralph Castain
9635db373d Ensure that we properly exit if the executable isn't found
This commit was SVN r21570.
2009-07-01 03:16:13 +00:00
Ralph Castain
2fbdea0273 Add a test for loop over bcast
This commit was SVN r21560.
2009-06-29 17:06:19 +00:00
Josh Hursey
91c4869cdc Use {{{ORTE_*_VERSION}}} instead of {{{OMPI_*_VERSION}}} in orte
This commit was SVN r21558.
2009-06-29 12:53:23 +00:00
Lenny Verkhovsky
e03807a3d1 small patch to extend current rankfile syntax to be compliant with orte_hosts syntax
making it possible to claim relative hosts from the hostfile/scheduler
by using +n# hostname, where  0 <= # < np
ex:
cat ~/work/svn/hpc/dev/test/Rankfile/rankfile
rank 0=+n0 slot=0
rank 1=+n0 slot=1
rank 2=+n1 slot=2
rank 3=+n1 slot=1

This commit was SVN r21557.
2009-06-28 11:20:56 +00:00
Shiqing Fan
656ec00611 Remove the definition of _WIN32_DCOM. It only enables DCOM when _WIN32_WINNT is less then 0x400 (Win 95, 98), and we are supporting 0x502(Win XP) and above in Open MPI. Thanks George for pointing this out.
This commit was SVN r21555.
2009-06-27 23:36:25 +00:00
Ralph Castain
d0a5468deb Continue gradual cleanup of pointer array addressing.
This commit was SVN r21550.
2009-06-26 22:40:28 +00:00
Ralph Castain
8ccc47b152 Don't wakeup mpirun too early - even though we ordered an abort, we have to wait for the procs/daemons to properly terminate, and give the kill_procs command a chance to get out.
This commit was SVN r21549.
2009-06-26 22:40:00 +00:00
Ralph Castain
2b4f051b7f Cleanup some indexing bugs so that shared memory can function
This commit was SVN r21548.
2009-06-26 22:07:25 +00:00
Ralph Castain
0a92fe3739 Pull the HNP node from the right index
This commit was SVN r21547.
2009-06-26 21:43:09 +00:00
Ralph Castain
b96a71b62e Enable restart of individual processes upon command via the errmgr callback function. It needs an external application to drive this capability, so normal operations shouldn't be affected.
Does not support MPI applications. More work coming to update daemon accounting on movement of procs across nodes.

This commit was SVN r21545.
2009-06-26 20:54:58 +00:00
George Bosilca
f06198a999 BAD grpcomm now has the ability to execute the modex offline. The MPI process
prepare the send buffer, and post the collective order to the local daemon. It
then register the callback and return fromthe modex exchange. It will only 
wait for this modex completion when the modex_recv is called. Meanwhile, the
daemon will do the allgather.

This commit was SVN r21543.
2009-06-26 20:32:31 +00:00
George Bosilca
6239e714a6 Extract the modex unpacking in it's own function.
This commit was SVN r21542.
2009-06-26 20:29:54 +00:00
George Bosilca
760df7fcb9 Get rid of the static buffer. Instead use directly the user supplied one.
This commit was SVN r21541.
2009-06-26 19:21:27 +00:00
Shiqing Fan
b0b11e8465 Clean up things properly.
This commit was SVN r21539.
2009-06-26 15:20:19 +00:00
Shiqing Fan
6ba25951b4 Add a separate function for reading remote registry, so that it could be easily reused.
Add two options for plm process module, i.e. remote_env_prefix for getting OMPI prefix on remote computer by reading its user environment variable (OPENMPI_HOME), and remote_reg_prefix is similar, but it reads the registry on the remote computer. Reading remote env prefix has a higher priority than reading reg prefix, so that user can use their own installation of OMPI.

This commit was SVN r21538.
2009-06-26 13:35:45 +00:00
Shiqing Fan
5b4b8f8899 Get rid of a bunch of compiler warnings on Windows.
This commit was SVN r21536.
2009-06-26 07:51:39 +00:00
Lenny Verkhovsky
efa800efea removed orphan files in rankfile mapper
This commit was SVN r21532.
2009-06-25 17:14:10 +00:00
Ralph Castain
863e57700e Cleanup use of pointer arrays - thanks to Lenny for pointing it out.
This commit was SVN r21529.
2009-06-25 14:08:36 +00:00
Josh Hursey
26363fcc8b Move the end timer to the end of the launch loop for the HNP. This matches the timing in the SLURM component, and makes the timing work when enabling the 'tree' based rsh launch.
This commit was SVN r21528.
2009-06-25 13:28:34 +00:00
Josh Hursey
8705c7edb1 Fix {{{orte_timing}}} for the RSH component of the PLM.
It never set the daemon launch counter before launching.
In the output for 'total job launch time' make the message match that of the SLURM PLM for easier parsing.

This commit was SVN r21527.
2009-06-25 12:37:58 +00:00
Shiqing Fan
07f7fe1a1b Add two CCP params, for user specifying stdout and stderr to files.
This commit was SVN r21521.
2009-06-25 09:03:50 +00:00
Ralph Castain
00fb79567f Shift some setup items from orterun to the ess/hnp module so that any HNP will perform them.
This commit was SVN r21520.
2009-06-25 03:00:53 +00:00
George Bosilca
fabebd140f If the daemon doesn't contain orted we're adding the prefix_dir and
bin_base twice. Create a temporary full_orted_cmd to cope with this
case.

This commit was SVN r21519.
2009-06-24 23:57:32 +00:00
George Bosilca
32f632ad5a Put back the multi-word command line argument for the RSH PLM.
This commit was SVN r21518.
2009-06-24 23:51:53 +00:00
George Bosilca
c0750343b5 Partially revert 21513. Beware of the exception on the orte_launch_agent which
is treated apart in orte_plm_base_setup_orted_cmd.

This commit was SVN r21517.
2009-06-24 23:48:14 +00:00
Ralph Castain
2e98ba3fd0 Complete implementation of regexp launch with static oob ports. Only enabled for SLURM at this time - migration to Torque coming
This commit was SVN r21516.
2009-06-24 20:31:26 +00:00
George Bosilca
53e76eed75 Missed an include.
This commit was SVN r21515.
2009-06-24 20:19:47 +00:00
George Bosilca
35033aeae9 Allow the PLM RSH to follow the routing table with the spawn, instead of the hard
coded binomial approach. Now, the PLM extract the children information from the
routed component, and the startup follow the routed topology. As a side effect,
we can now launch using linear or radix topologies, in addition to the previously
binomial topology.

This commit was SVN r21514.
2009-06-24 19:56:29 +00:00
George Bosilca
8cb8f28d9d When we get a report from an orted about its state, don't use the sender of
the message to update the structures, but instead use the information from
the URI. The reason is that even the launch report messages can get routed.

Deal with the orted_cmd_line in a single location.

This commit was SVN r21513.
2009-06-24 19:51:52 +00:00
George Bosilca
84a953a2a6 Be less verbose.
This commit was SVN r21512.
2009-06-24 19:49:06 +00:00
George Bosilca
07954c388a Reactivate the orte_daemon_spin option.
Reorder the sends with the contact info, first to our parent and then
to the HNP to avoid a possible race condition.

This commit was SVN r21511.
2009-06-24 19:30:34 +00:00
Shiqing Fan
6750657a97 Change the way that we detect cluster nodes status and clean up the CCP components.
Normally, any non-windows nodes should return  NodeStatus_Unreachable, but that's not true. We have noticed that Scientific Linux cluster nodes, which are in the same subnet as the Windows nodes, return NodeStatus_PendingApproval value, and the RAS CCP takes them as Windows nodes. So let's just use the "Ready" nodes.

This commit was SVN r21510.
2009-06-24 19:15:15 +00:00
Shiqing Fan
9bdf258d3c Add an initial support for remote process launch using WMI (Windows Management Instrumentation) in the environment where Windows 2003/2008 server and HPC pack are not installed. Users in a domain will be able to use domain computers as a cluster with their domain Credentials, the computers may run Windows XP or higher, e.g. Vista, and/or Windows 7 (later), with correct settings and permissions for WMI and Windows firewall.
This commit was SVN r21508.
2009-06-24 18:59:24 +00:00
Ralph Castain
51ee170f75 Update the pidmap decode logic to handle pidmap updates for restarted processes
This commit was SVN r21506.
2009-06-24 03:05:04 +00:00