1
1
Граф коммитов

293 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
51a8b89a83 Treat termination of continuously operating processes as an abort
This commit was SVN r21709.
2009-07-17 22:20:05 +00:00
Ralph Castain
4c1eb040b0 Enable the system to keep functioning even when multiple launches are occurring simultaneously.
This is a bit of a hack, but it does seem to allow the system to work. A better solution is being discussed.

This commit was SVN r21705.
2009-07-17 02:28:47 +00:00
George Bosilca
3e971e61f3 The system headers are supposed to be protected by #ifdef and not by #if.
This commit was SVN r21700.
2009-07-16 18:27:33 +00:00
Ralph Castain
dbac602be5 Add support for the add-host and add-hostfile MPI Info keys to allow Comm_spawn users to add new hosts to those already known by mpirun.
Requires full testing once comm_spawn is fixed (Edgar is working that now).

This commit was SVN r21664.
2009-07-14 14:34:11 +00:00
Ralph Castain
60edbc7220 Fix hetero operations and comm_spawn (to a point).
Remove all architecture references from ORTE and put them back in the modex using modex_send/recv calls.

Hetero operations are now fully supported again. Comm_spawn now works up to the point where it segfaults due to an error in the CID code - which now allows Edgar to dig further! :-)

This commit was SVN r21655.
2009-07-13 20:03:41 +00:00
Ralph Castain
1b418dd397 Fix segfault in comm_spawn. The underlying problem breaking comm_spawn, however, remains - the change to make modex non-blocking causes the system to fail due to the arch not getting properly set.
Fix for that coming shortly.

This commit was SVN r21646.
2009-07-13 15:13:06 +00:00
Ralph Castain
b97f885c00 Restore the original API to terminate individual processes instead of the entire job. This was originally removed as we didn't at that time know how to take advantage of it. Some of us are now working on proactive resilience methods that move procs prior to node failure, so this is now a required API. Modify the odls, plm, and orted functions to support this new functionality.
Continue work on the resilient mapper, completing support for fault groups.

This commit was SVN r21639.
2009-07-13 02:29:17 +00:00
Ralph Castain
9635db373d Ensure that we properly exit if the executable isn't found
This commit was SVN r21570.
2009-07-01 03:16:13 +00:00
Shiqing Fan
656ec00611 Remove the definition of _WIN32_DCOM. It only enables DCOM when _WIN32_WINNT is less then 0x400 (Win 95, 98), and we are supporting 0x502(Win XP) and above in Open MPI. Thanks George for pointing this out.
This commit was SVN r21555.
2009-06-27 23:36:25 +00:00
Ralph Castain
d0a5468deb Continue gradual cleanup of pointer array addressing.
This commit was SVN r21550.
2009-06-26 22:40:28 +00:00
Ralph Castain
b96a71b62e Enable restart of individual processes upon command via the errmgr callback function. It needs an external application to drive this capability, so normal operations shouldn't be affected.
Does not support MPI applications. More work coming to update daemon accounting on movement of procs across nodes.

This commit was SVN r21545.
2009-06-26 20:54:58 +00:00
Shiqing Fan
b0b11e8465 Clean up things properly.
This commit was SVN r21539.
2009-06-26 15:20:19 +00:00
Shiqing Fan
6ba25951b4 Add a separate function for reading remote registry, so that it could be easily reused.
Add two options for plm process module, i.e. remote_env_prefix for getting OMPI prefix on remote computer by reading its user environment variable (OPENMPI_HOME), and remote_reg_prefix is similar, but it reads the registry on the remote computer. Reading remote env prefix has a higher priority than reading reg prefix, so that user can use their own installation of OMPI.

This commit was SVN r21538.
2009-06-26 13:35:45 +00:00
Shiqing Fan
5b4b8f8899 Get rid of a bunch of compiler warnings on Windows.
This commit was SVN r21536.
2009-06-26 07:51:39 +00:00
Josh Hursey
26363fcc8b Move the end timer to the end of the launch loop for the HNP. This matches the timing in the SLURM component, and makes the timing work when enabling the 'tree' based rsh launch.
This commit was SVN r21528.
2009-06-25 13:28:34 +00:00
Josh Hursey
8705c7edb1 Fix {{{orte_timing}}} for the RSH component of the PLM.
It never set the daemon launch counter before launching.
In the output for 'total job launch time' make the message match that of the SLURM PLM for easier parsing.

This commit was SVN r21527.
2009-06-25 12:37:58 +00:00
Shiqing Fan
07f7fe1a1b Add two CCP params, for user specifying stdout and stderr to files.
This commit was SVN r21521.
2009-06-25 09:03:50 +00:00
George Bosilca
fabebd140f If the daemon doesn't contain orted we're adding the prefix_dir and
bin_base twice. Create a temporary full_orted_cmd to cope with this
case.

This commit was SVN r21519.
2009-06-24 23:57:32 +00:00
George Bosilca
32f632ad5a Put back the multi-word command line argument for the RSH PLM.
This commit was SVN r21518.
2009-06-24 23:51:53 +00:00
George Bosilca
c0750343b5 Partially revert 21513. Beware of the exception on the orte_launch_agent which
is treated apart in orte_plm_base_setup_orted_cmd.

This commit was SVN r21517.
2009-06-24 23:48:14 +00:00
Ralph Castain
2e98ba3fd0 Complete implementation of regexp launch with static oob ports. Only enabled for SLURM at this time - migration to Torque coming
This commit was SVN r21516.
2009-06-24 20:31:26 +00:00
George Bosilca
53e76eed75 Missed an include.
This commit was SVN r21515.
2009-06-24 20:19:47 +00:00
George Bosilca
35033aeae9 Allow the PLM RSH to follow the routing table with the spawn, instead of the hard
coded binomial approach. Now, the PLM extract the children information from the
routed component, and the startup follow the routed topology. As a side effect,
we can now launch using linear or radix topologies, in addition to the previously
binomial topology.

This commit was SVN r21514.
2009-06-24 19:56:29 +00:00
George Bosilca
8cb8f28d9d When we get a report from an orted about its state, don't use the sender of
the message to update the structures, but instead use the information from
the URI. The reason is that even the launch report messages can get routed.

Deal with the orted_cmd_line in a single location.

This commit was SVN r21513.
2009-06-24 19:51:52 +00:00
Shiqing Fan
6750657a97 Change the way that we detect cluster nodes status and clean up the CCP components.
Normally, any non-windows nodes should return  NodeStatus_Unreachable, but that's not true. We have noticed that Scientific Linux cluster nodes, which are in the same subnet as the Windows nodes, return NodeStatus_PendingApproval value, and the RAS CCP takes them as Windows nodes. So let's just use the "Ready" nodes.

This commit was SVN r21510.
2009-06-24 19:15:15 +00:00
Shiqing Fan
9bdf258d3c Add an initial support for remote process launch using WMI (Windows Management Instrumentation) in the environment where Windows 2003/2008 server and HPC pack are not installed. Users in a domain will be able to use domain computers as a cluster with their domain Credentials, the computers may run Windows XP or higher, e.g. Vista, and/or Windows 7 (later), with correct settings and permissions for WMI and Windows firewall.
This commit was SVN r21508.
2009-06-24 18:59:24 +00:00
George Bosilca
addaf7aaf8 Repair the tree spawn. The problem seems to come from the fact
that now the HNP send the messages using the routed component. In the case
of tree spawn, when a intermediary node spawn a child it doesn't know how
to forward a message to it, so when the node-map message is coming from
the HNP (as there is nothing yet in the contact/routing table) the message
is sent back the way it came. As a result the node-map message keeps jumping
between the HNP and the first level orteds.

The solution is to add a new option to the children orte_parent_uri, which
is only set when the orted is _not_ directly spawned by the HNP. When this
option is present on the argument list, the orted will add the parent to
its routing, and force the parent to update his routes (by sending the URI).
With this approach, the routing tree is build in same time as the processes
are spawned, and all messages from the HNP can be routed to the leaves.

However, this is far from an optimal solution. Right now, this so called tree
spawn, only spawn the children in a tree without doing anything about the
"connect back to the HNP" step. The HNP is flooded with reports from all the
orted. The total number of messages is higher than in the non tree startup
scheme, so we do not expect this approach to be scalable in the current
incarnation. A complete overhaul of the tree startup is required in order
improve the scalability. Stay tuned!

This commit was SVN r21504.
2009-06-23 22:10:25 +00:00
Ralph Castain
0ba845fed2 Continue development of regular expression support by implementing it for slurm launches. Works for both initial (cmd line and non-cmd line) and comm_spawn launch.
Additional work required to fully enable static port support when using cmd line regular expression launch system.

This commit was SVN r21502.
2009-06-23 20:25:38 +00:00
George Bosilca
7339530061 Remove the prototype of a non-existant function.
This commit was SVN r21500.
2009-06-23 19:50:23 +00:00
Ralph Castain
7a802a9d3a Move the plm designation to the argv from the env to support those systems not setup to pass env via rsh.
This commit was SVN r21495.
2009-06-22 18:08:45 +00:00
Ralph Castain
771ce035a5 Complete implementation of regular expression generator and parser - now handles leading zero's and suffix in node names.
This commit was SVN r21468.
2009-06-18 04:36:00 +00:00
Ralph Castain
70b8c89b44 Fix slave spawn, which was hanging because the local daemon never saw the slave job report - it doesn't do it in the normal way, and so the slave launch system itself has to "fake it".
Also complete implementation to printout app_context objects so we see all the fields.

This commit was SVN r21408.
2009-06-10 19:01:08 +00:00
George Bosilca
77a6f27d44 Update the call to orte_plm_base_create_jobid based on the new interface.
This commit was SVN r21393.
2009-06-08 23:53:53 +00:00
Ralph Castain
0336460b0a Continue implementation of resilient operations by supporting reuse of jobids for restarted procs. Ensure that restarted processes have valid node and local ranks, and that node rank values are passed to direct-launched processes.
This commit was SVN r21385.
2009-06-06 01:08:47 +00:00
Ralph Castain
30a357bd8d Provide a "progress meter" for launch that outputs progress as we are launching, especially on large jobs. Also, provide a timeout mechanism so that we cleanly abort if we don't get a response from the next daemon in a specified time.
This commit was SVN r21359.
2009-06-02 23:52:02 +00:00
Ralph Castain
a95731fc68 Minor update to let apps set their own component selections if desired, while preserving slave behavior
This commit was SVN r21332.
2009-05-30 20:42:23 +00:00
Ralph Castain
4e0223a638 Add the ability to directly launch procs via rsh/ssh. Collect common functions in plm/base. Create a new global param to set assume_same_shell, alias'd back to plm_rsh_assume_same_shell (not deprecated).
This commit was SVN r21328.
2009-05-30 01:10:25 +00:00
Ralph Castain
f139cfd28a Fully enable the use of static ports to minimize connections on mpirun. When static ports are provided, daemons will automatically use routes defined by the selected routed module to callback to mpirun during startup, thus elimating the dedicated daemon-to-mpirun connection. Therefore, the total number of connections on mpirun will equal the fanout of the routed module (instead of #nodes in job).
Add a new tm ess module that exploits this capability.

Update the various plm modules to enable it - just a minor change reflecting an added param to a plm base function.

Additional fixes included:

1. remove an erroneous cleanup of session directories in the tool finalize procedure - tools don't create session directories to begin with!

2. fix a duplicate free when attempting to execute a non-existent app

3. cleanup an typo in the comm utilities 

4. fix comm_spawn - was perturbed by the changes in pack/unpack of orte_job_t to properly support orte-ps

Been tested on slurm and tm machines, using all tests in orte/test/mpi. May run into issue with command line length on large jobs due to inclusion of node info to support static ports - will fix this next with addition of regexp generator to compress that info.

This commit was SVN r21248.
2009-05-16 04:15:55 +00:00
Rainer Keller
b3daf7184a - Fix Coverity CID 1272:
Need to release src in case of error

 - Fix Coverity CID 1271:
   basename is allocated by opal_basename and later overwritten

This commit was SVN r21230.
2009-05-14 00:28:35 +00:00
Ralph Castain
c45ff0d59f Take the next step towards fully utilizing static ports for the daemons to eliminate the initial "phone home" to mpirun by modifying the orted termination procedure to eliminate the need for a full barrier-like operation. Instead, we add a "onesided" barrier to the grpcomm framework API that releases the orted once it has completed its own contribution to the barrier - i.e., the orteds now exit as the "ack" message rolls up towards mpirun instead of sending the "ack" directly to mpirun.
This causes the orteds in the routing tree to remain alive until all termination "acks" from orteds below them have passed through. Thus, if we use static ports, we no longer require a direct orted-to-mpirun connection.

Also modify the binomial routed module so it conforms to what all the other routed modules do and have all messages pass along the routing tree instead of short-circuiting between orteds. This further reduces the number of ports being opened on backend nodes.

This commit was SVN r21203.
2009-05-11 14:11:44 +00:00
Ralph Castain
c6f0499720 Some cleanups required from last night's commits to resolve some race conditions and ensure we cleanup properly. Also, remove some debug output that was unintentionally left "on" by default.
This commit was SVN r21202.
2009-05-11 14:03:07 +00:00
Ralph Castain
69cd4e9d8a Continue cleanup of opal_pointer_array references to replace direct addressing with opal_pointer_array_get_item.
Also, be much more careful and complete in recovering resources of terminated jobs.

This commit was SVN r21199.
2009-05-11 03:27:57 +00:00
Greg Koenig
60485ff95f This is a very large change to rename several #define values from
OMPI_* to OPAL_*.  This allows opal layer to be used more independent
from the whole of ompi.

NOTE: 9 "svn mv" operations immediately follow this commit.

This commit was SVN r21180.
2009-05-06 20:11:28 +00:00
Shiqing Fan
cd565923d3 Completely remove ltdl support for Windows build.
This commit was SVN r21170.
2009-05-05 18:59:13 +00:00
Josh Hursey
8b8bee04d6 It seems that some of the patches were missed in r21131. :(
This patch contains the following items:
 * Fix the flag passed to open() for the read side of the named pipe between the local and app coordinator. There is a race condition when using O_RDWR on a named pipe (not sure how that bug got in there in the first place).
 * Adjust control in the C/R thread timing
 * Clarify return code in BLCR component
 * Allow the user to adjust the max wait time for the named pipes in the FileM local coordinator by using the MCA parameter "snapc_full_max_wait_time" (Default: 20 seconds)
 * If the application terminates while there are active FileM operations, force mpirun to wait on these operations to complete.
 * Allow the user to set the local copy command (Default: cp) via MCA parameter "filem_rsh_cp"
 * Implement the ability to throttle the number of outgoing connections in FileM. At larger scales this type of explicit throttling helps prevent overwhelming the HNP machine. Default: 10, set via MCA parameter: {{{filem_rsh_max_outgoing}}}

This commit was SVN r21167.

The following SVN revision numbers were found above:
  r21131 --> open-mpi/ompi@0deb009225
2009-05-05 16:45:49 +00:00
Ralph Castain
eac027e6bc In reviewing last night's MTT of trunk, discovered that IU was still testing the old plm slurmd module. This was created solely for debugging proposed changes to the slurm module which have long since been integrated across - and thus, the slurmd module hasn't been maintained for quite some time.
Which explains why the tests using that module all failed...sigh.

My bad for not cleaning it out a long time ago.

This commit was SVN r21145.
2009-05-04 11:25:32 +00:00
Ralph Castain
4be24521aa Modify the orte_process_info structure to handle a broader range of process types by replacing the individual booleans with a 32-bit bitmap. Use a set of #define's to define the individual bits, and a set of matching macros to test for them. Update the orte code base to use the macros instead of the booleans.
Minor mod to the ompi layer to use the new #define's - just one-line name replacements.

This commit was SVN r21144.
2009-05-04 11:07:40 +00:00
Shiqing Fan
5b76f583e4 No RSH support on Windows.
This commit was SVN r21120.
2009-04-30 09:01:20 +00:00
Ralph Castain
21a66c0eff Fix a couple of minor errors in slave launch support
This commit was SVN r21118.
2009-04-30 02:54:25 +00:00
Ralph Castain
1fb92e2766 Cleanup the slave launch - completed debugging. w00t!
This commit was SVN r21113.
2009-04-29 14:37:33 +00:00
Ralph Castain
e003e0447a Ensure we terminate cleanly when we do not launch
This commit was SVN r21110.
2009-04-29 14:06:12 +00:00
Ralph Castain
36671eb962 Replace missing header
This commit was SVN r21109.
2009-04-29 13:50:11 +00:00
Rainer Keller
71052deebb - Get rid of incompatible implicit declaration
Need #include string.h

This commit was SVN r21104.
2009-04-29 08:11:37 +00:00
Rainer Keller
221fb9dbca ... Delayed due to notifier commits earlier this day ...
- Delete unnecessary header files using
   contrib/check_unnecessary_headers.sh after applying
   patches, that include headers, being "lost" due to
   inclusion in one of the now deleted headers...

   In total 817 files are touched.
   In ompi/mpi/c/ header files are moved up into the actual c-file,
   where necessary (these are the only additional #include),
   otherwise it is only deletions of #include (apart from the above
   additions required due to notifier...)

 - To get different MCAs (OpenIB, TM, ALPS), an earlier version was
   successfully compiled (yesterday) on:
   Linux locally using intel-11, gcc-4.3.2 and gcc-SVN + warnings enabled
   Smoky cluster (x86-64 running Linux) using PGI-8.0.2 + warnings enabled
   Lens cluster (x86-64 running Linux) using Pathscale-3.2 + warnings enabled

This commit was SVN r21096.
2009-04-29 01:32:14 +00:00
Rainer Keller
6c1cce8761 - For the upcoming header cleanup commit,
several header files (previously included by header-files)
   now have to be moved "upward".
   This is mainly system headers such as string.h, stdio.h and for
   networking, but also some orte headers.

This commit was SVN r21095.
2009-04-29 00:49:23 +00:00
Ralph Castain
f3cfe32b5d Update the slave launch and cleanup procedures. Track what files have been moved to the slave node to avoid attempting to copy them multiple times on top of each other. Cleanup any pre-positioned files, kill any lingering apps, and cleanup the session directory area upon termination of the daemon.
This commit was SVN r21094.
2009-04-29 00:11:19 +00:00
Shiqing Fan
4067d739b4 Add a new CMake module for finding Windows CCP libraries, so that we can remove the library files in the source tree.
This commit was SVN r21071.
2009-04-24 16:51:31 +00:00
Shiqing Fan
d8f61695cd Remove a few .ompi_ignore, and add configure scripts, so that the source files of these components will be put into the tarball.
This commit was SVN r21070.
2009-04-24 16:49:01 +00:00
Shiqing Fan
3d4e0472d6 Add windows support files into the tarball, including .windows, CMakeLists.txt files, and CMake modules. Thanks to Jeff for testing it on Linux.
This commit was SVN r21069.
2009-04-24 16:39:33 +00:00
Ralph Castain
8b0a470543 Continue work to cleanup user options for slave launch
This commit was SVN r21003.
2009-04-14 20:05:51 +00:00
Ralph Castain
a952dca062 Fix a bug where we created the correct path to the file, but didn't use it
This commit was SVN r20990.
2009-04-14 14:17:43 +00:00
Ralph Castain
d0b50a2b9b Eliminate an annoying "not found" message when a job abnormally terminates. In this case, we can get a race condition where the job object has been removed, but updates are continuing to flow into the system.
This commit was SVN r20857.
2009-03-24 18:06:49 +00:00
George Bosilca
faca1aeeb9 Support for password structures without a specified shell or a NULL/empty shell.
This commit was SVN r20853.
2009-03-24 13:26:57 +00:00
Rainer Keller
6f808d9b05 Preparation work for another commit (after RFC):
- This patch solely _adds_ required headers and is rather localized
   The next patch (after RFC) heavily removes headers (based on script)
 - ompi/communicator/communicator.h: For sources that use
   ompi_mpi_comm_world, don't require them to include "mpi.h"
 - ompi/debuggers/ompi_common_dll.c: mca_topo_base_comm_1_0_0_t needs
   #include "ompi/mca/topo/topo.h"
 - ompi/errhandler/errhandler_predefined.h:
   ompi/communicator/communicator.h depends on this header file!
   To prevent recursion just have fwd declarations.
   #include "ompi/types.h" for fwd declarations of the main structs.
 - ompi/mca/btl/btl.h: #include "opal/types.h" for ompi_ptr_t 
 - ompi/mca/mpool/base/mpool_base_tree.c: We use ompi_free_list_t and
   ompi_rb_tree_t, so have the proper classes
 - ompi/mca/op/op.h:
   Op is pretty self-contained: Nobody up to now has done
   #include "opal/class/opal_object.h"
 - ompi/mca/osc/pt2pt/osc_pt2pt_replyreq.h:
   #include "opal/types.h" for ompi_ptr_t 
 - ompi/mca/pml/base/base.h:
   We use opal_lists  
 - ompi/mca/pml/dr/pml_dr_vfrag.h:
   #include "opal/types.h" for ompi_ptr_t
 - ompi/mca/pml/ob1/pml_ob1_hdr.h:
   #include "ompi/mca/btl/btl.h" for mca_btl_base_segment_t
 - opal/dss/dss_unpack.c:
   #include "opal/types.h"
 - opal/mca/base/base.h:
   #include "opal/util/cmd_line.h" for opal_cmd_line_t
 - orte/mca/oob/tcp/oob_tcp.c:
   #include "opal/types.h" for opal_socklen_t
 - orte/mca/oob/tcp/oob_tcp.h:
   #include "opal/threads/threads.h" for opal_thread_t
 - orte/mca/oob/tcp/oob_tcp_msg.c:
   #include "opal/types.h" 
 - orte/mca/oob/tcp/oob_tcp_peer.c:
   #include "opal/types.h"  for opal_socklen_t
 - orte/mca/oob/tcp/oob_tcp_send.c:
   #include "opal/types.h" 
 - orte/mca/plm/base/plm_base_proxy.c:
   #include "orte/util/name_fns.h" for ORTE_NAME_PRINT
 - orte/mca/rml/base/rml_base_receive.c:
   #include "opal/util/output.h" for OPAL_OUTPUT_VERBOSE
 - orte/mca/rml/oob/rml_oob_recv.c:
   #include "opal/types.h" for ompi_iov_base_ptr_t
 - orte/mca/rml/oob/rml_oob_send.c:
   #include "opal/types.h" for ompi_iov_base_ptr_t
 - orte/runtime/orte_data_server.c
   #include "opal/util/output.h" for OPAL_OUTPUT_VERBOSE
 - orte/runtime/orte_globals.h:
   #include "orte/util/name_fns.h" for ORTE_NAME_PRINT

 Tested on Linux/x86-64

This commit was SVN r20817.
2009-03-17 21:34:30 +00:00
Jeff Squyres
b5c38f74b0 Always tie the child stdin to /dev/null.
This commit was SVN r20796.
2009-03-17 03:17:50 +00:00
Jeff Squyres
27bacbee3c Per discussion on #1833, we should ''always'' dup /dev/null into ssh's
stdin.

This commit was SVN r20790.
2009-03-16 19:12:48 +00:00
George Bosilca
02bee12de8 Small cleanup.
This commit was SVN r20781.
2009-03-14 23:11:24 +00:00
Brian Barrett
716b505789 Fix for #1832. The stdin for the forked ssh wasn't being closed, so it was
"eating" all the stdin, leaving nothing for mpirun to forward on to the child
process

This commit was SVN r20776.
2009-03-13 22:34:56 +00:00
Rainer Keller
d8cf4c0fec - Get pgcc on XT to complain less:
In case we use memcmp, strlen, strup and friends include <string.h>
   Also several constants.h are not included directly
 - Let's have mca_topo_base_cart_create  return ompi-errors in
   ompi/mca/topo/base/topo_base_cart_create.c

This commit was SVN r20773.
2009-03-13 02:10:32 +00:00
Shiqing Fan
ddc82f3831 Correct the output global variables.
This commit was SVN r20745.
2009-03-06 15:31:12 +00:00
Rainer Keller
ec0ed48718 - Revert r20739
This commit was SVN r20742.

The following SVN revision numbers were found above:
  r20739 --> open-mpi/ompi@781caee0b6
2009-03-05 21:56:03 +00:00
Rainer Keller
a94438343b - Revert r20740
This commit was SVN r20741.

The following SVN revision numbers were found above:
  r20740 --> open-mpi/ompi@2a70618a77
2009-03-05 21:50:47 +00:00
Rainer Keller
2a70618a77 - Second patch, as discussed in Louisville.
Replace short macros in orte/util/name_fns.h
   to the actual fct. call.

 - Compiles on linux/x86-64

This commit was SVN r20740.
2009-03-05 21:14:18 +00:00
Rainer Keller
781caee0b6 - First of two or three patches, in orte/util/proc_info.h:
Adapt orte_process_info to orte_proc_info, and
   change orte_proc_info() to orte_proc_info_init().
 - Compiled on linux-x86-64
 - Discussed with Ralph

This commit was SVN r20739.
2009-03-05 20:36:44 +00:00
Ralph Castain
f11931306a Modify the accounting system to recycle jobids. Properly recover resources from nodes and jobs upon completion. Adjustments in several places were required to deal with sparsely populated job, node, and proc arrays as a result of this change.
Correct an error wrt how jobids were being computed. Needed to ensure that the job family field was not overrun as we increment jobids for comm_spawn.

Update the slurm plm module so it uses the new slurm termination procedure (brings trunk back into alignment with 1.3 branch).

Update the slurmd ess component so it doesn't get selected if we are running a singleton inside of a slurm allocation.

Cleanup HNP init by moving some code that had been in orte_globals.c for historical reasons into the ess hnp module, and removing the call to that code from the ess_base_std_prolog


NOTE: this change allows orte to support an infinite aggregate number of comm_spawn's, with up to 64k being alive at any one instant. HOWEVER, the MPI layer currently does -not- support re-use of jobids. I did some prototype coding to revise the ompi_proc_t structures, but the BTLs are caching their own data, and there was no readily apparent way to update it. Thus, attempts to spawn more than the 64k limit will abort to avoid causing the MPI layer to hang.

This commit was SVN r20700.
2009-03-03 16:39:13 +00:00
Rainer Keller
4c0e8e1e69 - Header orte/mca/oob/base/base.h is probably the wrong one to include
anyhow -- if oob functionality is neededm then orte/mca/oob/oob.h

   Nevertheless compiles fine with -Wimplicit-function-declaration   

This commit was SVN r20641.
2009-02-26 04:20:03 +00:00
Rainer Keller
04567d3af0 - Header orte/mca/errmgr/errmgr.h is not needed.
Once again compiles fine with -Wimplicit-function-declaration   

This commit was SVN r20640.
2009-02-26 04:05:30 +00:00
Rainer Keller
96e1b9b747 - Header orte/mca/rml/rml.h is not needed if no occurence of orte_rml
or ORTE_RML.
   As the others compiles fine with -Wimplicit-function-declaration

This commit was SVN r20639.
2009-02-26 03:52:31 +00:00
Rainer Keller
b356e90fa1 - Get rid of include orte/util/proc_info.h, if not needed
Only proc_info.h-internal include file is opal/dss/dss_types.h
 - In one case (orte/util/hnp_contact.c) had to add proc_info.h again.
 - Local compilation (Linux/x86_64) w/ -Wimplicit-function-declaration
   works fine, no errors.

   Again, let's have MTT the last word.

This commit was SVN r20631.
2009-02-25 03:38:00 +00:00
Rainer Keller
02599446d0 - Occurences of ORTE_PROC_MY_NAME require orte/runtime/orte_globals.h
This commit was SVN r20607.
2009-02-20 03:16:13 +00:00
Ralph Castain
76fc406b08 Modify envars passed to support new proc_info and hier expectations
This commit was SVN r20600.
2009-02-19 21:36:30 +00:00
Rolf vandeVaart
515b99b357 Under SGE, the orted should not daemonize by default.
Also create mca parameter to force daemonization (previous
behavior) which might be needed on larger clusters or
to make use of the -notify flag with qsub.

This fixes trac:1783.

This commit was SVN r20582.

The following Trac tickets were found above:
  Ticket 1783 --> https://svn.open-mpi.org/trac/ompi/ticket/1783
2009-02-18 18:02:38 +00:00
Shiqing Fan
3f6c64f2e3 Include a missing header,which was implicitly included and removed.
This commit was SVN r20563.
2009-02-16 12:38:38 +00:00
Rainer Keller
d81443cc5a - On the way to get the BTLs split out and lessen dependency on orte:
Often, orte/util/show_help.h is included, although no functionality
   is required -- instead, most often opal_output.h, or               
   orte/mca/rml/rml_types.h                                           
   Please see orte_show_help_replacement.sh commited next.            

 - Local compilation (Linux/x86_64) w/ -Wimplicit-function-declaration
   actually showed two *missing* #include "orte/util/show_help.h"     
   in orte/mca/odls/base/odls_base_default_fns.c and                  
   in orte/tools/orte-top/orte-top.c                                  
   Manually added these.                                              

   Let's have MTT the last word.

This commit was SVN r20557.
2009-02-14 02:26:12 +00:00
Rolf vandeVaart
ce97c27a53 Make sure we create a valid parth argument for execve.
This gets SGE working in the trunk again.

This commit was SVN r20531.
2009-02-12 18:27:40 +00:00
Ralph Castain
62dd763a8f Add ability for local slave spawns to pre-position supporting files. Update comm_spawn and comm_spawn_multiple man pages to cover new info_keys.
This commit was SVN r20527.
2009-02-12 15:56:45 +00:00
Ralph Castain
816ef9e0a3 Ensure that the rsh_agent_argv is properly initialized when assembling the SGE qrsh command
This commit was SVN r20518.
2009-02-11 18:48:44 +00:00
Rolf vandeVaart
74b2001d61 Fix builds on Solaris. Missing errno.h file.
This commit was SVN r20516.
2009-02-11 15:08:07 +00:00
Ralph Castain
e76b68e554 Replace a missing line so that the TM libs are included in dynamic builds
This commit was SVN r20514.
2009-02-11 14:40:11 +00:00
Ralph Castain
390ce219f8 Enable the slurmd plm to trigger an mpirun exit if no other daemons are in the system
This commit was SVN r20507.
2009-02-10 19:22:57 +00:00
Ralph Castain
6a7fa79a09 Cleanup debug by converting to show_help, little more work to cleanup local vs remote ops when no preload is specified
This commit was SVN r20506.
2009-02-10 19:11:24 +00:00
Ralph Castain
d1b5afd9ea If we don't pre-position the binaries, correctly setup the ssh command to execute the bootproxy
This commit was SVN r20501.
2009-02-10 18:27:10 +00:00
Shiqing Fan
2f1461419c Add a new feature for checking mca subdirectories, i.e. detecting if there is an exclude file list which indicates the files that shouldn't be added to the source list. By default, the CMake build system will simply add all source files in the required sub folders, without knowing which files have to be excluded. The first use of it is in plm/base/.windows.
And clean up the nested variable names, in order to make it readable.

This commit was SVN r20498.
2009-02-10 17:20:13 +00:00
Ralph Castain
42df4b2102 Enable the slurmd plm module for testing - only selected if specified
This commit was SVN r20495.
2009-02-09 21:16:24 +00:00
Ralph Castain
f0af389910 Enable comm_spawn of slave processes, currently only active for the rsh, slurm, and tm environments. Establish support for local rsh environments in the plm/base so that rsh of local slaves can be done by any environment that supports it. Create new orte_rsh_agent param so users can specify rsh agent from outside of rsh plm, and sym link that to the old plm_rsh_agent and pls_rsh_agent options.
Modify the orte-bootproxy to pass prefix for the remote slave to support hetero/hybrid scenarios

This commit was SVN r20492.
2009-02-09 20:44:44 +00:00
Ralph Castain
5bfd1f3fd0 Ensure we have a correct, non-zero exit status when daemons or procs abort or fail to launch
This commit was SVN r20478.
2009-02-07 00:57:17 +00:00
Ralph Castain
8924e00e4c Ensure we don't segfault if we don't know which proc failed
This commit was SVN r20474.
2009-02-06 20:04:36 +00:00
Ralph Castain
13749673ed Enable spawn of local slave processes - plm module implementation to follow
This commit was SVN r20466.
2009-02-06 15:31:33 +00:00
Ralph Castain
a6f9c1f2b1 Allocate the slots for use in the xgrid plm
This commit was SVN r20460.
2009-02-06 00:55:14 +00:00
Shiqing Fan
a20254c8a5 A few type casts, making the MS compiler silent.
This commit was SVN r20449.
2009-02-05 16:37:44 +00:00