openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	9635db373d	Ensure that we properly exit if the executable isn't found This commit was SVN r21570.	2009-07-01 03:16:13 +00:00
Josh Hursey	91c4869cdc	Use {{{ORTE__VERSION}}} instead of {{{OMPI__VERSION}}} in orte This commit was SVN r21558.	2009-06-29 12:53:23 +00:00
Lenny Verkhovsky	e03807a3d1	small patch to extend current rankfile syntax to be compliant with orte_hosts syntax making it possible to claim relative hosts from the hostfile/scheduler by using +n# hostname, where 0 <= # < np ex: cat ~/work/svn/hpc/dev/test/Rankfile/rankfile rank 0=+n0 slot=0 rank 1=+n0 slot=1 rank 2=+n1 slot=2 rank 3=+n1 slot=1 This commit was SVN r21557.	2009-06-28 11:20:56 +00:00
Shiqing Fan	656ec00611	Remove the definition of _WIN32_DCOM. It only enables DCOM when _WIN32_WINNT is less then 0x400 (Win 95, 98), and we are supporting 0x502(Win XP) and above in Open MPI. Thanks George for pointing this out. This commit was SVN r21555.	2009-06-27 23:36:25 +00:00
Ralph Castain	d0a5468deb	Continue gradual cleanup of pointer array addressing. This commit was SVN r21550.	2009-06-26 22:40:28 +00:00
Ralph Castain	8ccc47b152	Don't wakeup mpirun too early - even though we ordered an abort, we have to wait for the procs/daemons to properly terminate, and give the kill_procs command a chance to get out. This commit was SVN r21549.	2009-06-26 22:40:00 +00:00
Ralph Castain	0a92fe3739	Pull the HNP node from the right index This commit was SVN r21547.	2009-06-26 21:43:09 +00:00
Ralph Castain	b96a71b62e	Enable restart of individual processes upon command via the errmgr callback function. It needs an external application to drive this capability, so normal operations shouldn't be affected. Does not support MPI applications. More work coming to update daemon accounting on movement of procs across nodes. This commit was SVN r21545.	2009-06-26 20:54:58 +00:00
George Bosilca	f06198a999	BAD grpcomm now has the ability to execute the modex offline. The MPI process prepare the send buffer, and post the collective order to the local daemon. It then register the callback and return fromthe modex exchange. It will only wait for this modex completion when the modex_recv is called. Meanwhile, the daemon will do the allgather. This commit was SVN r21543.	2009-06-26 20:32:31 +00:00
George Bosilca	6239e714a6	Extract the modex unpacking in it's own function. This commit was SVN r21542.	2009-06-26 20:29:54 +00:00
George Bosilca	760df7fcb9	Get rid of the static buffer. Instead use directly the user supplied one. This commit was SVN r21541.	2009-06-26 19:21:27 +00:00
Shiqing Fan	b0b11e8465	Clean up things properly. This commit was SVN r21539.	2009-06-26 15:20:19 +00:00
Shiqing Fan	6ba25951b4	Add a separate function for reading remote registry, so that it could be easily reused. Add two options for plm process module, i.e. remote_env_prefix for getting OMPI prefix on remote computer by reading its user environment variable (OPENMPI_HOME), and remote_reg_prefix is similar, but it reads the registry on the remote computer. Reading remote env prefix has a higher priority than reading reg prefix, so that user can use their own installation of OMPI. This commit was SVN r21538.	2009-06-26 13:35:45 +00:00
Shiqing Fan	5b4b8f8899	Get rid of a bunch of compiler warnings on Windows. This commit was SVN r21536.	2009-06-26 07:51:39 +00:00
Lenny Verkhovsky	efa800efea	removed orphan files in rankfile mapper This commit was SVN r21532.	2009-06-25 17:14:10 +00:00
Josh Hursey	26363fcc8b	Move the end timer to the end of the launch loop for the HNP. This matches the timing in the SLURM component, and makes the timing work when enabling the 'tree' based rsh launch. This commit was SVN r21528.	2009-06-25 13:28:34 +00:00
Josh Hursey	8705c7edb1	Fix {{{orte_timing}}} for the RSH component of the PLM. It never set the daemon launch counter before launching. In the output for 'total job launch time' make the message match that of the SLURM PLM for easier parsing. This commit was SVN r21527.	2009-06-25 12:37:58 +00:00
Shiqing Fan	07f7fe1a1b	Add two CCP params, for user specifying stdout and stderr to files. This commit was SVN r21521.	2009-06-25 09:03:50 +00:00
Ralph Castain	00fb79567f	Shift some setup items from orterun to the ess/hnp module so that any HNP will perform them. This commit was SVN r21520.	2009-06-25 03:00:53 +00:00
George Bosilca	fabebd140f	If the daemon doesn't contain orted we're adding the prefix_dir and bin_base twice. Create a temporary full_orted_cmd to cope with this case. This commit was SVN r21519.	2009-06-24 23:57:32 +00:00
George Bosilca	32f632ad5a	Put back the multi-word command line argument for the RSH PLM. This commit was SVN r21518.	2009-06-24 23:51:53 +00:00
George Bosilca	c0750343b5	Partially revert 21513. Beware of the exception on the orte_launch_agent which is treated apart in orte_plm_base_setup_orted_cmd. This commit was SVN r21517.	2009-06-24 23:48:14 +00:00
Ralph Castain	2e98ba3fd0	Complete implementation of regexp launch with static oob ports. Only enabled for SLURM at this time - migration to Torque coming This commit was SVN r21516.	2009-06-24 20:31:26 +00:00
George Bosilca	53e76eed75	Missed an include. This commit was SVN r21515.	2009-06-24 20:19:47 +00:00
George Bosilca	35033aeae9	Allow the PLM RSH to follow the routing table with the spawn, instead of the hard coded binomial approach. Now, the PLM extract the children information from the routed component, and the startup follow the routed topology. As a side effect, we can now launch using linear or radix topologies, in addition to the previously binomial topology. This commit was SVN r21514.	2009-06-24 19:56:29 +00:00
George Bosilca	8cb8f28d9d	When we get a report from an orted about its state, don't use the sender of the message to update the structures, but instead use the information from the URI. The reason is that even the launch report messages can get routed. Deal with the orted_cmd_line in a single location. This commit was SVN r21513.	2009-06-24 19:51:52 +00:00
George Bosilca	84a953a2a6	Be less verbose. This commit was SVN r21512.	2009-06-24 19:49:06 +00:00
Shiqing Fan	6750657a97	Change the way that we detect cluster nodes status and clean up the CCP components. Normally, any non-windows nodes should return NodeStatus_Unreachable, but that's not true. We have noticed that Scientific Linux cluster nodes, which are in the same subnet as the Windows nodes, return NodeStatus_PendingApproval value, and the RAS CCP takes them as Windows nodes. So let's just use the "Ready" nodes. This commit was SVN r21510.	2009-06-24 19:15:15 +00:00
Shiqing Fan	9bdf258d3c	Add an initial support for remote process launch using WMI (Windows Management Instrumentation) in the environment where Windows 2003/2008 server and HPC pack are not installed. Users in a domain will be able to use domain computers as a cluster with their domain Credentials, the computers may run Windows XP or higher, e.g. Vista, and/or Windows 7 (later), with correct settings and permissions for WMI and Windows firewall. This commit was SVN r21508.	2009-06-24 18:59:24 +00:00
George Bosilca	addaf7aaf8	Repair the tree spawn. The problem seems to come from the fact that now the HNP send the messages using the routed component. In the case of tree spawn, when a intermediary node spawn a child it doesn't know how to forward a message to it, so when the node-map message is coming from the HNP (as there is nothing yet in the contact/routing table) the message is sent back the way it came. As a result the node-map message keeps jumping between the HNP and the first level orteds. The solution is to add a new option to the children orte_parent_uri, which is only set when the orted is _not_ directly spawned by the HNP. When this option is present on the argument list, the orted will add the parent to its routing, and force the parent to update his routes (by sending the URI). With this approach, the routing tree is build in same time as the processes are spawned, and all messages from the HNP can be routed to the leaves. However, this is far from an optimal solution. Right now, this so called tree spawn, only spawn the children in a tree without doing anything about the "connect back to the HNP" step. The HNP is flooded with reports from all the orted. The total number of messages is higher than in the non tree startup scheme, so we do not expect this approach to be scalable in the current incarnation. A complete overhaul of the tree startup is required in order improve the scalability. Stay tuned! This commit was SVN r21504.	2009-06-23 22:10:25 +00:00
George Bosilca	6a00481285	We know what a daemon is there is no need to dig into the nidmap to find it out. This commit was SVN r21503.	2009-06-23 20:43:45 +00:00
Ralph Castain	0ba845fed2	Continue development of regular expression support by implementing it for slurm launches. Works for both initial (cmd line and non-cmd line) and comm_spawn launch. Additional work required to fully enable static port support when using cmd line regular expression launch system. This commit was SVN r21502.	2009-06-23 20:25:38 +00:00
George Bosilca	bca8015b94	Add the proc_get_daemon capability to the bproc launcher. This commit was SVN r21501.	2009-06-23 20:21:55 +00:00
George Bosilca	7339530061	Remove the prototype of a non-existant function. This commit was SVN r21500.	2009-06-23 19:50:23 +00:00
George Bosilca	225f2b01c9	Don't release uninitialized objects. This commit was SVN r21499.	2009-06-23 19:47:58 +00:00
Ralph Castain	7a802a9d3a	Move the plm designation to the argv from the env to support those systems not setup to pass env via rsh. This commit was SVN r21495.	2009-06-22 18:08:45 +00:00
Ralph Castain	c199dbb241	Revert r21480 - we already did open/select the PLM on non-HNP daemons. This commit broke slave launches on Torque and SLURM as it caused the PLM to be open/selected twice. The open/select of the PLM is done in orte/mca/ess/base/ess_base_std_orted.c. It only is done when the PLM MCA param is set directing a specific PLM be selected. The function orte_plm_base_orted_append_basic_args clears the params passed to the daemon of any PLM selection passed to the HNP. Each PLM then adds a PLM directive if-and-only-if backend PLM support is desired. At present, Torque, SLURM, and rsh all specify this support and direct that the backend orted open the "rsh" PLM. This commit was SVN r21488. The following SVN revision numbers were found above: r21480 --> open-mpi/ompi@ed585bce8a	2009-06-20 03:58:00 +00:00
Camille Coti	ed585bce8a	Initialize the PML if we are a non-HNP daemon. If we do not initialize the PML, non-HNP daemons will not be able to use its functions. For example, RSH needs it when the tree_spawn mode is enabled: daemons call orte_pml.remote_spawn() function to spawn their children in the deployment tree. This commit was SVN r21480.	2009-06-19 18:50:06 +00:00
Ralph Castain	771ce035a5	Complete implementation of regular expression generator and parser - now handles leading zero's and suffix in node names. This commit was SVN r21468.	2009-06-18 04:36:00 +00:00
Jeff Squyres	2a5813ac2d	Silence a compiler warning. This commit was SVN r21459.	2009-06-17 12:26:38 +00:00
Ralph Castain	85e55c5087	Make the IOF macros match for debug vs optimized builds This commit was SVN r21453.	2009-06-16 22:30:53 +00:00
Rolf vandeVaart	633b996a0f	Add sys/wait.h so we can compile on Solaris. This commit was SVN r21451.	2009-06-16 19:48:43 +00:00
Ralph Castain	e9fc0a74fb	Silence compiler warnings This commit was SVN r21445.	2009-06-16 13:34:31 +00:00
Ralph Castain	74bd80afd9	Do not preload binaries or files if the app isn't being executed on this node This commit was SVN r21444.	2009-06-16 03:12:30 +00:00
Ralph Castain	d1dd8c2653	Ensure we accurately count the number of new daemons to be launched, especially if we are restarting processes. Have the resilient mapper also setup for new daemons in case the PLM needs them. This commit was SVN r21437.	2009-06-15 13:55:01 +00:00
Ralph Castain	c0c56e30c9	Add a missing function to the resilient mapper so it defines daemons in case they are needed This commit was SVN r21428.	2009-06-12 19:48:13 +00:00
Ralph Castain	170327e575	Reorg the rmaps components to collect shared code for byslot and bynode mapping in the base so we quit duplicating it in every mapper This commit was SVN r21424.	2009-06-12 17:52:17 +00:00
Ralph Castain	1ee4acb247	Cleanup how we handle pointer arrays in the odls base fns to avoid potential segfaults This commit was SVN r21423.	2009-06-12 17:51:23 +00:00
Ralph Castain	70b8c89b44	Fix slave spawn, which was hanging because the local daemon never saw the slave job report - it doesn't do it in the normal way, and so the slave launch system itself has to "fake it". Also complete implementation to printout app_context objects so we see all the fields. This commit was SVN r21408.	2009-06-10 19:01:08 +00:00
Ralph Castain	f24cefe3d2	Shift the check for adequate file descriptors to after we check if this proc is part of the job to be launched - no point in doing the check more often than absolutely required This commit was SVN r21406.	2009-06-10 15:18:48 +00:00
Ralph Castain	87d7d693f0	Add a notifier call when the oob retries are exceeded so sys admins are aware of the problem This commit was SVN r21405.	2009-06-10 15:17:16 +00:00
George Bosilca	77a6f27d44	Update the call to orte_plm_base_create_jobid based on the new interface. This commit was SVN r21393.	2009-06-08 23:53:53 +00:00
Ralph Castain	86d55d7ebf	Fix tight loops over comm_spawn by checking to see if the system has enough child procs and file descriptors available before attempting to launch. If not, introduce a 1sec delay and then test again. This provides a chance for the orted to complete processing of proc terminations from other children, hopefully creating room for the new proc(s). Update the loop_spawn test to remove a sleep so that it runs at max speed, letting the new code catch when we overrun ourselves and wait for room to be cleared for the next comm_spawn. This commit was SVN r21390.	2009-06-08 18:28:26 +00:00
Shiqing Fan	5a90b3068e	Two type casts. This commit was SVN r21388.	2009-06-07 12:51:46 +00:00
Ralph Castain	0a67bcb653	Minor cleanups This commit was SVN r21387.	2009-06-06 15:44:00 +00:00
Ralph Castain	ccf6b2cb8c	Implement the ability to register callbacks when specified error states occur This commit was SVN r21386.	2009-06-06 01:15:31 +00:00
Ralph Castain	0336460b0a	Continue implementation of resilient operations by supporting reuse of jobids for restarted procs. Ensure that restarted processes have valid node and local ranks, and that node rank values are passed to direct-launched processes. This commit was SVN r21385.	2009-06-06 01:08:47 +00:00
Ralph Castain	3815bfbba6	Provide a better error message when the oob cannot send a message after exhausting retries, and then have the proc abort so the job doesn't just hang forever. Since it could be a daemon that needs to abort, cleanup the abort sequence so the daemon can exit as cleanly as possible. This commit was SVN r21361.	2009-06-02 23:57:12 +00:00
Ralph Castain	30a357bd8d	Provide a "progress meter" for launch that outputs progress as we are launching, especially on large jobs. Also, provide a timeout mechanism so that we cleanly abort if we don't get a response from the next daemon in a specified time. This commit was SVN r21359.	2009-06-02 23:52:02 +00:00
Ralph Castain	303e3a1d39	Add a resilient mapping capability - currently maps by fault groups (if provided), still need to add the remapping capability for failed procs. This commit was SVN r21350.	2009-06-02 03:23:20 +00:00
Jeff Squyres	df7c387155	I'm (temporarily?) removing this entry because there's no .window file in this directory and it's causing "make dist" to break. Shiqing -- is there a missing file in this directory? If so, please add it and restore the EXTRA_DIST line I just removed. Thanks! This commit was SVN r21340.	2009-06-01 14:07:08 +00:00
Ralph Castain	137104b786	Initial support for CMs - needs to be pruned as CM support develops This commit was SVN r21335.	2009-05-30 20:57:23 +00:00
Ralph Castain	4a31c65126	Add initial support for CMs This commit was SVN r21334.	2009-05-30 20:55:55 +00:00
Ralph Castain	a1005a716f	Allow CM's to select the default errmgr component. Add support for error function callbacks This commit was SVN r21333.	2009-05-30 20:43:42 +00:00
Ralph Castain	a95731fc68	Minor update to let apps set their own component selections if desired, while preserving slave behavior This commit was SVN r21332.	2009-05-30 20:42:23 +00:00
Ralph Castain	4e0223a638	Add the ability to directly launch procs via rsh/ssh. Collect common functions in plm/base. Create a new global param to set assume_same_shell, alias'd back to plm_rsh_assume_same_shell (not deprecated). This commit was SVN r21328.	2009-05-30 01:10:25 +00:00
Jeff Squyres	5ea1b776f7	Remove a compiler warning about an empty format string. The proper way to have no abort message is to pass NULL (the errmanager is smart enough to handle this case and not emit any extra message). This commit was SVN r21311.	2009-05-28 13:32:37 +00:00
Ralph Castain	17485ca604	Minor mod per Greg Watson, plus some cleanups to make George smile...or at least grimace a little less! :-) This commit was SVN r21309.	2009-05-28 00:55:01 +00:00
Ralph Castain	cf3a5c44db	Modify the xml output per devel-list discussion with Greg Watson This commit was SVN r21285.	2009-05-27 00:43:54 +00:00
Jeff Squyres	e6a32f13bb	Add missing header file This commit was SVN r21273.	2009-05-26 20:57:44 +00:00
Iain Bason	e7ff2368d6	This fixes trac:1930. Emit a more informative error message when the file descriptor limit is reached during an accept() call. Also, abort when the accept fails to avoid an infinite loop. Emit a more informative error message when the help file can't be opened. This commit was SVN r21271. The following Trac tickets were found above: Ticket 1930 --> https://svn.open-mpi.org/trac/ompi/ticket/1930	2009-05-26 20:03:21 +00:00
Rolf vandeVaart	db04b6ca71	This change does two things. First, do not emit error messages when delivering a signal (like STOP or CONT) to a non-existant process. This fixes trac:1929. Also, only print one error message in the other cases. This commit was SVN r21263. The following Trac tickets were found above: Ticket 1929 --> https://svn.open-mpi.org/trac/ompi/ticket/1929	2009-05-22 14:59:27 +00:00
Ralph Castain	cc7620c210	Fix orte-ps so it properly ignores/reports stale HNPs, but continues to provide output on running ones. Add a timeout on the send side of the comm so we don't hang while trying to send the info request to the non-existent HNP. This commit was SVN r21257.	2009-05-21 02:42:21 +00:00
Jeff Squyres	154dc846b5	Remove extra ";", which caused a bazillion warnings in MTT. This commit was SVN r21255.	2009-05-20 13:16:31 +00:00
Rainer Keller	5c80033aa2	- Eliminate icc warning w/ regard to __attribute__((__format__)) on function pointers... Needed checking in opal_check_attributes.m4 This commit was SVN r21254.	2009-05-20 00:39:22 +00:00
Ralph Castain	f139cfd28a	Fully enable the use of static ports to minimize connections on mpirun. When static ports are provided, daemons will automatically use routes defined by the selected routed module to callback to mpirun during startup, thus elimating the dedicated daemon-to-mpirun connection. Therefore, the total number of connections on mpirun will equal the fanout of the routed module (instead of #nodes in job). Add a new tm ess module that exploits this capability. Update the various plm modules to enable it - just a minor change reflecting an added param to a plm base function. Additional fixes included: 1. remove an erroneous cleanup of session directories in the tool finalize procedure - tools don't create session directories to begin with! 2. fix a duplicate free when attempting to execute a non-existent app 3. cleanup an typo in the comm utilities 4. fix comm_spawn - was perturbed by the changes in pack/unpack of orte_job_t to properly support orte-ps Been tested on slurm and tm machines, using all tests in orte/test/mpi. May run into issue with command line length on large jobs due to inclusion of node info to support static ports - will fix this next with addition of regexp generator to compress that info. This commit was SVN r21248.	2009-05-16 04:15:55 +00:00
Rainer Keller	b98a095d22	- Similar to r21229, check for return code from orte_rml_base_update_contact_info This commit was SVN r21233. The following SVN revision numbers were found above: r21229 --> open-mpi/ompi@9ad9b20847	2009-05-14 00:36:51 +00:00
Rainer Keller	9b7c4c2354	- Update comment This commit was SVN r21232.	2009-05-14 00:32:43 +00:00
Rainer Keller	b3daf7184a	- Fix Coverity CID 1272: Need to release src in case of error - Fix Coverity CID 1271: basename is allocated by opal_basename and later overwritten This commit was SVN r21230.	2009-05-14 00:28:35 +00:00
Rainer Keller	73fd329cbd	- Add the proper __opal_attribute_format__(__printf__...) to declarations. This commit was SVN r21226.	2009-05-14 00:10:59 +00:00
Ralph Castain	fd5dd9c4cb	Ensure we correctly cycle through map_by_slot when mapping leftover procs in rankfile mapper This commit was SVN r21219.	2009-05-12 15:41:55 +00:00
Shiqing Fan	9f624cd9a2	A small fix, the right one to use in orte. This commit was SVN r21213.	2009-05-12 09:53:34 +00:00
Ralph Castain	d396f0a6fc	Per the discussion on the devel list, move the binding of processes to processors from MPI_Init to process start. This involves: 1. replacing mpi_paffinity_alone with opal_paffinity_alone - for back-compatibility, I have aliased mpi_paffinity_alone to the new param name. This caus es a mild abstraction break in the opal/mca/paffinity framework - per the devel discussion...live with it. :-) I also moved the ompi_xxx global variable that tracked maffinity setup so it could be properly closed in MPI_Finalize to the opal/mca/maffinity framework to avoid an abstraction break. 2. Added code to the odls/default module to perform paffinity binding and maffinity init between process fork and exec. This has been tested on IU's odi n cluster and works for both MPI and non-MPI apps. 3. Revise MPI_Init to detect if affinity has already been set, and to attempt to set it if not already done. I have not tested this as I haven't yet f igured out a way to do so - I couldn't get slurm to perform cpu bindings, even though it supposedly does do so. This has only been lightly tested and would definitely benefit from a wider range of evaluation... This commit was SVN r21209.	2009-05-12 02:18:35 +00:00
Ralph Castain	fa839f4a30	Fix a bug in the rankfile mapper when nooversubscribe is set This commit was SVN r21208.	2009-05-11 23:44:59 +00:00
Ralph Castain	388292aed5	Fix my old "friend" singleton comm_spawn This commit was SVN r21207.	2009-05-11 14:53:02 +00:00
Ralph Castain	c45ff0d59f	Take the next step towards fully utilizing static ports for the daemons to eliminate the initial "phone home" to mpirun by modifying the orted termination procedure to eliminate the need for a full barrier-like operation. Instead, we add a "onesided" barrier to the grpcomm framework API that releases the orted once it has completed its own contribution to the barrier - i.e., the orteds now exit as the "ack" message rolls up towards mpirun instead of sending the "ack" directly to mpirun. This causes the orteds in the routing tree to remain alive until all termination "acks" from orteds below them have passed through. Thus, if we use static ports, we no longer require a direct orted-to-mpirun connection. Also modify the binomial routed module so it conforms to what all the other routed modules do and have all messages pass along the routing tree instead of short-circuiting between orteds. This further reduces the number of ports being opened on backend nodes. This commit was SVN r21203.	2009-05-11 14:11:44 +00:00
Ralph Castain	c6f0499720	Some cleanups required from last night's commits to resolve some race conditions and ensure we cleanup properly. Also, remove some debug output that was unintentionally left "on" by default. This commit was SVN r21202.	2009-05-11 14:03:07 +00:00
Ralph Castain	10a694ea43	The current errmgr.register_callback API takes a jobid as one of its argument. The intent was to have the errmgr check the jobid of the job being reported to it and, if it matches the jobid that was registered, call the specified callback function. Unfortunately, we assign the jobid during the plm.spawn procedure - which means it happens -after- control of the job has passed out of the range of mpirun (or whatever program is spawning the job), so it is too late for that main program to register a callback function. If the main program registers tha callback -after- we return from plm.spawn, then it (a) cannot get a callback for failed-to-start, and (b) will miss the callback if a proc aborts in the time between job launch and the call to errmgr.register_callback. This commit fixes the problem by adding callback-related fields to the orte_job_t object. Thus, the main program can specify what job states should initiate a callback, what function is to be called, and what data is to be passed back by simply filling in the orte_job_t fields prior to calling plm.spawn. Also, fully implement the "copy" function for the orte_job_t object. NOTE: as a result of this change, the errmgr.register_callback API may no longer be of any value. This commit was SVN r21200.	2009-05-11 03:38:15 +00:00
Ralph Castain	69cd4e9d8a	Continue cleanup of opal_pointer_array references to replace direct addressing with opal_pointer_array_get_item. Also, be much more careful and complete in recovering resources of terminated jobs. This commit was SVN r21199.	2009-05-11 03:27:57 +00:00
Josh Hursey	d920a302f3	Some more C/R related commits that have been sitting off-trunk for a while. * Pass the sequence number of the checkpoint along with reference from the global to the local coordinator. * 'orte-restart --apponly' now just generates the app context file, and does not run with it. This provides the user the ability to edit the file before launching. * Add a OPAL_CRS_NONE state * Split the INC into three distinct parts. * Implement a restart mechanism for the 'none' component. If given a context it simply execvp()'s it. This commit was SVN r21195.	2009-05-08 20:51:13 +00:00
Josh Hursey	5d0607395d	A couple of C/R related commits that have been sitting off-trunk for a while. * Add 'orte-checkpoint -l' option that lists all checkpoints currently available on the system. * Add 'orte-restart -i' which prints information regarding the checkpoint targeted for restart. * Add ability to extract the timing metadata. * Fix show_help() in the orte-checkpoint and orte-restart tools. They should be using the opal versions instead of the orte versions (otherwise nothing is printed). This commit was SVN r21194.	2009-05-08 19:41:11 +00:00
Rainer Keller	b0754071b7	- For compilation with BLCR and --with-ft=cr, #include <string.h> This commit was SVN r21185.	2009-05-07 16:14:59 +00:00
Greg Koenig	60485ff95f	This is a very large change to rename several #define values from OMPI_* to OPAL_*. This allows opal layer to be used more independent from the whole of ompi. NOTE: 9 "svn mv" operations immediately follow this commit. This commit was SVN r21180.	2009-05-06 20:11:28 +00:00
Shiqing Fan	cd565923d3	Completely remove ltdl support for Windows build. This commit was SVN r21170.	2009-05-05 18:59:13 +00:00
Josh Hursey	8b8bee04d6	It seems that some of the patches were missed in r21131. :( This patch contains the following items: * Fix the flag passed to open() for the read side of the named pipe between the local and app coordinator. There is a race condition when using O_RDWR on a named pipe (not sure how that bug got in there in the first place). * Adjust control in the C/R thread timing * Clarify return code in BLCR component * Allow the user to adjust the max wait time for the named pipes in the FileM local coordinator by using the MCA parameter "snapc_full_max_wait_time" (Default: 20 seconds) * If the application terminates while there are active FileM operations, force mpirun to wait on these operations to complete. * Allow the user to set the local copy command (Default: cp) via MCA parameter "filem_rsh_cp" * Implement the ability to throttle the number of outgoing connections in FileM. At larger scales this type of explicit throttling helps prevent overwhelming the HNP machine. Default: 10, set via MCA parameter: {{{filem_rsh_max_outgoing}}} This commit was SVN r21167. The following SVN revision numbers were found above: r21131 --> open-mpi/ompi@0deb009225	2009-05-05 16:45:49 +00:00
Ralph Castain	e615af8b80	Silence coverity... This commit was SVN r21149.	2009-05-04 22:22:47 +00:00
Ralph Castain	fa531a842d	Send all xml output over stdout This commit was SVN r21147.	2009-05-04 18:51:22 +00:00
Ralph Castain	eac027e6bc	In reviewing last night's MTT of trunk, discovered that IU was still testing the old plm slurmd module. This was created solely for debugging proposed changes to the slurm module which have long since been integrated across - and thus, the slurmd module hasn't been maintained for quite some time. Which explains why the tests using that module all failed...sigh. My bad for not cleaning it out a long time ago. This commit was SVN r21145.	2009-05-04 11:25:32 +00:00
Ralph Castain	4be24521aa	Modify the orte_process_info structure to handle a broader range of process types by replacing the individual booleans with a 32-bit bitmap. Use a set of #define's to define the individual bits, and a set of matching macros to test for them. Update the orte code base to use the macros instead of the booleans. Minor mod to the ompi layer to use the new #define's - just one-line name replacements. This commit was SVN r21144.	2009-05-04 11:07:40 +00:00
Rainer Keller	c32516c9a3	- Include errno.h, to get MTT for sun to run through This commit was SVN r21143.	2009-05-04 09:13:16 +00:00

1 2 3 4 5 ...

1825 Коммитов