1
1

520 Коммитов

Автор SHA1 Сообщение Дата
George Bosilca
d66632fdc9 Reorder the nidmap encoding function. Add a check to make sure we don't write
outside the boundaries of the allocated array.

However, the problem is still there. If we have rmaps file containing only
partial information the num_procs get set to the wrong value (the number of
hosts in the rmaps file instead of the number of processes requested on the
command line).

This commit was SVN r21686.
2009-07-15 19:36:53 +00:00
Ralph Castain
60edbc7220 Fix hetero operations and comm_spawn (to a point).
Remove all architecture references from ORTE and put them back in the modex using modex_send/recv calls.

Hetero operations are now fully supported again. Comm_spawn now works up to the point where it segfaults due to an error in the CID code - which now allows Edgar to dig further! :-)

This commit was SVN r21655.
2009-07-13 20:03:41 +00:00
Ralph Castain
50bd635200 Also require that the routed framework be initialized before attempting to use orte_show_help
This commit was SVN r21638.
2009-07-12 10:50:14 +00:00
Ralph Castain
e30826c6e1 Quiet some compiler warnings
This commit was SVN r21591.
2009-07-02 17:48:36 +00:00
Ralph Castain
dd5e195a7d Don't treat the HNP node entry separately - this was just a holdover from the days when we didn't have the regex generator.
Ensure we get an accurate count of the number of daemons in the system.

This commit was SVN r21582.
2009-07-01 20:46:05 +00:00
Ralph Castain
f832352b45 Clean up some compiler warnings
This commit was SVN r21577.
2009-07-01 16:51:11 +00:00
Ralph Castain
2b4f051b7f Cleanup some indexing bugs so that shared memory can function
This commit was SVN r21548.
2009-06-26 22:07:25 +00:00
Ralph Castain
b96a71b62e Enable restart of individual processes upon command via the errmgr callback function. It needs an external application to drive this capability, so normal operations shouldn't be affected.
Does not support MPI applications. More work coming to update daemon accounting on movement of procs across nodes.

This commit was SVN r21545.
2009-06-26 20:54:58 +00:00
Ralph Castain
863e57700e Cleanup use of pointer arrays - thanks to Lenny for pointing it out.
This commit was SVN r21529.
2009-06-25 14:08:36 +00:00
Ralph Castain
2e98ba3fd0 Complete implementation of regexp launch with static oob ports. Only enabled for SLURM at this time - migration to Torque coming
This commit was SVN r21516.
2009-06-24 20:31:26 +00:00
Ralph Castain
51ee170f75 Update the pidmap decode logic to handle pidmap updates for restarted processes
This commit was SVN r21506.
2009-06-24 03:05:04 +00:00
Ralph Castain
19062d70e4 Cleanup some of the loops in the nidmap code
This commit was SVN r21505.
2009-06-24 02:47:45 +00:00
Ralph Castain
0ba845fed2 Continue development of regular expression support by implementing it for slurm launches. Works for both initial (cmd line and non-cmd line) and comm_spawn launch.
Additional work required to fully enable static port support when using cmd line regular expression launch system.

This commit was SVN r21502.
2009-06-23 20:25:38 +00:00
Jeff Squyres
ecaa00ba73 Patch from Nadia/Bull from the opal-sos HG branch:
orte_session_dir_finalize doesn't clean the right directories.
orte_session_dir_cleanup neither.

This patch fixes several issues:

 1. orte_session_dir_cleanup():
   1. when jobid is not a wildcard, jobid is used to build the job
      session dir (instead of ORTE_LOCAL_JOBID).
   1. ORTE_SUCCESS is unconditionally returned (instead of rc that
      might have been previously set to another value).
 1. orte_session_dir_finalize():
   1. convert_jobid_to_string is not the right call to get the job
      session dir.
   1. in some places orte_process_info.top_session_dir is directly
      used, without being prefixed with the base directory.

Factorized the code sections that build the job_session_dir into a
single orte_build_job_session_dir() function that is now called by
both orte_session_dir_finalize() and orte_session_dir_cleanup().

Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net>

This commit was SVN r21498.
2009-06-23 16:07:41 +00:00
Ralph Castain
771ce035a5 Complete implementation of regular expression generator and parser - now handles leading zero's and suffix in node names.
This commit was SVN r21468.
2009-06-18 04:36:00 +00:00
Ralph Castain
8db7a9f9a7 Add regular expression generator to encode complete nid/pid maps - decoder to come.
This commit was SVN r21455.
2009-06-17 02:54:20 +00:00
Ralph Castain
4e0223a638 Add the ability to directly launch procs via rsh/ssh. Collect common functions in plm/base. Create a new global param to set assume_same_shell, alias'd back to plm_rsh_assume_same_shell (not deprecated).
This commit was SVN r21328.
2009-05-30 01:10:25 +00:00
Iain Bason
e7ff2368d6 This fixes trac:1930.
Emit a more informative error message when the file descriptor limit is
reached during an accept() call.  Also, abort when the accept fails to
avoid an infinite loop.

Emit a more informative error message when the help file can't be opened.

This commit was SVN r21271.

The following Trac tickets were found above:
  Ticket 1930 --> https://svn.open-mpi.org/trac/ompi/ticket/1930
2009-05-26 20:03:21 +00:00
Ralph Castain
cc7620c210 Fix orte-ps so it properly ignores/reports stale HNPs, but continues to provide output on running ones. Add a timeout on the send side of the comm so we don't hang while trying to send the info request to the non-existent HNP.
This commit was SVN r21257.
2009-05-21 02:42:21 +00:00
Ralph Castain
f139cfd28a Fully enable the use of static ports to minimize connections on mpirun. When static ports are provided, daemons will automatically use routes defined by the selected routed module to callback to mpirun during startup, thus elimating the dedicated daemon-to-mpirun connection. Therefore, the total number of connections on mpirun will equal the fanout of the routed module (instead of #nodes in job).
Add a new tm ess module that exploits this capability.

Update the various plm modules to enable it - just a minor change reflecting an added param to a plm base function.

Additional fixes included:

1. remove an erroneous cleanup of session directories in the tool finalize procedure - tools don't create session directories to begin with!

2. fix a duplicate free when attempting to execute a non-existent app

3. cleanup an typo in the comm utilities 

4. fix comm_spawn - was perturbed by the changes in pack/unpack of orte_job_t to properly support orte-ps

Been tested on slurm and tm machines, using all tests in orte/test/mpi. May run into issue with command line length on large jobs due to inclusion of node info to support static ports - will fix this next with addition of regexp generator to compress that info.

This commit was SVN r21248.
2009-05-16 04:15:55 +00:00
Ralph Castain
484a6f58f2 Repair orte-ps by updating some of the interface code. Add ability to recover from attempting to contact non-responsive HNPs due to stale session directories. Implement the -j option. Turn "off" the -p option as it doesn't work and will take a little while to actually implement it (if anyone really cares).
This commit was SVN r21245.
2009-05-15 13:21:18 +00:00
Shiqing Fan
56866e68e9 A few typecasts.
This commit was SVN r21212.
2009-05-12 09:46:52 +00:00
Ralph Castain
c45ff0d59f Take the next step towards fully utilizing static ports for the daemons to eliminate the initial "phone home" to mpirun by modifying the orted termination procedure to eliminate the need for a full barrier-like operation. Instead, we add a "onesided" barrier to the grpcomm framework API that releases the orted once it has completed its own contribution to the barrier - i.e., the orteds now exit as the "ack" message rolls up towards mpirun instead of sending the "ack" directly to mpirun.
This causes the orteds in the routing tree to remain alive until all termination "acks" from orteds below them have passed through. Thus, if we use static ports, we no longer require a direct orted-to-mpirun connection.

Also modify the binomial routed module so it conforms to what all the other routed modules do and have all messages pass along the routing tree instead of short-circuiting between orteds. This further reduces the number of ports being opened on backend nodes.

This commit was SVN r21203.
2009-05-11 14:11:44 +00:00
Ralph Castain
08e2f8ec2d Continue some more cleanup of how we handle opal_pointer_arrays - replace direct references with opal_pointer_array_get_item
This commit was SVN r21198.
2009-05-11 03:24:49 +00:00
Greg Koenig
60485ff95f This is a very large change to rename several #define values from
OMPI_* to OPAL_*.  This allows opal layer to be used more independent
from the whole of ompi.

NOTE: 9 "svn mv" operations immediately follow this commit.

This commit was SVN r21180.
2009-05-06 20:11:28 +00:00
Ralph Castain
4be24521aa Modify the orte_process_info structure to handle a broader range of process types by replacing the individual booleans with a 32-bit bitmap. Use a set of #define's to define the individual bits, and a set of matching macros to test for them. Update the orte code base to use the macros instead of the booleans.
Minor mod to the ompi layer to use the new #define's - just one-line name replacements.

This commit was SVN r21144.
2009-05-04 11:07:40 +00:00
Rainer Keller
221fb9dbca ... Delayed due to notifier commits earlier this day ...
- Delete unnecessary header files using
   contrib/check_unnecessary_headers.sh after applying
   patches, that include headers, being "lost" due to
   inclusion in one of the now deleted headers...

   In total 817 files are touched.
   In ompi/mpi/c/ header files are moved up into the actual c-file,
   where necessary (these are the only additional #include),
   otherwise it is only deletions of #include (apart from the above
   additions required due to notifier...)

 - To get different MCAs (OpenIB, TM, ALPS), an earlier version was
   successfully compiled (yesterday) on:
   Linux locally using intel-11, gcc-4.3.2 and gcc-SVN + warnings enabled
   Smoky cluster (x86-64 running Linux) using PGI-8.0.2 + warnings enabled
   Lens cluster (x86-64 running Linux) using Pathscale-3.2 + warnings enabled

This commit was SVN r21096.
2009-04-29 01:32:14 +00:00
Rainer Keller
6c1cce8761 - For the upcoming header cleanup commit,
several header files (previously included by header-files)
   now have to be moved "upward".
   This is mainly system headers such as string.h, stdio.h and for
   networking, but also some orte headers.

This commit was SVN r21095.
2009-04-29 00:49:23 +00:00
Ralph Castain
9c39a3edd7 Enable the passing of MCA params to dynamically spawned jobs. This creates a new info_key "ompi_param" that allows a user to specify MCA params for a dynamically spawned job.
We currently apply all of the MCA params in the parent job to the child. This commit allows a user to specify additional params for the child job, and to override any pre-existing params with the new value so they can better control behavior of the child job.

This commit was SVN r20989.
2009-04-14 14:15:49 +00:00
Ralph Castain
9f7c605166 More cleanup of pointer array usage
This commit was SVN r20981.
2009-04-13 19:06:54 +00:00
Shiqing Fan
1b97fe90fd Type casts mainly for Windows.
This commit was SVN r20967.
2009-04-09 13:34:55 +00:00
Ralph Castain
2c4e7bd5a2 Remove unused var
This commit was SVN r20966.
2009-04-09 13:18:18 +00:00
Ralph Castain
b4df8bcf85 Missed comment...
This commit was SVN r20964.
2009-04-09 03:00:57 +00:00
Ralph Castain
e9bc000f63 Correctly account for holes in the job map due to cleanup as jobs terminate
This commit was SVN r20963.
2009-04-09 02:59:23 +00:00
Ralph Castain
9c2f17eb01 Cleanup the nidmap lookup functions and add some comments explaining how we handle the nid, job, and pmap arrays. This fixes a problem we have less-than-full participation in a comm_spawn, causing holes to exist in the pmap array.
Update the slave spawn tests to properly indicate participation as being solely MPI_COMM_SELF.

This commit was SVN r20961.
2009-04-09 02:48:33 +00:00
Terry Dontje
4b43911c6a Remove superfluous spaces in manpages that were causing catman to
generate mangled windex files.  Made ompi-top.1 and ompi-iof.1 build
by default.  Also, added the orte-top synonym to the ompi-top manpage.

This commit was SVN r20915.
2009-04-01 14:40:27 +00:00
Rainer Keller
be66cc2279 - We're using uint16_t, uint32_t, and friends,
so #include <stdint.h> if we have it...

This commit was SVN r20835.
2009-03-21 01:26:27 +00:00
Rainer Keller
bff1b2a22b - Finally add the missing opal/util/output.h
for the OPAL_OUTPUT_VERBOSE macro.
 - ompi/errhandler/errhandler_predefined.h:
   Well, just the missing fwd declarations...

This commit was SVN r20820.
2009-03-17 22:37:15 +00:00
Rainer Keller
d8cf4c0fec - Get pgcc on XT to complain less:
In case we use memcmp, strlen, strup and friends include <string.h>
   Also several constants.h are not included directly
 - Let's have mca_topo_base_cart_create  return ompi-errors in
   ompi/mca/topo/base/topo_base_cart_create.c

This commit was SVN r20773.
2009-03-13 02:10:32 +00:00
Rainer Keller
0b59a59129 - Rather have 0xff instead of 0Xff...
This commit was SVN r20769.
2009-03-12 22:17:42 +00:00
Rainer Keller
ec0ed48718 - Revert r20739
This commit was SVN r20742.

The following SVN revision numbers were found above:
  r20739 --> open-mpi/ompi@781caee0b6
2009-03-05 21:56:03 +00:00
Rainer Keller
a94438343b - Revert r20740
This commit was SVN r20741.

The following SVN revision numbers were found above:
  r20740 --> open-mpi/ompi@2a70618a77
2009-03-05 21:50:47 +00:00
Rainer Keller
2a70618a77 - Second patch, as discussed in Louisville.
Replace short macros in orte/util/name_fns.h
   to the actual fct. call.

 - Compiles on linux/x86-64

This commit was SVN r20740.
2009-03-05 21:14:18 +00:00
Rainer Keller
781caee0b6 - First of two or three patches, in orte/util/proc_info.h:
Adapt orte_process_info to orte_proc_info, and
   change orte_proc_info() to orte_proc_info_init().
 - Compiled on linux-x86-64
 - Discussed with Ralph

This commit was SVN r20739.
2009-03-05 20:36:44 +00:00
Ralph Castain
f11931306a Modify the accounting system to recycle jobids. Properly recover resources from nodes and jobs upon completion. Adjustments in several places were required to deal with sparsely populated job, node, and proc arrays as a result of this change.
Correct an error wrt how jobids were being computed. Needed to ensure that the job family field was not overrun as we increment jobids for comm_spawn.

Update the slurm plm module so it uses the new slurm termination procedure (brings trunk back into alignment with 1.3 branch).

Update the slurmd ess component so it doesn't get selected if we are running a singleton inside of a slurm allocation.

Cleanup HNP init by moving some code that had been in orte_globals.c for historical reasons into the ess hnp module, and removing the call to that code from the ess_base_std_prolog


NOTE: this change allows orte to support an infinite aggregate number of comm_spawn's, with up to 64k being alive at any one instant. HOWEVER, the MPI layer currently does -not- support re-use of jobids. I did some prototype coding to revise the ompi_proc_t structures, but the BTLs are caching their own data, and there was no readily apparent way to update it. Thus, attempts to spawn more than the 64k limit will abort to avoid causing the MPI layer to hang.

This commit was SVN r20700.
2009-03-03 16:39:13 +00:00
Ralph Castain
fb1ecb7a45 Fix orted termination so we get the #@# relay out before we exit ourselves.
Minor change in the way we respond to job info requests - needed for coming change.

This commit was SVN r20698.
2009-03-03 13:38:29 +00:00
Josh Hursey
6d79a0398d Fix a bounds check that prevented some vpid resolution in certian launch scenarios.
Traced back to r20629.

This commit was SVN r20675.

The following SVN revision numbers were found above:
  r20629 --> open-mpi/ompi@dcff523244
2009-03-02 18:26:48 +00:00
Rainer Keller
04567d3af0 - Header orte/mca/errmgr/errmgr.h is not needed.
Once again compiles fine with -Wimplicit-function-declaration   

This commit was SVN r20640.
2009-02-26 04:05:30 +00:00
Ralph Castain
f3ffe48edd Remove debug output
This commit was SVN r20632.
2009-02-25 04:01:09 +00:00
Rainer Keller
b356e90fa1 - Get rid of include orte/util/proc_info.h, if not needed
Only proc_info.h-internal include file is opal/dss/dss_types.h
 - In one case (orte/util/hnp_contact.c) had to add proc_info.h again.
 - Local compilation (Linux/x86_64) w/ -Wimplicit-function-declaration
   works fine, no errors.

   Again, let's have MTT the last word.

This commit was SVN r20631.
2009-02-25 03:38:00 +00:00