1
1
Граф коммитов

1047 Коммитов

Автор SHA1 Сообщение Дата
Brian Barrett
f6a5d58885 Rather than set the connect event timeout number to something big and hoping
its bigger than the timeout for the connect() call, just don't register
the handler by default and fall back to connect() timing out.  Should give
much happier performance on big clusters.

This commit was SVN r13639.
2007-02-13 18:36:50 +00:00
Pak Lui
085826d94a * Remove the code for putting the bogus exit status of the user proc.
Also remove the smr set_proc_state since it's covered elsewhere.

This commit was SVN r13625.
2007-02-12 23:59:27 +00:00
Brian Barrett
8b28e5b33d Allow the OOB to connect between all MPI applications during MPI_INIT
without also establishing MPI connectivity. 

This commit was SVN r13595.
2007-02-09 20:17:37 +00:00
Brian Barrett
262cbbc5c9 Back out r13593, which contained a change that shouldn't be committed.
This commit was SVN r13594.

The following SVN revision numbers were found above:
  r13593 --> open-mpi/ompi@81472363ea
2007-02-09 20:13:02 +00:00
Brian Barrett
81472363ea Allow the OOB to connect between all MPI applications during MPI_INIT
without also establishing MPI connectivity.

This commit was SVN r13593.
2007-02-09 20:11:40 +00:00
Pak Lui
2d6b3776bf * fix the SEGV described in trac #892 that the exit_status in the 200 range
causes a strsignal to show NULL as a result. Still trying to determine
  why exit_status is in that range.

This commit was SVN r13583.
2007-02-09 16:39:30 +00:00
Ralph Castain
5818a32245 Bring in a forgotten speed improvement for the TM launcher that was developed during SNL Tbird testing last year. Remove the redundant and slow calls to TM to resolve hostnames. Instead, read the host info from the PBS file during the RAS, and then just use that info in the PLS (rather than getting it again).
Adjust the RMAPS mapped_node object to propagate the required launch_id info now included in the ras_node object. This provides support for those few systems that don't use nodename to launch, but instead want some id (typically an index into the array of allocated nodes). This value gets set for each node in the RAS - the RMAPS just propagates it for easy launch.

This commit was SVN r13581.
2007-02-09 15:06:45 +00:00
George Bosilca
79d76b044a ORTE_DECL everything that can be used outside the base directory. I
woner why this file is called private when it's included by all PLS ...

This commit was SVN r13573.
2007-02-09 03:16:19 +00:00
George Bosilca
7750ed22e0 Correct the Windows part of the universe detection.
This commit was SVN r13547.
2007-02-07 22:37:28 +00:00
Pak Lui
ccff0a6e65 * minor fix to correct the pid that always shows up as 0 in the abort
error message. e.g: 

  mpirun noticed that job rank 2 with PID 0 on node burl-ct-v440-4
  exited on signal 15 (Terminated).

This commit was SVN r13537.
2007-02-07 17:46:19 +00:00
Ralph Castain
890e3c7981 Reset the trunk so that the odls now sets the paffinity and sched_yield params again. The sched_yield is still overridden by any user-specified setting.
This change utilizes the new num_processors function. I also left the mods made to ompi_mpi_init and the bug fix for the default value of mpi_yield_when_idle. Note that the mods to mpi_init will not really take effect as the mca param will now *always* be set (either by user or odls). We will need those mods later, so no point in removing them now.

This commit was SVN r13519.
2007-02-06 19:51:05 +00:00
Jeff Squyres
c91fcd7fbd Fix a bunch of minor typos submitted by Bernhard Fischer.
This commit was SVN r13505.
2007-02-06 12:00:30 +00:00
Rolf vandeVaart
dcce8c739c Fix compiler warning. I am not sure how this got
passed us, but thanks to Jeff Squyres for pointing it out.

This commit was SVN r13501.
2007-02-05 22:03:58 +00:00
Rolf vandeVaart
74e3b68ce8 Better document orte-clean's behavior.
This commit was SVN r13498.
2007-02-05 20:01:15 +00:00
Ralph Castain
26897a626d Add a delayed_abort test code. We seem to handle this case just fine now, but Sun reports still seeing troubles on Solaris.
This commit was SVN r13493.
2007-02-05 15:24:01 +00:00
Jeff Squyres
4e506e69e5 Add missing <sys/param.h>
This commit was SVN r13478.
2007-02-03 01:11:35 +00:00
Rolf vandeVaart
bf5113198d Update to orte-clean so it will remove files on local and
remote nodes.  It will also kill off rogue orteds and orterun
processes.  The killing of processes is ifdef'ed out for Windows
since I do not know how to do it there.  Note that this change
will requite an autogen.  

This commit was SVN r13477.
2007-02-03 00:25:42 +00:00
Ralph Castain
a8202742ba Fix a missing function pointer - reference ticket #854
This commit was SVN r13476.
2007-02-02 23:10:14 +00:00
Jeff Squyres
f6e7016cdd Make this test capable of running more than "-np 1". If you run with
"-np X", it will launch X parents and then MPI_COMM_SPAWN X additional
children.

This commit was SVN r13466.
2007-02-02 14:34:53 +00:00
Ralph Castain
3daf8b341b Fix the sched_yield problem for generic environments. We now determine and set sched_yield during mpi_init based on the following logical sequence:
1. if the user has specified sched_yield, we simply do what we are told

2. if they didn't specify anything, try to get the number of processors on this node. Note that we already now get the number of local procs in our job that are sharing this node - that now comes in through the proc callback and is stored in the ompi_proc_t structures.

3. if we can get the number of processors, compare that to the number of local procs from my job that are sharing my node. If the number of local procs exceeds the number of processors, then set sched_yield to true. If not, then be a hog and set sched_yield to false

4. if we can't get the number of processors, default to conservative behavior and set sched_yield to true.

Note that I have not yet dealt with the need to dynamically adjust this setting as more processes are added via comm_spawn. So far, we are *only* looking within our own job. Given that we have now moved this logic to mpi_init (and away from the orteds), it isn't yet clear to me how a process will be informed about the number of procs in *other* jobs that are also sharing this node.

Something to continue to ponder.

This commit was SVN r13430.
2007-02-01 19:31:44 +00:00
Ralph Castain
c754523a14 Add cancel_operations to the pls module definition for tm
This commit was SVN r13416.
2007-02-01 16:52:28 +00:00
Ralph Castain
51fb746da3 Stop overriding the yield_when_idle mca param if the user has set it
This commit was SVN r13414.
2007-02-01 15:01:12 +00:00
George Bosilca
9f73335bdb Silence the compiler.
This commit was SVN r13381.
2007-01-31 04:24:56 +00:00
Jeff Squyres
8d872b195a Refs trac:726
Tested this functionality quite a bit more and made some fixes:

 * Print far fewer help messages
 * Fix one additional deadlock upon error
 * Change some ORTE_LOG messages to silent (because they're not
   errors)
 * Some code got re-indented, sorry...

Discussed and reviewed with Ralph.

This commit was SVN r13375.

The following Trac tickets were found above:
  Ticket 726 --> https://svn.open-mpi.org/trac/ompi/ticket/726
2007-01-30 23:03:13 +00:00
Jeff Squyres
78a13bc3ea Fix the MPI_ABORT problem. We added an orte_initialized variable
yesterday and set it to "true" in orte_init().  But ompi_mpi_init()
doesn't call orte_init() -- it calls orte_init_stage1() and
orte_init_stage2(). So orte_initialized was never set to true, and
Badness happend from there (w.r.t. ompi_mpi_abort()).

This patch moves the setting of orte_initialized to orte_init_stage2()
so that everyone will always get it set properly.

It also moves setting orte_universe_info.state to RUNNING into
stage2() as well -- Ralph confirmed that that should have been there
for the same reasons that orte_initialized needs to be there.

This commit was SVN r13374.
2007-01-30 23:00:43 +00:00
Rainer Keller
061ba05439 - Fixes uncovered with the format attribute to
opal_output and opal_output_verbose

This commit was SVN r13371.
2007-01-30 20:56:31 +00:00
Rainer Keller
3669e8921e - Fix further compiler warnings regarding initialization
and shadowing variables.

This commit was SVN r13358.
2007-01-30 06:34:38 +00:00
George Bosilca
dea69e3c7c Remove one of the %s.
This commit was SVN r13357.
2007-01-30 03:56:48 +00:00
Jeff Squyres
e90b3e415b * Before this commit, if we called ompi_mpi_abort() before MPI_INIT
completed successfully, Bad Things(tm) could happen.
 * Now we explicitly check orte_initialized (a new global in ORTE
   indicating whether we are between orte_init() and orte_finalize()
   or not), and if so, react accordingly.
 * If ORTE is initialized, use orte_system_info.nodename; otherwise,
   use gethostname().
 * Add loop protection to ensure that ompi_mpi_abort() is not invoked
   multiple times recursively.

This commit was SVN r13354.
2007-01-29 22:01:28 +00:00
Rich Graham
f6c99d0207 set orte_odls_base.components_available to false if no odls components are
available.  Startup now works if no odls components are availble.

This commit was SVN r13339.
2007-01-27 15:37:13 +00:00
Jeff Squyres
3c5c8c3c4c Refinement of Rainer's r13227 and r13228 (worked with Rainer, Ralph,
and George on these refinements):

 * Rename the static OBJ initializer macro to be
   OPAL_OBJ_STATIC_INIT(class)
 * Ensure that all static OBJ initializations get a refcount of 1
   (doesn't ''really'' matter, since they're static, it should never
   get to the point where the OBJ is DESTRUCTed, but more correct
   nonetheless)
 * Add a "magic number" to the OBJ when compiling with debug support.
   The magic number does some rudimentary support to ensure that
   you're operating on a valid OBJ (and fails an assertion if you're
   not).  Check to ensure that the memory contains the magic number
   when performing actions of OBJ's.  Also remove the magic number
   when DESTRUCTing OBJs, so that if, for example, an OBJ is
   DESTRUCTed more than once, we'll fail the magic number assert.

This commit was SVN r13338.

The following SVN revision numbers were found above:
  r13227 --> open-mpi/ompi@96030de97b
  r13228 --> open-mpi/ompi@c2e9075d29
2007-01-27 13:44:03 +00:00
Jeff Squyres
974dcebf9f Finish backing out r13316 by also removing the comments that it
insertted.

This commit was SVN r13324.

The following SVN revision numbers were found above:
  r13316 --> open-mpi/ompi@35c1370a13
2007-01-26 13:09:18 +00:00
George Bosilca
668a2bd7ac Remove some debug output.
This commit was SVN r13323.
2007-01-26 08:09:22 +00:00
George Bosilca
bd7eebda83 Deal with the argv problem from r13321 for the Windows PLS.
This commit was SVN r13322.

The following SVN revision numbers were found above:
  r13321 --> open-mpi/ompi@b439e87f96
2007-01-26 07:21:07 +00:00
George Bosilca
b439e87f96 We have this one starting from r12059. We save a pointer to the argv[*] and
then we modify the argv, forcing the reallocation of the array. With luck
the saved pointer still have a meaning ... without execve return with error
14 (EFAULT).

This commit was SVN r13321.

The following SVN revision numbers were found above:
  r12059 --> open-mpi/ompi@ae79894bad
2007-01-26 07:06:52 +00:00
George Bosilca
29597cf0c5 We need to initialize the ODLS as they are the only one to define
the ORTE_DAEMON_CMD type. Which, unfortunately, is used all over
the place. Without this, we get error:
[msc01:12341] [0,0,0] ORTE_ERROR_LOG: Data pack failed in file ../../ompi-trunk/orte/dss/dss_pack.c at line 83
[msc01:12341] [0,0,0] ORTE_ERROR_LOG: Data pack failed in file ../../ompi-trunk/orte/dss/dss_pack.c at line 58
[msc01:12341] [0,0,0] ORTE_ERROR_LOG: Data pack failed in file ../../../../ompi-trunk/orte/mca/pls/base/pls_base_orted_cmds.c at line 136

This commit was SVN r13320.
2007-01-26 04:32:15 +00:00
Ralph Castain
0905dfdfba Make sure the params.h file gets included in the tarballs
This commit was SVN r13318.
2007-01-26 03:05:30 +00:00
Rich Graham
35c1370a13 odls components are handled only by daemon procs.
This commit was SVN r13316.
2007-01-25 21:18:59 +00:00
Rich Graham
3488b394be fix typo in name of the cancel operation.
This commit was SVN r13312.
2007-01-25 19:07:27 +00:00
Jeff Squyres
580a7a108c Fix a compiler warning.
This commit was SVN r13310.
2007-01-25 17:22:01 +00:00
Tim Prins
e199bf9b64 Refs trac:801
Fix compiler warning

This commit was SVN r13308.

The following Trac tickets were found above:
  Ticket 801 --> https://svn.open-mpi.org/trac/ompi/ticket/801
2007-01-25 16:12:05 +00:00
Ralph Castain
ab5ea61100 Bring over the rest of the ctrl-c fixes. This commit includes:
1. add a "cancel_operation" API to the pls components that allows orterun to demand that an orted operation (e.g., terminate_job) be immediately cancelled and abandoned.

2. changes the pls orted commands from blocking to non-blocking. This allows us to interrupt those operations should an orted be non-responsive. The change also adds an orte_abort_timeout that limits how long orterun will automatically wait for the orteds to respond - if the terminate command, for example, doesn't see orted response within that time, then we printout an appropriate error message and just give up.

3. modifies orterun to allow multiple ctrl-c's to simply abort the program even if the orteds have not responded

4. does some cleanup on the orte-level mca params so that their implementation looks a lot more like that of ompi - makes it easier to maintain. This change also includes the definition of an orte_abort_timeout struct and associated MCA param (can't have too many!) so you can set the time after which orterun gives up on waiting for orteds to respond

This needs more testing before migrating to 1.2.

This commit was SVN r13304.
2007-01-25 14:17:44 +00:00
Ralph Castain
53967bd698 Fix a memory corruption problem deep inside the registry when subscriptions/triggers are processed. The create_value function will malloc space for the pointers to keyval objects, but doesn't actually allocate space for the objects themselves. When constructing the gpr_notify_data object, we forgot to OBJ_NEW the keyval objects. Since the create_value function didn't explicitly NULL those memory locations, it just so happened that there was a non-NULL address in them....which we dutifully dumped a keyval into.
This fix includes two parts: (a) we now initialize the keyval pointer locations to NULL after the malloc, and (b) we now OBJ_NEW the keyvals prior to storing info in them.

BTW, in case anyone reads this and wonders why we don't just OBJ_NEW the keyvals in create_value, the reason is simply that some places in the code use static keyvals and simply assign those addresses into the value object's array. So not everyone wants to OBJ_NEW keyvals - by not forcing it here in create_value, we give the user the flexibility to do whatever they want.

This commit was SVN r13300.
2007-01-25 12:54:02 +00:00
George Bosilca
5711583bdf Force only one thread to come out from the
socket engine.

This commit was SVN r13298.
2007-01-25 07:36:42 +00:00
George Bosilca
dcce444ed4 When the user give a prefix that really means something. I expect to
start looking for the daemons using the prefix.

This commit was SVN r13297.
2007-01-25 07:35:25 +00:00
George Bosilca
950b07d860 Work around the Windows sockets model.
This commit was SVN r13294.
2007-01-25 00:19:02 +00:00
George Bosilca
3b988fcdfd Small update the the process PLS.
This commit was SVN r13293.
2007-01-25 00:17:54 +00:00
Tim Prins
4fd81b3407 Fixes trac:801
- Make it so the SLURM ras can handle different nodelist configurations
- Some code cleanup and better/more informative error messages and error handling

This commit was SVN r13271.

The following Trac tickets were found above:
  Ticket 801 --> https://svn.open-mpi.org/trac/ompi/ticket/801
2007-01-24 14:45:42 +00:00
George Bosilca
9b16827049 Add ORTE_DECLSPEC, and few conversions.
This commit was SVN r13268.
2007-01-24 00:52:08 +00:00
George Bosilca
1e38810c2d Correctly close the sockets on a generic way.
This commit was SVN r13254.
2007-01-23 03:17:23 +00:00