1
1
Граф коммитов

1105 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
53967bd698 Fix a memory corruption problem deep inside the registry when subscriptions/triggers are processed. The create_value function will malloc space for the pointers to keyval objects, but doesn't actually allocate space for the objects themselves. When constructing the gpr_notify_data object, we forgot to OBJ_NEW the keyval objects. Since the create_value function didn't explicitly NULL those memory locations, it just so happened that there was a non-NULL address in them....which we dutifully dumped a keyval into.
This fix includes two parts: (a) we now initialize the keyval pointer locations to NULL after the malloc, and (b) we now OBJ_NEW the keyvals prior to storing info in them.

BTW, in case anyone reads this and wonders why we don't just OBJ_NEW the keyvals in create_value, the reason is simply that some places in the code use static keyvals and simply assign those addresses into the value object's array. So not everyone wants to OBJ_NEW keyvals - by not forcing it here in create_value, we give the user the flexibility to do whatever they want.

This commit was SVN r13300.
2007-01-25 12:54:02 +00:00
George Bosilca
5711583bdf Force only one thread to come out from the
socket engine.

This commit was SVN r13298.
2007-01-25 07:36:42 +00:00
George Bosilca
dcce444ed4 When the user give a prefix that really means something. I expect to
start looking for the daemons using the prefix.

This commit was SVN r13297.
2007-01-25 07:35:25 +00:00
George Bosilca
950b07d860 Work around the Windows sockets model.
This commit was SVN r13294.
2007-01-25 00:19:02 +00:00
George Bosilca
3b988fcdfd Small update the the process PLS.
This commit was SVN r13293.
2007-01-25 00:17:54 +00:00
Tim Prins
4fd81b3407 Fixes trac:801
- Make it so the SLURM ras can handle different nodelist configurations
- Some code cleanup and better/more informative error messages and error handling

This commit was SVN r13271.

The following Trac tickets were found above:
  Ticket 801 --> https://svn.open-mpi.org/trac/ompi/ticket/801
2007-01-24 14:45:42 +00:00
George Bosilca
9b16827049 Add ORTE_DECLSPEC, and few conversions.
This commit was SVN r13268.
2007-01-24 00:52:08 +00:00
George Bosilca
1e38810c2d Correctly close the sockets on a generic way.
This commit was SVN r13254.
2007-01-23 03:17:23 +00:00
Ralph Castain
46b7df5683 Fix bproc nodename to correctly assign process-to-nodename mapping
This commit was SVN r13244.
2007-01-22 19:39:00 +00:00
Jeff Squyres
6584df9262 For --prefix-like behavior, we used to modifiy environ directly and
then exec the "srun..." from there.  But somewhere along the line, we
switched to having a copy of environ and modifying that.  It looks
like we forgot to update the stuff for --prefix behavior.  So this
commit fixes the setenv's for PATH and LD_LIBRARY_PATH to modify the
environ copy (not environ itself) so that the values properly get
passed down to the srun environment via execve().

This restores --prefix behavior in the SLURM pls.

This commit was SVN r13239.
2007-01-22 15:50:35 +00:00
George Bosilca
3169a29da4 Revert commit r13235.
This commit was SVN r13238.

The following SVN revision numbers were found above:
  r13235 --> open-mpi/ompi@2636881324
2007-01-22 06:46:58 +00:00
George Bosilca
1b92589179 Update the PLS process.
This commit was SVN r13236.
2007-01-22 05:48:25 +00:00
George Bosilca
2636881324 Remove unused variables.
This commit was SVN r13235.
2007-01-22 05:46:57 +00:00
George Bosilca
93c3e3a21f __WINDOWS__ is defined or not.
This commit was SVN r13234.
2007-01-22 05:46:30 +00:00
Rainer Keller
c2e9075d29 - Define a OPAL_CLASS_EMPTY to be used for initialization.
Similar within the dt_module for the predefined datatypes.

This commit was SVN r13228.
2007-01-21 15:52:06 +00:00
Rainer Keller
96030de97b - Initialize the size of the opal_object class.
- Use the OBJ_CLASS_INSTANCE macro to initialize classes.
   This also gets rid of several missing initialization errors.

This commit was SVN r13227.
2007-01-21 14:24:29 +00:00
Rainer Keller
125ba1acfa - Reduce the amount of warnings with -Wshadow -- mainly due to
usage of index and abs in inline-fcts in header files.

This commit was SVN r13217.
2007-01-19 19:48:06 +00:00
Ralph Castain
b63d4ddfbf Clean up a compiler warning under bproc
This commit was SVN r13198.
2007-01-18 20:07:06 +00:00
Ralph Castain
1487e22ec8 Store the mapping mode so that it can be recovered later
This commit was SVN r13197.
2007-01-18 20:00:15 +00:00
Ralph Castain
2c46e10692 Convert this test back to the old form of xcast API
This commit was SVN r13194.
2007-01-18 19:32:43 +00:00
Ralph Castain
455e4ada9a Bring the modified/updated pernode and npernode behaviors over from the openrte repository. This change enables npernode to pay attention to the total #procs to be launched, and cleans up the bynode vs. byslot mapping directives when in pernode and npernode modes.
This commit was SVN r13191.
2007-01-18 17:15:19 +00:00
Brian Barrett
2755d5ccef Only do Windows things if we're on Windows. Need another case for when we
don't have windows and we don't have waitpid() (ie, the Cray)

This commit was SVN r13173.
2007-01-17 23:16:52 +00:00
Brian Barrett
ffe35ef6b8 Update the Cray XT3 run-time support files to compile with latest RTE changes
This commit was SVN r13172.
2007-01-17 22:47:27 +00:00
Ralph Castain
e093b5a256 Check for NULL before release, just to be safe
This commit was SVN r13162.
2007-01-17 21:29:34 +00:00
Ralph Castain
da82359446 Update some of the orte tests to sync with openrte repository
This commit was SVN r13155.
2007-01-17 16:15:37 +00:00
Jeff Squyres
6f7adfe231 Fix for the oob base open and close functions being invoked twice by
ompi_info -- once directly and once via the rml oob component.

This commit was SVN r13152.
2007-01-17 15:18:13 +00:00
Ralph Castain
cc905290e4 Fix the pernode and npernode options - the mca parameters weren't being set to correspond to the command line options
This commit was SVN r13151.
2007-01-17 14:56:22 +00:00
Jeff Squyres
3983141342 Remove this extra #if -- it wasn't necessary and was causing compiler warnings.
This commit was SVN r13146.
2007-01-17 13:53:02 +00:00
Ralph Castain
5d698dc55b Turn "off" an unimplemented command line option - we do not currently support execution without mpirun waiting for job completion.
This commit was SVN r13127.
2007-01-16 16:10:31 +00:00
Ralph Castain
f08210b3e1 Fix a double-free error
This commit was SVN r13126.
2007-01-16 15:49:16 +00:00
Jeff Squyres
e5205657cf A much better fix for #739. No configure test -- just do a simple
memcpy() instead of assigning the struct's by value.

Fixes trac:739.

This commit was SVN r13081.

The following Trac tickets were found above:
  Ticket 739 --> https://svn.open-mpi.org/trac/ompi/ticket/739
2007-01-11 14:30:32 +00:00
Jeff Squyres
add3909096 Back out 13076 and 13077 in favor of a much simpler approach.
Sorry for the configure change -- hopefully it's early enough in the
morning that it won't affect people... (new approach won't have a
configure change).

Refs trac:739.

This commit was SVN r13080.

The following Trac tickets were found above:
  Ticket 739 --> https://svn.open-mpi.org/trac/ompi/ticket/739
2007-01-11 14:07:15 +00:00
George Bosilca
24a91fad1d OPAL_BOOL_STRUCT_COPY or OMPI_BOOL_STRUCT_COPY that's the question!
Let's minimize the disturbances and say that the configure system is right.
From now on it's OPAL_BOOL_STRUCT_COPY. This one is related to r13076 and
has to follow when r13076 goes in the 1.2.

This commit was SVN r13077.

The following SVN revision numbers were found above:
  r13076 --> open-mpi/ompi@f0932a0701
2007-01-11 05:44:48 +00:00
Jeff Squyres
f0932a0701 A workaround for a bug in the PGI 6.2 compiler series. This bug has
been fixed in the 7.0 PGI series, but is unlikely to be fixed in the
6.2 series:

 * Add a configure test looking for the bad behavior (the PGI compiler
   chokes on C code where structs containing bool's are copied by
   value)
 * Set OMPI_BOOL_STRUCT_COPY to 1 if it's ok, 0 if it's not (i.e., PGI
   6.2 series will have this value set to 0)
 * In two places in the code base -- orte-clean and btl_openib_ini.h,
   we have a struct that contains a bool that is copied by value.  In
   these two places, check OMPI_BOOL_STRUCT_COPY and if it's 1, use
   the "int" type instead of "bool".

Fixes trac:739

This commit was SVN r13076.

The following Trac tickets were found above:
  Ticket 739 --> https://svn.open-mpi.org/trac/ompi/ticket/739
2007-01-11 02:21:26 +00:00
George Bosilca
0f68bb0bc0 Add a debug flag for the ODLS process.
This commit was SVN r13075.
2007-01-11 00:17:34 +00:00
George Bosilca
c8222b57eb The Windows PLS now is able to spawn process locally.
This commit was SVN r13074.
2007-01-11 00:16:58 +00:00
Rolf vandeVaart
9fd5e55b50 Need to include strings.h because that is where the rindex()
function prototype lives.  Without this, we get compile 
warnings.  In addition, for 64-bit Solaris, we get a 
segmentation fault from orterun without this include.

This commit was SVN r13065.
2007-01-10 18:44:08 +00:00
Brian Barrett
03112254e7 Increase connection timeout to 600 seconds, which should always be higher than
the connect() timeout, so that we'll use that rather than our own timeout by
defualt.  There timeout was set low for Big Red, but causes problems for very
large clusters, as there's no way to wire them up in 10 seconds most of the
time.

This commit was SVN r13062.
2007-01-10 04:53:21 +00:00
George Bosilca
c7da2b0a9a Add the process PLS. It's only intended for Windows users.
This commit was SVN r13051.
2007-01-09 00:19:52 +00:00
Ralph Castain
950149ec50 Add another test/example program that uses the OpenRTE to dynamically spawn an application
This commit was SVN r13050.
2007-01-09 00:13:57 +00:00
George Bosilca
77452ea8ea Add a missing include and update the definition of orte rds proxy component.
This commit was SVN r13042.
2007-01-08 22:00:01 +00:00
George Bosilca
409d1b8a8d Make the universe creation functions Windows friendly again.
This commit was SVN r13041.
2007-01-08 21:58:57 +00:00
Jeff Squyres
8a289cf1cb Part 1 of the fix for ticket #726. This commit adds logic to orteun
to effect the following:

 * The first time the user hits ctrl-c, we go into the process of
   killing the ORTE job (this is not new).
 * While waiting for the job to actually terminate, if the user hits
   ctrl-c a second time, we print a warning saying "Hey, I'm still
   trying to kill the job.  If you *really* want me to die
   immediately, hit ctrl-c again within 1 second."
 * If the user hits ctrl-c a within 1 second, orterun quits with a
   warning about how the job may not have actually been killed.

Note that none of this logic won't really work until the second part
of the fix for #726 is also committed (i.e., make pls.terminate_job()
non-blocking).  So I'm now throwing the ticket over to Ralph for the
second part of the fix...

Refs trac:726

This commit was SVN r13040.

The following Trac tickets were found above:
  Ticket 726 --> https://svn.open-mpi.org/trac/ompi/ticket/726
2007-01-08 20:25:26 +00:00
Brian Barrett
e130f18cc2 Fix some compiler warnings that have slipped in lately...
This commit was SVN r13037.
2007-01-08 17:20:09 +00:00
Brian Barrett
a34e67d743 Remove unneeded PARAM_INIT_FILE variable in configure.params files used by
components that use configure.m4 for configuration or are always built. 
The macro has not been needed since moving to configure types other than
configure.stub

Fixes trac:590

This commit was SVN r13031.

The following Trac tickets were found above:
  Ticket 590 --> https://svn.open-mpi.org/trac/ompi/ticket/590
2007-01-08 03:44:22 +00:00
Jeff Squyres
a91c017f81 The constant name changed from ORTE_RML_NAME_ANY to ORTE_NAME_WILDCARD
-- upcate the comments/documentation to match.

This commit was SVN r13001.
2007-01-05 13:38:22 +00:00
Rolf vandeVaart
fdf44cc4ab Add the ability to not only report broken files and directories,
but remove them also.  This current set of changes will affect
nothing as no one is making use of this ability.  However, orte-clean
will be changed soon to utilize this new feature.

This commit was SVN r12996.
2007-01-04 21:48:34 +00:00
Ralph Castain
3ce9b2f6cc Remove some debugging output that mistakenly was left behind.
This commit was SVN r12984.
2007-01-04 17:24:11 +00:00
Ralph Castain
6101050ea6 Remove an abstraction barrier I thought was gone long-ago. The OOB subscription really shouldn't be defined as an OMPI subscription.
I know it's just a technicality, but it is time to address such things rather than just letting them continue to propagate. :-)

This commit was SVN r12954.
2007-01-02 16:16:50 +00:00
Ralph Castain
90f5e3fad8 Fix a buglet in the singleton startup procedure. For purposes of minimizing the xcast message, we "strip" the descriptive info on all subscription messages. This means, though, that we have to store the process name and other info so it can be retrieved in the body of the subscription data (as opposed to in the description). This wasn't being done for singletons because they don't call the RMAPS to "map" themselves.
This has now been corrected. The singleton startup will dutifully call the mapper framework so that the proper data storage locations get initialized. Unfortunately, we then had to instruct the RMAPS not to allocate a vpid range for this job - otherwise, it would make a mistake and think there were two processes in it. Hence, a change was required to RMAPS to tell it "map this job, but don't allocate a vpid range for it".

This change will need to migrate across to 1.2 after it "soaks" the appropriate time.

This commit was SVN r12952.
2007-01-02 16:14:44 +00:00
Rich Graham
6cb2377015 Change the allocation of the shared memory backing file. The file
is allocated on a per comm_world instance, with the lowest rank
in comm_world on the given host creating and initializing the file,
and then notifying the remaining files via the OOB.

Reviewed: Ralph Castain, Brian Barrett
Addressing ticket #674.

This commit was SVN r12949.
2007-01-01 02:39:02 +00:00
Brian Barrett
b5057d923e fix compile error that crept in with orte changes. This just makes it
compile -- I can't check for correctness on this platform.  Someone
probably wants to do that.

This commit was SVN r12948.
2006-12-31 20:16:20 +00:00
Li-Ta Lo
6df4e80727 new XCPU PLS and SDS to work with libxcpu
This commit was SVN r12905.
2006-12-21 00:05:36 +00:00
Brian Barrett
414a87b14c On OS X, the lowest 8 bits of the exit status are for signals, so pushing
rc (which is -1 or 4 if we hit this case) resulted in an odd error that a
signal killed the proc (instead of a startup error, as is reality).
Instead, use the W_EXITCODE macro (if available) to build up an exit
code that has an error code for exit status, but does not make it look
like the process died from a signal

This commit was SVN r12890.
2006-12-18 02:30:05 +00:00
Brian Barrett
bc6cec346f Print out the description of the signal from mpirun when a proc was aborted
by a signal if we have strsignal()

This commit was SVN r12888.
2006-12-17 20:01:11 +00:00
Ralph Castain
a0ef517550 Fix some errors in the bproc components that prevented compiling. Thought I had already done this, but either those changes were lost when I did the merge, or my old man's memory is fading....
Whaz-at??? :-)

This commit was SVN r12874.
2006-12-15 19:40:04 +00:00
Ralph Castain
1e1d0e8a89 Set the app_num attribute into the process environment so we pick it up on the other end
This commit was SVN r12868.
2006-12-15 16:43:52 +00:00
Ralph Castain
677d1260aa cleanup nicely if we don't launch
This commit was SVN r12867.
2006-12-15 14:03:53 +00:00
Ralph Castain
cbb660504c Retain the ability to run valgrind on the bproc launcher - do not call bproc_version if "nolaunch" is specified.
This commit was SVN r12866.
2006-12-15 14:01:21 +00:00
Ralph Castain
64ec238b7b Repair support for Bproc 4 on 64-bit systems. Update the SMR framework to actually support the begin_monitoring API. Implement the get/set_node_state APIs.
This commit was SVN r12864.
2006-12-15 02:34:14 +00:00
Brian Barrett
38c2e43ac2 Print out error string rather than errno for TCP-related errors, making it easier for both the user and us to debug issues with BTL and OOB issues...
This commit was SVN r12852.
2006-12-14 18:20:43 +00:00
Ralph Castain
7b8f445e13 Modify the "--display-map-at-launch" option to just "--display-map". Now that we have a "--do-not-launch" option, the "-at-launch" part of the display-map option was confusing. "--display-map" displays the resulting process map before we launch anyway, so this is clearer.
This commit was SVN r12840.
2006-12-13 13:49:15 +00:00
Ralph Castain
82946cb220 Add a new option to orterun: "--do-not-launch" directs the system to do the allocation, map, job setup, etc., but don't actually launch the job. This lets us test all the setup portions of the code.
Also, take the first step in updating how we handle mca params in ORTE - bring it closer to how it is done in the other two layers. Much more work to be done here.

This commit was SVN r12838.
2006-12-13 04:51:38 +00:00
Ralph Castain
3b064a624e For convenience, revise the orte_job_map_t object so it includes the vpid start/range values, the number of nodes, and the number of processes on each node. These values are all used in various places in the code base - we currently re-compute them multiple times. Since these values do not change and are already being computed by the RMAPS framework, we might as well just save them for re-use.
This commit was SVN r12829.
2006-12-12 16:07:23 +00:00
Ralph Castain
28ce8e5e5e Extend the mpirun options to support "--npernode N". This option tells the system to spawn N procs/node across all nodes in the allocation. If N is greater than the number of allocated slots, then the usual oversubscription logic will apply (i.e., the system will error out if oversubscription is not allowed, otherwise it will run with the sched_yield set to non-aggressive behavior).
In "--npernode" operation, the "-np" command line parameter is ignored.

This commit was SVN r12826.
2006-12-12 00:54:05 +00:00
Ralph Castain
8314e8dbb9 Modify the pernode option so it can accept a request for the number of processes to be launched. We now check three use-cases for pernode:
1. no -np provided - put one proc/node across all allocated nodes

2. -np N provided, N > #nodes - we print a pretty error message and exit

3. -np N provided, N <= #nodes - put one proc/node across N nodes

I also added a new orte constant (ORTE_ERR_SILENT) that allows us to pass up the chain that an error was encountered, but NOT print ORTE_ERROR_LOG messages. This is intended to be used for cases where the error we encounter is NOT an orte error, but rather is one associated with incorrect user input (e.g., the preceding case 2). In such cases, there is no point in printing an ORTE_ERROR_LOG chain of messages as it isn't an orte error.

This commit was SVN r12821.
2006-12-11 18:07:07 +00:00
Ralph Castain
0a5d41857a Complete next round of message size reduction: "strip" the descriptive info from the returned values. I have now added a flag to the gpr address mode (ORTE_GPR_STRIPPED) that instructs the gpr to not include segment names or tokens in the returned gpr_value_t objects.
I found only two places that were looking at the tokens:

1. the odls - we used the tokens to separately process the globals container data from everything else. In this case, I left the subscription that returned the globals data alone, but "stripped" the subscription that returned the launch data for the procs. These subscriptions have nothing to do with the xcast message.

2. the pml_base_modex - the callback function was getting process names from the returned tokens. Actually, this function was doing a very bad thing - it was assuming that the first token returned was *always* the process name. This is currently true, but is one of those assumptions that someone could have easily changed - and suddenly found the system inexplicably failing. I modified the function to (a) get the name sent back to us, (b) "stripped" the value structures of tokens and segment strings, and (c) correctly obtained process names from the returned values. I also reindented the heck out of the code so it was legible (at least, to my old eyes).

This commit was SVN r12813.
2006-12-09 23:10:25 +00:00
Ralph Castain
58569546ed Fix the fix to remove compiler warning - an incorrect "\" was placed in the command string.
This commit was SVN r12805.
2006-12-08 04:17:38 +00:00
Sven Stork
78173a697a Replace the test opertion "-e" with "-r" to improve the protability.
Refs: #392

This commit was SVN r12790.
2006-12-07 12:14:40 +00:00
Ralph Castain
62d7826e01 Helps if we total up the correct field to get the total number of slots in the universe
This commit was SVN r12789.
2006-12-07 03:17:12 +00:00
Ralph Castain
a1153fdc8f Eliminate virtually all of the attribute_predefined data from the STG1 message. We now compute the total number of slots allocated to us and save that in the registry - the attributed_predefined then retrieves it via the STG1 message. The app_num is passed via the process_info structure, which gets the value from the ODLS in the environment.
Obviously, people like bproc will have to get the app_num via another avenue...but that's a problem for another day. Several options are easily available.

This commit was SVN r12788.
2006-12-07 03:11:20 +00:00
Brian Barrett
8f68764e5e A number of heterogeneous fixes for the dss with the new buffer options:
* When using the load/unload interface, stash away the current buffer
    type so that it can be properly unpacked on the receiving side if
    the buffer type is other than the receiver default
  * Include type information for unsized types (bool, int, size_t,
    pid_t) so that they can be properly unpacked by the receiver
    in the heterogeneous case.
  * Restore the NON_DESC type as the default for optimized builds,
    since it looks like this fixes the known issues with the
    non-described buffers

Refs trac:587

This commit was SVN r12784.

The following Trac tickets were found above:
  Ticket 587 --> https://svn.open-mpi.org/trac/ompi/ticket/587
2006-12-06 23:19:06 +00:00
Brian Barrett
cfeac5581a temporarily always use described buffers as the non-described causes all
kinds of problems for heterogeneous environments

This commit was SVN r12783.
2006-12-06 20:22:31 +00:00
Ralph Castain
d4bd60c9fe Restore the paffinity capability, along with all the required logic to ensure we "do the right thing" when the user gives us inaccurate information about the number of slots on a remote node.
This commit was SVN r12780.
2006-12-06 15:59:34 +00:00
Ralph Castain
b1e16fffac Add the C++ doo-hicky stuff around the odls framework definitions just in case somebody, somewhere, on some remote planet where only goats can feed needs it.
This commit was SVN r12777.
2006-12-06 13:58:04 +00:00
Ralph Castain
8ca415a0c5 Remove duplicate orte_odls declaration
This commit was SVN r12776.
2006-12-06 13:44:41 +00:00
Brian Barrett
6f8b366acb Rename liborte to libopen-rte and libopal to libopen-pal per telecon today
and bug #632.

Refs trac:632

This commit was SVN r12762.

The following Trac tickets were found above:
  Ticket 632 --> https://svn.open-mpi.org/trac/ompi/ticket/632
2006-12-05 18:27:24 +00:00
Tim Prins
08d5ca821f Don't get the node architecture when useing the LoadLevleer RAS. It is slow (about a second for ~300 nodes) and we don't even use the value.
This commit was SVN r12758.
2006-12-05 13:47:53 +00:00
Ralph Castain
eb941d8ae2 Fix a bug that declared a node as "oversubscribed" a little early during the mapper procedure. This only affected the mapping procedure, and only if you had set the "--no-oversubscribe" flag.
Kudos to Tim Prins for finding it.

This commit was SVN r12757.
2006-12-05 13:04:27 +00:00
George Bosilca
6f28bcdc21 Remove the last set of compiler warnings from the precondition file.
This commit was SVN r12753.
2006-12-04 21:45:57 +00:00
Brian Barrett
d64fa194f1 Instead of continually screwing around with different format strings to
make this warning-proof, loop over the uint64_ts as an array of integers
and use %x.  The final string is just as random and formatted exactly
the same, so we're all good in that department.

Refs trac:655

This commit was SVN r12742.

The following Trac tickets were found above:
  Ticket 655 --> https://svn.open-mpi.org/trac/ompi/ticket/655
2006-12-04 18:07:24 +00:00
Gleb Natapov
f0132b2499 Provide parameters in a correct order (processor/oversubscribed was swapped).
This commit was SVN r12737.
2006-12-04 12:55:45 +00:00
Rainer Keller
d078bb3e8a - Revert changes and include pointers to discussion.
This commit was SVN r12736.
2006-12-03 17:05:15 +00:00
Rainer Keller
e61dd8722e - Silence compiler on ORTE_TRANSPORT_KEY_FMT, it is fixed to llx
- No functional changes, just indentation and corrections to error
   output.

This commit was SVN r12734.
2006-12-03 13:59:23 +00:00
George Bosilca
a0ed53d70b Make the compilers happy.
This commit was SVN r12729.
2006-12-03 00:19:11 +00:00
Ralph Castain
4151a46871 Per Jeff's request (which made a lot of sense), setup the default buffer type to be DESCRIBED for debug/devel builds, and NON-DESC for optimized builds. The user can still select the default buffer type via mca parameter at runtime - this just sets the default default. :-)
Also, change the dss buffer type mca param to something more easily remembered (it is now "dss_buffer_type"). Heck, even I had to keep looking at the darn code to remember it.

This commit was SVN r12728.
2006-12-02 13:32:16 +00:00
George Bosilca
3fd278c522 Make the tree compile in debug mode.
This commit was SVN r12724.
2006-12-01 23:03:09 +00:00
Ralph Castain
897744cdeb Two major changes to the runtime:
1. implement and enable the non-described buffer operations. I will send out a more detailed explanation separately. However, this mode of operation (which is now the default) significantly reduces message size during startup. If you want the described buffers, set the mca param "-mca dss_describe_buffer 1".

2. revise the xcast system to support both linear and binomial tree broadcast methods. Since we are seeing scenarios where the binomiall tree can cause problems, I have made the linear method the default. To run with the binomial tree, set the mca param "-mca oob_xcast_mode binomial".

3. add some detailed timing reports to the xcast operation. These are enabled via "-mca oob_xcast_timing 1".

4. add some more unit tests for the dss and gpr (focused on support for the non-described buffer)

This commit was SVN r12722.
2006-12-01 22:30:39 +00:00
Jeff Squyres
3cf7dddd47 Fixes trac:635.
Ralph identified the problem, I tracked down ''where'' the fd was
being closed, and Brian figured out ''why'' (and the fix).

What was happening is that a remote process was closing its
stdout/stderr and therefore sending a 0-byte IOF message to mpirun.
mpirun, in turn, closed the iof endpoint associated with that stream
(i.e., stdout/stderr).  IOF does this to handle the case where
mpirun's stdin is closed -- this therefore causes the stdin on all the
ORTE-started processes to have their stdin's closed as well.

So the workaround here is to check that if we get a 0-byte IOF message
on a sink (indicating a remote closure), and if that sink is the
special stdout or stderr stream, don't actually close anything in the
local process.

This commit was SVN r12691.

The following Trac tickets were found above:
  Ticket 635 --> https://svn.open-mpi.org/trac/ompi/ticket/635
2006-11-28 21:42:49 +00:00
Ralph Castain
0398c9e0c5 Correctly setup the sched_yield when launching processes via the orteds. This still doesn't adjust the yield schedule "on-the-fly" as more procs are dynamically added to a node - it just sets it when they are first launched.
This commit was SVN r12683.
2006-11-28 08:27:20 +00:00
George Bosilca
8df8d86b85 Complete the functions to match the expected prototype.
This commit was SVN r12680.
2006-11-28 00:44:30 +00:00
Ralph Castain
bc4e97a435 First stage in the move to a faster startup. Change the ORTE stage gate xcast into a binary tree broadcast (away from a linear broadcast). Also, removed the timing report in the gpr_proxy component that printed out the number of bytes in the compound command message as the answer was "not much" - reduces the clutter in the data.
This commit was SVN r12679.
2006-11-28 00:06:25 +00:00
Ralph Castain
652b91ee26 Remove some compiler warnings
This commit was SVN r12678.
2006-11-27 23:47:36 +00:00
Ralph Castain
9bc25f0bec Fix a potential bug in the registry where it didn't fully check a segment's name when searching for it. Will have to verify that this doesn't break other things.
Bring the bproc system close to being back online....

This commit was SVN r12659.
2006-11-23 04:17:37 +00:00
Brian Barrett
32833deff0 since orteboot, ortehalt, and ortekill were all added today (including to
configure.ac), we need to add them to SUBDIRS to make them end up in the
tarball as well...

This commit was SVN r12658.
2006-11-23 03:10:57 +00:00
Ralph Castain
deb2470ba3 Move the waitpid callback in the bproc pls *after* we store the daemon info. Otherwise, a short-lived app could terminate before we store the daemon info, causing mpirun to not terminate the daemons since the call to get_active_daemons would return a NULL list.
This commit was SVN r12656.
2006-11-22 22:49:22 +00:00
Rainer Keller
b63500f62c - Dont unlock ompi_rte_mutex unconditionally, use the macro instead.
This commit was SVN r12655.
2006-11-22 21:01:43 +00:00
Ralph Castain
b1ff5fe868 Move the name of the bproc common segment to the central schema location - avoids conflicts when bproc 3 components try to build
This commit was SVN r12654.
2006-11-22 20:23:17 +00:00
Ralph Castain
8080034eb2 Clean up a compile issue for bproc
This commit was SVN r12653.
2006-11-22 19:50:27 +00:00
Ralph Castain
428c1f14c3 Modify the bproc components to resolve the current allocation problem
This commit was SVN r12652.
2006-11-22 19:10:58 +00:00
Ralph Castain
7f95b27141 Correctly "hide" the new orte tools - they shouldn't get compiled or seen unless you specifically go into those subdirectories and manually do a "make".
This commit was SVN r12650.
2006-11-22 14:35:16 +00:00
Sven Stork
dc116d4814 - Add missing mutex lock
This commit was SVN r12649.
2006-11-22 13:37:58 +00:00
Ralph Castain
6fca1431f3 Back out some prior commits. These commits fixed bproc so it would run, but broke several other things (singleton comm_spawn and hostfile operations have been identified so far). Since bproc is the culprit here, let's leave bproc broken for now - I'll work on a fix for that environment that doesn't impact everythig else.
This commit was SVN r12648.
2006-11-22 13:30:21 +00:00
Brian Barrett
0895f5e08d Rename OMPI_PROCESS_NAME_{HTON, NTOH} macros to ORTE_PROCESS_NAME_{HTON, NTOH}
because they are in ORTE, not OMPI.  Also, remove the ORTE_PROCESS_NAME macros
in iof base as they are duplicates of the ones that were in ns_types, which 
meant that bad things happened if you changed what an orte_process_name_t
looked like.

This commit was SVN r12646.
2006-11-22 03:03:21 +00:00
Brian Barrett
33320b7165 Rework the opal_progress interface to better support dynamic processes and at
the same time, remove some of the MPI-related options from OPAL:

  - provide mechanism to change at runtime whether sched_yield() should 
    be called when the progress engine is idle
  - provide mechanism for changing the rate at which the event engine
    is called when there are "no" users of the event engine (ie, when
    using MPI but not TCP)
  - fix some function names in the progress engine to better match
    their intended use (and remove MPI naming scheme)
  - remove progress_mpi_enable / progress_mpi_disable because 
    we can now use the functions to set the sched_yield and
    tick rate interfaces
  - rename opal_progress_events() to opal_progress_set_event_flag()
    because the first really isn't descriptive of what the function
    does and I always got confused by it

This commit was SVN r12645.
2006-11-22 02:06:52 +00:00
Ralph Castain
9f3dcd147a Round and round the mulberry bush we go...
Fix comm_spawn by singletons. orte_init does some voodoo to let the system know about localhost when we are a singleton. This includes allocating it so that any comm_spawn'd children can use their parent "allocation". Unfortunately, the fix that bproc needs (due to that smr filling up the node segment!) causes the singleton startup to fail. The fix is to just have the singleton startup force an allocation of its localhost.

Only issue here is: what happens if we are in a persistent universe? The singleton will now overwrite any prior info on slots used on localhost by other jobs (won't affect anything else). The answer, of course, is to do something more intelligent - lookup localhost on the registry and just update its info instead of overwriting it.

Something for another day (or month....or year)

This commit was SVN r12644.
2006-11-21 21:51:58 +00:00
Ralph Castain
761d4fab25 cleanup unused variables
This commit was SVN r12643.
2006-11-21 20:51:39 +00:00
Ralph Castain
a30c65ca24 Fix the allocator to make bproc happy.
We were burned again by the fact that the bproc state monitor creates entries on the node segment for  *all* the nodes in the cluster when it is opened during orte_init. As a result, the bjs allocator was never being called, and the system merrily assumed that *all* nodes in the cluster had been allocated to it.

To fix this, I removed a test that had been inserted into the allocation procedure that checked for a non-zero node segment. This was an old artifact - the RAS components already know that they are not to overwrite any existing node segment entries (at least, bproc does - I will check the others. For now, I just want to save the bproc fix on this machine).

This commit was SVN r12640.
2006-11-21 19:52:55 +00:00
Ralph Castain
050e401671 Simplify - the cellid is simply a field in the process name. Recently, we decided to just directly access it and get rid of extraneous function calls.
This commit was SVN r12638.
2006-11-21 09:59:24 +00:00
Tim Prins
2afb401e39 fix some compile warnings
This commit was SVN r12636.
2006-11-21 00:33:10 +00:00
Galen Shipman
c06b740220 a few more opps to get the cellid...
This fixes a compilation error on bproc machines.. 

This commit was SVN r12631.
2006-11-20 19:45:01 +00:00
Ralph Castain
33affed09c Bring ortehalt to a preliminary capability. It will corectly order a persistent daemon to exit cleanly. Need to now interface it to orterun, clean up a few things here and there
This commit was SVN r12626.
2006-11-18 04:47:51 +00:00
Ralph Castain
ea1e0d34c8 Fix the SLURM PLS and SDS components so they correctly communicate the name of the node. This repairs a number of comm_spawn issues seen on Odin and other SLURM machines.
This commit was SVN r12621.
2006-11-17 20:51:03 +00:00
Ralph Castain
3c5a2cd17b Cleanup a few warnings for unused variables
This commit was SVN r12620.
2006-11-17 19:32:49 +00:00
Ralph Castain
f771cc4fbd Modify the reuse daemons procedure so we only generate the add_local_procs message once. Revise the display-map-at-launch option so the RMAPS framework takes responsibility for implementation of that option.
Modify the RMAPS framework so we eliminate communicating a map to a backend node when certain attributes are set. The proxy functions are now implemented in the base, and a check made for HNP/non-HNP operation made in the map_jobs function prior to execution.

This commit was SVN r12619.
2006-11-17 19:06:10 +00:00
Jeff Squyres
d248f608b4 Remove svn:executable from a bunch of files.
This commit was SVN r12617.
2006-11-17 11:01:39 +00:00
Ralph Castain
ca5b4358fa Need to revise the display-map-at-launch option so it is active not only for the initial launch, but applies to any subsequent comm_spawn events too.
Add placeholders for the new orte tools. These don't actually do anything yet - in fact, I have set the .ompi_ignore so that you won't compile them (I have set a .ompi_unignore for me). Please let me know if you encounter any trouble with this - the ompi_ignore's should protect everyone.

This commit was SVN r12616.
2006-11-17 02:58:46 +00:00
Ralph Castain
5ddcb8a652 Ensure the orted kills *all* local procs when exiting. Add a little clarity to some of the debugging output
This commit was SVN r12615.
2006-11-16 21:15:25 +00:00
Ralph Castain
c1813e5c5a Extend the daemon reuse functionality to *most* of the other environments.
Note that Bproc won't support this operation, so we just ignore the --reuse-daemons directive.

I'm afraid I don't understand the POE and XGrid environments well enough to attempt the necessary modifications.

Also, please note that XGrid support has been broken on the trunk. I don't understand the code syntax well enough to make the required changes to that PLS component, so it won't compile at the moment. I'm hoping Brian has a few minutes to fix it after SC.

This commit was SVN r12614.
2006-11-16 15:11:45 +00:00
Ralph Castain
044898f4bf My eyes may be deceiving me....but I do believe these comparisons are backwards! I think we only really want to "free" these variables if they are NOT NULL - as opposed to "free"ing them if they ARE NULL.
This commit was SVN r12612.
2006-11-15 22:59:01 +00:00
Ralph Castain
a17e27dfd5 Sign of old age.....fix some compiler complaints
This commit was SVN r12611.
2006-11-15 22:28:14 +00:00
Ralph Castain
f7fc19a2ca Create the ability to re-use existing daemons. Included in the commit:
1. new functionality in the pls base to check for reusable daemons and launch upon them

2. an extension of the odls API to allow each odls component to build a notify message with the "correct" data in it for adding processes to the local daemon. This means that the odls now opens components on the HNP as well as on daemons - but that's the price of allowing so much flexibility. Only the default odls has this functionality enabled - the others just return NOT_IMPLEMENTED

3. addition of a new command line option "--reuse-daemons" to orterun. The default, for now, is to NOT reuse daemons. Once we have more time to test this capability, we may choose to reverse the default. For one thing, we probably want to investigate the tradeoffs in start time for comm_spawn'd processes that reuse daemons versus launch their own. On some systems, though, having another daemon show up can cause problems - so they may want to set the default as "reuse".

This is ONLY enabled for rsh launch, at the moment. The code needing to be added to each launcher is about three lines long, so I'll be doing that as I get access to machines I can test it on.

This commit was SVN r12608.
2006-11-15 21:12:27 +00:00
Andrew Friedley
3f550c3a83 missing a semicolon...
This commit was SVN r12607.
2006-11-15 20:35:27 +00:00
Ralph Castain
437f2b044d Modify the orted command communication system in two ways:
1. use non-blocking sends to transmit commands (this was actually done in a prior commit)

2. have an "ack" message sent back from the orted when it completes the command

The latter item is the new one here. With my prior commit, it was possible for the HNP to move on to other things before the orted had completed its command. This caused the HNP to occassionally exit before the orted, thus generating "lost connection" errors. With this change, we retain the parallel nature of the command communications, but still hold the HNP at that point until the orteds are done.

Best of both worlds.

This commit was SVN r12605.
2006-11-15 15:09:28 +00:00
Ralph Castain
22a473a831 Fix a compiler complaint for too many arguments to a printf command - thanks to Tim Prins for pointing it out to me.
This commit was SVN r12604.
2006-11-15 14:52:48 +00:00
Ralph Castain
6d6cebb4a7 Bring over the update to terminate orteds that are generated by a dynamic spawn such as comm_spawn. This introduces the concept of a job "family" - i.e., jobs that have a parent/child relationship. Comm_spawn'ed jobs have a parent (the one that spawned them). We track that relationship throughout the lineage - i.e., if a comm_spawned job in turn calls comm_spawn, then it has a parent (the one that spawned it) and a "root" job (the original job that started things).
Accordingly, there are new APIs to the name service to support the ability to get a job's parent, root, immediate children, and all its descendants. In addition, the terminate_job, terminate_orted, and signal_job APIs for the PLS have been modified to accept attributes that define the extent of their actions. For example, doing a "terminate_job" with an attribute of ORTE_NS_INCLUDE_DESCENDANTS will terminate the given jobid AND all jobs that descended from it.

I have tested this capability on a MacBook under rsh, Odin under SLURM, and LANL's Flash (bproc). It worked successfully on non-MPI jobs (both simple and including a spawn), and MPI jobs (again, both simple and with a spawn).

This commit was SVN r12597.
2006-11-14 19:34:59 +00:00
Ralph Castain
7b4261001a Forgot to modify the orted end of the communication subsystem
This commit was SVN r12586.
2006-11-13 22:08:47 +00:00
Ralph Castain
f95e20e2e1 Add another test program - an MPI app that just spins. This supports testing of system response to signal-terminated processes.
Add some debugger output to the ODLS default component.

Modify the orted command communication system so that it is done via non-blocking sends. This removes the linearity of the transmission and improves the response time.

This commit was SVN r12585.
2006-11-13 21:51:34 +00:00
Tim Prins
39bc652899 Refs trac:612
Make it so if -np was not passed and -pernode was, we map bynode

This commit was SVN r12580.

The following Trac tickets were found above:
  Ticket 612 --> https://svn.open-mpi.org/trac/ompi/ticket/612
2006-11-13 19:13:21 +00:00
Ralph Castain
4636125e2d Modify the RMGR components to allow job setup with a given jobid, and add another attribute so that we can setup triggers without launching.
Add some debugging output to the ODLS default module, and the orted.

Remove the nodename data from the ODLS info report - that info is already stored in the registry by the RMAPS framework upon completing the mapping procedure.

Add another test program that does an ORTE-only dynamic spawn (gasp!). Looks just like comm_spawn - just no MPI involved.

Modify the ODLS to release the processor when we "kill" local procs in a more scalable fashion. It previously had a sleep in it that Jeff's prior commit removed. However, he introduced some Windows code into the non-Windows component (protected by "if"s, but unnecessary). This is a more general solution he proposed - included here so I could get things to compile properly.

This commit was SVN r12579.
2006-11-13 18:51:18 +00:00
Jeff Squyres
bfdf801487 Replace a "sleep(1)" with a yield so that the orted can reap processes
much faster.

This commit was SVN r12575.
2006-11-13 12:45:03 +00:00
Ralph Castain
4e50cdae52 This commit accomplishes two things:
1. Fix the "hang" condition when an application isn't found. It turned out that the ODLS had some difficulty with the process actually not having been started - hence, it never called the waitpid callback. As a result, the "terminated" trigger didn't fire, and so mpirun didn't wake up. With this change, the HNP's errmgr forces the issue by causing the trigger to fire itself when an abort condition occurs.

2. Shift the recording of the pid and the nodename from mpi_init to the orted launcher. This allows programs such as Eclipse PTP to get the pids even for non-MPI applications. In the case of bproc, the pls handles this chore since we don't use orteds in that system.

This commit was SVN r12558.
2006-11-11 04:03:45 +00:00
Sven Stork
9ba4c4a7ee - Add support for "sh" handling. Instead of detecting of bash we now
check for bourne shell, because bourne shell is the smallest
  common divisor for bash/ksh/sh.
- Make some shell expressions sh compatible 

This commit was SVN r12509.
2006-11-09 10:16:45 +00:00
Ralph Castain
a3be8261fb Fix a bug that had us generate an error message and abort startup when there were stale universe directories around. Now, we just ignore them.
This commit was SVN r12472.
2006-11-07 21:34:57 +00:00
Ralph Castain
ea77beca29 cleanup the TM modules in prep for T-bird tests. The TM RAS will now report time required to resolve hostnames
This commit was SVN r12449.
2006-11-06 20:56:18 +00:00
Ralph Castain
884caeb2c7 Add timing tests for the TM ras
This commit was SVN r12445.
2006-11-06 18:41:22 +00:00
Galen Shipman
68d9922f44 enable/disable connection sleep in oob_tcp.c via mca param.. on by default..
This commit was SVN r12444.
2006-11-06 18:00:46 +00:00
Ralph Castain
194bdd413b Cleanup the problem of connecting to default universes.
This commit was SVN r12438.
2006-11-06 15:28:38 +00:00
Ralph Castain
d182ae7472 Clean up a few compiler warnings courtesy of Jeff
This commit was SVN r12430.
2006-11-03 20:45:22 +00:00
Ralph Castain
c77f6c605e Update timing reports:
1. Remove timing of xcast from mpi_init

2. Add timing report from oob_xcast on how long it took to send the message

This commit was SVN r12428.
2006-11-03 18:55:05 +00:00
Ralph Castain
761c8beeb7 Add output of the compound command size to the timing reports
This commit was SVN r12423.
2006-11-03 16:11:37 +00:00
Ralph Castain
60e27c77e7 Add some additional timing reporting:
1. Added reporting points around the xcasts in MPI_Init. Note that these times will include time spent waiting for a trigger to fire, which is why the times between stage gates did NOT include these times initially. The inter-stage-gate times still do NOT include the xcast time - the xcast time is reported separately.

2. Added the process vpid on the MPI_Init timing reports for clarity.

3. Added a report from the xcast function on the HNP that outputs the number of bytes in the message being sent to the processes.

This commit was SVN r12422.
2006-11-03 16:04:40 +00:00
Ralph Castain
2472f76d89 Make sure the print_map doesn't get called if we don't do the map step in rmgr.spawn
This commit was SVN r12404.
2006-11-02 05:52:24 +00:00
Ralph Castain
dc4de57253 Per request from the Eclipse team, change the attributes that control the "flow" through the rmgr.spawn procedure from "stop after this step" to "do this step". This allows them to use the spawn procedure to complete the launch one step at a time so they can let the user see the results after each step.
Shouldn't affect anyone else.

This commit was SVN r12403.
2006-11-02 05:15:24 +00:00
Ralph Castain
233dac8bba Enable the new attributes for remote operations such as comm_spawn
This commit was SVN r12382.
2006-10-31 23:52:25 +00:00
Ralph Castain
30de73a712 Add a few attributes that are helpful for folks doing things like Eclipse. Also add yet another command-line option to orterun to support one of the new attributes. These include:
1. ORTE_RMAPS_DISPLAY_AT_LAUNCH: pretty-prints out the process map right before we launch so you can see where everyone is going. This is settable via the command line option "--display-map-at-launch"

2. ORTE_RMGR_STOP_AFTER_SETUP: just setup the job and then return from the spawn command.

3. ORTE_RMGR_STOP_AFTER_ALLOC: return from the rmgr.spawn call after allocating the job

4. ORTE_RMGR_STOP_AFTER_MAP: return from the rmgr.spawn call after mapping the job. This gives folks a chance to retrieve and graphically display the map, let the user edit it, and store the results. They can then call "launch" on their own and the system will use the revised map.

Enjoy! My personal favorite is the first one - helps with debugging.

This commit was SVN r12379.
2006-10-31 22:16:51 +00:00
Ralph Castain
17c71f8d2a Fix ticket #545
Setup subscriptions to correctly return the MPI_APPNUM attribute.

Fix an unreported bug that was found. The universe size was incorrectly defined in the attributes code. As coded, it looked for size_t values and based its size computation on those numbers. Unfortunately, the node_slots value had been changed to an orte_std_cntr_t awhile back! So the universe size was never updated.

Update the hello_nodename test to check for MPI_APPNUM.

Add a definition to ns_types for ORTE_PROC_MY_NAME - just a shortcut for orte_process_info.my_name. Brought over from ORTE 2.0 as it will be used extensively there.

This commit was SVN r12377.
2006-10-31 21:29:07 +00:00
Tim Prins
894b220fbb Fixes trac:487
Give a more intelligible error message when someone passes -nolocal and the only available node is the local node.

This commit was SVN r12325.

The following Trac tickets were found above:
  Ticket 487 --> https://svn.open-mpi.org/trac/ompi/ticket/487
2006-10-26 21:46:18 +00:00
Ralph Castain
36d4511143 Bring the timing instrumentation to the trunk.
If you want to look at our launch and MPI process startup times, you can do so with two MCA params:

OMPI_MCA_orte_timing: set it to anything non-zero and you will get the launch time for different steps in the job launch procedure. The degree of detail depends on the launch environment. rsh will provide you with the average, min, and max launch time for the daemons. SLURM block launches the daemon, so you only get the time to launch the daemons and the total time to launch the job. Ditto for bproc. TM looks more like rsh. Only those four environments are currently supported - anyone interested in extending this capability to other environs is welcome to do so. In all cases, you also get the time to setup the job for launch.

OMPI_MCA_ompi_timing: set it to anything non-zero and you will get the time for mpi_init to reach the compound registry command, the time to execute that command, the time to go from our stage1 barrier to the stage2 barrier, and the time to go from the stage2 barrier to the end of mpi_init. This will be output for each process, so you'll have to compile any statistics on your own. Note: if someone develops a nice parser to do so, it would be really appreciated if you could/would share!

This commit was SVN r12302.
2006-10-25 15:27:47 +00:00
Brian Barrett
d6ff14ed61 Hand-pack the connection information for each peer rather than just
packing a sockaddr_in, as there are some endianness and padding issues
with sending a sockaddr_in.  Note that the sin_port and sin_addr are
already in network byte order, which is why we pack them as a byte
string.

Refs trac:493

This commit was SVN r12301.

The following Trac tickets were found above:
  Ticket 493 --> https://svn.open-mpi.org/trac/ompi/ticket/493
2006-10-25 15:09:30 +00:00