1
1
Граф коммитов

719 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
13227e36ab This commit looks a lot bigger than it is, so relax :-)
Fix the problem observed by multiple people that comm_spawned children were (once again) being mapped onto the same nodes as their parents. This was caused by going through the RAS a second time, thus overwriting the mapper's bookkeeping that told RMAPS where it had left off.

To solve this - and to continue moving forward on the ORTE development - we introduce the concept of attributes to control the behavior of the RM frameworks. I defined the attributes and a list of attributes as new ORTE data types to make it easier for people to pass them around (since they are now fundamental to the system, and therefore we will be packing and unpacking them frequently). Thus, all the functions to manipulate attributes can be implemented and debugged in one place.

I used those capabilities in two places:

1. Added an attribute list to the rmgr.spawn interface.

2. Added an attribute list to the ras.allocate interface. At the moment, the only attribute I modified the various RAS components to recognize is the USE_PARENT_ALLOCATION one (as defined in rmgr_types.h).

So the RAS components now know how to reuse an allocation. I have debugged this under rsh, but it now needs to be tested on a wider set of platforms.

This commit was SVN r12138.
2006-10-17 16:06:17 +00:00
Rainer Keller
3f88937081 - Error logging is really not yet enabled.
- Correct the error log for orte_errmgr_base_select
 - Spelling fixes

This commit was SVN r12135.
2006-10-17 09:11:20 +00:00
Ralph Castain
16e52c5784 Fix a non-compliance issue regarding hostfiles as reported by Sun. The man page states that an entry that specifies slots_max but does not specify "slots" will have the soft limit default to the hard limit. The hostfile implementation, however, defaulted the soft limit to 1.
This fix changes that behavior to conform to the man page.

This commit was SVN r12129.
2006-10-17 00:43:12 +00:00
Ralph Castain
3f55d6897a Remove the memory debugging options. Fix what appears to be a typo in a help file.
This commit was SVN r12107.
2006-10-12 00:44:48 +00:00
Brian Barrett
fce5130333 Delay opening the listen socket until module init, so that we can have the
seed value have something set to true.  Allow selection of the listen
type to thread if (and only if) the process is the HNP...

This commit was SVN r12105.
2006-10-11 21:29:29 +00:00
Brian Barrett
29c91cf2f3 * Fix issue in odls_bproc where we were using vpid instead of the number of
processes launched locally for the stdio file names.  This was causing
    the expected files to not exist and bproc_vexecmove_io to fail.
  * Clean up a bunch of debugging output in the bproc pls

This commit was SVN r12102.
2006-10-11 20:34:12 +00:00
Ralph Castain
f91a95b3fe Fix the bug that caused mpirun to hang when a remote executable wasn't found using the rsh launcher. Will now test on a remote node
This commit was SVN r12095.
2006-10-11 18:43:13 +00:00
Ralph Castain
2da8245be0 Correctly propagate no-daemonize
This commit was SVN r12093.
2006-10-11 17:53:17 +00:00
Ralph Castain
e5877cc459 Add the proper valgrind params
This commit was SVN r12092.
2006-10-11 17:48:41 +00:00
George Bosilca
b56636c855 orte_pls belong to the PLS framework, therefore it should only be defined
in pls.h.

This commit was SVN r12089.
2006-10-11 17:12:22 +00:00
Ralph Castain
27e305347c Add a couple of options to orterun that support debugging of daemons for memory corruption.
Ensure that the environment provided to local application processes isn't "polluted" by the orteds

This commit was SVN r12087.
2006-10-11 15:18:57 +00:00
Brian Barrett
f5b8f1f2f0 Work around Automake not knowing how to properly configure libtool to build
Objective C libraries

Refs trac:483

This commit was SVN r12080.

The following Trac tickets were found above:
  Ticket 483 --> https://svn.open-mpi.org/trac/ompi/ticket/483
2006-10-10 20:14:26 +00:00
Ralph Castain
699ffcf359 Restore the "bynode" mapping functionality - accidentally deleted setting of parameter
This commit was SVN r12078.
2006-10-10 19:41:22 +00:00
George Bosilca
7dadc1832d Correctly export the required functions. They are defined in a private file, but they are completely public.
This commit was SVN r12070.
2006-10-10 04:54:51 +00:00
Ralph Castain
0e9dc590b7 Fix typo that didn't make it over from testing on vogon
This commit was SVN r12068.
2006-10-09 20:37:39 +00:00
Ralph Castain
cebdc51762 Remove a debugging output
This commit was SVN r12066.
2006-10-09 02:10:52 +00:00
Ralph Castain
2e09128337 Many thanks to Jeff for tracking down the typo causing the orte_job_map_t destuctor to fail!!
Restore the OBJ_RELEASE calls to cleanup map objects.

This commit was SVN r12064.
2006-10-07 22:44:00 +00:00
Ralph Castain
98dd57b70e Add a new option to launch "pernode" - launches one process/node across all available nodes.
The other options also work correctly: "-bynode" with no -np will launch on all *slots*, mapped on a per-node basis.

This commit was SVN r12063.
2006-10-07 19:50:12 +00:00
Jeff Squyres
efe28d62e9 Fix some compiler errors. I have *not* checked this for correctness;
but it does now compile.

This commit was SVN r12062.
2006-10-07 19:10:56 +00:00
Ralph Castain
ae79894bad Bring the map fixes into the main trunk. This should fix several problems, including the multiple app_context issue.
I have tested on rsh, slurm, bproc, and tm. Bproc continues to have a problem (will be asking for help there).

Gridengine compiles but I cannot test (believe it likely will run).

Poe and xgrid compile to the extent they can without the proper include files.

This commit was SVN r12059.
2006-10-07 15:45:24 +00:00
Ralph Castain
82a023c731 Fix a typo that caused a segfault if a caller requested that we abort an array of procs (i.e., MPI_Abort when it specifies the other procs to be aborted).
Should hopefully address the recent problem seen with the BLACS AUX test as discussed on the user mailing list.

This commit was SVN r12055.
2006-10-07 01:58:11 +00:00
Ralph Castain
ee0df85ece Add a stupid, useless const since someone put it in the odls.h file.
This commit was SVN r12054.
2006-10-07 01:42:23 +00:00
George Bosilca
cda46efd2a Some missing DECLSPEC
This commit was SVN r12047.
2006-10-06 15:21:52 +00:00
George Bosilca
017af37291 Keep only the useful _DECLSPEC and _DECLSPEC some globals.
This commit was SVN r12042.
2006-10-06 07:24:34 +00:00
George Bosilca
b7579b09c7 Correctly handle the ORTE_DECLSPEC attribute.
This commit was SVN r12041.
2006-10-06 07:08:17 +00:00
George Bosilca
422ce1d3f8 If we are in the ODLS framework then the types should be called odls not pls.
This commit was SVN r12040.
2006-10-06 07:05:58 +00:00
George Bosilca
7fed79434e Windows is now able to create local processes.
This commit was SVN r12039.
2006-10-06 07:04:43 +00:00
George Bosilca
fa01b9b9aa Last step for the name reversion (from windows back to process).
This commit was SVN r12014.
2006-10-05 06:36:11 +00:00
George Bosilca
b7a793a6db Rename all the files in the process directory.
This commit was SVN r12013.
2006-10-05 06:34:30 +00:00
George Bosilca
fee0909815 Rename the Windows component.
This commit was SVN r12012.
2006-10-05 06:32:57 +00:00
George Bosilca
c79c436c8d Cleanups. Remove all __WINDOWS__ checks as this module will never
get compiled on Windows.

This commit was SVN r12011.
2006-10-05 06:17:30 +00:00
George Bosilca
dbe7f8ac32 Always return bool.
This commit was SVN r12009.
2006-10-05 05:45:18 +00:00
George Bosilca
d1e884fbf5 Make sure we always return a bool (true/false).
This commit was SVN r12002.
2006-10-05 05:27:46 +00:00
George Bosilca
3a34f9340e If the enum is defined inside the struct it will has a scope. We don't
really need that.

This commit was SVN r12001.
2006-10-05 05:27:04 +00:00
George Bosilca
090b8a9098 opal_list_is_empty return true or false ...
This commit was SVN r12000.
2006-10-05 05:26:08 +00:00
George Bosilca
03083cc1f6 Don't release the values[0] before it get initialized.
This commit was SVN r11999.
2006-10-05 05:25:18 +00:00
George Bosilca
fd76e56279 One protection against C++ compilers is more than enough.
This commit was SVN r11998.
2006-10-05 05:24:43 +00:00
George Bosilca
ad5810e33f ORTE_DECLSPEC what needs to be ORTE_DECLSPES.
This commit was SVN r11997.
2006-10-05 05:22:22 +00:00
Ralph Castain
faf3a558e6 Missing CR at end of file
This commit was SVN r11959.
2006-10-03 18:17:52 +00:00
Ralph Castain
cd7d87aa7b Define the map data types for dss compatibility. Setup to debug bproc
This commit was SVN r11955.
2006-10-03 17:40:00 +00:00
Ralph Castain
4e39878944 Add a "dump" capability to the DSS so one can display a single data value to an output stream.
Add some comments to the map type def in prep for building its data type support.

This commit was SVN r11947.
2006-10-03 08:40:35 +00:00
Ralph Castain
99f2986db7 Bring comm_spawn back online. Shift the trigger hosting responsibilities to the HNP.
We still have an issue with the io forwarding going through the spawning process, but that will be dealt with at a future time.

This commit was SVN r11943.
2006-10-03 02:07:58 +00:00
Ralph Castain
b269e4da9b Add missing functionaltiy to the ns replica to support remote get_job_peers requests. Add trace commands to help try and track down remaining problem with comm_spawn.
This commit was SVN r11939.
2006-10-02 19:44:35 +00:00
Ralph Castain
9eb14425b7 The last of the debug messages that keep hiding. My apologies.
This commit was SVN r11937.
2006-10-02 18:43:32 +00:00
Ralph Castain
3fd67a038f Bring comm_spawn and persistent operations online. Still some minor problems, though - so don't use them yet, please.
This won't affect anything except those two scenarios.

This commit was SVN r11936.
2006-10-02 18:29:15 +00:00
Ralph Castain
12328395ae Missed a couple of debug statements
This commit was SVN r11935.
2006-10-02 15:46:41 +00:00
Ralph Castain
7494a7a83f Clean out some debugging statements that were inadvertently left in the commit
This commit was SVN r11933.
2006-10-02 15:03:18 +00:00
Ralph Castain
559b9b0ae8 Continue beating on comm_spawn. Setup to debug bproc.
This commit was SVN r11932.
2006-10-02 14:58:22 +00:00
Ralph Castain
65593cd67e Fix a few Cyrador warnings
This commit was SVN r11930.
2006-10-02 13:00:32 +00:00
Brian Barrett
8f7ab1c584 num_procs can be zero if something went partly wrong before. This will
cause a math exception on some platforms, so don't let that happen.

This commit was SVN r11929.
2006-10-02 01:27:22 +00:00
Ralph Castain
121f834776 Continue bringing comm_spawn back online. Ensure all RM frameworks post their HNP receives. Fix the rmgr proxy component.
Still need some work on the proxy component, and on job termination for persistent daemon case.

This commit was SVN r11928.
2006-10-02 00:46:31 +00:00
Brian Barrett
e464adcd51 Need to tell the daemon how many procs it will start (always 1, because of
the way we do the fake mapping thing...

This commit was SVN r11924.
2006-10-01 23:25:22 +00:00
Brian Barrett
95ba51fbd4 * Clean up debugging output so that it's useful
* Error message in an NSError object is localizedDescription, not
    localizedErrorReason.  The latter is a decription of how the error
    can occur, which is usually nothing in XGrid frameworks.
  * Clean up silly error in finding the Kerberos Service Principal
    when using Kerberos authenticaion
  * Print useful error message when a connection unexpectedly closes, 
    as this is usually authentication related...

This commit was SVN r11923.
2006-10-01 22:43:17 +00:00
Brian Barrett
9807a38458 Always initialize the base output stream, but only set verbose if requested.
Otherwise, the PLS components have pay more attention to debugging streams
than the rest of the OMPI source code

This commit was SVN r11922.
2006-10-01 22:37:30 +00:00
Ralph Castain
33a46cff40 Grrrr....ok, this time actually put a *value* in the silly command buffer!
This commit was SVN r11901.
2006-09-29 21:44:11 +00:00
Ralph Castain
0625ead54a Fix comm_spawn communications
This commit was SVN r11900.
2006-09-29 21:10:48 +00:00
Ralph Castain
0411f9772e Begin instrumenting for scalability tests.
I have added a new MCA param (hey, you can't have too many!) called OMPI_MCA_orte_timing. If set to anything other than zero, the system will report out critical timing loops. At the moment, this includes three measurements:

1. Time spent going through the RDS->RAS->RMAPS, setting up triggers, etc. prior to calling the actual PLS launch function. This is reported out as time to setup job.

2. Time spent in MPI_Init from start of that function (well, right after opal_init) to the place where we send all of our info the registry. Reported out as time from start to exec_compound_cmd

3. Time actually spent executing the compound cmd. Reported out as time to exec_compound_cmd.

A few additional timing points will be added shortly.

These may eventually be removed or (better) setup with a conditional compile flag.

This commit was SVN r11892.
2006-09-29 13:19:44 +00:00
Ralph Castain
db6a93fa63 Fix a couple of reported issues:
1. PLS finalize was not being called. Now ensure that happens during orte_finalize.

2. Errmgr proxies were sending their messages to the wrong tag - typical cut/paste error.

This commit was SVN r11891.
2006-09-29 12:45:50 +00:00
Tim Prins
1b35e7adff cleanup
This commit was SVN r11863.
2006-09-28 13:28:48 +00:00
Brian Barrett
d00a0de716 * It appears that in their infinite wisdom, Apple removed the
__DARWIN_ALIGN_POWER define from the last release of the OS X compiler
    toolchain.  The bug in net/if.h, however, is still there.  So look
    for the hints that we're on a 64 bit Apple PowerPC instead.
  * If we don't find a buffer size that works by 10MB, we're never
    going to.  So add some code to limit the buffer size we'll try
    so that we don't fall into an infinite loop
  * Detect errors in opal_ifcount in the oob init code

Refs trac:420

This commit was SVN r11825.

The following Trac tickets were found above:
  Ticket 420 --> https://svn.open-mpi.org/trac/ompi/ticket/420
2006-09-26 16:37:04 +00:00
Brian Barrett
8943f583bf quiet some debugging output
This commit was SVN r11813.
2006-09-26 04:10:07 +00:00
Brian Barrett
d8d55a760f First attempt at Kerberos support for the XGrid process starter
refs trac:345

This commit was SVN r11812.

The following Trac tickets were found above:
  Ticket 345 --> https://svn.open-mpi.org/trac/ompi/ticket/345
2006-09-26 03:54:38 +00:00
Brian Barrett
9733c8e3bd Update XGrid RAS and PLS to the new infrastructure. Not yet super well
tested, but starting to get there...

This commit was SVN r11810.
2006-09-26 03:26:45 +00:00
Brian Barrett
3c814fdd23 fixes trac:391
Fix for double mutex free that would cause an abort condition in the orted
whenever threads were enabled.

This commit was SVN r11759.

The following Trac tickets were found above:
  Ticket 391 --> https://svn.open-mpi.org/trac/ompi/ticket/391
2006-09-22 19:24:42 +00:00
Andrew Friedley
798c19d395 Blah.. we should always return after try_connect() here, not just when we have an error.
Another fix for ticket #362.

This commit was SVN r11756.
2006-09-22 15:51:11 +00:00
Tim Prins
567676f3c1 - Formatting and minor cleanup
- made it so we now set the architecture of each node we discover
- remove debugging output

This commit was SVN r11751.
2006-09-22 13:24:32 +00:00
Tim Prins
83a7f6e4de Fix for bug #369.
LoadLeveler only sets LOADL_PROCESSOR_LIST when there are 128 or less tasks allocated to a job. The POE RAS relied on this variable so I created a new RAS which uses the LoadLeveler API instead of relying on the environment variable. This still needs some testing, so for now we use the POE RAS whenever LOADL_PROCESSOR_LIST, otherwise we fall back on this component.

Unfortunately, this will require an autogen...

This commit was SVN r11732.
2006-09-21 00:08:49 +00:00
Andrew Friedley
8895bf7369 Fix the fix (r11718) for bug #362.
We were still waiting the entire duration of the timeout before we figured out that a connect() was successful.  Re-introduce adding the peer_send_event so that we detect immediately when a connect() completes.

Also make sure to delete the timeout event in complete_connect().

Fixed a struct timeval initialization warning reported by Jeff.

Remove an erroneous opal_output().

This commit was SVN r11724.

The following SVN revision numbers were found above:
  r11718 --> open-mpi/ompi@1b6231a9b5
2006-09-20 14:29:37 +00:00
Andrew Friedley
1b6231a9b5 Fix for running jobs that span multiple 's' partitions on IU BigRed.
Each 's' partition has its own TCP network.  It's fine to use this network for jobs that fit inside the partition, but the TCP OOB errors when trying to connect across two partitions, because there are two disjoint networks.  Each node also has another TCP network connecting ALL nodes together.

So the solution is to actually try all the available TCP interfaces on a node, instead of erroring when the first one fails.

Also, the default TCP connect() timeout is way too long (5 minutes) - use our own timeout mechanism, with the timeout value expressed as an MCA parameter.

This commit was SVN r11718.
2006-09-19 19:33:49 +00:00
Tim Prins
c4db5654fa Fix for bug #370
The POE ras did not correctly enter the number of slots per node. This fixes that.

This commit was SVN r11716.
2006-09-19 16:27:15 +00:00
Ralph Castain
977e3c5ca1 Let's see if Cyrador understands this version a little better...
This commit was SVN r11709.
2006-09-19 13:05:40 +00:00
Ralph Castain
0ad0d84afd Add two new API functions to the RMGR, and modify the "spawn" API to support the enhanced MPI-2 functionality.
No implementation backs these new APIs - just placeholders for now.

This commit was SVN r11699.
2006-09-19 01:45:05 +00:00
Ralph Castain
d7e61e40fc Quiet a few warnings from Cyrador
This commit was SVN r11686.
2006-09-18 12:40:42 +00:00
Ralph Castain
8a291afda6 Ensure the rds_private.h file gets included in the distribution
This commit was SVN r11682.
2006-09-16 11:45:02 +00:00
Ralph Castain
f906af983a Forgot to change the silly Makefile.am names - sorry Cyrador!
This commit was SVN r11670.
2006-09-15 04:52:20 +00:00
Jeff Squyres
3e239f4532 Add a missing .ompi_ignore
This commit was SVN r11666.
2006-09-15 02:36:22 +00:00
George Bosilca
4fe39a4e7d The old PLS is now called a ODLS. However, the real name is not windows but process. This
change will follow shortly...

This commit was SVN r11663.
2006-09-14 22:22:34 +00:00
Ralph Castain
37dfdb76eb Here is the major MAD-cure commit. I have written plenty about it, so I refer you here to those messages for a description of everything that was done.
This commit was SVN r11661.
2006-09-14 21:29:51 +00:00
George Bosilca
17afe7dc9f Do it on the correct way as this is normally compiled as a module.
This commit was SVN r11660.
2006-09-14 21:22:41 +00:00
George Bosilca
01c5a115b2 Don't export the POE module. Only the component have to be exported (visible).
This commit was SVN r11659.
2006-09-14 21:20:31 +00:00
Josh Hursey
908f31fe9f Fix a code clarity issue in the POE PLS.
Allow the POE RAS to be compled for linux as well as AIX.
The POE RAS is really a Loadleveler RAS, and IU now has
a cluster that uses Loadleveler in a Linux environment (BigRed).

This seems to be the only thing we need to do so far to run 
Open MPI on BigRed. Yay :)

This commit was SVN r11600.
2006-09-09 05:13:15 +00:00
Josh Hursey
160120b4c5 Fix a cut-n-paste error that causes the 'num_concurrent' to be
set to 1 or 0 instead of the user defined number or default (128).

This caused the PLS to deadlock when using '--debug-daemons' with
more than 2 processes. :(

svn blame says that it was broken in r11347

It is *not* a problem on v1.1 or v1.2 branches.

Bug spotted by Tim Mattox and myself.

This commit was SVN r11575.

The following SVN revision numbers were found above:
  r11347 --> open-mpi/ompi@f52c10d18e
2006-09-08 15:17:17 +00:00
Jeff Squyres
0f11584a6c * Update svn:ignore
* Remove svn:executable from non-executable files

This commit was SVN r11555.
2006-09-07 17:17:40 +00:00
Ralph Castain
9e6e9b8619 Fix a couple of variable declarations
This commit was SVN r11467.
2006-08-28 13:28:10 +00:00
George Bosilca
c2311f6e42 Don't define the yywrap function.
This commit was SVN r11459.
2006-08-28 04:11:25 +00:00
George Bosilca
693c835137 No need to cast as the returned value is already in the
expected type.

This commit was SVN r11458.
2006-08-28 04:10:43 +00:00
George Bosilca
ba1514f2e7 A slightly more Windows friendly version. Unfortunately there
is no support for SGE on Windows.

This commit was SVN r11436.
2006-08-27 04:46:43 +00:00
Pak Lui
131f0eff04 fix the verbose value.
This commit was SVN r11418.
2006-08-24 21:30:08 +00:00
Pak Lui
65a524dd0d - need to provide option for showing the grid engine's JOB_ID in case the grid engine job needs to be killed
- clean up the orted_path and debug message

This commit was SVN r11413.
2006-08-24 20:27:19 +00:00
Pak Lui
4f75dfd353 - missed the opal_os_path() for LD_LIBRARY_PATH
This commit was SVN r11410.
2006-08-24 18:58:50 +00:00
George Bosilca
9110ea2b80 Add the Windows fork component. As fork is not available on Windows, I
create a process component which use CreateProcess to spawn the child.
Special care should be taken in order to correctly redirect the stdin,
stdout and stderr of the child process.

This commit was SVN r11405.
2006-08-24 17:51:20 +00:00
George Bosilca
0d607c1346 Use opal_os_path and OPAL_PATH_SEP to build the file path. I don't have any
machine to test, so I hope I get it right.

This commit was SVN r11398.
2006-08-24 16:20:32 +00:00
Pak Lui
5220c1ca42 - converted some tabs into spaces
This commit was SVN r11384.
2006-08-23 23:21:08 +00:00
Pak Lui
9dda057f05 - Do the changes as in r11347 for gridengine to use opal_os_path().
- Remove extra NULL argument from rsh module.

This commit was SVN r11377.

The following SVN revision numbers were found above:
  r11347 --> open-mpi/ompi@f52c10d18e
2006-08-23 20:40:01 +00:00
Jeff Squyres
715bae369c Remove extra argument - now obsoleted by the use of opal_os_path().
This commit was SVN r11366.
2006-08-23 14:32:06 +00:00
Brian Barrett
e39f0096a0 * add header file to sources list so make dist works
This commit was SVN r11357.
2006-08-23 13:31:56 +00:00
George Bosilca
c03ef692c1 And the missing header.
This commit was SVN r11348.
2006-08-23 03:33:35 +00:00
George Bosilca
f52c10d18e And ORTE is ready for prime-time. All Windows tricks are in:
- use the OPAL functions for PATH and environment variables
- make all headers C++ friendly
- no unamed structures
- no implicit cast.

Plus a full implementation for the orte_wait functions.

This commit was SVN r11347.
2006-08-23 03:32:36 +00:00
George Bosilca
aecdfc80eb Don't orget to relase the object if we detect an error.
This commit was SVN r11346.
2006-08-23 02:43:05 +00:00
Ralph Castain
c3ba1c1cc1 Fix a pack/unpack mismatch
This commit was SVN r11315.
2006-08-22 13:50:59 +00:00
Ralph Castain
73a7916946 For Ollie...fix a few names. Should help the Bproc SMR component compile.
This commit was SVN r11284.
2006-08-21 15:11:20 +00:00
George Bosilca
6afa4c6c64 Windows friendly version. We have to split the OMPI_DECLSPEC in at least 3
different macros, one for each project. Therefore, now we have OPAL_DECLSPEC,
ORTE_DECLSPEC and OMPI_DECLSPEC. Please use them based on the sub-project.

This commit was SVN r11270.
2006-08-20 15:54:04 +00:00
Ralph Castain
ee04e04dd0 Attempt to cleanup the xgrid pls module
This commit was SVN r11261.
2006-08-18 21:21:31 +00:00
Ralph Castain
6bf06d4602 Fix connect-accept by cleaning up two minor bugs.
This commit was SVN r11260.
2006-08-18 21:12:03 +00:00
Ralph Castain
517d6fda49 Add the smr_private include file so it gets put in tarballs
This commit was SVN r11243.
2006-08-17 12:24:44 +00:00
Ralph Castain
8c7f0ed9ae Change the SOH to the new State Monitoring and Reporting (SMR) framework. New API's will be appearing in the new framework shortly - this just gets the name change into the system.
Other changes:

1. Remove the old xcpu components as they are not functional.

2. Fix a "bug" in orterun whereby we called dump_aborted_procs even when we normally terminated. There is still some kind of bug in this procedure, however, as we appear to be calling the orterun job_state_callback function every time a process terminates (instead of only once when they have all terminated). I'll continue digging into that one.

This will require an autogen/configure, I'm afraid.

This commit was SVN r11228.
2006-08-16 16:35:09 +00:00
Ralph Castain
5dfd54c778 With the branch to 1.2 made....
Clean up the remainder of the size_t references in the runtime itself. Convert to orte_std_cntr_t wherever it makes sense (only avoid those places where the actual memory size is referenced).

Remove the obsolete oob barrier function (we actually obsoleted it a long time ago - just never bothered to clean it up).

I have done my best to go through all the components and catch everything, even if I couldn't test compile them since I wasn't on that type of system. Still, I cannot guarantee that problems won't show up when you test this on specific systems. Usually, these will just show as "warning: comparison between signed and unsigned" notes which are easily fixed (just change a size_t to orte_std_cntr_t).

In some places, people didn't use size_t, but instead used some other variant (e.g., I found several places with uint32_t). I tried to catch all of them, but...

Once we get all the instances caught and fixed, this should once and for all resolve many of the heterogeneity problems.

This commit was SVN r11204.
2006-08-15 19:54:10 +00:00
Brian Barrett
cd7b138d74 propogate up errors when setting up standard input forwarding
This commit was SVN r11187.
2006-08-14 21:09:05 +00:00
Ralph Castain
d2912f03e0 Cleanup a historical naming convention problem. Move the socket_errno definitions to the OPAL layer and change the name accordingly. This cleans up some interrelationship issues as well as removing a name confusion.
This commit was SVN r11186.
2006-08-14 20:14:44 +00:00
Ralph Castain
663e25f7cb Finalize the Bproc vpid algorithm.
Bproc is now fully operational and supports oversubscribed conditions for both bynode and byslot mapping procedures.

This commit was SVN r11180.
2006-08-14 19:16:11 +00:00
Ralph Castain
285aea1c0c Update to bproc algorithm to support oversubscription - committing to move to another test environment.
Note that this may break bproc for the moment.

This commit was SVN r11178.
2006-08-14 18:34:13 +00:00
Ralph Castain
de9156552b I have confirmed that the later version of the bproc launcher does support Bproc 3, so it appears that the outdated bproc_seed launcher truly is no longer required.
This commit was SVN r11164.
2006-08-12 07:47:21 +00:00
Ralph Castain
0ccc910485 Fix the Bproc vpid computation so that, when we map by slot, adjacent processes have vpids differing by only one.
I will ammend the documentation in the files shortly to explain why this was previously broken.

This commit was SVN r11162.
2006-08-11 19:41:33 +00:00
Pak Lui
8fab3d5b82 * Inadvertently removed a wrong variable during the last change.
This commit was SVN r11157.
2006-08-11 16:00:39 +00:00
Ralph Castain
59d6f1e2eb Remove ompi_ignores on gridengine components as this seems resolved - thanks Pak for quick response!
Fixed a few very minor compiler complaints in the pls_gridengine_module.c file. ISO C is less forgiving about where variables get declared.

This commit was SVN r11156.
2006-08-11 15:32:17 +00:00
Pak Lui
99a0521e44 * Fix the issue that Ralph observed in MacOS X with an invalid header file
and other warnings.

This commit was SVN r11155.
2006-08-11 15:04:51 +00:00
Ralph Castain
5fd6306c2f Add ompi_ignores until the configuration can be fixed
This commit was SVN r11154.
2006-08-11 14:11:41 +00:00
Pak Lui
08352878cc * Added in new ras and pls components to support Sun N1 Grid Engine (N1GE)
6 and its open source version as the job launchers for ORTE.

This commit was SVN r11153.
2006-08-10 21:46:52 +00:00
Ralph Castain
bd937b219d Tell xcast not to send to processes that have "aborted".
One of those fixes that has been sitting on another branch for awhile...sigh.

This commit was SVN r11142.
2006-08-09 18:23:43 +00:00
Ralph Castain
8496b6aff4 When a "fork" launch cannot find the executable, the system used to just return an error. This meant that the state of that process was never updated in the registry, leaving the counters at the incorrect levels. As a result, the triggers would never fire to indicate that the job had been aborted. This left orterun and other orteds/processes hanging.
This fix should fix the problem. I will test it on a broader range of systems forsooth...

This commit was SVN r11140.
2006-08-09 15:29:08 +00:00
Ralph Castain
ddd575d126 Ensure that the localhost gets placed on the registry with the same name as found in the system_info structure. Otherwise, we wind up with confusion in the session directory names.
This commit was SVN r11139.
2006-08-09 15:26:37 +00:00
Brian Barrett
59844f2119 Galen noticed that the soh component wasn't linking against the bproc
libraries.  Fix that issue.

This commit was SVN r11119.
2006-08-07 16:20:33 +00:00
Brian Barrett
16186978bb - Fix some compile issues in r11109
- indent / whitespace cleanup
- don't set --daemon-debug when pls debug is given, as it seems to make
  the daemons abort.

This commit was SVN r11113.

The following SVN revision numbers were found above:
  r11109 --> open-mpi/ompi@da7df6d257
2006-08-03 18:51:42 +00:00
Galen Shipman
da7df6d257 monitor bproc node state and terminate the job if a node in our job goes
down.. 

This commit was SVN r11109.
2006-08-03 05:29:49 +00:00
Josh Hursey
d1e1a68645 This commit contains the necessary changes to get "mpirun a.out" working
correctly with MPI_Comm_spawn.

The problem wiht MPI_Comm_spawn was that the 'parent' process was 
rmgr.create'ing and then rmgr.launch'ing the children via the rmgr proxy
component. The HNP saw these commands and processed them normally, but
since we never went through the HNP's rmgr (urm component) spawn() 
logic the triggers and key/value pairs were never created. So the
children were launched correctly, but since the HNP did not
have any triggers setup, never triggered the xcast for the
children to finish orte_init().

This fix puts the trigger and key/value pair initialization in 
rmgr_urm_spawn() for the 'mpirun a.out' case, *and* in the 
rmgr_base_unpack routine that deals with the creation of the
job for the child as requested by the proxy component. This
will allow the triggers to be registered for the proxy's request
which only happens during MPI_Comm_spawn*

Small change for a lot of debugging. Notice that his reverts r11037
to its previous version, and adds a newline to handle the spawn
cases.

This commit was SVN r11046.

The following SVN revision numbers were found above:
  r11037 --> open-mpi/ompi@5813fb7d2a
2006-07-28 17:17:31 +00:00
Josh Hursey
5813fb7d2a It seems that MPI_Comm_spawn{_multiple} has been broken since r10708
By reverting this file (changeset from commit r10708) to its previous
version fixes the problem.

This should be moved to the v1.1 branch where it is also broken.

This commit was SVN r11037.

The following SVN revision numbers were found above:
  r10708 --> open-mpi/ompi@febc143d8c
2006-07-27 21:21:10 +00:00
Brian Barrett
c744f650ba * really didn't mean for this patch (the threaded accept() code) to come in with
r10841, so revert it (and it's fixes) out.  Will bring back once cleaned up from
  the code used in the tbird experiment

This commit was SVN r10991.

The following SVN revision numbers were found above:
  r10841 --> open-mpi/ompi@dfa1221c3b
2006-07-25 22:32:01 +00:00
Jeff Squyres
c2d4dfce78 Remove unused variable
This commit was SVN r10985.
2006-07-25 21:43:21 +00:00
Jeff Squyres
bdab8d744c Send a pointer to the data, not the data itself. Otherwise, we could
get a segv in some cases.

This commit was SVN r10984.
2006-07-25 21:42:44 +00:00
Ralph Castain
65acc9325a Fix a bug that crept in during the last change to support "mpirun a.out" operations. Since we now reserve a range of vpids for each app_context, we no longer need to track the rank and offset the starting vpid each time through the mapper - the name service automatically accounts for the offset when allocating the next starting vpid for the job.
This should be shifted to v1.1.

This commit was SVN r10916.
2006-07-20 21:06:15 +00:00
Ralph Castain
8bec270f90 Fix a bug noted by Jeff - we were no longer accurately recording in the registry that a process had been terminated when the user initiated the "kill" process (via cntrl-c).
Added another system-level test function for ORTE that just spins until terminated by a ctrl-c signal.

Modified orterun - added a couple of newlines to the output when abnormally terminating so the prompt always is on a new line.

This commit was SVN r10866.
2006-07-18 14:42:27 +00:00
Gleb Natapov
f15fc4ef2f include signal.h for SIGPIPE definition
This commit was SVN r10863.
2006-07-18 09:07:53 +00:00
Brian Barrett
2185c059e8 * use opal_free_list_item_t as the type of items stored in an opal_free_list_t,
rather than assuing it's an opal_list_item_t.

This commit was SVN r10860.
2006-07-17 21:51:50 +00:00
Jeff Squyres
82161d20ca Catch a SIGPIPE and allow it to be harmless. Register a no-op SIGPIPE
handler before the write() and de-register it afterwards.  Determine
if the write() succeeded or failed by the return of write().

This commit was SVN r10858.
2006-07-17 21:15:56 +00:00
George Bosilca
33a7634009 Silence the compiler.
This commit was SVN r10851.
2006-07-17 17:13:28 +00:00
Ralph Castain
404acc9f65 It's okay to call index prior to anything being put in the registry...
This commit was SVN r10848.
2006-07-17 14:31:42 +00:00
Ralph Castain
574a6f7896 Fix a bug that caused the system to crash when asked for an index of the segment names. Such a request required passing a NULL value for the segment name, but the find_seg function didn't protect itself from that value.
Thanks to James Kennedy (UCC-Ireland) for finding it.

This commit was SVN r10847.
2006-07-17 13:51:07 +00:00
Brian Barrett
dfa1221c3b * AC_CONFIG_LINKS has a minor problem in that it always uses ln -s, rather
than $(LN_S).  This causes problems with with Windows and probably
  elsewhere (re: #200).  So use a slightly different trick to get the
  right header selected for the MEMCPY and TIMER components.

* Using the same trick used to solve the AC_CONFIG_LINKS problem, 
  stop using a separate header file for direct calling in the
  PML and MTL.  This lets me remove some icky code in ompi_mca.m4
  that was more fragile than I really liked.

This commit was SVN r10841.
2006-07-16 04:23:52 +00:00
Jeff Squyres
ffddfc5629 Turns out that it's a really Bad Idea(tm) to tm_spawn() and then not
keep the resulting tm_event_t that is generated because the back-end
TM library actually caches a bunch of stuff on it for internal
processing, and doesn't let go of it until tm_poll().

tm_event_t's are similar to (but slightly different than)
MPI_Requests: you can't do a million MPI_Isend()'s on a single
MPI_Request -- a) you need an array of MPI_Request's to fill and b)
you need to keep them around until all the requests have completed.

This commit was SVN r10820.
2006-07-14 22:04:41 +00:00
Rainer Keller
50b5791969 - Release best_item
- Reformat

This commit was SVN r10814.
2006-07-14 19:55:14 +00:00
Ralph Castain
7b3ced80e8 Fix a bug that has been causing inconsistent behavior on a number of platforms. Will explain more on the core-devel list.
Jeff: this needs to be back-patched to our supported prior releases. I'll try to verify how far back we need to go - my initial guess is probably all of them

This commit was SVN r10801.
2006-07-14 14:16:20 +00:00
Ralph Castain
cef1ce19d6 Restore the "sleep" delay during startup.
Since Jeff and I are going to a branch for T-bird, we have restored the trunk to its prior state to avoid any possibility of disturbing it.

This commit was SVN r10774.
2006-07-12 22:18:53 +00:00
Jeff Squyres
ef8433a60b After more discussion on the phone, it seems easier to not muck around
in special components but rather go down to a /tmp branch.  So
removing these components and I'll branch next.

This commit was SVN r10771.
2006-07-12 22:12:29 +00:00
Jeff Squyres
62c189ea1c Fix a few blanket search/replaces
This commit was SVN r10768.
2006-07-12 21:54:05 +00:00
Ralph Castain
badd3f4acb Clean up a few lingering references to "urm".
This commit was SVN r10765.
2006-07-12 21:01:21 +00:00
Jeff Squyres
36ca7497d1 Update m4 and configure files
This commit was SVN r10764.
2006-07-12 20:55:39 +00:00
Ralph Castain
9102b5af3b Remove the "sleep" delay in the oob connection procedure. This shouldn't cause any problems, especially for launches of less than 1000 processes.
Please report any abnormal behavior during launch, though, as we would like to understand what (if any) impact is seen. I couldn't see any on small jobs (the modulo functions render this number down pretty low).

This commit was SVN r10763.
2006-07-12 20:31:30 +00:00
Ralph Castain
a84898316c Create new components to support Thunderbird scalability development
This commit was SVN r10762.
2006-07-12 20:28:23 +00:00
Brian Barrett
4b70bb92db * Per ticket #112, localhost checks should check against 127.0.0.1/8, rather
than just 127.0.0.1.

This commit was SVN r10750.
2006-07-11 20:54:49 +00:00
Ralph Castain
11125dd67a George has a retarded compiler - but that's okay. This will quiet it's warning system.
This commit was SVN r10736.
2006-07-11 15:27:02 +00:00