1
1
Граф коммитов

1482 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
45767b038c Ensure that no-daemonize is correctly set
This commit was SVN r16079.
2007-09-10 14:50:54 +00:00
Brian Barrett
cfe737d1f9 Fix some mistaken error checks -- errors will be less than zero, not
greater than zero

This commit was SVN r16008.
2007-08-29 18:52:51 +00:00
Jeff Squyres
5628084fec Fix Coverity CID 463: remove unused variable / dead code.
This commit was SVN r15999.
2007-08-29 01:30:15 +00:00
Jeff Squyres
8f10c285ef Fix Coverity CID 466: remove unused variable / dead code.
This commit was SVN r15998.
2007-08-29 01:25:03 +00:00
Brian Barrett
dcf678dbab Fix heterogeneous issue with non-blocking RML receive, where the sender
field could be in the wrong endianness

This commit was SVN r15989.
2007-08-28 20:54:52 +00:00
Josh Hursey
5a029a47bd forgot to separate the arguments
This commit was SVN r15940.
2007-08-21 19:43:41 +00:00
Josh Hursey
db79f2392e Make sure to enable C/R support for the HNP when restarting.
This commit was SVN r15931.
2007-08-19 20:43:33 +00:00
Josh Hursey
729c63cf9d Fix invalid MCA 'base' names so they appear in ompi_info.
A subset of this patch needs to be applied to v1.2

Refs trac:928

This commit was SVN r15918.

The following Trac tickets were found above:
  Ticket 928 --> https://svn.open-mpi.org/trac/ompi/ticket/928
2007-08-18 03:05:45 +00:00
Brian Barrett
8294f6de03 The portals_utcp component doesn't actually need the POrtals libraries
and only pokes at environment variables.  So don't link in the libraries,
as that causes a whole other set of problems.

This commit was SVN r15899.
2007-08-17 03:48:39 +00:00
Andrew Friedley
2eedcd2539 Fixes trac:1047
Tie stdin to /dev/null to prevent stdin from being closed and thus making stdin not work in slurm allocations.

This commit was SVN r15892.

The following Trac tickets were found above:
  Ticket 1047 --> https://svn.open-mpi.org/trac/ompi/ticket/1047
2007-08-16 20:49:27 +00:00
Tim Prins
5a795128af Change it so that different components in orte use unique rml tags
This commit was SVN r15881.
2007-08-16 14:02:35 +00:00
Brian Barrett
fe0d1f30d5 need errno.h
This commit was SVN r15862.
2007-08-15 02:15:33 +00:00
Brian Barrett
330003361b * Free memory from asprintf
* need to compare ERANGE to errno

This commit was SVN r15860.
2007-08-14 21:12:00 +00:00
Brian Barrett
881dd0654e * Provide a hook so that a PLS can tell the orted it's starting that it
needs to override the default umask.  By default, this is not used
    since most environments do what the user would expect without any
    help.
  * Have TM use the newly added umask hook, so that processes inherit
    the user's umask from mpirun rather than the pbs_mom's umask, which
    the user has no control over.

This commit was SVN r15858.
2007-08-14 18:44:52 +00:00
Shiqing Fan
eea712f9ab - Export those components in correct way.
This commit was SVN r15804.
2007-08-08 16:20:17 +00:00
Brian Barrett
59524a9009 Fix issue where we set state to SHUTDOWN rather than CONNECTING when we
had to switch socket types.

This commit was SVN r15784.
2007-08-06 22:55:41 +00:00
Ralph Castain
eb3a97f428 Don't overwrite the local rank key
This commit was SVN r15776.
2007-08-06 16:56:23 +00:00
Shiqing Fan
d10570786c - A small fix, add missed flag parameters.
This commit was SVN r15774.
2007-08-06 16:15:38 +00:00
George Bosilca
d658a477af Update the help file to match the real name of the required argument.
This commit was SVN r15762.
2007-08-04 00:35:55 +00:00
Josh Hursey
755658694e Bring in changes to support Cray's Compute Node Linux (CNL) and
Application Level Placement Scheduler (ALPS).

This commit was tested under two Cray machines at ORNL: Jaguar (Catamount)
and Rizzo (CNL Test cage). Both machines performed as they should across
the commit.

It is likely that mor changes will follow this the work and environment
stabilizes.

Most of the infrastructure works the same for Catamount and CNL
except for a few bits. Below are the highlights:

Default IFACE Change:
 On Catamount we can use PTL_IFACE_DEFAULT, but on the CNL system we have access
 to will fail on this interface, and should be set to:
    IFACE_FROM_BRIDGE_AND_NALID(PTL_BRIDGE_UK,PTL_IFACE_SS).
 So if we detect that we are running with YOD then use the former interface
 and if we detect that we are running with ALPS then use the latter.
 We will want to pursue a more elegant solution if this interface continues to 
 change across machines.

PtlGetId and cnos_register_ptlid:
 The header suggests that these should never be called when launching with YOD.
 But in the ALPS environment the cnos_barrier() will hang forever if these 
 functions are not called after PtlNIInit(). Since these functions only need to
 be called once, and the orte rmgr/cnos component is loaded before the ompi 
 common/portals componet then just call these functions once in the rmgr/cnos
 component.

cnos_barrier_init():
 This is a noop for YOD, but critical for ALPS. So be sure to call it before
 calling the first barrier in the rmgr/cnos component.

cnos_barrier vs cnos_pm_barrier:
 It is suggested the cnos_pm_barrier only be used during finalization 
 as it will indicate to the launcher (yod or aprun) that the app is about
 to complete. It was suggested that we use the regular cnos_barrier() instead.
 I want to look into this a bit more to make sure there are not adverse
 side effects. A note has been placed in the code to indicate this reasoning.

This commit was SVN r15756.
2007-08-03 19:46:38 +00:00
Jeff Squyres
106beff744 Ahem. Apparently we should be checking for ORTE_EQUAL upon return
from orte_ns.compare_fields(), not 0 (yes, they're the same [today],
but it is much better to check for symbolic names...).

This commit was SVN r15731.
2007-08-01 18:59:37 +00:00
Jeff Squyres
8d4b6c7b0d The HNP changing into an orted brought a bug in the iof svc component
to light: we weren't ack'ing properly for streams that originated (or
originated via proxy) and terminated within the HNP.  This commit
fixes that.

It also fixes a few style issues, and added some more opal_outputs for
debugging.  Also, fixed a bug where the fact that we forwarded (and
therefore might need to update the ack) was not correctly reported if
there were multiple forwards (which there are not as the system is
currently using IOF, but there could be).

Refs trac:1098 -- want to get another pair of eyes to look at this before
I close the ticket.

This commit was SVN r15730.

The following Trac tickets were found above:
  Ticket 1098 --> https://svn.open-mpi.org/trac/ompi/ticket/1098
2007-08-01 18:38:03 +00:00
Ralph Castain
066ff38d42 Ensure we read all the reported URI contact info when we fork an HNP for singleton support
This commit was SVN r15714.
2007-07-31 18:55:08 +00:00
Shiqing Fan
0f468f3668 - Remove the solution and project files, will commit them later.
This commit was SVN r15705.
2007-07-31 17:07:02 +00:00
George Bosilca
2e2bf472ff Mark the orte_abort function as noreturn and change the return value from
int to void. This function call exit at the end, so there is no way to
return from there. Apply the same thing to the errmsg_abort function and
update all components.

This commit was SVN r15704.
2007-07-31 16:09:52 +00:00
Sven Stork
855434de59 - fixes several coverty issues
- add missing initialisation for variables
  - use strncpy instead of strcpy

This commit was SVN r15683.
2007-07-30 14:44:37 +00:00
Shiqing Fan
4d7b349cdb - Add VC8 solution and project files.
- If one wants to use this solution, remember to unload the project 'orte-restart' which is currently not working for Windows.

This commit was SVN r15680.
2007-07-30 11:05:34 +00:00
Rainer Keller
2c5d07217d - Coverity: use snprintf, instead of sprintf....
This commit was SVN r15669.
2007-07-29 11:23:23 +00:00
Jeff Squyres
3858cf48c0 Stop using the deprecated ORTE_NAME_ARGS() and switch to
ORTE_NAME_PRINT().

This commit was SVN r15665.
2007-07-27 13:33:20 +00:00
Josh Hursey
acbc8ecca3 - On Cray XT systems stop the grpcomm basic component from building.
grpcomm cnos component
  - Remove the .ompi_ignore
  - add a configure.m4 that should keep it from building on any system
    other than Cray XT* (copied from rml/cnos)
  - Fix some mis-named symbols resulting from cut/paste errors.

This patch brings the Cray build back into 'working' order.

This commit was SVN r15651.
2007-07-26 20:42:06 +00:00
Jeff Squyres
188d529beb * We *do* need the LSF task ID as part of our vpid
* Accidentally had the PLS LSF using the env SDS; switch it back to
   the LSF SDS

This commit was SVN r15650.
2007-07-26 20:22:36 +00:00
Josh Hursey
e5a03e7734 - Remove Makefile.in from version control
- Add back support for cnos (copy functionality lost by moving the interface
  from the RML).
- Fix some cut/paste errors.

This commit was SVN r15646.
2007-07-26 18:52:17 +00:00
Jeff Squyres
75192de1fc LSF support is now working. W00t! May be subject to a further tweak
or two.

 * checking lsb_init() is not sufficient to know whether you're in an
   LSF job or not; you also need to check for environment variable
   markers 
 * remove lots of debugging output
 * no need for the sds lsf to call lsb_init()
 * remove some slurm-like dead code and a copy-n-paste error in the
   sds lsf

This commit was SVN r15644.
2007-07-26 18:49:29 +00:00
Jeff Squyres
8e9c71282d Add a bunch more [conditional] debugging output.
This commit was SVN r15643.
2007-07-26 18:46:46 +00:00
Rich Graham
60df8be1a7 initial code - does not even compile, but Josh is picking up on this.
This commit was SVN r15641.
2007-07-26 17:55:51 +00:00
Jeff Squyres
d0137acaa4 If --debug-daemons-file is specified, it should also imply
--debug-daemons.

This commit was SVN r15640.
2007-07-26 17:49:13 +00:00
Brian Barrett
801fffabff Don't assume things about the contact info string in the general case. There
is no need for the IP address in most cases (filem being one dubious
exception), so just publish and hand around the supposedly opaque contact
info strings

This commit was SVN r15638.
2007-07-26 16:51:41 +00:00
Sven Stork
5fd6c69019 - fix a problem showed up with the sun thread tests.
Remove unnecessary locks because functions that are calling this
  function proper lock/unlock the orted_comm_mutex. Therefore this 
  unlocks cause some imballance.

This commit was SVN r15630.
2007-07-26 11:30:27 +00:00
Brian Barrett
e537cc0871 * Add documentation for RML base code
* Move function declaration out of base.h as it isn't needed
    outside the base code

This commit was SVN r15616.
2007-07-25 16:19:29 +00:00
Brian Barrett
f06b61cff9 Don't use the OOB TCP key for contact information, remove the need to
include a not so public header file.  FIxes a compile error on the Cray.

This commit was SVN r15613.
2007-07-25 15:12:07 +00:00
George Bosilca
00796cfdab Make sure the oob_tcp_windows_progress_callback is registered
in all cases. This is now done in the oob tcp open function.
As a result, the unregistering have to be done in the close
function.

This commit was SVN r15603.
2007-07-25 05:55:14 +00:00
George Bosilca
5d8a70e434 Update the Windows ODLS.
This commit was SVN r15600.
2007-07-25 03:57:25 +00:00
George Bosilca
c961cb5749 The Windows support is now back in bussiness.
This commit was SVN r15599.
2007-07-25 03:55:34 +00:00
Brian Barrett
4e23c7c5a2 Fixes for case where IPv6 support is disabled. Fixes trac:1102.
This commit was SVN r15584.

The following Trac tickets were found above:
  Ticket 1102 --> https://svn.open-mpi.org/trac/ompi/ticket/1102
2007-07-24 17:01:39 +00:00
Josh Hursey
1b177cd029 This component is checkpointable.
This commit was SVN r15567.
2007-07-23 20:20:28 +00:00
Josh Hursey
a24e530f8e Some C/R fixes (more to come)
r15390 - Changed the paradigm in which the runtime worked by enabling the mpirun
process to become an orted and spawn processes. This broke the C/R for this 
special case as it required that the orted start the process, and that 
the hierarchy remains.
The fix was to allow the global coordinator to be a local coordinator as well
for this case.

r15528 - Changed the selection logic for the RML. This caused the application to
segv if the 'ftrm' wrapper component was selected as it tried to modify a NULL
pointer.
The fix was to move the 'module swap' code into the init() function, and swap
when passed a NULL pointer. It sounds bad, but actually cleans up the code a bit
more.

Still have to fix the 'routed' framework.

This commit was SVN r15566.

The following SVN revision numbers were found above:
  r15390 --> open-mpi/ompi@bd65f8ba88
  r15528 --> open-mpi/ompi@39a6057fc6
2007-07-23 20:13:37 +00:00
Ralph Castain
f219cc1e6e A few changes to the lsf components - mostly cleanup, no major logic changes
This commit was SVN r15563.
2007-07-23 18:38:36 +00:00
Ralph Castain
d99c764e75 Resolve a problem where the orte daemon comm functions were being accessed by mpirun while still retaining occasional reference to the orted_globals. Remove all dependence on orted_globals from the comm functions. Move those functions back into their own file to make it easier to maintain the separation. Ensure that mpirun ignores any "exit" commands being sent to daemons as it will exit on its own.
This commit was SVN r15562.
2007-07-23 18:36:33 +00:00
Ralph Castain
2017d52df0 Cleanup a few compiler warnings
This commit was SVN r15560.
2007-07-23 18:30:40 +00:00
Sven Stork
4c031de1ab - fix typo to export component structure in the case of visibility enabled
- use BEGIN/END_C_DECLS

This commit was SVN r15559.
2007-07-23 17:33:13 +00:00
Ralph Castain
db267899be Setup and use a tsd ring buffer to avoid overwriting process name outputs when print is called multiple times in same output statement.
This commit was SVN r15558.
2007-07-23 17:27:14 +00:00
Sven Stork
baf5e4b596 - add orte_config.h as first file to be included
- export required symbol

This commit was SVN r15556.
2007-07-23 15:50:55 +00:00
Ralph Castain
ef141d1fbc Ensure daemons know contact info for all other daemons. Update binomial xcast to work in revised design. Add debug output to orted so the daemon lets us know it launched (if --debug-daemons set) early on in case it fails during orte_init
This commit was SVN r15555.
2007-07-23 15:00:39 +00:00
Ralph Castain
6c800d452d Bring orte tests up to date with revised rml system.
Make first cut at fixing non-direct xcast modes

This commit was SVN r15553.
2007-07-23 13:05:34 +00:00
Jeff Squyres
78d214fec8 Oops -- didn't mean to commit the test program...
This commit was SVN r15538.
2007-07-20 20:15:51 +00:00
Jeff Squyres
2baa866026 Compiles to the new API, but doesn't quite work yet...
This commit was SVN r15537.
2007-07-20 19:49:27 +00:00
George Bosilca
1751b289ed Avoid a compiler warning about uninitialized variables.
This commit was SVN r15534.
2007-07-20 04:07:19 +00:00
George Bosilca
d1424689ce Always release the buffer (this imply the buffer has to be created
outside the special case).

This commit was SVN r15533.
2007-07-20 04:06:39 +00:00
George Bosilca
172a4fa543 Includ a missing header file.
This commit was SVN r15532.
2007-07-20 03:24:52 +00:00
Brian Barrett
9a476c7fdd The print function doesn't change the process name, so take a const...
This commit was SVN r15531.
2007-07-20 02:54:43 +00:00
Brian Barrett
5b9fa7e998 reapply r15517 and r15520, which were removed in r15527 so that I could get
the RML/OOB merge in slightly easier

This commit was SVN r15530.

The following SVN revision numbers were found above:
  r15517 --> open-mpi/ompi@41977fcc95
  r15520 --> open-mpi/ompi@9cbc9df1b8
  r15527 --> open-mpi/ompi@2d17dd9516
2007-07-20 02:34:29 +00:00
Tim Prins
9602230ad6 remove defunct file
This commit was SVN r15529.
2007-07-20 02:10:38 +00:00
Brian Barrett
39a6057fc6 A number of improvements / changes to the RML/OOB layers:
* General TCP cleanup for OPAL / ORTE
  * Simplifying the OOB by moving much of the logic into the RML
  * Allowing the OOB RML component to do routing of messages
  * Adding a component framework for handling routing tables
  * Moving the xcast functionality from the OOB base to its own framework

Includes merge from tmp/bwb-oob-rml-merge revisions:

    r15506, r15507, r15508, r15510, r15511, r15512, r15513

This commit was SVN r15528.

The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
  r15506
  r15507
  r15508
  r15510
  r15511
  r15512
  r15513
2007-07-20 01:34:02 +00:00
Brian Barrett
2d17dd9516 temporarily back our r15517 and 15520 so that I can get the RML / OOB changes
to cleanly apply

This commit was SVN r15527.

The following SVN revision numbers were found above:
  r15517 --> open-mpi/ompi@41977fcc95
2007-07-20 01:10:34 +00:00
George Bosilca
9cbc9df1b8 Export the orte_ns_base_print_name_args function.
This commit was SVN r15520.
2007-07-19 21:45:27 +00:00
Ralph Castain
41977fcc95 Remove the cellid field from the orte_process_name_t structure. This only affects a handful of files in itself, but...
Cleanup ALL instances of output involving the printing of orte_process_name_t structures using the ORTE_NAME_ARGS macro so that the number of fields and type of data match. Replace those values with a new macro/function pair ORTE_NAME_PRINT that outputs a string (using the new thread safe data capability) so that any future changes to the printing of those structures can be accomplished with a change to a single point.

Note that I could not possibly find outputs that directly print the orte_process_name_t fields, but only dealt with those that used ORTE_NAME_ARGS. Hence, you may still have a few outputs that bark during compilation. Also, I could only verify those that fall within environments I can compile on, so other environments may yield some minor warnings.

This commit was SVN r15517.
2007-07-19 20:56:46 +00:00
Ralph Castain
2110064a9a Ensure that the LD_LIBRARY_PATH and PATH get properly set for procs locally spawned by mpirun.
This commit was SVN r15516.
2007-07-19 19:00:06 +00:00
Ralph Castain
b6c60dfc07 Bring over the extra debugging output that helped a user find his NSF mount problems. This just adds ERROR_LOG messages when the session directory creation process fails so we can see where it is happening - really helps users (and us as well) figure out what specifically went wrong.
This commit was SVN r15491.
2007-07-18 19:50:54 +00:00
Tim Prins
e41f86dfe6 add a small amount of debugging output
This commit was SVN r15483.
2007-07-18 15:20:55 +00:00
Sven Stork
c43335d671 - release the lock before we forward a message. In the threaded
build it's possible that we have to process an ack before this function
  returns. If we don't release the lock here we cause a deadlock later
  in ack processing function.

This commit was SVN r15441.
2007-07-16 14:43:24 +00:00
Ralph Castain
5121dfe7e7 With the changes to the failed-to-start logic, we need to revise the odls so it doesn't overwrite the exit status on procs that are not found. Otherwise, we lose the appropriate error message to the user.
This commit was SVN r15440.
2007-07-16 13:50:26 +00:00
Ralph Castain
cd9213a9f0 Son of a gun - how did that fix not get into r15390?
Fix the blasted iof null component so it only is selected if/when directed.

This commit was SVN r15437.

The following SVN revision numbers were found above:
  r15390 --> open-mpi/ompi@bd65f8ba88
2007-07-16 12:38:11 +00:00
Ralph Castain
d109e9a6f4 Roll in the Voltaire core/socket/etc process mapping implementation. Only change I made was to cleanup some of the diagnostic output in the odls_default component so it uses the -mca odls_base_verbose parameter.
You will not see any impact from this change unless you use the syntax described in ticket #1023. I've tried as many of the RAS components as possible and saw no problem - there may be issues with other RAS components that would not compile on any of my systems. Anything that appears should be trivial to fix.

This commit was SVN r15427.
2007-07-14 15:14:07 +00:00
Josh Hursey
eeba2cb871 Add a comment to clarify the relationship between
mca_base_cmd_line_process_args() and opal_init_util() so
we do not forget their ordering needs, and subtle relationship.

This commit was SVN r15412.
2007-07-13 19:08:05 +00:00
Sven Stork
3bb6ca6a3d - export orted symbols
This commit was SVN r15411.
2007-07-13 18:49:43 +00:00
Josh Hursey
20bbe9ee2b Bring back the AMCA fix for the orted from r15219.
This fixes C/R support for the trunk which regressed in r15390 due to the RTE
changes.

This commit was SVN r15409.

The following SVN revision numbers were found above:
  r15219 --> open-mpi/ompi@f88aa6c273
  r15390 --> open-mpi/ompi@bd65f8ba88
2007-07-13 18:15:36 +00:00
Tim Prins
d2f0806420 Fix a problem introduced by r15390 which was causing strange failures in numerous areas.
We no longer store whether we are a singleton in a MCA parameter, we now use a global constant. So all references to the MCA parameter must be removed.

This commit was SVN r15408.

The following SVN revision numbers were found above:
  r15390 --> open-mpi/ompi@bd65f8ba88
2007-07-13 17:52:16 +00:00
Ralph Castain
2bded34a1d Fix a problem observed by Brian where processes launched local to mpirun lost their environment except for MCA params.
The problem stemmed from no longer launching a local orted on the same node as mpirun. The orted would save and reuse the base environment. Mpirun didn't do that, and the odls was using the orted's globally saved environment (which wasn't being set).

This fix establishes a globally accessible base launch environment that both the orted and mpirun can utilize. Since we now use that, we don't need to pass it to the odls_launch_proc function, so remove that param from the API (and modify all components to handle the change).

This commit was SVN r15405.
2007-07-13 15:47:57 +00:00
Jeff Squyres
24a28494a6 Fix what looks like a cut-n-paste error.
This will not cause everyone to run autogen because this component is
.ompi_ignore'd for everyone except jsquyres and rhc.

This commit was SVN r15401.
2007-07-13 14:47:03 +00:00
Ralph Castain
ec79f264a6 The iof is still having problems with non-orte/ompi programs, apparently. Turn off the iof_flush when a program terminates as this will (in the case of non-orte/ompi programs) cause mpirun to hang.
This commit was SVN r15399.
2007-07-13 13:05:46 +00:00
Jeff Squyres
b20248709a Next round of LSF commits. Getting farther, but it still doesn't
fully work yet (everything is still .ompi_ignore'ed for everyone).

This commit was SVN r15398.
2007-07-13 11:57:17 +00:00
George Bosilca
52eebd706f Update the xgrid PLS to fit the current interface of the PLS.
This commit was SVN r15396.
2007-07-13 06:18:16 +00:00
Ralph Castain
bd65f8ba88 Bring in an updated launch system for the orteds. This commit restores the ability to execute singletons and singleton comm_spawn, both in single node and multi-node environments.
Short description: major changes include -

1. singletons now fork/exec a local daemon to manage their operations.

2. the orte daemon code now resides in libopen-rte

3. daemons no longer use the orte triggering system during startup. Instead, they directly call back to their parent pls component to report ready to operate. A base function to count the callbacks has been provided.

I have modified all the pls components except xcpu and poe (don't understand either well enough to do it). Full functionality has been verified for rsh, SLURM, and TM systems. Compile has been verified for xgrid and gridengine.

This commit was SVN r15390.
2007-07-12 19:53:18 +00:00
Jeff Squyres
4439734a8a It compiles (too)! Getting a little further...
This commit was SVN r15383.
2007-07-12 14:46:41 +00:00
Jeff Squyres
aa2c64d66d It compiles! That's a start... :-)
This commit was SVN r15382.
2007-07-12 14:41:09 +00:00
Jeff Squyres
e51bb19fab Fix some include files
This commit was SVN r15381.
2007-07-12 14:22:47 +00:00
Ralph Castain
39013e2a18 Clean up a couple of minor typos. Bring the new bproc-related RAS components online.
This commit was SVN r15328.
2007-07-10 14:11:26 +00:00
Jeff Squyres
64083570f5 Add support for DDT parallel debugger, which required several things:
* Making some symbols and types be global (vs. static) in orterun
 * Adding a "ddt" entry in the MCA parameter orte_base_user_debugger
   default value
 * Add support for @executable@, @executable_argv@, and @single_app@
   tokens in the orte_base_user_debugger MCA parameter.
 * Added various error checks and corresponding help messages after
   finding a debugger in the PATH

Fixes trac:1081

This commit was SVN r15323.

The following Trac tickets were found above:
  Ticket 1081 --> https://svn.open-mpi.org/trac/ompi/ticket/1081
2007-07-10 12:53:48 +00:00
Ralph Castain
a1bf04f39e First cut at revamping bproc support to separate it out from LANL's configuration.
First cut at adding support for LSF

Lots of ompi_ignores so only Jeff and I will see this stuff

This commit was SVN r15321.
2007-07-10 12:43:05 +00:00
Brian Barrett
1d02b9e7b5 Fix a bunch of issues exposed by Ken Cain in getting Open MPI to work with
VxWorks.  Still some issues remaining, I'm sure.

Refs trac:1010

This commit was SVN r15320.

The following Trac tickets were found above:
  Ticket 1010 --> https://svn.open-mpi.org/trac/ompi/ticket/1010
2007-07-10 03:46:57 +00:00
Jeff Squyres
892bc38ad0 Protect against a bad free: full_line points to the full buffer. But
line may point to a few characters beyond the beginning of the buffer
(if the buffer had some extra white space padding at the beginning).
So if we want to free the buffer, free full_line, not line.

This commit was SVN r15315.
2007-07-09 19:56:16 +00:00
Brian Barrett
b27b9b5380 * Clean up the ompi_mca macro's support for different configuration
types and add STOP_AT_FIRST_PRIORITY type for framework configuration,
    which allows all components at the highest priority that succeeds to
    succeed
  * Use STOP_AT_FIRST_PRIORITY type for gpr framework, so that the null
    component isn't built when the replica and proxy components are
    available.

This commit was SVN r15286.
2007-07-04 22:00:15 +00:00
Ralph Castain
684aa1bc9f Since universe size now is an orte thing, we may as well give it some direct support. Create rmgr set/get functions so it becomes more obvious where this value is being defined and how to retrieve it. Modify the bproc pls to pass it to the app procs when launched. Modify one of the test programs to verify it has been correctly set.
This commit was SVN r15266.
2007-07-02 16:45:40 +00:00
Jeff Squyres
c796d84d61 * Integrate man pages contributed by Dirk Eddelbuettel
* Make orted.1 man page be non-descriptive because it's really an
   internal command.
 * Re-work the opal_wrapper man page logic a bit so that we can have a
   real opal_wrapper.1 installed that says "don't look here -- look at
   mpicc (etc.)"

This commit was SVN r15264.
2007-07-02 15:27:39 +00:00
Tim Prins
c46ed1d5d4 Make it so the universe size is passed through the ODLS instead of through a gpr trigger during MPI init. This matches what is currently being done with the app number.
The default odls has been updated and works fine. The process odls has been updated, but I could not verify its operation. The bproc ODLS has not been updated yet. Ralph will look at it soon.

This commit was SVN r15257.
2007-07-02 01:33:35 +00:00
Sven Stork
086624a4fe - guess that we should retain the ep instead of releasing it
This commit was SVN r15244.
2007-06-29 11:18:37 +00:00
Brian Barrett
f8fb1e9720 Fix some compile failures on Solaris 9 because it doesn't have V6ONLY.
This commit was SVN r15237.
2007-06-28 18:52:15 +00:00
Ralph Castain
e299f7039f Allow bproc operations on the head node if it was allocated for our use
This commit was SVN r15232.
2007-06-28 14:53:17 +00:00
Josh Hursey
f88aa6c273 This commit cleans up the AMCA parameter implementation a bit.
* Remove the 'opal_mca_base_param_use_amca_sets' global variable
* Harness the fact that you can (read should) call the cmd_line functions
  before initializing opal_init_util(). This pushes the MCA/GMCA/AMCA
  command line options into the environment before OPAL inits and starts
  to use these values. By putting the cmd_line parse before opal_init_util
  in orterun and orted we only parse the *MCA parameter files once, and 
  correctly (alleviating the need to 'recache' the files on init.)
* Small bits of cleanup.

This commit was SVN r15219.
2007-06-27 01:03:31 +00:00
Rainer Keller
15c03e8acc - Apply patch 31_manpages_lintian.dpatch
Thanks to Dirk Eddelbuettel <edd@debian.org>

This commit was SVN r15215.
2007-06-26 21:13:10 +00:00
Sven Stork
0edcf1d47e - export required symbol
This commit was SVN r15190.
2007-06-25 14:27:04 +00:00
Josh Hursey
84f102c343 Fix/Cleanup the Checkpoint Error propagation through the Snapc Full component.
This commit was SVN r15175.
2007-06-22 16:14:25 +00:00
Jeff Squyres
bd56dc7e5d Fixes trac:1060
Per suggestion, if we don't find a valid shell via getpwuid(), also
check the $SHELL environment variable.  Also perform a few minor
cleanups along the way.

This commit was SVN r15156.

The following Trac tickets were found above:
  Ticket 1060 --> https://svn.open-mpi.org/trac/ompi/ticket/1060
2007-06-21 11:40:42 +00:00
Josh Hursey
78df098aee If we can not checkpoint, then make sure we return an error
This commit was SVN r15151.
2007-06-20 21:05:19 +00:00
Josh Hursey
dd021e7121 Remove some leftover debugging that must have been accidentally left in
r15142.

This commit was SVN r15145.

The following SVN revision numbers were found above:
  r15142 --> open-mpi/ompi@a3998a1676
2007-06-20 14:06:13 +00:00
Josh Hursey
db59235af5 Fix an AMCA parameter regression introduced (as a side effect of) in r14449
(and, due to lack of in code documentation, in r14661).

The {{{opal_mca_base_param_use_amca_sets}}} flag tells the orted that it should
not look at the parameter files just yet since it may have an AMCA parameter
file to look at first. So we need to set this to {{{false}}} before initializing
the MCA paras, then quickly turn around and re-init them when we have the full
information.

This commit fixes trac:1058

This commit was SVN r15144.

The following SVN revision numbers were found above:
  r14449 --> open-mpi/ompi@0ba47105ed
  r14661 --> open-mpi/ompi@df86202202

The following Trac tickets were found above:
  Ticket 1058 --> https://svn.open-mpi.org/trac/ompi/ticket/1058
2007-06-20 14:00:40 +00:00
George Bosilca
a3998a1676 Allow the symbols required by TotalView to be exported even when
the visibility feature is on.

This commit was SVN r15142.
2007-06-19 22:35:23 +00:00
Josh Hursey
edb2cbd150 In r15007 the --bootproxy orted argument was removed to support daemon reuse.
The SnapC Full local Coordinator used this argument to attach to the job the
daemon would be launching. So once this option was removed C/R support broke.

This commit has the local coordinator attach to the job just before it is
launched by the ODLS module. This is a much cleaner solution, and will
eventually allow the SnapC modules to attach to multiple jobs launched 
on a single machine.

This commit fixes the C/R regression introduced in r15007.

This commit was SVN r15121.

The following SVN revision numbers were found above:
  r15007 --> open-mpi/ompi@85df3bd92f
2007-06-18 15:39:04 +00:00
Josh Hursey
5719182a4e Fix a break introduced in r14706 when RANK_KEY changed types.
This commit was SVN r15120.

The following SVN revision numbers were found above:
  r14706 --> open-mpi/ompi@d9acc93efa
2007-06-18 14:57:53 +00:00
Shiqing Fan
2a77d46117 Fix a small bug.
This commit was SVN r15119.
2007-06-18 12:50:29 +00:00
George Bosilca
55cf6fc866 Be a little bit more verbose: tell which file we have trouble with...
This commit was SVN r15115.
2007-06-17 04:59:15 +00:00
George Bosilca
99e701062a The Windows job scheduler PLS. Initial commit as I have to move to
another Windows cluster. Right now it's not in a usable state.

This commit was SVN r15113.
2007-06-17 04:54:07 +00:00
Ralph Castain
e653da1d11 Where or where did that patch go??? Ah - there it went! ;-)
Fix singleton operations - allow multiple xcasts to be queued.

This commit was SVN r15097.
2007-06-15 13:45:29 +00:00
George Bosilca
35e824377e There seems to be a subtle race condition when we fail to spawn a
child. Marking the child as failed solve the issue.

This commit was SVN r15087.
2007-06-14 22:36:47 +00:00
George Bosilca
a4d99ddef6 More synchronizations for the Windows version. The problem came from
the multiple threads accessing the OOB/registry asynchronously via the
callbacks. The quickest solution (but definitively not the cleanest) is
to serialize these callbacks in such a way that at any given time
only one thread can execute a callbacks.

This commit was SVN r15086.
2007-06-14 22:35:38 +00:00
George Bosilca
fb9ff5cc75 Don't remove the tcp events from the list, they will remove themselves
in the destructor.

This commit was SVN r15085.
2007-06-14 22:33:09 +00:00
Josh Hursey
6cdfefad87 Fix portals BTL and cnos RML.
Both were failing due to interface changes that were never 
applied to them properly.

This commit was SVN r15082.
2007-06-14 18:49:41 +00:00
Ralph Castain
fde15ac97d Bring the TM launcher online
This commit was SVN r15076.
2007-06-14 12:33:34 +00:00
George Bosilca
95a607b945 A more Windows friendly version. As the socket event will be generated
through the win dll using multiple threads, we have to insure that
the oob callbacks happens only in a synchronous way or really bad
things happens with the current design (blocking messages from a receive
callback).

This commit was SVN r15069.
2007-06-14 04:38:06 +00:00
George Bosilca
de324502bc Update the Windows wait functions. The most important change is for
the event registration, which in the case of a process dead detection
should be marked as fire once and taking long time.

This commit was SVN r15068.
2007-06-14 04:35:46 +00:00
George Bosilca
8dfa06a617 Only output when the user request it.
This commit was SVN r15067.
2007-06-14 04:33:18 +00:00
George Bosilca
13a693faa0 Update the Windows process ODLS.
This commit was SVN r15066.
2007-06-14 04:32:19 +00:00
Pak Lui
de0f1eef89 No major changes here. Just updates to remove unused code and comments.
This commit was SVN r15051.
2007-06-13 17:23:03 +00:00
Pak Lui
03a93a38c5 Added an option for daemonizing orted. The existing behavior to --no-daemonize
for gridengine is not changed.

This commit was SVN r15050.
2007-06-13 17:11:37 +00:00
Ralph Castain
5adef03179 Clean up a diagnostic so it only outputs when requested
This commit was SVN r15048.
2007-06-13 15:53:10 +00:00
George Bosilca
18c2bb0ed6 Don't forget to set the name argument before spawning the daemon.
This commit was SVN r15047.
2007-06-13 15:45:34 +00:00
Pak Lui
8e7daea11f bring inline more changes with r15007.
This commit was SVN r15044.

The following SVN revision numbers were found above:
  r15007 --> open-mpi/ompi@85df3bd92f
2007-06-13 15:30:18 +00:00
Ralph Castain
425fed95ff Bring the SGE component online
This commit was SVN r15043.
2007-06-13 15:02:47 +00:00
Rainer Keller
7e0b400f3f - Small Fix.
This commit was SVN r15037.
2007-06-13 10:43:03 +00:00
George Bosilca
278ec7fd4f I wonder how this one compiled before ... or how do I manage to
miss it ...

This commit was SVN r15032.
2007-06-12 23:24:39 +00:00
George Bosilca
9d342ccb61 Shorter warning message.
This commit was SVN r15031.
2007-06-12 23:22:09 +00:00
George Bosilca
715f6012cf The DSS pack function can use the const attribute for the src field
as it is never modified by the pack functions directly. Enforce it
all over the code base.

This commit was SVN r15026.
2007-06-12 22:47:14 +00:00
George Bosilca
649ab84654 Don't do SIGPIPE handling on Windows.
This commit was SVN r15025.
2007-06-12 22:44:39 +00:00
George Bosilca
9e89abbd57 HAVE_SYS_TYPES_H require an ifdef.
This commit was SVN r15024.
2007-06-12 22:43:18 +00:00
George Bosilca
432185d617 Forget to remove the MCA parameter corresponding to the 2 unused
fields in the RSH PLS component.

This commit was SVN r15023.
2007-06-12 22:41:38 +00:00
George Bosilca
49e7bf3069 Be a little bit more clear when we fail to identify the shell.
This commit was SVN r15022.
2007-06-12 22:40:44 +00:00
George Bosilca
5b7796dfcd Remove 2 unused fields.
This commit was SVN r15021.
2007-06-12 22:39:57 +00:00
George Bosilca
16c38cabe1 Update the Windows ODLS component.
This commit was SVN r15020.
2007-06-12 22:37:04 +00:00
George Bosilca
bf6f30a42c Make the Windows PLS component match the current requirements for
a PLS module.

This commit was SVN r15019.
2007-06-12 22:34:56 +00:00
Ralph Castain
af64009368 Bring the CNOS component of the PLS back online
This commit was SVN r15018.
2007-06-12 22:17:05 +00:00
Jeff Squyres
54064f6fa1 Fix a warning that Tim P. found this morning.
The warning was indicative of overly-complex code anyway.  So I
removed the "first" bool and simply use a sentinel value in seq_min to
indicate that nothing has changed.  Note that this is "correct enough"
for the moment -- more fixes will come in this area with tickets #1049
and/or #1051.

This commit was SVN r15013.
2007-06-12 17:30:54 +00:00
Brian Barrett
84d1512fba Add the potential for doing some basic error checking on mutexes during
single threaded builds.  In its default configuration, all this does
is ensure that there's at least a good chance of threads building
based on non-threaded development (since the variable names will be
checked).  There is also code to make sure that a "mutex" is never
"double locked" when using the conditional macro mutex operations.
This is off by default because there are a number of places in both
ORTE and OMPI where this alarm spews mega bytes of errors on a
simple test.  So we have some work to do on our path towards
thread support.

Also removed the macro versions of the non-conditional thread locks,
as the only places they were used, the author of the code intended
to use the conditional thread locks.  So now you have upper-case
macros for conditional thread locks and lowercase functions for
non-conditional locks.  Simple, right? :).

This commit was SVN r15011.
2007-06-12 16:25:26 +00:00
Ralph Castain
4e8081ed1e Cleanup a now unnecessary variable
This commit was SVN r15010.
2007-06-12 14:23:33 +00:00
Tim Prins
1467558157 Cleanup a couple warnings.
Update svn:ignore

This commit was SVN r15009.
2007-06-12 14:11:06 +00:00
Ralph Castain
85df3bd92f Bring in the generalized xcast communication system along with the correspondingly revised orted launch. I will send a message out to developers explaining the basic changes. In brief:
1. generalize orte_rml.xcast to become a general broadcast-like messaging system. Messages can now be sent to any tag on the daemons or processes. Note that any message sent via xcast will be delivered to ALL processes in the specified job - you don't get to pick and choose. At a later date, we will introduce an augmented capability that will use the daemons as relays, but will allow you to send to a specified array of process names.

2. extended orte_rml.xcast so it supports more scalable message routing methodologies. At the moment, we support three: (a) direct, which sends the message directly to all recipients; (b) linear, which sends the message to the local daemon on each node, which then relays it to its own local procs; and (b) binomial, which sends the message via a binomial algo across all the daemons, each of which then relays to its own local procs. The crossover points between the algos are adjustable via MCA param, or you can simply demand that a specific algo be used.

3. orteds no longer exhibit two types of behavior: bootproxy or VM. Orteds now always behave like they are part of a virtual machine - they simply launch a job if mpirun tells them to do so. This is another step towards creating an "orteboot" functionality, but also provided a clean system for supporting message relaying.

Note one major impact of this commit: multiple daemons on a node cannot be supported any longer! Only a single daemon/node is now allowed.

This commit is known to break support for the following environments: POE, Xgrid, Xcpu, Windows. It has been tested on rsh, SLURM, and Bproc. Modifications for TM support have been made but could not be verified due to machine problems at LANL. Modifications for SGE have been made but could not be verified. The developers for the non-verified environments will be separately notified along with suggestions on how to fix the problems.

This commit was SVN r15007.
2007-06-12 13:28:54 +00:00
Brian Barrett
27ad954265 Fix a couple of problems with the way we were using orte_process_name_t
structures in the system.  Instead of using memcmp, use the ns function.
This won't cause a problem as long as all three elements of the name are
ints, but if they have different sizes, alignment and padding rules
can cause memcmp() to compare padding space, which rarely holds a sane
value.

This commit was SVN r14998.
2007-06-11 19:12:11 +00:00
Brian Barrett
1d11cc4b2d Fix mis-declared variable type
This commit was SVN r14994.
2007-06-11 16:48:35 +00:00
Shiqing Fan
d9fa58dc33 Add two more arguments to call. The definition of the function has been modified with 2 additional arguments.
This commit was SVN r14990.
2007-06-11 14:27:36 +00:00
Jeff Squyres
b704ff9f4e Fix typo that accidentally always resulted in a "true" result.
This commit was SVN r14984.
2007-06-11 12:52:56 +00:00
Jeff Squyres
1ed906a78b * Forgot to update the iof null component when we revamped the IOF
framework.  Updated pointers to match current definitions.
 * Trimmed some dead wood while I was at it:
   * No need for component close function that does nothing
   * Use BEGIN/END_C_DECLS
   * Use recent MCA param register function
   * Ditch MCA param orte_iof_debug (it wasn't used anywhere)
   * Use MCA param orte_iof_override properly in the code (i.e., look
     up the value once and use the cached value later)

This commit was SVN r14981.
2007-06-11 00:58:21 +00:00