1
1
Граф коммитов

841 Коммитов

Автор SHA1 Сообщение Дата
Jeff Squyres
82161d20ca Catch a SIGPIPE and allow it to be harmless. Register a no-op SIGPIPE
handler before the write() and de-register it afterwards.  Determine
if the write() succeeded or failed by the return of write().

This commit was SVN r10858.
2006-07-17 21:15:56 +00:00
Jeff Squyres
416e9de22d Fix some minor problems when handling the error cases
This commit was SVN r10854.
2006-07-17 19:21:10 +00:00
George Bosilca
33a7634009 Silence the compiler.
This commit was SVN r10851.
2006-07-17 17:13:28 +00:00
Ralph Castain
404acc9f65 It's okay to call index prior to anything being put in the registry...
This commit was SVN r10848.
2006-07-17 14:31:42 +00:00
Ralph Castain
574a6f7896 Fix a bug that caused the system to crash when asked for an index of the segment names. Such a request required passing a NULL value for the segment name, but the find_seg function didn't protect itself from that value.
Thanks to James Kennedy (UCC-Ireland) for finding it.

This commit was SVN r10847.
2006-07-17 13:51:07 +00:00
Brian Barrett
dfa1221c3b * AC_CONFIG_LINKS has a minor problem in that it always uses ln -s, rather
than $(LN_S).  This causes problems with with Windows and probably
  elsewhere (re: #200).  So use a slightly different trick to get the
  right header selected for the MEMCPY and TIMER components.

* Using the same trick used to solve the AC_CONFIG_LINKS problem, 
  stop using a separate header file for direct calling in the
  PML and MTL.  This lets me remove some icky code in ompi_mca.m4
  that was more fragile than I really liked.

This commit was SVN r10841.
2006-07-16 04:23:52 +00:00
Jeff Squyres
ffddfc5629 Turns out that it's a really Bad Idea(tm) to tm_spawn() and then not
keep the resulting tm_event_t that is generated because the back-end
TM library actually caches a bunch of stuff on it for internal
processing, and doesn't let go of it until tm_poll().

tm_event_t's are similar to (but slightly different than)
MPI_Requests: you can't do a million MPI_Isend()'s on a single
MPI_Request -- a) you need an array of MPI_Request's to fill and b)
you need to keep them around until all the requests have completed.

This commit was SVN r10820.
2006-07-14 22:04:41 +00:00
Rainer Keller
50b5791969 - Release best_item
- Reformat

This commit was SVN r10814.
2006-07-14 19:55:14 +00:00
Ralph Castain
c22b0d516e Some edits to the man page for Jeff to review
This commit was SVN r10803.
2006-07-14 14:47:06 +00:00
Ralph Castain
7b3ced80e8 Fix a bug that has been causing inconsistent behavior on a number of platforms. Will explain more on the core-devel list.
Jeff: this needs to be back-patched to our supported prior releases. I'll try to verify how far back we need to go - my initial guess is probably all of them

This commit was SVN r10801.
2006-07-14 14:16:20 +00:00
Jeff Squyres
e6c9c699fe Minor changes:
- change -no_oversubscribe to -nooversubscribe (to be similar to
  -nolocal)
- Added text to orterun.1 describing slots and -nooversubscribe
Still need to add text about "mpirun a.out" functionality, and RHC
wants to make some minor edits, so committing for synchronization.

This commit was SVN r10800.
2006-07-14 14:15:03 +00:00
Ralph Castain
cef1ce19d6 Restore the "sleep" delay during startup.
Since Jeff and I are going to a branch for T-bird, we have restored the trunk to its prior state to avoid any possibility of disturbing it.

This commit was SVN r10774.
2006-07-12 22:18:53 +00:00
Jeff Squyres
ef8433a60b After more discussion on the phone, it seems easier to not muck around
in special components but rather go down to a /tmp branch.  So
removing these components and I'll branch next.

This commit was SVN r10771.
2006-07-12 22:12:29 +00:00
Jeff Squyres
62c189ea1c Fix a few blanket search/replaces
This commit was SVN r10768.
2006-07-12 21:54:05 +00:00
Ralph Castain
badd3f4acb Clean up a few lingering references to "urm".
This commit was SVN r10765.
2006-07-12 21:01:21 +00:00
Jeff Squyres
36ca7497d1 Update m4 and configure files
This commit was SVN r10764.
2006-07-12 20:55:39 +00:00
Ralph Castain
9102b5af3b Remove the "sleep" delay in the oob connection procedure. This shouldn't cause any problems, especially for launches of less than 1000 processes.
Please report any abnormal behavior during launch, though, as we would like to understand what (if any) impact is seen. I couldn't see any on small jobs (the modulo functions render this number down pretty low).

This commit was SVN r10763.
2006-07-12 20:31:30 +00:00
Ralph Castain
a84898316c Create new components to support Thunderbird scalability development
This commit was SVN r10762.
2006-07-12 20:28:23 +00:00
Jeff Squyres
f7a71772a7 Remove long-defunct "openmpi" tool from orte. It was apparently an
early generation of the orted, and is now long-dead.

This commit was SVN r10754.
2006-07-12 03:52:17 +00:00
Brian Barrett
4b70bb92db * Per ticket #112, localhost checks should check against 127.0.0.1/8, rather
than just 127.0.0.1.

This commit was SVN r10750.
2006-07-11 20:54:49 +00:00
Josh Hursey
682a6a123e - os_dirpath.c : reset the is_dir var each time through the loop.
- orte-clean.c : check to see if the base session directory is empty 
                 and delete it if it is.

- orte_universe_exists.c : Fix a down stread problem resulting from 
      George's r10718 commit. Don't use the 'fulldirpath' since
      that is no longer guarenteed to be the absolute path
      to the session directory. Construct this value outside of that
      function from the prefix and frontend vars.

This commit was SVN r10741.

The following SVN revision numbers were found above:
  r10718 --> open-mpi/ompi@47eef2e002
2006-07-11 17:31:05 +00:00
Josh Hursey
5a812c8211 Fix orte-ps which George broke in r10718 by extending the orte_session_dir_get_name()
so that it does not return an error when no universe is passed to it.

Also put back in the 'Slots In Use' column as it is now working properly
per Ralphs recent ras commits. Still not sure what 'Slots Alloc' is meant
to represent, so left that as #if 0'd out for the moment.

This commit was SVN r10739.

The following SVN revision numbers were found above:
  r10718 --> open-mpi/ompi@47eef2e002
2006-07-11 16:54:07 +00:00
Ralph Castain
11125dd67a George has a retarded compiler - but that's okay. This will quiet it's warning system.
This commit was SVN r10736.
2006-07-11 15:27:02 +00:00
George Bosilca
3daa063772 Make the format and the arguments matchs.
This commit was SVN r10734.
2006-07-11 15:10:44 +00:00
Josh Hursey
9a31060b6d Fix r10725 so that the trunk builds again.
This commit was SVN r10733.

The following SVN revision numbers were found above:
  r10725 --> open-mpi/ompi@ae222cca5b
2006-07-11 14:48:31 +00:00
Josh Hursey
2e506591c3 more pedantic cleanup. Hopefully this will make happy.
This commit was SVN r10730.
2006-07-11 13:48:28 +00:00
Josh Hursey
6309047e63 pedantic cleanup
This commit was SVN r10728.
2006-07-11 13:43:50 +00:00
Ralph Castain
ae222cca5b Include the help file so it can be accessed
This commit was SVN r10725.
2006-07-11 12:15:25 +00:00
George Bosilca
b3e5c658d2 Add the correct include file.
This commit was SVN r10721.
2006-07-11 05:50:15 +00:00
George Bosilca
47eef2e002 Use Windows specific functions to parse the list of files in a
directory.

This commit was SVN r10718.
2006-07-11 05:28:31 +00:00
George Bosilca
a8a2a60cc5 Nothing releavant, only indentation.
This commit was SVN r10717.
2006-07-11 05:27:17 +00:00
George Bosilca
523b6dcbe8 Protect the header files. Remove the directory using the OPAL
function.

This commit was SVN r10716.
2006-07-11 05:25:41 +00:00
George Bosilca
94f6cb3765 There is no SIG_USR1 and SIG_USR2 on windows.
This commit was SVN r10715.
2006-07-11 05:24:08 +00:00
Ralph Castain
6129a5a887 Enable -host support for "mpirun a.out". You can now execute on all slots on specified nodes within your overall allocation.
This commit was SVN r10713.
2006-07-11 02:59:23 +00:00
George Bosilca
a9df5035f9 Remove unused variable.
This commit was SVN r10712.
2006-07-11 00:30:51 +00:00
Ralph Castain
febc143d8c Per LANL's stated need, add functionality that runs a.out across ALL available process slots if no num_proc is specified on the command line. However, please note the following limitation: we ONLY allow ONE application to be specified on the command line when this feature is invoked. If multiple apps are specified, the user MUST also specify the number to be launched for each and every one of them.
Update the help text to report errors when not following that rule.

Also updated the RMAPS help text to reflect the reorganization of some of the round-robin code into the base.

The new functionality has been tested under Mac OS-X and on Odin using an MPI program. Both byslot and bynode mapping have been checked and verified. Operational support for other systems needs to be verified - I respectfully request people's help in doing so.

This commit was SVN r10708.
2006-07-10 21:25:33 +00:00
Ralph Castain
3d220cbd48 This patch fixes several issues relating to comm_spawn and N1GE. In particular, it does the following:
1. Modifies the RAS framework so it correctly stores and retrieves the actual slots in use, not just those that were allocated. Although the RAS node structure had storage for the number of slots in use, it turned out that the base function for storing and retrieving that information ignored what was in the field and simply set it equal to the number of slots allocated. This has now been fixed.

2. Modified the RMAPS framework so it updates the registry with the actual number of slots used by the mapping. Note that daemons are still NOT counted in this process as daemons are NOT mapped at this time. This will be fixed in 2.0, but will not be addressed in 1.x.

3. Added a new MCA parameter "rmaps_base_no_oversubscribe" that tells the system not to oversubscribe nodes even if the underlying environment permits it. The default is to oversubscribe if needed and the underlying environment permits it. I'm sure someone may argue "why would a user do that?", but it turns out that (looking ahead to dynamic resource reservations) sometimes users won't know how many nodes or slots they've been given in advance - this just allows them to say "hey, I'd rather not run if I didn't get enough".

4. Reorganizes the RMAPS framework to more easily support multiple components. A lot of the logic in the round_robin mapper was very valuable to any component - this has been moved to the base so others can take advantage of it.

5. Added a new test program "hello_nodename" - just does "hello_world" but also prints out the name of the node it is on.

6. Made the orte_ras_node_t object a full ORTE data type so it can more easily be copied, packed, etc. This proved helpful for the RMAPS code reorganization and might be of use elsewhere too.

This commit was SVN r10697.
2006-07-10 14:10:21 +00:00
Josh Hursey
c38c47a4f5 Fix some unreachable statements. Caught by a nightly build.
This commit was SVN r10696.
2006-07-10 13:32:31 +00:00
Brian Barrett
41e144c879 Fix for ticket #92, bproc stdin being borked. The problem was that we were
using a pty for everything, which drops all buffered data on the floor when
close() is called on the daemon side, meaning EOF has some issues.  Instead,
do the same thing we do for other starters that use the fork() pls -- use
a pipe/fifo for stdin and stderr and a pty for stdout.  This is good enough
for what we need and avoids most of the issues with ptys.

This commit was SVN r10692.
2006-07-08 21:18:24 +00:00
Ralph Castain
bc7690bcb0 Fix the bproc allocator. This is just a bandaid for 1.x that will be fixed more thoroughly in 2.0.
Basically, the problem was that the allocator was grabbing everything on the cluster for which the user had access privilege. Thus, if a user had two sessions operable, each with its own allocation, mpirun in each session would grab both sets of nodes and use them. Not very polite.

This commit was SVN r10683.
2006-07-06 18:31:14 +00:00
Jeff Squyres
3d5d0959fa Remove unused variable, and therefore silence a compiler warning.
This commit was SVN r10673.
2006-07-06 10:44:04 +00:00
Josh Hursey
b1da6f8bc4 A bit more cleanup for that last patch.
* num_children should really be an int instead of size_t
  since 'size_t' is not signed and num_children can (in rare cases)
  drop below 0, and don't want it to roll around to MAX_INT or some
  such.

 * I figured out that this problem only happened to me because I use
  the pls_fork_reap_timeout MCA parameter and thus the only time that
  the code in pls_fork_module.c to waitpid is executed is if this is
  not set to 0 (I had it set to 1 to give my procs time to exit). I
  adjusted the loop from while{...} to do{...}while; so that it is
  executed at least once for consistency.

 * de-register the SIGCHILD callback for the pid before we attempt
  to kill it, so that we don't leave the door open for both the
  waitpids (the one in the callback, and the one in this function)
  to race to see who can wait on the child.

 * Move the 'thread release' to outside the for loop for a bit of an
  optimization, and always set the value to 0 since we want to 
  finish after this function.

 * Added a help message for the case when we can't send a kill()
   signal to the process. Should never happen, but all is possible
   in the wild wild west of HPC.

This commit was SVN r10666.
2006-07-05 21:38:23 +00:00
Josh Hursey
696bb4a0c0 A partial fix for the hanging orted bugs (Ticket #177)
When we force an application to terminate (via CTRL-C to mpirun)
we send an out-of-band message to the orted to reap its children.
the fork PLS was doing an internal waitpid but never releasing or
updating the information and signaling the condition variable. So
the fork PLS callback for SIGCHLD registered with the event library
and this waitpid are in a bit of a race to 'waitpid' for the children.
Since the PLS callback was the only one that handled the signal properly
when it 'won' then things were great -- as in the normal termination case.
But when it 'lost' -- as in the abnormal termination case -- the orted
never received the proper signal that its children had gone away.

We want to preserve the internal fork PLS callback since it allows
for a timeout while waiting for the child, which the event library
won't do.

This allows both to exist, and behave properly.

This was introduced in r9068.

The ticket is still open since the orted's hang in other situations
still. This is a fix for one of the causes.

This commit was SVN r10662.

The following SVN revision numbers were found above:
  r9068 --> open-mpi/ompi@c2c2daa966
2006-07-05 19:37:29 +00:00
Jeff Squyres
538965aeb0 Final merge of stuff from /tmp/tm-stuff tree (merged through
/tmp/tm-merge).  Validated by RHC.  Summary:

- Add --nolocal (and -nolocal) options to orterun
- Make some scalability improvements to the tm pls

This commit was SVN r10651.
2006-07-04 20:12:35 +00:00
Josh Hursey
5c5ce7e051 When 'mca_oob_send_callback' accesses the callback 'orte_pls_rsh_terminate_job_cb'
with an error status (< 0) then the req buffer is NULL. Put checks around the
OBJ_RELEASE(req) calls so that we don't try to release NULL :/

This commit was SVN r10641.
2006-07-03 22:44:54 +00:00
Josh Hursey
d082a63734 Add some new OPAL functionality.
After seeing the uglyness that is removing directories in the
codebase I decided to push down this to the OPAL by extending the
opal/os_create_dirpath.(c|h) to contain some more functionality.

In this process I renamed 'os_create_dirpath' to 'os_dirpath' since it
is a bit more general now.

Added a few functions to:
 - check if an directory is empty
 - check to see if the access permissions are set correctly
 - destroy the directory at the end of the dirpath
   - By using a caller callback function (a la Perl, I believe)
     for every file, the caller can have fine grained control over
     whether a specific file is deleted or not.

This simplifies things a bit for orte_session_dir_(finalize|cleanup)
as it should no longer contain any of this functionality, but uses
these functions to do the work.

From the external perspective nothing has changed, from the 
developer point of view we have some cleaner, more generic code.

This commit was SVN r10640.
2006-07-03 22:23:07 +00:00
Josh Hursey
38df31e488 A bit of cleanup in the pretty_printing, making it a bit more sane.
Since we don't properly handle connecting/disconecting from multiple
universes, only connect to the first one (or the user specified one).
This is a bug that needs to be fixed, but involves some deep magic in
ORTE.

Print the node segment upon request (-n option). 
{{{
Node Name | Arch | Cell ID |   State | Slots | Slots Max | 
-----------------------------------------------------------
  odin001 |      |       0 | Unknown |     2 |         4 | 
  odin002 |      |       0 | Unknown |     2 |         5 | 
  odin003 |      |       0 | Unknown |     2 |         6 | 
  odin004 |      |       0 | Unknown |     2 |         7 | 
}}}

Since node_slots_alloc and node_slots_inuse are not properly updated
in the GPR don't print those values.

This commit was SVN r10633.
2006-07-03 17:11:02 +00:00
Josh Hursey
fc72eb4a01 remove a residual warning
This commit was SVN r10628.
2006-07-03 15:16:15 +00:00
Brian Barrett
0bd5acc51f * Fix for bus error in XGrid starter
This commit was SVN r10615.
2006-07-01 16:16:46 +00:00
Josh Hursey
2edf1511fd Closes ticket #173 : Split name linking up for orte/ompi shared tools.
This moves the logic to create the symbolic links for:
 - mpirun
 - mpiexec
 - ompi-ps
 - ompi-clean
and their respective man pages to the ompi level from
the orte layer.

This is a bit pedantic, but orte shouldn't be doing the
work of ompi since that is a bit of an abstraction break.

Note: need to autogen.sh to get this. Sorry :(

This commit was SVN r10602.
2006-06-30 22:01:56 +00:00
Josh Hursey
c356f4e948 forgot to init a var. Thanks Jeff for catching this
This commit was SVN r10583.
2006-06-30 14:22:58 +00:00
Ralph Castain
54018b114b Do a proper restart - fixes the inconsistent communications that were observed via the console.
This commit was SVN r10570.
2006-06-29 19:05:41 +00:00
Ralph Castain
a90f8feb35 Need to initialize the buffer in the contact_info command.
This commit was SVN r10563.
2006-06-29 14:57:10 +00:00
Josh Hursey
793bbc667a bringing over orte-clean from tmp/jjhursey-ft-cr branch
per a request.

Currently it is not working well. That will soon change
as it just needs a bit of attention and testing to
make it lots-mo-betta.

This commit was SVN r10556.
2006-06-28 22:33:54 +00:00
Josh Hursey
9c0a279522 Moved the 'orte-ps' command from the tmp/jjhursey-ft-cr branch
per a request for its functionality into the main trunk.

This command provides basic information about a running job. It
needs a bit of attention, but works fine in its current iteration.

Please play with it, and lets try to work out all the left over bugs.

Pending action for this tool:
It has been requested that the tool be changed slightly to allow
it to be called via a function call from internal libraries
(e.g. orteconsole).

This commit was SVN r10554.
2006-06-28 22:06:13 +00:00
Josh Hursey
0a931f9fad Brining over the session directory and universe changes
from the tmp/jjhursey-ft-cr branch.

In this commit we change the way universe names are created.
Before we by default first created "default-universe" then
if there was a conflict we created "default-universe-PID"
where PID is the PID of the HNP.
Now we create "default-universe-PID" all the time (when
a default universe name is used). This makes it much 
easier when trying to find a HNP from an outside app 
(e.g. orte-ps, orteconsole, ...)

This also adds a "search" function to find all of the 
universes on the machine. This is useful in many contexts
when trying to find a persistent daemon or when trying to 
connect to a HNP.

This commit also makes orte_universe_t an opal_object_t, 
which is something that needed to happen, and only effected
the SDS in one of it's base functions.


I was asked to bring this over to aid in fixing orteconsole
and orteprobe. Due to the change of orte_universe_t to 
an object orteprobe may need to be updated to reflect this 
change. Since orteprobe needs to be looked at anyway I'll
leave this to Ralph to take care of.

*Note*:
These changes do not depend upon any of the FT work (but
the FT work does depend upon them). These were brought over
to help in fixing some of the ORTE tool set that require
the functionality layed out in this patch.

Testing:
Ran the 'ibm' tests before and after this change, and all was
as well as before the change. If anyone notices additional
irregularities in the system let me know. But none are expected.

This commit was SVN r10550.
2006-06-28 21:03:31 +00:00
Brian Barrett
2cf73912e2 * fix for signal forwarding additions in bproc_orted code
This commit was SVN r10529.
2006-06-27 19:59:07 +00:00
Brian Barrett
b6663c64c7 * fix for bug #161 - add man page info for recently added features
This commit was SVN r10514.
2006-06-26 22:16:39 +00:00
Brian Barrett
86861bc1c3 * add --quiet option, and surpress a couple of the status messages in
orterun if it is actually enabled.  For ticket #129.

This commit was SVN r10497.
2006-06-26 18:21:45 +00:00
Brian Barrett
4e8abb943b * fix up signal handling code so that one function handles SIGUSR1 and
SIGUSR2.  This can be extended later if needed to include other
  signals we should forward to the user processes (TSTP and CONT,
  perhaps?)
* Since the signal handlers don't actually run in signal context, we
  can use malloc/fprintf/etc.  So clean up some of the signal handler
  code so that we don't keep message buffers around for the life of
  the process

This commit was SVN r10496.
2006-06-26 15:12:52 +00:00
Brian Barrett
9766c01e50 * Per discussion at quarterly meeting and bug #91, print out the bug
contact point when printing version and help strings

This commit was SVN r10484.
2006-06-22 19:48:27 +00:00
Sushant Sharma
76926756d0 variable ntid not being assigned any value was resulting in errors
This commit was SVN r10480.
2006-06-22 18:00:54 +00:00
Josh Hursey
58110f9fc9 Fixes Ticket #125 for both the trunk and v1.1 branch.
This commit will apply cleanly to the v1.1 branch, and should
be moved over once I get someone to verify it.

The problem is outlined in the bug. The fix was to move the
setting of the app context index (idx) before we put it in the
GPR so that it is propogated to the gpr.

The reason this hasn't bitten us before is because we init
app->idx to 0, which is true most of the time. Except that is
when MPI_Comm_spawn_multiple in which we put in more than 
one app context, thus care about correct indexing.

This was causing down the line memory corruption by overrunning
the mapping array. This commit also puts in a check to make 
sure that we error out if we ever try to do that again.

This commit was SVN r10380.
2006-06-15 22:14:07 +00:00
Sushant Sharma
ca01291aea Updated soh-xcpu component. Not going to be used for time being.
This commit was SVN r10343.
2006-06-13 23:25:46 +00:00
Sushant Sharma
b5a16b6515 Updated xcpu launcher. open-mpi no longer needs xcpu library. Launcher code is now moved within xcpu.
This commit was SVN r10342.
2006-06-13 23:21:56 +00:00
Brian Barrett
5c89dc6946 Fix for ticket #91
mpirun/orterun now has an option to print the version number.  If -V/--version
is given, it will print the version number.  If it's the only option, we
exit cleanly.  Otherwise, we continue on as if --version wasn't given
(except we've printed the version number).
--This line, and th se below, will be ignored--

M    orte/tools/orterun/orterun.c
M    orte/tools/orterun/help-orterun.txt

This commit was SVN r10276.
2006-06-09 17:21:23 +00:00
Brian Barrett
17a8ccef89 * update XGrid API to match recent signal changes
This commit was SVN r10262.
2006-06-08 21:15:35 +00:00
Ralph Castain
ee5a626d25 Add ability to trap and propagate SIGUSR1/2 to remote processes. There are a number of small changes that hit a bunch of files:
1. Changed the RMGR and PLS APIs to add "signal_job" and "signal_proc" entry points. Only the "signal_job" entries are implemented - none of the components have implementations for "signal_proc" at this time. Thus, you can signal all of the procs in a job, but cannot currently signal only one specific proc.

2. Implemented those new API functions in all components except xgrid (Brian will do so very soon). Only the rsh/ssh and fork modules have been tested, however, and only under OS-X.

3. Added signal traps and callback functions for SIGUSR1/2 to orterun/mpirun that catch those signals and call the appropriate commands to propagate them out to all processes in the job.

4. Added a new test directory under the orte branch to (eventually) hold unit and system level tests for just the run-time. Since our test branch of the repository is under restricted access, people working on the RTE were continually developing their own system-level tests - thus making it hard to help diagnose problems. I have moved the more commonly-used functions here, and added one specifically for testing the SIGUSR1/2 functionality.

I will be contacting people directly to seek help with testing the changes on more environments. Other than compile issues, you should see absolutely no change in behavior on any of your systems - this additional functionality is transparent to anyone who does not issue a SIGUSR1/2 to mpirun.

Ralph

This commit was SVN r10258.
2006-06-08 18:27:17 +00:00
Jeff Squyres
4882dc0e2c Addendum to r9930: missed a chunk of the rsh pls to use the basename
of $libdir and $bindir (i.e., was correctly doing local launches, but
was still using $prefix/lib and $prefix/bin for remote launches).

[Re-]Fixes OFED bug 59.

This commit was SVN r10207.

The following SVN revision numbers were found above:
  r9930 --> open-mpi/ompi@1d6902296c
2006-06-05 21:12:36 +00:00
Brian Barrett
22cd78abb5 * add header required when debugging is not enabled
This commit was SVN r10155.
2006-06-01 01:26:52 +00:00
Josh Hursey
bb95df9bf2 Added some user friendly output to the hostfile RDS component.
This is more of a usability feature, but a very useful one. So I 
suggest that it go into the release branches.

This commit was SVN r10153.
2006-05-31 20:07:59 +00:00
Josh Hursey
2f20a38c98 This is a fix for bug Ticket #27
We were stuck in an infinite loop inside the rmaps round_robin
component when the user specified a host, then over subscribed it.
Instead of retuning an error, we looped forever.

For example:
 $ cat hostfile
  A slots=2 max-slots=2
  B slots=2 max-slots=2
 $ mpirun -np 3 --hostfile hostfile --host B
  <hang>

The loop would not terminate because both host A and B are in the 
'nodes' structure as they are both allocated to the job. However,
after allocating 2 slots to host B, we remove it from the node list
leaving us with a 'nodes' structure with just A in it. Since we can't
use host A, we keep looping here until we find a node that we can use.

This patch checks to make sure that if we get into this situation where
rmaps is looping over the list a second time without finding a node
during the first pass then we know that there are no nodes left to
use, so we have a resource allocation error, and should return to the user.

This patch should be moved to all of the release branches

This commit was SVN r10131.
2006-05-31 03:42:01 +00:00
Brian Barrett
7000cecf78 Fix for standard output / standard error truncation issue when in a shell
pipeline.  See lengthy comment in iof_base_endpoint.c for the details, but
the short version is that we shouldn't set O_NONBLOCK on standard I/O 
file descriptors, so we no longer do.

Closes ticket:9

This commit was SVN r9966.
2006-05-18 15:43:32 +00:00
Jeff Squyres
1d6902296c Additions to the tm, slurm, and rsh pls modules to handle the --prefix
option as discussed on the devel-core mailing list.  The Big
Difference is that instead of hard-coding the strings "/lib" and
"/bin" in to append to the prefix, we append the basename of the local
libdir and bindir.  Hence, if your libdir is $prefix/lib64, we'll
append /lib64 to construct the remote node's LD_LIBRARY_PATH (etc.).

Also appended the orterun.1 man page to include a description of
--prefix, how it is constructed, what it handles / what it does not,
etc.

This commit was SVN r9930.
2006-05-16 14:14:12 +00:00
Gleb Natapov
80dfe7e39b remove newline from environment
This commit was SVN r9892.
2006-05-11 13:15:48 +00:00
Brian Barrett
1c0c84cf67 If the urm gets a request to kill itself *and* it's a singleton, just
exit out, rather than trying to have the pls exit.  Since singletons
weren't started with a pls, there's no way the pls is going to be
able to kill the process.  So just exit and save the error message.

This commit was SVN r9859.
2006-05-09 13:40:41 +00:00
Brian Barrett
b76b46bcec * fix some compile issues on Red Storm
This commit was SVN r9812.
2006-05-04 14:08:36 +00:00
Brian Barrett
9276127c0d * add some extra sauce to make sure we close down our processes properly
This commit was SVN r9807.
2006-05-04 00:38:49 +00:00
Brian Barrett
52369307f8 Add a feature to the build system that Terry from Sun and I talked about
in San Jose.  Allow the configure option --disable-binaries to build OMPI,
but not build or install the support binaries (so basically, just build
the libraries).

This commit was SVN r9777.
2006-04-29 02:16:41 +00:00
Brian Barrett
5fed99c2c2 Sending SIZE_MAX from machines with different sizeof(size_t) causes big problems,
as the smaller machine's SIZE_MAX won't be SIZE_MAX on the bigger machine, which
can lead to failures along the way -- in this case, with GPR triggers being
improperly fired.

This commit was SVN r9776.
2006-04-28 21:09:42 +00:00
George Bosilca
1ea3a39372 The condition was wrong. The fact that it accept 0 length messages
is interpreted as a shutdown of the io channel on the next iteration.
Definitively not the good approach. The correct condition is
bigger than 0.

This commit was SVN r9770.
2006-04-28 04:57:07 +00:00
Jeff Squyres
bfcf3867fc Back out George's commit from earlier today; it seems to break stdout
forwarding. 

More detailed mail coming to devel-core shortly that explains.

This commit was SVN r9769.
2006-04-28 03:32:27 +00:00
Sushant Sharma
7a6e0c9ebf Fixed remote environment setup. Submitted by: Tim Woodall
This commit was SVN r9759.
2006-04-27 20:07:56 +00:00
George Bosilca
bafc16f724 We don't need the len anymore as everything is not attached to the fragment.
This commit was SVN r9758.
2006-04-27 17:35:05 +00:00
George Bosilca
5df94f812e Aren't we supposed to release the value on all possible execution paths ?
This commit was SVN r9757.
2006-04-27 17:31:01 +00:00
Tim Woodall
0a56067509 Correction to resolve a problem related to partial reads. We were making a
copy of the receive buffer based on the iovec struct that may have been updated 
during partial reads to reflect the current offset. Need to make the copy using 
the base address of the buffer.

Thanks to Sven Stork for finding this.

This should be backported to 1.0.X and 1.1.X branches.

This commit was SVN r9749.
2006-04-27 14:27:02 +00:00
Tim Woodall
7a139d6cc8 - corrections to I/O forwarding - handling of incomplete writes
THESE CHANGES SHOULD BE PROPOGATED TO BOTH 1.0 and 1.1 BRANCHES

This commit was SVN r9734.
2006-04-26 15:36:06 +00:00
Tim Woodall
3e57a4ec48 remove debug code - not required
This commit was SVN r9715.
2006-04-25 19:05:57 +00:00
Brian Barrett
ce72140633 Remove dependency libraries from these Makefile.ams - the libraries will
automagically bring in the libraries through the top-level library (so
liborte automatically brings in libopal, etc.).  Otherwise, we get some
warnings on Solaris

This should go to the v1.1 branch

This commit was SVN r9666.
2006-04-20 17:53:43 +00:00
Brian Barrett
62afa63ded Initialize length to 0 instead of -1 (size_t might be unsigned and therefore
-1 is an issue).

This should go to the v1.1 branch...

This commit was SVN r9665.
2006-04-20 15:42:36 +00:00
Brian Barrett
e737b0a106 Fix a bunch of warnings the Sun compilers find:
- The constant 1 is a signed int by default.  Explicitly say that
    it is an unsigned value so we can't overflow
  - Fix unreachable statement warnings in dss_arith by breaking out
    of switch statements instead of returning - this should have
    no impact on performance, since it's a non-conditional jump
  - A couple of the GPR files had carriage returns and were in
    DOS mode - put them in unix mode...

These should all probably go to the v1.1 branch...

This commit was SVN r9664.
2006-04-20 15:35:58 +00:00
Ralph Castain
95c4795157 Try a different tack...
This commit was SVN r9658.
2006-04-19 15:33:34 +00:00
Ralph Castain
93115fdaea Try again with passing the right enviro variables.
This commit was SVN r9629.
2006-04-13 18:07:22 +00:00
Ralph Castain
480af1c150 Add the missing enviro variables
This commit was SVN r9627.
2006-04-13 16:41:47 +00:00
Ralph Castain
d7456c3d89 Fix the broken Doxyfile so people can generate what little code base documentation we have :-)
Copy the Doxyfile to orte so people can generate just that documentation. Adjust the properties on that directory so it ignores the resulting doxygen output tree.

This should go over to the release branch(es)

This commit was SVN r9625.
2006-04-13 12:52:17 +00:00
Sushant Sharma
642e33fb3e xcpu launcher updated to setup the environment on remote nodes before launching jobs.
This commit was SVN r9622.
2006-04-12 22:42:41 +00:00
Ralph Castain
424900068f Update the xcpu launcher to setup the environment
This commit was SVN r9620.
2006-04-12 15:41:54 +00:00
Sushant Sharma
9fe5870862 xcpu pls component fixed so that it will compile correctly.
This commit was SVN r9617.
2006-04-11 20:27:13 +00:00
Ralph Castain
9adc16130e Proposed revision of the xcpu launcher to correctly incorporate the OpenRTE and Open MPI environment
This commit was SVN r9612.
2006-04-11 14:33:17 +00:00
Brian Barrett
f37a77dd08 * Fix potential deadlock when mpi threads are enabled and progress threads are
not.  See lengthy comment in the body of commit.

This commit was SVN r9573.
2006-04-07 18:13:35 +00:00
Sushant Sharma
26d51d5041 Cleaned lots of dead code in xcpu soh component (soh_xcpu.c). Checked the fix submitted by Ralph Castain for completing processes in soh_xcpu. Its working fine now.
This commit was SVN r9554.
2006-04-06 16:26:25 +00:00
George Bosilca
ca75ff2569 In the case we have support for threads, then the opal library have it's own
thread, which will do progress independently of MPI. So in this case we 
have to call opal_event_loop instead of opal_progress.

This commit was SVN r9551.
2006-04-06 14:31:38 +00:00
Brian Barrett
7408de0bfb When progress threads are enabled, opal_progress() doesn't call the
event library (since the event library has its own thread).  So when
we are using progress threads, we really want to call opal_event_loop()
and not opal_progres(). 

This commit was SVN r9549.
2006-04-06 12:58:09 +00:00
Ralph Castain
895c2ade8b Proposed fix for completing processes
This commit was SVN r9543.
2006-04-06 08:18:42 +00:00
Ralph Castain
c79c1714de Okaaayyy....let's see if this restores the "prefix" command line option. No idea what the problem was with the other option, but it isn't critical right now, so I'll figure it out later.
This commit was SVN r9542.
2006-04-06 07:53:38 +00:00
Ralph Castain
0ba8851a47 Fix the univ_exist option
This commit was SVN r9535.
2006-04-05 17:18:06 +00:00
Ralph Castain
b9bdb2125e Fix and upgrade the console to support better debugging. Activate "dump" commands to display registry content. Remove the blasted opal_output default prefix that made the dump output illegible. Properly connect to existing daemons and/or start new ones.
This commit was SVN r9528.
2006-04-04 11:05:52 +00:00
Sushant Sharma
8d5289b2b8 Corrected Makefile.am files for pls and soh xcpu-components as per Brian's suggestion.
This commit was SVN r9519.
2006-04-03 17:14:47 +00:00
Brian Barrett
4ea8790342 * Don't try to call tcgetprgp on platforms that don't have that function
* Some more stuff to ignore / do in Red Storm build

This commit was SVN r9511.
2006-04-01 05:46:15 +00:00
Brian Barrett
2c64ab562e More fixes to try to get Red Storm port going again....
* Add a platform spec for using the portals reference implementation's
  RTE instead of our own to make local testing easier.
* Add a cnos rmgr component so that 1) we don't have to build nearly
  as many components (no need for ras,rds,pls,etc.) and 2) calls
  to MPI_ABORT() won't print error messages about not being able to
  contact the daemon.  Still need to fill in some of the terminate
  stuff with calls from cnos, but will come in time.
* Make gpr_null use the base code for creating value and keyval
  structures so that we don't segfault in ompi_mpi_init().

This commit was SVN r9510.
2006-04-01 04:54:46 +00:00
Jeff Squyres
858612fd06 Face the possibilty that the child may have already died.
This commit was SVN r9508.
2006-04-01 02:23:10 +00:00
Sushant Sharma
46f84b1e8e Added xcpu component in pls and soh.
This commit was SVN r9491.
2006-03-31 02:19:52 +00:00
Ralph Castain
8ba453b866 Modify the rmgr_proxy component so it includes the automatic wire-up of stdio.
This commit was SVN r9483.
2006-03-30 19:44:28 +00:00
George Bosilca
2b3779cd6e Correct some of the casting issues. By default the compilers attach an signed type
to the defines. As our internal types (job_id and co.) are unsigned that generate
several errors (integer overflow in expression and comparison between signed and
unsigned). Casting the defines to the correct type solve these problems.

This commit was SVN r9481.
2006-03-30 19:28:17 +00:00
Tim Woodall
637511e759 correct cleanup of callbacks
This commit was SVN r9479.
2006-03-30 16:55:02 +00:00
Brian Barrett
e0eb9a19e7 * make orte_process_name_t be of fixed size (rather than depending on the size of
size_t).  This should be the last piece of the puzzle required to get 32/64
  interoperability working for ORTE.

This commit was SVN r9476.
2006-03-30 14:59:41 +00:00
Brian Barrett
6be35fb604 * Use the ORTE_<type> constants instead of internal DSS_TYPE_<type>_T constants
for the type to be packed / unpacked when dealing with sized types (like
  size_t) so that the dss_unpack code to deal with types of different sizes is
  activated.  Necessary for proper 32/64 interoperability.

This commit was SVN r9475.
2006-03-30 14:33:25 +00:00
Brian Barrett
02c8a51b76 * fix endian encoding for 64 bit numbers to use hton64
* cleanup the unpack_size_mismatch macros a little bit
* ad comment about endianness of size_mismatch cleanup code so that I don't
  think I've found a bug that really isn't and lose an hour tracking it down
  again...

This commit was SVN r9458.
2006-03-29 18:58:02 +00:00
Brian Barrett
cf425f6289 * trick the stupid compiler (GCC in this case) into shutting up about not
being able to convert from object pointer to function pointer. 

This commit was SVN r9445.
2006-03-29 01:26:16 +00:00
Brian Barrett
99e4c89183 * some typo fixes for orterun manpage
* Install orterun manpage as mpirun.1 and mpiexec.1 as well as orterun.1

This commit was SVN r9444.
2006-03-29 01:04:43 +00:00
Jeff Squyres
07b0e559f2 Fix copyright
This commit was SVN r9443.
2006-03-29 00:53:11 +00:00
George Bosilca
50b5a02f8b Let the oob to call opal_progress instead of opal_progress_event. Now, the MPI
communications will be advanced in MPI_Finalize.

This commit was SVN r9442.
2006-03-28 22:09:40 +00:00
Josh Hursey
35eb1a2970 Added a section on "Specifying Hosts" to the man page.
This commit was SVN r9432.
2006-03-27 23:46:38 +00:00
Jeff Squyres
bc96040e1c - Add Cisco copyright
- Add comment explaining why we used INT_MAX
- Update NEWS

This commit was SVN r9415.
2006-03-24 15:39:09 +00:00
Jeff Squyres
a843ce4c23 Clean up a minor memory leak
This commit was SVN r9413.
2006-03-24 15:28:42 +00:00
Brian Barrett
23f0aef07c * fix casting warning
This commit was SVN r9407.
2006-03-24 02:52:34 +00:00
Ralph Castain
08db67cdf8 Fix the app_context problem for app_files too....
Again, this should be checked by Jeff.

This commit was SVN r9393.
2006-03-23 17:55:25 +00:00
Ralph Castain
2a18ebd9e1 Fix the app_context problem.
NOTE: JEFF SHOULD CHECK THIS!

I found that orterun was not tracking the index number of the app_contexts it was creating. Hence, the app_context->idx field was always sitting at zero. This index is used by the mapper to decide which app_context to use for each process - thus, with the value of each index being zero, the mapper only used the first app_context that was created. All others were ignored.

Not sure when this might have gotten changed. Could be it was a problem that always existed, but didn't get exposed until something else was changed.

Anyway, it seems to work now - could stand further testing.

This commit was SVN r9389.
2006-03-23 16:53:11 +00:00
Ralph Castain
0552aef6bb Add some finer error checking that should help debug some recent problems with dynamic spawns.
This commit was SVN r9383.
2006-03-23 15:31:43 +00:00
Josh Hursey
22bac7ae95 a test commit. one more try
This commit was SVN r9350.
2006-03-21 00:39:29 +00:00
Josh Hursey
d64aab529f a test commit. no real changes here. Removing added char.
This commit was SVN r9349.
2006-03-21 00:37:13 +00:00
Josh Hursey
c8f9108c18 a test commit. no real changes here
This commit was SVN r9348.
2006-03-21 00:33:20 +00:00
Josh Hursey
66edc64be0 Minor comment change
This commit was SVN r9316.
2006-03-16 19:00:03 +00:00
Josh Hursey
7fcfd87cd5 Minor date change
This commit was SVN r9315.
2006-03-16 18:59:13 +00:00
Tim Woodall
564c177922 corrections for threading support
This commit was SVN r9292.
2006-03-16 00:06:48 +00:00
Tim Woodall
648ca32742 correct include
This commit was SVN r9276.
2006-03-14 14:40:52 +00:00
Brian Barrett
1f6e85af4c Let's try this again, this time with less suck.
* Don't do the .in -> .tmp -> header thing for the prefixes and versions.
  It causes some severe cleanup issues all to save 4 files from rebuilding
  when configure is run.
* Clean up some makefiles so it's clear what is being installed/disted

This commit was SVN r9260.
2006-03-12 17:56:58 +00:00
Brian Barrett
ea7b9cfc81 * Only enable SLURM support in ORTE if on a platform currently supported by
SLURM.  Currently, this includes AIX and Linux.  If the user wants to build
  SLURM on another platform, they can specify --with-slurm.
* Enable/disable the SLURM sds component using the same logic as the PLS and
  RDS components.

This commit was SVN r9259.
2006-03-12 05:32:35 +00:00
Brian Barrett
c42da09796 * Fix a small bug George noticed - if you change the prefix (or any of the
installation directories) in configure, the files that depend on this
  information are not properly rebuilt.  If you need this information,
  don't setup a -D in the Makefile.am - instead, include 
  opal/install_dirs.h.
* Use the : option in AC_CONFIG_FILES to avoid needing to expose that
  we are playing around with temporary files with our headers to avoid
  rebuilding
* Clean up the version file information a bit, and like the install 
  directory stuff, make sure that there is a dependency so that 
  ompi_info gets rebuilt properly when a version number changes.

This commit was SVN r9256.
2006-03-12 04:35:01 +00:00
Brian Barrett
3e2c51dea8 * fix some silly commenting done by a previous developer that are good for
a laugh but probably not good for usability ;)

This commit was SVN r9253.
2006-03-11 03:09:24 +00:00
Brian Barrett
d0f5f8a242 * enable pty support for platforms that do not have openpty(), but do
have pty support in general.

This commit was SVN r9250.
2006-03-11 02:35:40 +00:00
Jeff Squyres
80bc1850bf Ensure that --prefix takes precedence over /path/to/orterun
This commit was SVN r9183.
2006-02-28 14:44:40 +00:00
Jeff Squyres
88b3e6f8bd - Fix bug in orterun where --prefix didn't show up in the help output
(reported by Cisco)
- While in orterun, add a feature that multiple users have asked for:
  if you specify an absolute pathname to orterun, such as
  "/path/to/bin/orterun ...", it's equivalent to "orterun --path
  /path/to ..."

This commit was SVN r9181.
2006-02-28 11:52:12 +00:00
Rainer Keller
24f4b282f8 - Free the segment string.
This commit was SVN r9170.
2006-02-27 16:07:35 +00:00
George Bosilca
a76213f352 No group and lossy signals on Windows.
This commit was SVN r9160.
2006-02-27 05:11:44 +00:00
Jeff Squyres
28a1610453 - Add mssing <netdb.h>
- Change #if to #ifdef

This commit was SVN r9146.
2006-02-26 16:06:58 +00:00
Brian Barrett
285581dff2 More endian-related cleanups:
- moved hton64 and ntoh64 from the bunch of places it had been copied
    into one header file
  - properly set and use the btl_tcp's nbo option to put things in
    network byte order on the wire if both sides don't have the same
    endianness
  - Put the OB1 PML's headers (with a couple exceptions I need to discuss
    with Tim) in network byte order on the wire if both sides don't have
    the same endianness
  - since it was needed for the TCP BTL, move the orte_process_name_t
    HTON and NTOH macros from the TCP OOB to ns_types.h

This commit was SVN r9145.
2006-02-26 00:45:54 +00:00
Jeff Squyres
b4765b6db6 Add <netdb.h>
This commit was SVN r9142.
2006-02-25 19:04:26 +00:00
Brian Barrett
6e57e4c370 * adjust size of packed (unsized) data elements (like long, bool, size_t, etc)
to match the receiving process's setup.  sizeof(bool) is different on
  i386 OS X and PowerPC OS X, so need this to do endian testing between
  the two

This commit was SVN r9140.
2006-02-24 16:15:52 +00:00
George Bosilca
7f4d84d823 There is a duplicate in soh.h (thanks icpc to find it out).
This commit was SVN r9131.
2006-02-23 06:46:09 +00:00
Brian Barrett
1d3132a725 * some pending Red Storm related updates to ORTE
This commit was SVN r9128.
2006-02-23 04:55:39 +00:00
Josh Hursey
93e00415d5 A bunch of edits for clarity and precision.
Still needs some work, but getting closer

This commit was SVN r9098.
2006-02-21 04:17:56 +00:00
Josh Hursey
a3712f7a65 A cleanup checkpoint:
- Explained <program> and made a consistancy change in the Quick Start section.
 - Change references to 'app schema' to Open MPI 'app context'
 - Audit the command line arguments for --foo, -foo stuff.

This commit was SVN r9097.
2006-02-21 00:48:31 +00:00
Jeff Squyres
186704a23b A few updates
This commit was SVN r9089.
2006-02-18 04:17:18 +00:00
Jeff Squyres
22da6ef4e4 This bit accidentally got lost in the testing of the bproc/fork path
and cwd update functionality.  For bproc, we *do* need to change
directories while checking the cwd because argv[0] may be expressed as
a relative path, and therefore needs to be checked from the cwd
expressed in the app context.

This commit was SVN r9084.
2006-02-17 16:15:21 +00:00
Jeff Squyres
2c91ac861a When not in debug mode, tie stdout/stderr to dev null so we don't see
messages from orted (i.e., from the srun command).

This commit was SVN r9083.
2006-02-17 15:06:08 +00:00
George Bosilca
6a80c75110 The default value is not 0 as the variable is an enum !
This commit was SVN r9081.
2006-02-17 05:11:53 +00:00
George Bosilca
21749cb4c7 I suppose that the meaning of this function is just to print a job id ...
This commit was SVN r9080.
2006-02-17 05:11:07 +00:00
Josh Hursey
02c999776b Removed all of the LAM stuff.
This needs to be gone over a few more times before it is allowed to see
daylight, but has come a long way.  Some sections may be off more than a little,
but the general idea is there.

Need to audit to make sure we don't call the ORTE VHNP's daemons :)

This commit was SVN r9078.
2006-02-17 03:47:52 +00:00
Josh Hursey
2938545220 Checkpoint.
Finished adding and pruning all the the Options.

Cleaned up a bunch of man syntax, so it should be 'more' readable (making the
assumption that man source is ever readable :p).

I am moving on to the "description" and "see also" sections next.

This commit was SVN r9077.
2006-02-16 23:38:03 +00:00
Jeff Squyres
81e0bd444b - Remove extraneous chdir("/tmp")
- For a local orted launch, chdir($HOME) to be consistent with what
  [we assume] will happen on the remote nodes

This commit was SVN r9073.
2006-02-16 22:14:05 +00:00
George Bosilca
ab59741df6 Access to environ is not granted for free, we have to declare it on MAC OS X
(at least with gcc 4.2).

This commit was SVN r9071.
2006-02-16 21:26:25 +00:00
Jeff Squyres
c2c2daa966 Change the behavior of orterun (mpirun, mpirexec) to search for
argv[0] and the cwd on the target node (i.e., the node where the
executable will be running in all systems except BProc, where the
searches are run on the node where orterun is invoked).
- fork pls now does cwd and argv[0] search in orted
- bproc pls does cwd and argv[0] search in orterun
- cwd behavior slightly different:
  - if user specifies a -wdir to orterun, we chdir() to there; if we
    can't for some reason, abort
  - if user does not specify a -wdir, try to chdir() to the dir where
    orterun was invoked.  If we can't for some reason (e.g., it
    doesn't exist on the target node), then try to chdir($HOME).  If
    we can't do that, then just live with whatever default directory
    we were put in.

This commit was SVN r9068.
2006-02-16 20:40:23 +00:00
Tim Woodall
039fe0ad29 change process group only in bproc case, as this is really
a workaround for a bproc4 bug

This commit was SVN r9064.
2006-02-16 16:19:37 +00:00
Tim Woodall
c07e84cf6d correct return values
This commit was SVN r9063.
2006-02-16 16:18:46 +00:00
Tim Woodall
67c985554b correct compile issue
This commit was SVN r9060.
2006-02-16 14:44:58 +00:00
Jeff Squyres
d741b7f37f We're adding some specific and complex functionality to orteun, so it
really needs to be documented (in part so that users stop asking us
how to do it!).  

This is a first cut at an orterun.1 man page.  It is 95% copied from
LAM's mpirun.1 lam page -- I just edited the very top and am handing
this off to Josh to finish the first cut.  Then we'll add specific
docs about the behavior of some of the finer details.  This is not
listed in the Makefile.am yet because it's so incomplete/incorrect
(w.r.t. OMPI), so I don't want it included in the tarball or installed
[yet].

This commit was SVN r9058.
2006-02-16 13:29:37 +00:00
Jeff Squyres
018a4b98ff - Ensure that "context" is initialized to NULL
- Ensure that we don't free a NULL context
- Add a few {}'s

This commit was SVN r9055.
2006-02-16 04:09:29 +00:00
Tim Woodall
fc751171cd bproc cleanup from release branch
This commit was SVN r9054.
2006-02-16 00:16:22 +00:00
David Daniel
e82c470b32 - Change the exit status set by mpirun when an application process is
killed by a signal.  The exit status is now set to signo + 128, which
  conforms with the behavior of (almost) all shells.

This commit was SVN r9050.
2006-02-15 22:41:29 +00:00
David Daniel
ff7a2c7967 Fixes for BJS (broken since merge)
This commit was SVN r9043.
2006-02-15 01:14:50 +00:00
David Daniel
aa5c5772c2 Fixing a wayward OMPI_ERROR.
Fixing logic of a couple of error logging statements (compiler was complaining)

This commit was SVN r9042.
2006-02-15 00:09:33 +00:00
George Bosilca
8062277bae I'm confused ... Error string as well as the goto label had the same name ...
This commit was SVN r9036.
2006-02-14 17:49:14 +00:00
Ralph Castain
f5d17148c1 Clean up the references to num_env, which has been removed from app_context.
This commit was SVN r9014.
2006-02-13 21:08:35 +00:00
Ralph Castain
bc6a82839d Update these components to new dss
This commit was SVN r9004.
2006-02-13 15:28:29 +00:00
Brian Barrett
913890f534 * forgot to add new directories into DIST_SUBDIRS as well as SUBDIRS, so
tarballs were missing some directories.

This commit was SVN r8989.
2006-02-12 07:06:38 +00:00
Brian Barrett
566a050c23 Next step in the project split, mainly source code re-arranging
- move files out of toplevel include/ and etc/, moving it into the
    sub-projects
  - rather than including config headers with <project>/include, 
    have them as <project>
  - require all headers to be included with a project prefix, with
    the exception of the config headers ({opal,orte,ompi}_config.h
    mpi.h, and mpif.h)

This commit was SVN r8985.
2006-02-12 01:33:29 +00:00
Ralph Castain
1abe8ef368 Well, it certainly helps triggers to fire if the respective responsible routines adjust the counters!
The INIT counter is supposed to be adjusted when the processes are mapped - this is now done correctly.

The LAUNCHED counter is supposed to be adjusted when the pls sets the process pid info into the registry and changes the state to LAUNCHED. This could probably be changed to have that function use the set_proc_soh API, but this fixes the problem for now.

Thanks to Brian for finding that the triggers were not being fired.

This commit was SVN r8948.
2006-02-09 15:39:06 +00:00
George Bosilca
4b4b70cb0f Remove compilation warning.
This commit was SVN r8942.
2006-02-08 19:52:57 +00:00
Ralph Castain
892b396d70 Ensure that standard triggers are defined for all job/process states so that user's can subscribe to those they want to use. Modify the way that is done to avoid over-burdening the standard launch sequence since it doesn't need alerts from all those triggers.
This commit was SVN r8938.
2006-02-08 17:40:11 +00:00
Ralph Castain
5c750cd8b9 Checkpoint a fix for Brian's observed failure to correctly unpack byte_objects. Will continue testing on another machine.
This commit was SVN r8921.
2006-02-07 15:43:43 +00:00
George Bosilca
3bb2eadfaa Do not let them uninitialized.
This commit was SVN r8916.
2006-02-07 06:06:58 +00:00
George Bosilca
dda0e4182f Remove unused variables
Add required include files (stdio.h for NULL definition).
Make it compile on MAC OS 10.3.

This commit was SVN r8914.
2006-02-07 05:41:31 +00:00
Brian Barrett
72de49f0ad * make the xgrid component compile again. still need to test tomorrow...
This commit was SVN r8913.
2006-02-07 04:46:00 +00:00
Ralph Castain
4b9f015c0b Merge in the new data support subsystem for ORTE. MPI folks should not notice a difference. Longer explanation will be sent to developers mailing list.
This commit was SVN r8912.
2006-02-07 03:32:36 +00:00
George Bosilca
b7fa1f4664 As signal.h to the include files to import SIGCONT.
This commit was SVN r8899.
2006-02-05 05:49:24 +00:00
Brian Barrett
03f6a8529c * Fix situation where we were unlocking a mutex we didn't own in an error
cleanup code in the signal part of the event library
* Only attempt to forward standard input if we have a controlling terminal
  (isatty() returns 1) and we are the foreground process OR we do not have
  a controlling terminal (isatty() returns 0).  If we have a controlling
  terminal, check at each SIGCONT if we should change our forwarding,
  since our foreground / background status may have changed.

  Unfortunately, there isn't a great way in the iof framework to know if
  we are capturing a starter's stdin.  Use the logic that if it's a source
  AND tagged as standard input, it's a starter's stdin.  This seems to
  work for all the common usages.

Both these need to go to the v1.0 branch.

This commit was SVN r8894.
2006-02-04 23:26:58 +00:00
Brian Barrett
7c247eea01 * Add a finalize function for iof framework and add a finalize function for
the svc component so that it can disable the rml exception callback, fixing
  a race condition in the shutdown mechanism of orte.

  This should probably go to the v1.0 branch.

This commit was SVN r8893.
2006-02-03 21:01:11 +00:00
Brian Barrett
ddda56eb0d * Don't use ptys for stdin. When a pty has close() called on it, it
discards all of the data in the pty that hasn't been read.  This was
  leading to data being discarded when files were redirected into
  mpirun and read by rank 0 of the job.  This was very "not good".

  The decision to not use ptys for stdin was made based on what Tim said
  that LA-MPI was doing.

  This needs to go to the v1.0 branch...  Tim should probably review...

This commit was SVN r8892.
2006-02-03 20:43:20 +00:00
Jeff Squyres
abc67a257f This approach is cleaner than the previous one -- use a temporary
shell variable to avoid setting the OMPI $libpath twice in
$LD_LIBRARY_PATH.  Many thanks to Glenn Morris.

This commit was SVN r8883.
2006-02-02 11:58:40 +00:00
Jeff Squyres
cc1ee11eeb Fix issues with tcsh and LD_LIBRARY_PATH when using --prefix. See
lengthy comment inside for details.  Thanks to Glen Morris for finding
this issue and suggesting the fix.

This commit was SVN r8880.
2006-02-02 06:26:55 +00:00
Jeff Squyres
f7097d34c8 Remove some \n typos. Thanks to Glenn Morris for finding these.
This commit was SVN r8878.
2006-02-02 05:50:15 +00:00
Rainer Keller
dd13b098e1 - Simple locking fix.
This commit was SVN r8822.
2006-01-26 13:20:53 +00:00
Jeff Squyres
ed0fa9720d Incorporate fix suggested by Chris Gottbratch.
This commit was SVN r8750.
2006-01-19 15:21:53 +00:00
George Bosilca
e6e28460f1 Remove all windows code as fork is not available on windows. Instead a shinny new pls
will join the fun (handling process creation on windows).

This commit was SVN r8745.
2006-01-19 07:01:51 +00:00
Jeff Squyres
e6bd80b424 Per the commit message of r8514, change the search order to be "ssh :
rsh", *not* "rsh : ssh".

This commit was SVN r8736.

The following SVN revision numbers were found above:
  r8514 --> open-mpi/ompi@9c25bdc5ac
2006-01-18 22:00:34 +00:00
George Bosilca
992daf7522 Remove all unused defines from the Makefile.
This commit was SVN r8734.
2006-01-18 21:21:29 +00:00
Brian Barrett
c96f870674 * Merge of wrapper compiler updates from the bwb-wrapper-fix branch (r8690 -
r8698), with changes below:

  - Split wrapper flags into those required for each of the three projects,
    and cleaned up some cruft (including the LIBMPI_EXTRA_*FLAGS) through-
    out the build system
  - Added opal_init_util and opal_finalize_util to allow init / cleanup
    of all the opal code that doesn't require the MCA system
  - Create standalone key=value file parser, based on the one that used
    to be in the mca param parser, so that it can be shared in multiple
    places
  - Add wrapper datafiles for opal, orte, and ompi wrappers, and add
    wrapper compiler with support for all the old features

This commit was SVN r8699.

The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
  r8690
  r8698
2006-01-16 01:48:03 +00:00
George Bosilca
bf266c6109 Rollback the 8682 commit until we figure out the correct way to do it. It break several things
inside (like MPI_Wait* functions).

This commit was SVN r8686.
2006-01-13 22:02:40 +00:00
Rainer Keller
95f886b6ab - Protect callers of opal/ompi_condition_wait from spurious wakeups,
possible when with building with pthreads.
   Compiled on Linux ia32 with and without
   --enable-progress-threads

This commit was SVN r8682.
2006-01-12 17:13:08 +00:00
David Daniel
c2ee847184 Missing header file.
This commit was SVN r8670.
2006-01-10 21:58:21 +00:00
Jeff Squyres
b2de55d72e Back out some debugging stuff from a careless r8643 commit (only
intended to include the OMPI_DEBUG_ZERO call).

These debugging statements should not have affected correcteness
because the value of 78 will be overridden in the read() and the
assert()/abort() stuff will only be triggered on an error which should
never happen (i.e., the error should have been handled by the prior if
conditional).  But still, thise code should not be there.

This commit was SVN r8649.

The following SVN revision numbers were found above:
  r8643 --> open-mpi/ompi@a6b869ed68
2006-01-05 14:44:10 +00:00
Jeff Squyres
a6b869ed68 Avoid a false positive in bcheck
This commit was SVN r8643.
2006-01-04 22:29:09 +00:00
David Daniel
d272e02338 Need to include fcntl.h on linux -- protected for windows.
This commit was SVN r8630.
2006-01-04 00:54:16 +00:00
George Bosilca
7a88e72c1b Add more protections around the headers.
This commit was SVN r8617.
2005-12-31 12:35:24 +00:00
George Bosilca
d91650ea85 Do not use explicitly "ln -s" as on some systems it does not work properly ...
(windows). Instead use the LN_S variable exported by the Makefile (set to
"ln -s" on all Unixes and to "cp -p" on windows).

When we remove an executable use the correct extension for its name
(add $(EXEEXT) to the name).

This commit was SVN r8616.
2005-12-31 12:33:44 +00:00
George Bosilca
3c95dd0801 No discrimination !
This commit was SVN r8613.
2005-12-31 12:20:32 +00:00
Jeff Squyres
1e93c78e2e - Rename rsh component members: argv->agent_argv, argc->agent_argc,
and path->agent_path so that it's totally clear what these are for
- make a new rsh component param for agent_param (the value from the
  MCA param)
- delay the path check for the agent until the component init -- don't
  make it fail during open, because the MCA base will print a warning
  if a component fails open() (e.g., on clusters without rsh/ssh (!),
  this component was failing noisly even though it was
  normal/expected)

This commit was SVN r8596.
2005-12-22 14:37:19 +00:00
Jeff Squyres
8fb5e506aa Arrgh -- should have included this in last commit: need to set a
variable before we += it in a Makefile.am.

This commit was SVN r8595.
2005-12-22 14:30:27 +00:00
Jeff Squyres
93b4d12d14 Add a friendly help message if no pls components are found to be
available.

This commit was SVN r8594.
2005-12-22 14:29:45 +00:00
Jeff Squyres
5a03f86818 Fix a case where it's valid to get no responses back -- return early
before invoking malloc(0).

This commit was SVN r8577.
2005-12-21 13:45:06 +00:00
Rainer Keller
b06d79d4fe - Seems with change r7664, the mapping has slightly changed.
In case of checking for Shell with --mca pls_rsh_assume_same_shell 0
   have the node point to sensible values.

This commit was SVN r8563.

The following SVN revision numbers were found above:
  r7664 --> open-mpi/ompi@0629cdc2d7
2005-12-20 15:59:17 +00:00
Brian Barrett
a5af07cd6b fixes suggested by Ralf for supporting both Libtool 1 and 2 in Open MPI...
This commit was SVN r8538.
2005-12-19 03:10:23 +00:00
George Bosilca
f9b07f1912 Protect the includes.
This commit was SVN r8532.
2005-12-17 22:05:10 +00:00
Brian Barrett
456ba1c11f * need to declare environ on OS X
this should go to the 1.0 branch

This commit was SVN r8527.
2005-12-16 19:20:33 +00:00
Jeff Squyres
fa097c9874 Remove two components that were templated out quite a while ago and
aren't currently in use (i.e., they were never finished).  If needed,
they can be pulled out of SVN history.

This commit was SVN r8524.
2005-12-16 17:40:51 +00:00
Jeff Squyres
25b2730a34 Only allow the fork component to run when we're in an orted.
This commit was SVN r8515.
2005-12-15 21:05:26 +00:00
Jeff Squyres
9c25bdc5ac Change to the rsh pls component to have the pls_rsh_agent MCA param
now take a colon-delimited list of agents (and associated argv).  Also
change the default value to "ssh : rsh".  Hence, if we run on a
cluster that does not have ssh, we'll fall back to rsh.  If we can't
find rsh, then the rsh component will disqualify itself from
selection.

This commit was SVN r8514.
2005-12-15 20:54:24 +00:00
Jeff Squyres
e184fd6801 Make sure that what we find is executable
This commit was SVN r8513.
2005-12-15 20:31:20 +00:00
George Bosilca
505d830b3f I miss the requirement for the mca_base_component_repository.h header.
This commit was SVN r8465.
2005-12-12 21:10:30 +00:00
George Bosilca
7d8d516a4a A bunch of fixed for Windows support.
- protection with __WINDOWS__ and not WIN32 or _WIN32
 - protect all the headers

This commit was SVN r8463.
2005-12-12 20:04:00 +00:00
George Bosilca
32cecc5798 Change ERROR to subscribe_error because ERROR is predefined on Windows. I didn't spend
to much time tracking that down, I just know that cl.exe will replace it with the 
"constant" string ...

This commit was SVN r8449.
2005-12-11 06:23:07 +00:00
Jeff Squyres
31336e4773 Add some missing headers / correct one installation directory
This commit was SVN r8408.
2005-12-08 04:00:52 +00:00
Jeff Squyres
6fbd321442 Fix a bunch of install locations for header files
This commit was SVN r8406.
2005-12-08 00:54:44 +00:00
Jeff Squyres
e781f55d16 Add proper prefixes into the #include statements
This commit was SVN r8404.
2005-12-08 00:05:26 +00:00
Jeff Squyres
3f27e61de6 Fix location of installed header files
This commit was SVN r8403.
2005-12-08 00:04:19 +00:00
Jeff Squyres
bd0b5acf0b Oops -- there's a second instance of OCRNL that needed to be
protected.

This commit was SVN r8374.
2005-12-02 18:24:59 +00:00
Jeff Squyres
0c9420e204 OS X 10.3 does not have OCRNL #define'd, so we need to protect its
usage 

This commit was SVN r8371.
2005-12-02 16:57:37 +00:00
Brian Barrett
bc4d3d6fff IRIX compile fixes:
- Need to make sure that SIZE_MAX exists as a constant if stdint.h
    doesn't exist
  - struct timeval is defined in unistd.h on IRIX, so need to include
    that headerfile where ever struct timeval is used.

This commit was SVN r8361.
2005-12-01 18:28:20 +00:00
Tim Woodall
20e6f41fe2 allow node number as hostname for bproc
This commit was SVN r8357.
2005-12-01 17:44:08 +00:00
Brian Barrett
389e378054 * use opal_init / opal_finalize in orteprobe so that ordering doesn't get out of
sync with opal....

This commit was SVN r8341.
2005-11-30 21:40:11 +00:00
Brian Barrett
79bf8843d2 * update memory hooks interface to allow for callbacks on both allocations
and dealllocations, per request from Galen and Tim

This commit was SVN r8303.
2005-11-29 04:46:14 +00:00
Tim Woodall
cf53d3e48f missing include
This commit was SVN r8295.
2005-11-28 23:13:36 +00:00
Galen Shipman
6e64e8a144 bproc fixes, these exist in the release 1.0 branch.
This commit was SVN r8292.
2005-11-28 21:10:02 +00:00
Tim Woodall
943e6f0cd5 corrections for stdin
- when eof is reached at orterun, send a 0 byte message to peer indicating eof
- on receipt of zero byte message - close corresponding file descriptor associated with the endpoint
- require setup ptys for stdin and stdout so that stdin can be closed independently of stdout

This commit was SVN r8264.
2005-11-28 14:58:53 +00:00
Tim Woodall
eb7cfe3ecd implement unsubscribe
This commit was SVN r8214.
2005-11-21 19:46:47 +00:00
Jeff Squyres
443d833ee9 fx2 is the serial debugger; fxp is the parallel debugger.
This commit was SVN r8211.
2005-11-21 17:00:36 +00:00
Brian Barrett
fee6409708 fix compiler warning and compiler error in totalview code...
This commit was SVN r8207.
2005-11-20 18:41:45 +00:00
Jeff Squyres
8d96c21311 Good weekend brainless activity -- implement the orterun command line
debugger scheme described in
http://www.open-mpi.org/community/lists/users/2005/10/0214.php.  This
makes our user-level debugger scheme much more vendor-independent
(although the "-tv" option will still work for backwards compatibility
-- it'll just be a synonum of "--debug").

This commit was SVN r8206.
2005-11-20 16:06:53 +00:00
Brian Barrett
20cea60b82 * fix "make distclean" error in PML
* turns out (duh!) that there was a reason that the <projectdir>dir
  variable was set in the AM conditional.  If not, stupid directories
  are created and not needed...  duh.

This commit was SVN r8205.
2005-11-20 07:41:09 +00:00
Brian Barrett
8faa1884f0 * The last of the build system optimizations. Combine the component and
component/base Makefile.am files, reducing the time configure spends
  stamping out Makefiles at the end
* Install base_impl.h file when devel-headers are being installed

This commit was SVN r8200.
2005-11-20 01:03:01 +00:00
Tim Woodall
d579e048f7 reset node name to be node number only to match
value set by allocation/mapper

This commit was SVN r8186.
2005-11-17 22:02:28 +00:00
Jeff Squyres
23ca7e1311 Ensure to return a value.
This commit was SVN r8182.
2005-11-17 14:31:42 +00:00
Brian Barrett
3e3ba49cdb should have removed the line of code, rather than #if 0'ing it out
This commit was SVN r8172.
2005-11-17 05:22:19 +00:00
Brian Barrett
f464bbbcc0 fix a couple of double-lock issues in the iof code that have crept in recently.
This should go to the v1.0 branch.

This commit was SVN r8171.
2005-11-17 01:26:00 +00:00
Tim Woodall
142b7cc682 merge from release branch
This commit was SVN r8167.
2005-11-16 17:10:49 +00:00
Tim Woodall
59d8c791d9 return fragments to free list
This commit was SVN r8121.
2005-11-11 17:48:56 +00:00
George Bosilca
c802d54696 The return type is an int. Casting it to a size_t before checking if it's bigger than zero lead to a true condition ... always ...
This commit was SVN r8114.
2005-11-11 06:34:14 +00:00
Brian Barrett
878676218e Rename opal/memory to opal/memoryhooks because XLC++ on Mac OS X is broken.
When compiling C++ code that includes something that looks for the C++
header file "memory" (stupid C++ headers not having .h extensions), it
goes through the header file search path, which includes $(topsrcdir)/opal,
so it finds the directory $(topsrcdir)/opal/memory/ and tries to load
that as the memory header file and all goes downhill.

This commit was SVN r8111.
2005-11-11 00:26:27 +00:00
Josh Hursey
5fa34df9ce Fix for orted / MPI_Abort problem reported from testers. They were seeing orteds
spining in orte_iof_base_flush() when running 
  intel_tests/src/MPI_Errhandler_fatal_c

When we close an endpoint by taking it out of the envent handler, we need to make
sure that it fits the criteria to pass through orte_iof_base_flush(), specificly
make sure we clean out the ep_frags list.
Note: This is more of a sanity check, since the endpoint should already be
      in this state at the point of closure.

Secondly in orte_iof_base_endpoint_read_handler(), if we determine that it is 
necessary to close the endpoint we have to "return" after doing so, otherwise
we add another frag to the endpoint which will cause it to hang in 
orte_iof_base_flush().

Bug go squish!

This commit was SVN r8109.
2005-11-11 00:09:07 +00:00