1
1
Граф коммитов

268 Коммитов

Автор SHA1 Сообщение Дата
Josh Hursey
af9ccdf04a need to use get_first instead of get_begin since we don't want to execute
this loop if "nodes" is an empty list. get_first, in this loop context, 
allows us to do just that, while get_begin doesn't.

This fixes a --host problem that appeared on the Linux PPC64 build.

This commit was SVN r7703.
2005-10-11 21:33:04 +00:00
Josh Hursey
8ba2900341 fixed a typo, added comments for future work
This commit was SVN r7700.
2005-10-11 20:59:31 +00:00
Ralph Castain
e1244fc160 Fix a few thread-lock things discovered by Josh. The thread locks in the registry's local notify delivery system had not been updated to reflect the design change whereby the xcast uses the notify delivery system. This has now been fixed.
Also revised the callbacks to store and utilize local variables to avoid problems where threads modify the global structures. Not sure this totally fixes the problem, but it's a shot - suggested by Josh (and Jeff, I believe).

This commit was SVN r7694.
2005-10-11 19:35:04 +00:00
Ralph Castain
a47655b3fd Add unlock/lock around the delivery of a local callback to remove thread-lock condition if the callback function attempts to re-enter the registry.
This commit was SVN r7678.
2005-10-10 02:45:50 +00:00
Ralph Castain
6c839048cf Fix a typo that caused valgrind to bark on 64-bit machines. Actually was a potential source of error, so the barking was legit.
This commit was SVN r7677.
2005-10-10 02:34:26 +00:00
Josh Hursey
d5ebb5c46a fix a compiler warning
This commit was SVN r7674.
2005-10-08 17:03:12 +00:00
Jeff Squyres
0629cdc2d7 Bring back the changes from /tmp/jjhursey-rmaps. Specific merge
command:

svn merge -r 7567:7663 https://svn.open-mpi.org/svn/ompi/tmp/jjhursey-rmaps .

(where "." is a trunk checkout)

The logs from this branch are much more descriptive than I will put
here (including a *really* long description from last night).  Here's
the short version:

- fixed some broken implementations in ras and rmaps
- "orterun --host ..." now works and has clearly defined semantics
  (this was the impetus for the branch and all these fixes -- LANL had
  a requirement for --host to work for 1.0)
- there is still a little bit of cleanup left to do post-1.0 (we got
  correct functionality for 1.0 -- we did not fix bad implementations
  that still "work")
  - rds/hostfile and ras/hostfile handshaking
  - singleton node segment assignments in stage1
  - remove the default hostfile (no need for it anymore with the
    localhost ras component)
  - clean up pls components to avoid duplicate ras mapping queries
  - [possible] -bynode/-byslot being specific to a single app context 

This commit was SVN r7664.
2005-10-07 22:24:52 +00:00
Tim Woodall
3c900a7aa2 - fix a deadlock on threaded build
- update sequence number after a partial write completes

This commit was SVN r7654.
2005-10-06 21:50:58 +00:00
Tim Woodall
a79e07390a remove debug
This commit was SVN r7653.
2005-10-06 21:29:47 +00:00
Tim Woodall
2ea71064ad close all file descriptors w/ the exception of stdin/stdout/stderr
otherwise, parent's file descriptors are inherited and held open by
the child even if the parent dies

This commit was SVN r7652.
2005-10-06 21:22:36 +00:00
Tim Woodall
797922fbab - cleanup on loss of connection to peer
- generate ack if no one to forward msg to

This commit was SVN r7651.
2005-10-06 21:21:26 +00:00
Tim Woodall
3280f6e655 add facility to receive callback on disconnection from peer
This commit was SVN r7650.
2005-10-06 19:39:20 +00:00
Jeff Squyres
65698bc6be Remove compiler warning
This commit was SVN r7635.
2005-10-05 10:23:02 +00:00
Jeff Squyres
0f100d8577 - Don't overwrite rc with the return value from pls_tm_disconnect --
it's always ORTE_SUCCESS and sometimes masks real !=ORTE_SUCCESS rc
  values. 
- Add MCA param pls_tm_want_path_check.  If nonzero (the default),
  check for the orted in the PATH before each tm_spawn()'ing (doing a
  little caching so that we don't hammer on the filesystem -- remember
  all the PATH's where we successfully found the orted so that we
  don't have to query the filesystem multiple times for a PATH where
  we previously found the orted)
- Be sure to opal_argv_split() the pls_tm_orted MCA param

This commit was SVN r7625.
2005-10-04 19:38:51 +00:00
Jeff Squyres
b79c46dbf6 Downgrade the default priority to 75, just to give leeway (same as the
slurm pls).

This commit was SVN r7624.
2005-10-04 19:18:52 +00:00
Jeff Squyres
eb24fe4fd8 If the job fails to launch properly, set its state to ABORTED, which
will fire some subscriptions that will eventually result in invoking
terminate_job (i.e., terminate anything that may have been
successfully started by launch).

This commit was SVN r7622.
2005-10-04 17:19:23 +00:00
Jeff Squyres
80399aff17 Add some README's to describe what these components are fore.
This commit was SVN r7618.
2005-10-04 15:14:23 +00:00
Jeff Squyres
3df0828921 Restore this PLS -- LANL needs this for some of its older clusters.
This commit was SVN r7617.
2005-10-04 15:09:38 +00:00
Jeff Squyres
7645a0fa23 This is the old bproc launcher that is ok to remove.
This commit was SVN r7583.
2005-10-02 14:58:52 +00:00
Jeff Squyres
a9f24c27bd Restore bproc -- this was *not* the old one (didn't read Tim Prins'
mail carefully -- doh!)

This commit was SVN r7582.
2005-10-02 14:57:44 +00:00
Jeff Squyres
d44fc0fa2a - Clarify the help file text a little
- Remove an extraneous \n in opal_output() output

This commit was SVN r7581.
2005-10-02 11:58:51 +00:00
Jeff Squyres
91ed790715 Add --prefix processing for the tm pls
This commit was SVN r7580.
2005-10-02 11:58:18 +00:00
Jeff Squyres
da1c096883 Remove old, outdated bproc launcher.
This commit was SVN r7579.
2005-10-02 10:45:00 +00:00
Jeff Squyres
0459678f82 Fixes to make the SLURM pls handle --prefix properly
This commit was SVN r7569.
2005-09-30 21:44:05 +00:00
Jeff Squyres
e9ec846c68 Minor change to only display the prefix debug message at most once
This commit was SVN r7568.
2005-09-30 21:43:32 +00:00
Jeff Squyres
d172088dd3 Leave it up to users to do something that we hadn't planned on. :-)
If you use --prefix and then "-x LD_LIBRARY_PATH", the rsh pls would
take great pains to ensure that PATH and LD_LIBRARY_PATH were setup
correctly on the local and remote nodes, but then the fork pls would
blitely overwrite LD_LIBRARY_PATH with what the user exported (i.e.,
most likely without our prefix).  This patch takes care of that -- the
fork pls examines the incoming environment, and if it sees PATH or
LD_LIBRARY_PATH, it re-prefixes those variables.

This commit was SVN r7566.
2005-09-30 19:14:31 +00:00
Andrew Friedley
82ee2933a5 - Add an opal_show_help() to the pls fork module to explain what went wrong when the execv to start the application fails.
- Add a couple opal_show_help()'s to indicate when not enough slots/nodes are available to satisfy a request.

This commit was SVN r7555.
2005-09-30 14:30:21 +00:00
Jeff Squyres
fcef1774d5 Per advice from Ralf W., change the pkgdata declarations in
Makefile.am's to be a *slightly* more correct (and, more importantly,
less error-prone) construct.

This commit was SVN r7554.
2005-09-30 13:32:39 +00:00
Josh Hursey
d39841174d Must release the lock before entering the non blocking recv, since
it is possible that if the receive has been arrived the callback will
be called before recv_buffer_nb() returns. This causes deadlock
as we try to acquire the lock, but already hold it.

This was causing orterun and orteds to stall in certian situations.
Became evident when stress testing dynamics with remote nodes.

This commit was SVN r7543.
2005-09-29 14:24:11 +00:00
Jeff Squyres
de1c8fb125 - Make debug output a bit more accurate and readable
- Fix bug identified by users: --prefix may also apply on the local
  node; we need to prefix the PATH and LD_LIBRARY_PATH environment
  variables before invoking execve()

This commit was SVN r7541.
2005-09-29 12:35:43 +00:00
Josh Hursey
c11ba09655 Remove the progress engine stuff from abort. This was causing
some orted's to stall on locks in the MPI Dynamics cases. Since it
is not essentual that we call these functions, they can so away.

Unlock the peer lock when aborting. This causes a potential deadlock
in do_waitall [see comment in code]. This was causing orteds to
deadlock at times when the seed had terminated. With proper interleaving
and timing the orted was deadlocking. This seems to have fixed this in 
my stress testing with MPI 2 Dynamics.

This commit was SVN r7539.
2005-09-29 05:04:43 +00:00
Josh Hursey
4cf4b4ea86 Fix for MPI 2 dynamics.
The NS replica should give out tags that are over ORTE_RML_TAG_DYNAMIC
or it will overlap with other outstanding tags. This overlap was killing
MPI_Comm_spawn when a program tried to use it multiple times (> 3).

With this fix MPI_Comm_spawn is behaving properly.
A program can call it many times in a row with out problem.

NOTE: Not tested for multi-threaded build yet

(A long time debugging for a one liner... :/)

This commit was SVN r7529.
2005-09-28 03:20:43 +00:00
Josh Hursey
a23370c007 Converted some MCA parameters from the old version to the new.
Have the ras_base_schedule_policy MCA parameter working once again. before it 
would only do slot based allocation, even if the MCA parameter was set properly.

Currently you can specify to orterun a node allocation by either:
-mca ras_base_schedule_policy node
-bynode

and slot allocation (which is the default) by:
-mca ras_base_schedule_policy slot
-byslot

This commit was SVN r7513.
2005-09-27 02:54:15 +00:00
Brian Barrett
50dc5499b4 * fix some remaining --with-btl-portals configure issues
This commit was SVN r7498.
2005-09-24 00:11:40 +00:00
Tim Woodall
c38ebe2c6a support -H,-host,--host option for rsh/ssh launch
This commit was SVN r7484.
2005-09-22 16:09:23 +00:00
Andrew Friedley
cfa09dc0e7 Fix two more missing escapes.
Sorry about breaking the tree with typos, I think this should fix all of them.

This commit was SVN r7482.
2005-09-22 16:04:46 +00:00
Tim Woodall
194150b81c someone broke this...
This commit was SVN r7478.
2005-09-22 13:47:37 +00:00
Andrew Friedley
555ae37255 Add lib{opal,orte,mpi}.la to appropriate LIBADD's, some whitespace cleanup as well.
This commit was SVN r7477.
2005-09-22 12:28:54 +00:00
Tim Woodall
84e0d89497 correction
This commit was SVN r7447.
2005-09-20 19:20:39 +00:00
Ralph Castain
2656ec93b5 Fix a typo so that stage_gate_2 gets correctly passed back to orterun...
This commit was SVN r7446.
2005-09-20 19:12:59 +00:00
Ralph Castain
5686e8119e Move the error name macro to the errmgr framework. Add a second level of tracing. Remove an obsolete file.
This commit was SVN r7445.
2005-09-20 17:09:11 +00:00
Tim Woodall
29d14281c8 use the specified host names (if provided)
This commit was SVN r7442.
2005-09-20 13:33:11 +00:00
Tim Woodall
6c885acb91 corrections to handle host specifications
This commit was SVN r7441.
2005-09-20 13:32:08 +00:00
Tim Woodall
75d9119cf3 correction
This commit was SVN r7436.
2005-09-19 21:35:39 +00:00
Tim Woodall
e1ec160858 lookup available nodes based on mapping data (if available)
This commit was SVN r7435.
2005-09-19 21:31:00 +00:00
Tim Woodall
9c334800ad merge in environ from front-end node - giving precedence
to any user supplied values. otherwise, some c library
routines behave badly (getpwuid...)

This commit was SVN r7434.
2005-09-19 21:06:05 +00:00
George Bosilca
193120d434 In the case where we we have to subscribe to get information about the peer. As we call this function
with the mutex locked and as this function will call oob_send which will call the lookup again
... we will deadlock as the mutex is already lock. The solution is to release the mutex before
going into the subscription. Then of course the logic to remote the item when something went
wrong with the subscrition is a little bit more complex.

This commit was SVN r7429.
2005-09-19 15:59:46 +00:00
George Bosilca
703e874468 Remove a race condition. If this functions is called by the progress thread then it does not have to
add an event, it can call the spawn function directly. This will avoid it standing on the condition who 
will never get released.

This commit was SVN r7428.
2005-09-19 15:54:53 +00:00
Ralph Castain
b589a93e29 Continue to lace the trace functionality into orte...
This commit was SVN r7427.
2005-09-19 15:29:14 +00:00
Tim Woodall
09869daf8e from the list of addresses exported by the peer, attempt to
pick an address on the same subnet. if non are found, give
up and try them in order

This commit was SVN r7426.
2005-09-19 14:47:11 +00:00