1
1
Граф коммитов

8278 Коммитов

Автор SHA1 Сообщение Дата
Brian Barrett
83df0ab0a4 * s/LAM\/MPI/Open MPI/g
This commit was SVN r10693.
2006-07-09 03:41:39 +00:00
Brian Barrett
41e144c879 Fix for ticket #92, bproc stdin being borked. The problem was that we were
using a pty for everything, which drops all buffered data on the floor when
close() is called on the daemon side, meaning EOF has some issues.  Instead,
do the same thing we do for other starters that use the fork() pls -- use
a pipe/fifo for stdin and stderr and a pty for stdout.  This is good enough
for what we need and avoids most of the issues with ptys.

This commit was SVN r10692.
2006-07-08 21:18:24 +00:00
Andrew Friedley
b7e0484c37 Give up on dat_ep_query() and instead manually send our address information across the wire after connection establishment.
I've introduced a race condition - seeing occasional LOCAL_LENGTH errors on the receive side.  I think I'm mixing up eager/max somehow - will look at it more on monday.

This commit was SVN r10690.
2006-07-07 21:48:16 +00:00
Josh Hursey
13f1f4d86e fix a typo when checking the return code
This commit was SVN r10686.
2006-07-06 20:49:09 +00:00
Galen Shipman
5085061475 don't call unpack when we received directly into the user buffer.. the
convertor doesn't handle it properly
continue peeking until we don't get anything else.. 
close the endpoint before closing the library.. 
add a blocking send that uses mx_test .. 

This commit was SVN r10684.
2006-07-06 19:54:13 +00:00
Ralph Castain
bc7690bcb0 Fix the bproc allocator. This is just a bandaid for 1.x that will be fixed more thoroughly in 2.0.
Basically, the problem was that the allocator was grabbing everything on the cluster for which the user had access privilege. Thus, if a user had two sessions operable, each with its own allocation, mpirun in each session would grab both sets of nodes and use them. Not very polite.

This commit was SVN r10683.
2006-07-06 18:31:14 +00:00
Brian Barrett
cba9b1e6b7 * the POrtals MTL is now stable enough to not have it ompi ignored
This commit was SVN r10682.
2006-07-06 18:26:48 +00:00
Brian Barrett
58ce434292 * remove the broken, defunct portals PML. Not needed anymore, since we can
do the same basic thing with the MTL design

This commit was SVN r10681.
2006-07-06 18:24:08 +00:00
George Bosilca
476c9e64df Don't keep multiples copies of the datatype and count. The only one we really need
is the one provided by the user. For the buffered send the real datatype used
for the communication is always MPI_BYTE and the count can be retrieved from
the req_bytes_packed field. This will decrease the size of the request by
one pointer and one size_t (8 bytes or 16 bytes depending on the architecture).

This commit was SVN r10680.
2006-07-06 17:58:25 +00:00
Brian Barrett
b7b93e48f5 * can definitely be optimized more, but add code for calling send for MTL
components that have a blocking send implementation

This commit was SVN r10679.
2006-07-06 16:37:59 +00:00
Brian Barrett
ef6b7e170f * make mtl datatype wrapper code inline functions
This commit was SVN r10678.
2006-07-06 15:58:07 +00:00
Galen Shipman
2217fd4003 reset receive request convertor for persistent requests
We can always call unpack.. 

This commit was SVN r10677.
2006-07-06 15:13:26 +00:00
Brian Barrett
ef8c6a249b * Fix up some direct-calling issues for the PML/MTL
This commit was SVN r10676.
2006-07-06 15:12:38 +00:00
Brian Barrett
95118f83f6 * complete all outstanding Portals events before shutting down
* Remove all knowledge of PML requests from the Portals MTL

This commit was SVN r10675.
2006-07-06 14:33:29 +00:00
Brian Barrett
26eee59032 * turns out that you should only call bsend_request_alloc or
bsend_request_init, but not both.  Otherwise, you don't free
  some buffer space and end up leaking buffers and ending in
  badness
* since you only call alloc() or init(), but not both, need to 
  restore reference counting in init()

This commit was SVN r10674.
2006-07-06 14:02:51 +00:00
Jeff Squyres
3d5d0959fa Remove unused variable, and therefore silence a compiler warning.
This commit was SVN r10673.
2006-07-06 10:44:04 +00:00
Gleb Natapov
e05ec69dc4 print "flush error" only once.
This commit was SVN r10672.
2006-07-06 08:03:01 +00:00
Gleb Natapov
9b0807e547 Put pending fragment on the right waiting list.
This commit was SVN r10671.
2006-07-06 07:51:23 +00:00
George Bosilca
01a59d68da Do not generate the XFER_BEGIN and XFER_END events if the length of
the data is zero, for both the receives and the sends.

This commit was SVN r10670.
2006-07-05 23:39:13 +00:00
Brian Barrett
c793ad0a3d unpack the amount received, not the amount we had space to receive.
This commit was SVN r10669.
2006-07-05 22:31:29 +00:00
Galen Shipman
c933c0f65f unpack the length actually received, not the length posted..
This commit was SVN r10668.
2006-07-05 22:16:46 +00:00
Brian Barrett
3e29949cc8 * Fix shutdown code in utcp portals code
* make all sends long sends for now in Portals MTL
* More optimized match check

This commit was SVN r10667.
2006-07-05 21:46:45 +00:00
Josh Hursey
b1da6f8bc4 A bit more cleanup for that last patch.
* num_children should really be an int instead of size_t
  since 'size_t' is not signed and num_children can (in rare cases)
  drop below 0, and don't want it to roll around to MAX_INT or some
  such.

 * I figured out that this problem only happened to me because I use
  the pls_fork_reap_timeout MCA parameter and thus the only time that
  the code in pls_fork_module.c to waitpid is executed is if this is
  not set to 0 (I had it set to 1 to give my procs time to exit). I
  adjusted the loop from while{...} to do{...}while; so that it is
  executed at least once for consistency.

 * de-register the SIGCHILD callback for the pid before we attempt
  to kill it, so that we don't leave the door open for both the
  waitpids (the one in the callback, and the one in this function)
  to race to see who can wait on the child.

 * Move the 'thread release' to outside the for loop for a bit of an
  optimization, and always set the value to 0 since we want to 
  finish after this function.

 * Added a help message for the case when we can't send a kill()
   signal to the process. Should never happen, but all is possible
   in the wild wild west of HPC.

This commit was SVN r10666.
2006-07-05 21:38:23 +00:00
Galen Shipman
fe480cd003 change mask bits and don't call convertor if we received directly into the
user buffer.. 

This commit was SVN r10665.
2006-07-05 21:10:09 +00:00
Jeff Squyres
429c25095e Fix for bug #176.
* Fix for two problems introduced by r10661:
   1. ensure to use the key ''after'' it is initialized (sigh).
   1. handle the case where we free the attrkey before it is fully
      initialized (i.e., some other error causes us to free it).  In
      this case, don't try to remove the key from the hash map,
      because it won't exist.
 * More accurate zeroing in the keyval constructor
   (ompi_attrkey_item_constructor)
 * Widen the scope of the alock such that the attrkey destructor does
   not need to acquire it.  Instead, assume that the caller already
   has it.
 * Add a comment about why the keyval may get destroyed as the result
   of deleting an attribute (so that I don't have to figure it out
   again the next time I read this code :-) )

This commit was SVN r10664.

The following SVN revision numbers were found above:
  r10661 --> open-mpi/ompi@fdba2c9df0
2006-07-05 20:23:08 +00:00
George Bosilca
6265625983 Generate the XFER_CONTINUE PERUSE event (or the receive) before unpacking the data.
This commit was SVN r10663.
2006-07-05 19:45:00 +00:00
Josh Hursey
696bb4a0c0 A partial fix for the hanging orted bugs (Ticket #177)
When we force an application to terminate (via CTRL-C to mpirun)
we send an out-of-band message to the orted to reap its children.
the fork PLS was doing an internal waitpid but never releasing or
updating the information and signaling the condition variable. So
the fork PLS callback for SIGCHLD registered with the event library
and this waitpid are in a bit of a race to 'waitpid' for the children.
Since the PLS callback was the only one that handled the signal properly
when it 'won' then things were great -- as in the normal termination case.
But when it 'lost' -- as in the abnormal termination case -- the orted
never received the proper signal that its children had gone away.

We want to preserve the internal fork PLS callback since it allows
for a timeout while waiting for the child, which the event library
won't do.

This allows both to exist, and behave properly.

This was introduced in r9068.

The ticket is still open since the orted's hang in other situations
still. This is a fix for one of the causes.

This commit was SVN r10662.

The following SVN revision numbers were found above:
  r9068 --> open-mpi/ompi@c2c2daa966
2006-07-05 19:37:29 +00:00
Jeff Squyres
fdba2c9df0 Per the analysis in bug #184, move some assignments around to effect
thread safety.  This is likely to be only the first of multiple steps
for complete thread safety in the MPI attribute code.  All tests
[continue to] pass the intel and ibm attribute tests.

Also renamed a variable from "attr" to "attrkey" to reflect that it's
a keyval, not an attribute.

This commit was SVN r10661.
2006-07-05 17:37:17 +00:00
Brian Barrett
4ee4acb6a6 * ignore some Cray-only code when not on the Cray machine
This commit was SVN r10660.
2006-07-05 17:16:27 +00:00
Brian Barrett
043153dad3 * fix opal_list_item_t -> ompi_free_list_item_t type change
This commit was SVN r10659.
2006-07-05 17:02:16 +00:00
Rainer Keller
23d3628691 - Declare and initialize the peruse_handle_list_lock
This commit was SVN r10656.
2006-07-05 13:48:25 +00:00
Jeff Squyres
789dd47d1b Sync NEWS with 1.1.1
This commit was SVN r10654.
2006-07-05 13:23:43 +00:00
Jeff Squyres
538965aeb0 Final merge of stuff from /tmp/tm-stuff tree (merged through
/tmp/tm-merge).  Validated by RHC.  Summary:

- Add --nolocal (and -nolocal) options to orterun
- Make some scalability improvements to the tm pls

This commit was SVN r10651.
2006-07-04 20:12:35 +00:00
George Bosilca
d2bf3844e9 Include the header file which define opal_output.
This commit was SVN r10648.
2006-07-04 06:23:01 +00:00
George Bosilca
2bdb06b549 Force the request to NULL in order to avoid complaints from the compiler.
This commit was SVN r10647.
2006-07-04 06:20:13 +00:00
George Bosilca
402a03d229 Add a .h dependency in order to remove a warning when we compile without --enable-debug.
This commit was SVN r10646.
2006-07-04 04:53:38 +00:00
George Bosilca
9ac1a6cdb3 Remove the warnings. Now they are ompi_free_list_item not opal_list_item_t.
This commit was SVN r10645.
2006-07-04 04:21:16 +00:00
Brian Barrett
7d12f9119a * make sure to include post_configure.sh in the dist tarball, so that
direct calling the ob1 pml works properly.

This commit was SVN r10644.
2006-07-04 04:03:58 +00:00
Brian Barrett
27d9e26721 Fix for ticket #179. Print a reasonable error message if we fail to parse
the compiler data file.  Also, actually fix the bug by expanding out
datarootdir before letting it get in install_dirs.h.

This commit was SVN r10643.
2006-07-04 03:00:01 +00:00
Brian Barrett
47725c9b02 * Add new PML (CM) and network drivers (MTL) for high speed
interconnects that provide matching logic in the library.
  Currently includes support for MX and some support for
  Portals
* Fix overuse of proc_pml pointer on the ompi_proc structuer, 
  splitting into proc_pml for pml data and proc_bml for
  the BML endpoint data
* bug fixes in bsend init code, which wasn't being used by
  the OB1 or DR PMLs...

This commit was SVN r10642.
2006-07-04 01:20:20 +00:00
Josh Hursey
5c5ce7e051 When 'mca_oob_send_callback' accesses the callback 'orte_pls_rsh_terminate_job_cb'
with an error status (< 0) then the req buffer is NULL. Put checks around the
OBJ_RELEASE(req) calls so that we don't try to release NULL :/

This commit was SVN r10641.
2006-07-03 22:44:54 +00:00
Josh Hursey
d082a63734 Add some new OPAL functionality.
After seeing the uglyness that is removing directories in the
codebase I decided to push down this to the OPAL by extending the
opal/os_create_dirpath.(c|h) to contain some more functionality.

In this process I renamed 'os_create_dirpath' to 'os_dirpath' since it
is a bit more general now.

Added a few functions to:
 - check if an directory is empty
 - check to see if the access permissions are set correctly
 - destroy the directory at the end of the dirpath
   - By using a caller callback function (a la Perl, I believe)
     for every file, the caller can have fine grained control over
     whether a specific file is deleted or not.

This simplifies things a bit for orte_session_dir_(finalize|cleanup)
as it should no longer contain any of this functionality, but uses
these functions to do the work.

From the external perspective nothing has changed, from the 
developer point of view we have some cleaner, more generic code.

This commit was SVN r10640.
2006-07-03 22:23:07 +00:00
Josh Hursey
38df31e488 A bit of cleanup in the pretty_printing, making it a bit more sane.
Since we don't properly handle connecting/disconecting from multiple
universes, only connect to the first one (or the user specified one).
This is a bug that needs to be fixed, but involves some deep magic in
ORTE.

Print the node segment upon request (-n option). 
{{{
Node Name | Arch | Cell ID |   State | Slots | Slots Max | 
-----------------------------------------------------------
  odin001 |      |       0 | Unknown |     2 |         4 | 
  odin002 |      |       0 | Unknown |     2 |         5 | 
  odin003 |      |       0 | Unknown |     2 |         6 | 
  odin004 |      |       0 | Unknown |     2 |         7 | 
}}}

Since node_slots_alloc and node_slots_inuse are not properly updated
in the GPR don't print those values.

This commit was SVN r10633.
2006-07-03 17:11:02 +00:00
Josh Hursey
fc72eb4a01 remove a residual warning
This commit was SVN r10628.
2006-07-03 15:16:15 +00:00
Jeff Squyres
06fb5fcce0 - Added a missing original entry in the changelog
- Just use the prefix in the % files list so that we a) grab the whole
  tree and b) it removes all the directories when the RPM is removed.
  Thanks to Bernard Li for reporting the problem.

This commit was SVN r10617.
2006-07-02 12:08:48 +00:00
Brian Barrett
0bd5acc51f * Fix for bus error in XGrid starter
This commit was SVN r10615.
2006-07-01 16:16:46 +00:00
Graham Fagg
f10c21b746 corrected mca param description and algorithm count
(now to find out why I have disallowed direct calling fo the bm tree)

This commit was SVN r10603.
2006-06-30 23:22:49 +00:00
Josh Hursey
2edf1511fd Closes ticket #173 : Split name linking up for orte/ompi shared tools.
This moves the logic to create the symbolic links for:
 - mpirun
 - mpiexec
 - ompi-ps
 - ompi-clean
and their respective man pages to the ompi level from
the orte layer.

This is a bit pedantic, but orte shouldn't be doing the
work of ompi since that is a bit of an abstraction break.

Note: need to autogen.sh to get this. Sorry :(

This commit was SVN r10602.
2006-06-30 22:01:56 +00:00
Graham Fagg
f64cbbe8f2 ops. some decisions used extent rather than size for decision making
yes this means it WAS possible for two nodes to choice two different algorithms
(discovered by Doug Gregor and figured out by George)
Also changed some names like size to comsize so we know which sizes we are using where
This should be updated in al versions

This commit was SVN r10601.
2006-06-30 21:49:04 +00:00
Brian Barrett
df9273587f * romio_cb_write should also be forced to enable when optimizations are
requested

This commit was SVN r10584.
2006-06-30 15:06:10 +00:00