1
1
Граф коммитов

68 Коммитов

Автор SHA1 Сообщение Дата
Brian Barrett
b07dfa7841 * remove unused variable in ompi_comm_get_rprocs
* don't load data into a buffer until we have the data, as
    the data contains some header information needed to
    properly load the data

This commit was SVN r12792.
2006-12-07 16:19:44 +00:00
Brian Barrett
33320b7165 Rework the opal_progress interface to better support dynamic processes and at
the same time, remove some of the MPI-related options from OPAL:

  - provide mechanism to change at runtime whether sched_yield() should 
    be called when the progress engine is idle
  - provide mechanism for changing the rate at which the event engine
    is called when there are "no" users of the event engine (ie, when
    using MPI but not TCP)
  - fix some function names in the progress engine to better match
    their intended use (and remove MPI naming scheme)
  - remove progress_mpi_enable / progress_mpi_disable because 
    we can now use the functions to set the sched_yield and
    tick rate interfaces
  - rename opal_progress_events() to opal_progress_set_event_flag()
    because the first really isn't descriptive of what the function
    does and I always got confused by it

This commit was SVN r12645.
2006-11-22 02:06:52 +00:00
Ralph Castain
6d6cebb4a7 Bring over the update to terminate orteds that are generated by a dynamic spawn such as comm_spawn. This introduces the concept of a job "family" - i.e., jobs that have a parent/child relationship. Comm_spawn'ed jobs have a parent (the one that spawned them). We track that relationship throughout the lineage - i.e., if a comm_spawned job in turn calls comm_spawn, then it has a parent (the one that spawned it) and a "root" job (the original job that started things).
Accordingly, there are new APIs to the name service to support the ability to get a job's parent, root, immediate children, and all its descendants. In addition, the terminate_job, terminate_orted, and signal_job APIs for the PLS have been modified to accept attributes that define the extent of their actions. For example, doing a "terminate_job" with an attribute of ORTE_NS_INCLUDE_DESCENDANTS will terminate the given jobid AND all jobs that descended from it.

I have tested this capability on a MacBook under rsh, Odin under SLURM, and LANL's Flash (bproc). It worked successfully on non-MPI jobs (both simple and including a spawn), and MPI jobs (again, both simple and with a spawn).

This commit was SVN r12597.
2006-11-14 19:34:59 +00:00
Ralph Castain
9204747930 Add timing info to comm_spawn - timing collected and reported when OMPI_MCA_ompi_timing = 1 (or something other than zero).
This commit was SVN r12381.
2006-10-31 23:32:39 +00:00
George Bosilca
06563b5dec Last set of explicit conversions. We are now close to the zero warnings on
all platforms. The only exceptions (and I will not deal with them
anytime soon) are on Windows:
- the write functions which require the length to be an int when it's
  a size_t on all UNIX variants.
- all iovec manipulation functions where the iov_len is again an int
  when it's a size_t on most of the UNIXes.
As these only happens on Windows, so I think we're set for now :)

This commit was SVN r12215.
2006-10-20 03:57:44 +00:00
Ralph Castain
d0eb7d7216 Complete the attribute management functions.
Modify the mapper to better bookmark its stopping place each time, and to pick up the next time from there. This needs to be validated on a multi-node system.

Fix a major memory corruption problem in the registry put/get functions that was doing multiple free's. Not sure how valgrind missed this one, though it only occurred in specific circumstances (such as comm_spawn).

This commit was SVN r12179.
2006-10-18 20:02:16 +00:00
Ralph Castain
f4a458532b This doesn't totally resolve the comm_spawn problem, but it helps a little. I'll continue working on it and hope to resolve it completely shortly. The issue primarily centers on where to start mapping the child job's processes, and how to deal with oversubscription that might result. At the moment, I am trying to resolve the first issue first (hey, that even sounds right!).
This change does a couple of things:

1. Since the USE_PARENT_ALLOC attribute is a directive about regarding allocation of resources to a job, it more properly should be an attribute of the RAS. Change the name to reflect that and move the attribute define to the ras_types.h file.

2. Add the attributes list to the RMAPS map_job interface. This provides us with the desired flexibility to dynamically specify directives for mapping. The system will - in the absence of any attribute-based directive - default to the values provided in the MCA parameters (either from environment or command-line interface).

This commit was SVN r12164.
2006-10-18 14:01:44 +00:00
Ralph Castain
13227e36ab This commit looks a lot bigger than it is, so relax :-)
Fix the problem observed by multiple people that comm_spawned children were (once again) being mapped onto the same nodes as their parents. This was caused by going through the RAS a second time, thus overwriting the mapper's bookkeeping that told RMAPS where it had left off.

To solve this - and to continue moving forward on the ORTE development - we introduce the concept of attributes to control the behavior of the RM frameworks. I defined the attributes and a list of attributes as new ORTE data types to make it easier for people to pass them around (since they are now fundamental to the system, and therefore we will be packing and unpacking them frequently). Thus, all the functions to manipulate attributes can be implemented and debugged in one place.

I used those capabilities in two places:

1. Added an attribute list to the rmgr.spawn interface.

2. Added an attribute list to the ras.allocate interface. At the moment, the only attribute I modified the various RAS components to recognize is the USE_PARENT_ALLOCATION one (as defined in rmgr_types.h).

So the RAS components now know how to reuse an allocation. I have debugged this under rsh, but it now needs to be tested on a wider set of platforms.

This commit was SVN r12138.
2006-10-17 16:06:17 +00:00
Ralph Castain
1f7a5da3ce Bring singleton comm_spawn online.
This commit was SVN r12081.
2006-10-10 23:59:48 +00:00
Edgar Gabriel
ec55acd8f4 orte_rml.send_buffer returns the number of bytes sent or a negative value if
something went wrong. A positiv number > 0 is however a correct value (in
contrary to orte_rml.recv_buffer, which really returns ORTE_SUCCESS or an
error code).

Note: this part of the code is correct on 1.1 and 1.2 branch, no need to move
this change patch to the release branches.

This commit was SVN r11897.
2006-09-29 20:28:45 +00:00
George Bosilca
645790dd9c Pedantic...
This commit was SVN r11731.
2006-09-20 22:20:10 +00:00
George Bosilca
688a16ea78 A long time waiting patch. Get rid of the comm->c_pml_procs. It was (and that was
long ago) supposed to be used as a cache for accessing the PML procs. But in
all of the PMLs the PML proc contain only one field i.e. a pointer to the ompi_proc.
This pointer can be accessed using the c_remote_group easily. Therefore, there is no
meaning of keeping the PML procs around. Slim fast commit ...

This commit was SVN r11730.
2006-09-20 22:14:46 +00:00
George Bosilca
20459bd982 Remove the HIDDEN flag. It is not used anywhere.
This commit was SVN r11729.
2006-09-20 20:57:10 +00:00
Ralph Castain
0ad0d84afd Add two new API functions to the RMGR, and modify the "spawn" API to support the enhanced MPI-2 functionality.
No implementation backs these new APIs - just placeholders for now.

This commit was SVN r11699.
2006-09-19 01:45:05 +00:00
Ralph Castain
37dfdb76eb Here is the major MAD-cure commit. I have written plenty about it, so I refer you here to those messages for a description of everything that was done.
This commit was SVN r11661.
2006-09-14 21:29:51 +00:00
George Bosilca
3f0a7cad9e The last patch for Windows support. Mostly casting and conversion to C++ friendly headers.
This commit was SVN r11400.
2006-08-24 16:38:08 +00:00
Ralph Castain
6d27fee3a2 Silence Cyrador...who had a valid complaint.
This commit was SVN r11282.
2006-08-21 14:26:11 +00:00
Ralph Castain
6bf06d4602 Fix connect-accept by cleaning up two minor bugs.
This commit was SVN r11260.
2006-08-18 21:12:03 +00:00
Ralph Castain
8c7f0ed9ae Change the SOH to the new State Monitoring and Reporting (SMR) framework. New API's will be appearing in the new framework shortly - this just gets the name change into the system.
Other changes:

1. Remove the old xcpu components as they are not functional.

2. Fix a "bug" in orterun whereby we called dump_aborted_procs even when we normally terminated. There is still some kind of bug in this procedure, however, as we appear to be calling the orterun job_state_callback function every time a process terminates (instead of only once when they have all terminated). I'll continue digging into that one.

This will require an autogen/configure, I'm afraid.

This commit was SVN r11228.
2006-08-16 16:35:09 +00:00
Ralph Castain
5dfd54c778 With the branch to 1.2 made....
Clean up the remainder of the size_t references in the runtime itself. Convert to orte_std_cntr_t wherever it makes sense (only avoid those places where the actual memory size is referenced).

Remove the obsolete oob barrier function (we actually obsoleted it a long time ago - just never bothered to clean it up).

I have done my best to go through all the components and catch everything, even if I couldn't test compile them since I wasn't on that type of system. Still, I cannot guarantee that problems won't show up when you test this on specific systems. Usually, these will just show as "warning: comparison between signed and unsigned" notes which are easily fixed (just change a size_t to orte_std_cntr_t).

In some places, people didn't use size_t, but instead used some other variant (e.g., I found several places with uint32_t). I tried to catch all of them, but...

Once we get all the instances caught and fixed, this should once and for all resolve many of the heterogeneity problems.

This commit was SVN r11204.
2006-08-15 19:54:10 +00:00
Ralph Castain
62e70e6b3a Enable the use of "prefix" for comm_spawn child processes. With this patch:
1. comm_spawn processes by default will inherit the "--prefix" from their parent job. Thus, the "--prefix" provided on the command line will be propagated automatically to any children.

2. application programs can override the default by providing their own "ompi_prefix" in the MPI_Info parameter passed to comm_spawn

This commit was SVN r11143.
2006-08-09 20:48:51 +00:00
Jeff Squyres
7f372b4e1f No functional changes -- only re-indent some portions of the code to
make it consistent with the indenting in the rest of the file
(otherwise it was quite difficult to understand -- saw this while I
was reviewing 11039).

This commit was SVN r11042.
2006-07-28 15:47:16 +00:00
David Daniel
45894aecee Adding support for MPI_Comm_spawn() to use the 'host' key in an MPI_Info
object if provided.

The associated value is a comma-separated list of hosts -- which must be
in the initial allocation -- and is used to populate the application
context map.

This commit was SVN r11039.
2006-07-27 23:45:33 +00:00
Jeff Squyres
942f9e8f8d Fixes for ticket:14. Lengthy discussion is on that ticket and in a
comment in ompi_comm_invalid() in
source:/trunk/ompi/communicator/communicator.h.

Short version:
- ompi_comm_invalid() returns TRUE for MPI_COMM_NULL
- therefore MPI_COMM_C2F needs to explicitly check for MPI_COMM_NULL
  (because it uses ompi_comm_invalid())
- make ~20 MPI functions only call ompi_comm_invalid() instead of
  calling ompi_comm_invalid() *and* checking for MPI_COMM_NULL (~40 MPI
  functions already only called ompi_comm_invalid() -- we should be
  consistent)
- similar issue for ompi_win_invalid(), so I added a cross-referencing
  comment in win.h and fixed MPI_WIN_SET_NAME to only call
  ompi_win_invalid() (and not check for MPI_WIN_NULL)

This commit was SVN r9970.
2006-05-18 18:05:46 +00:00
Edgar Gabriel
8c49f14dce fix a bug in the intercomm-split allgather emulation function.
This commit was SVN r9806.
2006-05-03 21:41:10 +00:00
Edgar Gabriel
f962ba2d89 fix the handling of the 'high' argument in Intercomm_merge. The logic
was unfortunatly exactly the opposite way round.

This commit was SVN r9803.
2006-05-03 14:43:52 +00:00
George Bosilca
88037b456e We have nice macros for checking ...
This commit was SVN r9670.
2006-04-20 19:54:41 +00:00
George Bosilca
29fbf9e296 Add more information on the default name of the communicator. We will be
able to know how the communicator was created and from which parent.

This commit was SVN r9649.
2006-04-16 01:34:34 +00:00
Jeff Squyres
82d590629d After extensive conversations about this...
- My original patch stands: MPI_FINALIZE directly invokes the
  attribute callbacks on MPI_COMM_SELF
- We added some user-level checks to ensure that they don't call
  MPI_FINALIZE twice (this isn't really required, but it will prevent
  whacky segv's -- they'll at least get a nice error message)
- Removed the attribute callbacks on MPI_COMM_SELF from
  ompi_mpi_comm_finalize (i.e., we just moved them from
  ompi_mpi_comm_finalize to ompi_mpi_finalize -- we just moved this
  process up earlier in the MPI_FINALIZE sequence of events)
- Because there were so many conversations about this, here's the
  rationale:
  - MPI-2:4.8 says that we have to MPI_COMM_FREE MPI_COMM_SELF so that
    the attribute callbacks are invoked.
  - After considerable discussion, we came to the conclusion that
    FREE'ing COMM_SELF is not the issue -- calling the callbacks is
    the issue.
  - So it is sufficent for MPI_FINALIZE to directly invoke these
    attribute callbacks
  - The attribute callbacks are *not* invoked on other communicators
    because said communicators are not MPI_COMM_FREE'ed

This commit was SVN r9628.
2006-04-13 17:00:36 +00:00
George Bosilca
686cc9ef54 First cut of PERUSE. Right now we support all the Peruse definitions from the
version 1.12. As in the 2.0 everything related to windows and files has been removed
I prefer to add the complete files, so I have a trace in the SN for later.

This commit was SVN r9373.
2006-03-23 05:00:55 +00:00
Rainer Keller
9e1c5716b6 - opal_cube_dim does not return an error
This commit was SVN r9196.
2006-03-04 13:47:24 +00:00
Brian Barrett
2eb76ff0cd * finish the TEG/UNIQ/PTL removal
This commit was SVN r9118.
2006-02-23 00:39:01 +00:00
Brian Barrett
566a050c23 Next step in the project split, mainly source code re-arranging
- move files out of toplevel include/ and etc/, moving it into the
    sub-projects
  - rather than including config headers with <project>/include, 
    have them as <project>
  - require all headers to be included with a project prefix, with
    the exception of the config headers ({opal,orte,ompi}_config.h
    mpi.h, and mpif.h)

This commit was SVN r8985.
2006-02-12 01:33:29 +00:00
Ralph Castain
892b396d70 Ensure that standard triggers are defined for all job/process states so that user's can subscribe to those they want to use. Modify the way that is done to avoid over-burdening the standard launch sequence since it doesn't need alerts from all those triggers.
This commit was SVN r8938.
2006-02-08 17:40:11 +00:00
Ralph Castain
4b9f015c0b Merge in the new data support subsystem for ORTE. MPI folks should not notice a difference. Longer explanation will be sent to developers mailing list.
This commit was SVN r8912.
2006-02-07 03:32:36 +00:00
George Bosilca
6fb4ce5e2e Some dependencies cleanups (there were on hold for a while).
This commit was SVN r8425.
2005-12-09 05:14:18 +00:00
Brian Barrett
d60c7695d3 * need to declare environ on OS X
* work around fact that num_env is a size_t.  Thankfully, OS X compiler
  caught this one.

This commit was SVN r8180.
2005-11-17 08:19:47 +00:00
Brian Barrett
028d1d179a push OMPI_* environment variables to spawned processes, similar to what we
do for mpirun/orterun.  This will allow -mca btl foo,self to work as 
expected when doing MPI_COMM_SPAWN and friends.

This should be pushed to the v1.0 branch

This commit was SVN r8170.
2005-11-16 22:20:33 +00:00
Edgar Gabriel
b3d3552900 Fix for a problem Brian pointed out with cartesian communicators: in
comm_fill_rest there is no need for calling ompi_set_group_rank, since
we know already the rank of the process in the new comm. In case the
process was not part of the new communicator (rank = MPI_UNDEFINED)
calling this function caused a segfault on some platforms.

This commit was SVN r8060.
2005-11-09 21:00:58 +00:00
Jeff Squyres
42ec26e640 Update the copyright notices for IU and UTK.
This commit was SVN r7999.
2005-11-05 19:57:48 +00:00
Rainer Keller
d6120d32d6 - Only minor white-space changes, to clean up
This commit was SVN r7843.
2005-10-24 10:36:16 +00:00
Brian Barrett
1302cb4072 The next in a long line of crazed build system changes from Brian. This was
originally suggested by Ralf Wildenhues, to try to speed autogen, configure,
and make (and possibly even make install).  Use automake's include directive
to drastically reduce the number of Makefile files (although the number of
Makefile.am files is the same - most are just included in a top-level
Makefile.am).  Also use an Automake SUBDIRs feature to eliminate the
dynamic-mca tree, which was no longer really needed.  This makes adding
a framework easier (since you don't have to remember the dynamic-mca
tree) and makes building faster (as make doesn't have to recurse through
the dynamic-mca tree)

This commit was SVN r7777.
2005-10-17 00:21:10 +00:00
Jeff Squyres
84feccd3d5 This is something I forgot to commit from long ago -- already
discussed and cleared with Edgar.

Ensure that only processes who will be in the new communicator call
the coll selection function.  It is pointless (and Bad in some cases)
for processes who are not in the new communicator to try to select a
coll module for the new communicator.

This commit was SVN r7573.
2005-10-01 11:57:17 +00:00
Josh Hursey
e825b4522f Upon further investigation the fix in r7537 was an anomoly of zero'ing out the
bits to expose the low bits being set. We were casting from a size_t to a void*
which is not good when working with big endian machines.

This fix makes MPI 2 dynamics work on PPC 64 (tested with a Linux OS).

This commit was SVN r7538.

The following SVN revision numbers were found above:
  r7537 --> open-mpi/ompi@fd45714c03
2005-09-28 23:50:42 +00:00
Josh Hursey
fd45714c03 For some reason we have to initialize this variable or bad things happen in the
comm->c_coll.coll_bcast of the rnamebuflen.

This fixes the threaded MPI 2 Dynamics stuff. Should be working great now! Yay!

This commit was SVN r7537.
2005-09-28 22:30:41 +00:00
Josh Hursey
75419313f7 check the return code and do something reasonable, instead of progressing and hanging on error
This commit was SVN r7531.
2005-09-28 06:13:51 +00:00
Tim Woodall
9279e4f882 use sync send to ensure message is received before exiting
This commit was SVN r7374.
2005-09-14 21:28:17 +00:00
George Bosilca
e3a8489dd0 Replace ompi_proc_t by struct ompi_proc_t to remove all dependencies to proc.h
This commit was SVN r7326.
2005-09-12 21:51:56 +00:00
Brian Barrett
15d48945c6 * fix communicator.h so that tree compiles again - needs to know what an ompi_proc_t is
This commit was SVN r7323.
2005-09-12 21:34:26 +00:00
George Bosilca
5caeb0295a Correct the includes and some indentation.
This commit was SVN r7322.
2005-09-12 20:36:04 +00:00