1
1
Граф коммитов

247 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
7102d7c5f7 ick - brain is fried. take that test out as it isnt needed on a regular basis
This commit was SVN r27875.
2013-01-19 14:48:31 +00:00
Ralph Castain
38786457cb Add new test
This commit was SVN r27874.
2013-01-19 14:46:23 +00:00
Ralph Castain
c1690f403e Remove non-existent file
This commit was SVN r27730.
2012-12-29 02:21:50 +00:00
Ralph Castain
68329b516c Cleanup stale test codes
This commit was SVN r27729.
2012-12-28 16:52:51 +00:00
Jeff Squyres
f779b1ded9 Put back the static-library-detection stuff from r27668, with some
additional functionality.  Rationale (refs trac:3422):

 * Normal MPI applications only ever use the MPI API. Hence, -lmpi is
   sufficient (they'll never directly call ORTE or OPAL
   functions). This is arguably the most common case.
 * That being said, we do have some test programs (e.g., those in
   orte/test/mpi) that call MPI functions but also call ORTE/OPAL
   functions. I've also written the occasional MPI test program that
   calls opal_output, for example (there even might be a few tests in
   the IBM test suite that directly call ORTE/OPAL functions).
   * Even though this is not a common case, these applications should
     also compile/link with mpicc.
   * So we should add a --openmpi:linkall option that will also link
     in whatever is necessary to call ORTE/OPAL functions
   * Yes, we could hard-code "-lopen-rte -lopen-pal" in Makefiles, but
     we do reserve the right to change those library names and/or add
     others someday, so it's better to abstract out the names and let
     the wrapper supply whatever is necessary.
 * ORTE programs, however, are different. They almost always call OPAL
   functions (e.g., if they want to send a message, they must use the
   OPAL DSS). As such, it seems like the ORTE programs should always
   link in OPAL.

Therefore:

 * Add undocumented --openmpi:linkall flag to the wrapper compilers.
   See the comment in opal_wrapper.c for an explanation of what it
   does.  This flag is only intended for Open MPI developers -- not
   end users.  That's why it's undocumented.
 * Update orte/test/mpi/Makefile.am to add --openmpi:linkall
 * Make ortecc/ortec++'s wrapper data text files always explicitly
   link in libopen-pal

This commit was SVN r27670.

The following SVN revision numbers were found above:
  r27668 --> open-mpi/ompi@cf845897aa

The following Trac tickets were found above:
  Ticket 3422 --> https://svn.open-mpi.org/trac/ompi/ticket/3422
2012-12-13 22:31:37 +00:00
Ralph Castain
e11f32038a Add an MCA param to retain all aliases based on IP addrs for node names so that procs can look them up by interface, if desired. If the param is set, pass aliases around to all daemons and procs for local use
This commit was SVN r27619.
2012-11-16 04:04:29 +00:00
Ralph Castain
fe6dfad625 Update DFS to support multi-node operations
This commit was SVN r27594.
2012-11-12 02:54:53 +00:00
Ralph Castain
bd887f7f56 Add a new "test" component to the DFS that treats all files as remote in order to test the app-to-daemon interactions on a single machine. Set a global param to indicate we are using staged execution. Add a param to indicate it is okay for non-MPI processes to execute without finalizing. Cleanup file map load and fetch operations.
This commit was SVN r27587.
2012-11-10 14:09:12 +00:00
Ralph Castain
a080de188f Enable orterun to directly support staged execution, treating each app as a separate job. Support transfer of file maps when support exists.
This commit was SVN r27516.
2012-10-29 23:11:30 +00:00
Ralph Castain
4e52a15e70 Provide for sync on seek and close DFS operations. Eliminate an unnecessary wake-up timer when using ORTE progress thread
This commit was SVN r27500.
2012-10-26 15:49:04 +00:00
Ralph Castain
df642f1508 Add an API to get a remote file's size. Separate dfs cmds from returned data messages so daemons don't get confused.
This commit was SVN r27487.
2012-10-25 22:23:08 +00:00
Ralph Castain
094d6f3143 Add a new "distributed file system" capability to support file access operations across nodes that do not have a network file system attached to them.
Add a set of URI create/parse utilities

This commit was SVN r27483.
2012-10-25 17:15:17 +00:00
Ralph Castain
90d7b5fdca Update test
This commit was SVN r27354.
2012-09-20 02:51:27 +00:00
Ralph Castain
5f7a5c4793 Update test to include all keys
This commit was SVN r27311.
2012-09-12 05:02:51 +00:00
Ralph Castain
cd8aff675b Update test
This commit was SVN r27303.
2012-09-11 20:32:43 +00:00
Ralph Castain
0e1dbe8711 Remove non-existent files
This commit was SVN r27136.
2012-08-25 01:29:17 +00:00
Ralph Castain
c8b511d18a Remove stale tests
This commit was SVN r27126.
2012-08-24 02:22:11 +00:00
Ralph Castain
3c13176aa7 Remove test code
This commit was SVN r27114.
2012-08-22 21:36:54 +00:00
Ralph Castain
335c0eafcf Add a filem test program and set ignores
This commit was SVN r27069.
2012-08-16 17:46:46 +00:00
Jeff Squyres
96f640a762 Add new "opal_hotel" class. Abstractly speaking, this class does the
following:

 * Provides a fixed number of resource slots (i.e., "hotel rooms").
 * Allows one thing to occupy a resource slot at a time (i.e., each
   hotel room can have an occupant check in to that room).
 * Resource slots can be vacated at any time (i.e., occupants can
   voluntarily check out of their hotel room).
 * Resource slots can be occupied for a specific maximum amount of
   time.  If that time expires, the occupant is forcibly evicted and
   the upper layer is notified via (libevent) callback (i.e., the maid
   will kick an occupant of out of their room when their reservation
   is over).

This class can be to be used for things like retransmission schemes
for unreliable transports.  For example, a message sent on an
unreliable transport can be checked in to a hotel room.  If an ACK for
that message is received, the message can be checked out.  But if the
ACK is never received, the message will eventually be evicted from its
room and the upper layer will be notified that the message failed to
check out in time (i.e., that an ACK for that message was not received
in time).

Code using this class is currently being developed off-trunk, but will
be coming to SVN soon.

This commit was SVN r27067.
2012-08-16 17:29:55 +00:00
Ralph Castain
c90b7380c1 Sigh - of course, they changed the name of the silly MPI_Info object in the final standard, but not in the proposal. So change to the new MPI_INFO_ENV name. Also, don't set unknown values to "N/A", but just leave them unset.
This commit was SVN r27012.
2012-08-12 05:00:57 +00:00
Ralph Castain
cb48fd52d4 Implement the MPI_Info part of MPI-3 Ticket 313. Add an MPI_info object MPI_INFO_GET_ENV that contains a number of run-time related pieces of info. This includes all the required ones in the ticket, plus a few that specifically address recent user questions:
"num_app_ctx" - the number of app_contexts in the job
"first_rank" - the MPI rank of the first process in each app_context
"np" - the number of procs in each app_context

Still need clarification on the MPI_Init portion of the ticket. Specifically, does the ticket call for returning an error is someone calls MPI_Init more than once in a program? We set a flag to tell us that we have been initialized, but currently never check it.

This commit was SVN r27005.
2012-08-12 01:28:23 +00:00
Ralph Castain
6ee35e4977 Add num_local_peers to orte_process_info so we don't keep re-computing it, ensure it is available for direct launch via pmi as well
This commit was SVN r26931.
2012-07-31 21:21:50 +00:00
Ralph Castain
9680c52f5e Add mrplus examples to tarball
This commit was SVN r26643.
2012-06-23 02:40:46 +00:00
Ralph Castain
40c2fc5f55 Update the tests, add a couple
This commit was SVN r26379.
2012-05-02 19:00:05 +00:00
Ralph Castain
8f7bf3344a Update test
This commit was SVN r26370.
2012-05-01 18:38:44 +00:00
Ralph Castain
4542070cf2 Add event priority inversion test
This commit was SVN r26369.
2012-05-01 16:42:22 +00:00
Ralph Castain
f68487016c Add test code from Terry. Properly terminate if we don't abort on non-zero exit
This commit was SVN r26271.
2012-04-16 16:44:23 +00:00
Ralph Castain
bd8b4f7f1e Sorry for mid-day commit, but I had promised on the call to do this upon my return.
Roll in the ORTE state machine. Remove last traces of opal_sos. Remove UTK epoch code.

Please see the various emails about the state machine change for details. I'll send something out later with more info on the new arch.

This commit was SVN r26242.
2012-04-06 14:23:13 +00:00
Ralph Castain
534d70025f Cleanup the detection of process binding during mpi_init. There are several cases that need to be checked:
1. no binding support - indicated by a negative return code from get_cpubind

2. binding supported, but not bound - the bitset returned by get_cpubind is the same as the available cpuset

3. binding supported and bound - bitset from get_cpubind is a subset of available cpuset

4. only one cpu is available - in this case, get_cpubind matches the available cpuset, but we are effectively bound

This commit was SVN r25957.
2012-02-17 21:18:53 +00:00
Ralph Castain
10f94efbda Add binding output to test
This commit was SVN r25955.
2012-02-17 16:48:01 +00:00
Ralph Castain
15facc4ba6 Fix comm_spawn yet again...add another test
This commit was SVN r25579.
2011-12-06 20:15:40 +00:00
Ralph Castain
6310361532 At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement

The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.

In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:

1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.

2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.

3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.

As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.

This commit was SVN r25476.
2011-11-15 03:40:11 +00:00
Samuel Gutierrez
3ea59cce96 minor cleanup to getenv_pmi.c.
This commit was SVN r25449.
2011-11-07 03:18:07 +00:00
Samuel Gutierrez
e03bc93fb7 only use pmi grpcomm and pubsub during the direct launch case. use PMI environment variable to setup vpid in ess alps on cray xe systems. add pmi test code.
This commit was SVN r25447.
2011-11-06 17:28:40 +00:00
Ralph Castain
198e001554 Add another test
This commit was SVN r25415.
2011-11-02 15:59:16 +00:00
Wesley Bland
4e7ff0bd5e By popular demand the epoch code is now disabled by default.
To enable the epochs and the resilient orte code, use the configure flag:

--enable-resilient-orte

This will define both:

ORTE_ENABLE_EPOCH
ORTE_RESIL_ORTE

This commit was SVN r25093.
2011-08-26 22:16:14 +00:00
Wesley Bland
09274cd047 Make sure that the epoch is initialized everywhere so we don't get weird output
during valgrind. This shouldn't have caused any problems with any actual
execution. Just extra warnings in valgrind.

This commit was SVN r25015.
2011-08-08 15:11:55 +00:00
Ralph Castain
7b9f958dcf Add some missing error strings. Update test to show silent errors
This commit was SVN r25010.
2011-08-08 04:21:02 +00:00
Ralph Castain
590ac70e88 Add a simple test program for error string output
This commit was SVN r25007.
2011-08-07 21:32:25 +00:00
Ralph Castain
8853e0e80a Fix regular expression analyzer for slurmd - use a slurm-specific version
Fix multi-node routing for daemon startup when static ports are not set

This commit was SVN r24898.
2011-07-13 22:49:56 +00:00
Ralph Castain
1ee7c39982 Fix some major bit-rot on scalable launch. If static ports are provided, then daemons can connect back to the HNP via the routed connection tree instead of doing so directly. In order to do that at scale, the node list must be passed as a regular expression - otherwise, the orted command line gets too long.
Over the course of time, usage of static ports got corrupted in several places, the "parent" info got incorrectly reset, etc. So correct all that and get the regex-based wireup going again.

Also, don't pass node lists if static ports aren't enabled - they are of no value to the orted and just create the possibility of overly-long cmd lines.

This commit was SVN r24860.
2011-07-07 18:54:30 +00:00
Wesley Bland
e1ba09ad51 Add a resilience to ORTE. Allows the runtime to continue after a process (or
ORTED) failure. Note that more work will be necessary to allow the MPI layer to
take advantage of this.

Per RFC:
http://www.open-mpi.org/community/lists/devel/2011/06/9299.php

This commit was SVN r24815.
2011-06-23 20:38:02 +00:00
Ralph Castain
138928fcf4 Use ports as multicast channels instead of networks so we avoid stepping into reserved spaces.
This commit was SVN r24666.
2011-04-29 18:46:40 +00:00
Ralph Castain
80ef1af8ba Add psm key generator program
This commit was SVN r24197.
2010-12-30 20:54:58 +00:00
Ralph Castain
58e711a412 Update a test and add two new ones for testing event lib thread support
This commit was SVN r24051.
2010-11-13 15:39:28 +00:00
Ralph Castain
703684e071 Output the mca params for debug purposes
This commit was SVN r24042.
2010-11-11 20:06:29 +00:00
Ralph Castain
a47b33678b Add orte-level thread support to avoid some of the opal_if_threads protection used solely for ompi.
Use threads to help process multicast messages.

This commit was SVN r24009.
2010-11-08 19:09:23 +00:00
Ralph Castain
bf665692c3 Update the rmcast callback function API to return message sequence number. Update orte_mcast test to stress the system.
This commit was SVN r24004.
2010-11-07 23:29:52 +00:00
Ralph Castain
9ea2b196ce Convert the opal_event framework to use direct function calls instead of hiding functions behind function pointers. Eliminate the opal_object_t abstraction of libevent's event struct so it can be directly passed to the libevent functions.
Note: the ompi_check_libfca.m4 file had to be modified to avoid it stomping on global CPPFLAGS and the like. The file was also relocated to the ompi/config directory as it pertains solely to an ompi-layer component.

Forgive the mid-day configure change, but I know Shiqing is working the windows issues and don't want to cause him unnecessary redo work.

This commit was SVN r23966.
2010-10-28 15:22:46 +00:00
Ralph Castain
86c7365e8e Clean up a few initialization issues - don't think these are impacting the shared memory situation as it didn't fix the problem.
Setup the event API to support multiple bases in preparation for splitting the OMPI and ORTE events. Holding here pending shared memory resolution.

This commit was SVN r23943.
2010-10-26 02:41:42 +00:00
Ralph Castain
fceabb2498 Update libevent to the 2.0 series, currently at 2.0.7rc. We will update to their final release when it becomes available. Currently known errors exist in unused portions of the libevent code. This revision passes the IBM test suite on a Linux machine and on a standalone Mac.
This is a fairly intrusive change, but outside of the moving of opal/event to opal/mca/event, the only changes involved (a) changing all calls to opal_event functions to reflect the new framework instead, and (b) ensuring that all opal_event_t objects are properly constructed since they are now true opal_objects.

Note: Shiqing has just returned from vacation and has not yet had a chance to complete the Windows integration. Thus, this commit almost certainly breaks Windows support on the trunk. However, I want this to have a chance to soak for as long as possible before I become less available a week from today (going to be at a class for 5 days, and thus will only be sparingly available) so we can find and fix any problems.

Biggest change is moving the libevent code from opal/event to a new opal/mca/event framework. This was done to make it much easier to update libevent in the future. New versions can be inserted as a new component and tested in parallel with the current version until validated, then we can remove the earlier version if we so choose. This is a statically built framework ala installdirs, so only one component will build at a time. There is no selection logic - the sole compiled component simply loads its function pointers into the opal_event struct.

I have gone thru the code base and converted all the libevent calls I could find. However, I cannot compile nor test every environment. It is therefore quite likely that errors remain in the system. Please keep an eye open for two things:

1. compile-time errors: these will be obvious as calls to the old functions (e.g., opal_evtimer_new) must be replaced by the new framework APIs (e.g., opal_event.evtimer_new)

2. run-time errors: these will likely show up as segfaults due to missing constructors on opal_event_t objects. It appears that it became a typical practice for people to "init" an opal_event_t by simply using memset to zero it out. This will no longer work - you must either OBJ_NEW or OBJ_CONSTRUCT an opal_event_t. I tried to catch these cases, but may have missed some. Believe me, you'll know when you hit it.

There is also the issue of the new libevent "no recursion" behavior. As I described on a recent email, we will have to discuss this and figure out what, if anything, we need to do.

This commit was SVN r23925.
2010-10-24 18:35:54 +00:00
Ralph Castain
248320b91a Enable connect_accept between multiple singleton jobs without the presence of an external rendezvous agent (e.g., ompi-server). This also enables connect_accept between processes in more than two jobs regardless of how they were started.
Create an ability to store the contact info for multiple HNPs being used to route between different job families. Modify the dpm orte module to pass the resulting store during the connect_accept procedure so that all jobs involved in the resulting communicator know how to route OOB messages between them.

Add a test provided by Philippe that tests this ability.

This commit was SVN r23438.
2010-07-20 04:22:45 +00:00
Ralph Castain
f3d90dfb8d Fully restore fault recovery, both at the individual process and daemon level.
NOTE: MPI fault recovery remains unavailable pending merge from Josh. This only covers ORTE-level processes.

This commit was SVN r23335.
2010-07-01 19:45:43 +00:00
Ralph Castain
6cbe947810 Modify the multicast scheme so that applications have separate input and output channels to avoid cross-talk. Update the multicast test to conform.
This commit was SVN r23271.
2010-06-15 03:50:31 +00:00
Ralph Castain
bb602694e6 Add a new example program, update cisco platform file
This commit was SVN r23262.
2010-06-09 18:21:06 +00:00
Ralph Castain
d80c90c7b9 Include missing tests
This commit was SVN r23244.
2010-06-07 14:15:00 +00:00
Ralph Castain
ab6e06f5b3 Reorganize the rmcast code to capture common code elements. Increase max msg size for spread and udp transports. Cleanup the spread configuration doc.
This commit was SVN r23207.
2010-05-25 22:36:57 +00:00
Ralph Castain
88f5217a12 Cleanup the debugger daemon co-launch code and add an ability to test it. Implement ability to co-launch debugger daemons upon attach to a running job for jobs launched under rsh, slurm, and tm environments (others can easily be added if desired).
Add new mca params to test:

orte_debugger_test_daemon: Name of the executable to be used to simulate a debugger colaunch
orte_debugger_test_attach: Test debugger colaunch after debugger attachment

To test co-launch at job start, just set the orte_debugger_test_daemon param.

To test co-launch upon attach:
set orte_debugger_test_daemon
set orte_debugger_test_attach=1
set orte_enable_debug_cospawn_while_running=1
set orte_debugger_check_rate=<N> - defines the number of seconds to wait before "checking" for a debugger attaching

Added a "debugger" program to orte/test/mpi that just spins to simulate a debugger daemon.

This commit was SVN r23144.
2010-05-14 18:44:49 +00:00
Ralph Castain
8e7faf9119 Add a new test for the db framework, fix some minor bugs in the daemon module
This commit was SVN r23085.
2010-05-04 02:38:11 +00:00
Ralph Castain
9dfb5c7c62 Rename the orte state framework to be "db", which more accurately reflects its overall capabilities since it can store any kind of data (not just state, although that will be its primary purpose). Update tools and tests accordingly. Add a daemon module for storing data on the daemons - requires --enable-multicast, so it won't build unless that is set
This commit was SVN r23082.
2010-05-03 04:11:03 +00:00
Ralph Castain
c62418d76d Add a new test that checks behavior when we call exit with a non-zero return code after calling finalize - don't ask why.
Modify the check_complete code so it finds the first non-zero exit status (i.e., the one from the lowest rank) in a job that terminates normally, and sets the mpirun exit code to that status.

This commit was SVN r23071.
2010-04-29 19:58:44 +00:00
Ralph Castain
55889934d8 After hours spent chasing the stupid "abort" file, it became clear that we were always going to be plagued by that idiot contraption when trying to be good citizens and properly cleanup. So get rid of it by instead doing a messaging handshake with the local daemon.
Note that this isn't a problem since MPI_Abort and orte_abort are only called under controlled circumstances - i.e., we are doing an orderly abort and not segfaulting. If we can't get the message out for some reason, then too bad - we'll still see an abnormal process termination and act accordingly.

This commit was SVN r23045.
2010-04-27 03:39:32 +00:00
Josh Hursey
e4f2d03d28 ErrMgr Framework redesign to better support fault tolerance development activities.
Explained in more detail in the following RFC:
  http://www.open-mpi.org/community/lists/devel/2010/03/7589.php

This commit was SVN r22872.
2010-03-23 21:28:02 +00:00
Ralph Castain
0b9552cd4e Expand the ESS framework's API to include a new function "query_sys_info" that allows the caller to retrieve key-value pairs of info on the local system capabilities (e.g., cpu type/model). Have each daemon and the HNP "sense" that information and provide it to their local procs to avoid having every proc querying the system directly.
This commit was SVN r22870.
2010-03-23 20:47:41 +00:00
Ralph Castain
9a5fdbb622 Continue development of reliable multicast
This commit was SVN r22616.
2010-02-14 19:20:56 +00:00
Ralph Castain
09763ec711 Since we modified ORTE to declare that any process that terminates after calling "init" while at least one other process has not yet called "init" is an error, we have to ensure that non-MPI ORTE apps (i.e., apps that call orte_init but not mpi_init) include a barrier in orte_init. Otherwise, fast ORTE apps almost always wind up triggering the "abnormal termination" condition.
The barrier is protected with a test to ensure that MPI apps don't execute it and wind up doing two barriers during their init.

This commit was SVN r22378.
2010-01-07 06:58:01 +00:00
Ralph Castain
ef1bfaa823 Add the ability to track how many times a process has been restarted, and to communicate that value to a process when it is restarted in case it needs to take action when it is restarted as opposed to being started for the first time.
This commit was SVN r22377.
2010-01-07 01:19:44 +00:00
Ralph Castain
06d1f2cfe2 Add some new tests to the ORTE collection
This commit was SVN r22328.
2009-12-17 19:30:57 +00:00
Ralph Castain
4026a9c873 Update all the tests to the new orte_init API
This commit was SVN r22263.
2009-12-04 04:31:06 +00:00
Ralph Castain
4a82dd9a45 Add message sequence numbers to multicast messages, tracked by channel
This commit was SVN r22262.
2009-12-04 04:17:44 +00:00
Ralph Castain
ae3e9f2aee Update the spin.c test
This commit was SVN r22259.
2009-12-03 04:46:31 +00:00
Ralph Castain
93ebed48b1 Update the multicast test. Some cleanups to the basic rmcast module
This commit was SVN r22257.
2009-12-03 04:30:58 +00:00
Ralph Castain
a0d5c80ce0 Add a new framework for discovering local resource information such as cpu type/model, #cpus, available physical memory, etc. Two initial components (darwin and linux) are provided. This is needed to support bootstrap operations where daemons are started at node boot, and applications where initial knowledge of cpu identification is needed to guide framework component selection.
Add orte configuration option to control the use of the framework in the system. Although the code will build, it will not be active unless configured with --enable-bootstrap.

If bootstrap is enabled and the new opal_sysinfo framework can successfully determine the cpu model, pass that info to the application as an MCA param to support some work at Sun.

Also, have daemons report back the resources they find to guide process mapping in bootstrap operations (i.e., where the daemon starts at node boot as opposed to being launched at application start).

Adjust some platform files to enable these capabilities.

This commit was SVN r22244.
2009-11-30 23:11:25 +00:00
Ralph Castain
92733b13d9 Add a couple of new tests to the orte system.
Modify the job_complete check so we don't kill jobs when a single proc was terminated by ORTE command via plm.terminate_procs

Still dies gracefully with a ctrl-c, and behaves as before when using plm.terminate_job

This commit was SVN r22227.
2009-11-20 01:47:49 +00:00
Ralph Castain
840766a894 Update the rmcast APIs to include tag params and reorder them to look like their rml cousins
This commit was SVN r22218.
2009-11-17 15:58:59 +00:00
Ralph Castain
a2f3a47b92 Update the orte_mcast test
This commit was SVN r22214.
2009-11-11 22:11:19 +00:00
Ralph Castain
7afd65d631 Add a couple of test programs
This commit was SVN r22137.
2009-10-24 01:00:38 +00:00
Ralph Castain
b7a0125bb7 Add a test for the new opal if.c functions. Modify the multicast test
This commit was SVN r22081.
2009-10-09 15:25:18 +00:00
Ralph Castain
26bb6e8f79 Add a couple of non-orte multicast tests
This commit was SVN r22001.
2009-09-23 05:24:22 +00:00
Ralph Castain
c3f9096fd9 Add a reliable multicast framework, with an initial basic module. This is configured out unless specifically requested via --enable-multicast.
This commit was SVN r21988.
2009-09-22 00:58:29 +00:00
Ralph Castain
82af6ee940 Update test
This commit was SVN r21987.
2009-09-22 00:55:02 +00:00
Ralph Castain
0e528e994f Revert last commit - went to wrong repo!
Didn't we just have that happen the other day too? :-)

This commit was SVN r21878.
2009-08-25 13:06:14 +00:00
Ralph Castain
15d12b240b Sync to r21876
This commit was SVN r21877.

The following SVN revision numbers were found above:
  r21876 --> open-mpi/ompi@ef970293f0
2009-08-25 13:04:12 +00:00
Ralph Castain
511fe5da8b Minor cleanup - get reported iters right in test
This commit was SVN r21819.
2009-08-14 03:33:59 +00:00
Ralph Castain
9da6b46e7d Add several options to the sendrecv_blaster to make it more powerful
This commit was SVN r21817.
2009-08-14 03:12:43 +00:00
Ralph Castain
007cbe74f4 Include the sendrecv blaster in the tarball
This commit was SVN r21790.
2009-08-11 02:42:47 +00:00
Rainer Keller
76469ea64a - Change the property of a few files, that obviously
don't need to be svn:executable...

This commit was SVN r21786.
2009-08-11 01:40:00 +00:00
Ralph Castain
c66a5a9504 Add another test that just blasts the system with MPI_Sendrecv to myself commands of varying sizes
This commit was SVN r21748.
2009-07-31 14:57:03 +00:00
Ralph Castain
ef20e778b3 Ensure that output ends on an appropriate suffix tag when --tag-output or --xml are selected.
When we read the input buffer, we don't always get a complete printf output - we sometimes end mid stream. We still need to add the suffix and a <CR> to keep the output working right.

This commit was SVN r21706.
2009-07-17 05:02:53 +00:00
Ralph Castain
bc0fe3c6da Add some more tests for parallel IO that have caused problems in the past.
Add a README that explains how to run the ziatest for launch timing

This commit was SVN r21576.
2009-07-01 14:47:14 +00:00
Ralph Castain
2fbdea0273 Add a test for loop over bcast
This commit was SVN r21560.
2009-06-29 17:06:19 +00:00
Ralph Castain
70b8c89b44 Fix slave spawn, which was hanging because the local daemon never saw the slave job report - it doesn't do it in the normal way, and so the slave launch system itself has to "fake it".
Also complete implementation to printout app_context objects so we see all the fields.

This commit was SVN r21408.
2009-06-10 19:01:08 +00:00
Ralph Castain
86d55d7ebf Fix tight loops over comm_spawn by checking to see if the system has enough child procs and file descriptors available before attempting to launch. If not, introduce a 1sec delay and then test again. This provides a chance for the orted to complete processing of proc terminations from other children, hopefully creating room for the new proc(s).
Update the loop_spawn test to remove a sleep so that it runs at max speed, letting the new code catch when we overrun ourselves and wait for room to be cleared for the next comm_spawn.

This commit was SVN r21390.
2009-06-08 18:28:26 +00:00
Greg Koenig
60485ff95f This is a very large change to rename several #define values from
OMPI_* to OPAL_*.  This allows opal layer to be used more independent
from the whole of ompi.

NOTE: 9 "svn mv" operations immediately follow this commit.

This commit was SVN r21180.
2009-05-06 20:11:28 +00:00
Ralph Castain
4be24521aa Modify the orte_process_info structure to handle a broader range of process types by replacing the individual booleans with a 32-bit bitmap. Use a set of #define's to define the individual bits, and a set of matching macros to test for them. Update the orte code base to use the macros instead of the booleans.
Minor mod to the ompi layer to use the new #define's - just one-line name replacements.

This commit was SVN r21144.
2009-05-04 11:07:40 +00:00
Ralph Castain
7b420e32b6 Add some missing tests
This commit was SVN r21140.
2009-05-01 18:35:22 +00:00
Ralph Castain
dfb2146430 Perform the ziatest as a C program instead of a script - less trouble that way.
This commit was SVN r21132.
2009-04-30 18:43:26 +00:00
Ralph Castain
ce206df568 Fix the danged ziatest - thx to Jeff, the mighty perl guru!
This commit was SVN r21116.
2009-04-29 17:58:35 +00:00
Rainer Keller
221fb9dbca ... Delayed due to notifier commits earlier this day ...
- Delete unnecessary header files using
   contrib/check_unnecessary_headers.sh after applying
   patches, that include headers, being "lost" due to
   inclusion in one of the now deleted headers...

   In total 817 files are touched.
   In ompi/mpi/c/ header files are moved up into the actual c-file,
   where necessary (these are the only additional #include),
   otherwise it is only deletions of #include (apart from the above
   additions required due to notifier...)

 - To get different MCAs (OpenIB, TM, ALPS), an earlier version was
   successfully compiled (yesterday) on:
   Linux locally using intel-11, gcc-4.3.2 and gcc-SVN + warnings enabled
   Smoky cluster (x86-64 running Linux) using PGI-8.0.2 + warnings enabled
   Lens cluster (x86-64 running Linux) using Pathscale-3.2 + warnings enabled

This commit was SVN r21096.
2009-04-29 01:32:14 +00:00