1
1
openmpi/orte/test/mpi
Ralph Castain 6310361532 At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement

The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.

In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:

1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.

2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.

3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.

As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.

This commit was SVN r25476.
2011-11-15 03:40:11 +00:00
..
spawn-problem Add some new tests to the ORTE collection 2009-12-17 19:30:57 +00:00
abort.c Here is the major MAD-cure commit. I have written plenty about it, so I refer you here to those messages for a description of everything that was done. 2006-09-14 21:29:51 +00:00
accept.c Repair the MPI-2 dynamic operations. This includes: 2008-07-03 17:53:37 +00:00
bad_exit.c These changes were mostly captured in a prior RFC (except for #2 below) and are aimed specifically at improving startup performance and setting up the remaining modifications described in that RFC. 2007-10-05 19:48:23 +00:00
bcast_loop.c Add a test for loop over bcast 2009-06-29 17:06:19 +00:00
cell_spawn.c Update the slave launch and cleanup procedures. Track what files have been moved to the slave node to avoid attempting to copy them multiple times on top of each other. Cleanup any pre-positioned files, kill any lingering apps, and cleanup the session directory area upon termination of the daemon. 2009-04-29 00:11:19 +00:00
concurrent_spawn.c Commit a few missing header files, etc. 2008-09-24 15:41:42 +00:00
connect.c Cleanup and fix bugs in the MPI dynamics section. Modify the dpm API so it properly takes ports instead of process names (as correctly identified by Aurelien). Fix race conditions in the use of ompi-server. Fix incompatibilities between the mpi bindings and the dpm implemenation that could cause segfaults due to uninitialized memory. 2008-04-16 14:27:42 +00:00
crisscross.c Add new test that stresses MPI send/recv 2008-09-09 15:47:31 +00:00
debugger.c Cleanup the debugger daemon co-launch code and add an ability to test it. Implement ability to co-launch debugger daemons upon attach to a running job for jobs launched under rsh, slurm, and tm environments (others can easily be added if desired). 2010-05-14 18:44:49 +00:00
delayed_abort.c Add a delayed_abort test code. We seem to handle this case just fine now, but Sun reports still seeing troubles on Solaris. 2007-02-05 15:24:01 +00:00
early_abort.c Add some new tests to the ORTE collection 2009-12-17 19:30:57 +00:00
hello_barrier.c Merge the ORTE devel branch into the main trunk. Details of what this means will be circulated separately. 2008-02-28 01:57:57 +00:00
hello_nodename.c Commit a few missing header files, etc. 2008-09-24 15:41:42 +00:00
hello_output.c - On the way to get the BTLs split out and lessen dependency on orte: 2009-02-14 02:26:12 +00:00
hello_show_help.c - On the way to get the BTLs split out and lessen dependency on orte: 2009-02-14 02:26:12 +00:00
hello.c Merge the ORTE devel branch into the main trunk. Details of what this means will be circulated separately. 2008-02-28 01:57:57 +00:00
hello++.cc Add a couple of test programs 2009-10-24 01:00:38 +00:00
hellof90.f90 Add a couple of test programs 2009-10-24 01:00:38 +00:00
intercomm_create.c Add another test 2011-11-02 15:59:16 +00:00
loop_child.c Update a couple of tests 2009-03-01 15:32:32 +00:00
loop_spawn.c Fix tight loops over comm_spawn by checking to see if the system has enough child procs and file descriptors available before attempting to launch. If not, introduce a 1sec delay and then test again. This provides a chance for the orted to complete processing of proc terminations from other children, hopefully creating room for the new proc(s). 2009-06-08 18:28:26 +00:00
makedata.pl Add scripty-foo to make the data files. Revamp the data files to be 2008-10-24 13:35:47 +00:00
Makefile Add another test 2011-11-02 15:59:16 +00:00
Makefile.include Enable connect_accept between multiple singleton jobs without the presence of an external rendezvous agent (e.g., ompi-server). This also enables connect_accept between processes in more than two jobs regardless of how they were started. 2010-07-20 04:22:45 +00:00
mpi_barrier.c Merge the ORTE devel branch into the main trunk. Details of what this means will be circulated separately. 2008-02-28 01:57:57 +00:00
mpi_no_op.c With the branch to 1.2 made.... 2006-08-15 19:54:10 +00:00
mpi_spin.c Add another test program - an MPI app that just spins. This supports testing of system response to signal-terminated processes. 2006-11-13 21:51:34 +00:00
multi_abort.c One more correction to mpirun exit codes - cleanup the application proc's exit codes in the orted so that non-zero exit codes generated by mpirun itself don't get "munged". 2008-03-20 13:54:11 +00:00
parallel_r8.c - Change the property of a few files, that obviously 2009-08-11 01:40:00 +00:00
parallel_r64.c - Change the property of a few files, that obviously 2009-08-11 01:40:00 +00:00
parallel_w8.c Add some more tests for parallel IO that have caused problems in the past. 2009-07-01 14:47:14 +00:00
parallel_w64.c - Change the property of a few files, that obviously 2009-08-11 01:40:00 +00:00
pubsub.c Fix ompi-server so it works with unity routed module - still not working with tree routing. 2008-04-04 19:17:28 +00:00
read_write.c Add sanity check to ensure that specified stdin target is within range of job. Print error message and exit if not. 2008-10-25 14:38:06 +00:00
reduce-hang.c Cleanup the nidmap lookup functions and add some comments explaining how we handle the nid, job, and pmap arrays. This fixes a problem we have less-than-full participation in a comm_spawn, causing holes to exist in the pmap array. 2009-04-09 02:48:33 +00:00
segv.c Commit a few missing header files, etc. 2008-09-24 15:41:42 +00:00
sendrecv_blaster.c Revert last commit - went to wrong repo! 2009-08-25 13:06:14 +00:00
shell_hello Merge the ORTE devel branch into the main trunk. Details of what this means will be circulated separately. 2008-02-28 01:57:57 +00:00
simple_spawn.c At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here: 2011-11-15 03:40:11 +00:00
singleton_client_server.c Enable connect_accept between multiple singleton jobs without the presence of an external rendezvous agent (e.g., ompi-server). This also enables connect_accept between processes in more than two jobs regardless of how they were started. 2010-07-20 04:22:45 +00:00
sio.c Ensure that output ends on an appropriate suffix tag when --tag-output or --xml are selected. 2009-07-17 05:02:53 +00:00
slave_spawn.c Fix slave spawn, which was hanging because the local daemon never saw the slave job report - it doesn't do it in the normal way, and so the slave launch system itself has to "fake it". 2009-06-10 19:01:08 +00:00
slave.c Add new slave spawn test programs 2009-02-09 20:45:11 +00:00
spawn_multiple.c Commit a few missing header files, etc. 2008-09-24 15:41:42 +00:00
ziaprobe.c Perform the ziatest as a C program instead of a script - less trouble that way. 2009-04-30 18:43:26 +00:00
ziatest.c Perform the ziatest as a C program instead of a script - less trouble that way. 2009-04-30 18:43:26 +00:00
ziatest.README Add some more tests for parallel IO that have caused problems in the past. 2009-07-01 14:47:14 +00:00