openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	534d70025f	Cleanup the detection of process binding during mpi_init. There are several cases that need to be checked: 1. no binding support - indicated by a negative return code from get_cpubind 2. binding supported, but not bound - the bitset returned by get_cpubind is the same as the available cpuset 3. binding supported and bound - bitset from get_cpubind is a subset of available cpuset 4. only one cpu is available - in this case, get_cpubind matches the available cpuset, but we are effectively bound This commit was SVN r25957.	2012-02-17 21:18:53 +00:00
Ralph Castain	10f94efbda	Add binding output to test This commit was SVN r25955.	2012-02-17 16:48:01 +00:00
Ralph Castain	15facc4ba6	Fix comm_spawn yet again...add another test This commit was SVN r25579.	2011-12-06 20:15:40 +00:00
Ralph Castain	6310361532	At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here: https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation. In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions: 1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior. 2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation. 3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so. As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes. This commit was SVN r25476.	2011-11-15 03:40:11 +00:00
Ralph Castain	198e001554	Add another test This commit was SVN r25415.	2011-11-02 15:59:16 +00:00
Ralph Castain	248320b91a	Enable connect_accept between multiple singleton jobs without the presence of an external rendezvous agent (e.g., ompi-server). This also enables connect_accept between processes in more than two jobs regardless of how they were started. Create an ability to store the contact info for multiple HNPs being used to route between different job families. Modify the dpm orte module to pass the resulting store during the connect_accept procedure so that all jobs involved in the resulting communicator know how to route OOB messages between them. Add a test provided by Philippe that tests this ability. This commit was SVN r23438.	2010-07-20 04:22:45 +00:00
Ralph Castain	88f5217a12	Cleanup the debugger daemon co-launch code and add an ability to test it. Implement ability to co-launch debugger daemons upon attach to a running job for jobs launched under rsh, slurm, and tm environments (others can easily be added if desired). Add new mca params to test: orte_debugger_test_daemon: Name of the executable to be used to simulate a debugger colaunch orte_debugger_test_attach: Test debugger colaunch after debugger attachment To test co-launch at job start, just set the orte_debugger_test_daemon param. To test co-launch upon attach: set orte_debugger_test_daemon set orte_debugger_test_attach=1 set orte_enable_debug_cospawn_while_running=1 set orte_debugger_check_rate=<N> - defines the number of seconds to wait before "checking" for a debugger attaching Added a "debugger" program to orte/test/mpi that just spins to simulate a debugger daemon. This commit was SVN r23144.	2010-05-14 18:44:49 +00:00
Ralph Castain	06d1f2cfe2	Add some new tests to the ORTE collection This commit was SVN r22328.	2009-12-17 19:30:57 +00:00
Ralph Castain	7afd65d631	Add a couple of test programs This commit was SVN r22137.	2009-10-24 01:00:38 +00:00
Ralph Castain	0e528e994f	Revert last commit - went to wrong repo! Didn't we just have that happen the other day too? :-) This commit was SVN r21878.	2009-08-25 13:06:14 +00:00
Ralph Castain	15d12b240b	Sync to r21876 This commit was SVN r21877. The following SVN revision numbers were found above: r21876 --> open-mpi/ompi@ef970293f0	2009-08-25 13:04:12 +00:00
Ralph Castain	511fe5da8b	Minor cleanup - get reported iters right in test This commit was SVN r21819.	2009-08-14 03:33:59 +00:00
Ralph Castain	9da6b46e7d	Add several options to the sendrecv_blaster to make it more powerful This commit was SVN r21817.	2009-08-14 03:12:43 +00:00
Ralph Castain	007cbe74f4	Include the sendrecv blaster in the tarball This commit was SVN r21790.	2009-08-11 02:42:47 +00:00
Rainer Keller	76469ea64a	- Change the property of a few files, that obviously don't need to be svn:executable... This commit was SVN r21786.	2009-08-11 01:40:00 +00:00
Ralph Castain	c66a5a9504	Add another test that just blasts the system with MPI_Sendrecv to myself commands of varying sizes This commit was SVN r21748.	2009-07-31 14:57:03 +00:00
Ralph Castain	ef20e778b3	Ensure that output ends on an appropriate suffix tag when --tag-output or --xml are selected. When we read the input buffer, we don't always get a complete printf output - we sometimes end mid stream. We still need to add the suffix and a <CR> to keep the output working right. This commit was SVN r21706.	2009-07-17 05:02:53 +00:00
Ralph Castain	bc0fe3c6da	Add some more tests for parallel IO that have caused problems in the past. Add a README that explains how to run the ziatest for launch timing This commit was SVN r21576.	2009-07-01 14:47:14 +00:00
Ralph Castain	2fbdea0273	Add a test for loop over bcast This commit was SVN r21560.	2009-06-29 17:06:19 +00:00
Ralph Castain	70b8c89b44	Fix slave spawn, which was hanging because the local daemon never saw the slave job report - it doesn't do it in the normal way, and so the slave launch system itself has to "fake it". Also complete implementation to printout app_context objects so we see all the fields. This commit was SVN r21408.	2009-06-10 19:01:08 +00:00
Ralph Castain	86d55d7ebf	Fix tight loops over comm_spawn by checking to see if the system has enough child procs and file descriptors available before attempting to launch. If not, introduce a 1sec delay and then test again. This provides a chance for the orted to complete processing of proc terminations from other children, hopefully creating room for the new proc(s). Update the loop_spawn test to remove a sleep so that it runs at max speed, letting the new code catch when we overrun ourselves and wait for room to be cleared for the next comm_spawn. This commit was SVN r21390.	2009-06-08 18:28:26 +00:00
Ralph Castain	7b420e32b6	Add some missing tests This commit was SVN r21140.	2009-05-01 18:35:22 +00:00
Ralph Castain	dfb2146430	Perform the ziatest as a C program instead of a script - less trouble that way. This commit was SVN r21132.	2009-04-30 18:43:26 +00:00
Ralph Castain	ce206df568	Fix the danged ziatest - thx to Jeff, the mighty perl guru! This commit was SVN r21116.	2009-04-29 17:58:35 +00:00
Ralph Castain	f3cfe32b5d	Update the slave launch and cleanup procedures. Track what files have been moved to the slave node to avoid attempting to copy them multiple times on top of each other. Cleanup any pre-positioned files, kill any lingering apps, and cleanup the session directory area upon termination of the daemon. This commit was SVN r21094.	2009-04-29 00:11:19 +00:00
Ralph Castain	76b6ae3b29	Fix the ziatest to report correct times This commit was SVN r21059.	2009-04-23 01:12:56 +00:00
Ralph Castain	3c3c306ee4	Simplify test This commit was SVN r21058.	2009-04-22 23:03:04 +00:00
Ralph Castain	9c2f17eb01	Cleanup the nidmap lookup functions and add some comments explaining how we handle the nid, job, and pmap arrays. This fixes a problem we have less-than-full participation in a comm_spawn, causing holes to exist in the pmap array. Update the slave spawn tests to properly indicate participation as being solely MPI_COMM_SELF. This commit was SVN r20961.	2009-04-09 02:48:33 +00:00
Ralph Castain	4af623076d	Add a test for hanging in a loop over mpi_reduce This commit was SVN r20798.	2009-03-17 13:57:23 +00:00
Ralph Castain	47cfccbb49	Update a couple of tests This commit was SVN r20668.	2009-03-01 15:32:32 +00:00
Rainer Keller	d81443cc5a	- On the way to get the BTLs split out and lessen dependency on orte: Often, orte/util/show_help.h is included, although no functionality is required -- instead, most often opal_output.h, or orte/mca/rml/rml_types.h Please see orte_show_help_replacement.sh commited next. - Local compilation (Linux/x86_64) w/ -Wimplicit-function-declaration actually showed two missing #include "orte/util/show_help.h" in orte/mca/odls/base/odls_base_default_fns.c and in orte/tools/orte-top/orte-top.c Manually added these. Let's have MTT the last word. This commit was SVN r20557.	2009-02-14 02:26:12 +00:00
Ralph Castain	91bc5346eb	Update cell example This commit was SVN r20528.	2009-02-12 16:36:11 +00:00
Ralph Castain	62dd763a8f	Add ability for local slave spawns to pre-position supporting files. Update comm_spawn and comm_spawn_multiple man pages to cover new info_keys. This commit was SVN r20527.	2009-02-12 15:56:45 +00:00
Ralph Castain	b408cbd8c1	Crumby - get the make tarball correct! Earlier commit was from intermediate state... This commit was SVN r20504.	2009-02-10 18:33:32 +00:00
Ralph Castain	7216c5b104	Add a new test to demonstrate how to use slave spawn on hybrid machines. Add some of the orte test programs to the tarball to help diagnose user problems and provide examples This commit was SVN r20503.	2009-02-10 18:28:58 +00:00
Ralph Castain	26806c3fdd	Add new slave spawn test programs This commit was SVN r20493.	2009-02-09 20:45:11 +00:00
Ralph Castain	91ada6c323	Ensure we avoid overflows, handle the odd number of nodes case This commit was SVN r20171.	2008-12-31 01:11:57 +00:00
Ralph Castain	b012ed6c94	Add a somewhat unique launch time test This commit was SVN r20170.	2008-12-30 21:42:51 +00:00
Ralph Castain	71dcf61f9b	Add sanity check to ensure that specified stdin target is within range of job. Print error message and exit if not. Modify read_write test to allow specification of rank to read stdin. IOF now validated to work for arbitrary rank as stdin target. Not validate to work for multiple simultaneous ranks reading stdin (untested). This commit was SVN r19804.	2008-10-25 14:38:06 +00:00
Jeff Squyres	d96b78fee1	If the script is there, there's no real reason to have these files in the repo. This commit was SVN r19795.	2008-10-24 13:42:26 +00:00
Jeff Squyres	0a741d7f81	Add scripty-foo to make the data files. Revamp the data files to be non-uniform in content as a slightly better test. This commit was SVN r19794.	2008-10-24 13:35:47 +00:00
Ralph Castain	c56cdac379	Finish cleanup of stdin. Set non-stdio file descriptors to non-blocking (thanks to Jeff for catching that one). Handle writes that result in "would have blocked" errno. This commit was SVN r19793.	2008-10-24 01:42:58 +00:00
Ralph Castain	6100d88ded	Cleanup the new IOF: 1. remove some stale files that were overlooked in original commit 2. add a test program and data to stress iof for stdin 3. cleanup a debug statement that caused memory corruption when reading large files 4. some minor cleanups to correctly handle xon/xoff scenarios This commit was SVN r19792.	2008-10-23 19:11:05 +00:00
Ralph Castain	6e5d844c36	Roll in the revamped IOF subsystem. Per the devel mailing list email, this is a complete rewrite of the iof framework designed to simplify the code for maintainability, and to support features we had planned to do, but were too difficult to implement in the old code. Specifically, the new code: 1. completely and cleanly separates responsibilities between the HNP, orted, and tool components. 2. removes all wireup messaging during launch and shutdown. 3. maintains flow control for stdin to avoid large-scale consumption of memory by orteds when large input files are forwarded. This is done using an xon/xoff protocol. 4. enables specification of stdin recipients on the mpirun cmd line. Allowed options include rank, "all", or "none". Default is rank 0. 5. creates a new MPI_Info key "ompi_stdin_target" that supports the above options for child jobs. Default is "none". 6. adds a new tool "orte-iof" that can connect to a running mpirun and display the output. Cmd line options allow selection of any combination of stdout, stderr, and stddiag. Default is stdout. 7. adds a new mpirun and orte-iof cmd line option "tag-output" that will tag each line of output with process name and stream ident. For example, "[1,0]<stdout>this is output" This is not intended for the 1.3 release as it is a major change requiring considerable soak time. This commit was SVN r19767.	2008-10-18 00:00:49 +00:00
Jeff Squyres	dbb932b619	Remove the missing app mpi_after_finalize from the Makefile. This commit was SVN r19687.	2008-10-06 14:35:15 +00:00
Jeff Squyres	78a25cf116	Commit a few missing header files, etc. This commit was SVN r19626.	2008-09-24 15:41:42 +00:00
Ralph Castain	20ece3cb86	Add new test that stresses MPI send/recv This commit was SVN r19530.	2008-09-09 15:47:31 +00:00
Ralph Castain	ba5498cdc6	Repair the MPI-2 dynamic operations. This includes: 1. repair of the linear and direct routed modules 2. repair of the ompi/pubsub/orte module to correctly init routes to the ompi-server, and correctly handle failure to correctly parse the provided ompi-server URI 3. modification of orterun to accept both "file" and "FILE" for designating where the ompi-server URI is to be found - purely a convenience feature 4. resolution of a message ordering problem during the connect/accept handshake that allowed the "send-first" proc to attempt to send to the "recv-first" proc before the HNP had actually updated its routes. Let this be a further reminder to all - message ordering is NOT guaranteed in the OOB 5. Repair the ompi/dpm/orte module to correctly init routes during connect/accept. Reminder to all: messages sent to procs in another job family (i.e., started by a different mpirun) are ALWAYS routed through the respective HNPs. As per the comments in orte/routed, this is REQUIRED to maintain connect/accept (where only the root proc on each side is capable of init'ing the routes), allow communication between mpirun's using different routing modules, and to minimize connections on tools such as ompi-server. It is all taken care of "under the covers" by the OOB to ensure that a route back to the sender is maintained, even when the different mpirun's are using different routed modules. 6. corrections in the orte/odls to ensure proper identification of daemons participating in a dynamic launch 7. corrections in build/nidmap to support update of an existing nidmap during dynamic launch 8. corrected implementation of the update_arch function in the ESS, along with consolidation of a number of ESS operations into base functions for easier maintenance. The ability to support info from multiple jobs was added, although we don't currently do so - this will come later to support further fault recovery strategies 9. minor updates to several functions to remove unnecessary and/or no longer used variables and envar's, add some debugging output, etc. 10. addition of a new macro ORTE_PROC_IS_DAEMON that resolves to true if the provided proc is a daemon There is still more cleanup to be done for efficiency, but this at least works. Tested on single-node Mac, multi-node SLURM via odin. Tests included connect/accept, publish/lookup/unpublish, comm_spawn, comm_spawn_multiple, and singleton comm_spawn. Fixes ticket #1256 This commit was SVN r18804.	2008-07-03 17:53:37 +00:00
Ralph Castain	2cc8b2c51f	Add yet another test, this one for proper error behavior when someone call an MPI function after calling MPI_Finalize. Add a minor debug that outputs the orterun exit status to stderr when orte_debug is set. This commit was SVN r18622.	2008-06-09 19:21:20 +00:00
Ralph Castain	9613b3176c	Effectively revert the orte_output system and return to direct use of opal_output at all levels. Retain the orte_show_help subsystem to allow aggregation of show_help messages at the HNP. After much work by Jeff and myself, and quite a lot of discussion, it has become clear that we simply cannot resolve the infinite loops caused by RML-involved subsystems calling orte_output. The original rationale for the change to orte_output has also been reduced by shifting the output of XML-formatted vs human readable messages to an alternative approach. I have globally replaced the orte_output/ORTE_OUTPUT calls in the code base, as well as the corresponding .h file name. I have test compiled and run this on the various environments within my reach, so hopefully this will prove minimally disruptive. This commit was SVN r18619.	2008-06-09 14:53:58 +00:00

1 2

76 Коммитов