openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	a080de188f	Enable orterun to directly support staged execution, treating each app as a separate job. Support transfer of file maps when support exists. This commit was SVN r27516.	2012-10-29 23:11:30 +00:00
Ralph Castain	4e52a15e70	Provide for sync on seek and close DFS operations. Eliminate an unnecessary wake-up timer when using ORTE progress thread This commit was SVN r27500.	2012-10-26 15:49:04 +00:00
Ralph Castain	df642f1508	Add an API to get a remote file's size. Separate dfs cmds from returned data messages so daemons don't get confused. This commit was SVN r27487.	2012-10-25 22:23:08 +00:00
Ralph Castain	094d6f3143	Add a new "distributed file system" capability to support file access operations across nodes that do not have a network file system attached to them. Add a set of URI create/parse utilities This commit was SVN r27483.	2012-10-25 17:15:17 +00:00
Ralph Castain	90d7b5fdca	Update test This commit was SVN r27354.	2012-09-20 02:51:27 +00:00
Ralph Castain	5f7a5c4793	Update test to include all keys This commit was SVN r27311.	2012-09-12 05:02:51 +00:00
Ralph Castain	cd8aff675b	Update test This commit was SVN r27303.	2012-09-11 20:32:43 +00:00
Ralph Castain	0e1dbe8711	Remove non-existent files This commit was SVN r27136.	2012-08-25 01:29:17 +00:00
Ralph Castain	c8b511d18a	Remove stale tests This commit was SVN r27126.	2012-08-24 02:22:11 +00:00
Ralph Castain	3c13176aa7	Remove test code This commit was SVN r27114.	2012-08-22 21:36:54 +00:00
Ralph Castain	335c0eafcf	Add a filem test program and set ignores This commit was SVN r27069.	2012-08-16 17:46:46 +00:00
Jeff Squyres	96f640a762	Add new "opal_hotel" class. Abstractly speaking, this class does the following: * Provides a fixed number of resource slots (i.e., "hotel rooms"). * Allows one thing to occupy a resource slot at a time (i.e., each hotel room can have an occupant check in to that room). * Resource slots can be vacated at any time (i.e., occupants can voluntarily check out of their hotel room). * Resource slots can be occupied for a specific maximum amount of time. If that time expires, the occupant is forcibly evicted and the upper layer is notified via (libevent) callback (i.e., the maid will kick an occupant of out of their room when their reservation is over). This class can be to be used for things like retransmission schemes for unreliable transports. For example, a message sent on an unreliable transport can be checked in to a hotel room. If an ACK for that message is received, the message can be checked out. But if the ACK is never received, the message will eventually be evicted from its room and the upper layer will be notified that the message failed to check out in time (i.e., that an ACK for that message was not received in time). Code using this class is currently being developed off-trunk, but will be coming to SVN soon. This commit was SVN r27067.	2012-08-16 17:29:55 +00:00
Ralph Castain	c90b7380c1	Sigh - of course, they changed the name of the silly MPI_Info object in the final standard, but not in the proposal. So change to the new MPI_INFO_ENV name. Also, don't set unknown values to "N/A", but just leave them unset. This commit was SVN r27012.	2012-08-12 05:00:57 +00:00
Ralph Castain	cb48fd52d4	Implement the MPI_Info part of MPI-3 Ticket 313. Add an MPI_info object MPI_INFO_GET_ENV that contains a number of run-time related pieces of info. This includes all the required ones in the ticket, plus a few that specifically address recent user questions: "num_app_ctx" - the number of app_contexts in the job "first_rank" - the MPI rank of the first process in each app_context "np" - the number of procs in each app_context Still need clarification on the MPI_Init portion of the ticket. Specifically, does the ticket call for returning an error is someone calls MPI_Init more than once in a program? We set a flag to tell us that we have been initialized, but currently never check it. This commit was SVN r27005.	2012-08-12 01:28:23 +00:00
Ralph Castain	6ee35e4977	Add num_local_peers to orte_process_info so we don't keep re-computing it, ensure it is available for direct launch via pmi as well This commit was SVN r26931.	2012-07-31 21:21:50 +00:00
Ralph Castain	9680c52f5e	Add mrplus examples to tarball This commit was SVN r26643.	2012-06-23 02:40:46 +00:00
Ralph Castain	40c2fc5f55	Update the tests, add a couple This commit was SVN r26379.	2012-05-02 19:00:05 +00:00
Ralph Castain	8f7bf3344a	Update test This commit was SVN r26370.	2012-05-01 18:38:44 +00:00
Ralph Castain	4542070cf2	Add event priority inversion test This commit was SVN r26369.	2012-05-01 16:42:22 +00:00
Ralph Castain	f68487016c	Add test code from Terry. Properly terminate if we don't abort on non-zero exit This commit was SVN r26271.	2012-04-16 16:44:23 +00:00
Ralph Castain	bd8b4f7f1e	Sorry for mid-day commit, but I had promised on the call to do this upon my return. Roll in the ORTE state machine. Remove last traces of opal_sos. Remove UTK epoch code. Please see the various emails about the state machine change for details. I'll send something out later with more info on the new arch. This commit was SVN r26242.	2012-04-06 14:23:13 +00:00
Ralph Castain	534d70025f	Cleanup the detection of process binding during mpi_init. There are several cases that need to be checked: 1. no binding support - indicated by a negative return code from get_cpubind 2. binding supported, but not bound - the bitset returned by get_cpubind is the same as the available cpuset 3. binding supported and bound - bitset from get_cpubind is a subset of available cpuset 4. only one cpu is available - in this case, get_cpubind matches the available cpuset, but we are effectively bound This commit was SVN r25957.	2012-02-17 21:18:53 +00:00
Ralph Castain	10f94efbda	Add binding output to test This commit was SVN r25955.	2012-02-17 16:48:01 +00:00
Ralph Castain	15facc4ba6	Fix comm_spawn yet again...add another test This commit was SVN r25579.	2011-12-06 20:15:40 +00:00
Ralph Castain	6310361532	At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here: https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation. In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions: 1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior. 2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation. 3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so. As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes. This commit was SVN r25476.	2011-11-15 03:40:11 +00:00
Samuel Gutierrez	3ea59cce96	minor cleanup to getenv_pmi.c. This commit was SVN r25449.	2011-11-07 03:18:07 +00:00
Samuel Gutierrez	e03bc93fb7	only use pmi grpcomm and pubsub during the direct launch case. use PMI environment variable to setup vpid in ess alps on cray xe systems. add pmi test code. This commit was SVN r25447.	2011-11-06 17:28:40 +00:00
Ralph Castain	198e001554	Add another test This commit was SVN r25415.	2011-11-02 15:59:16 +00:00
Wesley Bland	4e7ff0bd5e	By popular demand the epoch code is now disabled by default. To enable the epochs and the resilient orte code, use the configure flag: --enable-resilient-orte This will define both: ORTE_ENABLE_EPOCH ORTE_RESIL_ORTE This commit was SVN r25093.	2011-08-26 22:16:14 +00:00
Wesley Bland	09274cd047	Make sure that the epoch is initialized everywhere so we don't get weird output during valgrind. This shouldn't have caused any problems with any actual execution. Just extra warnings in valgrind. This commit was SVN r25015.	2011-08-08 15:11:55 +00:00
Ralph Castain	7b9f958dcf	Add some missing error strings. Update test to show silent errors This commit was SVN r25010.	2011-08-08 04:21:02 +00:00
Ralph Castain	590ac70e88	Add a simple test program for error string output This commit was SVN r25007.	2011-08-07 21:32:25 +00:00
Ralph Castain	8853e0e80a	Fix regular expression analyzer for slurmd - use a slurm-specific version Fix multi-node routing for daemon startup when static ports are not set This commit was SVN r24898.	2011-07-13 22:49:56 +00:00
Ralph Castain	1ee7c39982	Fix some major bit-rot on scalable launch. If static ports are provided, then daemons can connect back to the HNP via the routed connection tree instead of doing so directly. In order to do that at scale, the node list must be passed as a regular expression - otherwise, the orted command line gets too long. Over the course of time, usage of static ports got corrupted in several places, the "parent" info got incorrectly reset, etc. So correct all that and get the regex-based wireup going again. Also, don't pass node lists if static ports aren't enabled - they are of no value to the orted and just create the possibility of overly-long cmd lines. This commit was SVN r24860.	2011-07-07 18:54:30 +00:00
Wesley Bland	e1ba09ad51	Add a resilience to ORTE. Allows the runtime to continue after a process (or ORTED) failure. Note that more work will be necessary to allow the MPI layer to take advantage of this. Per RFC: http://www.open-mpi.org/community/lists/devel/2011/06/9299.php This commit was SVN r24815.	2011-06-23 20:38:02 +00:00
Ralph Castain	138928fcf4	Use ports as multicast channels instead of networks so we avoid stepping into reserved spaces. This commit was SVN r24666.	2011-04-29 18:46:40 +00:00
Ralph Castain	80ef1af8ba	Add psm key generator program This commit was SVN r24197.	2010-12-30 20:54:58 +00:00
Ralph Castain	58e711a412	Update a test and add two new ones for testing event lib thread support This commit was SVN r24051.	2010-11-13 15:39:28 +00:00
Ralph Castain	703684e071	Output the mca params for debug purposes This commit was SVN r24042.	2010-11-11 20:06:29 +00:00
Ralph Castain	a47b33678b	Add orte-level thread support to avoid some of the opal_if_threads protection used solely for ompi. Use threads to help process multicast messages. This commit was SVN r24009.	2010-11-08 19:09:23 +00:00
Ralph Castain	bf665692c3	Update the rmcast callback function API to return message sequence number. Update orte_mcast test to stress the system. This commit was SVN r24004.	2010-11-07 23:29:52 +00:00
Ralph Castain	9ea2b196ce	Convert the opal_event framework to use direct function calls instead of hiding functions behind function pointers. Eliminate the opal_object_t abstraction of libevent's event struct so it can be directly passed to the libevent functions. Note: the ompi_check_libfca.m4 file had to be modified to avoid it stomping on global CPPFLAGS and the like. The file was also relocated to the ompi/config directory as it pertains solely to an ompi-layer component. Forgive the mid-day configure change, but I know Shiqing is working the windows issues and don't want to cause him unnecessary redo work. This commit was SVN r23966.	2010-10-28 15:22:46 +00:00
Ralph Castain	86c7365e8e	Clean up a few initialization issues - don't think these are impacting the shared memory situation as it didn't fix the problem. Setup the event API to support multiple bases in preparation for splitting the OMPI and ORTE events. Holding here pending shared memory resolution. This commit was SVN r23943.	2010-10-26 02:41:42 +00:00
Ralph Castain	fceabb2498	Update libevent to the 2.0 series, currently at 2.0.7rc. We will update to their final release when it becomes available. Currently known errors exist in unused portions of the libevent code. This revision passes the IBM test suite on a Linux machine and on a standalone Mac. This is a fairly intrusive change, but outside of the moving of opal/event to opal/mca/event, the only changes involved (a) changing all calls to opal_event functions to reflect the new framework instead, and (b) ensuring that all opal_event_t objects are properly constructed since they are now true opal_objects. Note: Shiqing has just returned from vacation and has not yet had a chance to complete the Windows integration. Thus, this commit almost certainly breaks Windows support on the trunk. However, I want this to have a chance to soak for as long as possible before I become less available a week from today (going to be at a class for 5 days, and thus will only be sparingly available) so we can find and fix any problems. Biggest change is moving the libevent code from opal/event to a new opal/mca/event framework. This was done to make it much easier to update libevent in the future. New versions can be inserted as a new component and tested in parallel with the current version until validated, then we can remove the earlier version if we so choose. This is a statically built framework ala installdirs, so only one component will build at a time. There is no selection logic - the sole compiled component simply loads its function pointers into the opal_event struct. I have gone thru the code base and converted all the libevent calls I could find. However, I cannot compile nor test every environment. It is therefore quite likely that errors remain in the system. Please keep an eye open for two things: 1. compile-time errors: these will be obvious as calls to the old functions (e.g., opal_evtimer_new) must be replaced by the new framework APIs (e.g., opal_event.evtimer_new) 2. run-time errors: these will likely show up as segfaults due to missing constructors on opal_event_t objects. It appears that it became a typical practice for people to "init" an opal_event_t by simply using memset to zero it out. This will no longer work - you must either OBJ_NEW or OBJ_CONSTRUCT an opal_event_t. I tried to catch these cases, but may have missed some. Believe me, you'll know when you hit it. There is also the issue of the new libevent "no recursion" behavior. As I described on a recent email, we will have to discuss this and figure out what, if anything, we need to do. This commit was SVN r23925.	2010-10-24 18:35:54 +00:00
Ralph Castain	248320b91a	Enable connect_accept between multiple singleton jobs without the presence of an external rendezvous agent (e.g., ompi-server). This also enables connect_accept between processes in more than two jobs regardless of how they were started. Create an ability to store the contact info for multiple HNPs being used to route between different job families. Modify the dpm orte module to pass the resulting store during the connect_accept procedure so that all jobs involved in the resulting communicator know how to route OOB messages between them. Add a test provided by Philippe that tests this ability. This commit was SVN r23438.	2010-07-20 04:22:45 +00:00
Ralph Castain	f3d90dfb8d	Fully restore fault recovery, both at the individual process and daemon level. NOTE: MPI fault recovery remains unavailable pending merge from Josh. This only covers ORTE-level processes. This commit was SVN r23335.	2010-07-01 19:45:43 +00:00
Ralph Castain	6cbe947810	Modify the multicast scheme so that applications have separate input and output channels to avoid cross-talk. Update the multicast test to conform. This commit was SVN r23271.	2010-06-15 03:50:31 +00:00
Ralph Castain	bb602694e6	Add a new example program, update cisco platform file This commit was SVN r23262.	2010-06-09 18:21:06 +00:00
Ralph Castain	d80c90c7b9	Include missing tests This commit was SVN r23244.	2010-06-07 14:15:00 +00:00
Ralph Castain	ab6e06f5b3	Reorganize the rmcast code to capture common code elements. Increase max msg size for spread and udp transports. Cleanup the spread configuration doc. This commit was SVN r23207.	2010-05-25 22:36:57 +00:00
Ralph Castain	88f5217a12	Cleanup the debugger daemon co-launch code and add an ability to test it. Implement ability to co-launch debugger daemons upon attach to a running job for jobs launched under rsh, slurm, and tm environments (others can easily be added if desired). Add new mca params to test: orte_debugger_test_daemon: Name of the executable to be used to simulate a debugger colaunch orte_debugger_test_attach: Test debugger colaunch after debugger attachment To test co-launch at job start, just set the orte_debugger_test_daemon param. To test co-launch upon attach: set orte_debugger_test_daemon set orte_debugger_test_attach=1 set orte_enable_debug_cospawn_while_running=1 set orte_debugger_check_rate=<N> - defines the number of seconds to wait before "checking" for a debugger attaching Added a "debugger" program to orte/test/mpi that just spins to simulate a debugger daemon. This commit was SVN r23144.	2010-05-14 18:44:49 +00:00
Ralph Castain	8e7faf9119	Add a new test for the db framework, fix some minor bugs in the daemon module This commit was SVN r23085.	2010-05-04 02:38:11 +00:00
Ralph Castain	9dfb5c7c62	Rename the orte state framework to be "db", which more accurately reflects its overall capabilities since it can store any kind of data (not just state, although that will be its primary purpose). Update tools and tests accordingly. Add a daemon module for storing data on the daemons - requires --enable-multicast, so it won't build unless that is set This commit was SVN r23082.	2010-05-03 04:11:03 +00:00
Ralph Castain	c62418d76d	Add a new test that checks behavior when we call exit with a non-zero return code after calling finalize - don't ask why. Modify the check_complete code so it finds the first non-zero exit status (i.e., the one from the lowest rank) in a job that terminates normally, and sets the mpirun exit code to that status. This commit was SVN r23071.	2010-04-29 19:58:44 +00:00
Ralph Castain	55889934d8	After hours spent chasing the stupid "abort" file, it became clear that we were always going to be plagued by that idiot contraption when trying to be good citizens and properly cleanup. So get rid of it by instead doing a messaging handshake with the local daemon. Note that this isn't a problem since MPI_Abort and orte_abort are only called under controlled circumstances - i.e., we are doing an orderly abort and not segfaulting. If we can't get the message out for some reason, then too bad - we'll still see an abnormal process termination and act accordingly. This commit was SVN r23045.	2010-04-27 03:39:32 +00:00
Josh Hursey	e4f2d03d28	ErrMgr Framework redesign to better support fault tolerance development activities. Explained in more detail in the following RFC: http://www.open-mpi.org/community/lists/devel/2010/03/7589.php This commit was SVN r22872.	2010-03-23 21:28:02 +00:00
Ralph Castain	0b9552cd4e	Expand the ESS framework's API to include a new function "query_sys_info" that allows the caller to retrieve key-value pairs of info on the local system capabilities (e.g., cpu type/model). Have each daemon and the HNP "sense" that information and provide it to their local procs to avoid having every proc querying the system directly. This commit was SVN r22870.	2010-03-23 20:47:41 +00:00
Ralph Castain	9a5fdbb622	Continue development of reliable multicast This commit was SVN r22616.	2010-02-14 19:20:56 +00:00
Ralph Castain	09763ec711	Since we modified ORTE to declare that any process that terminates after calling "init" while at least one other process has not yet called "init" is an error, we have to ensure that non-MPI ORTE apps (i.e., apps that call orte_init but not mpi_init) include a barrier in orte_init. Otherwise, fast ORTE apps almost always wind up triggering the "abnormal termination" condition. The barrier is protected with a test to ensure that MPI apps don't execute it and wind up doing two barriers during their init. This commit was SVN r22378.	2010-01-07 06:58:01 +00:00
Ralph Castain	ef1bfaa823	Add the ability to track how many times a process has been restarted, and to communicate that value to a process when it is restarted in case it needs to take action when it is restarted as opposed to being started for the first time. This commit was SVN r22377.	2010-01-07 01:19:44 +00:00
Ralph Castain	06d1f2cfe2	Add some new tests to the ORTE collection This commit was SVN r22328.	2009-12-17 19:30:57 +00:00
Ralph Castain	4026a9c873	Update all the tests to the new orte_init API This commit was SVN r22263.	2009-12-04 04:31:06 +00:00
Ralph Castain	4a82dd9a45	Add message sequence numbers to multicast messages, tracked by channel This commit was SVN r22262.	2009-12-04 04:17:44 +00:00
Ralph Castain	ae3e9f2aee	Update the spin.c test This commit was SVN r22259.	2009-12-03 04:46:31 +00:00
Ralph Castain	93ebed48b1	Update the multicast test. Some cleanups to the basic rmcast module This commit was SVN r22257.	2009-12-03 04:30:58 +00:00
Ralph Castain	a0d5c80ce0	Add a new framework for discovering local resource information such as cpu type/model, #cpus, available physical memory, etc. Two initial components (darwin and linux) are provided. This is needed to support bootstrap operations where daemons are started at node boot, and applications where initial knowledge of cpu identification is needed to guide framework component selection. Add orte configuration option to control the use of the framework in the system. Although the code will build, it will not be active unless configured with --enable-bootstrap. If bootstrap is enabled and the new opal_sysinfo framework can successfully determine the cpu model, pass that info to the application as an MCA param to support some work at Sun. Also, have daemons report back the resources they find to guide process mapping in bootstrap operations (i.e., where the daemon starts at node boot as opposed to being launched at application start). Adjust some platform files to enable these capabilities. This commit was SVN r22244.	2009-11-30 23:11:25 +00:00
Ralph Castain	92733b13d9	Add a couple of new tests to the orte system. Modify the job_complete check so we don't kill jobs when a single proc was terminated by ORTE command via plm.terminate_procs Still dies gracefully with a ctrl-c, and behaves as before when using plm.terminate_job This commit was SVN r22227.	2009-11-20 01:47:49 +00:00
Ralph Castain	840766a894	Update the rmcast APIs to include tag params and reorder them to look like their rml cousins This commit was SVN r22218.	2009-11-17 15:58:59 +00:00
Ralph Castain	a2f3a47b92	Update the orte_mcast test This commit was SVN r22214.	2009-11-11 22:11:19 +00:00
Ralph Castain	7afd65d631	Add a couple of test programs This commit was SVN r22137.	2009-10-24 01:00:38 +00:00
Ralph Castain	b7a0125bb7	Add a test for the new opal if.c functions. Modify the multicast test This commit was SVN r22081.	2009-10-09 15:25:18 +00:00
Ralph Castain	26bb6e8f79	Add a couple of non-orte multicast tests This commit was SVN r22001.	2009-09-23 05:24:22 +00:00
Ralph Castain	c3f9096fd9	Add a reliable multicast framework, with an initial basic module. This is configured out unless specifically requested via --enable-multicast. This commit was SVN r21988.	2009-09-22 00:58:29 +00:00
Ralph Castain	82af6ee940	Update test This commit was SVN r21987.	2009-09-22 00:55:02 +00:00
Ralph Castain	0e528e994f	Revert last commit - went to wrong repo! Didn't we just have that happen the other day too? :-) This commit was SVN r21878.	2009-08-25 13:06:14 +00:00
Ralph Castain	15d12b240b	Sync to r21876 This commit was SVN r21877. The following SVN revision numbers were found above: r21876 --> open-mpi/ompi@ef970293f0	2009-08-25 13:04:12 +00:00
Ralph Castain	511fe5da8b	Minor cleanup - get reported iters right in test This commit was SVN r21819.	2009-08-14 03:33:59 +00:00
Ralph Castain	9da6b46e7d	Add several options to the sendrecv_blaster to make it more powerful This commit was SVN r21817.	2009-08-14 03:12:43 +00:00
Ralph Castain	007cbe74f4	Include the sendrecv blaster in the tarball This commit was SVN r21790.	2009-08-11 02:42:47 +00:00
Rainer Keller	76469ea64a	- Change the property of a few files, that obviously don't need to be svn:executable... This commit was SVN r21786.	2009-08-11 01:40:00 +00:00
Ralph Castain	c66a5a9504	Add another test that just blasts the system with MPI_Sendrecv to myself commands of varying sizes This commit was SVN r21748.	2009-07-31 14:57:03 +00:00
Ralph Castain	ef20e778b3	Ensure that output ends on an appropriate suffix tag when --tag-output or --xml are selected. When we read the input buffer, we don't always get a complete printf output - we sometimes end mid stream. We still need to add the suffix and a <CR> to keep the output working right. This commit was SVN r21706.	2009-07-17 05:02:53 +00:00
Ralph Castain	bc0fe3c6da	Add some more tests for parallel IO that have caused problems in the past. Add a README that explains how to run the ziatest for launch timing This commit was SVN r21576.	2009-07-01 14:47:14 +00:00
Ralph Castain	2fbdea0273	Add a test for loop over bcast This commit was SVN r21560.	2009-06-29 17:06:19 +00:00
Ralph Castain	70b8c89b44	Fix slave spawn, which was hanging because the local daemon never saw the slave job report - it doesn't do it in the normal way, and so the slave launch system itself has to "fake it". Also complete implementation to printout app_context objects so we see all the fields. This commit was SVN r21408.	2009-06-10 19:01:08 +00:00
Ralph Castain	86d55d7ebf	Fix tight loops over comm_spawn by checking to see if the system has enough child procs and file descriptors available before attempting to launch. If not, introduce a 1sec delay and then test again. This provides a chance for the orted to complete processing of proc terminations from other children, hopefully creating room for the new proc(s). Update the loop_spawn test to remove a sleep so that it runs at max speed, letting the new code catch when we overrun ourselves and wait for room to be cleared for the next comm_spawn. This commit was SVN r21390.	2009-06-08 18:28:26 +00:00
Greg Koenig	60485ff95f	This is a very large change to rename several #define values from OMPI_* to OPAL_*. This allows opal layer to be used more independent from the whole of ompi. NOTE: 9 "svn mv" operations immediately follow this commit. This commit was SVN r21180.	2009-05-06 20:11:28 +00:00
Ralph Castain	4be24521aa	Modify the orte_process_info structure to handle a broader range of process types by replacing the individual booleans with a 32-bit bitmap. Use a set of #define's to define the individual bits, and a set of matching macros to test for them. Update the orte code base to use the macros instead of the booleans. Minor mod to the ompi layer to use the new #define's - just one-line name replacements. This commit was SVN r21144.	2009-05-04 11:07:40 +00:00
Ralph Castain	7b420e32b6	Add some missing tests This commit was SVN r21140.	2009-05-01 18:35:22 +00:00
Ralph Castain	dfb2146430	Perform the ziatest as a C program instead of a script - less trouble that way. This commit was SVN r21132.	2009-04-30 18:43:26 +00:00
Ralph Castain	ce206df568	Fix the danged ziatest - thx to Jeff, the mighty perl guru! This commit was SVN r21116.	2009-04-29 17:58:35 +00:00
Rainer Keller	221fb9dbca	... Delayed due to notifier commits earlier this day ... - Delete unnecessary header files using contrib/check_unnecessary_headers.sh after applying patches, that include headers, being "lost" due to inclusion in one of the now deleted headers... In total 817 files are touched. In ompi/mpi/c/ header files are moved up into the actual c-file, where necessary (these are the only additional #include), otherwise it is only deletions of #include (apart from the above additions required due to notifier...) - To get different MCAs (OpenIB, TM, ALPS), an earlier version was successfully compiled (yesterday) on: Linux locally using intel-11, gcc-4.3.2 and gcc-SVN + warnings enabled Smoky cluster (x86-64 running Linux) using PGI-8.0.2 + warnings enabled Lens cluster (x86-64 running Linux) using Pathscale-3.2 + warnings enabled This commit was SVN r21096.	2009-04-29 01:32:14 +00:00
Ralph Castain	f3cfe32b5d	Update the slave launch and cleanup procedures. Track what files have been moved to the slave node to avoid attempting to copy them multiple times on top of each other. Cleanup any pre-positioned files, kill any lingering apps, and cleanup the session directory area upon termination of the daemon. This commit was SVN r21094.	2009-04-29 00:11:19 +00:00
Ralph Castain	76b6ae3b29	Fix the ziatest to report correct times This commit was SVN r21059.	2009-04-23 01:12:56 +00:00
Ralph Castain	3c3c306ee4	Simplify test This commit was SVN r21058.	2009-04-22 23:03:04 +00:00
Ralph Castain	1f7281957f	Need this one too... This commit was SVN r21027.	2009-04-16 02:23:57 +00:00
Ralph Castain	9c2f17eb01	Cleanup the nidmap lookup functions and add some comments explaining how we handle the nid, job, and pmap arrays. This fixes a problem we have less-than-full participation in a comm_spawn, causing holes to exist in the pmap array. Update the slave spawn tests to properly indicate participation as being solely MPI_COMM_SELF. This commit was SVN r20961.	2009-04-09 02:48:33 +00:00
Ralph Castain	4af623076d	Add a test for hanging in a loop over mpi_reduce This commit was SVN r20798.	2009-03-17 13:57:23 +00:00
Rainer Keller	ec0ed48718	- Revert r20739 This commit was SVN r20742. The following SVN revision numbers were found above: r20739 --> open-mpi/ompi@781caee0b6	2009-03-05 21:56:03 +00:00
Rainer Keller	a94438343b	- Revert r20740 This commit was SVN r20741. The following SVN revision numbers were found above: r20740 --> open-mpi/ompi@2a70618a77	2009-03-05 21:50:47 +00:00
Rainer Keller	2a70618a77	- Second patch, as discussed in Louisville. Replace short macros in orte/util/name_fns.h to the actual fct. call. - Compiles on linux/x86-64 This commit was SVN r20740.	2009-03-05 21:14:18 +00:00
Rainer Keller	781caee0b6	- First of two or three patches, in orte/util/proc_info.h: Adapt orte_process_info to orte_proc_info, and change orte_proc_info() to orte_proc_info_init(). - Compiled on linux-x86-64 - Discussed with Ralph This commit was SVN r20739.	2009-03-05 20:36:44 +00:00
Ralph Castain	47cfccbb49	Update a couple of tests This commit was SVN r20668.	2009-03-01 15:32:32 +00:00
Rainer Keller	4c0e8e1e69	- Header orte/mca/oob/base/base.h is probably the wrong one to include anyhow -- if oob functionality is neededm then orte/mca/oob/oob.h Nevertheless compiles fine with -Wimplicit-function-declaration This commit was SVN r20641.	2009-02-26 04:20:03 +00:00
Rainer Keller	04567d3af0	- Header orte/mca/errmgr/errmgr.h is not needed. Once again compiles fine with -Wimplicit-function-declaration This commit was SVN r20640.	2009-02-26 04:05:30 +00:00
Rainer Keller	96e1b9b747	- Header orte/mca/rml/rml.h is not needed if no occurence of orte_rml or ORTE_RML. As the others compiles fine with -Wimplicit-function-declaration This commit was SVN r20639.	2009-02-26 03:52:31 +00:00
Rainer Keller	b356e90fa1	- Get rid of include orte/util/proc_info.h, if not needed Only proc_info.h-internal include file is opal/dss/dss_types.h - In one case (orte/util/hnp_contact.c) had to add proc_info.h again. - Local compilation (Linux/x86_64) w/ -Wimplicit-function-declaration works fine, no errors. Again, let's have MTT the last word. This commit was SVN r20631.	2009-02-25 03:38:00 +00:00
Ralph Castain	5dc4a2b1e0	Add missing include file This commit was SVN r20603.	2009-02-19 21:40:31 +00:00
Rainer Keller	d81443cc5a	- On the way to get the BTLs split out and lessen dependency on orte: Often, orte/util/show_help.h is included, although no functionality is required -- instead, most often opal_output.h, or orte/mca/rml/rml_types.h Please see orte_show_help_replacement.sh commited next. - Local compilation (Linux/x86_64) w/ -Wimplicit-function-declaration actually showed two missing #include "orte/util/show_help.h" in orte/mca/odls/base/odls_base_default_fns.c and in orte/tools/orte-top/orte-top.c Manually added these. Let's have MTT the last word. This commit was SVN r20557.	2009-02-14 02:26:12 +00:00
Ralph Castain	91bc5346eb	Update cell example This commit was SVN r20528.	2009-02-12 16:36:11 +00:00
Ralph Castain	62dd763a8f	Add ability for local slave spawns to pre-position supporting files. Update comm_spawn and comm_spawn_multiple man pages to cover new info_keys. This commit was SVN r20527.	2009-02-12 15:56:45 +00:00
Ralph Castain	b408cbd8c1	Crumby - get the make tarball correct! Earlier commit was from intermediate state... This commit was SVN r20504.	2009-02-10 18:33:32 +00:00
Ralph Castain	7216c5b104	Add a new test to demonstrate how to use slave spawn on hybrid machines. Add some of the orte test programs to the tarball to help diagnose user problems and provide examples This commit was SVN r20503.	2009-02-10 18:28:58 +00:00
Ralph Castain	26806c3fdd	Add new slave spawn test programs This commit was SVN r20493.	2009-02-09 20:45:11 +00:00
Ralph Castain	91ada6c323	Ensure we avoid overflows, handle the odd number of nodes case This commit was SVN r20171.	2008-12-31 01:11:57 +00:00
Ralph Castain	b012ed6c94	Add a somewhat unique launch time test This commit was SVN r20170.	2008-12-30 21:42:51 +00:00
Ralph Castain	7e3ddb09d3	As requested by Aurelien at the July design meeting - long time coming, but finally got around to it. Enable one mpirun to act as the server for another mpirun when doing MPI_Publish_name and its associated operations. The user is responsible, of course, for ensuring that the mpirun acting as a server outlives any mpiruns using it in that capacity. Add a cmd line option to mpirun --report-pid that prints out mpirun's pid. Allow the --ompi-server option to now take pid:# (or PID:#) of the mpirun to be used as the server, and then look that pid up by searching the local mpirun contact infos for it. This commit was SVN r20102.	2008-12-10 17:10:39 +00:00
Ralph Castain	1ace83c470	Enable modex-less launch. Consists of: 1. minor modification to include two new opal MCA params: (a) opal_profile: outputs what components were selected by each framework currently enabled for most, but not all, frameworks (b) opal_profile_file: name of file that contains profile info required for modex 2. introduction of two new tools: (a) ompi-probe: MPI process that simply calls MPI_Init/Finalize with opal_profile set. Also reports back the rml IP address for all interfaces on the node (b) ompi-profiler: uses ompi-probe to create the profile_file, also reports out a summary of what framework components are actually being used to help with configuration options 3. modification of the grpcomm basic component to utilize the profile file in place of the modex where possible 4. modification of orterun so it properly sees opal mca params and handles opal_profile correctly to ensure we don't get its profile 5. similar mod to orted as for orterun 6. addition of new test that calls orte_init followed by calls to grpcomm.barrier This is all completely benign unless actively selected. At the moment, it only supports modex-less launch for openib-based systems. Minor mod to the TCP btl would be required to enable it as well, if people are interested. Similarly, anyone interested in enabling other BTL's for modex-less operation should let me know and I'll give you the magic details. This seems to significantly improve scalability provided the file can be locally located on the nodes. I'm looking at an alternative means of disseminating the info (perhaps in launch message) as an option for removing that constraint. This commit was SVN r20098.	2008-12-09 23:49:02 +00:00
Ralph Castain	f54fda489e	This is a first step towards supporting fully-routed OOB communications: 1. remove direct routed module (hooray!) 2. add radix tree routed module (binomial remains default) 3. remove duplicate data storage - orteds were storing nidmap and pidmap data in odls, everyone else in ess 4. add ess APIs to update nidmap, add new pidmap - used only by orteds for MPI-2 support 5. modify code to eliminate multiple calls to orte_routed.update_route that recreated info already in ess pidmap. Add ess API to lookup that info instead. Modify routed modules to utilize that capability 6. setup new ability to shutdown orteds without sending back an "ack" message to mpirun - not utilized yet, will require some changes to plm terminate_orteds functions in managed environments (coming soon) Initial tests indicating that fully routing comm via defined routing trees may not actually have a significant cost for operations like IB QP setup. More tests required to confirm. This will require an autogen... This commit was SVN r19866.	2008-10-31 21:10:00 +00:00
Ralph Castain	71dcf61f9b	Add sanity check to ensure that specified stdin target is within range of job. Print error message and exit if not. Modify read_write test to allow specification of rank to read stdin. IOF now validated to work for arbitrary rank as stdin target. Not validate to work for multiple simultaneous ranks reading stdin (untested). This commit was SVN r19804.	2008-10-25 14:38:06 +00:00
Jeff Squyres	d96b78fee1	If the script is there, there's no real reason to have these files in the repo. This commit was SVN r19795.	2008-10-24 13:42:26 +00:00
Jeff Squyres	0a741d7f81	Add scripty-foo to make the data files. Revamp the data files to be non-uniform in content as a slightly better test. This commit was SVN r19794.	2008-10-24 13:35:47 +00:00
Ralph Castain	c56cdac379	Finish cleanup of stdin. Set non-stdio file descriptors to non-blocking (thanks to Jeff for catching that one). Handle writes that result in "would have blocked" errno. This commit was SVN r19793.	2008-10-24 01:42:58 +00:00
Ralph Castain	6100d88ded	Cleanup the new IOF: 1. remove some stale files that were overlooked in original commit 2. add a test program and data to stress iof for stdin 3. cleanup a debug statement that caused memory corruption when reading large files 4. some minor cleanups to correctly handle xon/xoff scenarios This commit was SVN r19792.	2008-10-23 19:11:05 +00:00
Ralph Castain	6e5d844c36	Roll in the revamped IOF subsystem. Per the devel mailing list email, this is a complete rewrite of the iof framework designed to simplify the code for maintainability, and to support features we had planned to do, but were too difficult to implement in the old code. Specifically, the new code: 1. completely and cleanly separates responsibilities between the HNP, orted, and tool components. 2. removes all wireup messaging during launch and shutdown. 3. maintains flow control for stdin to avoid large-scale consumption of memory by orteds when large input files are forwarded. This is done using an xon/xoff protocol. 4. enables specification of stdin recipients on the mpirun cmd line. Allowed options include rank, "all", or "none". Default is rank 0. 5. creates a new MPI_Info key "ompi_stdin_target" that supports the above options for child jobs. Default is "none". 6. adds a new tool "orte-iof" that can connect to a running mpirun and display the output. Cmd line options allow selection of any combination of stdout, stderr, and stddiag. Default is stdout. 7. adds a new mpirun and orte-iof cmd line option "tag-output" that will tag each line of output with process name and stream ident. For example, "[1,0]<stdout>this is output" This is not intended for the 1.3 release as it is a major change requiring considerable soak time. This commit was SVN r19767.	2008-10-18 00:00:49 +00:00
Jeff Squyres	dbb932b619	Remove the missing app mpi_after_finalize from the Makefile. This commit was SVN r19687.	2008-10-06 14:35:15 +00:00
Ralph Castain	15c47a2473	Revise the daemon collective system to handle comm_spawn patterns that cross into new nodes that are not direct children on the routing tree of the HNP. Refers to ticket #1548. Although this appears to fix the problem, the ticket will be held open pending further test prior to transition to the 1.3 branch. This commit was SVN r19674.	2008-10-02 20:08:27 +00:00
Jeff Squyres	78a25cf116	Commit a few missing header files, etc. This commit was SVN r19626.	2008-09-24 15:41:42 +00:00
Ralph Castain	20ece3cb86	Add new test that stresses MPI send/recv This commit was SVN r19530.	2008-09-09 15:47:31 +00:00
Ralph Castain	063837a413	Add oob and iof stress tests This commit was SVN r19404.	2008-08-26 03:02:46 +00:00
Ralph Castain	ba5498cdc6	Repair the MPI-2 dynamic operations. This includes: 1. repair of the linear and direct routed modules 2. repair of the ompi/pubsub/orte module to correctly init routes to the ompi-server, and correctly handle failure to correctly parse the provided ompi-server URI 3. modification of orterun to accept both "file" and "FILE" for designating where the ompi-server URI is to be found - purely a convenience feature 4. resolution of a message ordering problem during the connect/accept handshake that allowed the "send-first" proc to attempt to send to the "recv-first" proc before the HNP had actually updated its routes. Let this be a further reminder to all - message ordering is NOT guaranteed in the OOB 5. Repair the ompi/dpm/orte module to correctly init routes during connect/accept. Reminder to all: messages sent to procs in another job family (i.e., started by a different mpirun) are ALWAYS routed through the respective HNPs. As per the comments in orte/routed, this is REQUIRED to maintain connect/accept (where only the root proc on each side is capable of init'ing the routes), allow communication between mpirun's using different routing modules, and to minimize connections on tools such as ompi-server. It is all taken care of "under the covers" by the OOB to ensure that a route back to the sender is maintained, even when the different mpirun's are using different routed modules. 6. corrections in the orte/odls to ensure proper identification of daemons participating in a dynamic launch 7. corrections in build/nidmap to support update of an existing nidmap during dynamic launch 8. corrected implementation of the update_arch function in the ESS, along with consolidation of a number of ESS operations into base functions for easier maintenance. The ability to support info from multiple jobs was added, although we don't currently do so - this will come later to support further fault recovery strategies 9. minor updates to several functions to remove unnecessary and/or no longer used variables and envar's, add some debugging output, etc. 10. addition of a new macro ORTE_PROC_IS_DAEMON that resolves to true if the provided proc is a daemon There is still more cleanup to be done for efficiency, but this at least works. Tested on single-node Mac, multi-node SLURM via odin. Tests included connect/accept, publish/lookup/unpublish, comm_spawn, comm_spawn_multiple, and singleton comm_spawn. Fixes ticket #1256 This commit was SVN r18804.	2008-07-03 17:53:37 +00:00
Ralph Castain	2cc8b2c51f	Add yet another test, this one for proper error behavior when someone call an MPI function after calling MPI_Finalize. Add a minor debug that outputs the orterun exit status to stderr when orte_debug is set. This commit was SVN r18622.	2008-06-09 19:21:20 +00:00
Ralph Castain	11692ca98e	Update tests to flag that these are non-MPI apps This commit was SVN r18620.	2008-06-09 18:48:21 +00:00
Ralph Castain	9613b3176c	Effectively revert the orte_output system and return to direct use of opal_output at all levels. Retain the orte_show_help subsystem to allow aggregation of show_help messages at the HNP. After much work by Jeff and myself, and quite a lot of discussion, it has become clear that we simply cannot resolve the infinite loops caused by RML-involved subsystems calling orte_output. The original rationale for the change to orte_output has also been reduced by shifting the output of XML-formatted vs human readable messages to an alternative approach. I have globally replaced the orte_output/ORTE_OUTPUT calls in the code base, as well as the corresponding .h file name. I have test compiled and run this on the various environments within my reach, so hopefully this will prove minimally disruptive. This commit was SVN r18619.	2008-06-09 14:53:58 +00:00
Ralph Castain	0da811ce79	Initial work on xml support - allocation and job map outputs completed. More to come. This commit was SVN r18587.	2008-06-04 20:53:12 +00:00
Ralph Castain	c992e99035	Remove the tags from orte_output_open and the filtering operation from orte_output - this will be handled differently to improve the XML output interface This commit was SVN r18557.	2008-06-03 14:24:01 +00:00
Jeff Squyres	e7ecd56bd2	This commit represents a bunch of work on a Mercurial side branch. As such, the commit message back to the master SVN repository is fairly long. = ORTE Job-Level Output Messages = Add two new interfaces that should be used for all new code throughout the ORTE and OMPI layers (we already make the search-and-replace on the existing ORTE / OMPI layers): * orte_output(): (and corresponding friends ORTE_OUTPUT, orte_output_verbose, etc.) This function sends the output directly to the HNP for processing as part of a job-specific output channel. It supports all the same outputs as opal_output() (syslog, file, stdout, stderr), but for stdout/stderr, the output is sent to the HNP for processing and output. More on this below. * orte_show_help(): This function is a drop-in-replacement for opal_show_help(), with two differences in functionality: 1. the rendered text help message output is sent to the HNP for display (rather than outputting directly into the process' stderr stream) 1. the HNP detects duplicate help messages and does not display them (so that you don't see the same error message N times, once from each of your N MPI processes); instead, it counts "new" instances of the help message and displays a message every ~5 seconds when there are new ones ("I got X new copies of the help message...") opal_show_help and opal_output still exist, but they only output in the current process. The intent for the new orte_* functions is that they can apply job-level intelligence to the output. As such, we recommend that all new ORTE and OMPI code use the new orte_* functions, not thei opal_* functions. === New code === For ORTE and OMPI programmers, here's what you need to do differently in new code: * Do not include opal/util/show_help.h or opal/util/output.h. Instead, include orte/util/output.h (this one header file has declarations for both the orte_output() series of functions and orte_show_help()). * Effectively s/opal_output/orte_output/gi throughout your code. Note that orte_output_open() takes a slightly different argument list (as a way to pass data to the filtering stream -- see below), so you if explicitly call opal_output_open(), you'll need to slightly adapt to the new signature of orte_output_open(). * Literally s/opal_show_help/orte_show_help/. The function signature is identical. === Notes === * orte_output'ing to stream 0 will do similar to what opal_output'ing did, so leaving a hard-coded "0" as the first argument is safe. * For systems that do not use ORTE's RML or the HNP, the effect of orte_output_* and orte_show_help will be identical to their opal counterparts (the additional information passed to orte_output_open() will be lost!). Indeed, the orte_* functions simply become trivial wrappers to their opal_* counterparts. Note that we have not tested this; the code is simple but it is quite possible that we mucked something up. = Filter Framework = Messages sent view the new orte_* functions described above and messages output via the IOF on the HNP will now optionally be passed through a new "filter" framework before being output to stdout/stderr. The "filter" OPAL MCA framework is intended to allow preprocessing to messages before they are sent to their final destinations. The first component that was written in the filter framework was to create an XML stream, segregating all the messages into different XML tags, etc. This will allow 3rd party tools to read the stdout/stderr from the HNP and be able to know exactly what each text message is (e.g., a help message, another OMPI infrastructure message, stdout from the user process, stderr from the user process, etc.). Filtering is not active by default. Filter components must be specifically requested, such as: {{{ $ mpirun --mca filter xml ... }}} There can only be one filter component active. = New MCA Parameters = The new functionality described above introduces two new MCA parameters: * '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that help messages will be aggregated, as described above. If set to 0, all help messages will be displayed, even if they are duplicates (i.e., the original behavior). * '''orte_base_show_output_recursions''': An MCA parameter to help debug one of the known issues, described below. It is likely that this MCA parameter will disappear before v1.3 final. = Known Issues = * The XML filter component is not complete. The current output from this component is preliminary and not real XML. A bit more work needs to be done to configure.m4 search for an appropriate XML library/link it in/use it at run time. * There are possible recursion loops in the orte_output() and orte_show_help() functions -- e.g., if RML send calls orte_output() or orte_show_help(). We have some ideas how to fix these, but figured that it was ok to commit before feature freeze with known issues. The code currently contains sub-optimal workarounds so that this will not be a problem, but it would be good to actually solve the problem rather than have hackish workarounds before v1.3 final. This commit was SVN r18434.	2008-05-13 20:00:55 +00:00
Ralph Castain	7b91f8baff	Cleanup and fix bugs in the MPI dynamics section. Modify the dpm API so it properly takes ports instead of process names (as correctly identified by Aurelien). Fix race conditions in the use of ompi-server. Fix incompatibilities between the mpi bindings and the dpm implemenation that could cause segfaults due to uninitialized memory. Fix the ompi-server -h cmd line option so it actually tells you something! Add two new testing codes to the orte/test/mpi area: accept and connect. This commit was SVN r18176.	2008-04-16 14:27:42 +00:00
Ralph Castain	7c7304466c	Add a binomial tree-based launch to ssh, turned "on" only when the plm_rsh_tree_spawned mca param is set to a non-zero value. This probably isn't a very optimized capability, but it does execute a tree-based launch that may scale better than linear at high node counts. Add the daemon map capability to the ODLS to create and save a map of daemon vpid vs nodename from the launch message. Cleanup a few places in the base plm launch support where we didn't adequately protect rml recv's from potentially executing sends. This commit was SVN r18143.	2008-04-14 18:26:08 +00:00
Ralph Castain	95d7e177c6	Not really a test, but a useful tool for testing computation of binomial trees This commit was SVN r18113.	2008-04-09 21:58:42 +00:00
Ralph Castain	5e6dc24e62	Fix ompi-server so it works with unity routed module - still not working with tree routing. Cleanup debug flag so it activates debugging on the data server code itself This commit was SVN r18080.	2008-04-04 19:17:28 +00:00
Ralph Castain	dc7f45dafd	Remove the obsolete and largely unused orte_system_info structure. The only fields that were used in that struct were nodeid and nodename - these have been transferred to the orte_process_info structure. Only one place used the user name field - session_dir, when formulating the name of the top-level directory. Accordingly, the code for getting the user's id has been moved to the session_dir code. This commit was SVN r17926.	2008-03-23 23:10:15 +00:00
Ralph Castain	6bb139e4f2	One more correction to mpirun exit codes - cleanup the application proc's exit codes in the orted so that non-zero exit codes generated by mpirun itself don't get "munged". Modify the multi_abort function so they all return different exit codes - allows us to tell which one was being reported. This commit was SVN r17895.	2008-03-20 13:54:11 +00:00
Ralph Castain	2ed0e60321	Bring some sanity to the exit code returned by mpirun. Ensure that we provide a non-zero code if something goes wrong, including someone exiting after calling mpi_init without calling mpi_finalize. Jeff is preparing an (undoubtedly lengthy) explanation/matrix of how these codes are determined for the OMPI FAQ. This commit was SVN r17879.	2008-03-19 19:00:51 +00:00
Ralph Castain	629b95a2fe	Afraid this has a couple of things mixed into the commit. Couldn't be helped - had missed one commit prior to running out the door on vacation. Fix race conditions in abnormal terminations. We had done a first-cut at this in a prior commit. However, the window remained partially open due to the fact that the HNP has multiple paths leading to orte_finalize. Most of our frameworks don't care if they are finalized more than once, but one of them does, which meant we segfaulted if orte_finalize got called more than once. Besides, we really shouldn't be doing that anyway. So we now introduce a set of atomic locks that prevent us from multiply calling abort, attempting to call orte_finalize, etc. My initial tests indicate this is working cleanly, but since it is a race condition issue, more testing will have to be done before we know for sure that this problem has been licked. Also, some updates relevant to the tool comm library snuck in here. Since those also touched the orted code (as did the prior changes), I didn't want to attempt to separate them out - besides, they are coming in soon anyway. More on them later as that functionality approaches completion. This commit was SVN r17843.	2008-03-17 17:58:59 +00:00
George Bosilca	9d421bea2a	Replace all occurences of orte_pointer_array by opal_pointer_array. Remove the implementation of orte_pointer_array. This commit was SVN r17636.	2008-02-28 05:32:23 +00:00
Ralph Castain	d70e2e8c2b	Merge the ORTE devel branch into the main trunk. Details of what this means will be circulated separately. Remains to be tested to ensure everything came over cleanly, so please continue to withhold commits a little longer This commit was SVN r17632.	2008-02-28 01:57:57 +00:00
Ralph Castain	6f498c0964	Since we no longer set the APP_NUM attribute in the case of a singleton, protect the code so we don't crash in that case. This commit was SVN r16461.	2007-10-16 16:17:48 +00:00
Ralph Castain	54b2cf747e	These changes were mostly captured in a prior RFC (except for #2 below) and are aimed specifically at improving startup performance and setting up the remaining modifications described in that RFC. The commit has been tested for C/R and Cray operations, and on Odin (SLURM, rsh) and RoadRunner (TM). I tried to update all environments, but obviously could not test them. I know that Windows needs some work, and have highlighted what is know to be needed in the odls process component. This represents a lot of work by Brian, Tim P, Josh, and myself, with much advice from Jeff and others. For posterity, I have appended a copy of the email describing the work that was done: As we have repeatedly noted, the modex operation in MPI_Init is the single greatest consumer of time during startup. To-date, we have executed that operation as an ORTE stage gate that held the process until a startup message containing all required modex (and OOB contact info - see #3 below) info could be sent to it. Each process would send its data to the HNP's registry, which assembled and sent the message when all processes had reported in. In addition, ORTE had taken responsibility for monitoring process status as it progressed through a series of "stage gates". The process reported its status at each gate, and ORTE would then send a "release" message once all procs had reported in. The incoming changes revamp these procedures in three ways: 1. eliminating the ORTE stage gate system and cleanly delineating responsibility between the OMPI and ORTE layers for MPI init/finalize. The modex stage gate (STG1) has been replaced by a collective operation in the modex itself that performs an allgather on the required modex info. The allgather is implemented using the orte_grpcomm framework since the BTL's are not active at that point. At the moment, the grpcomm framework only has a "basic" component analogous to OMPI's "basic" coll framework - I would recommend that the MPI team create additional, more advanced components to improve performance of this step. The other stage gates have been replaced by orte_grpcomm barrier functions. We tried to use MPI barriers instead (since the BTL's are active at that point), but - as we discussed on the telecon - these are not currently true barriers so the job would hang when we fell through while messages were still in process. Note that the grpcomm barrier doesn't actually resolve that problem, but Brian has pointed out that we are unlikely to ever see it violated. Again, you might want to spend a little time on an advanced barrier algorithm as the one in "basic" is very simplistic. Summarizing this change: ORTE no longer tracks process state nor has direct responsibility for synchronizing jobs. This is now done via collective operations within the MPI layer, albeit using ORTE collective communication services. I -strongly- urge the MPI team to implement advanced collective algorithms to improve the performance of this critical procedure. 2. reducing the volume of data exchanged during modex. Data in the modex consisted of the process name, the name of the node where that process is located (expressed as a string), plus a string representation of all contact info. The nodename was required in order for the modex to determine if the process was local or not - in addition, some people like to have it to print pretty error messages when a connection failed. The size of this data has been reduced in three ways: (a) reducing the size of the process name itself. The process name consisted of two 32-bit fields for the jobid and vpid. This is far larger than any current system, or system likely to exist in the near future, can support. Accordingly, the default size of these fields has been reduced to 16-bits, which means you can have 32k procs in each of 32k jobs. Since the daemons must have a vpid, and we require one daemon/node, this also restricts the default configuration to 32k nodes. To support any future "mega-clusters", a configuration option --enable-jumbo-apps has been added. This option increases the jobid and vpid field sizes to 32-bits. Someday, if necessary, someone can add yet another option to increase them to 64-bits, I suppose. (b) replacing the string nodename with an integer nodeid. Since we have one daemon/node, the nodeid corresponds to the local daemon's vpid. This replaces an often lengthy string with only 2 (or at most 4) bytes, a substantial reduction. (c) when the mca param requesting that nodenames be sent to support pretty error messages, a second mca param is now used to request FQDN - otherwise, the domain name is stripped (by default) from the message to save space. If someone wants to combine those into a single param somehow (perhaps with an argument?), they are welcome to do so - I didn't want to alter what people are already using. While these may seem like small savings, they actually amount to a significant impact when aggregated across the entire modex operation. Since every proc must receive the modex data regardless of the collective used to send it, just reducing the size of the process name removes nearly 400MBytes of communication from a 32k proc job (admittedly, much of this comm may occur in parallel). So it does add up pretty quickly. 3. routing RML messages to reduce connections. The default messaging system remains point-to-point - i.e., each proc opens a socket to every proc it communicates with and sends its messages directly. A new option uses the orteds as routers - i.e., each proc only opens a single socket to its local orted. All messages are sent from the proc to the orted, which forwards the message to the orted on the node where the intended recipient proc is located - that orted then forwards the message to its local proc (the recipient). This greatly reduces the connection storm we have encountered during startup. It also has the benefit of removing the sharing of every proc's OOB contact with every other proc. The orted routing tables are populated during launch since every orted gets a map of where every proc is being placed. Each proc, therefore, only needs to know the contact info for its local daemon, which is passed in via the environment when the proc is fork/exec'd by the daemon. This alone removes ~50 bytes/process of communication that was in the current STG1 startup message - so for our 32k proc job, this saves us roughly 32k50 = 1.6MBytes sent to 32k procs = 51GBytes of messaging. Note that you can use the new routing method by specifying -mca routed tree - if you so desire. This mode will become the default at some point in the future. There are a few minor additional changes in the commit that I'll just note in passing: propagation of command line mca params to the orteds - fixes ticket #1073. See note there for details. * requiring of "finalize" prior to "exit" for MPI procs - fixes ticket #1144. See note there for details. * cleanup of some stale header files This commit was SVN r16364.	2007-10-05 19:48:23 +00:00
Ralph Castain	6c800d452d	Bring orte tests up to date with revised rml system. Make first cut at fixing non-direct xcast modes This commit was SVN r15553.	2007-07-23 13:05:34 +00:00

1 2 3 4 5 ...

289 Коммитов