openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	5f7a5c4793	Update test to include all keys This commit was SVN r27311.	2012-09-12 05:02:51 +00:00
Ralph Castain	cd8aff675b	Update test This commit was SVN r27303.	2012-09-11 20:32:43 +00:00
Ralph Castain	0e1dbe8711	Remove non-existent files This commit was SVN r27136.	2012-08-25 01:29:17 +00:00
Ralph Castain	c8b511d18a	Remove stale tests This commit was SVN r27126.	2012-08-24 02:22:11 +00:00
Ralph Castain	3c13176aa7	Remove test code This commit was SVN r27114.	2012-08-22 21:36:54 +00:00
Ralph Castain	335c0eafcf	Add a filem test program and set ignores This commit was SVN r27069.	2012-08-16 17:46:46 +00:00
Jeff Squyres	96f640a762	Add new "opal_hotel" class. Abstractly speaking, this class does the following: * Provides a fixed number of resource slots (i.e., "hotel rooms"). * Allows one thing to occupy a resource slot at a time (i.e., each hotel room can have an occupant check in to that room). * Resource slots can be vacated at any time (i.e., occupants can voluntarily check out of their hotel room). * Resource slots can be occupied for a specific maximum amount of time. If that time expires, the occupant is forcibly evicted and the upper layer is notified via (libevent) callback (i.e., the maid will kick an occupant of out of their room when their reservation is over). This class can be to be used for things like retransmission schemes for unreliable transports. For example, a message sent on an unreliable transport can be checked in to a hotel room. If an ACK for that message is received, the message can be checked out. But if the ACK is never received, the message will eventually be evicted from its room and the upper layer will be notified that the message failed to check out in time (i.e., that an ACK for that message was not received in time). Code using this class is currently being developed off-trunk, but will be coming to SVN soon. This commit was SVN r27067.	2012-08-16 17:29:55 +00:00
Ralph Castain	c90b7380c1	Sigh - of course, they changed the name of the silly MPI_Info object in the final standard, but not in the proposal. So change to the new MPI_INFO_ENV name. Also, don't set unknown values to "N/A", but just leave them unset. This commit was SVN r27012.	2012-08-12 05:00:57 +00:00
Ralph Castain	cb48fd52d4	Implement the MPI_Info part of MPI-3 Ticket 313. Add an MPI_info object MPI_INFO_GET_ENV that contains a number of run-time related pieces of info. This includes all the required ones in the ticket, plus a few that specifically address recent user questions: "num_app_ctx" - the number of app_contexts in the job "first_rank" - the MPI rank of the first process in each app_context "np" - the number of procs in each app_context Still need clarification on the MPI_Init portion of the ticket. Specifically, does the ticket call for returning an error is someone calls MPI_Init more than once in a program? We set a flag to tell us that we have been initialized, but currently never check it. This commit was SVN r27005.	2012-08-12 01:28:23 +00:00
Ralph Castain	6ee35e4977	Add num_local_peers to orte_process_info so we don't keep re-computing it, ensure it is available for direct launch via pmi as well This commit was SVN r26931.	2012-07-31 21:21:50 +00:00
Ralph Castain	9680c52f5e	Add mrplus examples to tarball This commit was SVN r26643.	2012-06-23 02:40:46 +00:00
Ralph Castain	40c2fc5f55	Update the tests, add a couple This commit was SVN r26379.	2012-05-02 19:00:05 +00:00
Ralph Castain	8f7bf3344a	Update test This commit was SVN r26370.	2012-05-01 18:38:44 +00:00
Ralph Castain	4542070cf2	Add event priority inversion test This commit was SVN r26369.	2012-05-01 16:42:22 +00:00
Ralph Castain	f68487016c	Add test code from Terry. Properly terminate if we don't abort on non-zero exit This commit was SVN r26271.	2012-04-16 16:44:23 +00:00
Ralph Castain	bd8b4f7f1e	Sorry for mid-day commit, but I had promised on the call to do this upon my return. Roll in the ORTE state machine. Remove last traces of opal_sos. Remove UTK epoch code. Please see the various emails about the state machine change for details. I'll send something out later with more info on the new arch. This commit was SVN r26242.	2012-04-06 14:23:13 +00:00
Ralph Castain	534d70025f	Cleanup the detection of process binding during mpi_init. There are several cases that need to be checked: 1. no binding support - indicated by a negative return code from get_cpubind 2. binding supported, but not bound - the bitset returned by get_cpubind is the same as the available cpuset 3. binding supported and bound - bitset from get_cpubind is a subset of available cpuset 4. only one cpu is available - in this case, get_cpubind matches the available cpuset, but we are effectively bound This commit was SVN r25957.	2012-02-17 21:18:53 +00:00
Ralph Castain	10f94efbda	Add binding output to test This commit was SVN r25955.	2012-02-17 16:48:01 +00:00
Ralph Castain	15facc4ba6	Fix comm_spawn yet again...add another test This commit was SVN r25579.	2011-12-06 20:15:40 +00:00
Ralph Castain	6310361532	At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here: https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation. In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions: 1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior. 2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation. 3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so. As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes. This commit was SVN r25476.	2011-11-15 03:40:11 +00:00
Samuel Gutierrez	3ea59cce96	minor cleanup to getenv_pmi.c. This commit was SVN r25449.	2011-11-07 03:18:07 +00:00
Samuel Gutierrez	e03bc93fb7	only use pmi grpcomm and pubsub during the direct launch case. use PMI environment variable to setup vpid in ess alps on cray xe systems. add pmi test code. This commit was SVN r25447.	2011-11-06 17:28:40 +00:00
Ralph Castain	198e001554	Add another test This commit was SVN r25415.	2011-11-02 15:59:16 +00:00
Wesley Bland	4e7ff0bd5e	By popular demand the epoch code is now disabled by default. To enable the epochs and the resilient orte code, use the configure flag: --enable-resilient-orte This will define both: ORTE_ENABLE_EPOCH ORTE_RESIL_ORTE This commit was SVN r25093.	2011-08-26 22:16:14 +00:00
Wesley Bland	09274cd047	Make sure that the epoch is initialized everywhere so we don't get weird output during valgrind. This shouldn't have caused any problems with any actual execution. Just extra warnings in valgrind. This commit was SVN r25015.	2011-08-08 15:11:55 +00:00
Ralph Castain	7b9f958dcf	Add some missing error strings. Update test to show silent errors This commit was SVN r25010.	2011-08-08 04:21:02 +00:00
Ralph Castain	590ac70e88	Add a simple test program for error string output This commit was SVN r25007.	2011-08-07 21:32:25 +00:00
Ralph Castain	8853e0e80a	Fix regular expression analyzer for slurmd - use a slurm-specific version Fix multi-node routing for daemon startup when static ports are not set This commit was SVN r24898.	2011-07-13 22:49:56 +00:00
Ralph Castain	1ee7c39982	Fix some major bit-rot on scalable launch. If static ports are provided, then daemons can connect back to the HNP via the routed connection tree instead of doing so directly. In order to do that at scale, the node list must be passed as a regular expression - otherwise, the orted command line gets too long. Over the course of time, usage of static ports got corrupted in several places, the "parent" info got incorrectly reset, etc. So correct all that and get the regex-based wireup going again. Also, don't pass node lists if static ports aren't enabled - they are of no value to the orted and just create the possibility of overly-long cmd lines. This commit was SVN r24860.	2011-07-07 18:54:30 +00:00
Wesley Bland	e1ba09ad51	Add a resilience to ORTE. Allows the runtime to continue after a process (or ORTED) failure. Note that more work will be necessary to allow the MPI layer to take advantage of this. Per RFC: http://www.open-mpi.org/community/lists/devel/2011/06/9299.php This commit was SVN r24815.	2011-06-23 20:38:02 +00:00
Ralph Castain	138928fcf4	Use ports as multicast channels instead of networks so we avoid stepping into reserved spaces. This commit was SVN r24666.	2011-04-29 18:46:40 +00:00
Ralph Castain	80ef1af8ba	Add psm key generator program This commit was SVN r24197.	2010-12-30 20:54:58 +00:00
Ralph Castain	58e711a412	Update a test and add two new ones for testing event lib thread support This commit was SVN r24051.	2010-11-13 15:39:28 +00:00
Ralph Castain	703684e071	Output the mca params for debug purposes This commit was SVN r24042.	2010-11-11 20:06:29 +00:00
Ralph Castain	a47b33678b	Add orte-level thread support to avoid some of the opal_if_threads protection used solely for ompi. Use threads to help process multicast messages. This commit was SVN r24009.	2010-11-08 19:09:23 +00:00
Ralph Castain	bf665692c3	Update the rmcast callback function API to return message sequence number. Update orte_mcast test to stress the system. This commit was SVN r24004.	2010-11-07 23:29:52 +00:00
Ralph Castain	9ea2b196ce	Convert the opal_event framework to use direct function calls instead of hiding functions behind function pointers. Eliminate the opal_object_t abstraction of libevent's event struct so it can be directly passed to the libevent functions. Note: the ompi_check_libfca.m4 file had to be modified to avoid it stomping on global CPPFLAGS and the like. The file was also relocated to the ompi/config directory as it pertains solely to an ompi-layer component. Forgive the mid-day configure change, but I know Shiqing is working the windows issues and don't want to cause him unnecessary redo work. This commit was SVN r23966.	2010-10-28 15:22:46 +00:00
Ralph Castain	86c7365e8e	Clean up a few initialization issues - don't think these are impacting the shared memory situation as it didn't fix the problem. Setup the event API to support multiple bases in preparation for splitting the OMPI and ORTE events. Holding here pending shared memory resolution. This commit was SVN r23943.	2010-10-26 02:41:42 +00:00
Ralph Castain	fceabb2498	Update libevent to the 2.0 series, currently at 2.0.7rc. We will update to their final release when it becomes available. Currently known errors exist in unused portions of the libevent code. This revision passes the IBM test suite on a Linux machine and on a standalone Mac. This is a fairly intrusive change, but outside of the moving of opal/event to opal/mca/event, the only changes involved (a) changing all calls to opal_event functions to reflect the new framework instead, and (b) ensuring that all opal_event_t objects are properly constructed since they are now true opal_objects. Note: Shiqing has just returned from vacation and has not yet had a chance to complete the Windows integration. Thus, this commit almost certainly breaks Windows support on the trunk. However, I want this to have a chance to soak for as long as possible before I become less available a week from today (going to be at a class for 5 days, and thus will only be sparingly available) so we can find and fix any problems. Biggest change is moving the libevent code from opal/event to a new opal/mca/event framework. This was done to make it much easier to update libevent in the future. New versions can be inserted as a new component and tested in parallel with the current version until validated, then we can remove the earlier version if we so choose. This is a statically built framework ala installdirs, so only one component will build at a time. There is no selection logic - the sole compiled component simply loads its function pointers into the opal_event struct. I have gone thru the code base and converted all the libevent calls I could find. However, I cannot compile nor test every environment. It is therefore quite likely that errors remain in the system. Please keep an eye open for two things: 1. compile-time errors: these will be obvious as calls to the old functions (e.g., opal_evtimer_new) must be replaced by the new framework APIs (e.g., opal_event.evtimer_new) 2. run-time errors: these will likely show up as segfaults due to missing constructors on opal_event_t objects. It appears that it became a typical practice for people to "init" an opal_event_t by simply using memset to zero it out. This will no longer work - you must either OBJ_NEW or OBJ_CONSTRUCT an opal_event_t. I tried to catch these cases, but may have missed some. Believe me, you'll know when you hit it. There is also the issue of the new libevent "no recursion" behavior. As I described on a recent email, we will have to discuss this and figure out what, if anything, we need to do. This commit was SVN r23925.	2010-10-24 18:35:54 +00:00
Ralph Castain	248320b91a	Enable connect_accept between multiple singleton jobs without the presence of an external rendezvous agent (e.g., ompi-server). This also enables connect_accept between processes in more than two jobs regardless of how they were started. Create an ability to store the contact info for multiple HNPs being used to route between different job families. Modify the dpm orte module to pass the resulting store during the connect_accept procedure so that all jobs involved in the resulting communicator know how to route OOB messages between them. Add a test provided by Philippe that tests this ability. This commit was SVN r23438.	2010-07-20 04:22:45 +00:00
Ralph Castain	f3d90dfb8d	Fully restore fault recovery, both at the individual process and daemon level. NOTE: MPI fault recovery remains unavailable pending merge from Josh. This only covers ORTE-level processes. This commit was SVN r23335.	2010-07-01 19:45:43 +00:00
Ralph Castain	6cbe947810	Modify the multicast scheme so that applications have separate input and output channels to avoid cross-talk. Update the multicast test to conform. This commit was SVN r23271.	2010-06-15 03:50:31 +00:00
Ralph Castain	bb602694e6	Add a new example program, update cisco platform file This commit was SVN r23262.	2010-06-09 18:21:06 +00:00
Ralph Castain	d80c90c7b9	Include missing tests This commit was SVN r23244.	2010-06-07 14:15:00 +00:00
Ralph Castain	ab6e06f5b3	Reorganize the rmcast code to capture common code elements. Increase max msg size for spread and udp transports. Cleanup the spread configuration doc. This commit was SVN r23207.	2010-05-25 22:36:57 +00:00
Ralph Castain	88f5217a12	Cleanup the debugger daemon co-launch code and add an ability to test it. Implement ability to co-launch debugger daemons upon attach to a running job for jobs launched under rsh, slurm, and tm environments (others can easily be added if desired). Add new mca params to test: orte_debugger_test_daemon: Name of the executable to be used to simulate a debugger colaunch orte_debugger_test_attach: Test debugger colaunch after debugger attachment To test co-launch at job start, just set the orte_debugger_test_daemon param. To test co-launch upon attach: set orte_debugger_test_daemon set orte_debugger_test_attach=1 set orte_enable_debug_cospawn_while_running=1 set orte_debugger_check_rate=<N> - defines the number of seconds to wait before "checking" for a debugger attaching Added a "debugger" program to orte/test/mpi that just spins to simulate a debugger daemon. This commit was SVN r23144.	2010-05-14 18:44:49 +00:00
Ralph Castain	8e7faf9119	Add a new test for the db framework, fix some minor bugs in the daemon module This commit was SVN r23085.	2010-05-04 02:38:11 +00:00
Ralph Castain	9dfb5c7c62	Rename the orte state framework to be "db", which more accurately reflects its overall capabilities since it can store any kind of data (not just state, although that will be its primary purpose). Update tools and tests accordingly. Add a daemon module for storing data on the daemons - requires --enable-multicast, so it won't build unless that is set This commit was SVN r23082.	2010-05-03 04:11:03 +00:00
Ralph Castain	c62418d76d	Add a new test that checks behavior when we call exit with a non-zero return code after calling finalize - don't ask why. Modify the check_complete code so it finds the first non-zero exit status (i.e., the one from the lowest rank) in a job that terminates normally, and sets the mpirun exit code to that status. This commit was SVN r23071.	2010-04-29 19:58:44 +00:00
Ralph Castain	55889934d8	After hours spent chasing the stupid "abort" file, it became clear that we were always going to be plagued by that idiot contraption when trying to be good citizens and properly cleanup. So get rid of it by instead doing a messaging handshake with the local daemon. Note that this isn't a problem since MPI_Abort and orte_abort are only called under controlled circumstances - i.e., we are doing an orderly abort and not segfaulting. If we can't get the message out for some reason, then too bad - we'll still see an abnormal process termination and act accordingly. This commit was SVN r23045.	2010-04-27 03:39:32 +00:00

1 2 3 4

184 Коммитов