openmpi

Автор	SHA1	Сообщение	Дата
George Bosilca	d8fe05264b	Fix recursion in include files (Coverty fix 156). This commit was SVN r19181.	2008-08-06 13:50:01 +00:00
Ralph Castain	63c33a9c32	Some minor updates to the locking system changes. Remove obsolete locks. Ensure the trigger event objects do not get deconstructed until the very end to avoid possible problems due to race conditions. Route all orted abnormal term tests through the trigger. This commit was SVN r19172.	2008-08-06 11:31:06 +00:00
Shiqing Fan	bb90ad793a	- Move the entire OBJ_CLASS_INSTANCE of orte_trigger_event_t into #if blocks, so that windows can have its own destructor for socket. Thanks to Ralph. - The modification for handling windows socket will first be applied to windows branch. This commit was SVN r19170.	2008-08-06 09:42:48 +00:00
Ralph Castain	be02211b4f	Modify the wakeup system to make it more Windows-friendly. This allows Shiqing to consolidate the Windows-specific modifications into one location, and generalizes the wakeup procedure in case we hit other system-specific requirements. This needs some soak time to ensure we haven't opened any race conditions. I tried to loop everything in the shutdown procedure through that trigger event call to ensure it all goes through the one-time locks as it did before so that someone hitting ctrl-c when we are already shutting down shouldn't cause problems. Just want to let people use it for awhile to verify. This commit was SVN r19159.	2008-08-05 15:09:29 +00:00
Ralph Castain	7342a6f1da	Per the July technical meeting: During the discussion of MPI-2 functionality, it was pointed out by Aurelien that there was an inherent race condition between startup of ompi-server and mpirun. Specifically, if someone started ompi-server to run in the background as part of a script, and then immediately executed mpirun, it was possible that an MPI proc could attempt to contact the server (or that mpirun could try to read the server's contact file before the server is running and ready. At that time, we discussed createing a new tool "ompi-wait-server" that would wait for the server to be running, and/or probe to see if it is running and return true/false. However, rather than create yet another tool, it seemed just as effective to add the functionality to mpirun. Thus, this commit creates two new mpirun cmd line flags (hey, you can never have too many!): --wait-for-server : instructs mpirun to ping the server to see if it responds. This causes mpirun to execute an rml.ping to the server's URI with an appropriate timeout interval - if the ping isn't successful, mpirun attempts it again. --server-wait-time xx : sets the ping timeout interval to xx seconds. Note that mpirun will attempt to ping the server twice with this timeout, so we actually wait for twice this time. Default is 10 seconds, which should be plenty of time. This has only lightly been tested. It works if the server is present, and outputs a nice error message if it cannot be contacted. I have not tested the race condition case. This commit was SVN r19152.	2008-08-04 20:29:50 +00:00
Jeff Squyres	017a4acceb	Missed these during r19141 This commit was SVN r19151. The following SVN revision numbers were found above: r19141 --> open-mpi/ompi@b83ee7d82a	2008-08-04 20:10:55 +00:00
Jeff Squyres	b83ee7d82a	* Fix a problem with VPATH builds if the destination directory didn't already exist * s/top_srcdir/top_builddir/ in a bunch of places; left over from the previous man page generation system This commit was SVN r19141.	2008-08-04 15:17:50 +00:00
Ralph Castain	381d10833a	Remove comments in documentation about the "-nw" option to mpirun as this option doesn't exist, and probably never will This commit was SVN r19137.	2008-08-04 14:38:37 +00:00
Ralph Castain	35a86b3347	Establish an MCA param "orte_allocation_required" so that a system can require the user have an RM-provided allocation in order to run. This helps prevent the problem where a user forgets to get an allocation on an RM-managed cluster, and then executes mpirun on the head node - thus causing all of their mpi procs to launch on the head node, usually bringing it to its knees. Since OMPI allows mpirun to default to the local node, and since users want to retain the option to co-locate procs with mpirun, we needed another param to block this error case. This commit was SVN r19135.	2008-08-04 14:25:19 +00:00
Rainer Keller	0d08866786	- Declare functions in lex-files as extern "C" {} to get rid of warnings. This commit was SVN r19132.	2008-08-04 11:49:01 +00:00
Ralph Castain	5b2f53a069	One more quick fix - ensure we are looking at the value and not its pointer This commit was SVN r19123.	2008-08-01 23:39:55 +00:00
Jeff Squyres	26c7daf16a	Fix typo This commit was SVN r19121.	2008-08-01 21:30:53 +00:00
Dan Lacher	9175da1e02	Putback for all changes to automate man page updates to strings of versions, dates and build names. Fixes trac:1387 Big thanks to Jeff and Brian for help and oversight. This commit was SVN r19120. The following Trac tickets were found above: Ticket 1387 --> https://svn.open-mpi.org/trac/ompi/ticket/1387	2008-08-01 21:14:37 +00:00
Ralph Castain	21cd4b9df8	Add pls_rsh_agent synonym to the PLM rsh component This commit was SVN r19119.	2008-08-01 20:15:42 +00:00
Jeff Squyres	4bdc093746	Fixes trac:1361: mainly add new internal MCA parameter that orterun will set when it launches under debuggers using the --debug option. This commit was SVN r19116. The following Trac tickets were found above: Ticket 1361 --> https://svn.open-mpi.org/trac/ompi/ticket/1361	2008-07-31 22:11:46 +00:00
Jeff Squyres	5818eca234	Also make sure that the new INTERNAL channel doesn't close the endpoint and/or the real stderr fd in the HNP. This commit was SVN r19113.	2008-07-31 21:26:58 +00:00
Ralph Castain	2ee493c3f9	Fix some FT code to reflect change in session_dir interface This commit was SVN r19106.	2008-07-31 14:53:18 +00:00
Ralph Castain	a62b2a0150	Per the July technical meeting: Standardize the handling of the orte launch agent option across PLMs. This has been a consistent complaint I have received - each PLM would register its own MCA param to get input on the launch agent for remote nodes (in fact, one or two didn't, but most did). This would then get handled in various and contradictory ways. Some PLMs would accept only a one-word input. Others accepted multi-word args such as "valgrind orted", but then some would error by putting any prefix specified on the cmd line in front of the incorrect argument. For example, while using the rsh launcher, if you specified "valgrind orted" as your launch agent and had "--prefix foo" on you cmd line, you would attempt to execute "ssh foo/valgrind orted" - which obviously wouldn't work. This was all -very- confusing to users, who had to know which PLM was being used so they could even set the right mca param in the first place! And since we don't warn about non-recognized or non-used mca params, half of the time they would wind up not doing what they thought they were telling us to do. To solve this problem, we did the following: 1. removed all mca params from the individual plms for the launch agent 2. added a new mca param "orte_launch_agent" for this purpose. To further simplify for users, this comes with a new cmd line option "--launch-agent" that can take a multi-word string argument. The value of the param defaults to "orted". 3. added a PLM base function that processes the orte_launch_agent value and adds the contents to a provided argv array. This can subsequently be harvested at-will to handle multi-word values 4. modified the PLMs to use this new function. All the PLMs except for the rsh PLM required very minor change - just called the function and moved on. The rsh PLM required much larger changes as - because of the rsh/ssh cmd line limitations - we had to correctly prepend any provided prefix to the correct argv entry. 5. added a new opal_argv_join_range function that allows the caller to "join" argv entries between two specified indices Please let me know of any problems. I tried to make this as clean as possible, but cannot compile all PLMs to ensure all is correct. This commit was SVN r19097.	2008-07-30 18:26:24 +00:00
Ralph Castain	01a7259a7d	This fixes ticket #1426 - mpirun is cleaning up ALL session dirs Mpirun - and the orteds - were doing their best to whack all session dirs on their nodes just in case there was something lingering due to an abnormal termination. Unfortunately, they were -too- good at it. They were whacking all session directories under the user's name, even those from other mpiruns! This adds another layer to the session dir tree so that we can denote which jobs come from our own job family, and restricts the cleanup operation to only session dirs from within our own job family. So we'll still cleanup anything due to our own mpirun, but won't whack any other mpirun from this user. Call it being polite... This commit was SVN r19083.	2008-07-29 18:58:35 +00:00
Ralph Castain	d45d728e8e	Allow debuggers to attach to a running mpirun by -always- setting up the MPIR_Proctable. Only wait for MPIR_Breakpoint and hold MPI proc s if we are launching under a debugger. This commit was SVN r19079.	2008-07-29 17:39:16 +00:00
George Bosilca	a4d905db4a	Allow xgrid to compile. This commit was SVN r19076.	2008-07-29 13:24:08 +00:00
Ralph Castain	1210a96d82	Ensure a value gets defined before used...thanks Jeff This commit was SVN r19075.	2008-07-29 13:08:45 +00:00
Jeff Squyres	0af7ac53f2	Fixes trac:1392, #1400 * add "register" function to mca_base_component_t * converted coll:basic and paffinity:linux and paffinity:solaris to use this function * we'll convert the rest over time (I'll file a ticket once all this is committed) * add 32 bytes of "reserved" space to the end of mca_base_component_t and mca_base_component_data_2_0_0_t to make future upgrades [slightly] easier * new mca_base_component_t size: 196 bytes * new mca_base_component_data_2_0_0_t size: 36 bytes * MCA base version bumped to v2.0 * '''We now refuse to load components that are not MCA v2.0.x''' * all MCA frameworks versions bumped to v2.0 * be a little more explicit about version numbers in the MCA base * add big comment in mca.h about versioning philosophy This commit was SVN r19073. The following Trac tickets were found above: Ticket 1392 --> https://svn.open-mpi.org/trac/ompi/ticket/1392	2008-07-28 22:40:57 +00:00
Ralph Castain	1a77b15523	Modify the handling of hostfiles to allow them to subdivide allocations. Utilize the "slots_alloc" field of the orte_node_t object - which had previously been unused - to track the #slots allocated to a given app_context. Let the hostfile filtering action utilize the #slots field to modify the allocated slots for each app_context. This commit was SVN r19066.	2008-07-28 15:10:40 +00:00
Ralph Castain	3107545709	Ensure that ORTE processes such as mpirun and orted never inadvertently bind themselves to cores. Change the mca param name used by the rank_file mapper to get user directives on slot lists to be different from that used by MPI procs to discover their binding. Add a cmd line option to orterun to make it easier for a user to specify the slot list (basically, hide the mca param name). Discussed and reviewed with Lenny and Jeff. This commit was SVN r19062.	2008-07-28 14:18:36 +00:00
Ralph Castain	0735d6f1c2	This commit fixes ticket #1414 Cleanup the logic in the odls for when processes terminate. It turns out that we were only going through the kill_proc logic once instead of looping over all local children when we ordered a daemon to kill its local procs. This went unnoticed for some time as for most systems the local procs were terminated anyway when the daemon terminated due to the parent/child relationship. Solaris is apparently different - the children are not automatically terminated when the parent dies. As a result, it acts as a detector for this bug. Mucho thanks to Rolf V. for his help in debugging - and to IM for letting me follow his gdb progress in quasi real-time! This commit was SVN r19044.	2008-07-26 02:54:43 +00:00
Ralph Castain	d5a916d350	Fix a problem reported by IBM: nolocal and bynode combined to map byslot. Problem actually was that any time multiple mapping policy directives were provided, we would only map byslot due to incorrect if statement conditions. Thanks to Kris Davis for his patience while we tracked this down! This commit was SVN r19039.	2008-07-25 17:50:46 +00:00
Ralph Castain	718cceddaa	Ensure that we only launch procs on the HNP if that node is actually included in the allocation. This commit was SVN r19038.	2008-07-25 17:13:22 +00:00
Ralph Castain	cb93775cca	Just for the AR - remove unnecessary typecast This commit was SVN r19034.	2008-07-25 15:30:37 +00:00
Thomas Herault	28dc80b67e	Deal with the SIGCHLD issue in LSF. lsb_launch tampers with SIGCHLD signal handler. We are forced to reinstall our own signal handler after a call to this function. This commit fixes trac:1356. This commit was SVN r19033. The following Trac tickets were found above: Ticket 1356 --> https://svn.open-mpi.org/trac/ompi/ticket/1356	2008-07-25 15:23:23 +00:00
Ralph Castain	7e6e104fc3	Add more debugging to the RML when it fails to find a route - specifically, have it print a stacktrace so we can figure out where it came from. This commit was SVN r19032.	2008-07-25 15:01:41 +00:00
Ralph Castain	42c134cb32	Silence stupid compiler warning - and a certain someone who keeps reminding me of it... :-) This commit was SVN r19031.	2008-07-25 14:01:06 +00:00
Ralph Castain	a1d296ae03	This commit fixes ticket #1410 Fix a few bugs in the mappers: 1. Ensure that bynode with no -np fills all available slots - it just does so with the ranks set bynode instead of byslot 2. fix --nolocal behavior so it works correctly in all cases. We still have to test the host's name using opal_ifislocal in the mapper because the name returned by gethostname to orte_process_info.hostname can be an FQDN, but a hostfile may contain a non-FQDN version. 3. Add missing --nolocal logic to the seq mapper Oversubscribed mapping seemed to be working okay without repair, so I couldn't verify my own bug report in that regard. Also included are some preliminary changes to support the modified hostfile behavior, which will be committed shortly: 1. removed the totally useless "allocate" field in the orte_node_t object since every node is automatically allocated for use - and everything ignored the field anyway 2. correctly initialize the slots_alloc field when the allocation is read This commit was SVN r19030.	2008-07-25 13:35:12 +00:00
Ralph Castain	fdb2408bf2	Rename the osx paffinity component the "posix" component since it really has nothing osx specific in it - it is just a generic posix call to determine #processors. Set the priority low so that both linux and solaris components override it if they build. It shouldn't build in Windows at all. Modify the odls to remove a (size_t) typecast in front of the num_processors variable just in case it is returned negative. This usually is accompanied by an opal_error, so this shouldn't make any difference - but it is more technically correct. This commit was SVN r19008.	2008-07-24 01:54:51 +00:00
Lenny Verkhovsky	b4d54dda57	Fixed possible seqf when using RANKFILE, but not all ranks assigned Fixed allocation of all ranks when using RANKFILE, but not all ranks assigned Aborting if using RANKFILE, but np wasn't specified a little earlier Clean mca_rmaps_rank_file_component.debug This commit was SVN r19004.	2008-07-23 17:44:02 +00:00
Shiqing Fan	0646cd2491	- Move wait object instance code out of the #ifdef block, so that systems with waitpid and Windows can both use it. Thanks to Ralph. This commit was SVN r19003.	2008-07-23 16:20:42 +00:00
Ralph Castain	e3c3d28bf1	Add some more debugging to tell us how many processors were found when setting sched_yield This commit was SVN r18999.	2008-07-23 15:28:51 +00:00
Thomas Herault	b6affd35e9	Small typos for LSF compilation and update Makefile.am This commit was SVN r18998.	2008-07-23 14:42:26 +00:00
Ralph Castain	83e7c19d33	Remove deprecated function - this was incorporated into the paffinity framework a long time ago. Fortunately, nobody was actually using it! This commit was SVN r18990.	2008-07-23 03:43:31 +00:00
Ralph Castain	dbc35b60f6	Okay, one last time - get the xml output of the map correct...sigh. This commit was SVN r18988.	2008-07-23 02:45:08 +00:00
Ralph Castain	76f2659527	Very minor cleanup to slurm support This commit was SVN r18987.	2008-07-23 02:35:03 +00:00
Ralph Castain	1f665425e7	Fix some compile problems in the LSF support This commit was SVN r18986.	2008-07-23 02:34:41 +00:00
Rolf vandeVaart	ed4920ba5f	Fix a couple problems with orte-clean. Also add a new --debug flag to help developers figure out possible future issues. This fixes trac:1335. This commit was SVN r18979. The following Trac tickets were found above: Ticket 1335 --> https://svn.open-mpi.org/trac/ompi/ticket/1335	2008-07-22 17:41:06 +00:00
Ralph Castain	26cfac94e6	Fix a formatting problem with xml output of map This commit was SVN r18976.	2008-07-22 13:14:02 +00:00
Ralph Castain	a4f0fa6e3a	Update the routed framework to: 1. add a new API delete_route(orte_process_name_t*) to delete the specified proc from the routing table 2. modify update_route so that it actually updates pre-existing routes instead of only adding routing info the end of the hash table This fixes ticket #1403 This commit was SVN r18970.	2008-07-21 21:37:09 +00:00
Jeff Squyres	54dbd95243	Fix some component version numbers to be the same as the OMPI release This commit was SVN r18965.	2008-07-21 20:05:29 +00:00
Ralph Castain	3137ed9255	Update the manpages for comm_spawn(_multiple) - add man page to explain host/hostfile behavior This commit was SVN r18961.	2008-07-21 17:58:12 +00:00
George Bosilca	bcac9a0540	Remove a warning about using map when it is not initialized. This commit was SVN r18957.	2008-07-21 14:35:05 +00:00
Jeff Squyres	750ea30961	So apparently my clever fix in r18873 was not good -- apparently, we can have a pub_endpoint and a sub_endpoint that are not equal but go to the same place (fd). I didn't think that that was possible. :-\ So just use a bool to track whether we have forwarded the fragment at all; if we have, then don't forward to the sub_endpoint. IOF is going to be re-written for v1.4. This commit was SVN r18950. The following SVN revision numbers were found above: r18873 --> open-mpi/ompi@773c92a6eb	2008-07-18 20:04:26 +00:00
Ralph Castain	6135943382	Update the paffinity call in the ODLS so we retrieve the number of processors on the local node, thus allowing us to correctly set the sched_yield parameter. This commit was SVN r18946.	2008-07-18 19:19:16 +00:00

1 2 3 4 5 ...

1819 Коммитов