openmpi

Автор	SHA1	Сообщение	Дата
Wesley Bland	e1ba09ad51	Add a resilience to ORTE. Allows the runtime to continue after a process (or ORTED) failure. Note that more work will be necessary to allow the MPI layer to take advantage of this. Per RFC: http://www.open-mpi.org/community/lists/devel/2011/06/9299.php This commit was SVN r24815.	2011-06-23 20:38:02 +00:00
Josh Hursey	20339a7900	Minor coding style and intentation fixes. This commit was SVN r24764.	2011-06-09 14:16:06 +00:00
Ralph Castain	f3cae3d6f3	Cleanup the handling of if_include and if_exclude arguments based on CIDR notation. Fix a bug in the new code that prevented the system from correctly matching addresses. Remove comments in the show-help text indicating that we would continue in the face of incorrect specifications - leave that to the calling layer to decide. Modify the new opal_ifmatches so it returns error codes letting the caller better understand the result. Modify the oob to ensure we abort if we don't find interfaces matching specified constraints, and that we do so without multiple error messages. NOTE: we have a conflict in our standards. We have been using comma-delimited lists of interfaces for all our params. However, one param - opal_net_private_ipv4 - now uses semicolons instead of comma separators. No idea why, but it is confusing. This commit was SVN r24755.	2011-06-07 02:09:11 +00:00
Ralph Castain	1491d52bd7	Extend the parsing capability of the oob tcp module's if_include and if_exclude options to support subnet+mask notation, and to handle virtual IP addresses (it was previously having problems distinguishing between "eth1" and "eth1.3"). This commit was SVN r24747.	2011-06-05 19:16:42 +00:00
Ralph Castain	c3df95dd13	Prevent failure due to race condition during abnormal term This commit was SVN r24712.	2011-05-19 21:27:05 +00:00
Thomas Herault	fb3fd8fd0e	items belonging to peer_send_queue are mca_oob_tcp_msg_t *, which are obtained through a opal_freelist. They shouldn't be released, but returned to the freelist. This commit was SVN r24679.	2011-05-03 21:03:09 +00:00
George Bosilca	d2502b14f9	Destruct the OOB TCP internal objects. This commit was SVN r24503.	2011-03-10 00:40:54 +00:00
Ralph Castain	9b38525d1e	Remove unused include files This commit was SVN r24394.	2011-02-16 00:32:47 +00:00
Josh Hursey	8ec85c6b8f	Fixes the C/R Automatic Recovery feature when the HNP is also hosting processes locally. I want to thank Hugo Meyer for reporting this/these bugs. Notes: * Moved over a patch from the stabilization branch that makes sure we close the peer socket in the OOB TCP component fully during shutdown (after the de-registration sync). It also ensures that we free the rml_uri only after we are done communicating with the peer (in the odls_base deregister sync operation). * When an error is detected while delivering messages, we really want to bail out of the loop since the error manager is likely mutating the orte_local_children data structure, so it is no longer safe to iterate over in the orte_odls_base_default_deliver_message() function. * When the HNP is hosting processes make sure it accounts for processes that may have failed locally in the ErrMgr HNP component by decrementing the num_local_procs. This makes it match the orted ErrMgr component accounting. This is what was causing the modex to fail (the number of participants was wrong on a rolling recovery. * The crmig and autor features of the hnp ErrMgr component now check for the jobid from both the 'job' parameter and from the process name (since one may be there and not the other). This caused some additional error messages during startup. * If we fail to migrate (e.g., due to invalid node specification), print only the error message, not the error and success messages. This can be misleading. This commit was SVN r24317.	2011-01-27 20:40:23 +00:00
Abhishek Kulkarni	87d2c9b31d	Few fault tolerance updates related to the CIFTS project (http://www.mcs.anl.gov/research/cifts/) * Improve the FTB notifier to publish (C/R, process/communication failure) events to the FTB with the OMPI jobid as the associated payload. * Add notifier calls for C/R events and process status events in SnapC and ErrMgr components. * Fix a bug where the SnapC states and process states collide before being thrown out over the notifier. This commit was SVN r24251.	2011-01-13 20:13:49 +00:00
Ralph Castain	2dc5cbb483	Remove stale code and API from the RML/OOB frameworks. Stopped using this code years ago. This commit was SVN r24153.	2010-12-05 15:58:21 +00:00
Shiqing Fan	f43862420c	Convert the bad dos line endings to unix style for all windows related files. This commit was SVN r24137.	2010-12-02 12:08:08 +00:00
Ralph Castain	9ea2b196ce	Convert the opal_event framework to use direct function calls instead of hiding functions behind function pointers. Eliminate the opal_object_t abstraction of libevent's event struct so it can be directly passed to the libevent functions. Note: the ompi_check_libfca.m4 file had to be modified to avoid it stomping on global CPPFLAGS and the like. The file was also relocated to the ompi/config directory as it pertains solely to an ompi-layer component. Forgive the mid-day configure change, but I know Shiqing is working the windows issues and don't want to cause him unnecessary redo work. This commit was SVN r23966.	2010-10-28 15:22:46 +00:00
Nathan Hjelm	e7bfbe1d1a	added missing object initialization/destruction of mca_oob_tcp_component.tcp_listen_thread_event This commit was SVN r23958.	2010-10-26 22:09:37 +00:00
Ralph Castain	86c7365e8e	Clean up a few initialization issues - don't think these are impacting the shared memory situation as it didn't fix the problem. Setup the event API to support multiple bases in preparation for splitting the OMPI and ORTE events. Holding here pending shared memory resolution. This commit was SVN r23943.	2010-10-26 02:41:42 +00:00
Ralph Castain	fceabb2498	Update libevent to the 2.0 series, currently at 2.0.7rc. We will update to their final release when it becomes available. Currently known errors exist in unused portions of the libevent code. This revision passes the IBM test suite on a Linux machine and on a standalone Mac. This is a fairly intrusive change, but outside of the moving of opal/event to opal/mca/event, the only changes involved (a) changing all calls to opal_event functions to reflect the new framework instead, and (b) ensuring that all opal_event_t objects are properly constructed since they are now true opal_objects. Note: Shiqing has just returned from vacation and has not yet had a chance to complete the Windows integration. Thus, this commit almost certainly breaks Windows support on the trunk. However, I want this to have a chance to soak for as long as possible before I become less available a week from today (going to be at a class for 5 days, and thus will only be sparingly available) so we can find and fix any problems. Biggest change is moving the libevent code from opal/event to a new opal/mca/event framework. This was done to make it much easier to update libevent in the future. New versions can be inserted as a new component and tested in parallel with the current version until validated, then we can remove the earlier version if we so choose. This is a statically built framework ala installdirs, so only one component will build at a time. There is no selection logic - the sole compiled component simply loads its function pointers into the opal_event struct. I have gone thru the code base and converted all the libevent calls I could find. However, I cannot compile nor test every environment. It is therefore quite likely that errors remain in the system. Please keep an eye open for two things: 1. compile-time errors: these will be obvious as calls to the old functions (e.g., opal_evtimer_new) must be replaced by the new framework APIs (e.g., opal_event.evtimer_new) 2. run-time errors: these will likely show up as segfaults due to missing constructors on opal_event_t objects. It appears that it became a typical practice for people to "init" an opal_event_t by simply using memset to zero it out. This will no longer work - you must either OBJ_NEW or OBJ_CONSTRUCT an opal_event_t. I tried to catch these cases, but may have missed some. Believe me, you'll know when you hit it. There is also the issue of the new libevent "no recursion" behavior. As I described on a recent email, we will have to discuss this and figure out what, if anything, we need to do. This commit was SVN r23925.	2010-10-24 18:35:54 +00:00
Ralph Castain	dd959f5ab6	Silence an idiotic warning This commit was SVN r23819.	2010-09-30 17:54:13 +00:00
Ralph Castain	40a2bfa238	WARNING: Work on the temp branch being merged here encountered problems with bugs in subversion. Considerable effort has gone into validating the branch. However, not all conditions can be checked, so users are cautioned that it may be advisable to not update from the trunk for a few days to allow MTT to identify platform-specific issues. This merges the branch containing the revamped build system based around converting autogen from a bash script to a Perl program. Jeff has provided emails explaining the features contained in the change. Please note that configure requirements on components HAVE CHANGED. For example. a configure.params file is no longer required in each component directory. See Jeff's emails for an explanation. This commit was SVN r23764.	2010-09-17 23:04:06 +00:00
Ralph Castain	4ecd9a0bbe	Protect against an obscure race condition that AFAICT only occurs when we are in a loop waiting to recv a message from a peer who is then killed by signal. This commit was SVN r23662.	2010-08-25 15:35:01 +00:00
Ralph Castain	f1a00c9a21	Per Jeff's inquiry, play chicken and don't assume herror exists everywhere. This commit was SVN r23656.	2010-08-24 20:46:41 +00:00
Ralph Castain	3b3cd67d07	If we are using static ports and cannot resolve a hostname, then see if the proc is on the local host. If so, then attempt to use a loopback interface to complete the connection. Only implemented for IPv4 because the if.c code has been so hashed I couldn't figure out how to do this cleanly for all cases. This commit was SVN r23647.	2010-08-24 14:14:59 +00:00
Ralph Castain	099c3aad97	Fix a major foopah that broke debugger attach. With the revisions in updating proc state, we dropped the recording of each proc's pid. Thus, attaching debuggers would find a proctable whose pids all equal 0. This required modification of the errmgr.update_state API so the pid could be passed in to the function that could update the proper data record(s). All calls to that API have been updated as well, but I obviously couldn't test them all. Thanks to Dong Ahn (LLNL) for catching this problem! Also fixed debugger daemon cospawn, both for initial launch and attach-while-running modes. Tested and verified on rsh and slurm. This commit was SVN r23300.	2010-06-24 05:13:53 +00:00
Ralph Castain	7c43d6c0f5	Don't drop a core file when we abort due to a lost connection This commit was SVN r23199.	2010-05-22 18:09:40 +00:00
Abhishek Kulkarni	afbe3e99c6	* Wrap all the direct error-code checks of the form (OMPI_ERR_* == ret) with (OMPI_ERR_* = OPAL_SOS_GET_ERR_CODE(ret)), since the return value could be a SOS-encoded error. The OPAL_SOS_GET_ERR_CODE() takes in a SOS error and returns back the native error code. * Since OPAL_SUCCESS is preserved by SOS, also change all calls of the form (OPAL_ERROR == ret) to (OPAL_SUCCESS != ret). We thus avoid having to decode 'ret' to get the native error code. This commit was SVN r23162.	2010-05-17 23:08:56 +00:00
Ralph Castain	306533fdb8	Replace a missing line that shutdown a peer that failed comm. This commit was SVN r23120.	2010-05-12 18:09:35 +00:00
Ralph Castain	4bd25f587c	Begin handling the case of lost connections by having the OOB report it to the errmgr instead of the routed framework. Add an "app" component to t he errmgr framework so that it can decide how to respond - which for now at least is just to check for lifeline and abort if so. Add a new error constant to indicate that the error is "unrecoverable" so the oob can know it needs to abort. This commit was SVN r23112.	2010-05-11 00:34:12 +00:00
Josh Hursey	e4f2d03d28	ErrMgr Framework redesign to better support fault tolerance development activities. Explained in more detail in the following RFC: http://www.open-mpi.org/community/lists/devel/2010/03/7589.php This commit was SVN r22872.	2010-03-23 21:28:02 +00:00
Josh Hursey	e9b5162d79	Fix the configure logic for --with-ft so that it properly takes a comma separated list. Many of the OPAL_ENABLE_FT should be OPAL_ENABLE_FT_CR, so fix those. The OPAL Layer INC should call opal_output on restart so that it can refresh the string it prints to reflect the current pid/hostname which may have changed. This commit was SVN r22824.	2010-03-12 23:57:50 +00:00
Jeff Squyres	f65eebf53d	More changes for NetBSD. Thanks to Aleksej Saushev for this patch. This commit was SVN r22680.	2010-02-22 15:05:09 +00:00
Shiqing Fan	ad763c327d	Restore several linked libraries that were deleted by mistake in r22405. This commit was SVN r22415. The following SVN revision numbers were found above: r22405 --> open-mpi/ompi@872a4047ba	2010-01-14 21:50:42 +00:00
Shiqing Fan	872a4047ba	Fix the bug that caused by ADD_DEPENDENCIES() from different version of CMake. In CMake 2.6 and earlier, this function add dependencies for targets and also link the target libraries automatically, but in CMake 2.8,this behavior has been changed, i.e. it will only add the dependencies but no link, which will cause linking errors at compilation time. This commit was SVN r22405.	2010-01-14 18:10:20 +00:00
Ralph Castain	c877b1a5f8	Silence a compiler warning about no format This commit was SVN r21951.	2009-09-08 15:03:14 +00:00
Ralph Castain	509cc0553c	When directly launched by an RM, flag that a process is operating without daemons - i.e., standalone. Provide an error string for the new socket_not_available error. Use errmgr.abort to exit when we cannot get a socket, and ensure that the slurmd module returns the proper exit status for slurm 2.0 This commit was SVN r21868.	2009-08-22 02:58:20 +00:00
Ralph Castain	7370235c3e	Create a more specific error code for when specific sockets are not available. Ensure that slurm 2.0 gets the expected error return if the process can't start for that reason so it can take corrective action. This commit was SVN r21867.	2009-08-21 21:28:15 +00:00
Ralph Castain	7183179f56	Provide native integration with SLURM 2.0's OMPI support This commit was SVN r21865.	2009-08-21 18:03:34 +00:00
Shiqing Fan	bce2f44154	Update related .windows files with proper compiling properties, in order to have a successful DSO build. This commit was SVN r21805.	2009-08-12 08:55:58 +00:00
Shiqing Fan	0b56a8a4d5	Enable IPv6 on Windows by default, and fix two type casts for IPv6 operations. This commit was SVN r21586.	2009-07-02 14:41:03 +00:00
Ralph Castain	4adb3ed80f	Print out a more meaningful and correct error message This commit was SVN r21581.	2009-07-01 20:16:15 +00:00
Ralph Castain	0ba845fed2	Continue development of regular expression support by implementing it for slurm launches. Works for both initial (cmd line and non-cmd line) and comm_spawn launch. Additional work required to fully enable static port support when using cmd line regular expression launch system. This commit was SVN r21502.	2009-06-23 20:25:38 +00:00
Ralph Castain	87d7d693f0	Add a notifier call when the oob retries are exceeded so sys admins are aware of the problem This commit was SVN r21405.	2009-06-10 15:17:16 +00:00
Ralph Castain	3815bfbba6	Provide a better error message when the oob cannot send a message after exhausting retries, and then have the proc abort so the job doesn't just hang forever. Since it could be a daemon that needs to abort, cleanup the abort sequence so the daemon can exit as cleanly as possible. This commit was SVN r21361.	2009-06-02 23:57:12 +00:00
Jeff Squyres	5ea1b776f7	Remove a compiler warning about an empty format string. The proper way to have no abort message is to pass NULL (the errmanager is smart enough to handle this case and not emit any extra message). This commit was SVN r21311.	2009-05-28 13:32:37 +00:00
Jeff Squyres	e6a32f13bb	Add missing header file This commit was SVN r21273.	2009-05-26 20:57:44 +00:00
Iain Bason	e7ff2368d6	This fixes trac:1930. Emit a more informative error message when the file descriptor limit is reached during an accept() call. Also, abort when the accept fails to avoid an infinite loop. Emit a more informative error message when the help file can't be opened. This commit was SVN r21271. The following Trac tickets were found above: Ticket 1930 --> https://svn.open-mpi.org/trac/ompi/ticket/1930	2009-05-26 20:03:21 +00:00
Ralph Castain	f139cfd28a	Fully enable the use of static ports to minimize connections on mpirun. When static ports are provided, daemons will automatically use routes defined by the selected routed module to callback to mpirun during startup, thus elimating the dedicated daemon-to-mpirun connection. Therefore, the total number of connections on mpirun will equal the fanout of the routed module (instead of #nodes in job). Add a new tm ess module that exploits this capability. Update the various plm modules to enable it - just a minor change reflecting an added param to a plm base function. Additional fixes included: 1. remove an erroneous cleanup of session directories in the tool finalize procedure - tools don't create session directories to begin with! 2. fix a duplicate free when attempting to execute a non-existent app 3. cleanup an typo in the comm utilities 4. fix comm_spawn - was perturbed by the changes in pack/unpack of orte_job_t to properly support orte-ps Been tested on slurm and tm machines, using all tests in orte/test/mpi. May run into issue with command line length on large jobs due to inclusion of node info to support static ports - will fix this next with addition of regexp generator to compress that info. This commit was SVN r21248.	2009-05-16 04:15:55 +00:00
Greg Koenig	60485ff95f	This is a very large change to rename several #define values from OMPI_* to OPAL_*. This allows opal layer to be used more independent from the whole of ompi. NOTE: 9 "svn mv" operations immediately follow this commit. This commit was SVN r21180.	2009-05-06 20:11:28 +00:00
Shiqing Fan	cd565923d3	Completely remove ltdl support for Windows build. This commit was SVN r21170.	2009-05-05 18:59:13 +00:00
Ralph Castain	4be24521aa	Modify the orte_process_info structure to handle a broader range of process types by replacing the individual booleans with a 32-bit bitmap. Use a set of #define's to define the individual bits, and a set of matching macros to test for them. Update the orte code base to use the macros instead of the booleans. Minor mod to the ompi layer to use the new #define's - just one-line name replacements. This commit was SVN r21144.	2009-05-04 11:07:40 +00:00
Ralph Castain	0b9116b1e3	Don't really need all those if statements...duh. Cleanup the code a bit. This commit was SVN r21139.	2009-05-01 17:11:44 +00:00
Ralph Castain	d98fc311e9	Restore the ability to specify a range of dynamic ports for use by the TCP OOB module. The range can now be specified as any combination of ranges (e.g., 1-5,8,10,21-30). The system will error out if you attempt to specify both static and dynamic ports. This commit was SVN r21138.	2009-05-01 15:57:36 +00:00
Rainer Keller	221fb9dbca	... Delayed due to notifier commits earlier this day ... - Delete unnecessary header files using contrib/check_unnecessary_headers.sh after applying patches, that include headers, being "lost" due to inclusion in one of the now deleted headers... In total 817 files are touched. In ompi/mpi/c/ header files are moved up into the actual c-file, where necessary (these are the only additional #include), otherwise it is only deletions of #include (apart from the above additions required due to notifier...) - To get different MCAs (OpenIB, TM, ALPS), an earlier version was successfully compiled (yesterday) on: Linux locally using intel-11, gcc-4.3.2 and gcc-SVN + warnings enabled Smoky cluster (x86-64 running Linux) using PGI-8.0.2 + warnings enabled Lens cluster (x86-64 running Linux) using Pathscale-3.2 + warnings enabled This commit was SVN r21096.	2009-04-29 01:32:14 +00:00
Shiqing Fan	3d4e0472d6	Add windows support files into the tarball, including .windows, CMakeLists.txt files, and CMake modules. Thanks to Jeff for testing it on Linux. This commit was SVN r21069.	2009-04-24 16:39:33 +00:00
Rainer Keller	64dcd85ba1	- This one was missing This commit was SVN r20818.	2009-03-17 22:02:51 +00:00
Rainer Keller	6f808d9b05	Preparation work for another commit (after RFC): - This patch solely _adds_ required headers and is rather localized The next patch (after RFC) heavily removes headers (based on script) - ompi/communicator/communicator.h: For sources that use ompi_mpi_comm_world, don't require them to include "mpi.h" - ompi/debuggers/ompi_common_dll.c: mca_topo_base_comm_1_0_0_t needs #include "ompi/mca/topo/topo.h" - ompi/errhandler/errhandler_predefined.h: ompi/communicator/communicator.h depends on this header file! To prevent recursion just have fwd declarations. #include "ompi/types.h" for fwd declarations of the main structs. - ompi/mca/btl/btl.h: #include "opal/types.h" for ompi_ptr_t - ompi/mca/mpool/base/mpool_base_tree.c: We use ompi_free_list_t and ompi_rb_tree_t, so have the proper classes - ompi/mca/op/op.h: Op is pretty self-contained: Nobody up to now has done #include "opal/class/opal_object.h" - ompi/mca/osc/pt2pt/osc_pt2pt_replyreq.h: #include "opal/types.h" for ompi_ptr_t - ompi/mca/pml/base/base.h: We use opal_lists - ompi/mca/pml/dr/pml_dr_vfrag.h: #include "opal/types.h" for ompi_ptr_t - ompi/mca/pml/ob1/pml_ob1_hdr.h: #include "ompi/mca/btl/btl.h" for mca_btl_base_segment_t - opal/dss/dss_unpack.c: #include "opal/types.h" - opal/mca/base/base.h: #include "opal/util/cmd_line.h" for opal_cmd_line_t - orte/mca/oob/tcp/oob_tcp.c: #include "opal/types.h" for opal_socklen_t - orte/mca/oob/tcp/oob_tcp.h: #include "opal/threads/threads.h" for opal_thread_t - orte/mca/oob/tcp/oob_tcp_msg.c: #include "opal/types.h" - orte/mca/oob/tcp/oob_tcp_peer.c: #include "opal/types.h" for opal_socklen_t - orte/mca/oob/tcp/oob_tcp_send.c: #include "opal/types.h" - orte/mca/plm/base/plm_base_proxy.c: #include "orte/util/name_fns.h" for ORTE_NAME_PRINT - orte/mca/rml/base/rml_base_receive.c: #include "opal/util/output.h" for OPAL_OUTPUT_VERBOSE - orte/mca/rml/oob/rml_oob_recv.c: #include "opal/types.h" for ompi_iov_base_ptr_t - orte/mca/rml/oob/rml_oob_send.c: #include "opal/types.h" for ompi_iov_base_ptr_t - orte/runtime/orte_data_server.c #include "opal/util/output.h" for OPAL_OUTPUT_VERBOSE - orte/runtime/orte_globals.h: #include "orte/util/name_fns.h" for ORTE_NAME_PRINT Tested on Linux/x86-64 This commit was SVN r20817.	2009-03-17 21:34:30 +00:00
Rainer Keller	ec0ed48718	- Revert r20739 This commit was SVN r20742. The following SVN revision numbers were found above: r20739 --> open-mpi/ompi@781caee0b6	2009-03-05 21:56:03 +00:00
Rainer Keller	a94438343b	- Revert r20740 This commit was SVN r20741. The following SVN revision numbers were found above: r20740 --> open-mpi/ompi@2a70618a77	2009-03-05 21:50:47 +00:00
Rainer Keller	2a70618a77	- Second patch, as discussed in Louisville. Replace short macros in orte/util/name_fns.h to the actual fct. call. - Compiles on linux/x86-64 This commit was SVN r20740.	2009-03-05 21:14:18 +00:00
Rainer Keller	781caee0b6	- First of two or three patches, in orte/util/proc_info.h: Adapt orte_process_info to orte_proc_info, and change orte_proc_info() to orte_proc_info_init(). - Compiled on linux-x86-64 - Discussed with Ralph This commit was SVN r20739.	2009-03-05 20:36:44 +00:00
George Bosilca	af9c2e10a3	Really cycle when we have several IP addresses. This commit was SVN r20705.	2009-03-03 19:29:03 +00:00
Rainer Keller	96e1b9b747	- Header orte/mca/rml/rml.h is not needed if no occurence of orte_rml or ORTE_RML. As the others compiles fine with -Wimplicit-function-declaration This commit was SVN r20639.	2009-02-26 03:52:31 +00:00
Rainer Keller	b356e90fa1	- Get rid of include orte/util/proc_info.h, if not needed Only proc_info.h-internal include file is opal/dss/dss_types.h - In one case (orte/util/hnp_contact.c) had to add proc_info.h again. - Local compilation (Linux/x86_64) w/ -Wimplicit-function-declaration works fine, no errors. Again, let's have MTT the last word. This commit was SVN r20631.	2009-02-25 03:38:00 +00:00
Ralph Castain	6151f7b60c	Enable static ports for application procs during self-bootstrap for non-daemon environments by letting them select what port to use based on node rank and attempting to connect to the peer on that port Note that this assumes non-shared nodes...but only takes affect if there is no prior knowledge of how to talk to the specified peer. Thus, all daemon-based environments are unaffected. This commit was SVN r20598.	2009-02-19 21:33:46 +00:00
Rainer Keller	d81443cc5a	- On the way to get the BTLs split out and lessen dependency on orte: Often, orte/util/show_help.h is included, although no functionality is required -- instead, most often opal_output.h, or orte/mca/rml/rml_types.h Please see orte_show_help_replacement.sh commited next. - Local compilation (Linux/x86_64) w/ -Wimplicit-function-declaration actually showed two missing #include "orte/util/show_help.h" in orte/mca/odls/base/odls_base_default_fns.c and in orte/tools/orte-top/orte-top.c Manually added these. Let's have MTT the last word. This commit was SVN r20557.	2009-02-14 02:26:12 +00:00
Jeff Squyres	91d302fd67	A bunch of minor ORTE valgrind-inspired memory leak cleanups (reviewed by Ralph). This commit was SVN r20544.	2009-02-13 04:14:10 +00:00
Ralph Castain	df3446faf1	Procs don't need to check for other job families to update routes - now that the direct routing module is gone, they always route through their daemons anyway, so save a couple of unnecessary steps. This commit was SVN r20429.	2009-02-04 22:49:57 +00:00
George Bosilca	c359762c2d	We're supposed to read a string and not an int ... This commit was SVN r20421.	2009-02-04 15:51:31 +00:00
Ralph Castain	debf128e53	Ensure the static port array is correctly checked for size This commit was SVN r20393.	2009-01-31 03:46:42 +00:00
Ralph Castain	5e6d3ba289	Initial implementation of static ports. Provide an mca param to specify static port ranges to the OOB - can provide an y combination of comma-separated values and ranges. Daemons will use the first port in the range, MPI procs will use the other ports in the range assuming that they know their node rank in time and enough ports were specified. NOTE: this capability only works under specific conditions. I will outline more about this in a note to devel as the remainder of the implementation progresses. For now, the only environment where this works is slurm. The linear routed module has also been adjusted to work with static ports so that all messaging flows strictly through the topology, including the initial daemon callback - thus limiting the number of sockets opened by mpirun. This commit was SVN r20390.	2009-01-30 18:31:43 +00:00
Ralph Castain	253a54df12	Shutdown the socket before closing for cleaner termination. This commit was SVN r20283.	2009-01-15 18:06:01 +00:00
Shiqing Fan	a5281f0434	- 1/4 commit for Windows Visual Studio and CCP support: CMakeLists and .windows files. In contribs preconfigured and precompiled parts. This commit was SVN r20108.	2008-12-10 20:59:20 +00:00
Ralph Castain	55f52d7a4b	Ensure we know how to route to a different job family when it connects to us This commit was SVN r19885.	2008-11-03 14:25:14 +00:00
Ralph Castain	f54fda489e	This is a first step towards supporting fully-routed OOB communications: 1. remove direct routed module (hooray!) 2. add radix tree routed module (binomial remains default) 3. remove duplicate data storage - orteds were storing nidmap and pidmap data in odls, everyone else in ess 4. add ess APIs to update nidmap, add new pidmap - used only by orteds for MPI-2 support 5. modify code to eliminate multiple calls to orte_routed.update_route that recreated info already in ess pidmap. Add ess API to lookup that info instead. Modify routed modules to utilize that capability 6. setup new ability to shutdown orteds without sending back an "ack" message to mpirun - not utilized yet, will require some changes to plm terminate_orteds functions in managed environments (coming soon) Initial tests indicating that fully routing comm via defined routing trees may not actually have a significant cost for operations like IB QP setup. More tests required to confirm. This will require an autogen... This commit was SVN r19866.	2008-10-31 21:10:00 +00:00
Josh Hursey	88aa45dd52	Commit to bring online OpenIB, MX, and shared memory support for Open MPI's checkpoint/restart functionality. Some tuning is still needed, but basic functionality is in place. There is still a problem with OpenIB and threads (external to C/R functionality). It has been reported in Ticket #1539 Additionally: * Fix a file cleanup bug in CRS Base. * Fix a possible deadlock in the TCP ft_event function * Add a mca_base_param_deregister() function to MCA base * Add whole process checkpoint timers * Add support for BTL: OpenIB, MX, Shared Memory * Add support Mpool: rdma, sm * Sundry bounds checking an cleanup in some scattered functions This commit was SVN r19756.	2008-10-16 15:09:00 +00:00
Shiqing Fan	3d4e89a5cd	- Remove the unused code introduced with r19480, which was for serializing tcp events on Windows and not successful. This commit was SVN r19747. The following SVN revisions from the original message are invalid or inconsistent and therefore were not cross-referenced: r19480	2008-10-15 08:39:30 +00:00
Shiqing Fan	8b60c755c2	- Bring r19742 into trunk. - Unify the Windows and the others way of handling callbacks. Thanks to George. - This will let Windows use the same callbacks as Linux does, which works also. This commit was SVN r19746. The following SVN revisions from the original message are invalid or inconsistent and therefore were not cross-referenced: r19742	2008-10-15 08:14:24 +00:00
Ralph Castain	0cc2e724f8	Separate var declaration from use to remove compiler warnings in non-debug builds This commit was SVN r19675.	2008-10-03 13:40:31 +00:00
Shiqing Fan	04ee20a880	- Mainly type casts. Microsoft VC++ compiler is too strict. This commit was SVN r19517.	2008-09-08 15:39:30 +00:00
Rainer Keller	d57ef70149	- Store the result of the 1-byte read... and assert, in case of error checking -- we don't return errors here anyway. Fixes Coverity CID 981 This commit was SVN r19259.	2008-08-12 18:00:38 +00:00
George Bosilca	d8fe05264b	Fix recursion in include files (Coverty fix 156). This commit was SVN r19181.	2008-08-06 13:50:01 +00:00
Jeff Squyres	0af7ac53f2	Fixes trac:1392, #1400 * add "register" function to mca_base_component_t * converted coll:basic and paffinity:linux and paffinity:solaris to use this function * we'll convert the rest over time (I'll file a ticket once all this is committed) * add 32 bytes of "reserved" space to the end of mca_base_component_t and mca_base_component_data_2_0_0_t to make future upgrades [slightly] easier * new mca_base_component_t size: 196 bytes * new mca_base_component_data_2_0_0_t size: 36 bytes * MCA base version bumped to v2.0 * '''We now refuse to load components that are not MCA v2.0.x''' * all MCA frameworks versions bumped to v2.0 * be a little more explicit about version numbers in the MCA base * add big comment in mca.h about versioning philosophy This commit was SVN r19073. The following Trac tickets were found above: Ticket 1392 --> https://svn.open-mpi.org/trac/ompi/ticket/1392	2008-07-28 22:40:57 +00:00
Jeff Squyres	54dbd95243	Fix some component version numbers to be the same as the OMPI release This commit was SVN r18965.	2008-07-21 20:05:29 +00:00
Ralph Castain	9613b3176c	Effectively revert the orte_output system and return to direct use of opal_output at all levels. Retain the orte_show_help subsystem to allow aggregation of show_help messages at the HNP. After much work by Jeff and myself, and quite a lot of discussion, it has become clear that we simply cannot resolve the infinite loops caused by RML-involved subsystems calling orte_output. The original rationale for the change to orte_output has also been reduced by shifting the output of XML-formatted vs human readable messages to an alternative approach. I have globally replaced the orte_output/ORTE_OUTPUT calls in the code base, as well as the corresponding .h file name. I have test compiled and run this on the various environments within my reach, so hopefully this will prove minimally disruptive. This commit was SVN r18619.	2008-06-09 14:53:58 +00:00
Jeff Squyres	d3795d7a34	Fix CID 987: remove unused variable. This commit was SVN r18598.	2008-06-05 20:17:02 +00:00
George Bosilca	25ae9c12e6	Silence few warnings. This commit was SVN r18568.	2008-06-03 19:58:40 +00:00
Ralph Castain	c992e99035	Remove the tags from orte_output_open and the filtering operation from orte_output - this will be handled differently to improve the XML output interface This commit was SVN r18557.	2008-06-03 14:24:01 +00:00
Ralph Castain	e5e542ddcf	Clarify an error message This commit was SVN r18533.	2008-05-29 12:20:24 +00:00
Jeff Squyres	e7ecd56bd2	This commit represents a bunch of work on a Mercurial side branch. As such, the commit message back to the master SVN repository is fairly long. = ORTE Job-Level Output Messages = Add two new interfaces that should be used for all new code throughout the ORTE and OMPI layers (we already make the search-and-replace on the existing ORTE / OMPI layers): * orte_output(): (and corresponding friends ORTE_OUTPUT, orte_output_verbose, etc.) This function sends the output directly to the HNP for processing as part of a job-specific output channel. It supports all the same outputs as opal_output() (syslog, file, stdout, stderr), but for stdout/stderr, the output is sent to the HNP for processing and output. More on this below. * orte_show_help(): This function is a drop-in-replacement for opal_show_help(), with two differences in functionality: 1. the rendered text help message output is sent to the HNP for display (rather than outputting directly into the process' stderr stream) 1. the HNP detects duplicate help messages and does not display them (so that you don't see the same error message N times, once from each of your N MPI processes); instead, it counts "new" instances of the help message and displays a message every ~5 seconds when there are new ones ("I got X new copies of the help message...") opal_show_help and opal_output still exist, but they only output in the current process. The intent for the new orte_* functions is that they can apply job-level intelligence to the output. As such, we recommend that all new ORTE and OMPI code use the new orte_* functions, not thei opal_* functions. === New code === For ORTE and OMPI programmers, here's what you need to do differently in new code: * Do not include opal/util/show_help.h or opal/util/output.h. Instead, include orte/util/output.h (this one header file has declarations for both the orte_output() series of functions and orte_show_help()). * Effectively s/opal_output/orte_output/gi throughout your code. Note that orte_output_open() takes a slightly different argument list (as a way to pass data to the filtering stream -- see below), so you if explicitly call opal_output_open(), you'll need to slightly adapt to the new signature of orte_output_open(). * Literally s/opal_show_help/orte_show_help/. The function signature is identical. === Notes === * orte_output'ing to stream 0 will do similar to what opal_output'ing did, so leaving a hard-coded "0" as the first argument is safe. * For systems that do not use ORTE's RML or the HNP, the effect of orte_output_* and orte_show_help will be identical to their opal counterparts (the additional information passed to orte_output_open() will be lost!). Indeed, the orte_* functions simply become trivial wrappers to their opal_* counterparts. Note that we have not tested this; the code is simple but it is quite possible that we mucked something up. = Filter Framework = Messages sent view the new orte_* functions described above and messages output via the IOF on the HNP will now optionally be passed through a new "filter" framework before being output to stdout/stderr. The "filter" OPAL MCA framework is intended to allow preprocessing to messages before they are sent to their final destinations. The first component that was written in the filter framework was to create an XML stream, segregating all the messages into different XML tags, etc. This will allow 3rd party tools to read the stdout/stderr from the HNP and be able to know exactly what each text message is (e.g., a help message, another OMPI infrastructure message, stdout from the user process, stderr from the user process, etc.). Filtering is not active by default. Filter components must be specifically requested, such as: {{{ $ mpirun --mca filter xml ... }}} There can only be one filter component active. = New MCA Parameters = The new functionality described above introduces two new MCA parameters: * '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that help messages will be aggregated, as described above. If set to 0, all help messages will be displayed, even if they are duplicates (i.e., the original behavior). * '''orte_base_show_output_recursions''': An MCA parameter to help debug one of the known issues, described below. It is likely that this MCA parameter will disappear before v1.3 final. = Known Issues = * The XML filter component is not complete. The current output from this component is preliminary and not real XML. A bit more work needs to be done to configure.m4 search for an appropriate XML library/link it in/use it at run time. * There are possible recursion loops in the orte_output() and orte_show_help() functions -- e.g., if RML send calls orte_output() or orte_show_help(). We have some ideas how to fix these, but figured that it was ok to commit before feature freeze with known issues. The code currently contains sub-optimal workarounds so that this will not be a problem, but it would be good to actually solve the problem rather than have hackish workarounds before v1.3 final. This commit was SVN r18434.	2008-05-13 20:00:55 +00:00
Ralph Castain	3e55fe6f6d	Fold in the revised modex scheme. Move the ompi_proc_t modex portions to the RTE level since the daemons already have that info. Provide each process with the equivalent of a "nidmap" - both a map of what nodes are in the job, and a map of which node each process is on. This enables the use of static ports, though that hasn't been turned "on" in this commit. Update the rsh tree spawn capability so we spawn the next wave of daemons before launching our own local procs. Add an ability to encode nodenames for large clusters with contiguous node name numbering schemes - this allows communication of all node names in a few bytes instead of tens-of-bytes/node. This commit was SVN r18338.	2008-04-30 19:49:53 +00:00
Josh Hursey	2c736873bb	Fix a checkpoint/restart bug that causes a restarted application to occasionally throw a SIGSEGV or SIGPIPE due to invalid socket descriptors. The problem was caused by a bad ordering between the restart of the ORTE level tcp connections (in the OOB - out-of-band communication) and the Open MPI level tcp connections (BTLs). Before this commit ORTE would shutdown and restart the OOB completely before the OMPI level restarted its tcp connections. What would happen is that a socket descriptor used by the OMPI level on checkpoint was assigned to the ORTE level on restart. But the OMPI level had no knowledge that the socket descriptor it was previously using has been recycled so it closed it on restart. This caused the ORTE level to break as the newly created socket descriptor was closed without its knowledge. The fix is to have the OMPI level shutdown tcp connections, allow the ORTE level to restart, and then allow the OMPi level to restart its connections. This seems obvious, and I'm surprised that this bug has not cropped up sooner. I'm confident that this specific problem has been fixed with this commit. Thanks to Eric Roman and Tamer El Sayed for their help in identifying this problem, and patience while I was fixing it. * Add a new state {{{OPAL_CRS_RESTART_PRE}}}. This state identifies when we are on the down slope of the INC (finalize-like) which is useful when you want to close, but not reopen a component set for fear of interfering with a lower level. * Use this new state in OMPI level coordination. Here we want to make sure to play well with both the OMPI/BTL/TCP and ORTE/OOB/TCP components. * Update ft_event functions in PML and BML to handle the new restart state. * Add an additional flag to the error output in OOB/TCP so we can see what the socket descriptor was on failure as this can be helpful in debugging. This commit was SVN r18276.	2008-04-24 17:54:22 +00:00
Josh Hursey	cc83d41ad9	Merge in tmp/jjh-scratch {{{ svn merge -r 18218:18240 https://svn.open-mpi.org/svn/ompi/tmp/jjh-scratch . }}} Contains: * Primarily a fix for a user reported problem where a cached file descriptor is causing a SIGPIPE on restart. * Cleanup some small memory leaks from using mca_base_param_env_var() - Thanks Jeff * Cleanup ORTE FT tool compilation in non-FT builds - Thanks Tim P. * Cleanup mpi interface with missplaced {{{OPAL_CR_ENTER_LIBRARY}}} - Thanks Terry * Some other sundry cleanup items all dealing with C/R functionality in the trunk. This commit was SVN r18241.	2008-04-23 00:17:12 +00:00
Adrian Knoth	84e4013530	Always declare oob_tcp_disable_family, no matter if --disable-ipv6 is set. This commit was SVN r18164.	2008-04-16 09:31:15 +00:00
Adrian Knoth	0ddfff4ffe	Added new oob-tcp parameter oob_tcp_disable_family. Like btl_tcp_disable_family, this parameter more or less disables a whole address family. Though the sockets are still created, the corresponding information isn't added to the connection strings. Likewise, we don't try to connect to addresses matching the disabled address family. This is particularly important for multidomain clusters, where IPv4 is oftenly filtered (firewalled), sometimes by simply dropping the packets instead of rejecting them (thus causing a connection timeout instead of a quick "no route to host"). This commit was SVN r18163.	2008-04-16 09:22:00 +00:00
Ralph Castain	11c6773c83	Commit a patch from Brian that fixes potential segfaults in systems where IPv6 include files are found, but the kernel doesn't actually support IPv6. This commit was SVN r18106.	2008-04-09 12:53:24 +00:00
Adrian Knoth	a56b9b1df1	Fix broken build with --disable-ipv6. This commit was SVN r18071.	2008-04-02 10:53:48 +00:00
Ralph Castain	39c2680e9a	Silence warning This commit was SVN r18057.	2008-04-01 13:42:16 +00:00
Ralph Castain	3e8846d685	Some code cleanups from Brian to clarify port selection and opening logic This commit was SVN r18055.	2008-04-01 12:39:02 +00:00
Ralph Castain	60d931217f	Modify the routed framework to allow greater control/flexibility over response to lost routes and initial wireup of jobs as required by several soon-to-come new modules. Specifically, add two new APIs: 1. lost_route: allows the OOB to report that a connection has failed, thereby giving the routed module an opportunity to respond appropriately to its topology. Creating the API also allows each routed component to hold its own definition of "lifeline" - in some cases, this may be a single connection, but in others it may be multiple connections. Some modules may choose to re-route messaging if the lifeline or any other connection is lost, while others may choose to abort the job. Both the tree and unity modules retain the current behavior and abort the job if the lifeline connection is lost, while ignoring other lost connections. 2. get_wireup_info: returns (in a provided buffer) info required to wireup connections for the specified job. Some routed modules do not need to return any info as they can wireup via alternative means, while some need to xchg data with their peers. If info is inserted into the buffer, the plm_base_launch_apps function will xcast the contents to the specified job. The commit also removes the "lifeline" entry from the orte_process_info struct (and the associated ORTE_PROC_MY_LIFELINE definition) as the lifeline info is now contained within the respective routed module. This commit was SVN r17969.	2008-03-26 01:00:24 +00:00
Ralph Castain	f8642e9390	Add debug to tell us when we opened a socket and to whom This commit was SVN r17911.	2008-03-21 15:47:47 +00:00
Ralph Castain	19ffdfef42	Add some debugging output to tell us what interfaces were considered and used by OOB This commit was SVN r17909.	2008-03-21 15:35:40 +00:00
Ralph Castain	27a73ad9ee	Fix a race condition between the orteds and HNP that can cause the orteds to output the "lost lifeline" message. This has been a long-time problem. I tried to reduce the problem by having the orteds tell the HNP they were finalizing, and having the HNP wait until all orteds had reported or we timed out. What was observed was that all the orteds were correctly reporting that they are leaving, but the HNP is able to exit before the orteds, thus closing the orteds lifeline socket and generating the error output. This is caused by the fact that the orteds have to whack all remaining session directories, which includes that blasted monster shared memory file! Cleaning up the SM file can take quite a while. The HNP doesn't have that problem as there is no SM file there! So it gets out first. What we had done in the past to resolve that problem was put a little test in the OOB that checks to see if we are finalizing. If we are, then we ignore the lifeline connection being lost. That check was still in the code - however, we had lost the line in orte_finalize that set the flag!! This commit was SVN r17893.	2008-03-20 13:30:51 +00:00

1 2 3 4 5 ...

304 Коммитов