openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	90cfd139cf	Cleanup error - need an "and" instead of an "or" This commit was SVN r29037.	2013-08-16 21:41:59 +00:00
Ralph Castain	e4e678e234	Per the RFC and discussion on the devel list, update the RTE-MPI error handling interface. There are a few differences in the code from the original RFC that came out of the discussion - I've captured those in the following writeup George and I were talking about ORTE's error handling the other day in regards to the right way to deal with errors in the updated OOB. Specifically, it seemed a bad idea for a library such as ORTE to be aborting the job on its own prerogative. If we lose a connection or cannot send a message, then we really should just report it upwards and let the application and/or upper layers decide what to do about it. The current code base only allows a single error callback to exist, which seemed unduly limiting. So, based on the conversation, I've modified the errmgr interface to provide a mechanism for registering any number of error handlers (this replaces the current "set_fault_callback" API). When an error occurs, these handlers will be called in order until one responds that the error has been "resolved" - i.e., no further action is required - by returning OMPI_SUCCESS. The default MPI layer error handler is specified to go "last" and calls mpi_abort, so the current "abort" behavior is preserved unless other error handlers are registered. In the register_callback function, I provide an "order" param so you can specify "this callback must come first" or "this callback must come last". Seemed to me that we will probably have different code areas registering callbacks, and one might require it go first (the default "abort" will always require it go last). So you can append and prepend, or go first. Note that only one registration can declare itself "first" or "last", and since the default "abort" callback automatically takes "last", that one isn't available. :-) The errhandler callback function passes an opal_pointer_array of structs, each of which contains the name of the proc involved (which can be yourself for internal errors) and the error code. This is a change from the current fault callback which returned an opal_pointer_array of just process names. Rationale is that you might need to see the cause of the error to decide what action to take. I realize that isn't a requirement for remote procs, but remember that we will use the SAME interface to report RTE errors internal to the proc itself. In those cases, you really do need to see the error code. It is legal to pass a NULL for the pointer array (e.g., when reporting an internal failure without error code), so handlers must be prepared for that possibility. If people find that too burdensome, we can remove it. Should we ever decide to create a separate callback path for internal errors vs remote process failures, or if we decide to do something different based on experience, then we can adjust this API. This commit was SVN r28852.	2013-07-19 01:08:53 +00:00
Nathan Hjelm	db9bce0926	Add destructors for MPI_T error codes This commit was SVN r28618.	2013-06-12 14:58:14 +00:00
Nathan Hjelm	e48bd9809e	Add useful messages for MPI_T error codes This commit was SVN r28584.	2013-06-04 23:18:44 +00:00
Jeff Squyres	cfd023bcda	Ensure that MPI_LASTUSEDCODE is always >= MPI_ERR_LASTCODE. Thanks to Dimitry Dontsov for reporting the issue and providing a patch. This commit was SVN r28225.	2013-03-27 01:32:27 +00:00
Jeff Squyres	d9fed36793	This destruction logic was flat-out wrong. Harmless, but wrong. Replace it with something much smaller and simpler. This commit was SVN r28154.	2013-03-07 20:37:33 +00:00
Brian Barrett	312f37706e	In talking about this with Jeff and Ralph, we don't actually need ompi_show_help, because opal_show_help is replaced with an aggregating version when using ORTE, so there's no reason to directly call orte_show_help. This commit was SVN r28051.	2013-02-12 21:10:11 +00:00
Brian Barrett	f42783ae1a	Move the RTE framework change into the trunk. With this change, all non-CR runtime code goes through one of the rte, dpm, or pubsub frameworks. This commit was SVN r27934.	2013-01-27 23:25:10 +00:00
Josh Hursey	28681deffa	Backout the ORCA commit. :( There is a linking issue on Mac OSX that needs to be addressed before this is able to come back into the trunk. This commit was SVN r26676.	2012-06-27 01:28:28 +00:00
Josh Hursey	542330e3a7	Commit of ORCA: Open MPI Runtime Collaborative Abstraction This is a runtime interposition project that sits between the OMPI and ORTE layers in Open MPI. The project is described on the wiki: https://svn.open-mpi.org/trac/ompi/wiki/Runtime_Interposition And on this email thread: http://www.open-mpi.org/community/lists/devel/2012/06/11109.php This commit was SVN r26670.	2012-06-26 21:42:16 +00:00
Jeff Squyres	5fb48bafda	Fixes trac:3093: when cleaning up, only reset items in the ompi_mpi_errcodes array to NULL if the index is not MPI_UNDEFINED. Thanks to Alexey Bayduraev for the patch. This commit was SVN r26446. The following Trac tickets were found above: Ticket 3093 --> https://svn.open-mpi.org/trac/ompi/ticket/3093	2012-05-16 15:26:32 +00:00
Jeff Squyres	253444c6d0	== Highlights == 1. New mpifort wrapper compiler: you can utilize mpif.h, use mpi, and use mpi_f08 through this one wrapper compiler 1. mpif77 and mpif90 still exist, but are sym links to mpifort and may be removed in a future release 1. The mpi module has been re-implemented and is significantly "mo' bettah" 1. The mpi_f08 module offers many, many improvements over mpif.h and the mpi module This stuff is coming from a VERY long-lived mercurial branch (3 years!); it'll almost certainly take a few SVN commits and a bunch of testing before I get it correctly committed to the SVN trunk. == More details == Craig Rasmussen and I have been working with the MPI-3 Fortran WG and Fortran J3 committees for a long, long time to make a prototype MPI-3 Fortran bindings implementation. We think we're at a stable enough state to bring this stuff back to the trunk, with the goal of including it in OMPI v1.7. Special thanks go out to everyone who has been incredibly patient and helpful to us in this journey: * Rolf Rabenseifner/HLRS (mastermind/genius behind the entire MPI-3 Fortran effort) * The Fortran J3 committee * Tobias Burnus/gfortran * Tony !Goetz/Absoft * Terry !Donte/Oracle * ...and probably others whom I'm forgetting :-( There's still opportunities for optimization in the mpi_f08 implementation, but by and large, it is as far along as it can be until Fortran compilers start implementing the new F08 dimension(..) syntax. Note that gfortran is currently unsupported for the mpi_f08 module and the new mpi module. gfortran users will a) fall back to the same mpi module implementation that is in OMPI v1.5.x, and b) not get the new mpi_f08 module. The gfortran maintainers are actively working hard to add the necessary features to support both the new mpi_f08 module and the new mpi module implementations. This will take some time. As mentioned above, ompi/mpi/f77 and ompi/mpi/f90 no longer exist. All the fortran bindings implementations have been collated under ompi/mpi/fortran; each implementation has its own subdirectory: {{{ ompi/mpi/fortran/ base/ - glue code mpif-h/ - what used to be ompi/mpi/f77 use-mpi-tkr/ - what used to be ompi/mpi/f90 use-mpi-ignore-tkr/ - new mpi module implementation use-mpi-f08/ - new mpi_f08 module implementation }}} There's also a prototype 6-function-MPI implementation under use-mpi-f08-desc that emulates the new F08 dimension(..) syntax that isn't fully available in Fortran compilers yet. We did that to prove it to ourselves that it could be done once the compilers fully support it. This directory/implementation will likely eventually replace the use-mpi-f08 version. Other things that were done: * ompi_info grew a few new output fields to describe what level of Fortran support is included * Existing Fortran examples in examples/ were renamed; new mpi_f08 examples were added * The old Fortran MPI libraries were renamed: * libmpi_f77 -> libmpi_mpifh * libmpi_f90 -> libmpi_usempi * The configury for Fortran was consolidated and significantly slimmed down. Note that the F77 env variable is now IGNORED for configure; you should only use FC. Example: {{{ shell$ ./configure CC=icc CXX=icpc FC=ifort ... }}} All of this work was done in a Mercurial branch off the SVN trunk, and hosted at Bitbucket. This branch has got to be one of OMPI's longest-running branches. Its first commit was Tue Apr 07 23:01:46 2009 -0400 -- it's over 3 years old! :-) We think we've pulled in all relevant changes from the OMPI trunk (e.g., Fortran implementations of the new MPI-3 MPROBE stuff for mpif.h, use mpi, and use mpi_f08, and the recent Fujitsu Fortran patches). I anticipate some instability when we bring this stuff into the trunk, simply because it touches a LOT of code in the MPI layer in the OMPI code base. We'll try our best to make it as pain-free as possible, but please bear with us when it is committed. This commit was SVN r26283.	2012-04-18 15:57:29 +00:00
Ralph Castain	bd8b4f7f1e	Sorry for mid-day commit, but I had promised on the call to do this upon my return. Roll in the ORTE state machine. Remove last traces of opal_sos. Remove UTK epoch code. Please see the various emails about the state machine change for details. I'll send something out later with more info on the new arch. This commit was SVN r26242.	2012-04-06 14:23:13 +00:00
Josh Hursey	5af13d0d86	Adjust patch in r26172 to only set the MPI_ERROR field in the status object returned from MPI_Waitall instead of using the internal req_status object to carry it around. Note that the previous patch allowed the following test to -pass-: ompi-tests/mpich_tester/mpich_pt2pt/truncmult.c This patch makes that test -fail- due to the assumption that MPI_Wait will update the status.MPI_ERROR field. In Open MPI we do not do this, so the MPI_ERROR field being inspected will remain set to MPI_ERR_PENDING. See comments in req_wait.c for why we do this. If we change the test to not inspect the MPI_ERROR field after calling MPI_Wait successfully, then the test would pass correctly with this patch. This change was made per discussion on the below email thread: http://www.open-mpi.org/community/lists/devel/2012/03/10753.php This commit was SVN r26177. The following SVN revision numbers were found above: r26172 --> open-mpi/ompi@03a33417d5	2012-03-22 14:09:19 +00:00
Josh Hursey	03a33417d5	Add support for MPI_ERR_PENDING - Per MPI 2.2 p 60 Tested with: ompi-tests/mpich_tester/mpich_pt2pt/truncmult.c This commit was SVN r26172.	2012-03-21 17:46:15 +00:00
George Bosilca	a9511779a6	Combined patch from Fujitsu. Fixes a collections of typos over the code and man pages. cmr:v1.4:reviewer=jsquyres and cmr:v1.5:reviewer=jsquyres This commit was SVN r25781.	2012-01-26 04:22:00 +00:00
George Bosilca	efd88e10d7	Cleanup the error codes. Get rid of all the useless ones, and mark the distinction between ORTE and OMPI errors. This commit was SVN r25323.	2011-10-19 03:51:53 +00:00
Wesley Bland	e1ba09ad51	Add a resilience to ORTE. Allows the runtime to continue after a process (or ORTED) failure. Note that more work will be necessary to allow the MPI layer to take advantage of this. Per RFC: http://www.open-mpi.org/community/lists/devel/2011/06/9299.php This commit was SVN r24815.	2011-06-23 20:38:02 +00:00
Josh Hursey	0eb3b3b7b0	Fix missing functionality in MPI_Abort so that the group of peers defined by the communicator that should be aborted with this process are requested from the runtime before the local process exits. Per RFC: http://www.open-mpi.org/community/lists/devel/2011/06/9335.php This commit was SVN r24775.	2011-06-15 13:10:13 +00:00
Jeff Squyres	79cf382ff3	Fix a few issues with error messages: * If something goes wrong during ompi_mpi_init, don't erroneously report that it is illegal to invoke MPI_INIT* before MPI_INIT * Aggregate help messages when possible when something goes wring during ompi_mpi_init This commit was SVN r24492.	2011-03-07 16:45:45 +00:00
Rolf vandeVaart	3d9b05ba2b	Fix bug introduced by r23463. We now handle positive error codes correctly again. Also fix a typo. Reviewed by Jeff Squyres. This commit was SVN r23531. The following SVN revision numbers were found above: r23463 --> open-mpi/ompi@2af3e6e5ae	2010-07-28 19:19:27 +00:00
Jeff Squyres	2af3e6e5ae	Minor updates: * ompi_errcode_get_mpi_code() already checks for >0 error codes; the checks in OMPI_ERRHANDLER_INVOKE, OMPI_ERRHANDLER_CHECK, and OMPI_ERRHANDLER_RETURN were superfluous. * Ensure to use/return an OPAL_SOS-decoded value in ompi_errcode_get_mpi_code(). * Symbols beginning with !__ technically belong in the compiler namespace; we shouldn't be using those. * Other minor style updates in ompi_errcode_get_mpi_code(). This commit was SVN r23463.	2010-07-21 16:27:08 +00:00
Abhishek Kulkarni	c63c4d6892	Fix bugs where (OMPI_ERROR == ) checks cannot be converted to (OMPI_SUCCESS != ) since the return codes are overloaded to return an "index" on success. The fix is to just check if the return value is positive or not, since all the SOS encoded errors are always negative. The real fix (as Ralph points out) is to change these functions (opal_pointer_array_add and mca_base_param*) to return the index as a pointer. This commit was SVN r23173.	2010-05-18 20:54:11 +00:00
Abhishek Kulkarni	afbe3e99c6	* Wrap all the direct error-code checks of the form (OMPI_ERR_* == ret) with (OMPI_ERR_* = OPAL_SOS_GET_ERR_CODE(ret)), since the return value could be a SOS-encoded error. The OPAL_SOS_GET_ERR_CODE() takes in a SOS error and returns back the native error code. * Since OPAL_SUCCESS is preserved by SOS, also change all calls of the form (OPAL_ERROR == ret) to (OPAL_SUCCESS != ret). We thus avoid having to decode 'ret' to get the native error code. This commit was SVN r23162.	2010-05-17 23:08:56 +00:00
Jeff Squyres	8f824177cd	Closes trac:2158. Special-case the before MPI_INIT / after MPI_FINALIZE error messages so that they can be a bit more clear than the general "an error occurred" messages that are displayed in the middle of MPI jobs. This is not really a "bug fix", but it is helpful for usability. I leave it up to the v1.4 RM's to decide if they want it for the 1.4 series or not. This commit was SVN r22382. The following Trac tickets were found above: Ticket 2158 --> https://svn.open-mpi.org/trac/ompi/ticket/2158	2010-01-07 18:16:39 +00:00
George Bosilca	db010ebc6d	We need a format to print a string. This commit was SVN r22132.	2009-10-23 02:43:13 +00:00
Jeff Squyres	c78df0d1b4	Fixes trac:2060: MPI-2.2 ticket 7, convert some function pointer typedefs from "MPI__errhandler_fn" to "MPI__errhandler_function" (and their corresponding C++ types, too). Also updated the corresponding man pages, and marked the typedefs to the now-deprecated types as deprecated. This commit was SVN r22122. The following Trac tickets were found above: Ticket 2060 --> https://svn.open-mpi.org/trac/ompi/ticket/2060	2009-10-22 16:50:45 +00:00
Rainer Keller	b572dc3591	- As discussed revert r21330, Fortran-configure info should not end up in OPAL - Will post an updated patch for the OMPI_ALIGNMENT_ parts (within C). This commit was SVN r21342. The following SVN revision numbers were found above: r21330 --> open-mpi/ompi@95596d1814	2009-06-01 19:02:34 +00:00
Rainer Keller	95596d1814	- Move alignment and size output generated by configure-tests into the OPAL namespace, eliminating cases like opal/util/arch.c testing for ompi_fortran_logical_t. As this is processor- and compiler-related information (e.g. does the compiler/architecture support REAL*16) this should have been on the OPAL layer. - Unifies f77 code using MPI_Flogical instead of opal_fortran_logical_t - Tested locally (Linux/x86-64) with mpich and intel testsuite but would like to get this week-ends MTT output - PLEASE NOTE: configure-internal macro-names and ompi_cv_ variables have not been changed, so that external platform (not in contrib/) files still work. This commit was SVN r21330.	2009-05-30 15:54:29 +00:00
Rainer Keller	4b9aecb188	- Fix a few buglet's, "suppported", ": ", and stay below 63 characters in length... This commit was SVN r21286.	2009-05-27 02:04:54 +00:00
Rainer Keller	c868ec21b0	- Fix Coverity CID 30; If there is no errhandler, can't reference it... Return early. This commit was SVN r21187.	2009-05-07 17:05:15 +00:00
Rainer Keller	221fb9dbca	... Delayed due to notifier commits earlier this day ... - Delete unnecessary header files using contrib/check_unnecessary_headers.sh after applying patches, that include headers, being "lost" due to inclusion in one of the now deleted headers... In total 817 files are touched. In ompi/mpi/c/ header files are moved up into the actual c-file, where necessary (these are the only additional #include), otherwise it is only deletions of #include (apart from the above additions required due to notifier...) - To get different MCAs (OpenIB, TM, ALPS), an earlier version was successfully compiled (yesterday) on: Linux locally using intel-11, gcc-4.3.2 and gcc-SVN + warnings enabled Smoky cluster (x86-64 running Linux) using PGI-8.0.2 + warnings enabled Lens cluster (x86-64 running Linux) using Pathscale-3.2 + warnings enabled This commit was SVN r21096.	2009-04-29 01:32:14 +00:00
Rainer Keller	bff1b2a22b	- Finally add the missing opal/util/output.h for the OPAL_OUTPUT_VERBOSE macro. - ompi/errhandler/errhandler_predefined.h: Well, just the missing fwd declarations... This commit was SVN r20820.	2009-03-17 22:37:15 +00:00
Rainer Keller	6f808d9b05	Preparation work for another commit (after RFC): - This patch solely _adds_ required headers and is rather localized The next patch (after RFC) heavily removes headers (based on script) - ompi/communicator/communicator.h: For sources that use ompi_mpi_comm_world, don't require them to include "mpi.h" - ompi/debuggers/ompi_common_dll.c: mca_topo_base_comm_1_0_0_t needs #include "ompi/mca/topo/topo.h" - ompi/errhandler/errhandler_predefined.h: ompi/communicator/communicator.h depends on this header file! To prevent recursion just have fwd declarations. #include "ompi/types.h" for fwd declarations of the main structs. - ompi/mca/btl/btl.h: #include "opal/types.h" for ompi_ptr_t - ompi/mca/mpool/base/mpool_base_tree.c: We use ompi_free_list_t and ompi_rb_tree_t, so have the proper classes - ompi/mca/op/op.h: Op is pretty self-contained: Nobody up to now has done #include "opal/class/opal_object.h" - ompi/mca/osc/pt2pt/osc_pt2pt_replyreq.h: #include "opal/types.h" for ompi_ptr_t - ompi/mca/pml/base/base.h: We use opal_lists - ompi/mca/pml/dr/pml_dr_vfrag.h: #include "opal/types.h" for ompi_ptr_t - ompi/mca/pml/ob1/pml_ob1_hdr.h: #include "ompi/mca/btl/btl.h" for mca_btl_base_segment_t - opal/dss/dss_unpack.c: #include "opal/types.h" - opal/mca/base/base.h: #include "opal/util/cmd_line.h" for opal_cmd_line_t - orte/mca/oob/tcp/oob_tcp.c: #include "opal/types.h" for opal_socklen_t - orte/mca/oob/tcp/oob_tcp.h: #include "opal/threads/threads.h" for opal_thread_t - orte/mca/oob/tcp/oob_tcp_msg.c: #include "opal/types.h" - orte/mca/oob/tcp/oob_tcp_peer.c: #include "opal/types.h" for opal_socklen_t - orte/mca/oob/tcp/oob_tcp_send.c: #include "opal/types.h" - orte/mca/plm/base/plm_base_proxy.c: #include "orte/util/name_fns.h" for ORTE_NAME_PRINT - orte/mca/rml/base/rml_base_receive.c: #include "opal/util/output.h" for OPAL_OUTPUT_VERBOSE - orte/mca/rml/oob/rml_oob_recv.c: #include "opal/types.h" for ompi_iov_base_ptr_t - orte/mca/rml/oob/rml_oob_send.c: #include "opal/types.h" for ompi_iov_base_ptr_t - orte/runtime/orte_data_server.c #include "opal/util/output.h" for OPAL_OUTPUT_VERBOSE - orte/runtime/orte_globals.h: #include "orte/util/name_fns.h" for ORTE_NAME_PRINT Tested on Linux/x86-64 This commit was SVN r20817.	2009-03-17 21:34:30 +00:00
Rainer Keller	d8cf4c0fec	- Get pgcc on XT to complain less: In case we use memcmp, strlen, strup and friends include <string.h> Also several constants.h are not included directly - Let's have mca_topo_base_cart_create return ompi-errors in ompi/mca/topo/base/topo_base_cart_create.c This commit was SVN r20773.	2009-03-13 02:10:32 +00:00
Rainer Keller	ec0ed48718	- Revert r20739 This commit was SVN r20742. The following SVN revision numbers were found above: r20739 --> open-mpi/ompi@781caee0b6	2009-03-05 21:56:03 +00:00
Rainer Keller	781caee0b6	- First of two or three patches, in orte/util/proc_info.h: Adapt orte_process_info to orte_proc_info, and change orte_proc_info() to orte_proc_info_init(). - Compiled on linux-x86-64 - Discussed with Ralph This commit was SVN r20739.	2009-03-05 20:36:44 +00:00
Terry Dontje	0178b6c45f	Added padding to predefined handle structures to maintain library version to version compatibility. This commit was SVN r20627.	2009-02-24 17:17:33 +00:00
Rainer Keller	d81443cc5a	- On the way to get the BTLs split out and lessen dependency on orte: Often, orte/util/show_help.h is included, although no functionality is required -- instead, most often opal_output.h, or orte/mca/rml/rml_types.h Please see orte_show_help_replacement.sh commited next. - Local compilation (Linux/x86_64) w/ -Wimplicit-function-declaration actually showed two missing #include "orte/util/show_help.h" in orte/mca/odls/base/odls_base_default_fns.c and in orte/tools/orte-top/orte-top.c Manually added these. Let's have MTT the last word. This commit was SVN r20557.	2009-02-14 02:26:12 +00:00
Jeff Squyres	ffc5d8877f	Fix a problem where we're accidentally initializing the wrong errhandler (should be initializing _errors_throw_exceptions, not _are_fatal). This bug was not a huge tragedy because the only real problem is that _are_fatal has the wrong string name with it (because MPI::Init fixes up the _errors_throw_exceptions later). This commit was SVN r20458.	2009-02-05 21:36:10 +00:00
Jeff Squyres	bbfac2dfb5	Based on a review by Ralph, no need to call getpid() or gethostname(); we already have them in orte_process_info. Refs trac:1523. This commit was SVN r19615. The following Trac tickets were found above: Ticket 1523 --> https://svn.open-mpi.org/trac/ompi/ticket/1523	2008-09-23 20:04:34 +00:00
Jeff Squyres	4c558ed637	Enable aggregation checking for "*** An error occurred..." MPI layer help messages so that users only see the message once instead of N times when their MPI app crashes. Note that there is a tradeoff here -- we now call malloc in this particular "show the error" code path. This shouldn't usually be a problem, because the errors typically displayed through this mechanism are MPI API argument problems (e.g., sending a negative count to MPI_SEND), and not memory errors. But such API argument errors could be a consequence of of a prior memory error, so there's a nonzero chance that the error failure will fail to print because malloc failed. In this case, the user can disable help message aggregation (via the orte_base_want_aggregate MCA parameter) and we'll fall back to the no-malloc code path (but without aggregation). Note that we won't aggregate before MPI_INIT or after MPI_FINALIZE. So if you call an MPI function before MPI_INIT / after MPI_FINALIZE, you'll still see the error message N times. Nothing we can do about that; we need ORTE to do the aggregation properly (which is obviously unavailable before MPI_INIT / after MPI_FINALIZE). This commit was SVN r19611.	2008-09-23 17:19:24 +00:00
Jeff Squyres	d6696c46a6	Oops -- sometimes we actually pass NULL for the error_code. Make sure to handle that nicely without segv'ing. This commit was SVN r19603.	2008-09-22 17:41:39 +00:00
Jeff Squyres	5fd742e769	Add in the standardized way to notify a debugger if the MPI job is about to abort. Fixes trac:1509. This commit was SVN r19596. The following Trac tickets were found above: Ticket 1509 --> https://svn.open-mpi.org/trac/ompi/ticket/1509	2008-09-20 11:34:37 +00:00
Jeff Squyres	262f865e77	Fix CID 387: be safer about string handling when using fixed-length strings This commit was SVN r19174.	2008-08-06 12:15:49 +00:00
Rainer Keller	4712a73db5	- Help the compiler, by noting that errors are unlikely. This commit was SVN r19138.	2008-08-04 14:50:27 +00:00
Ralph Castain	9613b3176c	Effectively revert the orte_output system and return to direct use of opal_output at all levels. Retain the orte_show_help subsystem to allow aggregation of show_help messages at the HNP. After much work by Jeff and myself, and quite a lot of discussion, it has become clear that we simply cannot resolve the infinite loops caused by RML-involved subsystems calling orte_output. The original rationale for the change to orte_output has also been reduced by shifting the output of XML-formatted vs human readable messages to an alternative approach. I have globally replaced the orte_output/ORTE_OUTPUT calls in the code base, as well as the corresponding .h file name. I have test compiled and run this on the various environments within my reach, so hopefully this will prove minimally disruptive. This commit was SVN r18619.	2008-06-09 14:53:58 +00:00
Jeff Squyres	e7ecd56bd2	This commit represents a bunch of work on a Mercurial side branch. As such, the commit message back to the master SVN repository is fairly long. = ORTE Job-Level Output Messages = Add two new interfaces that should be used for all new code throughout the ORTE and OMPI layers (we already make the search-and-replace on the existing ORTE / OMPI layers): * orte_output(): (and corresponding friends ORTE_OUTPUT, orte_output_verbose, etc.) This function sends the output directly to the HNP for processing as part of a job-specific output channel. It supports all the same outputs as opal_output() (syslog, file, stdout, stderr), but for stdout/stderr, the output is sent to the HNP for processing and output. More on this below. * orte_show_help(): This function is a drop-in-replacement for opal_show_help(), with two differences in functionality: 1. the rendered text help message output is sent to the HNP for display (rather than outputting directly into the process' stderr stream) 1. the HNP detects duplicate help messages and does not display them (so that you don't see the same error message N times, once from each of your N MPI processes); instead, it counts "new" instances of the help message and displays a message every ~5 seconds when there are new ones ("I got X new copies of the help message...") opal_show_help and opal_output still exist, but they only output in the current process. The intent for the new orte_* functions is that they can apply job-level intelligence to the output. As such, we recommend that all new ORTE and OMPI code use the new orte_* functions, not thei opal_* functions. === New code === For ORTE and OMPI programmers, here's what you need to do differently in new code: * Do not include opal/util/show_help.h or opal/util/output.h. Instead, include orte/util/output.h (this one header file has declarations for both the orte_output() series of functions and orte_show_help()). * Effectively s/opal_output/orte_output/gi throughout your code. Note that orte_output_open() takes a slightly different argument list (as a way to pass data to the filtering stream -- see below), so you if explicitly call opal_output_open(), you'll need to slightly adapt to the new signature of orte_output_open(). * Literally s/opal_show_help/orte_show_help/. The function signature is identical. === Notes === * orte_output'ing to stream 0 will do similar to what opal_output'ing did, so leaving a hard-coded "0" as the first argument is safe. * For systems that do not use ORTE's RML or the HNP, the effect of orte_output_* and orte_show_help will be identical to their opal counterparts (the additional information passed to orte_output_open() will be lost!). Indeed, the orte_* functions simply become trivial wrappers to their opal_* counterparts. Note that we have not tested this; the code is simple but it is quite possible that we mucked something up. = Filter Framework = Messages sent view the new orte_* functions described above and messages output via the IOF on the HNP will now optionally be passed through a new "filter" framework before being output to stdout/stderr. The "filter" OPAL MCA framework is intended to allow preprocessing to messages before they are sent to their final destinations. The first component that was written in the filter framework was to create an XML stream, segregating all the messages into different XML tags, etc. This will allow 3rd party tools to read the stdout/stderr from the HNP and be able to know exactly what each text message is (e.g., a help message, another OMPI infrastructure message, stdout from the user process, stderr from the user process, etc.). Filtering is not active by default. Filter components must be specifically requested, such as: {{{ $ mpirun --mca filter xml ... }}} There can only be one filter component active. = New MCA Parameters = The new functionality described above introduces two new MCA parameters: * '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that help messages will be aggregated, as described above. If set to 0, all help messages will be displayed, even if they are duplicates (i.e., the original behavior). * '''orte_base_show_output_recursions''': An MCA parameter to help debug one of the known issues, described below. It is likely that this MCA parameter will disappear before v1.3 final. = Known Issues = * The XML filter component is not complete. The current output from this component is preliminary and not real XML. A bit more work needs to be done to configure.m4 search for an appropriate XML library/link it in/use it at run time. * There are possible recursion loops in the orte_output() and orte_show_help() functions -- e.g., if RML send calls orte_output() or orte_show_help(). We have some ideas how to fix these, but figured that it was ok to commit before feature freeze with known issues. The code currently contains sub-optimal workarounds so that this will not be a problem, but it would be good to actually solve the problem rather than have hackish workarounds before v1.3 final. This commit was SVN r18434.	2008-05-13 20:00:55 +00:00
Josh Hursey	99144db970	Improve checkpoint/restart support by allowing a checkpoint to progress when the process is not in the MPI library. This involves creating a separate thread for polling for a checkpoint request. This thread is active when the MPI process is not in the MPI library, and paused when the MPI process is in the library. Some MPI C interface files saw some spacing changes to conform to the coding standards of Open MPI. Changed MPI C interface files to use {{{OPAL_CR_ENTER_LIBRARY()}}} and {{{OPAL_CR_EXIT_LIBRARY()}}} instead of just {{{OPAL_CR_TEST_CHECKPOINT_READY()}}}. This will allow the checkpoint/restart system more flexibility in how it is to behave. Fixed the configure check for {{{--enable-ft-thread}}} so it has a know dependance on {{{--enable-mpi-thread}}} (and/or {{{--enable-progress-thread}}}). Added a line for Checkpoint/Restart support to {{{ompi_info}}}. Added some options to choose at runtime whether or not to use the checkpoint polling thread. By default, if the user asked for it to be compiled in, then it is used. But some users will want the ability to toggle its use at runtime. There are still some places for improvement, but the feature works correctly. As always with Checkpoint/Restart, it is compiled out unless explicitly asked for at configure time. Further, if it was configured in, then it is not used unless explicitly asked for by the user at runtime. This commit was SVN r17516.	2008-02-19 22:15:52 +00:00
Tim Prins	f722cc0fa2	Fixes trac:1216 Add a missing break to the outer switch statement of ompi_errhandler_invoke. While I'm there, remove a couple of TABs. This commit was SVN r17489. The following Trac tickets were found above: Ticket 1216 --> https://svn.open-mpi.org/trac/ompi/ticket/1216	2008-02-18 13:03:39 +00:00

1 2

86 Коммитов