openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	869041f770	Purge whitespace from the repo	2015-06-23 20:59:57 -07:00
Adrian Reber	c08e234af7	FT: fix compilation using --with-ft (5/5) Enabling the FT code breaks compilation (again). This series tries to fix the compiler errors. This is again only fixing the compiler errors without any warranty that the result might actually support FT again. With the changes introduced in the previous patches in this series some goto constructs for cleanup are no longer necessary and removed.	2015-03-11 14:23:33 +01:00
Adrian Reber	1c5a8df724	FT: fix compilation using --with-ft (2/5) Enabling the FT code breaks compilation (again). This series tries to fix the compiler errors. This is again only fixing the compiler errors without any warranty that the result might actually support FT again. The FT code used barrier mechanisms which have been removed with `aec5cd08bd`. This patch replaces all those different barriers with opal_pmix.fence(NULL, 0); I am not sure this is completely correct but at least a starting point for a review.	2015-03-11 14:23:33 +01:00
Adrian Reber	f45dd069bd	FT: fix compilation using --with-ft (1/5) Enabling the FT code breaks compilation (again). This series tries to fix the compiler errors. This is again only fixing the compiler errors without any warranty that the result might actually support FT again. This first patch moves orte_cr_continue_like_restart from ORTE to opal_cr_continue_like_restart in OPAL. This only leaves three calls from OPAL to ORTE in the FT code. As it is not yet 100% clear how to handle these calls the code orte_sstore.set_attr() has been #ifdef'd out for now.	2015-03-11 14:23:33 +01:00
Nathan Hjelm	1b564f62bd	Revert "Merge pull request #275 from hjelmn/btlmod" This reverts commit `ccaecf0fd6`, reversing changes made to `6a19bf85dd`.	2014-11-19 23:22:43 -07:00
Nathan Hjelm	c61e017177	pml: updates to reflect member changes in mca_btl_base_descriptor_t and mca_btl_base_module_t structures	2014-11-19 11:33:02 -07:00
Ralph Castain	552c9ca5a0	George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-) WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic. This commit was SVN r32317.	2014-07-26 00:47:28 +00:00
Nathan Hjelm	a14e0f10d4	Per RFC: Remove des_src and des_dst members from the mca_btl_base_segment_t and replace them with des_local and des_remote This change also updates the BTL version to 3.0.0. This commit does not represent the final version of BTL 3.0.0. More changes are coming. In making this change I updated all of the BTLs as well as BTL user's to use the new structure members. Please evaluate your component to ensure the changes are correct. RFC text: This is the first of several BTL interface changes I am proposing for the 1.9/2.0 release series. What: Change naming of btl descriptor members. I propose we change des_src and des_dst (and their associated counts) to be des_local and des_remote. For receive callbacks the des_local member will be used to communicate the segment information to the callback. The proposed change will include updating all of the doxygen in btl.h as well as updating all BTLs and BTL users to use the new naming scheme. Why: My btl usage makes use of both put and get operations on the same descriptor. With the current naming scheme I need to ensure that there is consistency beteen the segments described in des_src and des_dst depending on whether a put or get operation is executed. Additionally, the current naming prevents BTLs that do not require prepare/RMA matched operations (do not set MCA_BTL_FLAGS_RDMA_MATCHED) from executing multiple simultaneous put AND get operations. At the moment the descriptor can only be used with one or the other. The naming change makes it easier for BTL users to setup/modify descriptors for RMA operations as the local segment and remote segment are always in the same member field. The only issue I forsee with this change is that it will require a little more work to move BTL fixes to the 1.8 release series. This commit was SVN r32196.	2014-07-10 16:31:15 +00:00
Brian Barrett	16a1166884	Remove the proc_pml and proc_bml fields from ompi_proc_t and replace with a configure-time dynamic allocation of flags. The net result for platforms which only support BTL-based communication is a reduction of 8*nprocs bytes per process. Platforms which support both MTLs and BTLs will not see a space reduction, but will now be able to safely run both the MTL and BTL side-by-side, which will prove useful. This commit was SVN r29100.	2013-08-30 16:54:55 +00:00
Ralph Castain	a200e4f865	As per the RFC, bring in the ORTE async progress code and the rewrite of OOB: * THIS RFC INCLUDES A MINOR CHANGE TO THE MPI-RTE INTERFACE * Note: during the course of this work, it was necessary to completely separate the MPI and RTE progress engines. There were multiple places in the MPI layer where ORTE_WAIT_FOR_COMPLETION was being used. A new OMPI_WAIT_FOR_COMPLETION macro was created (defined in ompi/mca/rte/rte.h) that simply cycles across opal_progress until the provided flag becomes false. Places where the MPI layer blocked waiting for RTE to complete an event have been modified to use this macro. *************************************************************************************** I am reissuing this RFC because of the time that has passed since its original release. Since its initial release and review, I have debugged it further to ensure it fully supports tests like loop_spawn. It therefore seems ready for merge back to the trunk. Given its prior review, I have set the timeout for one week. The code is in https://bitbucket.org/rhc/ompi-oob2 WHAT: Rewrite of ORTE OOB WHY: Support asynchronous progress and a host of other features WHEN: Wed, August 21 SYNOPSIS: The current OOB has served us well, but a number of limitations have been identified over the years. Specifically: * it is only progressed when called via opal_progress, which can lead to hangs or recursive calls into libevent (which is not supported by that code) * we've had issues when multiple NICs are available as the code doesn't "shift" messages between transports - thus, all nodes had to be available via the same TCP interface. * the OOB "unloads" incoming opal_buffer_t objects during the transmission, thus preventing use of OBJ_RETAIN in the code when repeatedly sending the same message to multiple recipients * there is no failover mechanism across NICs - if the selected NIC (or its attached switch) fails, we are forced to abort * only one transport (i.e., component) can be "active" The revised OOB resolves these problems: * async progress is used for all application processes, with the progress thread blocking in the event library * each available TCP NIC is supported by its own TCP module. The ability to asynchronously progress each module independently is provided, but not enabled by default (a runtime MCA parameter turns it "on") * multi-address TCP NICs (e.g., a NIC with both an IPv4 and IPv6 address, or with virtual interfaces) are supported - reachability is determined by comparing the contact info for a peer against all addresses within the range covered by the address/mask pairs for the NIC. * a message that arrives on one TCP NIC is automatically shifted to whatever NIC that is connected to the next "hop" if that peer cannot be reached by the incoming NIC. If no TCP module will reach the peer, then the OOB attempts to send the message via all other available components - if none can reach the peer, then an "error" is reported back to the RML, which then calls the errmgr for instructions. * opal_buffer_t now conforms to standard object rules re OBJ_RETAIN as we no longer "unload" the incoming object * NIC failure is reported to the TCP component, which then tries to resend the message across any other available TCP NIC. If that doesn't work, then the message is given back to the OOB base to try using other components. If all that fails, then the error is reported to the RML, which reports to the errmgr for instructions * obviously from the above, multiple OOB components (e.g., TCP and UD) can be active in parallel * the matching code has been moved to the RML (and out of the OOB/TCP component) so it is independent of transport * routing is done by the individual OOB modules (as opposed to the RML). Thus, both routed and non-routed transports can simultaneously be active * all blocking send/recv APIs have been removed. Everything operates asynchronously. KNOWN LIMITATIONS: * although provision is made for component failover as described above, the code for doing so has not been fully implemented yet. At the moment, if all connections for a given peer fail, the errmgr is notified of a "lost connection", which by default results in termination of the job if it was a lifeline * the IPv6 code is present and compiles, but is not complete. Since the current IPv6 support in the OOB doesn't work anyway, I don't consider this a blocker * routing is performed at the individual module level, yet the active routed component is selected on a global basis. We probably should update that to reflect that different transports may need/choose to route in different ways * obviously, not every error path has been tested nor necessarily covered * determining abnormal termination is more challenging than in the old code as we now potentially have multiple ways of connecting to a process. Ideally, we would declare "connection failed" when all transports can no longer reach the process, but that requires some additional (possibly complex) code. For now, the code replicates the old behavior only somewhat modified - i.e., if a module sees its connection fail, it checks to see if it is a lifeline. If so, it notifies the errmgr that the lifeline is lost - otherwise, it notifies the errmgr that a non-lifeline connection was lost. * reachability is determined solely on the basis of a shared subnet address/mask - more sophisticated algorithms (e.g., the one used in the tcp btl) are required to handle routing via gateways * the RML needs to assign sequence numbers to each message on a per-peer basis. The receiving RML will then deliver messages in order, thus preventing out-of-order messaging in the case where messages travel across different transports or a message needs to be redirected/resent due to failure of a NIC This commit was SVN r29058.	2013-08-22 16:37:40 +00:00
Brian Barrett	312f37706e	In talking about this with Jeff and Ralph, we don't actually need ompi_show_help, because opal_show_help is replaced with an aggregating version when using ORTE, so there's no reason to directly call orte_show_help. This commit was SVN r28051.	2013-02-12 21:10:11 +00:00
Brian Barrett	f42783ae1a	Move the RTE framework change into the trunk. With this change, all non-CR runtime code goes through one of the rte, dpm, or pubsub frameworks. This commit was SVN r27934.	2013-01-27 23:25:10 +00:00
Abhishek Kulkarni	1878f276cd	Replace the pattern while(flag) { opal_progress() }; in the C/R code with the ORTE_WAIT_FOR_COMPLETION macro. This commit was SVN r26794.	2012-07-13 23:31:56 +00:00
Abhishek Kulkarni	5c58a1c9c1	Fix C/R support in the trunk. Among other things, this patch deals with the following issues: * fix ompi-checkpoint argument parsing * ompi-restart -showme prints an extraneous "Restarted child with PID" message. Move around the debug statement to avoid this. * fixes for the state machine changes This commit was SVN r26770.	2012-07-09 23:34:13 +00:00
Josh Hursey	28681deffa	Backout the ORCA commit. :( There is a linking issue on Mac OSX that needs to be addressed before this is able to come back into the trunk. This commit was SVN r26676.	2012-06-27 01:28:28 +00:00
Josh Hursey	542330e3a7	Commit of ORCA: Open MPI Runtime Collaborative Abstraction This is a runtime interposition project that sits between the OMPI and ORTE layers in Open MPI. The project is described on the wiki: https://svn.open-mpi.org/trac/ompi/wiki/Runtime_Interposition And on this email thread: http://www.open-mpi.org/community/lists/devel/2012/06/11109.php This commit was SVN r26670.	2012-06-26 21:42:16 +00:00
Ralph Castain	bd8b4f7f1e	Sorry for mid-day commit, but I had promised on the call to do this upon my return. Roll in the ORTE state machine. Remove last traces of opal_sos. Remove UTK epoch code. Please see the various emails about the state machine change for details. I'll send something out later with more info on the new arch. This commit was SVN r26242.	2012-04-06 14:23:13 +00:00
Brian Barrett	25d48e22fa	Implementation of the MPI-3 Matched Probe functionality. Currently only implemented in the OB1 PML, will return NOT_SUPPORTED in other PMLs. This commit was SVN r25865.	2012-02-06 17:35:21 +00:00
Rolf vandeVaart	f65309364b	Add closing comment to #ifdefs. No code changes. This commit was SVN r24083.	2010-11-23 21:42:20 +00:00
Rolf vandeVaart	9c57108e20	Change #ifdef to #if to match OMPI coding conventions. This commit was SVN r24067.	2010-11-17 20:34:15 +00:00
Rolf vandeVaart	148ed00dd1	Some more refactoring in the BFO PML. Getting it as close to OB1 PML as possible. This commit was SVN r23920.	2010-10-22 18:13:35 +00:00
Rolf vandeVaart	70fe48698c	Change some of the bfo code to be more like the ob1 code. Create some new macros and functions to handle some differences. This commit was SVN r23913.	2010-10-19 17:46:51 +00:00
Rolf vandeVaart	59e3fa8ed3	Some more formatting fixes and code refactoring. All these changes are in the bfo so this has no affect on ob1. This commit was SVN r23815.	2010-09-29 13:46:45 +00:00
Rolf vandeVaart	0331889495	Some more spaces, tabs, include file ordering changes. No real code changes here. This commit was SVN r23789.	2010-09-22 13:48:22 +00:00
Rolf vandeVaart	77560269f2	More fixes of spaces, tabs, and ordering of include files to make the 3 PMLs the same where they are the same. No real code changes. This commit was SVN r23779.	2010-09-20 21:22:33 +00:00
Josh Hursey	e12ca48cd9	A number of C/R enhancements per RFC below: http://www.open-mpi.org/community/lists/devel/2010/07/8240.php Documentation: http://osl.iu.edu/research/ft/ Major Changes: -------------- * Added C/R-enabled Debugging support. Enabled with the --enable-crdebug flag. See the following website for more information: http://osl.iu.edu/research/ft/crdebug/ * Added Stable Storage (SStore) framework for checkpoint storage * 'central' component does a direct to central storage save * 'stage' component stages checkpoints to central storage while the application continues execution. * 'stage' supports offline compression of checkpoints before moving (sstore_stage_compress) * 'stage' supports local caching of checkpoints to improve automatic recovery (sstore_stage_caching) * Added Compression (compress) framework to support * Add two new ErrMgr recovery policies * {{{crmig}}} C/R Process Migration * {{{autor}}} C/R Automatic Recovery * Added the {{{ompi-migrate}}} command line tool to support the {{{crmig}}} ErrMgr component * Added CR MPI Ext functions (enable them with {{{--enable-mpi-ext=cr}}} configure option) * {{{OMPI_CR_Checkpoint}}} (Fixes trac:2342) * {{{OMPI_CR_Restart}}} * {{{OMPI_CR_Migrate}}} (may need some more work for mapping rules) * {{{OMPI_CR_INC_register_callback}}} (Fixes trac:2192) * {{{OMPI_CR_Quiesce_start}}} * {{{OMPI_CR_Quiesce_checkpoint}}} * {{{OMPI_CR_Quiesce_end}}} * {{{OMPI_CR_self_register_checkpoint_callback}}} * {{{OMPI_CR_self_register_restart_callback}}} * {{{OMPI_CR_self_register_continue_callback}}} * The ErrMgr predicted_fault() interface has been changed to take an opal_list_t of ErrMgr defined types. This will allow us to better support a wider range of fault prediction services in the future. * Add a progress meter to: * FileM rsh (filem_rsh_process_meter) * SnapC full (snapc_full_progress_meter) * SStore stage (sstore_stage_progress_meter) * Added 2 new command line options to ompi-restart * --showme : Display the full command line that would have been exec'ed. * --mpirun_opts : Command line options to pass directly to mpirun. (Fixes trac:2413) * Deprecated some MCA params: * crs_base_snapshot_dir deprecated, use sstore_stage_local_snapshot_dir * snapc_base_global_snapshot_dir deprecated, use sstore_base_global_snapshot_dir * snapc_base_global_shared deprecated, use sstore_stage_global_is_shared * snapc_base_store_in_place deprecated, replaced with different components of SStore * snapc_base_global_snapshot_ref deprecated, use sstore_base_global_snapshot_ref * snapc_base_establish_global_snapshot_dir deprecated, never well supported * snapc_full_skip_filem deprecated, use sstore_stage_skip_filem Minor Changes: -------------- * Fixes trac:1924 : {{{ompi-restart}}} now recognizes path prefixed checkpoint handles and does the right thing. * Fixes trac:2097 : {{{ompi-info}}} should now report all available CRS components * Fixes trac:2161 : Manual checkpoint movement. A user can 'mv' a checkpoint directory from the original location to another and still restart from it. * Fixes trac:2208 : Honor various TMPDIR varaibles instead of forcing {{{/tmp}}} * Move {{{ompi_cr_continue_like_restart}}} to {{{orte_cr_continue_like_restart}}} to be more flexible in where this should be set. * opal_crs_base_metadata_write* functions have been moved to SStore to support a wider range of metadata handling functionality. * Cleanup the CRS framework and components to work with the SStore framework. * Cleanup the SnapC framework and components to work with the SStore framework (cleans up these code paths considerably). * Add 'quiesce' hook to CRCP for a future enhancement. * We now require a BLCR version that supports {{{cr_request_file()}}} or {{{cr_request_checkpoint()}}} in order to make the code more maintainable. Note that {{{cr_request_file}}} has been deprecated since 0.7.0, so we prefer to use {{{cr_request_checkpoint()}}}. * Add optional application level INC callbacks (registered through the CR MPI Ext interface). * Increase the {{{opal_cr_thread_sleep_wait}}} parameter to 1000 microseconds to make the C/R thread less aggressive. * {{{opal-restart}}} now looks for cache directories before falling back on stable storage when asked. * {{{opal-restart}}} also support local decompression before restarting * {{{orte-checkpoint}}} now uses the SStore framework to work with the metadata * {{{orte-restart}}} now uses the SStore framework to work with the metadata * Remove the {{{orte-restart}}} preload option. This was removed since the user only needs to select the 'stage' component in order to support this functionality. * Since the '-am' parameter is saved in the metadata, {{{ompi-restart}}} no longer hard codes {{{-am ft-enable-cr}}}. * Fix {{{hnp}}} ErrMgr so that if a previous component in the stack has 'fixed' the problem, then it should be skipped. * Make sure to decrement the number of 'num_local_procs' in the orted when one goes away. * odls now checks the SStore framework to see if it needs to load any checkpoint files before launching (to support 'stage'). This separates the SStore logic from the --preload-[binary\|files] options. * Add unique IDs to the named pipes established between the orted and the app in SnapC. This is to better support migration and automatic recovery activities. * Improve the checks for 'already checkpointing' error path. * A a recovery output timer, to show how long it takes to restart a job * Do a better job of cleaning up the old session directory on restart. * Add a local module to the autor and crmig ErrMgr components. These small modules prevent the 'orted' component from attempting a local recovery (Which does not work for MPI apps at the moment) * Add a fix for bounding the checkpointable region between MPI_Init and MPI_Finalize. This commit was SVN r23587. The following Trac tickets were found above: Ticket 1924 --> https://svn.open-mpi.org/trac/ompi/ticket/1924 Ticket 2097 --> https://svn.open-mpi.org/trac/ompi/ticket/2097 Ticket 2161 --> https://svn.open-mpi.org/trac/ompi/ticket/2161 Ticket 2192 --> https://svn.open-mpi.org/trac/ompi/ticket/2192 Ticket 2208 --> https://svn.open-mpi.org/trac/ompi/ticket/2208 Ticket 2342 --> https://svn.open-mpi.org/trac/ompi/ticket/2342 Ticket 2413 --> https://svn.open-mpi.org/trac/ompi/ticket/2413	2010-08-10 20:51:11 +00:00
Rolf vandeVaart	19d007a6fc	New PML to support failover between openib BTLs. openib BTL changes coming soon. This commit was SVN r23385.	2010-07-13 10:46:20 +00:00

27 Коммитов