1
1
Граф коммитов

66 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
780c93ee57 Per the PR and discussion on today's telecon, extend the process name definition as a two-field struct of uint32_t's down to the OPAL layer. This resolves issues created by prior commits that impacted both heterogeneous and SPARC support. This also simplifies the OMPI code base by removing the need for frequent memcpy's when transitioning between the OMPI/ORTE layers and OPAL.
We recognize that this means other users of OPAL will need to "wrap" the opal_process_name_t if they desire to abstract it in some fashion. This is regrettable, and we are looking at possible alternatives that might mitigate that requirement. Meantime, however, we have to put the needs of the OMPI community first, and are taking this step to restore hetero and SPARC support.
2014-11-11 17:00:42 -08:00
Elena
03fc809bc9 This commit contains new dstore component sm which is used for communication between pmix server and clients at the same node via shared memory. 2014-11-06 16:01:19 +02:00
Ralph Castain
1ae34da5e5 Add an attributes parameter to the dstore.open function so we can pass directives to the active storage component. This can, for example, include the backing file info for a new shared memory segment. 2014-10-10 12:13:25 -07:00
Ralph Castain
fd6a044b7f Cleanup some cruft resulting from the move of the btl's to opal. We had created the ability to delay modex operations, which included a need to delay retrieving hostname info for remote procs. This allowed us to not retrieve the modex info until first message unless required - the hostname is generally only required for debug and error messages.
Properly setup the opal_process_info structure early in the initialization procedure. Define the local hostname right at the beginning of opal_init so all parts of opal can use it. Overlay that during orte_init as the user may choose to remove fqdn and strip prefixes during that time. Setup the job_session_dir and other such info immediately when it becomes available during orte_init.
2014-10-03 16:02:57 -06:00
Ralph Castain
dfb952fa78 [Contribution from Artem - moved it to svn from git for him]
Replace our old, clunky timing setup with a much nicer one that is only available if configured with --enable-timing. Add a tool for profiling clock differences between the nodes so you can get more precise timing measurements. I'll ask Artem to update the Github wiki with full instructions on how to use this setup.

This commit was SVN r32738.
2014-09-15 18:00:46 +00:00
Ralph Castain
8f1b9b463e Fix shared memory operations - need to pass the local topology and cpusets of all local peers so we can properly compute relative locality for them. Also need to set default locality to "on node" in case where cpusets are not passed because procs are not bound.
This commit was SVN r32577.
2014-08-22 05:17:51 +00:00
Ralph Castain
aec5cd08bd Per the PMIx RFC:
WHAT:    Merge the PMIx branch into the devel repo, creating a new
               OPAL “lmix” framework to abstract PMI support for all RTEs.
               Replace the ORTE daemon-level collectives with a new PMIx
               server and update the ORTE grpcomm framework to support
               server-to-server collectives

WHY:      We’ve had problems dealing with variations in PMI implementations,
               and need to extend the existing PMI definitions to meet exascale
               requirements.

WHEN:   Mon, Aug 25

WHERE:  https://github.com/rhc54/ompi-svn-mirror.git

Several community members have been working on a refactoring of the current PMI support within OMPI. Although the APIs are common, Slurm and Cray implement a different range of capabilities, and package them differently. For example, Cray provides an integrated PMI-1/2 library, while Slurm separates the two and requires the user to specify the one to be used at runtime. In addition, several bugs in the Slurm implementations have caused problems requiring extra coding.

All this has led to a slew of #if’s in the PMI code and bugs when the corner-case logic for one implementation accidentally traps the other. Extending this support to other implementations would have increased this complexity to an unacceptable level.

Accordingly, we have:

* created a new OPAL “pmix” framework to abstract the PMI support, with separate components for Cray, Slurm PMI-1, and Slurm PMI-2 implementations.

* Replaced the current ORTE grpcomm daemon-based collective operation with an integrated PMIx server, and updated the grpcomm APIs to provide more flexible, multi-algorithm support for collective operations. At this time, only the xcast and allgather operations are supported.

* Replaced the current global collective id with a signature based on the names of the participating procs. The allows an unlimited number of collectives to be executed by any group of processes, subject to the requirement that only one collective can be active at a time for a unique combination of procs. Note that a proc can be involved in any number of simultaneous collectives - it is the specific combination of procs that is subject to the constraint

* removed the prior OMPI/OPAL modex code

* added new macros for executing modex send/recv to simplify use of the new APIs. The send macros allow the caller to specify whether or not the BTL supports async modex operations - if so, then the non-blocking “fence” operation is used, if the active PMIx component supports it. Otherwise, the default is a full blocking modex exchange as we currently perform.

* retained the current flag that directs us to use a blocking fence operation, but only to retrieve data upon demand

This commit was SVN r32570.
2014-08-21 18:56:47 +00:00
Gilles Gouaillardet
f24699623f check-help-strings cleanup
This commit was SVN r32495.
2014-08-11 03:25:22 +00:00
Ralph Castain
daeb9b6c4f Some more cleanups. Remove direct references to ORTE by changing OMPI_CAST_ORTE_NAME -> OMPI_CAST_RTE_NAME. Ensure that ORTE tools (mpirun, orted, tools) set the OPAL proc structure fields so OPAL knows what is going on and uses the correct print functions (still need to fix the problem for non-MPI apps). Properly return uint32_t from the opal utilities instead of int32_t as that is what the ORTE process name fields contain.
Thanks to Gilles for pointing out some of the discrepancies.

This commit was SVN r32398.
2014-08-01 14:44:11 +00:00
Ralph Castain
1d8c061687 Fix a race condition that could result in assert failures during finalize. Ensure we shutdown the orte progress thread prior to finalizing the rml/oob frameworks so that no async operations are executing during destruct of the base-level lists and objects.
cmr=v1.7.5:reviewer=jsquyres:subject=fix race condition in finalize

This commit was SVN r30641.
2014-02-08 22:04:19 +00:00
Ralph Castain
a200e4f865 As per the RFC, bring in the ORTE async progress code and the rewrite of OOB:
*** THIS RFC INCLUDES A MINOR CHANGE TO THE MPI-RTE INTERFACE ***

Note: during the course of this work, it was necessary to completely separate the MPI and RTE progress engines. There were multiple places in the MPI layer where ORTE_WAIT_FOR_COMPLETION was being used. A new OMPI_WAIT_FOR_COMPLETION macro was created (defined in ompi/mca/rte/rte.h) that simply cycles across opal_progress until the provided flag becomes false. Places where the MPI layer blocked waiting for RTE to complete an event have been modified to use this macro.

***************************************************************************************

I am reissuing this RFC because of the time that has passed since its original release. Since its initial release and review, I have debugged it further to ensure it fully supports tests like loop_spawn. It therefore seems ready for merge back to the trunk. Given its prior review, I have set the timeout for one week.

The code is in  https://bitbucket.org/rhc/ompi-oob2


WHAT:    Rewrite of ORTE OOB

WHY:       Support asynchronous progress and a host of other features

WHEN:    Wed, August 21

SYNOPSIS:
The current OOB has served us well, but a number of limitations have been identified over the years. Specifically:

* it is only progressed when called via opal_progress, which can lead to hangs or recursive calls into libevent (which is not supported by that code)

* we've had issues when multiple NICs are available as the code doesn't "shift" messages between transports - thus, all nodes had to be available via the same TCP interface.

* the OOB "unloads" incoming opal_buffer_t objects during the transmission, thus preventing use of OBJ_RETAIN in the code when repeatedly sending the same message to multiple recipients

* there is no failover mechanism across NICs - if the selected NIC (or its attached switch) fails, we are forced to abort

* only one transport (i.e., component) can be "active"


The revised OOB resolves these problems:

* async progress is used for all application processes, with the progress thread blocking in the event library

* each available TCP NIC is supported by its own TCP module. The ability to asynchronously progress each module independently is provided, but not enabled by default (a runtime MCA parameter turns it "on")

* multi-address TCP NICs (e.g., a NIC with both an IPv4 and IPv6 address, or with virtual interfaces) are supported - reachability is determined by comparing the contact info for a peer against all addresses within the range covered by the address/mask pairs for the NIC.

* a message that arrives on one TCP NIC is automatically shifted to whatever NIC that is connected to the next "hop" if that peer cannot be reached by the incoming NIC. If no TCP module will reach the peer, then the OOB attempts to send the message via all other available components - if none can reach the peer, then an "error" is reported back to the RML, which then calls the errmgr for instructions.

* opal_buffer_t now conforms to standard object rules re OBJ_RETAIN as we no longer "unload" the incoming object

* NIC failure is reported to the TCP component, which then tries to resend the message across any other available TCP NIC. If that doesn't work, then the message is given back to the OOB base to try using other components. If all that fails, then the error is reported to the RML, which reports to the errmgr for instructions

* obviously from the above, multiple OOB components (e.g., TCP and UD) can be active in parallel

* the matching code has been moved to the RML (and out of the OOB/TCP component) so it is independent of transport

* routing is done by the individual OOB modules (as opposed to the RML). Thus, both routed and non-routed transports can simultaneously be active

* all blocking send/recv APIs have been removed. Everything operates asynchronously.


KNOWN LIMITATIONS:

* although provision is made for component failover as described above, the code for doing so has not been fully implemented yet. At the moment, if all connections for a given peer fail, the errmgr is notified of a "lost connection", which by default results in termination of the job if it was a lifeline

* the IPv6 code is present and compiles, but is not complete. Since the current IPv6 support in the OOB doesn't work anyway, I don't consider this a blocker

* routing is performed at the individual module level, yet the active routed component is selected on a global basis. We probably should update that to reflect that different transports may need/choose to route in different ways

* obviously, not every error path has been tested nor necessarily covered

* determining abnormal termination is more challenging than in the old code as we now potentially have multiple ways of connecting to a process. Ideally, we would declare "connection failed" when *all* transports can no longer reach the process, but that requires some additional (possibly complex) code. For now, the code replicates the old behavior only somewhat modified - i.e., if a module sees its connection fail, it checks to see if it is a lifeline. If so, it notifies the errmgr that the lifeline is lost - otherwise, it notifies the errmgr that a non-lifeline connection was lost.

* reachability is determined solely on the basis of a shared subnet address/mask - more sophisticated algorithms (e.g., the one used in the tcp btl) are required to handle routing via gateways

* the RML needs to assign sequence numbers to each message on a per-peer basis. The receiving RML will then deliver messages in order, thus preventing out-of-order messaging in the case where messages travel across different transports or a message needs to be redirected/resent due to failure of a NIC

This commit was SVN r29058.
2013-08-22 16:37:40 +00:00
Ralph Castain
27e3e382d5 No need for ORTE tools to use orte progress thread
This commit was SVN r28445.
2013-05-04 21:13:20 +00:00
Nathan Hjelm
c041156f60 Update ORTE frameworks to use the MCA framework system.
This commit was SVN r28240.
2013-03-27 21:14:43 +00:00
Ralph Castain
81a8e21939 Need to have the event thread running during init/finalize, but we still have a problem with cleanup - so comment out the event_base_free for now.
This commit was SVN r27738.
2013-01-03 02:16:57 +00:00
Ralph Castain
cada035f38 Fix the segfault problem in the orteds - turns out it only occurred with progress threads enabled. Ensure the thread gets started at the right time (at the end of init), although the event base gets created earlier. Remove the finalize event as we can instead use the loopbreak call to exit the event loop.
This commit was SVN r27721.
2012-12-25 19:30:18 +00:00
Ralph Castain
da6428a822 Add an MCA param to help debug the ORTE progress thread
This commit was SVN r27614.
2012-11-15 15:54:38 +00:00
Ralph Castain
fefec03e78 Enable all ORTE tools to use progress threads if they are enabled
This commit was SVN r27593.
2012-11-12 02:54:09 +00:00
Ralph Castain
4e52a15e70 Provide for sync on seek and close DFS operations. Eliminate an unnecessary wake-up timer when using ORTE progress thread
This commit was SVN r27500.
2012-10-26 15:49:04 +00:00
Ralph Castain
79e36413c2 There was some discussion of this at an earlier time, but we never got around to doing it - so make orte behave more like a regular library, counting the number of times init is called, and executing finalize when all those are exhausted.
This commit was SVN r27484.
2012-10-25 18:39:37 +00:00
Ralph Castain
9506ac1617 Remove debug
This commit was SVN r26592.
2012-06-11 20:02:53 +00:00
Ralph Castain
269cb2b8d9 Some cleanup to remove calls to opal_progress when running with orte progress threads, and to ensure that all orte-related events are in the orte event base.
This commit was SVN r26591.
2012-06-11 19:59:53 +00:00
Ralph Castain
83d69b6c95 Enable the ORTE progress thread for apps (not needed in the tools as they already continuously loop in the event lib). This appears to be working, at least for MPI apps that only use shared memory (a simple "hello"). More testing is required to identify where problems will occur - this is only intended to allow further development.
In order to use the progress thread, you must configure with:

--enable-orte-progress-threads --enable-event-thread-support

This commit was SVN r26457.
2012-05-20 15:14:43 +00:00
Jeff Squyres
2ba10c37fe Per RFC, bring in the following changes:
* Remove paffinity, maffinity, and carto frameworks -- they've been
   wholly replaced by hwloc.
 * Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
 * Update sm, smcuda, wv, and openib components to no longer use carto.
   Instead, use hwloc data.  There are still optimizations possible in
   the sm/smcuda BTLs (i.e., making multiple mpools).  Also, the old
   carto-based code found out how many NUMA nodes were ''available''
   -- not how many were used ''in this job''.  The new hwloc-using
   code computes the same value -- it was not updated to calculate how
   many NUMA nodes are used ''by this job.''
   * Note that I cannot compile the smcuda and wv BTLs -- I ''think''
     they're right, but they need to be verified by their owners.
 * The openib component now does a bunch of stuff to figure out where
   "near" OpenFabrics devices are.  '''THIS IS A CHANGE IN DEFAULT
   BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
   (I do not have a NUMA machine with an OpenFabrics device that is a
   non-uniform distance from multiple different NUMA nodes).
 * Completely rewrite the OMPI_Affinity_str() routine from the
   "affinity" mpiext extension.  This extension now understands
   hyperthreads; the output format of it has changed a bit to reflect
   this new information.
 * Bunches of minor changes around the code base to update names/types
   from maffinity/paffinity-based names to hwloc-based names.
 * Add some helper functions into the hwloc base, mainly having to do
   with the fact that we have the hwloc data reporting ''all''
   topology information, but sometimes you really only want the
   (online | available) data.

This commit was SVN r26391.
2012-05-07 14:52:54 +00:00
Ralph Castain
48de3a2501 Minor fixes so orte_progress_thread can work
This commit was SVN r26248.
2012-04-06 15:50:49 +00:00
Ralph Castain
bd8b4f7f1e Sorry for mid-day commit, but I had promised on the call to do this upon my return.
Roll in the ORTE state machine. Remove last traces of opal_sos. Remove UTK epoch code.

Please see the various emails about the state machine change for details. I'll send something out later with more info on the new arch.

This commit was SVN r26242.
2012-04-06 14:23:13 +00:00
Ralph Castain
891027c10d Cleanup error reports
This commit was SVN r25420.
2011-11-02 18:34:19 +00:00
George Bosilca
749b63c09d Provide a generic fix for the termination issue instead of r25248. The
termination condition is to be checked at the daemon/HNP level not down
in the routing.

This commit was SVN r25313.

The following SVN revision numbers were found above:
  r25248 --> open-mpi/ompi@b42ccc89b8
2011-10-18 03:07:37 +00:00
Ralph Castain
f1a5a26ba0 Minor cleanups
This commit was SVN r25289.
2011-10-14 18:46:03 +00:00
Ralph Castain
b42ccc89b8 Although this didn't solve the earlier termination problem, the code will be required once we get connection terminations properly detected. If a daemon (or HNP) is trying to terminate, then we need to check for termination conditions whenever a route is lost - when all child connections are gone, then we are free to finalize.
This commit was SVN r25248.
2011-10-10 21:41:49 +00:00
Wesley Bland
4e7ff0bd5e By popular demand the epoch code is now disabled by default.
To enable the epochs and the resilient orte code, use the configure flag:

--enable-resilient-orte

This will define both:

ORTE_ENABLE_EPOCH
ORTE_RESIL_ORTE

This commit was SVN r25093.
2011-08-26 22:16:14 +00:00
Abhishek Kulkarni
afbe3e99c6 * Wrap all the direct error-code checks of the form (OMPI_ERR_* == ret) with
(OMPI_ERR_* = OPAL_SOS_GET_ERR_CODE(ret)), since the return value could be a
 SOS-encoded error. The OPAL_SOS_GET_ERR_CODE() takes in a SOS error and returns
 back the native error code.

* Since OPAL_SUCCESS is preserved by SOS, also change all calls of the form
  (OPAL_ERROR == ret) to (OPAL_SUCCESS != ret). We thus avoid having to
  decode 'ret' to get the native error code.

This commit was SVN r23162.
2010-05-17 23:08:56 +00:00
Ralph Castain
efbb5c9b7c Revamp the errmgr framework to provide a greater range of optional behaviors, including different behaviors for daemons, and remove several looping messages across the code base:
* add hnp and orted modules to the errmgr framework. The HNP module contains much of the code that was in the errmgr base since that code could only be executed by the HNP anyway.

* update the odls to report process states directly into the active errmgr module, thus removing the need to send messages looped back into the odls cmd processor. Let the active errmgr module decide what to do at various states.

* remove the code to track application state progress from the plm_base_launch_support.c code. Update the plm modules to call the errmgr directly when a launch fails.

* update the plm_base_receive.c code to call the errmgr with state updates from remote daemons

* update the routed modules to reflect that process state is updated in the errmgr

* ensure that the orted's open the errmgr and select their appropriate module

* add new pretty-print utilities to print process and job state. Move the pretty-print of time info to a globally-accessible place

* define a global orte_comm function to send messages from orted's to the HNP so that others can overlay the standard RML methods, if desired.

* update the orterun help output to reflect that the "term w/o sync" error message can result from three, not two, scenarios

This commit was SVN r23023.
2010-04-23 04:44:41 +00:00
Ralph Castain
c88fe1ea54 Create a new mca parameter to control creation of session directories. Defaults to true so that the current behavior of always creating them is preserved. If set to false (0), then don't create session directories. Helps in those environments where session directories are a problem.
Tell the sm btl that it cannot run if no session directories were created.

This commit was SVN r22756.
2010-03-02 15:18:33 +00:00
Jeff Squyres
16b100219d A patch from UTK to allow orte_init(), opal_init(), and associated
friends also receive &argc and &argv (George asked Jeff to Ralph to
review before committing).  The thought is that passing argv and argc
to opal/orte_init be useful to other projects outside of OMPI that are
using OPAL and/or ORTE (especially in conjunction with some other
bootstrapping code where it is helpful to modify argv).  It's such a
small thing that it's easy to apply here to make others' lives a
little easier.

Ask George for more details; I'm just the messenger.  :-)

Judging by the copyrights on this patch, it's been around for a
while.  :-)

This commit was SVN r22260.
2009-12-04 00:51:15 +00:00
Greg Koenig
60485ff95f This is a very large change to rename several #define values from
OMPI_* to OPAL_*.  This allows opal layer to be used more independent
from the whole of ompi.

NOTE: 9 "svn mv" operations immediately follow this commit.

This commit was SVN r21180.
2009-05-06 20:11:28 +00:00
Ralph Castain
4be24521aa Modify the orte_process_info structure to handle a broader range of process types by replacing the individual booleans with a 32-bit bitmap. Use a set of #define's to define the individual bits, and a set of matching macros to test for them. Update the orte code base to use the macros instead of the booleans.
Minor mod to the ompi layer to use the new #define's - just one-line name replacements.

This commit was SVN r21144.
2009-05-04 11:07:40 +00:00
Ralph Castain
5fa3b38d3c Revert r21097 as this results in multiple instantiations of global variables. Instead, fix the problem by including orte_globals.h in the orte_init.c.
Since I already had some changes in there, add in the rmaps rank_file changes - should work okay, but not fully tested.

This commit was SVN r21099.

The following SVN revision numbers were found above:
  r21097 --> open-mpi/ompi@88ae934c26
2009-04-29 02:13:14 +00:00
Rainer Keller
ec0ed48718 - Revert r20739
This commit was SVN r20742.

The following SVN revision numbers were found above:
  r20739 --> open-mpi/ompi@781caee0b6
2009-03-05 21:56:03 +00:00
Rainer Keller
781caee0b6 - First of two or three patches, in orte/util/proc_info.h:
Adapt orte_process_info to orte_proc_info, and
   change orte_proc_info() to orte_proc_info_init().
 - Compiled on linux-x86-64
 - Discussed with Ralph

This commit was SVN r20739.
2009-03-05 20:36:44 +00:00
Rainer Keller
d81443cc5a - On the way to get the BTLs split out and lessen dependency on orte:
Often, orte/util/show_help.h is included, although no functionality
   is required -- instead, most often opal_output.h, or               
   orte/mca/rml/rml_types.h                                           
   Please see orte_show_help_replacement.sh commited next.            

 - Local compilation (Linux/x86_64) w/ -Wimplicit-function-declaration
   actually showed two *missing* #include "orte/util/show_help.h"     
   in orte/mca/odls/base/odls_base_default_fns.c and                  
   in orte/tools/orte-top/orte-top.c                                  
   Manually added these.                                              

   Let's have MTT the last word.

This commit was SVN r20557.
2009-02-14 02:26:12 +00:00
Brian Barrett
cbd6749c22 Move the lock initialization back to orte_init so that the finalize lock
is properly initialized and available in all cases (like ompi_info, where
the ess is never actually initialized).  Fixes trac:1364.

This commit was SVN r18733.

The following Trac tickets were found above:
  Ticket 1364 --> https://svn.open-mpi.org/trac/ompi/ticket/1364
2008-06-25 03:18:37 +00:00
Ralph Castain
b118779c08 It is okay for us to init the ORTE mca params multiple times. Indeed, it is absolutely required by orterun as the first time has to be done prior to parsing the command line, which means that the mca values haven't been parsed yet!
Add ability for sys admins to prohibit putting session directories under specified locations. Thus, they can now protect parallel file systems from foolish user mistakes.

This commit was SVN r18721.
2008-06-24 17:50:56 +00:00
Ralph Castain
0532d799d6 Complete implementation of the --without-rte-support configure option. Working with Brian, this has been tested on RedStorm.
Some minor changes to help facilitate debugger support so that both mpirun and yod can operate with it. Still to be completed.

This commit was SVN r18664.
2008-06-18 03:15:56 +00:00
Ralph Castain
9613b3176c Effectively revert the orte_output system and return to direct use of opal_output at all levels. Retain the orte_show_help subsystem to allow aggregation of show_help messages at the HNP.
After much work by Jeff and myself, and quite a lot of discussion, it has become clear that we simply cannot resolve the infinite loops caused by RML-involved subsystems calling orte_output. The original rationale for the change to orte_output has also been reduced by shifting the output of XML-formatted vs human readable messages to an alternative approach.

I have globally replaced the orte_output/ORTE_OUTPUT calls in the code base, as well as the corresponding .h file name. I have test compiled and run this on the various environments within my reach, so hopefully this will prove minimally disruptive.

This commit was SVN r18619.
2008-06-09 14:53:58 +00:00
Ralph Castain
0da811ce79 Initial work on xml support - allocation and job map outputs completed. More to come.
This commit was SVN r18587.
2008-06-04 20:53:12 +00:00
Ralph Castain
828ae26d90 ORTE-level MCA params are defined in several places. Ompi_info cannot call orte_init due to an issue with the memory allocator, thus making it impossible for ompi_info to display all of the ORTE-level MCA params.
By consolidating them all into one function, ompi_info can call that function and register the desired variables. This also requires, however, that ompi_info call orte_output_init to avoid generating tons of error messages, so make that adjustment too. 

Fixes ticket #1314

In addition, orte_output has a race condition issue whereby calls to orte_output/verbose can occur prior to either the RML being defined/setup, or the HNP being defined. This latter occurs during the initialization of the orte_process_info structure. In both cases, there is no way orte_output can send the output to the HNP. Hence, the message must be simply output locally.

Fixes ticket #1315

This commit was SVN r18524.
2008-05-28 13:29:58 +00:00
Terry Dontje
ef7ac86929 created opal_version_string and orte_version_string to match the ompi changes
made in r18345 for ompi_version_string.  This was done per request from Jeff 
Squyres to maintain consistency and to remove some warnings caused by the 
non-use of some static const char.

This commit was SVN r18461.

The following SVN revision numbers were found above:
  r18345 --> open-mpi/ompi@8dd0421015
2008-05-20 12:13:19 +00:00
Jeff Squyres
e7ecd56bd2 This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.

= ORTE Job-Level Output Messages =

Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):

 * orte_output(): (and corresponding friends ORTE_OUTPUT,
   orte_output_verbose, etc.)  This function sends the output directly
   to the HNP for processing as part of a job-specific output
   channel.  It supports all the same outputs as opal_output()
   (syslog, file, stdout, stderr), but for stdout/stderr, the output
   is sent to the HNP for processing and output.  More on this below.
 * orte_show_help(): This function is a drop-in-replacement for
   opal_show_help(), with two differences in functionality:
   1. the rendered text help message output is sent to the HNP for
      display (rather than outputting directly into the process' stderr
      stream)
   1. the HNP detects duplicate help messages and does not display them
      (so that you don't see the same error message N times, once from
      each of your N MPI processes); instead, it counts "new" instances
      of the help message and displays a message every ~5 seconds when
      there are new ones ("I got X new copies of the help message...")

opal_show_help and opal_output still exist, but they only output in
the current process.  The intent for the new orte_* functions is that
they can apply job-level intelligence to the output.  As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.

=== New code ===

For ORTE and OMPI programmers, here's what you need to do differently
in new code:

 * Do not include opal/util/show_help.h or opal/util/output.h.
   Instead, include orte/util/output.h (this one header file has
   declarations for both the orte_output() series of functions and
   orte_show_help()).
 * Effectively s/opal_output/orte_output/gi throughout your code.
   Note that orte_output_open() takes a slightly different argument
   list (as a way to pass data to the filtering stream -- see below),
   so you if explicitly call opal_output_open(), you'll need to
   slightly adapt to the new signature of orte_output_open().
 * Literally s/opal_show_help/orte_show_help/.  The function signature
   is identical.

=== Notes ===

 * orte_output'ing to stream 0 will do similar to what
   opal_output'ing did, so leaving a hard-coded "0" as the first
   argument is safe.
 * For systems that do not use ORTE's RML or the HNP, the effect of
   orte_output_* and orte_show_help will be identical to their opal
   counterparts (the additional information passed to
   orte_output_open() will be lost!).  Indeed, the orte_* functions
   simply become trivial wrappers to their opal_* counterparts.  Note
   that we have not tested this; the code is simple but it is quite
   possible that we mucked something up.

= Filter Framework =

Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr.  The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations.  The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc.  This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).

Filtering is not active by default.  Filter components must be
specifically requested, such as:

{{{
$ mpirun --mca filter xml ...
}}}

There can only be one filter component active.

= New MCA Parameters =

The new functionality described above introduces two new MCA
parameters:

 * '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
   help messages will be aggregated, as described above.  If set to 0,
   all help messages will be displayed, even if they are duplicates
   (i.e., the original behavior).
 * '''orte_base_show_output_recursions''': An MCA parameter to help
   debug one of the known issues, described below.  It is likely that
   this MCA parameter will disappear before v1.3 final.

= Known Issues =

 * The XML filter component is not complete.  The current output from
   this component is preliminary and not real XML.  A bit more work
   needs to be done to configure.m4 search for an appropriate XML
   library/link it in/use it at run time.
 * There are possible recursion loops in the orte_output() and
   orte_show_help() functions -- e.g., if RML send calls orte_output()
   or orte_show_help().  We have some ideas how to fix these, but
   figured that it was ok to commit before feature freeze with known
   issues.  The code currently contains sub-optimal workarounds so
   that this will not be a problem, but it would be good to actually
   solve the problem rather than have hackish workarounds before v1.3 final.

This commit was SVN r18434.
2008-05-13 20:00:55 +00:00
Ralph Castain
dc7f45dafd Remove the obsolete and largely unused orte_system_info structure. The only fields that were used in that struct were nodeid and nodename - these have been transferred to the orte_process_info structure.
Only one place used the user name field - session_dir, when formulating the name of the top-level directory. Accordingly, the code for getting the user's id has been moved to the session_dir code.

This commit was SVN r17926.
2008-03-23 23:10:15 +00:00
Ralph Castain
629b95a2fe Afraid this has a couple of things mixed into the commit. Couldn't be helped - had missed one commit prior to running out the door on vacation.
Fix race conditions in abnormal terminations. We had done a first-cut at this in a prior commit. However, the window remained partially open due to the fact that the HNP has multiple paths leading to orte_finalize. Most of our frameworks don't care if they are finalized more than once, but one of them does, which meant we segfaulted if orte_finalize got called more than once. Besides, we really shouldn't be doing that anyway.

So we now introduce a set of atomic locks that prevent us from multiply calling abort, attempting to call orte_finalize, etc. My initial tests indicate this is working cleanly, but since it is a race condition issue, more testing will have to be done before we know for sure that this problem has been licked.

Also, some updates relevant to the tool comm library snuck in here. Since those also touched the orted code (as did the prior changes), I didn't want to attempt to separate them out - besides, they are coming in soon anyway. More on them later as that functionality approaches completion.

This commit was SVN r17843.
2008-03-17 17:58:59 +00:00