2005-09-01 01:07:30 +00:00
|
|
|
/*
|
2007-03-16 23:11:45 +00:00
|
|
|
* Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana
|
2005-11-05 19:57:48 +00:00
|
|
|
* University Research and Technology
|
|
|
|
* Corporation. All rights reserved.
|
2007-04-12 05:01:29 +00:00
|
|
|
* Copyright (c) 2004-2007 The University of Tennessee and The University
|
2005-11-05 19:57:48 +00:00
|
|
|
* of Tennessee Research Foundation. All rights
|
|
|
|
* reserved.
|
2005-09-01 01:07:30 +00:00
|
|
|
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
|
2004-11-28 20:09:25 +00:00
|
|
|
* University of Stuttgart. All rights reserved.
|
2005-03-24 12:43:37 +00:00
|
|
|
* Copyright (c) 2004-2005 The Regents of the University of California.
|
|
|
|
* All rights reserved.
|
Fix a number of OOB issues:
* Remove the connect() timeout code, as it had some nasty race conditions
when connections were established as the trigger was firing. A better
solution has been found for the cluster where this was needed, so just
removing it was easiest.
* When a fatal error (too many connection failures) occurs, set an error
on messages in the queue even if there isn't an active message. The
first message to any peer will be queued without being active (and
so will all subsequent messages until the connection is established),
and the orteds will hang until that first message completes. So if
an orted can never contact it's peer, it will never exit and just sit
waiting for that message to complete.
* Cover an interesting RST condition in the connect code. A connection
can complete the three-way handshake, the connector can even send
some data, but the server side will drop the connection because it
can't move it from the half-connected to fully-connected state because
of space shortage in the listen backlog queue. This causes a RST to
be received first time that recv() is called, which will be when waiting
for the remote side of the OOB ack. In this case, transition the
connection back into a CLOSED state and try to connect again.
* Add levels of debugging, rather than all or nothing, each building on
the previous level. 0 (default) is hard errors. 1 is connection
error debugging info. 2 is all connection info. 3 is more state
info. 4 includes all message info.
* Add some hopefully useful comments
This commit was SVN r14261.
2007-04-07 22:33:30 +00:00
|
|
|
* Copyright (c) 2006-2007 Los Alamos National Security, LLC.
|
|
|
|
* All rights reserved.
|
2004-11-22 01:38:40 +00:00
|
|
|
* $COPYRIGHT$
|
2005-09-01 01:07:30 +00:00
|
|
|
*
|
2004-11-22 01:38:40 +00:00
|
|
|
* Additional copyrights may follow
|
2005-09-01 01:07:30 +00:00
|
|
|
*
|
2004-07-01 14:49:54 +00:00
|
|
|
* $HEADER$
|
2006-03-11 03:09:24 +00:00
|
|
|
*
|
|
|
|
* In windows, many of the socket functions return an EWOULDBLOCK
|
2007-04-12 05:01:29 +00:00
|
|
|
* instead of things like EAGAIN, EINPROGRESS, etc. It has been
|
|
|
|
* verified that this will not conflict with other error codes that
|
|
|
|
* are returned by these functions under UNIX/Linux environments
|
2004-07-01 14:49:54 +00:00
|
|
|
*/
|
|
|
|
|
2006-02-12 01:33:29 +00:00
|
|
|
#include "orte_config.h"
|
2008-02-28 01:57:57 +00:00
|
|
|
#include "orte/types.h"
|
2005-03-14 20:57:21 +00:00
|
|
|
|
2004-10-22 16:06:05 +00:00
|
|
|
#ifdef HAVE_UNISTD_H
|
2004-08-02 21:24:00 +00:00
|
|
|
#include <unistd.h>
|
2004-10-22 16:06:05 +00:00
|
|
|
#endif
|
|
|
|
#ifdef HAVE_SYS_TYPES_H
|
2004-08-02 21:24:00 +00:00
|
|
|
#include <sys/types.h>
|
2004-10-22 16:06:05 +00:00
|
|
|
#endif
|
2004-08-02 21:24:00 +00:00
|
|
|
#include <fcntl.h>
|
2004-10-22 16:06:05 +00:00
|
|
|
#ifdef HAVE_NETINET_IN_H
|
2004-08-16 19:39:54 +00:00
|
|
|
#include <netinet/in.h>
|
2004-10-22 16:06:05 +00:00
|
|
|
#endif
|
|
|
|
#ifdef HAVE_ARPA_INET_H
|
2004-08-16 19:39:54 +00:00
|
|
|
#include <arpa/inet.h>
|
2004-10-22 16:06:05 +00:00
|
|
|
#endif
|
2007-07-20 01:34:02 +00:00
|
|
|
#ifdef HAVE_NETDB_H
|
|
|
|
#include <netdb.h>
|
2007-04-25 01:55:40 +00:00
|
|
|
#endif
|
|
|
|
|
|
|
|
#include "opal/util/error.h"
|
2006-08-14 20:14:44 +00:00
|
|
|
#include "opal/opal_socket_errno.h"
|
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-13 20:00:55 +00:00
|
|
|
#include "orte/util/output.h"
|
2005-07-04 01:36:20 +00:00
|
|
|
#include "opal/util/if.h"
|
2007-05-17 01:17:59 +00:00
|
|
|
#include "opal/util/net.h"
|
2007-03-16 23:11:45 +00:00
|
|
|
#include "opal/class/opal_hash_table.h"
|
2008-02-28 01:57:57 +00:00
|
|
|
|
2006-02-12 01:33:29 +00:00
|
|
|
#include "orte/mca/errmgr/errmgr.h"
|
2007-07-20 01:34:02 +00:00
|
|
|
#include "orte/mca/rml/rml.h"
|
2008-02-28 01:57:57 +00:00
|
|
|
#include "orte/util/name_fns.h"
|
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-13 20:00:55 +00:00
|
|
|
#include "orte/util/output.h"
|
2008-02-28 01:57:57 +00:00
|
|
|
#include "orte/runtime/orte_globals.h"
|
|
|
|
|
|
|
|
#include "orte/mca/oob/tcp/oob_tcp.h"
|
2004-07-01 14:49:54 +00:00
|
|
|
|
2007-06-14 22:35:38 +00:00
|
|
|
#if defined(__WINDOWS__)
|
|
|
|
static opal_mutex_t windows_callback;
|
|
|
|
#endif /* defined(__WINDOWS__) */
|
|
|
|
|
2004-09-30 15:09:29 +00:00
|
|
|
/*
|
|
|
|
* Data structure for accepting connections.
|
|
|
|
*/
|
|
|
|
struct mca_oob_tcp_event_t {
|
2005-07-03 16:22:16 +00:00
|
|
|
opal_list_item_t item;
|
2005-07-03 23:09:55 +00:00
|
|
|
opal_event_t event;
|
2004-09-30 15:09:29 +00:00
|
|
|
};
|
|
|
|
typedef struct mca_oob_tcp_event_t mca_oob_tcp_event_t;
|
2005-09-01 01:07:30 +00:00
|
|
|
|
2004-09-30 15:09:29 +00:00
|
|
|
static void mca_oob_tcp_event_construct(mca_oob_tcp_event_t* event)
|
|
|
|
{
|
2005-07-03 22:45:48 +00:00
|
|
|
OPAL_THREAD_LOCK(&mca_oob_tcp_component.tcp_lock);
|
2005-07-03 16:22:16 +00:00
|
|
|
opal_list_append(&mca_oob_tcp_component.tcp_events, &event->item);
|
2005-07-03 22:45:48 +00:00
|
|
|
OPAL_THREAD_UNLOCK(&mca_oob_tcp_component.tcp_lock);
|
2004-09-30 15:09:29 +00:00
|
|
|
}
|
2005-09-01 01:07:30 +00:00
|
|
|
|
2004-09-30 15:09:29 +00:00
|
|
|
static void mca_oob_tcp_event_destruct(mca_oob_tcp_event_t* event)
|
|
|
|
{
|
2005-07-03 22:45:48 +00:00
|
|
|
OPAL_THREAD_LOCK(&mca_oob_tcp_component.tcp_lock);
|
2005-07-03 16:22:16 +00:00
|
|
|
opal_list_remove_item(&mca_oob_tcp_component.tcp_events, &event->item);
|
2005-07-03 22:45:48 +00:00
|
|
|
OPAL_THREAD_UNLOCK(&mca_oob_tcp_component.tcp_lock);
|
2004-09-30 15:09:29 +00:00
|
|
|
}
|
2005-09-01 01:07:30 +00:00
|
|
|
|
2004-09-30 15:09:29 +00:00
|
|
|
OBJ_CLASS_INSTANCE(
|
|
|
|
mca_oob_tcp_event_t,
|
2005-07-03 16:22:16 +00:00
|
|
|
opal_list_item_t,
|
2004-09-30 15:09:29 +00:00
|
|
|
mca_oob_tcp_event_construct,
|
|
|
|
mca_oob_tcp_event_destruct);
|
2005-09-01 01:07:30 +00:00
|
|
|
|
2004-09-30 15:09:29 +00:00
|
|
|
/*
|
|
|
|
* Local utility functions
|
|
|
|
*/
|
2004-08-02 21:24:00 +00:00
|
|
|
|
2008-04-09 12:53:24 +00:00
|
|
|
static int mca_oob_tcp_create_listen(int *target_sd, unsigned short *port, uint16_t af_family);
|
2006-09-14 21:29:51 +00:00
|
|
|
static int mca_oob_tcp_create_listen_thread(void);
|
2004-08-02 21:24:00 +00:00
|
|
|
static void mca_oob_tcp_recv_handler(int sd, short flags, void* user);
|
2007-04-25 01:55:40 +00:00
|
|
|
static void mca_oob_tcp_accept(int incoming_sd);
|
2004-08-02 21:24:00 +00:00
|
|
|
|
2006-09-14 21:29:51 +00:00
|
|
|
OBJ_CLASS_INSTANCE(
|
|
|
|
mca_oob_tcp_pending_connection_t,
|
|
|
|
opal_free_list_item_t,
|
|
|
|
NULL,
|
|
|
|
NULL);
|
|
|
|
|
2007-07-20 01:34:02 +00:00
|
|
|
OBJ_CLASS_INSTANCE(mca_oob_tcp_device_t, opal_list_item_t, NULL, NULL);
|
|
|
|
|
2007-03-16 23:11:45 +00:00
|
|
|
int mca_oob_tcp_output_handle = 0;
|
2004-09-01 23:07:40 +00:00
|
|
|
|
|
|
|
|
2004-07-01 14:49:54 +00:00
|
|
|
/*
|
|
|
|
* Struct of function pointers and all that to let us be initialized
|
|
|
|
*/
|
2004-08-02 00:24:22 +00:00
|
|
|
mca_oob_tcp_component_t mca_oob_tcp_component = {
|
|
|
|
{
|
2004-07-01 14:49:54 +00:00
|
|
|
{
|
|
|
|
MCA_OOB_BASE_VERSION_1_0_0,
|
|
|
|
"tcp", /* MCA module name */
|
2004-08-19 19:34:37 +00:00
|
|
|
1, /* MCA component major version */
|
|
|
|
0, /* MCA component minor version */
|
|
|
|
0, /* MCA component release version */
|
|
|
|
mca_oob_tcp_component_open, /* component open */
|
|
|
|
mca_oob_tcp_component_close /* component close */
|
2004-07-01 14:49:54 +00:00
|
|
|
},
|
|
|
|
{
|
2007-03-16 23:11:45 +00:00
|
|
|
/* The component is checkpoint ready */
|
|
|
|
MCA_BASE_METADATA_PARAM_CHECKPOINT
|
2004-07-01 14:49:54 +00:00
|
|
|
},
|
2005-09-01 01:07:30 +00:00
|
|
|
mca_oob_tcp_component_init
|
2004-08-02 00:24:22 +00:00
|
|
|
}
|
2004-07-01 14:49:54 +00:00
|
|
|
};
|
|
|
|
|
2007-07-20 01:34:02 +00:00
|
|
|
mca_oob_t mca_oob_tcp = {
|
|
|
|
mca_oob_tcp_init,
|
|
|
|
mca_oob_tcp_fini,
|
|
|
|
|
2004-08-16 19:39:54 +00:00
|
|
|
mca_oob_tcp_get_addr,
|
Not as bad as this all may look. Tim and I made a significant change to the way we handle the startup of the oob, the seed, etc. We have made it backwards-compatible so that mpirun2 and singleton operations remain working. We had to adjust the name server and gpr as well, plus the process_info structure.
This also includes a checkpoint update to openmpi.c and ompid.c. I have re-enabled the ompid compile.
This latter raises an important point. The trunk compiles the programs like ompid just fine under Linux. It also does just fine for OSX under the dynamic libraries. However, we are seeing errors when compiling under OSX for the static case - the linker seems to have trouble resolving some variable names, even though linker diagnostics show the variables as being defined. Thus, a warning to Mac users that you may have to locally turn things off if you are trying to do static compiles. We ask, however, that you don't commit those changes that turn things off for everyone else - instead, let's try to figure out why the static compile is having a problem, and let everyone else continue to work.
Thanks
Ralph
This commit was SVN r2534.
2004-09-08 03:59:06 +00:00
|
|
|
mca_oob_tcp_set_addr,
|
2007-07-20 01:34:02 +00:00
|
|
|
|
|
|
|
mca_oob_tcp_get_new_name,
|
2004-09-08 17:02:24 +00:00
|
|
|
mca_oob_tcp_ping,
|
2007-07-20 01:34:02 +00:00
|
|
|
|
2004-07-01 14:49:54 +00:00
|
|
|
mca_oob_tcp_send_nb,
|
2007-07-20 01:34:02 +00:00
|
|
|
|
2004-08-10 21:02:36 +00:00
|
|
|
mca_oob_tcp_recv_nb,
|
2004-09-30 15:09:29 +00:00
|
|
|
mca_oob_tcp_recv_cancel,
|
2007-07-20 01:34:02 +00:00
|
|
|
|
|
|
|
mca_oob_tcp_ft_event
|
2004-07-01 14:49:54 +00:00
|
|
|
};
|
|
|
|
|
2007-07-25 05:55:14 +00:00
|
|
|
#if defined(__WINDOWS__)
|
|
|
|
static int oob_tcp_windows_progress_callback( void )
|
|
|
|
{
|
|
|
|
opal_list_item_t* item;
|
|
|
|
mca_oob_tcp_msg_t* msg;
|
|
|
|
int event_count = 0;
|
|
|
|
|
|
|
|
/* Only one thread at the time is allowed to execute callbacks */
|
|
|
|
if( !opal_mutex_trylock(&windows_callback) )
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
OPAL_THREAD_LOCK(&mca_oob_tcp_component.tcp_lock);
|
|
|
|
while(NULL !=
|
|
|
|
(item = opal_list_remove_first(&mca_oob_tcp_component.tcp_msg_completed))) {
|
|
|
|
msg = (mca_oob_tcp_msg_t*)item;
|
|
|
|
OPAL_THREAD_UNLOCK(&mca_oob_tcp_component.tcp_lock);
|
|
|
|
msg->msg_cbfunc( msg->msg_rc,
|
|
|
|
&msg->msg_peer,
|
|
|
|
msg->msg_uiov,
|
|
|
|
msg->msg_ucnt,
|
|
|
|
msg->msg_hdr.msg_tag,
|
|
|
|
msg->msg_cbdata);
|
|
|
|
event_count++;
|
|
|
|
OPAL_THREAD_LOCK(&mca_oob_tcp_component.tcp_lock);
|
|
|
|
MCA_OOB_TCP_MSG_RETURN(msg);
|
|
|
|
}
|
|
|
|
OPAL_THREAD_UNLOCK(&mca_oob_tcp_component.tcp_lock);
|
|
|
|
|
|
|
|
opal_mutex_unlock(&windows_callback);
|
|
|
|
|
|
|
|
return event_count;
|
|
|
|
}
|
|
|
|
#endif /* defined(__WINDOWS__) */
|
|
|
|
|
2004-07-01 14:49:54 +00:00
|
|
|
/*
|
2004-07-12 22:46:57 +00:00
|
|
|
* Initialize global variables used w/in this module.
|
2004-07-01 14:49:54 +00:00
|
|
|
*/
|
2004-08-19 19:34:37 +00:00
|
|
|
int mca_oob_tcp_component_open(void)
|
2004-07-01 14:49:54 +00:00
|
|
|
{
|
2008-04-01 12:39:02 +00:00
|
|
|
int tmp, value = 0;
|
2007-09-12 18:16:53 +00:00
|
|
|
char *listen_type, *str = NULL;
|
2006-09-14 21:29:51 +00:00
|
|
|
|
2005-12-12 20:04:00 +00:00
|
|
|
#ifdef __WINDOWS__
|
2004-11-02 13:14:34 +00:00
|
|
|
WSADATA win_sock_data;
|
|
|
|
if (WSAStartup(MAKEWORD(2,2), &win_sock_data) != 0) {
|
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-13 20:00:55 +00:00
|
|
|
orte_output (0, "mca_oob_tcp_component_init: failed to initialise windows sockets: error %d\n", WSAGetLastError());
|
2006-02-12 01:33:29 +00:00
|
|
|
return ORTE_ERROR;
|
2004-11-02 13:14:34 +00:00
|
|
|
}
|
|
|
|
#endif
|
2005-09-01 01:07:30 +00:00
|
|
|
|
2007-03-16 23:11:45 +00:00
|
|
|
mca_base_param_reg_int(&mca_oob_tcp_component.super.oob_base,
|
|
|
|
"verbose",
|
|
|
|
"Verbose level for the OOB tcp component",
|
|
|
|
false, false,
|
|
|
|
0,
|
|
|
|
&value);
|
2008-06-03 14:24:01 +00:00
|
|
|
mca_oob_tcp_output_handle = orte_output_open(NULL);
|
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-13 20:00:55 +00:00
|
|
|
orte_output_set_verbosity(mca_oob_tcp_output_handle, value);
|
2007-03-16 23:11:45 +00:00
|
|
|
|
2005-07-03 16:22:16 +00:00
|
|
|
OBJ_CONSTRUCT(&mca_oob_tcp_component.tcp_peer_list, opal_list_t);
|
2005-10-25 13:48:08 +00:00
|
|
|
OBJ_CONSTRUCT(&mca_oob_tcp_component.tcp_peers, opal_hash_table_t);
|
2005-07-03 16:52:32 +00:00
|
|
|
OBJ_CONSTRUCT(&mca_oob_tcp_component.tcp_peer_names, opal_hash_table_t);
|
2005-07-02 16:46:27 +00:00
|
|
|
OBJ_CONSTRUCT(&mca_oob_tcp_component.tcp_peer_free, opal_free_list_t);
|
|
|
|
OBJ_CONSTRUCT(&mca_oob_tcp_component.tcp_msgs, opal_free_list_t);
|
2005-07-03 22:45:48 +00:00
|
|
|
OBJ_CONSTRUCT(&mca_oob_tcp_component.tcp_lock, opal_mutex_t);
|
2005-07-03 16:22:16 +00:00
|
|
|
OBJ_CONSTRUCT(&mca_oob_tcp_component.tcp_events, opal_list_t);
|
|
|
|
OBJ_CONSTRUCT(&mca_oob_tcp_component.tcp_msg_post, opal_list_t);
|
|
|
|
OBJ_CONSTRUCT(&mca_oob_tcp_component.tcp_msg_recv, opal_list_t);
|
2005-10-25 13:48:08 +00:00
|
|
|
OBJ_CONSTRUCT(&mca_oob_tcp_component.tcp_msg_completed, opal_list_t);
|
2005-07-03 22:45:48 +00:00
|
|
|
OBJ_CONSTRUCT(&mca_oob_tcp_component.tcp_match_lock, opal_mutex_t);
|
|
|
|
OBJ_CONSTRUCT(&mca_oob_tcp_component.tcp_match_cond, opal_condition_t);
|
2007-07-20 01:34:02 +00:00
|
|
|
OBJ_CONSTRUCT(&mca_oob_tcp_component.tcp_available_devices, opal_list_t);
|
2008-04-01 12:39:02 +00:00
|
|
|
OBJ_CONSTRUCT(&mca_oob_tcp_component.tcp_listen_thread, opal_thread_t);
|
2006-09-14 21:29:51 +00:00
|
|
|
OBJ_CONSTRUCT(&mca_oob_tcp_component.tcp_pending_connections, opal_list_t);
|
Fix a number of OOB issues:
* Remove the connect() timeout code, as it had some nasty race conditions
when connections were established as the trigger was firing. A better
solution has been found for the cluster where this was needed, so just
removing it was easiest.
* When a fatal error (too many connection failures) occurs, set an error
on messages in the queue even if there isn't an active message. The
first message to any peer will be queued without being active (and
so will all subsequent messages until the connection is established),
and the orteds will hang until that first message completes. So if
an orted can never contact it's peer, it will never exit and just sit
waiting for that message to complete.
* Cover an interesting RST condition in the connect code. A connection
can complete the three-way handshake, the connector can even send
some data, but the server side will drop the connection because it
can't move it from the half-connected to fully-connected state because
of space shortage in the listen backlog queue. This causes a RST to
be received first time that recv() is called, which will be when waiting
for the remote side of the OOB ack. In this case, transition the
connection back into a CLOSED state and try to connect again.
* Add levels of debugging, rather than all or nothing, each building on
the previous level. 0 (default) is hard errors. 1 is connection
error debugging info. 2 is all connection info. 3 is more state
info. 4 includes all message info.
* Add some hopefully useful comments
This commit was SVN r14261.
2007-04-07 22:33:30 +00:00
|
|
|
OBJ_CONSTRUCT(&mca_oob_tcp_component.tcp_connections_return, opal_list_t);
|
2008-04-01 12:39:02 +00:00
|
|
|
OBJ_CONSTRUCT(&mca_oob_tcp_component.tcp_connections_lock, opal_mutex_t);
|
2004-08-02 21:24:00 +00:00
|
|
|
|
|
|
|
/* register oob module parameters */
|
2007-05-24 13:01:55 +00:00
|
|
|
mca_base_param_reg_int(&mca_oob_tcp_component.super.oob_base,
|
|
|
|
"peer_limit",
|
|
|
|
"Maximum number of peer connections to simultaneously maintain (-1 = infinite)",
|
|
|
|
false, false, -1,
|
|
|
|
&mca_oob_tcp_component.tcp_peer_limit);
|
|
|
|
|
|
|
|
mca_base_param_reg_int(&mca_oob_tcp_component.super.oob_base,
|
|
|
|
"peer_retries",
|
|
|
|
"Number of times to try shutting down a connection before giving up",
|
|
|
|
false, false, 60,
|
|
|
|
&mca_oob_tcp_component.tcp_peer_retries);
|
|
|
|
|
|
|
|
mca_base_param_reg_int(&mca_oob_tcp_component.super.oob_base,
|
|
|
|
"debug",
|
|
|
|
"Enable (1) / disable (0) debugging output for this component",
|
|
|
|
false, false, 0,
|
|
|
|
&mca_oob_tcp_component.tcp_debug);
|
|
|
|
|
|
|
|
mca_base_param_reg_int(&mca_oob_tcp_component.super.oob_base,
|
|
|
|
"sndbuf",
|
|
|
|
"TCP socket send buffering size (in bytes)",
|
|
|
|
false, false, 128 * 1024,
|
|
|
|
&mca_oob_tcp_component.tcp_sndbuf);
|
|
|
|
|
|
|
|
mca_base_param_reg_int(&mca_oob_tcp_component.super.oob_base,
|
|
|
|
"rcvbuf",
|
|
|
|
"TCP socket receive buffering size (in bytes)",
|
|
|
|
false, false, 128 * 1024,
|
|
|
|
&mca_oob_tcp_component.tcp_rcvbuf);
|
2007-05-24 12:52:26 +00:00
|
|
|
|
|
|
|
mca_base_param_reg_string(&mca_oob_tcp_component.super.oob_base,
|
|
|
|
"if_include",
|
|
|
|
"Comma-delimited list of TCP interfaces to use",
|
|
|
|
false, false, NULL,
|
|
|
|
&mca_oob_tcp_component.tcp_include);
|
|
|
|
mca_base_param_reg_string(&mca_oob_tcp_component.super.oob_base,
|
|
|
|
"include",
|
|
|
|
"Obsolete synonym for oob_tcp_if_include",
|
|
|
|
true, false, NULL, &str);
|
|
|
|
if (NULL != str) {
|
|
|
|
if (NULL == mca_oob_tcp_component.tcp_include) {
|
|
|
|
mca_oob_tcp_component.tcp_include = str;
|
|
|
|
} else {
|
|
|
|
free(str);
|
2007-09-12 18:16:53 +00:00
|
|
|
str = NULL; /* reset to NULL so we can use it again later */
|
2007-05-24 12:52:26 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
mca_base_param_reg_string(&mca_oob_tcp_component.super.oob_base,
|
|
|
|
"if_exclude",
|
|
|
|
"Comma-delimited list of TCP interfaces to exclude",
|
|
|
|
false, false, NULL,
|
|
|
|
&mca_oob_tcp_component.tcp_exclude);
|
|
|
|
mca_base_param_reg_string(&mca_oob_tcp_component.super.oob_base,
|
|
|
|
"exclude",
|
|
|
|
"Obsolete synonym for oob_tcp_if_exclude",
|
|
|
|
true, false, NULL, &str);
|
|
|
|
if (NULL != str) {
|
|
|
|
if (NULL == mca_oob_tcp_component.tcp_exclude) {
|
|
|
|
mca_oob_tcp_component.tcp_exclude = str;
|
|
|
|
} else {
|
|
|
|
free(str);
|
2007-09-12 18:16:53 +00:00
|
|
|
str = NULL; /* reset to NULL so we can use it again later */
|
2007-05-24 12:52:26 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2006-11-06 18:00:46 +00:00
|
|
|
mca_base_param_reg_int(&mca_oob_tcp_component.super.oob_base,
|
|
|
|
"connect_sleep",
|
2008-04-01 12:39:02 +00:00
|
|
|
"Enable (1) / disable (0) random sleep for "
|
|
|
|
"connection wireup.",
|
2006-11-06 18:00:46 +00:00
|
|
|
false,
|
|
|
|
false,
|
|
|
|
1,
|
|
|
|
&mca_oob_tcp_component.connect_sleep);
|
|
|
|
|
2006-09-14 21:29:51 +00:00
|
|
|
mca_base_param_reg_string(&mca_oob_tcp_component.super.oob_base,
|
|
|
|
"listen_mode",
|
2008-04-01 12:39:02 +00:00
|
|
|
"Mode for HNP to accept incoming connections: "
|
|
|
|
"event, listen_thread.",
|
2006-09-14 21:29:51 +00:00
|
|
|
false,
|
|
|
|
false,
|
|
|
|
"event",
|
|
|
|
&listen_type);
|
2006-09-19 19:33:49 +00:00
|
|
|
|
2006-10-11 21:29:29 +00:00
|
|
|
if (0 == strcmp(listen_type, "event")) {
|
2006-09-14 21:29:51 +00:00
|
|
|
mca_oob_tcp_component.tcp_listen_type = OOB_TCP_EVENT;
|
|
|
|
} else if (0 == strcmp(listen_type, "listen_thread")) {
|
|
|
|
mca_oob_tcp_component.tcp_listen_type = OOB_TCP_LISTEN_THREAD;
|
|
|
|
} else {
|
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-13 20:00:55 +00:00
|
|
|
orte_output(0, "Invalid value for oob_tcp_listen_mode parameter: %s",
|
2006-09-14 21:29:51 +00:00
|
|
|
listen_type);
|
|
|
|
return ORTE_ERROR;
|
|
|
|
}
|
|
|
|
|
|
|
|
mca_base_param_reg_int(&mca_oob_tcp_component.super.oob_base,
|
|
|
|
"listen_thread_max_queue",
|
2008-04-01 12:39:02 +00:00
|
|
|
"High water mark for queued accepted socket "
|
|
|
|
"list size. Used only when listen_mode is "
|
|
|
|
"listen_thread.",
|
2006-09-14 21:29:51 +00:00
|
|
|
false,
|
|
|
|
false,
|
|
|
|
10,
|
|
|
|
&mca_oob_tcp_component.tcp_copy_max_size);
|
|
|
|
mca_base_param_reg_int(&mca_oob_tcp_component.super.oob_base,
|
2008-04-01 12:39:02 +00:00
|
|
|
"listen_thread_wait_time",
|
|
|
|
"Time in milliseconds to wait before "
|
|
|
|
"actively checking for new connections when "
|
|
|
|
"listen_mode is listen_thread.",
|
2006-09-14 21:29:51 +00:00
|
|
|
false,
|
|
|
|
false,
|
|
|
|
10,
|
|
|
|
&tmp);
|
2008-04-01 12:39:02 +00:00
|
|
|
mca_oob_tcp_component.tcp_listen_thread_tv.tv_sec = tmp / (1000);
|
|
|
|
mca_oob_tcp_component.tcp_listen_thread_tv.tv_usec = (tmp % 1000) * 1000;
|
2006-09-14 21:29:51 +00:00
|
|
|
|
2008-04-01 12:39:02 +00:00
|
|
|
mca_oob_tcp_component.tcp_listen_thread_num_sockets = 0;
|
|
|
|
mca_oob_tcp_component.tcp_listen_thread_sds[0] = -1;
|
|
|
|
mca_oob_tcp_component.tcp_listen_thread_sds[1] = -1;
|
2006-09-14 21:29:51 +00:00
|
|
|
|
2007-10-26 16:36:51 +00:00
|
|
|
mca_base_param_reg_int(&mca_oob_tcp_component.super.oob_base,
|
|
|
|
"port_min_v4", "Starting port allowed (IPv4)",
|
|
|
|
false, false,
|
|
|
|
0,
|
|
|
|
&mca_oob_tcp_component.tcp_port_min);
|
|
|
|
mca_base_param_reg_int(&mca_oob_tcp_component.super.oob_base,
|
|
|
|
"port_range_v4", "Range of allowed ports (IPv4)",
|
|
|
|
false, false,
|
|
|
|
64*1024 - 1 - mca_oob_tcp_component.tcp_port_min,
|
|
|
|
&mca_oob_tcp_component.tcp_port_range);
|
2008-04-16 09:31:15 +00:00
|
|
|
mca_base_param_reg_int(&mca_oob_tcp_component.super.oob_base,
|
|
|
|
"disable_family", "Disable IPv4 (4) or IPv6 (6)",
|
|
|
|
false, false,
|
|
|
|
0,
|
|
|
|
&mca_oob_tcp_component.disable_family);
|
2007-10-26 16:36:51 +00:00
|
|
|
#if OPAL_WANT_IPV6
|
|
|
|
mca_base_param_reg_int(&mca_oob_tcp_component.super.oob_base,
|
|
|
|
"port_min_v6", "Starting port allowed (IPv6)",
|
|
|
|
false, false,
|
|
|
|
0,
|
|
|
|
&mca_oob_tcp_component.tcp6_port_min);
|
|
|
|
mca_base_param_reg_int(&mca_oob_tcp_component.super.oob_base,
|
|
|
|
"port_range_v6", "Range of allowed ports (IPv6)",
|
|
|
|
false, false,
|
|
|
|
64*1024 - 1 - mca_oob_tcp_component.tcp6_port_min,
|
|
|
|
&mca_oob_tcp_component.tcp6_port_range);
|
|
|
|
mca_oob_tcp_component.tcp6_listen_sd = -1;
|
|
|
|
#endif /* OPAL_WANT_IPV6 */
|
|
|
|
|
2004-08-02 22:16:35 +00:00
|
|
|
/* initialize state */
|
2006-09-14 21:29:51 +00:00
|
|
|
mca_oob_tcp_component.tcp_shutdown = false;
|
2004-08-02 22:16:35 +00:00
|
|
|
mca_oob_tcp_component.tcp_listen_sd = -1;
|
2004-09-30 15:09:29 +00:00
|
|
|
mca_oob_tcp_component.tcp_match_count = 0;
|
2007-03-16 23:11:45 +00:00
|
|
|
|
2007-06-14 04:38:06 +00:00
|
|
|
#if defined(__WINDOWS__)
|
2007-07-25 05:55:14 +00:00
|
|
|
/* Register the libevent callback which will trigger the OOB
|
|
|
|
* completion callbacks. */
|
|
|
|
OBJ_CONSTRUCT(&windows_callback, opal_mutex_t);
|
|
|
|
opal_progress_register(oob_tcp_windows_progress_callback);
|
|
|
|
#endif /* defined(__WINDOWS__) */
|
2007-06-14 22:35:38 +00:00
|
|
|
|
2007-07-25 05:55:14 +00:00
|
|
|
return ORTE_SUCCESS;
|
2007-06-14 04:38:06 +00:00
|
|
|
}
|
2004-07-01 14:49:54 +00:00
|
|
|
|
2004-07-12 22:46:57 +00:00
|
|
|
/*
|
|
|
|
* Cleanup of global variables used by this module.
|
|
|
|
*/
|
|
|
|
|
2004-08-19 19:34:37 +00:00
|
|
|
int mca_oob_tcp_component_close(void)
|
2004-07-01 14:49:54 +00:00
|
|
|
{
|
2007-07-20 01:34:02 +00:00
|
|
|
opal_list_item_t *item;
|
|
|
|
|
2007-06-14 04:38:06 +00:00
|
|
|
#if defined(__WINDOWS__)
|
|
|
|
opal_progress_unregister(oob_tcp_windows_progress_callback);
|
2007-07-20 01:34:02 +00:00
|
|
|
OBJ_DESTRUCT( &windows_callback );
|
2004-11-02 13:14:34 +00:00
|
|
|
WSACleanup();
|
2007-06-14 04:38:06 +00:00
|
|
|
#endif /* defined(__WINDOWS__) */
|
2005-09-01 01:07:30 +00:00
|
|
|
|
2004-09-30 15:09:29 +00:00
|
|
|
/* cleanup resources */
|
2007-07-20 01:34:02 +00:00
|
|
|
while (NULL != (item = opal_list_remove_first(&mca_oob_tcp_component.tcp_available_devices))) {
|
|
|
|
OBJ_RELEASE(item);
|
|
|
|
}
|
Fix a number of OOB issues:
* Remove the connect() timeout code, as it had some nasty race conditions
when connections were established as the trigger was firing. A better
solution has been found for the cluster where this was needed, so just
removing it was easiest.
* When a fatal error (too many connection failures) occurs, set an error
on messages in the queue even if there isn't an active message. The
first message to any peer will be queued without being active (and
so will all subsequent messages until the connection is established),
and the orteds will hang until that first message completes. So if
an orted can never contact it's peer, it will never exit and just sit
waiting for that message to complete.
* Cover an interesting RST condition in the connect code. A connection
can complete the three-way handshake, the connector can even send
some data, but the server side will drop the connection because it
can't move it from the half-connected to fully-connected state because
of space shortage in the listen backlog queue. This causes a RST to
be received first time that recv() is called, which will be when waiting
for the remote side of the OOB ack. In this case, transition the
connection back into a CLOSED state and try to connect again.
* Add levels of debugging, rather than all or nothing, each building on
the previous level. 0 (default) is hard errors. 1 is connection
error debugging info. 2 is all connection info. 3 is more state
info. 4 includes all message info.
* Add some hopefully useful comments
This commit was SVN r14261.
2007-04-07 22:33:30 +00:00
|
|
|
|
2008-04-01 12:39:02 +00:00
|
|
|
OBJ_DESTRUCT(&mca_oob_tcp_component.tcp_connections_lock);
|
Fix a number of OOB issues:
* Remove the connect() timeout code, as it had some nasty race conditions
when connections were established as the trigger was firing. A better
solution has been found for the cluster where this was needed, so just
removing it was easiest.
* When a fatal error (too many connection failures) occurs, set an error
on messages in the queue even if there isn't an active message. The
first message to any peer will be queued without being active (and
so will all subsequent messages until the connection is established),
and the orteds will hang until that first message completes. So if
an orted can never contact it's peer, it will never exit and just sit
waiting for that message to complete.
* Cover an interesting RST condition in the connect code. A connection
can complete the three-way handshake, the connector can even send
some data, but the server side will drop the connection because it
can't move it from the half-connected to fully-connected state because
of space shortage in the listen backlog queue. This causes a RST to
be received first time that recv() is called, which will be when waiting
for the remote side of the OOB ack. In this case, transition the
connection back into a CLOSED state and try to connect again.
* Add levels of debugging, rather than all or nothing, each building on
the previous level. 0 (default) is hard errors. 1 is connection
error debugging info. 2 is all connection info. 3 is more state
info. 4 includes all message info.
* Add some hopefully useful comments
This commit was SVN r14261.
2007-04-07 22:33:30 +00:00
|
|
|
OBJ_DESTRUCT(&mca_oob_tcp_component.tcp_connections_return);
|
|
|
|
OBJ_DESTRUCT(&mca_oob_tcp_component.tcp_pending_connections);
|
|
|
|
OBJ_DESTRUCT(&mca_oob_tcp_component.tcp_listen_thread);
|
2008-04-01 12:39:02 +00:00
|
|
|
OBJ_DESTRUCT(&mca_oob_tcp_component.tcp_available_devices);
|
Fix a number of OOB issues:
* Remove the connect() timeout code, as it had some nasty race conditions
when connections were established as the trigger was firing. A better
solution has been found for the cluster where this was needed, so just
removing it was easiest.
* When a fatal error (too many connection failures) occurs, set an error
on messages in the queue even if there isn't an active message. The
first message to any peer will be queued without being active (and
so will all subsequent messages until the connection is established),
and the orteds will hang until that first message completes. So if
an orted can never contact it's peer, it will never exit and just sit
waiting for that message to complete.
* Cover an interesting RST condition in the connect code. A connection
can complete the three-way handshake, the connector can even send
some data, but the server side will drop the connection because it
can't move it from the half-connected to fully-connected state because
of space shortage in the listen backlog queue. This causes a RST to
be received first time that recv() is called, which will be when waiting
for the remote side of the OOB ack. In this case, transition the
connection back into a CLOSED state and try to connect again.
* Add levels of debugging, rather than all or nothing, each building on
the previous level. 0 (default) is hard errors. 1 is connection
error debugging info. 2 is all connection info. 3 is more state
info. 4 includes all message info.
* Add some hopefully useful comments
This commit was SVN r14261.
2007-04-07 22:33:30 +00:00
|
|
|
OBJ_DESTRUCT(&mca_oob_tcp_component.tcp_match_cond);
|
|
|
|
OBJ_DESTRUCT(&mca_oob_tcp_component.tcp_match_lock);
|
|
|
|
OBJ_DESTRUCT(&mca_oob_tcp_component.tcp_msg_completed);
|
|
|
|
OBJ_DESTRUCT(&mca_oob_tcp_component.tcp_msg_recv);
|
|
|
|
OBJ_DESTRUCT(&mca_oob_tcp_component.tcp_msg_post);
|
2004-09-30 15:09:29 +00:00
|
|
|
OBJ_DESTRUCT(&mca_oob_tcp_component.tcp_events);
|
2004-08-02 21:24:00 +00:00
|
|
|
OBJ_DESTRUCT(&mca_oob_tcp_component.tcp_lock);
|
Fix a number of OOB issues:
* Remove the connect() timeout code, as it had some nasty race conditions
when connections were established as the trigger was firing. A better
solution has been found for the cluster where this was needed, so just
removing it was easiest.
* When a fatal error (too many connection failures) occurs, set an error
on messages in the queue even if there isn't an active message. The
first message to any peer will be queued without being active (and
so will all subsequent messages until the connection is established),
and the orteds will hang until that first message completes. So if
an orted can never contact it's peer, it will never exit and just sit
waiting for that message to complete.
* Cover an interesting RST condition in the connect code. A connection
can complete the three-way handshake, the connector can even send
some data, but the server side will drop the connection because it
can't move it from the half-connected to fully-connected state because
of space shortage in the listen backlog queue. This causes a RST to
be received first time that recv() is called, which will be when waiting
for the remote side of the OOB ack. In this case, transition the
connection back into a CLOSED state and try to connect again.
* Add levels of debugging, rather than all or nothing, each building on
the previous level. 0 (default) is hard errors. 1 is connection
error debugging info. 2 is all connection info. 3 is more state
info. 4 includes all message info.
* Add some hopefully useful comments
This commit was SVN r14261.
2007-04-07 22:33:30 +00:00
|
|
|
OBJ_DESTRUCT(&mca_oob_tcp_component.tcp_msgs);
|
|
|
|
OBJ_DESTRUCT(&mca_oob_tcp_component.tcp_peer_free);
|
|
|
|
OBJ_DESTRUCT(&mca_oob_tcp_component.tcp_peer_names);
|
|
|
|
OBJ_DESTRUCT(&mca_oob_tcp_component.tcp_peers);
|
|
|
|
OBJ_DESTRUCT(&mca_oob_tcp_component.tcp_peer_list);
|
2007-03-16 23:11:45 +00:00
|
|
|
|
2006-02-12 01:33:29 +00:00
|
|
|
return ORTE_SUCCESS;
|
2004-07-01 14:49:54 +00:00
|
|
|
}
|
|
|
|
|
2004-07-12 22:46:57 +00:00
|
|
|
|
2008-04-01 12:39:02 +00:00
|
|
|
/* Called by mca_oob_tcp_accept() and mca_oob_tcp_thread_handler() on
|
|
|
|
* a socket that has been accepted. This call finishes processing the
|
|
|
|
* socket, including setting socket options and registering for the
|
|
|
|
* OOB-level connection handshake. Used by both the threaded and
|
|
|
|
* event listen modes.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
mca_oob_tcp_create_connection(const int accepted_fd,
|
|
|
|
const struct sockaddr *addr)
|
|
|
|
{
|
|
|
|
mca_oob_tcp_event_t* event;
|
|
|
|
|
|
|
|
/* setup socket options */
|
|
|
|
mca_oob_tcp_set_socket_options(accepted_fd);
|
|
|
|
|
|
|
|
/* log the accept */
|
|
|
|
if (mca_oob_tcp_component.tcp_debug >= OOB_TCP_DEBUG_CONNECT) {
|
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-13 20:00:55 +00:00
|
|
|
orte_output(0, "%s mca_oob_tcp_accept: %s:%d\n",
|
2008-04-01 12:39:02 +00:00
|
|
|
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
|
|
|
|
opal_net_get_hostname(addr),
|
|
|
|
opal_net_get_port(addr));
|
|
|
|
}
|
|
|
|
|
|
|
|
/* wait for receipt of peers process identifier to complete this connection */
|
|
|
|
event = OBJ_NEW(mca_oob_tcp_event_t);
|
|
|
|
opal_event_set(&event->event, accepted_fd, OPAL_EV_READ, mca_oob_tcp_recv_handler, event);
|
|
|
|
opal_event_add(&event->event, 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2004-08-02 21:24:00 +00:00
|
|
|
/*
|
2008-04-01 12:39:02 +00:00
|
|
|
* Called by mca_oob_tcp_recv_handler() when the TCP listen socket
|
|
|
|
* has pending connection requests. Accept incoming requests and
|
|
|
|
* queue for completion of the connection handshake. Will not be
|
|
|
|
* called when listen_mode is listen_thread.
|
2004-07-12 22:46:57 +00:00
|
|
|
*/
|
2007-04-25 01:55:40 +00:00
|
|
|
static void mca_oob_tcp_accept(int incoming_sd)
|
2004-07-12 22:46:57 +00:00
|
|
|
{
|
2004-08-02 21:24:00 +00:00
|
|
|
while(true) {
|
2007-07-20 01:34:02 +00:00
|
|
|
struct sockaddr_storage addr;
|
|
|
|
opal_socklen_t addrlen = sizeof(struct sockaddr_storage);
|
2004-09-30 15:09:29 +00:00
|
|
|
int sd;
|
2005-09-01 01:07:30 +00:00
|
|
|
|
2007-04-25 01:55:40 +00:00
|
|
|
sd = accept(incoming_sd, (struct sockaddr*)&addr, &addrlen);
|
2004-08-02 21:24:00 +00:00
|
|
|
if(sd < 0) {
|
2007-04-25 01:55:40 +00:00
|
|
|
if(opal_socket_errno == EINTR) {
|
2004-08-02 21:24:00 +00:00
|
|
|
continue;
|
2007-04-25 01:55:40 +00:00
|
|
|
}
|
|
|
|
if(opal_socket_errno != EAGAIN && opal_socket_errno != EWOULDBLOCK) {
|
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-13 20:00:55 +00:00
|
|
|
orte_output(0, "mca_oob_tcp_accept: accept() failed: %s (%d).",
|
2006-12-14 18:20:43 +00:00
|
|
|
strerror(opal_socket_errno), opal_socket_errno);
|
2007-04-25 01:55:40 +00:00
|
|
|
}
|
2004-08-02 21:24:00 +00:00
|
|
|
return;
|
|
|
|
}
|
2005-03-18 23:40:08 +00:00
|
|
|
|
2008-04-01 12:39:02 +00:00
|
|
|
mca_oob_tcp_create_connection(sd, (struct sockaddr*) &addr);
|
2004-08-02 21:24:00 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2007-07-20 01:34:02 +00:00
|
|
|
|
2004-08-02 21:24:00 +00:00
|
|
|
/*
|
2008-04-01 12:39:02 +00:00
|
|
|
* Create a listen socket in the specified af_family and bind to all
|
|
|
|
* interfaces.
|
|
|
|
*
|
|
|
|
* At one time, this also registered a callback with the event library
|
|
|
|
* for when connections were received on the listen socket. This is
|
|
|
|
* no longer the case -- the caller must register any events required.
|
|
|
|
*
|
|
|
|
* Called by both the threaded and event based listen modes.
|
2004-08-02 21:24:00 +00:00
|
|
|
*/
|
2008-04-01 12:39:02 +00:00
|
|
|
static int
|
2008-04-09 12:53:24 +00:00
|
|
|
mca_oob_tcp_create_listen(int *target_sd, unsigned short *target_port, uint16_t af_family)
|
2004-08-02 21:24:00 +00:00
|
|
|
{
|
2008-06-03 19:58:40 +00:00
|
|
|
int flags, index, range = 0, port=0;
|
2007-07-20 01:34:02 +00:00
|
|
|
struct sockaddr_storage inaddr;
|
|
|
|
opal_socklen_t addrlen;
|
2004-08-12 13:29:37 +00:00
|
|
|
|
2004-08-02 21:24:00 +00:00
|
|
|
/* create a listen socket for incoming connections */
|
2007-04-25 01:55:40 +00:00
|
|
|
*target_sd = socket(af_family, SOCK_STREAM, 0);
|
|
|
|
if(*target_sd < 0) {
|
|
|
|
if (EAFNOSUPPORT != opal_socket_errno) {
|
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-13 20:00:55 +00:00
|
|
|
orte_output(0,"mca_oob_tcp_component_init: socket() failed: %s (%d)",
|
2007-04-25 01:55:40 +00:00
|
|
|
strerror(opal_socket_errno), opal_socket_errno);
|
|
|
|
}
|
|
|
|
return ORTE_ERR_IN_ERRNO;
|
2004-08-02 21:24:00 +00:00
|
|
|
}
|
2005-10-31 16:21:11 +00:00
|
|
|
|
|
|
|
/* setup socket options */
|
2007-04-25 01:55:40 +00:00
|
|
|
mca_oob_tcp_set_socket_options(*target_sd);
|
2005-10-31 16:21:11 +00:00
|
|
|
|
2008-04-01 12:39:02 +00:00
|
|
|
/* Set some more options and fill in the family / address
|
|
|
|
information. Set to bind to the any address */
|
2007-04-25 01:55:40 +00:00
|
|
|
#if OPAL_WANT_IPV6
|
2007-10-26 16:36:51 +00:00
|
|
|
{
|
|
|
|
struct addrinfo hints, *res = NULL;
|
|
|
|
int error;
|
|
|
|
|
|
|
|
memset(&hints, 0, sizeof(hints));
|
|
|
|
hints.ai_family = af_family;
|
|
|
|
hints.ai_socktype = SOCK_STREAM;
|
|
|
|
hints.ai_flags = AI_PASSIVE;
|
2007-04-25 01:55:40 +00:00
|
|
|
|
2007-10-26 16:36:51 +00:00
|
|
|
if ((error = getaddrinfo(NULL, "0", &hints, &res))) {
|
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-13 20:00:55 +00:00
|
|
|
orte_output (0,
|
2007-10-26 16:36:51 +00:00
|
|
|
"mca_oob_tcp_create_listen: unable to resolve. %s\n",
|
|
|
|
gai_strerror (error));
|
|
|
|
return ORTE_ERROR;
|
|
|
|
}
|
2007-04-25 01:55:40 +00:00
|
|
|
|
2007-10-26 16:36:51 +00:00
|
|
|
memcpy (&inaddr, res->ai_addr, res->ai_addrlen);
|
|
|
|
addrlen = res->ai_addrlen;
|
|
|
|
freeaddrinfo (res);
|
2007-04-25 01:55:40 +00:00
|
|
|
|
2007-06-28 18:52:15 +00:00
|
|
|
#ifdef IPV6_V6ONLY
|
2007-10-26 16:36:51 +00:00
|
|
|
/* in case of AF_INET6, disable v4-mapped addresses */
|
|
|
|
if (AF_INET6 == af_family) {
|
|
|
|
int flg = 0;
|
|
|
|
if (setsockopt (*target_sd, IPPROTO_IPV6, IPV6_V6ONLY,
|
|
|
|
&flg, sizeof (flg)) < 0) {
|
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-13 20:00:55 +00:00
|
|
|
orte_output(0,
|
2007-10-26 16:36:51 +00:00
|
|
|
"mca_oob_tcp_create_listen: unable to disable v4-mapped addresses\n");
|
|
|
|
}
|
2007-04-25 01:55:40 +00:00
|
|
|
}
|
2007-07-20 01:34:02 +00:00
|
|
|
#endif /* IPV6_V6ONLY */
|
2008-01-01 11:02:38 +00:00
|
|
|
}
|
2007-04-25 01:55:40 +00:00
|
|
|
#else
|
2007-10-26 16:36:51 +00:00
|
|
|
if (AF_INET != af_family) {
|
2007-07-24 17:01:39 +00:00
|
|
|
return ORTE_ERROR;
|
|
|
|
}
|
2007-10-26 16:36:51 +00:00
|
|
|
((struct sockaddr_in*) &inaddr)->sin_family = af_family;
|
|
|
|
((struct sockaddr_in*) &inaddr)->sin_addr.s_addr = INADDR_ANY;
|
|
|
|
addrlen = sizeof(struct sockaddr_in);
|
2007-04-25 01:55:40 +00:00
|
|
|
#endif
|
2004-08-12 13:29:37 +00:00
|
|
|
|
2008-04-01 12:39:02 +00:00
|
|
|
/* Disable reusing ports */
|
|
|
|
flags = 0;
|
|
|
|
if (setsockopt (*target_sd, SOL_SOCKET, SO_REUSEADDR, (void*)&flags, sizeof(flags)) < 0) {
|
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-13 20:00:55 +00:00
|
|
|
orte_output(0, "mca_oob_tcp_create_listen: unable to unset the "
|
2008-04-01 12:39:02 +00:00
|
|
|
"SO_REUSEADDR option (%s:%d)\n",
|
|
|
|
strerror(opal_socket_errno), opal_socket_errno);
|
|
|
|
CLOSE_THE_SOCKET(*target_sd);
|
|
|
|
return ORTE_ERROR;
|
2004-08-02 21:24:00 +00:00
|
|
|
}
|
2004-08-12 13:29:37 +00:00
|
|
|
|
2008-04-01 12:39:02 +00:00
|
|
|
/* If an explicit range of ports was given, find the first open
|
|
|
|
port in the range. Otherwise, tcp_port_min will be 0, which
|
|
|
|
means "pick any port" */
|
|
|
|
if (AF_INET == af_family) {
|
2007-10-26 20:15:28 +00:00
|
|
|
range = mca_oob_tcp_component.tcp_port_range;
|
|
|
|
port = mca_oob_tcp_component.tcp_port_min;
|
2008-04-01 12:39:02 +00:00
|
|
|
}
|
2007-10-26 16:36:51 +00:00
|
|
|
#if OPAL_WANT_IPV6
|
2008-04-01 12:39:02 +00:00
|
|
|
if (AF_INET6 == af_family) {
|
|
|
|
range = mca_oob_tcp_component.tcp6_port_range;
|
|
|
|
port = mca_oob_tcp_component.tcp6_port_min;
|
|
|
|
}
|
2007-10-26 16:36:51 +00:00
|
|
|
#endif /* OPAL_WANT_IPV6 */
|
|
|
|
|
2008-04-30 19:49:53 +00:00
|
|
|
#if 0
|
|
|
|
/* flag whether or not static ports are in use so that other
|
|
|
|
* parts of ORTE can act appropriately
|
|
|
|
* LEAVE OFF FOR MOMENT PENDING FURTHER TEST
|
|
|
|
*/
|
|
|
|
if (0 != port) {
|
|
|
|
orte_static_ports = true;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2008-04-01 12:39:02 +00:00
|
|
|
for (index = 0; index < range; index++ ) {
|
|
|
|
if (AF_INET == af_family) {
|
2007-10-26 16:36:51 +00:00
|
|
|
((struct sockaddr_in*) &inaddr)->sin_port = port + index;
|
2008-04-01 12:39:02 +00:00
|
|
|
} else if (AF_INET6 == af_family) {
|
|
|
|
((struct sockaddr_in6*) &inaddr)->sin6_port = port + index;
|
|
|
|
} else {
|
|
|
|
return ORTE_ERROR;
|
2007-10-26 16:36:51 +00:00
|
|
|
}
|
2008-04-01 12:39:02 +00:00
|
|
|
|
|
|
|
if(bind(*target_sd, (struct sockaddr*)&inaddr, addrlen) < 0) {
|
|
|
|
if( (EADDRINUSE == opal_socket_errno) || (EADDRNOTAVAIL == opal_socket_errno) ) {
|
|
|
|
continue;
|
|
|
|
}
|
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-13 20:00:55 +00:00
|
|
|
orte_output(0, "bind() failed: %s (%d)",
|
2008-04-01 12:39:02 +00:00
|
|
|
strerror(opal_socket_errno),
|
|
|
|
opal_socket_errno );
|
|
|
|
CLOSE_THE_SOCKET(*target_sd);
|
|
|
|
return ORTE_ERROR;
|
2007-10-26 16:36:51 +00:00
|
|
|
}
|
2008-04-01 12:39:02 +00:00
|
|
|
goto socket_binded;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (AF_INET == af_family ) {
|
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-13 20:00:55 +00:00
|
|
|
orte_output(0, "bind() failed: no port available in the range [%d..%d]",
|
2008-04-01 12:39:02 +00:00
|
|
|
mca_oob_tcp_component.tcp_port_min,
|
|
|
|
mca_oob_tcp_component.tcp_port_min + range);
|
|
|
|
}
|
2007-10-26 16:36:51 +00:00
|
|
|
#if OPAL_WANT_IPV6
|
2008-04-01 12:39:02 +00:00
|
|
|
if (AF_INET6 == af_family) {
|
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-13 20:00:55 +00:00
|
|
|
orte_output(0, "bind6() failed: no port available in the range [%d..%d]",
|
2008-04-01 12:39:02 +00:00
|
|
|
mca_oob_tcp_component.tcp6_port_min,
|
|
|
|
mca_oob_tcp_component.tcp6_port_min + range);
|
2007-10-26 16:36:51 +00:00
|
|
|
}
|
2008-04-01 12:39:02 +00:00
|
|
|
#endif /* OPAL_WANT_IPV6 */
|
|
|
|
|
|
|
|
CLOSE_THE_SOCKET(*target_sd);
|
|
|
|
return ORTE_ERROR;
|
|
|
|
|
|
|
|
socket_binded:
|
|
|
|
/* resolve assigned port */
|
2007-07-20 01:34:02 +00:00
|
|
|
if (getsockname(*target_sd, (struct sockaddr*)&inaddr, &addrlen) < 0) {
|
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-13 20:00:55 +00:00
|
|
|
orte_output(0, "mca_oob_tcp_create_listen: getsockname(): %s (%d)",
|
2006-12-14 18:20:43 +00:00
|
|
|
strerror(opal_socket_errno), opal_socket_errno);
|
2007-10-26 16:36:51 +00:00
|
|
|
CLOSE_THE_SOCKET(*target_sd);
|
2006-02-12 01:33:29 +00:00
|
|
|
return ORTE_ERROR;
|
2004-08-02 21:24:00 +00:00
|
|
|
}
|
2007-07-20 01:34:02 +00:00
|
|
|
|
2007-04-25 01:55:40 +00:00
|
|
|
if (AF_INET == af_family) {
|
2008-04-09 12:53:24 +00:00
|
|
|
*target_port = ((struct sockaddr_in*) &inaddr)->sin_port;
|
|
|
|
} else {
|
|
|
|
*target_port = ((struct sockaddr_in6*) &inaddr)->sin6_port;
|
2007-10-26 16:36:51 +00:00
|
|
|
}
|
|
|
|
|
2004-08-02 21:24:00 +00:00
|
|
|
/* setup listen backlog to maximum allowed by kernel */
|
2007-04-25 01:55:40 +00:00
|
|
|
if(listen(*target_sd, SOMAXCONN) < 0) {
|
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-13 20:00:55 +00:00
|
|
|
orte_output(0, "mca_oob_tcp_component_init: listen(): %s (%d)",
|
2006-12-14 18:20:43 +00:00
|
|
|
strerror(opal_socket_errno), opal_socket_errno);
|
2006-02-12 01:33:29 +00:00
|
|
|
return ORTE_ERROR;
|
2004-08-02 21:24:00 +00:00
|
|
|
}
|
2005-09-01 01:07:30 +00:00
|
|
|
|
2004-08-02 21:24:00 +00:00
|
|
|
/* set socket up to be non-blocking, otherwise accept could block */
|
2007-04-25 01:55:40 +00:00
|
|
|
if((flags = fcntl(*target_sd, F_GETFL, 0)) < 0) {
|
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-13 20:00:55 +00:00
|
|
|
orte_output(0, "mca_oob_tcp_component_init: fcntl(F_GETFL) failed: %s (%d)",
|
2006-12-14 18:20:43 +00:00
|
|
|
strerror(opal_socket_errno), opal_socket_errno);
|
2006-02-12 01:33:29 +00:00
|
|
|
return ORTE_ERROR;
|
2004-08-02 21:24:00 +00:00
|
|
|
} else {
|
|
|
|
flags |= O_NONBLOCK;
|
2007-04-25 01:55:40 +00:00
|
|
|
if(fcntl(*target_sd, F_SETFL, flags) < 0) {
|
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-13 20:00:55 +00:00
|
|
|
orte_output(0, "mca_oob_tcp_component_init: fcntl(F_SETFL) failed: %s (%d)",
|
2006-12-14 18:20:43 +00:00
|
|
|
strerror(opal_socket_errno), opal_socket_errno);
|
2006-02-12 01:33:29 +00:00
|
|
|
return ORTE_ERROR;
|
2004-08-02 21:24:00 +00:00
|
|
|
}
|
|
|
|
}
|
2005-09-01 01:07:30 +00:00
|
|
|
|
2006-02-12 01:33:29 +00:00
|
|
|
return ORTE_SUCCESS;
|
2004-08-02 21:24:00 +00:00
|
|
|
}
|
2004-07-12 22:46:57 +00:00
|
|
|
|
2004-07-10 01:27:07 +00:00
|
|
|
|
2008-04-01 12:39:02 +00:00
|
|
|
/*
|
|
|
|
* The listen thread created when listen_mode is threaded. Accepts
|
|
|
|
* incoming connections and places them in a queue for further
|
|
|
|
* processing. Finishing the accepted connection is done in the main
|
|
|
|
* thread to maintain thread safety even when the event library and
|
|
|
|
* most of ORTE is in single threaded mode.
|
|
|
|
*
|
|
|
|
* Runs until mca_oob_tcp_compnent.tcp_shutdown is set to true.
|
|
|
|
*/
|
|
|
|
static void*
|
|
|
|
mca_oob_tcp_listen_thread(opal_object_t *obj)
|
2006-09-14 21:29:51 +00:00
|
|
|
{
|
2008-04-01 12:39:02 +00:00
|
|
|
int rc, count, i, max, accepted_connections, need_write;
|
|
|
|
opal_socklen_t addrlen = sizeof(struct sockaddr_storage);
|
2006-09-14 21:29:51 +00:00
|
|
|
opal_free_list_item_t *fl_item;
|
2008-04-01 12:39:02 +00:00
|
|
|
mca_oob_tcp_pending_connection_t *pending_connection;
|
2006-09-14 21:29:51 +00:00
|
|
|
struct timeval timeout;
|
|
|
|
fd_set readfds;
|
2008-04-01 12:39:02 +00:00
|
|
|
opal_list_t local_accepted_list;
|
|
|
|
opal_free_list_t pending_connections_fl;
|
|
|
|
|
|
|
|
OBJ_CONSTRUCT(&local_accepted_list, opal_list_t);
|
|
|
|
OBJ_CONSTRUCT(&pending_connections_fl, opal_free_list_t);
|
|
|
|
opal_free_list_init(&pending_connections_fl,
|
|
|
|
sizeof(mca_oob_tcp_pending_connection_t),
|
|
|
|
OBJ_CLASS(mca_oob_tcp_pending_connection_t),
|
|
|
|
16, /* initial number */
|
|
|
|
-1, /* maximum number */
|
|
|
|
16); /* increment to grow by */
|
2006-09-14 21:29:51 +00:00
|
|
|
|
|
|
|
while (false == mca_oob_tcp_component.tcp_shutdown) {
|
|
|
|
count = 0;
|
|
|
|
|
|
|
|
FD_ZERO(&readfds);
|
2008-04-01 12:39:02 +00:00
|
|
|
max = -1;
|
|
|
|
for (i = 0 ; i < mca_oob_tcp_component.tcp_listen_thread_num_sockets ; ++i) {
|
|
|
|
int sd = mca_oob_tcp_component.tcp_listen_thread_sds[i];
|
|
|
|
FD_SET(sd, &readfds);
|
|
|
|
max = (sd > max) ? sd : max;
|
|
|
|
}
|
|
|
|
/* XXX - FIX ME - should really slowly back this off as
|
|
|
|
connections are done. Will reduce amount of polling in the
|
|
|
|
HNP after the initial connection storm. Would also require
|
|
|
|
some type of wakeup mechanism for when shutdown happens */
|
|
|
|
timeout.tv_sec = mca_oob_tcp_component.tcp_listen_thread_tv.tv_sec;
|
|
|
|
timeout.tv_usec = mca_oob_tcp_component.tcp_listen_thread_tv.tv_usec;
|
|
|
|
|
|
|
|
/* Block in a select for a short (10ms) amount of time to give
|
|
|
|
the other thread a chance to do some work. If a connection
|
|
|
|
comes in, we'll get woken up right away. */
|
|
|
|
rc = select(max + 1, &readfds, NULL, NULL, &timeout);
|
2006-09-14 21:29:51 +00:00
|
|
|
if (rc < 0) {
|
|
|
|
if (EAGAIN != opal_socket_errno && EINTR != opal_socket_errno) {
|
|
|
|
perror("select");
|
|
|
|
}
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2008-04-01 12:39:02 +00:00
|
|
|
/* Spin accepting connections until either our queue is full
|
|
|
|
or all active listen sockets do not have any incoming
|
|
|
|
connections */
|
|
|
|
do {
|
|
|
|
accepted_connections = 0;
|
|
|
|
for (i = 0 ; i < mca_oob_tcp_component.tcp_listen_thread_num_sockets ; ++i) {
|
|
|
|
int sd = mca_oob_tcp_component.tcp_listen_thread_sds[i];
|
|
|
|
|
|
|
|
/* make sure we have space for an accepted connection */
|
|
|
|
if (opal_list_get_size(&local_accepted_list) >=
|
|
|
|
(size_t) mca_oob_tcp_component.tcp_copy_max_size) {
|
|
|
|
goto recover;
|
2006-09-14 21:29:51 +00:00
|
|
|
}
|
|
|
|
|
2008-04-01 12:39:02 +00:00
|
|
|
/* Can't wait because our thread is the only one that
|
|
|
|
can put things back in the free list */
|
|
|
|
OPAL_FREE_LIST_GET(&pending_connections_fl, fl_item, rc);
|
|
|
|
if (NULL == fl_item) goto recover;
|
|
|
|
|
|
|
|
pending_connection = (mca_oob_tcp_pending_connection_t*) fl_item;
|
|
|
|
pending_connection->fd = accept(sd,
|
|
|
|
(struct sockaddr*)&(pending_connection->addr),
|
|
|
|
&addrlen);
|
|
|
|
if (pending_connection->fd < 0) {
|
|
|
|
OPAL_FREE_LIST_RETURN(&pending_connections_fl, fl_item);
|
|
|
|
if (mca_oob_tcp_component.tcp_shutdown) goto done;
|
|
|
|
|
|
|
|
if (opal_socket_errno != EAGAIN ||
|
|
|
|
opal_socket_errno != EWOULDBLOCK) {
|
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-13 20:00:55 +00:00
|
|
|
orte_output(0, "mca_oob_tcp_accept: accept() failed: %s (%d).",
|
2008-04-01 12:39:02 +00:00
|
|
|
strerror(opal_socket_errno), opal_socket_errno);
|
|
|
|
CLOSE_THE_SOCKET(pending_connection->fd);
|
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (mca_oob_tcp_component.tcp_debug >= OOB_TCP_DEBUG_CONNECT) {
|
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-13 20:00:55 +00:00
|
|
|
orte_output(0,
|
2008-04-01 12:39:02 +00:00
|
|
|
"%s mca_oob_tcp_listen_thread: new connection: "
|
|
|
|
"(%d, %d) %s:%d\n",
|
|
|
|
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
|
|
|
|
pending_connection->fd, opal_socket_errno,
|
|
|
|
opal_net_get_hostname((struct sockaddr*) &pending_connection->addr),
|
|
|
|
opal_net_get_port((struct sockaddr*) &pending_connection->addr));
|
|
|
|
}
|
|
|
|
|
|
|
|
opal_list_append(&local_accepted_list, (opal_list_item_t*) pending_connection);
|
|
|
|
accepted_connections++;
|
|
|
|
}
|
|
|
|
} while (accepted_connections > 0);
|
|
|
|
|
|
|
|
/* recover from a loop of accepting resources. Give any new
|
|
|
|
connections to the main thread and reap any available
|
|
|
|
connection fragments */
|
|
|
|
recover:
|
|
|
|
need_write = 0;
|
|
|
|
if (0 != opal_list_get_size(&local_accepted_list) ||
|
|
|
|
0 != opal_list_get_size(&mca_oob_tcp_component.tcp_connections_return)) {
|
|
|
|
opal_mutex_lock(&mca_oob_tcp_component.tcp_connections_lock);
|
|
|
|
/* copy local accepted list into shared list */
|
|
|
|
if (0 != opal_list_get_size(&local_accepted_list)) {
|
|
|
|
opal_list_join(&mca_oob_tcp_component.tcp_pending_connections,
|
|
|
|
opal_list_get_end(&mca_oob_tcp_component.tcp_pending_connections),
|
|
|
|
&local_accepted_list);
|
2006-09-14 21:29:51 +00:00
|
|
|
}
|
|
|
|
|
2008-04-01 12:39:02 +00:00
|
|
|
/* If the pending connection list is now at high
|
|
|
|
watermark, will signal the other thread */
|
|
|
|
if (opal_list_get_size(&mca_oob_tcp_component.tcp_pending_connections) >=
|
|
|
|
(size_t) mca_oob_tcp_component.tcp_copy_max_size) {
|
|
|
|
need_write = 1;
|
|
|
|
}
|
|
|
|
/* As an optimization, we could probably copy into a local
|
|
|
|
list, exit the lock, then free the pending connections,
|
|
|
|
but I'm not convinced that would be any faster */
|
|
|
|
while (NULL != (fl_item = (opal_free_list_item_t*)
|
|
|
|
opal_list_remove_first(&mca_oob_tcp_component.tcp_connections_return))) {
|
|
|
|
OPAL_FREE_LIST_RETURN(&pending_connections_fl, fl_item);
|
2006-09-14 21:29:51 +00:00
|
|
|
}
|
|
|
|
|
2008-04-01 12:39:02 +00:00
|
|
|
opal_mutex_unlock(&mca_oob_tcp_component.tcp_connections_lock);
|
2006-09-14 21:29:51 +00:00
|
|
|
}
|
|
|
|
|
2008-04-01 12:39:02 +00:00
|
|
|
if (need_write) {
|
|
|
|
char buf[1] = { '\0' };
|
|
|
|
#ifdef HAVE_PIPE
|
|
|
|
write(mca_oob_tcp_component.tcp_connections_pipe[1], buf, 1);
|
|
|
|
#endif
|
2006-09-14 21:29:51 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2008-04-01 12:39:02 +00:00
|
|
|
done:
|
|
|
|
OBJ_DESTRUCT(&local_accepted_list);
|
|
|
|
OBJ_DESTRUCT(&pending_connections_fl);
|
|
|
|
|
2006-09-14 21:29:51 +00:00
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2008-04-01 12:39:02 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Handler for accepting connections from the listen thread. Called by
|
|
|
|
* timer or pipe signal.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
mca_oob_tcp_accept_thread_handler(int sd, short flags, void* user)
|
2006-09-14 21:29:51 +00:00
|
|
|
{
|
2008-04-01 12:39:02 +00:00
|
|
|
/* probably more efficient to use the user pointer for this rather
|
|
|
|
than always recreating the list. Future work. */
|
|
|
|
opal_list_t local_accepted_list;
|
|
|
|
opal_list_t local_return_list;
|
|
|
|
mca_oob_tcp_pending_connection_t *new_connection;
|
|
|
|
struct timeval tv;
|
|
|
|
|
|
|
|
if (mca_oob_tcp_component.tcp_debug >= OOB_TCP_DEBUG_INFO) {
|
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-13 20:00:55 +00:00
|
|
|
orte_output(0, "%s in accept_thread_handler: %d",
|
2008-04-01 12:39:02 +00:00
|
|
|
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), flags);
|
|
|
|
}
|
2006-09-14 21:29:51 +00:00
|
|
|
|
2008-04-01 12:39:02 +00:00
|
|
|
OBJ_CONSTRUCT(&local_accepted_list, opal_list_t);
|
|
|
|
OBJ_CONSTRUCT(&local_return_list, opal_list_t);
|
2006-09-14 21:29:51 +00:00
|
|
|
|
2008-04-01 12:39:02 +00:00
|
|
|
/* read the byte waiting - if we don't have pipe, this can't ever
|
|
|
|
happen, so no need for yet another #if */
|
|
|
|
if (OPAL_EV_READ == flags) {
|
|
|
|
char buf[1];
|
|
|
|
read(sd, buf, 1);
|
2006-09-14 21:29:51 +00:00
|
|
|
}
|
|
|
|
|
2008-04-01 12:39:02 +00:00
|
|
|
/* Copy in all pending connections. opal_list_join is O(1), so
|
|
|
|
this is pretty cheap. Size is pretty friendly to thread
|
|
|
|
safety, and join will properly handle the case where the list
|
|
|
|
magically got shorter. */
|
|
|
|
if (0 != opal_list_get_size(&mca_oob_tcp_component.tcp_pending_connections)) {
|
|
|
|
opal_mutex_lock(&mca_oob_tcp_component.tcp_connections_lock);
|
|
|
|
opal_list_join(&local_accepted_list,
|
|
|
|
opal_list_get_end(&local_accepted_list),
|
|
|
|
&mca_oob_tcp_component.tcp_pending_connections);
|
|
|
|
opal_mutex_unlock(&mca_oob_tcp_component.tcp_connections_lock);
|
|
|
|
}
|
2006-09-14 21:29:51 +00:00
|
|
|
|
2008-04-01 12:39:02 +00:00
|
|
|
/* process all the connections */
|
|
|
|
while (NULL != (new_connection = (mca_oob_tcp_pending_connection_t*)
|
|
|
|
opal_list_remove_first(&local_accepted_list))) {
|
|
|
|
mca_oob_tcp_create_connection(new_connection->fd,
|
|
|
|
(struct sockaddr*) &(new_connection->addr));
|
2006-09-14 21:29:51 +00:00
|
|
|
|
2008-04-01 12:39:02 +00:00
|
|
|
opal_list_append(&local_return_list, (opal_list_item_t*) new_connection);
|
2006-09-14 21:29:51 +00:00
|
|
|
}
|
|
|
|
|
2008-04-01 12:39:02 +00:00
|
|
|
/* Copy all processed connections into the return list */
|
|
|
|
if (0 != opal_list_get_size(&local_return_list)) {
|
|
|
|
opal_mutex_lock(&mca_oob_tcp_component.tcp_connections_lock);
|
|
|
|
opal_list_join(&mca_oob_tcp_component.tcp_connections_return,
|
|
|
|
opal_list_get_end(&mca_oob_tcp_component.tcp_connections_return),
|
|
|
|
&local_return_list);
|
|
|
|
opal_mutex_unlock(&mca_oob_tcp_component.tcp_connections_lock);
|
|
|
|
}
|
2006-09-14 21:29:51 +00:00
|
|
|
|
2008-04-01 12:39:02 +00:00
|
|
|
OBJ_DESTRUCT(&local_accepted_list);
|
|
|
|
OBJ_DESTRUCT(&local_return_list);
|
2006-09-14 21:29:51 +00:00
|
|
|
|
2008-04-01 12:39:02 +00:00
|
|
|
tv.tv_sec = mca_oob_tcp_component.tcp_listen_thread_tv.tv_sec;
|
|
|
|
tv.tv_usec = mca_oob_tcp_component.tcp_listen_thread_tv.tv_usec;
|
|
|
|
#ifdef HAVE_PIPE
|
|
|
|
opal_event_set(&mca_oob_tcp_component.tcp_listen_thread_event,
|
|
|
|
mca_oob_tcp_component.tcp_connections_pipe[0],
|
|
|
|
OPAL_EV_READ,
|
|
|
|
mca_oob_tcp_accept_thread_handler, NULL);
|
|
|
|
#else
|
|
|
|
opal_event_set(&mca_oob_tcp_component.tcp_listen_thread_event,
|
|
|
|
-1, 0,
|
|
|
|
mca_oob_tcp_accept_thread_handler, NULL);
|
|
|
|
#endif
|
|
|
|
opal_event_add(&mca_oob_tcp_component.tcp_listen_thread_event, &tv);
|
|
|
|
}
|
2006-09-14 21:29:51 +00:00
|
|
|
|
|
|
|
|
2008-04-01 12:39:02 +00:00
|
|
|
/*
|
|
|
|
* Create the actual listen thread. Should only be called once.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
mca_oob_tcp_create_listen_thread(void)
|
|
|
|
{
|
|
|
|
struct timeval tv;
|
2006-09-14 21:29:51 +00:00
|
|
|
|
2008-04-01 12:39:02 +00:00
|
|
|
#ifdef HAVE_PIPE
|
|
|
|
if (pipe(mca_oob_tcp_component.tcp_connections_pipe) < 0) {
|
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-13 20:00:55 +00:00
|
|
|
orte_output(0, "mca_oob_tcp_create_listen_thread: pipe failed: %d", errno);
|
2006-09-14 21:29:51 +00:00
|
|
|
return ORTE_ERROR;
|
|
|
|
}
|
2008-04-01 12:39:02 +00:00
|
|
|
#endif
|
2006-09-14 21:29:51 +00:00
|
|
|
|
|
|
|
/* start the listen thread */
|
|
|
|
mca_oob_tcp_component.tcp_listen_thread.t_run = mca_oob_tcp_listen_thread;
|
|
|
|
mca_oob_tcp_component.tcp_listen_thread.t_arg = NULL;
|
|
|
|
|
2008-04-01 12:39:02 +00:00
|
|
|
/* register event for read and timeout */
|
|
|
|
tv.tv_sec = mca_oob_tcp_component.tcp_listen_thread_tv.tv_sec;
|
|
|
|
tv.tv_usec = mca_oob_tcp_component.tcp_listen_thread_tv.tv_usec;
|
|
|
|
#ifdef HAVE_PIPE
|
|
|
|
opal_event_set(&mca_oob_tcp_component.tcp_listen_thread_event,
|
|
|
|
mca_oob_tcp_component.tcp_connections_pipe[0],
|
|
|
|
OPAL_EV_READ,
|
|
|
|
mca_oob_tcp_accept_thread_handler, NULL);
|
|
|
|
#else
|
|
|
|
opal_event_set(&mca_oob_tcp_component.tcp_listen_thread_event,
|
|
|
|
-1, 0,
|
|
|
|
mca_oob_tcp_accept_thread_handler, NULL);
|
|
|
|
#endif
|
|
|
|
opal_event_add(&mca_oob_tcp_component.tcp_listen_thread_event, &tv);
|
|
|
|
|
2006-09-14 21:29:51 +00:00
|
|
|
return opal_thread_start(&mca_oob_tcp_component.tcp_listen_thread);
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2004-07-01 14:49:54 +00:00
|
|
|
/*
|
2005-05-05 16:31:40 +00:00
|
|
|
* Handle probe
|
2004-07-01 14:49:54 +00:00
|
|
|
*/
|
2005-05-05 16:31:40 +00:00
|
|
|
static void mca_oob_tcp_recv_probe(int sd, mca_oob_tcp_hdr_t* hdr)
|
2004-08-02 21:24:00 +00:00
|
|
|
{
|
2005-05-19 16:16:19 +00:00
|
|
|
unsigned char* ptr = (unsigned char*)hdr;
|
2005-05-05 16:31:40 +00:00
|
|
|
size_t cnt = 0;
|
|
|
|
|
2005-05-19 16:16:19 +00:00
|
|
|
hdr->msg_type = MCA_OOB_TCP_PROBE;
|
2005-05-05 16:31:40 +00:00
|
|
|
hdr->msg_dst = hdr->msg_src;
|
2008-02-28 01:57:57 +00:00
|
|
|
hdr->msg_src = *ORTE_PROC_MY_NAME;
|
2005-05-19 16:16:19 +00:00
|
|
|
MCA_OOB_TCP_HDR_HTON(hdr);
|
|
|
|
|
2005-05-05 16:31:40 +00:00
|
|
|
while(cnt < sizeof(mca_oob_tcp_hdr_t)) {
|
|
|
|
int retval = send(sd, (char *)ptr+cnt, sizeof(mca_oob_tcp_hdr_t)-cnt, 0);
|
|
|
|
if(retval < 0) {
|
2006-08-14 20:14:44 +00:00
|
|
|
if(opal_socket_errno != EINTR && opal_socket_errno != EAGAIN && opal_socket_errno != EWOULDBLOCK) {
|
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-13 20:00:55 +00:00
|
|
|
orte_output(0, "%s-%s mca_oob_tcp_peer_recv_probe: send() failed: %s (%d)\n",
|
2008-02-28 01:57:57 +00:00
|
|
|
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
|
2007-07-20 02:34:29 +00:00
|
|
|
ORTE_NAME_PRINT(&(hdr->msg_src)),
|
2006-12-14 18:20:43 +00:00
|
|
|
strerror(opal_socket_errno),
|
2006-08-14 20:14:44 +00:00
|
|
|
opal_socket_errno);
|
2006-08-23 03:32:36 +00:00
|
|
|
CLOSE_THE_SOCKET(sd);
|
2005-05-05 16:31:40 +00:00
|
|
|
return;
|
2004-09-09 19:21:34 +00:00
|
|
|
}
|
2005-05-05 16:31:40 +00:00
|
|
|
continue;
|
2004-09-01 23:07:40 +00:00
|
|
|
}
|
2005-05-05 16:31:40 +00:00
|
|
|
cnt += retval;
|
2004-08-02 21:24:00 +00:00
|
|
|
}
|
2006-08-23 03:32:36 +00:00
|
|
|
CLOSE_THE_SOCKET(sd);
|
2005-05-05 16:31:40 +00:00
|
|
|
}
|
|
|
|
|
2008-04-01 12:39:02 +00:00
|
|
|
|
2005-05-05 16:31:40 +00:00
|
|
|
/*
|
2008-04-01 12:39:02 +00:00
|
|
|
* Complete the OOB-level handshake to establish a connection with
|
|
|
|
* another peer. Called when the remote peer replies with his process
|
|
|
|
* identifier. Used in both the threaded and event listen modes.
|
2005-05-05 16:31:40 +00:00
|
|
|
*/
|
|
|
|
static void mca_oob_tcp_recv_connect(int sd, mca_oob_tcp_hdr_t* hdr)
|
|
|
|
{
|
|
|
|
mca_oob_tcp_peer_t* peer;
|
|
|
|
int flags;
|
|
|
|
int cmpval;
|
2004-08-02 21:24:00 +00:00
|
|
|
|
|
|
|
/* now set socket up to be non-blocking */
|
|
|
|
if((flags = fcntl(sd, F_GETFL, 0)) < 0) {
|
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-13 20:00:55 +00:00
|
|
|
orte_output(0, "%s mca_oob_tcp_recv_handler: fcntl(F_GETFL) failed: %s (%d)",
|
2008-02-28 01:57:57 +00:00
|
|
|
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), strerror(opal_socket_errno), opal_socket_errno);
|
2004-08-02 21:24:00 +00:00
|
|
|
} else {
|
|
|
|
flags |= O_NONBLOCK;
|
|
|
|
if(fcntl(sd, F_SETFL, flags) < 0) {
|
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-13 20:00:55 +00:00
|
|
|
orte_output(0, "%s mca_oob_tcp_recv_handler: fcntl(F_SETFL) failed: %s (%d)",
|
2008-02-28 01:57:57 +00:00
|
|
|
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), strerror(opal_socket_errno), opal_socket_errno);
|
2004-08-02 21:24:00 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2008-02-28 01:57:57 +00:00
|
|
|
/* check for invalid name - if this is true, then we have an error
|
2004-08-25 17:39:08 +00:00
|
|
|
*/
|
2008-02-28 01:57:57 +00:00
|
|
|
cmpval = orte_util_compare_name_fields(ORTE_NS_CMP_ALL, &hdr->msg_src, ORTE_NAME_INVALID);
|
|
|
|
if (cmpval == OPAL_EQUAL) {
|
|
|
|
ORTE_ERROR_LOG(ORTE_ERR_VALUE_OUT_OF_BOUNDS);
|
|
|
|
return;
|
2004-08-25 17:39:08 +00:00
|
|
|
}
|
|
|
|
|
2004-08-02 21:24:00 +00:00
|
|
|
/* lookup the corresponding process */
|
2005-05-05 16:31:40 +00:00
|
|
|
peer = mca_oob_tcp_peer_lookup(&hdr->msg_src);
|
2004-08-02 21:24:00 +00:00
|
|
|
if(NULL == peer) {
|
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-13 20:00:55 +00:00
|
|
|
orte_output(0, "%s mca_oob_tcp_recv_handler: unable to locate peer",
|
2008-02-28 01:57:57 +00:00
|
|
|
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME));
|
2006-08-23 03:32:36 +00:00
|
|
|
CLOSE_THE_SOCKET(sd);
|
2004-08-02 21:24:00 +00:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
/* is the peer instance willing to accept this connection */
|
|
|
|
if(mca_oob_tcp_peer_accept(peer, sd) == false) {
|
Fix a number of OOB issues:
* Remove the connect() timeout code, as it had some nasty race conditions
when connections were established as the trigger was firing. A better
solution has been found for the cluster where this was needed, so just
removing it was easiest.
* When a fatal error (too many connection failures) occurs, set an error
on messages in the queue even if there isn't an active message. The
first message to any peer will be queued without being active (and
so will all subsequent messages until the connection is established),
and the orteds will hang until that first message completes. So if
an orted can never contact it's peer, it will never exit and just sit
waiting for that message to complete.
* Cover an interesting RST condition in the connect code. A connection
can complete the three-way handshake, the connector can even send
some data, but the server side will drop the connection because it
can't move it from the half-connected to fully-connected state because
of space shortage in the listen backlog queue. This causes a RST to
be received first time that recv() is called, which will be when waiting
for the remote side of the OOB ack. In this case, transition the
connection back into a CLOSED state and try to connect again.
* Add levels of debugging, rather than all or nothing, each building on
the previous level. 0 (default) is hard errors. 1 is connection
error debugging info. 2 is all connection info. 3 is more state
info. 4 includes all message info.
* Add some hopefully useful comments
This commit was SVN r14261.
2007-04-07 22:33:30 +00:00
|
|
|
if(mca_oob_tcp_component.tcp_debug >= OOB_TCP_DEBUG_CONNECT_FAIL) {
|
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-13 20:00:55 +00:00
|
|
|
orte_output(0, "%s-%s mca_oob_tcp_recv_handler: "
|
2007-07-20 02:34:29 +00:00
|
|
|
"rejected connection from %s connection state %d",
|
2008-02-28 01:57:57 +00:00
|
|
|
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
|
2007-07-20 02:34:29 +00:00
|
|
|
ORTE_NAME_PRINT(&(peer->peer_name)),
|
|
|
|
ORTE_NAME_PRINT(&(hdr->msg_src)),
|
2004-09-09 21:57:45 +00:00
|
|
|
peer->peer_state);
|
|
|
|
}
|
2006-08-23 03:32:36 +00:00
|
|
|
CLOSE_THE_SOCKET(sd);
|
2004-08-02 21:24:00 +00:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2008-04-01 12:39:02 +00:00
|
|
|
|
2005-05-05 16:31:40 +00:00
|
|
|
/*
|
|
|
|
* Event callback when there is data available on the registered
|
2008-04-01 12:39:02 +00:00
|
|
|
* socket to recv. This is called for the listen sockets to accept an
|
|
|
|
* incoming connection, on new sockets trying to complete the software
|
|
|
|
* connection process, and for probes. Data on an established
|
|
|
|
* connection is handled elsewhere.
|
2005-05-05 16:31:40 +00:00
|
|
|
*/
|
|
|
|
static void mca_oob_tcp_recv_handler(int sd, short flags, void* user)
|
|
|
|
{
|
|
|
|
mca_oob_tcp_hdr_t hdr;
|
|
|
|
mca_oob_tcp_event_t* event = (mca_oob_tcp_event_t *)user;
|
|
|
|
int rc;
|
|
|
|
|
|
|
|
/* accept new connections on the listen socket */
|
2007-10-26 16:36:51 +00:00
|
|
|
if( (mca_oob_tcp_component.tcp_listen_sd == sd)
|
|
|
|
#if OPAL_WANT_IPV6
|
|
|
|
|| (mca_oob_tcp_component.tcp6_listen_sd == sd)
|
|
|
|
#endif /* OPAL_WANT_IPV6 */
|
|
|
|
) {
|
2007-04-25 01:55:40 +00:00
|
|
|
mca_oob_tcp_accept(sd);
|
2005-05-05 16:31:40 +00:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
OBJ_RELEASE(event);
|
|
|
|
|
2006-01-04 22:29:09 +00:00
|
|
|
/* Some mem checkers don't realize that hdr will guarantee to be
|
|
|
|
fully filled in during the read(), below :-( */
|
|
|
|
OMPI_DEBUG_ZERO(hdr);
|
|
|
|
|
2005-05-05 16:31:40 +00:00
|
|
|
/* recv the process identifier */
|
|
|
|
while((rc = recv(sd, (char *)&hdr, sizeof(hdr), 0)) != sizeof(hdr)) {
|
|
|
|
if(rc >= 0) {
|
Fix a number of OOB issues:
* Remove the connect() timeout code, as it had some nasty race conditions
when connections were established as the trigger was firing. A better
solution has been found for the cluster where this was needed, so just
removing it was easiest.
* When a fatal error (too many connection failures) occurs, set an error
on messages in the queue even if there isn't an active message. The
first message to any peer will be queued without being active (and
so will all subsequent messages until the connection is established),
and the orteds will hang until that first message completes. So if
an orted can never contact it's peer, it will never exit and just sit
waiting for that message to complete.
* Cover an interesting RST condition in the connect code. A connection
can complete the three-way handshake, the connector can even send
some data, but the server side will drop the connection because it
can't move it from the half-connected to fully-connected state because
of space shortage in the listen backlog queue. This causes a RST to
be received first time that recv() is called, which will be when waiting
for the remote side of the OOB ack. In this case, transition the
connection back into a CLOSED state and try to connect again.
* Add levels of debugging, rather than all or nothing, each building on
the previous level. 0 (default) is hard errors. 1 is connection
error debugging info. 2 is all connection info. 3 is more state
info. 4 includes all message info.
* Add some hopefully useful comments
This commit was SVN r14261.
2007-04-07 22:33:30 +00:00
|
|
|
if(mca_oob_tcp_component.tcp_debug >= OOB_TCP_DEBUG_CONNECT_FAIL) {
|
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-13 20:00:55 +00:00
|
|
|
orte_output(0, "%s mca_oob_tcp_recv_handler: peer closed connection",
|
2008-02-28 01:57:57 +00:00
|
|
|
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME));
|
2005-05-05 16:31:40 +00:00
|
|
|
}
|
2006-08-23 03:32:36 +00:00
|
|
|
CLOSE_THE_SOCKET(sd);
|
2005-05-05 16:31:40 +00:00
|
|
|
return;
|
|
|
|
}
|
2006-08-14 20:14:44 +00:00
|
|
|
if(opal_socket_errno != EINTR) {
|
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-13 20:00:55 +00:00
|
|
|
orte_output(0, "%s mca_oob_tcp_recv_handler: recv() failed: %s (%d)\n",
|
2008-02-28 01:57:57 +00:00
|
|
|
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), strerror(opal_socket_errno), opal_socket_errno);
|
2006-08-23 03:32:36 +00:00
|
|
|
CLOSE_THE_SOCKET(sd);
|
2005-05-05 16:31:40 +00:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
MCA_OOB_TCP_HDR_NTOH(&hdr);
|
|
|
|
|
|
|
|
/* dispatch based on message type */
|
|
|
|
switch(hdr.msg_type) {
|
|
|
|
case MCA_OOB_TCP_PROBE:
|
|
|
|
mca_oob_tcp_recv_probe(sd, &hdr);
|
|
|
|
break;
|
|
|
|
case MCA_OOB_TCP_CONNECT:
|
|
|
|
mca_oob_tcp_recv_connect(sd, &hdr);
|
|
|
|
break;
|
|
|
|
default:
|
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-13 20:00:55 +00:00
|
|
|
orte_output(0, "%s mca_oob_tcp_recv_handler: invalid message type: %d\n",
|
2008-02-28 01:57:57 +00:00
|
|
|
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), hdr.msg_type);
|
2006-08-23 03:32:36 +00:00
|
|
|
CLOSE_THE_SOCKET(sd);
|
2005-05-05 16:31:40 +00:00
|
|
|
break;
|
|
|
|
}
|
2007-03-16 23:11:45 +00:00
|
|
|
|
2005-05-05 16:31:40 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
|
2004-08-02 21:24:00 +00:00
|
|
|
/*
|
2008-04-01 12:39:02 +00:00
|
|
|
* Component initialization - create a module and initialize the
|
|
|
|
* static resources associated with that module.
|
|
|
|
*
|
|
|
|
* Also initializes the list of devices that will be used/supported by
|
|
|
|
* the module, using the if_include and if_exclude variables. This is
|
|
|
|
* the only place that this sorting should occur -- all other places
|
|
|
|
* should use the tcp_avaiable_devices list. This is a change from
|
|
|
|
* previous versions of this component.
|
2004-08-02 21:24:00 +00:00
|
|
|
*/
|
2005-03-14 20:57:21 +00:00
|
|
|
mca_oob_t* mca_oob_tcp_component_init(int* priority)
|
2004-07-01 14:49:54 +00:00
|
|
|
{
|
2007-05-31 02:29:44 +00:00
|
|
|
int i;
|
2007-07-20 01:34:02 +00:00
|
|
|
bool found_local = false;
|
|
|
|
bool found_nonlocal = false;
|
2007-05-31 02:29:44 +00:00
|
|
|
|
2004-08-28 01:15:19 +00:00
|
|
|
*priority = 1;
|
2004-11-02 13:14:34 +00:00
|
|
|
|
2004-09-30 21:23:10 +00:00
|
|
|
/* are there any interfaces? */
|
2006-09-26 16:37:04 +00:00
|
|
|
if(opal_ifcount() <= 0)
|
2004-09-30 21:23:10 +00:00
|
|
|
return NULL;
|
|
|
|
|
2008-04-01 12:39:02 +00:00
|
|
|
/* Which interfaces should we use? Start by building a list of
|
|
|
|
all devices that meet the requirements of the if_include and
|
|
|
|
if_exclude list. This might include local and non-local
|
|
|
|
interfaces mixed together. After that sorting is done, if there
|
|
|
|
is a mix of devices, we go through the devices that survived
|
|
|
|
the initial sort and remove all the local devices (since we
|
|
|
|
have non-local devices to use). */
|
2007-05-31 02:29:44 +00:00
|
|
|
for (i = opal_ifbegin() ; i > 0 ; i = opal_ifnext(i)) {
|
|
|
|
char name[32];
|
2007-07-20 01:34:02 +00:00
|
|
|
mca_oob_tcp_device_t *dev;
|
|
|
|
|
2007-05-31 02:29:44 +00:00
|
|
|
opal_ifindextoname(i, name, sizeof(name));
|
2008-03-21 15:35:40 +00:00
|
|
|
|
2007-05-31 02:29:44 +00:00
|
|
|
if (mca_oob_tcp_component.tcp_include != NULL &&
|
|
|
|
strstr(mca_oob_tcp_component.tcp_include,name) == NULL) {
|
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-13 20:00:55 +00:00
|
|
|
ORTE_OUTPUT_VERBOSE((1, mca_oob_tcp_output_handle,
|
2008-03-21 15:35:40 +00:00
|
|
|
"%s oob:tcp:init rejecting interface %s",
|
|
|
|
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), name));
|
2007-05-31 02:29:44 +00:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
if (mca_oob_tcp_component.tcp_exclude != NULL &&
|
|
|
|
strstr(mca_oob_tcp_component.tcp_exclude,name) != NULL) {
|
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-13 20:00:55 +00:00
|
|
|
ORTE_OUTPUT_VERBOSE((1, mca_oob_tcp_output_handle,
|
2008-03-21 15:35:40 +00:00
|
|
|
"%s oob:tcp:init rejecting interface %s",
|
|
|
|
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), name));
|
2007-05-31 02:29:44 +00:00
|
|
|
continue;
|
|
|
|
}
|
2007-07-20 01:34:02 +00:00
|
|
|
|
|
|
|
dev = OBJ_NEW(mca_oob_tcp_device_t);
|
|
|
|
dev->if_index = i;
|
|
|
|
|
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-13 20:00:55 +00:00
|
|
|
ORTE_OUTPUT_VERBOSE((1, mca_oob_tcp_output_handle,
|
2008-03-21 15:35:40 +00:00
|
|
|
"%s oob:tcp:init setting up interface %s",
|
|
|
|
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), name));
|
|
|
|
|
2007-07-20 01:34:02 +00:00
|
|
|
opal_ifindextoaddr(i, (struct sockaddr*) &dev->if_addr, sizeof(struct sockaddr_storage));
|
|
|
|
if(opal_net_islocalhost((struct sockaddr*) &dev->if_addr)) {
|
|
|
|
dev->if_local = true;
|
|
|
|
found_local = true;
|
|
|
|
} else {
|
|
|
|
dev->if_local = false;
|
|
|
|
found_nonlocal = true;
|
|
|
|
}
|
|
|
|
|
|
|
|
opal_list_append(&mca_oob_tcp_component.tcp_available_devices,
|
|
|
|
&dev->super);
|
|
|
|
}
|
|
|
|
if (found_local && found_nonlocal) {
|
|
|
|
opal_list_item_t *item;
|
|
|
|
for (item = opal_list_get_first(&mca_oob_tcp_component.tcp_available_devices) ;
|
|
|
|
item != opal_list_get_end(&mca_oob_tcp_component.tcp_available_devices) ;
|
|
|
|
item = opal_list_get_next(item)) {
|
|
|
|
mca_oob_tcp_device_t *dev = (mca_oob_tcp_device_t*) item;
|
|
|
|
if (dev->if_local) {
|
|
|
|
item = opal_list_remove_item(&mca_oob_tcp_component.tcp_available_devices,
|
|
|
|
item);
|
|
|
|
}
|
2007-05-31 02:29:44 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2007-07-20 01:34:02 +00:00
|
|
|
if (opal_list_get_size(&mca_oob_tcp_component.tcp_available_devices) == 0) {
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2004-07-12 22:46:57 +00:00
|
|
|
/* initialize data structures */
|
2005-07-03 16:52:32 +00:00
|
|
|
opal_hash_table_init(&mca_oob_tcp_component.tcp_peers, 128);
|
|
|
|
opal_hash_table_init(&mca_oob_tcp_component.tcp_peer_names, 128);
|
2004-08-02 21:24:00 +00:00
|
|
|
|
2005-07-02 16:46:27 +00:00
|
|
|
opal_free_list_init(&mca_oob_tcp_component.tcp_peer_free,
|
2004-08-02 21:24:00 +00:00
|
|
|
sizeof(mca_oob_tcp_peer_t),
|
|
|
|
OBJ_CLASS(mca_oob_tcp_peer_t),
|
|
|
|
8, /* initial number */
|
|
|
|
mca_oob_tcp_component.tcp_peer_limit, /* maximum number */
|
2005-07-02 16:46:27 +00:00
|
|
|
8); /* increment to grow by */
|
2004-08-02 21:24:00 +00:00
|
|
|
|
2005-07-02 16:46:27 +00:00
|
|
|
opal_free_list_init(&mca_oob_tcp_component.tcp_msgs,
|
2004-08-02 21:24:00 +00:00
|
|
|
sizeof(mca_oob_tcp_msg_t),
|
|
|
|
OBJ_CLASS(mca_oob_tcp_msg_t),
|
|
|
|
8, /* initial number */
|
2004-08-02 22:16:35 +00:00
|
|
|
-1, /* maximum number */
|
2005-07-02 16:46:27 +00:00
|
|
|
8); /* increment to grow by */
|
2004-08-02 21:24:00 +00:00
|
|
|
|
|
|
|
/* intialize event library */
|
2005-07-03 23:09:55 +00:00
|
|
|
memset(&mca_oob_tcp_component.tcp_recv_event, 0, sizeof(opal_event_t));
|
2007-10-26 16:36:51 +00:00
|
|
|
#if OPAL_WANT_IPV6
|
2007-04-25 01:55:40 +00:00
|
|
|
memset(&mca_oob_tcp_component.tcp6_recv_event, 0, sizeof(opal_event_t));
|
2007-10-26 16:36:51 +00:00
|
|
|
#endif /* OPAL_WANT_IPV6 */
|
2004-08-02 21:24:00 +00:00
|
|
|
return &mca_oob_tcp;
|
2004-07-01 14:49:54 +00:00
|
|
|
}
|
|
|
|
|
2008-04-01 12:39:02 +00:00
|
|
|
|
2004-09-01 23:07:40 +00:00
|
|
|
/*
|
|
|
|
* Attempt to resolve peer name.
|
|
|
|
*/
|
|
|
|
int mca_oob_tcp_resolve(mca_oob_tcp_peer_t* peer)
|
|
|
|
{
|
2008-03-05 22:44:35 +00:00
|
|
|
mca_oob_tcp_addr_t* addr = NULL;
|
2005-09-01 01:07:30 +00:00
|
|
|
|
2005-04-12 21:25:51 +00:00
|
|
|
/* if the address is already cached - simply return it */
|
2005-07-03 22:45:48 +00:00
|
|
|
OPAL_THREAD_LOCK(&mca_oob_tcp_component.tcp_lock);
|
2008-03-05 22:44:35 +00:00
|
|
|
opal_hash_table_get_value_uint64(&mca_oob_tcp_component.tcp_peer_names,
|
|
|
|
orte_util_hash_name(&peer->peer_name), (void**)&addr);
|
2007-10-19 12:36:26 +00:00
|
|
|
OPAL_THREAD_UNLOCK(&mca_oob_tcp_component.tcp_lock);
|
2005-04-12 21:25:51 +00:00
|
|
|
if(NULL != addr) {
|
2004-09-01 23:07:40 +00:00
|
|
|
mca_oob_tcp_peer_resolved(peer, addr);
|
2006-02-12 01:33:29 +00:00
|
|
|
return ORTE_SUCCESS;
|
2005-04-12 21:25:51 +00:00
|
|
|
}
|
2004-09-01 23:07:40 +00:00
|
|
|
|
Commit the orted-failed-to-start code. This correctly causes the system to detect the failure of an orted to start and allows the system to terminate all procs/orteds that *did* start.
The primary change that underlies all this is in the OOB. Specifically, the problem in the code until now has been that the OOB attempts to resolve an address when we call the "send" to an unknown recipient. The OOB would then wait forever if that recipient never actually started (and hence, never reported back its OOB contact info). In the case of an orted that failed to start, we would correctly detect that the orted hadn't started, but then we would attempt to order all orteds (including the one that failed to start) to die. This would cause the OOB to "hang" the system.
Unfortunately, revising how the OOB resolves addresses introduced a number of additional problems. Specifically, and most troublesome, was the fact that comm_spawn involved the immediate transmission of the rendezvous point from parent-to-child after the child was spawned. The current code used the OOB address resolution as a "barrier" - basically, the parent would attempt to send the info to the child, and then "hold" there until the child's contact info had arrived (meaning the child had started) and the send could be completed.
Note that this also caused comm_spawn to "hang" the entire system if the child never started... The app-failed-to-start helped improve that behavior - this code provides additional relief.
With this change, the OOB will return an ADDRESSEE_UNKNOWN error if you attempt to send to a recipient whose contact info isn't already in the OOB's hash tables. To resolve comm_spawn issues, we also now force the cross-sharing of connection info between parent and child jobs during spawn.
Finally, to aid in setting triggers to the right values, we introduce the "arith" API for the GPR. This function allows you to atomically change the value in a registry location (either divide, multiply, add, or subtract) by the provided operand. It is equivalent to first fetching the value using a "get", then modifying it, and then putting the result back into the registry via a "put".
This commit was SVN r14711.
2007-05-21 18:31:28 +00:00
|
|
|
/* if we don't know it, then report unknown - don't try to go get it */
|
|
|
|
return ORTE_ERR_ADDRESSEE_UNKNOWN;
|
2004-09-01 23:07:40 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
|
2004-08-19 19:34:37 +00:00
|
|
|
/*
|
2008-04-01 12:39:02 +00:00
|
|
|
* Ready the TCP module for connections. This includes creating
|
|
|
|
* listen sockets for both IPv4 and IPv6 and (possibly) starting the
|
|
|
|
* connection listen thread.
|
2004-08-19 19:34:37 +00:00
|
|
|
*/
|
|
|
|
int mca_oob_tcp_init(void)
|
|
|
|
{
|
2005-03-14 20:57:21 +00:00
|
|
|
orte_jobid_t jobid;
|
2004-08-19 19:34:37 +00:00
|
|
|
int rc;
|
2006-10-02 01:27:22 +00:00
|
|
|
int randval = orte_process_info.num_procs;
|
|
|
|
|
|
|
|
if (0 == randval) randval = 10;
|
2005-03-14 20:57:21 +00:00
|
|
|
|
2005-03-18 23:40:08 +00:00
|
|
|
/* random delay to stagger connections back to seed */
|
2005-12-12 20:04:00 +00:00
|
|
|
#if defined(__WINDOWS__)
|
2006-11-06 18:00:46 +00:00
|
|
|
if(1 == mca_oob_tcp_component.connect_sleep) {
|
2008-02-28 01:57:57 +00:00
|
|
|
Sleep((ORTE_PROC_MY_NAME->vpid % randval % 1000) * 100);
|
2006-11-06 18:00:46 +00:00
|
|
|
}
|
2007-07-10 03:46:57 +00:00
|
|
|
#elif defined(HAVE_USLEEP)
|
2006-11-06 18:00:46 +00:00
|
|
|
if(1 == mca_oob_tcp_component.connect_sleep) {
|
2008-02-28 01:57:57 +00:00
|
|
|
usleep((ORTE_PROC_MY_NAME->vpid % randval % 1000) * 1000);
|
2006-11-06 18:00:46 +00:00
|
|
|
}
|
2005-04-19 04:38:48 +00:00
|
|
|
#endif
|
2006-07-12 22:18:53 +00:00
|
|
|
|
2005-03-18 23:40:08 +00:00
|
|
|
/* get my jobid */
|
Bring over the update to terminate orteds that are generated by a dynamic spawn such as comm_spawn. This introduces the concept of a job "family" - i.e., jobs that have a parent/child relationship. Comm_spawn'ed jobs have a parent (the one that spawned them). We track that relationship throughout the lineage - i.e., if a comm_spawned job in turn calls comm_spawn, then it has a parent (the one that spawned it) and a "root" job (the original job that started things).
Accordingly, there are new APIs to the name service to support the ability to get a job's parent, root, immediate children, and all its descendants. In addition, the terminate_job, terminate_orted, and signal_job APIs for the PLS have been modified to accept attributes that define the extent of their actions. For example, doing a "terminate_job" with an attribute of ORTE_NS_INCLUDE_DESCENDANTS will terminate the given jobid AND all jobs that descended from it.
I have tested this capability on a MacBook under rsh, Odin under SLURM, and LANL's Flash (bproc). It worked successfully on non-MPI jobs (both simple and including a spawn), and MPI jobs (again, both simple and with a spawn).
This commit was SVN r12597.
2006-11-14 19:34:59 +00:00
|
|
|
jobid = ORTE_PROC_MY_NAME->jobid;
|
2008-04-01 12:39:02 +00:00
|
|
|
|
|
|
|
/* Fix up the listen type. This is the first call into the OOB in
|
|
|
|
which the orte_process_info.hnp field is reliably set. The
|
|
|
|
listen_mode should only be listen_thread for the HNP -- all
|
|
|
|
others should use the traditional event library. */
|
|
|
|
if (!orte_process_info.hnp) {
|
|
|
|
mca_oob_tcp_component.tcp_listen_type = OOB_TCP_EVENT;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Create an IPv4 listen socket and either register with the event
|
|
|
|
engine or give to the listen thread */
|
|
|
|
rc = mca_oob_tcp_create_listen(&mca_oob_tcp_component.tcp_listen_sd,
|
2008-04-09 12:53:24 +00:00
|
|
|
&mca_oob_tcp_component.tcp_listen_port,
|
2008-04-01 12:39:02 +00:00
|
|
|
AF_INET);
|
2008-04-09 12:53:24 +00:00
|
|
|
if (ORTE_SUCCESS != rc) {
|
|
|
|
/* Don't complain if just not supported unless want connect debugging */
|
|
|
|
if (EAFNOSUPPORT != opal_socket_errno ||
|
|
|
|
mca_oob_tcp_component.tcp_debug >= OOB_TCP_DEBUG_CONNECT) {
|
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-13 20:00:55 +00:00
|
|
|
orte_output(0,
|
2008-04-09 12:53:24 +00:00
|
|
|
"mca_oob_tcp_init: unable to create IPv4 listen socket: %s\n",
|
|
|
|
opal_strerror(rc));
|
|
|
|
}
|
|
|
|
mca_oob_tcp_component.tcp_listen_sd = -1;
|
|
|
|
mca_oob_tcp_component.tcp_listen_port = 0;
|
2008-04-01 12:39:02 +00:00
|
|
|
} else {
|
2008-04-09 12:53:24 +00:00
|
|
|
if (OOB_TCP_LISTEN_THREAD == mca_oob_tcp_component.tcp_listen_type) {
|
|
|
|
int idx = mca_oob_tcp_component.tcp_listen_thread_num_sockets++;
|
|
|
|
mca_oob_tcp_component.tcp_listen_thread_sds[idx] =
|
|
|
|
mca_oob_tcp_component.tcp_listen_sd;
|
|
|
|
} else {
|
|
|
|
opal_event_set(&mca_oob_tcp_component.tcp_recv_event,
|
|
|
|
mca_oob_tcp_component.tcp_listen_sd,
|
|
|
|
OPAL_EV_READ|OPAL_EV_PERSIST,
|
|
|
|
mca_oob_tcp_recv_handler,
|
|
|
|
0);
|
|
|
|
opal_event_add(&mca_oob_tcp_component.tcp_recv_event, 0);
|
|
|
|
}
|
2008-04-01 12:39:02 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Create an IPv6 listen socket (if IPv6 is enabled, of course)
|
|
|
|
and either register with the event engine or give to the listen
|
|
|
|
thread */
|
|
|
|
#if OPAL_WANT_IPV6
|
|
|
|
rc = mca_oob_tcp_create_listen(&mca_oob_tcp_component.tcp6_listen_sd,
|
2008-04-09 12:53:24 +00:00
|
|
|
&mca_oob_tcp_component.tcp6_listen_port,
|
2008-04-01 12:39:02 +00:00
|
|
|
AF_INET6);
|
2008-04-09 12:53:24 +00:00
|
|
|
if (ORTE_SUCCESS != rc) {
|
|
|
|
/* Don't complain if just not supported unless want connect debugging */
|
|
|
|
if (EAFNOSUPPORT != opal_socket_errno ||
|
|
|
|
mca_oob_tcp_component.tcp_debug >= OOB_TCP_DEBUG_CONNECT) {
|
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-13 20:00:55 +00:00
|
|
|
orte_output(0,
|
2008-04-09 12:53:24 +00:00
|
|
|
"mca_oob_tcp_init: unable to create IPv6 listen socket: %s\n",
|
|
|
|
opal_strerror(rc));
|
|
|
|
}
|
|
|
|
mca_oob_tcp_component.tcp6_listen_sd = -1;
|
|
|
|
mca_oob_tcp_component.tcp6_listen_port = 0;
|
2008-04-01 12:39:02 +00:00
|
|
|
} else {
|
2008-04-09 12:53:24 +00:00
|
|
|
if (OOB_TCP_LISTEN_THREAD == mca_oob_tcp_component.tcp_listen_type) {
|
|
|
|
int idx = mca_oob_tcp_component.tcp_listen_thread_num_sockets++;
|
|
|
|
mca_oob_tcp_component.tcp_listen_thread_sds[idx] =
|
|
|
|
mca_oob_tcp_component.tcp6_listen_sd;
|
|
|
|
} else {
|
|
|
|
opal_event_set(&mca_oob_tcp_component.tcp6_recv_event,
|
|
|
|
mca_oob_tcp_component.tcp6_listen_sd,
|
|
|
|
OPAL_EV_READ|OPAL_EV_PERSIST,
|
|
|
|
mca_oob_tcp_recv_handler,
|
|
|
|
0);
|
|
|
|
opal_event_add(&mca_oob_tcp_component.tcp6_recv_event, 0);
|
|
|
|
}
|
2008-04-01 12:39:02 +00:00
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2008-04-09 12:53:24 +00:00
|
|
|
if (mca_oob_tcp_component.tcp_listen_sd < 0
|
|
|
|
#if OPAL_WANT_IPV6
|
|
|
|
&& mca_oob_tcp_component.tcp6_listen_sd < 0
|
|
|
|
#endif
|
|
|
|
) {
|
|
|
|
return ORTE_ERR_NOT_SUPPORTED;
|
|
|
|
}
|
|
|
|
|
2008-04-01 12:39:02 +00:00
|
|
|
/* Finish up by either printing a nice message (event library) or
|
|
|
|
initializing the listen thread (listen thread) */
|
|
|
|
if (OOB_TCP_LISTEN_THREAD == mca_oob_tcp_component.tcp_listen_type) {
|
|
|
|
rc = mca_oob_tcp_create_listen_thread();
|
|
|
|
if (ORTE_SUCCESS != rc) {
|
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-13 20:00:55 +00:00
|
|
|
orte_output(0, "Unable to create listen thread: %d\n", rc);
|
2008-04-01 12:39:02 +00:00
|
|
|
return rc;
|
2006-10-11 21:29:29 +00:00
|
|
|
}
|
2008-04-01 12:39:02 +00:00
|
|
|
|
Fix a number of OOB issues:
* Remove the connect() timeout code, as it had some nasty race conditions
when connections were established as the trigger was firing. A better
solution has been found for the cluster where this was needed, so just
removing it was easiest.
* When a fatal error (too many connection failures) occurs, set an error
on messages in the queue even if there isn't an active message. The
first message to any peer will be queued without being active (and
so will all subsequent messages until the connection is established),
and the orteds will hang until that first message completes. So if
an orted can never contact it's peer, it will never exit and just sit
waiting for that message to complete.
* Cover an interesting RST condition in the connect code. A connection
can complete the three-way handshake, the connector can even send
some data, but the server side will drop the connection because it
can't move it from the half-connected to fully-connected state because
of space shortage in the listen backlog queue. This causes a RST to
be received first time that recv() is called, which will be when waiting
for the remote side of the OOB ack. In this case, transition the
connection back into a CLOSED state and try to connect again.
* Add levels of debugging, rather than all or nothing, each building on
the previous level. 0 (default) is hard errors. 1 is connection
error debugging info. 2 is all connection info. 3 is more state
info. 4 includes all message info.
* Add some hopefully useful comments
This commit was SVN r14261.
2007-04-07 22:33:30 +00:00
|
|
|
if (mca_oob_tcp_component.tcp_debug >= OOB_TCP_DEBUG_INFO) {
|
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-13 20:00:55 +00:00
|
|
|
orte_output(0, "%s accepting connections via listen thread",
|
2008-02-28 01:57:57 +00:00
|
|
|
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME));
|
2007-03-23 13:29:18 +00:00
|
|
|
}
|
|
|
|
} else {
|
Fix a number of OOB issues:
* Remove the connect() timeout code, as it had some nasty race conditions
when connections were established as the trigger was firing. A better
solution has been found for the cluster where this was needed, so just
removing it was easiest.
* When a fatal error (too many connection failures) occurs, set an error
on messages in the queue even if there isn't an active message. The
first message to any peer will be queued without being active (and
so will all subsequent messages until the connection is established),
and the orteds will hang until that first message completes. So if
an orted can never contact it's peer, it will never exit and just sit
waiting for that message to complete.
* Cover an interesting RST condition in the connect code. A connection
can complete the three-way handshake, the connector can even send
some data, but the server side will drop the connection because it
can't move it from the half-connected to fully-connected state because
of space shortage in the listen backlog queue. This causes a RST to
be received first time that recv() is called, which will be when waiting
for the remote side of the OOB ack. In this case, transition the
connection back into a CLOSED state and try to connect again.
* Add levels of debugging, rather than all or nothing, each building on
the previous level. 0 (default) is hard errors. 1 is connection
error debugging info. 2 is all connection info. 3 is more state
info. 4 includes all message info.
* Add some hopefully useful comments
This commit was SVN r14261.
2007-04-07 22:33:30 +00:00
|
|
|
if (mca_oob_tcp_component.tcp_debug >= OOB_TCP_DEBUG_INFO) {
|
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-13 20:00:55 +00:00
|
|
|
orte_output(0, "%s accepting connections via event library",
|
2008-02-28 01:57:57 +00:00
|
|
|
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME));
|
2007-03-23 13:29:18 +00:00
|
|
|
}
|
2006-10-11 21:29:29 +00:00
|
|
|
}
|
|
|
|
|
Commit the orted-failed-to-start code. This correctly causes the system to detect the failure of an orted to start and allows the system to terminate all procs/orteds that *did* start.
The primary change that underlies all this is in the OOB. Specifically, the problem in the code until now has been that the OOB attempts to resolve an address when we call the "send" to an unknown recipient. The OOB would then wait forever if that recipient never actually started (and hence, never reported back its OOB contact info). In the case of an orted that failed to start, we would correctly detect that the orted hadn't started, but then we would attempt to order all orteds (including the one that failed to start) to die. This would cause the OOB to "hang" the system.
Unfortunately, revising how the OOB resolves addresses introduced a number of additional problems. Specifically, and most troublesome, was the fact that comm_spawn involved the immediate transmission of the rendezvous point from parent-to-child after the child was spawned. The current code used the OOB address resolution as a "barrier" - basically, the parent would attempt to send the info to the child, and then "hold" there until the child's contact info had arrived (meaning the child had started) and the send could be completed.
Note that this also caused comm_spawn to "hang" the entire system if the child never started... The app-failed-to-start helped improve that behavior - this code provides additional relief.
With this change, the OOB will return an ADDRESSEE_UNKNOWN error if you attempt to send to a recipient whose contact info isn't already in the OOB's hash tables. To resolve comm_spawn issues, we also now force the cross-sharing of connection info between parent and child jobs during spawn.
Finally, to aid in setting triggers to the right values, we introduce the "arith" API for the GPR. This function allows you to atomically change the value in a registry location (either divide, multiply, add, or subtract) by the provided operand. It is equivalent to first fetching the value using a "get", then modifying it, and then putting the result back into the registry via a "put".
This commit was SVN r14711.
2007-05-21 18:31:28 +00:00
|
|
|
return ORTE_SUCCESS;
|
|
|
|
}
|
2004-09-01 23:07:40 +00:00
|
|
|
|
Commit the orted-failed-to-start code. This correctly causes the system to detect the failure of an orted to start and allows the system to terminate all procs/orteds that *did* start.
The primary change that underlies all this is in the OOB. Specifically, the problem in the code until now has been that the OOB attempts to resolve an address when we call the "send" to an unknown recipient. The OOB would then wait forever if that recipient never actually started (and hence, never reported back its OOB contact info). In the case of an orted that failed to start, we would correctly detect that the orted hadn't started, but then we would attempt to order all orteds (including the one that failed to start) to die. This would cause the OOB to "hang" the system.
Unfortunately, revising how the OOB resolves addresses introduced a number of additional problems. Specifically, and most troublesome, was the fact that comm_spawn involved the immediate transmission of the rendezvous point from parent-to-child after the child was spawned. The current code used the OOB address resolution as a "barrier" - basically, the parent would attempt to send the info to the child, and then "hold" there until the child's contact info had arrived (meaning the child had started) and the send could be completed.
Note that this also caused comm_spawn to "hang" the entire system if the child never started... The app-failed-to-start helped improve that behavior - this code provides additional relief.
With this change, the OOB will return an ADDRESSEE_UNKNOWN error if you attempt to send to a recipient whose contact info isn't already in the OOB's hash tables. To resolve comm_spawn issues, we also now force the cross-sharing of connection info between parent and child jobs during spawn.
Finally, to aid in setting triggers to the right values, we introduce the "arith" API for the GPR. This function allows you to atomically change the value in a registry location (either divide, multiply, add, or subtract) by the provided operand. It is equivalent to first fetching the value using a "get", then modifying it, and then putting the result back into the registry via a "put".
This commit was SVN r14711.
2007-05-21 18:31:28 +00:00
|
|
|
|
2004-08-02 21:24:00 +00:00
|
|
|
/*
|
|
|
|
* Module cleanup.
|
|
|
|
*/
|
2004-08-19 19:34:37 +00:00
|
|
|
int mca_oob_tcp_fini(void)
|
2004-07-01 14:49:54 +00:00
|
|
|
{
|
2005-07-03 16:22:16 +00:00
|
|
|
opal_list_item_t *item;
|
2008-04-01 12:39:02 +00:00
|
|
|
void *data;
|
2007-07-25 05:55:14 +00:00
|
|
|
|
2005-07-03 22:45:48 +00:00
|
|
|
OPAL_THREAD_LOCK(&mca_oob_tcp_component.tcp_lock);
|
2005-07-03 23:09:55 +00:00
|
|
|
opal_event_disable(); /* disable event processing */
|
2004-09-30 15:09:29 +00:00
|
|
|
|
2008-04-01 12:39:02 +00:00
|
|
|
/* shut down the listening system */
|
|
|
|
if (OOB_TCP_LISTEN_THREAD == mca_oob_tcp_component.tcp_listen_type) {
|
|
|
|
mca_oob_tcp_component.tcp_shutdown = true;
|
|
|
|
opal_thread_join(&mca_oob_tcp_component.tcp_listen_thread, &data);
|
|
|
|
opal_event_del(&mca_oob_tcp_component.tcp_listen_thread_event);
|
|
|
|
} else {
|
|
|
|
if (mca_oob_tcp_component.tcp_listen_sd >= 0) {
|
2006-09-14 21:29:51 +00:00
|
|
|
opal_event_del(&mca_oob_tcp_component.tcp_recv_event);
|
|
|
|
}
|
2008-04-02 10:53:48 +00:00
|
|
|
#if OPAL_WANT_IPV6
|
2008-04-01 12:39:02 +00:00
|
|
|
if (mca_oob_tcp_component.tcp6_listen_sd >= 0) {
|
|
|
|
opal_event_del(&mca_oob_tcp_component.tcp6_recv_event);
|
|
|
|
}
|
2008-04-02 10:53:48 +00:00
|
|
|
#endif
|
2008-04-01 12:39:02 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/* close listen socket */
|
|
|
|
if (mca_oob_tcp_component.tcp_listen_sd >= 0) {
|
|
|
|
CLOSE_THE_SOCKET(mca_oob_tcp_component.tcp_listen_sd);
|
2004-09-30 15:09:29 +00:00
|
|
|
mca_oob_tcp_component.tcp_listen_sd = -1;
|
2004-08-05 19:37:48 +00:00
|
|
|
}
|
2008-04-01 12:39:02 +00:00
|
|
|
#if OPAL_WANT_IPV6
|
|
|
|
if (mca_oob_tcp_component.tcp6_listen_sd >= 0) {
|
|
|
|
CLOSE_THE_SOCKET(mca_oob_tcp_component.tcp6_listen_sd);
|
|
|
|
mca_oob_tcp_component.tcp6_listen_sd = -1;
|
|
|
|
}
|
|
|
|
#endif /* OPAL_WANT_IPV6 */
|
2004-08-12 13:29:37 +00:00
|
|
|
|
|
|
|
/* cleanup all peers */
|
2005-07-03 16:22:16 +00:00
|
|
|
for(item = opal_list_remove_first(&mca_oob_tcp_component.tcp_peer_list);
|
2004-09-30 15:09:29 +00:00
|
|
|
item != NULL;
|
2005-07-03 16:22:16 +00:00
|
|
|
item = opal_list_remove_first(&mca_oob_tcp_component.tcp_peer_list)) {
|
2004-09-30 15:09:29 +00:00
|
|
|
mca_oob_tcp_peer_t* peer = (mca_oob_tcp_peer_t*)item;
|
|
|
|
MCA_OOB_TCP_PEER_RETURN(peer);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* delete any pending events */
|
2007-06-14 22:33:09 +00:00
|
|
|
for( item = opal_list_get_first(&mca_oob_tcp_component.tcp_events);
|
|
|
|
item != opal_list_get_end(&mca_oob_tcp_component.tcp_events);
|
|
|
|
item = opal_list_get_first(&mca_oob_tcp_component.tcp_events) ) {
|
2004-09-30 15:09:29 +00:00
|
|
|
mca_oob_tcp_event_t* event = (mca_oob_tcp_event_t*)item;
|
2005-07-03 23:09:55 +00:00
|
|
|
opal_event_del(&event->event);
|
2004-09-30 15:09:29 +00:00
|
|
|
OBJ_RELEASE(event);
|
2004-08-06 17:23:37 +00:00
|
|
|
}
|
2004-09-30 15:09:29 +00:00
|
|
|
|
2005-07-03 23:09:55 +00:00
|
|
|
opal_event_enable();
|
2005-07-03 22:45:48 +00:00
|
|
|
OPAL_THREAD_UNLOCK(&mca_oob_tcp_component.tcp_lock);
|
2006-02-12 01:33:29 +00:00
|
|
|
return ORTE_SUCCESS;
|
2004-07-01 14:49:54 +00:00
|
|
|
}
|
2004-07-12 22:46:57 +00:00
|
|
|
|
2004-09-30 15:09:29 +00:00
|
|
|
|
2004-08-09 23:07:53 +00:00
|
|
|
/*
|
2004-08-05 15:30:36 +00:00
|
|
|
* Compare two process names for equality.
|
|
|
|
*
|
|
|
|
* @param n1 Process name 1.
|
|
|
|
* @param n2 Process name 2.
|
|
|
|
* @return (-1 for n1<n2 0 for equality, 1 for n1>n2)
|
|
|
|
*
|
|
|
|
* Note that the definition of < or > is somewhat arbitrary -
|
|
|
|
* just needs to be consistently applied to maintain an ordering
|
|
|
|
* when process names are used as indices.
|
Bring over the update to terminate orteds that are generated by a dynamic spawn such as comm_spawn. This introduces the concept of a job "family" - i.e., jobs that have a parent/child relationship. Comm_spawn'ed jobs have a parent (the one that spawned them). We track that relationship throughout the lineage - i.e., if a comm_spawned job in turn calls comm_spawn, then it has a parent (the one that spawned it) and a "root" job (the original job that started things).
Accordingly, there are new APIs to the name service to support the ability to get a job's parent, root, immediate children, and all its descendants. In addition, the terminate_job, terminate_orted, and signal_job APIs for the PLS have been modified to accept attributes that define the extent of their actions. For example, doing a "terminate_job" with an attribute of ORTE_NS_INCLUDE_DESCENDANTS will terminate the given jobid AND all jobs that descended from it.
I have tested this capability on a MacBook under rsh, Odin under SLURM, and LANL's Flash (bproc). It worked successfully on non-MPI jobs (both simple and including a spawn), and MPI jobs (again, both simple and with a spawn).
This commit was SVN r12597.
2006-11-14 19:34:59 +00:00
|
|
|
*
|
|
|
|
* Currently, this function is ONLY used in one place - in oob_tcp_send.c to
|
|
|
|
* determine if the recipient of the message-to-be-sent is ourselves. Hence,
|
|
|
|
* this comparison is okay to be LITERAL and can/should use the ns.compare_fields
|
|
|
|
* function
|
2004-08-05 15:30:36 +00:00
|
|
|
*/
|
|
|
|
|
|
|
|
|
2005-03-14 20:57:21 +00:00
|
|
|
int mca_oob_tcp_process_name_compare(const orte_process_name_t* n1, const orte_process_name_t* n2)
|
2004-08-05 15:30:36 +00:00
|
|
|
{
|
2008-02-28 01:57:57 +00:00
|
|
|
return orte_util_compare_name_fields(ORTE_NS_CMP_ALL, n1, n2);
|
2004-08-05 15:30:36 +00:00
|
|
|
}
|
|
|
|
|
2004-08-16 19:39:54 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Return local process address as a URI string.
|
|
|
|
*/
|
|
|
|
|
|
|
|
char* mca_oob_tcp_get_addr(void)
|
|
|
|
{
|
2007-09-11 11:28:43 +00:00
|
|
|
char *contact_info = (char *) malloc(opal_list_get_size(&mca_oob_tcp_component.tcp_available_devices) * 128);
|
2004-08-16 19:39:54 +00:00
|
|
|
char *ptr = contact_info;
|
2007-07-20 01:34:02 +00:00
|
|
|
opal_list_item_t *item;
|
2004-08-16 19:39:54 +00:00
|
|
|
*ptr = 0;
|
|
|
|
|
2007-07-20 01:34:02 +00:00
|
|
|
for (item = opal_list_get_first(&mca_oob_tcp_component.tcp_available_devices) ;
|
|
|
|
item != opal_list_get_end(&mca_oob_tcp_component.tcp_available_devices) ;
|
|
|
|
item = opal_list_get_next(item)) {
|
|
|
|
mca_oob_tcp_device_t *dev = (mca_oob_tcp_device_t*) item;
|
|
|
|
|
|
|
|
if (ptr != contact_info) {
|
2004-08-16 19:39:54 +00:00
|
|
|
ptr += sprintf(ptr, ";");
|
|
|
|
}
|
2007-04-25 19:08:07 +00:00
|
|
|
|
2008-04-16 09:22:00 +00:00
|
|
|
if (dev->if_addr.ss_family == AF_INET &&
|
|
|
|
4 != mca_oob_tcp_component.disable_family) {
|
2007-07-20 01:34:02 +00:00
|
|
|
ptr += sprintf(ptr, "tcp://%s:%d", opal_net_get_hostname((struct sockaddr*) &dev->if_addr),
|
2007-04-25 01:55:40 +00:00
|
|
|
ntohs(mca_oob_tcp_component.tcp_listen_port));
|
2007-10-26 16:36:51 +00:00
|
|
|
}
|
|
|
|
#if OPAL_WANT_IPV6
|
2008-04-16 09:22:00 +00:00
|
|
|
if (dev->if_addr.ss_family == AF_INET6 &&
|
|
|
|
6 != mca_oob_tcp_component.disable_family) {
|
2007-07-20 01:34:02 +00:00
|
|
|
ptr += sprintf(ptr, "tcp6://%s:%d", opal_net_get_hostname((struct sockaddr*) &dev->if_addr),
|
2007-04-25 01:55:40 +00:00
|
|
|
ntohs(mca_oob_tcp_component.tcp6_listen_port));
|
|
|
|
}
|
2007-10-26 16:36:51 +00:00
|
|
|
#endif /* OPAL_WANT_IPV6 */
|
2004-08-16 19:39:54 +00:00
|
|
|
}
|
|
|
|
return contact_info;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Parse a URI string into an IP address and port number.
|
|
|
|
*/
|
|
|
|
|
2007-07-20 01:34:02 +00:00
|
|
|
int
|
|
|
|
mca_oob_tcp_parse_uri(const char* uri, struct sockaddr* inaddr)
|
2004-08-16 19:39:54 +00:00
|
|
|
{
|
2007-07-20 01:34:02 +00:00
|
|
|
char *dup_uri = strdup(uri);
|
|
|
|
char *host, *port;
|
|
|
|
uint16_t af_family = AF_UNSPEC;
|
|
|
|
int ret;
|
2007-04-25 11:51:18 +00:00
|
|
|
#if OPAL_WANT_IPV6
|
2007-07-20 01:34:02 +00:00
|
|
|
struct addrinfo hints, *res;
|
2007-04-25 11:51:18 +00:00
|
|
|
#endif
|
2007-07-20 01:34:02 +00:00
|
|
|
|
|
|
|
if (NULL == dup_uri) return ORTE_ERR_OUT_OF_RESOURCE;
|
|
|
|
|
|
|
|
if (strncmp(dup_uri, "tcp6://", strlen("tcp6://")) == 0) {
|
2007-04-25 01:55:40 +00:00
|
|
|
#if OPAL_WANT_IPV6
|
|
|
|
af_family = AF_INET6;
|
2007-07-20 01:34:02 +00:00
|
|
|
host = dup_uri + strlen("tcp6://");
|
|
|
|
#else
|
|
|
|
ret = ORTE_ERR_NOT_SUPPORTED;
|
|
|
|
goto cleanup;
|
|
|
|
#endif
|
|
|
|
} else if (strncmp(dup_uri, "tcp://", strlen("tcp://")) == 0) {
|
|
|
|
af_family = AF_INET;
|
|
|
|
host = dup_uri + strlen("tcp://");
|
2007-04-25 01:55:40 +00:00
|
|
|
} else {
|
2007-07-20 01:34:02 +00:00
|
|
|
ret = ORTE_ERR_BAD_PARAM;
|
|
|
|
goto cleanup;
|
2004-08-16 19:39:54 +00:00
|
|
|
}
|
2007-04-25 01:55:40 +00:00
|
|
|
|
2007-07-20 01:34:02 +00:00
|
|
|
/* mutate the host string so that the port number is not in the
|
|
|
|
same string as the host. */
|
|
|
|
port = strrchr(host, ':');
|
|
|
|
if (NULL == port) {
|
|
|
|
ret = ORTE_ERR_BAD_PARAM;
|
|
|
|
goto cleanup;
|
|
|
|
}
|
|
|
|
*port = '\0';
|
|
|
|
port++;
|
|
|
|
|
|
|
|
switch (af_family) {
|
|
|
|
case AF_INET:
|
|
|
|
memset(inaddr, 0, sizeof(struct sockaddr_in));
|
|
|
|
break;
|
|
|
|
case AF_INET6:
|
|
|
|
memset(inaddr, 0, sizeof(struct sockaddr_in6));
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
ret = ORTE_ERR_BAD_PARAM;
|
|
|
|
goto cleanup;
|
2004-08-16 19:39:54 +00:00
|
|
|
}
|
|
|
|
|
2007-04-25 01:55:40 +00:00
|
|
|
#if OPAL_WANT_IPV6
|
2007-07-20 01:34:02 +00:00
|
|
|
memset(&hints, 0, sizeof(hints));
|
|
|
|
hints.ai_family = af_family;
|
|
|
|
hints.ai_socktype = SOCK_STREAM;
|
|
|
|
ret = getaddrinfo (host, NULL, &hints, &res);
|
|
|
|
|
|
|
|
if (ret) {
|
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-13 20:00:55 +00:00
|
|
|
orte_output (0, "oob_tcp_parse_uri: Could not resolve %s. [Error: %s]\n",
|
2007-07-20 01:34:02 +00:00
|
|
|
host, gai_strerror (ret));
|
|
|
|
ret = ORTE_ERR_BAD_PARAM;
|
|
|
|
goto cleanup;
|
2007-04-25 01:55:40 +00:00
|
|
|
}
|
2007-07-20 01:34:02 +00:00
|
|
|
|
|
|
|
if (res->ai_family != af_family) {
|
2007-04-25 01:55:40 +00:00
|
|
|
/* should never happen */
|
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-13 20:00:55 +00:00
|
|
|
orte_output (0, "oob_tcp_parse_uri: getaddrinfo returned wrong af_family for %s",
|
2007-07-20 01:34:02 +00:00
|
|
|
host);
|
|
|
|
ret = ORTE_ERROR;
|
|
|
|
goto cleanup;
|
2007-04-25 01:55:40 +00:00
|
|
|
}
|
2007-07-20 01:34:02 +00:00
|
|
|
|
|
|
|
memcpy(inaddr, res->ai_addr, res->ai_addrlen);
|
|
|
|
freeaddrinfo(res);
|
2007-04-25 01:55:40 +00:00
|
|
|
#else
|
2007-07-24 17:01:39 +00:00
|
|
|
if (AF_INET == af_family) {
|
|
|
|
struct sockaddr_in *in = (struct sockaddr_in*) inaddr;
|
|
|
|
in->sin_family = af_family;
|
|
|
|
in->sin_addr.s_addr = inet_addr(host);
|
|
|
|
if (in->sin_addr.s_addr == INADDR_ANY) {
|
|
|
|
ret = ORTE_ERR_BAD_PARAM;
|
|
|
|
goto cleanup;
|
|
|
|
}
|
|
|
|
} else {
|
2007-07-20 01:34:02 +00:00
|
|
|
ret = ORTE_ERR_BAD_PARAM;
|
|
|
|
goto cleanup;
|
2004-08-16 19:39:54 +00:00
|
|
|
}
|
2007-04-25 01:55:40 +00:00
|
|
|
#endif
|
2007-07-20 01:34:02 +00:00
|
|
|
|
|
|
|
switch (af_family) {
|
|
|
|
case AF_INET:
|
|
|
|
((struct sockaddr_in*) inaddr)->sin_port = htons(atoi(port));
|
|
|
|
break;
|
|
|
|
case AF_INET6:
|
|
|
|
((struct sockaddr_in6*) inaddr)->sin6_port = htons(atoi(port));
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
ret = ORTE_ERR_BAD_PARAM;
|
|
|
|
goto cleanup;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = ORTE_SUCCESS;
|
|
|
|
|
|
|
|
cleanup:
|
|
|
|
if (NULL != dup_uri) free(dup_uri);
|
|
|
|
return ret;
|
2004-08-16 19:39:54 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/*
|
Not as bad as this all may look. Tim and I made a significant change to the way we handle the startup of the oob, the seed, etc. We have made it backwards-compatible so that mpirun2 and singleton operations remain working. We had to adjust the name server and gpr as well, plus the process_info structure.
This also includes a checkpoint update to openmpi.c and ompid.c. I have re-enabled the ompid compile.
This latter raises an important point. The trunk compiles the programs like ompid just fine under Linux. It also does just fine for OSX under the dynamic libraries. However, we are seeing errors when compiling under OSX for the static case - the linker seems to have trouble resolving some variable names, even though linker diagnostics show the variables as being defined. Thus, a warning to Mac users that you may have to locally turn things off if you are trying to do static compiles. We ask, however, that you don't commit those changes that turn things off for everyone else - instead, let's try to figure out why the static compile is having a problem, and let everyone else continue to work.
Thanks
Ralph
This commit was SVN r2534.
2004-09-08 03:59:06 +00:00
|
|
|
* Setup address in the cache. Note that this could be called multiple
|
|
|
|
* times if a given destination exports multiple addresses.
|
2004-08-16 19:39:54 +00:00
|
|
|
*/
|
|
|
|
|
2005-03-14 20:57:21 +00:00
|
|
|
int mca_oob_tcp_set_addr(const orte_process_name_t* name, const char* uri)
|
2004-08-16 19:39:54 +00:00
|
|
|
{
|
2007-07-20 01:34:02 +00:00
|
|
|
struct sockaddr_storage inaddr;
|
2008-03-05 22:44:35 +00:00
|
|
|
mca_oob_tcp_addr_t* addr = NULL;
|
|
|
|
mca_oob_tcp_peer_t* peer = NULL;
|
2004-08-16 19:39:54 +00:00
|
|
|
int rc;
|
2007-07-20 01:34:02 +00:00
|
|
|
if((rc = mca_oob_tcp_parse_uri(uri, (struct sockaddr*) &inaddr)) != ORTE_SUCCESS) {
|
2004-08-16 19:39:54 +00:00
|
|
|
return rc;
|
2007-07-20 01:34:02 +00:00
|
|
|
}
|
2004-08-16 19:39:54 +00:00
|
|
|
|
2005-07-03 22:45:48 +00:00
|
|
|
OPAL_THREAD_LOCK(&mca_oob_tcp_component.tcp_lock);
|
2008-03-05 22:44:35 +00:00
|
|
|
opal_hash_table_get_value_uint64(&mca_oob_tcp_component.tcp_peer_names,
|
|
|
|
orte_util_hash_name(name), (void**)&addr);
|
2004-09-01 23:07:40 +00:00
|
|
|
if(NULL == addr) {
|
|
|
|
addr = OBJ_NEW(mca_oob_tcp_addr_t);
|
Not as bad as this all may look. Tim and I made a significant change to the way we handle the startup of the oob, the seed, etc. We have made it backwards-compatible so that mpirun2 and singleton operations remain working. We had to adjust the name server and gpr as well, plus the process_info structure.
This also includes a checkpoint update to openmpi.c and ompid.c. I have re-enabled the ompid compile.
This latter raises an important point. The trunk compiles the programs like ompid just fine under Linux. It also does just fine for OSX under the dynamic libraries. However, we are seeing errors when compiling under OSX for the static case - the linker seems to have trouble resolving some variable names, even though linker diagnostics show the variables as being defined. Thus, a warning to Mac users that you may have to locally turn things off if you are trying to do static compiles. We ask, however, that you don't commit those changes that turn things off for everyone else - instead, let's try to figure out why the static compile is having a problem, and let everyone else continue to work.
Thanks
Ralph
This commit was SVN r2534.
2004-09-08 03:59:06 +00:00
|
|
|
addr->addr_name = *name;
|
2008-03-05 22:44:35 +00:00
|
|
|
opal_hash_table_set_value_uint64(&mca_oob_tcp_component.tcp_peer_names,
|
|
|
|
orte_util_hash_name(&addr->addr_name), addr);
|
2004-08-16 19:39:54 +00:00
|
|
|
}
|
2007-07-20 01:34:02 +00:00
|
|
|
rc = mca_oob_tcp_addr_insert(addr, (struct sockaddr*) &inaddr);
|
2008-03-05 22:44:35 +00:00
|
|
|
opal_hash_table_get_value_uint64(&mca_oob_tcp_component.tcp_peers,
|
|
|
|
orte_util_hash_name(&addr->addr_name),
|
|
|
|
(void**)&peer);
|
2005-03-14 20:57:21 +00:00
|
|
|
if(NULL != peer) {
|
|
|
|
mca_oob_tcp_peer_resolved(peer, addr);
|
|
|
|
}
|
2005-07-03 22:45:48 +00:00
|
|
|
OPAL_THREAD_UNLOCK(&mca_oob_tcp_component.tcp_lock);
|
2004-09-01 23:07:40 +00:00
|
|
|
return rc;
|
2004-08-16 19:39:54 +00:00
|
|
|
}
|
|
|
|
|
2007-03-16 23:11:45 +00:00
|
|
|
|
|
|
|
/* Dummy function for when we are not using FT. */
|
|
|
|
#if OPAL_ENABLE_FT == 0
|
|
|
|
int mca_oob_tcp_ft_event(int state) {
|
|
|
|
return ORTE_SUCCESS;
|
|
|
|
}
|
|
|
|
#else
|
|
|
|
int mca_oob_tcp_ft_event(int state) {
|
|
|
|
int exit_status = ORTE_SUCCESS;
|
2008-04-23 00:17:12 +00:00
|
|
|
opal_list_item_t *item;
|
2007-03-16 23:11:45 +00:00
|
|
|
|
|
|
|
if(OPAL_CRS_CHECKPOINT == state) {
|
|
|
|
/*
|
|
|
|
* Disable event processing while we are working
|
|
|
|
*/
|
|
|
|
OPAL_THREAD_LOCK(&mca_oob_tcp_component.tcp_lock);
|
|
|
|
opal_event_disable();
|
|
|
|
}
|
|
|
|
else if(OPAL_CRS_CONTINUE == state) {
|
|
|
|
/*
|
|
|
|
* Resume event processing
|
|
|
|
*/
|
|
|
|
opal_event_enable();
|
|
|
|
OPAL_THREAD_UNLOCK(&mca_oob_tcp_component.tcp_lock);
|
|
|
|
}
|
|
|
|
else if(OPAL_CRS_RESTART == state) {
|
2008-04-23 00:17:12 +00:00
|
|
|
/*
|
|
|
|
* Clean out cached connection information
|
|
|
|
* Select pieces of finalize/init
|
|
|
|
*/
|
|
|
|
for(item = opal_list_remove_first(&mca_oob_tcp_component.tcp_peer_list);
|
|
|
|
item != NULL;
|
|
|
|
item = opal_list_remove_first(&mca_oob_tcp_component.tcp_peer_list)) {
|
|
|
|
mca_oob_tcp_peer_t* peer = (mca_oob_tcp_peer_t*)item;
|
|
|
|
/* JJH: Use the below command for debugging restarts with invalid sockets
|
|
|
|
* mca_oob_tcp_peer_dump(peer, "RESTART CLEAN")
|
|
|
|
*/
|
|
|
|
MCA_OOB_TCP_PEER_RETURN(peer);
|
|
|
|
}
|
|
|
|
|
|
|
|
OBJ_DESTRUCT(&mca_oob_tcp_component.tcp_peer_free);
|
|
|
|
OBJ_DESTRUCT(&mca_oob_tcp_component.tcp_peer_names);
|
|
|
|
OBJ_DESTRUCT(&mca_oob_tcp_component.tcp_peers);
|
|
|
|
OBJ_DESTRUCT(&mca_oob_tcp_component.tcp_peer_list);
|
|
|
|
|
|
|
|
OBJ_CONSTRUCT(&mca_oob_tcp_component.tcp_peer_list, opal_list_t);
|
|
|
|
OBJ_CONSTRUCT(&mca_oob_tcp_component.tcp_peers, opal_hash_table_t);
|
|
|
|
OBJ_CONSTRUCT(&mca_oob_tcp_component.tcp_peer_names, opal_hash_table_t);
|
|
|
|
OBJ_CONSTRUCT(&mca_oob_tcp_component.tcp_peer_free, opal_free_list_t);
|
|
|
|
|
2007-04-25 19:51:52 +00:00
|
|
|
/*
|
|
|
|
* Resume event processing
|
|
|
|
*/
|
|
|
|
opal_event_enable();
|
2007-03-16 23:11:45 +00:00
|
|
|
OPAL_THREAD_UNLOCK(&mca_oob_tcp_component.tcp_lock);
|
|
|
|
}
|
|
|
|
else if(OPAL_CRS_TERM == state ) {
|
|
|
|
;
|
|
|
|
}
|
|
|
|
else {
|
|
|
|
;
|
|
|
|
}
|
|
|
|
|
|
|
|
return exit_status;
|
|
|
|
}
|
|
|
|
#endif
|
2007-07-20 01:34:02 +00:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
int
|
|
|
|
mca_oob_tcp_get_new_name(orte_process_name_t* name)
|
|
|
|
{
|
|
|
|
mca_oob_tcp_peer_t* peer = mca_oob_tcp_peer_lookup(ORTE_PROC_MY_HNP);
|
|
|
|
mca_oob_tcp_msg_t* msg;
|
|
|
|
int rc;
|
|
|
|
|
|
|
|
if(NULL == peer)
|
|
|
|
return ORTE_ERR_UNREACH;
|
|
|
|
|
|
|
|
MCA_OOB_TCP_MSG_ALLOC(msg, rc);
|
|
|
|
if(NULL == msg) {
|
|
|
|
return rc;
|
|
|
|
}
|
|
|
|
|
|
|
|
if(mca_oob_tcp_component.tcp_debug >= OOB_TCP_DEBUG_ALL) {
|
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-13 20:00:55 +00:00
|
|
|
orte_output(0, "%s-%s mca_oob_tcp_get_new_name: starting\n",
|
2008-02-28 01:57:57 +00:00
|
|
|
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
|
2007-07-20 02:34:29 +00:00
|
|
|
ORTE_NAME_PRINT(&(peer->peer_name)));
|
2007-07-20 01:34:02 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/* turn the size to network byte order so there will be no problems */
|
|
|
|
msg->msg_hdr.msg_type = MCA_OOB_TCP_PING;
|
|
|
|
msg->msg_hdr.msg_size = 0;
|
|
|
|
msg->msg_hdr.msg_tag = 0;
|
|
|
|
msg->msg_hdr.msg_src = *ORTE_NAME_INVALID;
|
|
|
|
msg->msg_hdr.msg_dst = *ORTE_PROC_MY_HNP;
|
|
|
|
|
|
|
|
MCA_OOB_TCP_HDR_HTON(&msg->msg_hdr);
|
|
|
|
rc = mca_oob_tcp_peer_send(peer, msg);
|
|
|
|
if(rc != ORTE_SUCCESS) {
|
|
|
|
if (rc != ORTE_ERR_ADDRESSEE_UNKNOWN) {
|
|
|
|
MCA_OOB_TCP_MSG_RETURN(msg);
|
|
|
|
}
|
|
|
|
return rc;
|
|
|
|
}
|
|
|
|
|
|
|
|
mca_oob_tcp_msg_wait(msg, &rc);
|
|
|
|
|
|
|
|
if (ORTE_SUCCESS == rc) {
|
2008-02-28 01:57:57 +00:00
|
|
|
*name = *ORTE_PROC_MY_NAME;
|
2007-07-20 01:34:02 +00:00
|
|
|
if(mca_oob_tcp_component.tcp_debug >= OOB_TCP_DEBUG_ALL) {
|
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-13 20:00:55 +00:00
|
|
|
orte_output(0, "%s mca_oob_tcp_get_new_name: done\n",
|
2008-02-28 01:57:57 +00:00
|
|
|
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME));
|
2007-07-20 01:34:02 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return rc;
|
|
|
|
}
|