We commonly see messages on the users list where a peer has hung up
because it has crashed. Instead of having just a BTL_ERROR message,
make this a real opal_show_help() message that tells the user that the
peer unexpectedly hung up, and they should look into *why* that peer
hung up.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
It looks like one help message was accidentally pasted in the middle
of another. Disentangle the two messages from each other, and
slightly tweak the one message to say that the job may also crash (in
addition to hanging).
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
In some rare cases when a process receives the connect ack while
locally updating the peer endpoint structure, we could drop the
incomming connect ack due to the fact that the send handler is
protected with a try lock (on the endpoint) and our initial send
event was not persistent. Making the send event persistent solves
all issues.
This commit fixes a race between a thread calling the tcp btl's
add_procs and a thread processing an incomming connection. The race
occured because the add_procs thread adds a newly created proc object
to the hash table *before* the object is fully initialized. The
connection thread then attempts to use the object before the endpoints
array on the object has beeen allocation. The fix is to only add the
proc to the hash table after it has been completely initialized.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
These macros should really be named OPAL_SUMMARY_*; they're used in
all projects, and therefore should be in the lowest later project (OPAL).
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Added mca parameter to turn progress thread on/off
Add a flag to check if we have btl progress thread.
Added macro for ob1 matching lock.
Update the AUTHORS file.
All BTL-only operations (basically all data movements
with the exception of the matching operation) can now
be handled for the TCP BTL by a progress thread.
This commit adds two m4 macros: OPAL_SUMMARY_ADD, OPAL_SUMMARY_PRINT.
OPAL_SUMMARY_ADD adds an item to a section in the summary. For example
OPAL_SUMMARY_ADD([[Transports]],[[Foo]],...,[yes]) will add the
following to the summary:
Transports
-----------------------
Foo: yes
With this commit two sections are added: Transports, Resource Managers.
The OPAL_SUMMARY_PRINT macro is called after AC_OUTPUT and prints out
some information about the build (version, projects, etc) and then
the summarys sections. It will additionally print a warning if
internal debugging is enabled.
Example output:
Open MPI configuration:
-----------------------
Version: 3.0.0 a1
Build Open Platform Abstration project: yes
Build Open Runtime project: yes
Build Open MPI project: yes
Build Open SHMEM project: no
MPI C++ bindings (deprecated): no
MPI Fortran bindings: mpif.h, use mpi, use mpi_f08
Debug build: yes
Transports
-----------------------
Cray uGNI (Gemini/Aries): no
Intel Omnipath (PSM2): no
KNEM Shared Memory: no
Linux CMA IPC: no
Mellanox MXM: no
Open UCX: no
OpenFabrics libfabric: no
OpenFabrics Verbs: no
portals4: no
QLogic Infinipath (PSM): no
tcp: yes
XPMEM Shared Memory: no
Resource Managers
-----------------------
Cray Alps: no
Grid Engine: no
LSF: no
Slurm: yes
Torque: yes
INTERNAL DEBUGGING IS ENABLED. DO NOT USE THIS BUILD FOR PERFORMANCE MEASUREMENTS!
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
This commit makes two changes to the tcp btl:
- If a tcp proc does not exist when handling a new connection create
a new proc and use it. The current implementation uses the
opal_proc_by_name() function to get the opal_proc_t then calls
add_procs on all btl modules. It may be sufficient to just call
add_procs until an endpoint is created so this may change somewhat.
- In add_procs add a check for an existing endpoint before creating
one.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit updates each non-compliant btl to send the
MCA_BTL_FLAGS_SEND flag in the btl_flags field if send is
supported. This fixes a problem identified after the latest bml/r2
update which excplicitly checks for the send flag.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Bring Slurm PMI-1 component online
Bring the s2 component online
Little cleanup - let the various PMIx modules set the process name during init, and then just raise it up to the ORTE level. Required as the different PMI environments all pass the jobid in different ways.
Bring the OMPI pubsub/pmi component online
Get comm_spawn working again
Ensure we always provide a cpuset, even if it is NULL
pmix/cray: adjust cray pmix component for pmix
Make changes so cray pmix can work within the integrated
ompi/pmix framework.
Bring singletons back online. Implement the comm_spawn operation using pmix - not tested yet
Cleanup comm_spawn - procs now starting, error in connect_accept
Complete integration
This commit does two things. It removes checks for C99 required
headers (stdlib.h, string.h, signal.h, etc). Additionally it removes
definitions for required C99 types (intptr_t, int64_t, int32_t, etc).
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
Also add another (superflous but symmetric) continue statement.
This missing "continue" statement allows IPv4 "private network"
matches to fall through and allow IPv6 matches to be made -- thereby
overriding the IPv4 match that was already made.
Fixes#585 (although several of the other issues identified on #585
still exist, the primary / initial bug that was reported there is now
fixed).
Use of the old ompi_free_list_t and ompi_free_list_item_t is
deprecated. These classes will be removed in a future commit.
This commit updates the entire code base to use opal_free_list_t and
opal_free_list_item_t.
Notes:
OMPI_FREE_LIST_*_MT -> opal_free_list_* (uses opal_using_threads ())
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit adds an owner file in each of the component directories
for each framework. This allows for a simple script to parse
the contents of the files and generate, among other things, tables
to be used on the project's wiki page. Currently there are two
"fields" in the file, an owner and a status. A tool to parse
the files and generate tables for the wiki page will be added
in a subsequent commit.