Jeff Squyres
72704441a2
URLs: update URLs for GitHub
2014-10-01 14:44:09 -07:00
Ralph Castain
84810b80fd
Cover the remaining code paths for Java apps to define class path
...
Refs trac:4926
This commit was SVN r32823.
The following Trac tickets were found above:
Ticket 4926 --> https://svn.open-mpi.org/trac/ompi/ticket/4926
2014-09-30 22:27:03 +00:00
Ralph Castain
040a69c38b
Correct the classpath to correctly include the local directory so Java programs find the application class
...
cmr=v1.8.4:reviewer=jsquyres
This commit was SVN r32817.
2014-09-30 16:35:12 +00:00
Ralph Castain
4320457394
Fix the debug output - you can't print the cpuset pointer using the %p format without generating warnings
...
This commit was SVN r32811.
2014-09-29 17:10:38 +00:00
Howard Pritchard
f8ac8bb6b0
remove improper use of hwloc_bitmap_free
...
When using the native aprun launcher, it was observed that
there were frequent memory corruption errors occuring either
during a PMI kvs-fence operation, or at mpi termation during
opal cleanup of allocated objects. This was especially bad
when using
aprun --c none
In some cases, the application would even just hang in finalize
if using ptmalloc, owing to some kind of infinite loop in
cleanup of small blocks, etc.
It turns out that the proble was in orte_ess_base_proc_binding's
improper use of opal_hwloc_base_get_available_cpus. The cpuset
(bitmap) returned from that function is not meant to be freed
by the caller.
This problem is likely never observed when using the mpirun launcher
as there's an early exit if the OMPI_MCA_orte_bound_at_launch
environment variable is set.
This commit was SVN r32809.
2014-09-29 16:10:37 +00:00
Gilles Gouaillardet
9661e4537f
oob/tcp: fix a race condition
...
Mimick the btl/tcp protocol to solve the race condition that happens
when two peers try to connect to each other at the same time
cmr=v1.8.4:reviewer=rhc
This commit was SVN r32799.
2014-09-26 06:54:30 +00:00
Ralph Castain
17846411c3
Now that we have an ORTE thread running in apps, we can't just call "exit"
...
during RTE abort as that is happening in a thread, and (at least in some
environments) doesn't result in the main thread being immediately
terminated. Instead, we wind up going thru orte_finalize in the main
thread, which isn't what we want.
So replace the call to "exit" with the "quick exit" variant "_exit", which
causes the entire process to exit immediately.
(custom patch has been posted for 1.8.3)
This commit was SVN r32780.
2014-09-23 22:51:10 +00:00
Howard Pritchard
1508a01325
Fixes to enable mpirun to work again on Cray
...
The ess pmi module was not handling aprun launched
daemons. All daemons were thinking they were vpid 1.
Also, turns out that on cray systems using MOM nodes
for launched jobs, just detecting whether or not a
process is in a PAGG container is not sufficient.
Crank up the priority of the alps PLM component in the
event that the configure detected the presence of both
slurm and alps.
Have the ESS pmi component open the pmix framework and
select a pmix component.
This commit was SVN r32773.
2014-09-23 15:37:26 +00:00
Gilles Gouaillardet
5fa2b6c59c
oob/tcp: fix a race condition
...
Refs trac:4909
This commit was SVN r32754.
The following Trac tickets were found above:
Ticket 4909 --> https://svn.open-mpi.org/trac/ompi/ticket/4909
2014-09-18 08:17:25 +00:00
Ralph Castain
3a437cbdb3
Silence set-but-not-used warning when timing isn't enabled
...
This commit was SVN r32749.
2014-09-17 00:40:10 +00:00
Ralph Castain
414f4e9783
Try to provide a real hostname for the remote host to aid in debugging
...
Refs trac:4908
This commit was SVN r32748.
The following Trac tickets were found above:
Ticket 4908 --> https://svn.open-mpi.org/trac/ompi/ticket/4908
2014-09-17 00:39:49 +00:00
Jeff Squyres
9dc49c5f92
oob_tcp_connection: print "<unknown>" instead of "NULL"
...
"NULL" doesn't meany anything to the user, and is somewhat confusing
to see in an error message. "<unknown>" at least indicates that
there's an error, and we know who the peer is.
This commit was SVN r32747.
2014-09-16 22:47:57 +00:00
Ralph Castain
09aecea55a
Can't use show_help as the RML has already been enabled, but we haven't successfully connected back to the HNP. So use opal_output instead and hardwire the message.
...
Refs trac:4908
This commit was SVN r32746.
The following Trac tickets were found above:
Ticket 4908 --> https://svn.open-mpi.org/trac/ompi/ticket/4908
2014-09-16 22:21:02 +00:00
Ralph Castain
4bbc9a28d6
Try to resolve the simultaneous connection problem by being a little more careful about the choice of returned status when a connection is refused. As before, have the higher vpid of the two peers retry the connection, while the lower one waits. This can happen in a couple of places, so try to hit them all. Since this is hard to test, will ask Gilles to give it a try since he's the one who is seeing it.
...
cmr=v1.8.3:reviewer=rhc
This commit was SVN r32744.
2014-09-16 18:59:36 +00:00
Ralph Castain
a74428513d
Provide a better help message when we are unable to complete a connection due to a firewall.
...
cmr=v1.8.3:reviewer=jsquyres
This commit was SVN r32743.
2014-09-16 16:28:29 +00:00
Ralph Castain
dfb952fa78
[Contribution from Artem - moved it to svn from git for him]
...
Replace our old, clunky timing setup with a much nicer one that is only available if configured with --enable-timing. Add a tool for profiling clock differences between the nodes so you can get more precise timing measurements. I'll ask Artem to update the Github wiki with full instructions on how to use this setup.
This commit was SVN r32738.
2014-09-15 18:00:46 +00:00
Jeff Squyres
e95ed94a94
plm_rsh_module.c: output to the framework output
...
Trivial fix from r32686: don't output to stream 0, but rather to
orte_plm_base_framework.framework_output (this is the way it was
before r32686). In reality, this is going to end up being stream 0,
anyway, but we might as well be pedantically correct...
Refs trac:4897.
This commit was SVN r32726.
The following SVN revision numbers were found above:
r32686 --> open-mpi/ompi@4df1aa63f7
The following Trac tickets were found above:
Ticket 4897 --> https://svn.open-mpi.org/trac/ompi/ticket/4897
2014-09-13 00:46:35 +00:00
Ralph Castain
0445052a1c
Check for multiple declarations of a given MCA param and error out if detected as that can create an ambiguous definition of the param value.
...
Refs trac:4897
This commit was SVN r32719.
The following Trac tickets were found above:
Ticket 4897 --> https://svn.open-mpi.org/trac/ompi/ticket/4897
2014-09-12 22:21:30 +00:00
Ralph Castain
9e7e90265f
Temporarily make the direct grpcomm component the default until we can debug the other modules
...
This commit was SVN r32707.
2014-09-11 14:47:54 +00:00
Ralph Castain
4eb6291334
Avoid conflicts when multiple collectives are underway in ORTE by giving each grpcomm component its own RML tag and posting persistent receives. We use the signature anyway to determine which collective the received message is addressing, so there is no need to post non-persistent receives.
...
This commit was SVN r32703.
2014-09-10 17:36:16 +00:00
Ralph Castain
ea11e63f59
Per patch from Tetsuya, allow the user to bind-to none when specifying multiple pe's/rank as requested by Reuti. This allows the user to reserve multiple "slots" in the allocation for each process while mapping, but not to bind the process to specific processing elements on the node.
...
Reviewed by rhc, so RM-approved to go across to v1.8.3
cmr=v1.8.3:reviewer=ompi-gk1.8
This commit was SVN r32701.
2014-09-10 15:52:18 +00:00
Ralph Castain
e671620ac7
Per request from Jeff: tune up the help messages for binding options
...
Refs trac:4898
This commit was SVN r32691.
The following Trac tickets were found above:
Ticket 4898 --> https://svn.open-mpi.org/trac/ompi/ticket/4898
2014-09-09 22:39:22 +00:00
Gilles Gouaillardet
63209eac5b
orte/util: use ORTE_JOB_FAMILY and ORTE_LOCAL_JOBID macros
...
This commit was SVN r32688.
2014-09-09 05:13:00 +00:00
Ralph Castain
4207b4c4ad
Improve the --bind-to help message to better indicate the default options under various values of np. Remove the warning message if the user doesn't specify a binding policy and we are overloaded
...
cmr=v1.8.3:reviewer=jsquyres
This commit was SVN r32687.
2014-09-08 21:03:51 +00:00
Ralph Castain
4df1aa63f7
Since we've run into the situation where someone puts a script wrapper around a launcher such as srun, we need to always protect MCA cmd line params with quotes. This means we also need to protect the backend from quotes coming into the system as part of a value, or else the parser gets confused.
...
So add a new function for wrapping MCA arguments, and tell the backend parser to ignore/remove leading/trailing quotes.
cmr=v1.8.3:reviewer=jsquyres
This commit was SVN r32686.
2014-09-08 20:38:46 +00:00
Ralph Castain
6323b226c7
Bring over some updates from the PMIx branch - mostly just minor cleanups. Make the direct grpcomm component no longer be the default. For now, we seem to be having problems with non-blocking fence operations, so make them not be the default under any scenario (e.g., when sm is the only btl in operation).
...
This commit was SVN r32673.
2014-09-06 19:19:44 +00:00
Ralph Castain
94ffca4901
Correct the cutoff point for full modex operation as it is based on the number of nodes in the system, not the number of procs in the signature.
...
This commit was SVN r32666.
2014-09-03 17:28:12 +00:00
Ralph Castain
2bfb18e004
Resolve some race conditions when async pmix modex modes are invoked. Since calls to "get" data can come both locally and remotely before data for a given proc has actually been received, we have to track all requests that cannot be immediately fulfilled and provide the data once it has been received.
...
This commit was SVN r32664.
2014-09-02 20:04:17 +00:00
Ralph Castain
4d186e6402
Properly protect the MCA parameters being registered by the OOB/TCP component when IPv6 is enabled
...
cmr=v1.8.3:reviewer=jsquyres
This commit was SVN r32662.
2014-09-02 14:53:00 +00:00
Ralph Castain
f2b26bde4c
Resolve a race condition that could cause us to hang during abnormal terminations due to multi-counting num_terminated
...
This commit was SVN r32660.
2014-09-02 00:32:52 +00:00
Ralph Castain
e49ca05f11
Remove unused variable
...
This commit was SVN r32651.
2014-08-31 03:11:50 +00:00
Ralph Castain
5cdbc00136
Re-enable the usock oob component. Ensure the TCP component promotes messages for other procs to the OOB base so that other components have a chance to send the relay. Seems to be passing MTT, so let's see how it works for others.
...
This commit was SVN r32650.
2014-08-30 19:33:46 +00:00
Ralph Castain
a2085a5916
Fix the PSM transport key generator to match prior releases
...
This commit was SVN r32649.
2014-08-30 00:48:25 +00:00
Ralph Castain
cb0739dfd4
Update the regex to resolve a bug
...
This commit was SVN r32647.
2014-08-29 22:24:20 +00:00
Ralph Castain
8faabed2cd
Add some further initialization and protection for zero-byte messages
...
This commit was SVN r32644.
2014-08-29 17:24:55 +00:00
Ralph Castain
2b225e3776
Cleanup a race condition regarding marking that waitpid_fired. We should always mark it as fired when we enter the wait_local_proc routine, and also mark it as no longer alive if iof_complete has also been found. If other places in the code also update those flags, there is no harm done.
...
This commit was SVN r32643.
2014-08-29 17:03:31 +00:00
Ralph Castain
730e28349e
Some minor uninitialized variable cleanups
...
This commit was SVN r32629.
2014-08-29 02:21:13 +00:00
Ralph Castain
fafdbeec0c
Cleanup and enable the new daemon collective modules for more scalable operations. Thanks to Nadezhda Kogteva (Mellanox) for doing them.
...
This commit was SVN r32624.
2014-08-28 20:35:35 +00:00
Ralph Castain
731a878ff3
Add a bunch of debug to help track down the problem, and eventually find another place where comparison of signatures was incorrectly performed - use the dss compare operation to be consistent and safe
...
This commit was SVN r32620.
2014-08-27 19:52:20 +00:00
Ralph Castain
5fb7c7d23b
Don't explicitly add the hostname to the data fetch when we already cached a remote blob
...
This commit was SVN r32619.
2014-08-27 16:18:05 +00:00
Ralph Castain
3c24770bce
Protect debug printing on backend nodes
...
This commit was SVN r32618.
2014-08-27 16:17:28 +00:00
Ralph Castain
b87b69e977
Ensure the nodes get added to the job map on the remote nodes, add some debug to grpcomm daemon array construction
...
This commit was SVN r32617.
2014-08-27 16:16:46 +00:00
Ralph Castain
842aaf6167
Correctly end mapping oversubscribed nodes round-robin byslot
...
cmr=v1.8.3:reviewer=rhc
This commit was SVN r32616.
2014-08-27 16:15:18 +00:00
Gilles Gouaillardet
2679629a12
pmix: fix compilation when configured with --without-hwloc
...
This commit was SVN r32604.
2014-08-26 08:31:05 +00:00
Ralph Castain
1221e8a96f
Compare the full signature - thanks to Gilles for identifying the problem
...
This commit was SVN r32595.
2014-08-25 14:52:06 +00:00
Ralph Castain
5a13cdb739
Fix a race condition caused by a bad attribute flag that created an OR instead of an AND condition check
...
This commit was SVN r32587.
2014-08-22 22:48:16 +00:00
Ralph Castain
039b7acfb5
Fix the quoting algorithm so only rsh command lines get quoted values
...
cmr=v1.8.2:reviewer=jsquyres
This commit was SVN r32586.
2014-08-22 22:47:38 +00:00
Ralph Castain
f00af81c1d
Little more cleanup under the abort cases cited by Gilles. All seem to be working now
...
This commit was SVN r32585.
2014-08-22 19:57:57 +00:00
Ralph Castain
b1a7375192
Fix the "unreachable" message so it outputs the correct hostname for the remote proc. Cleanup some of the pmix stuff when running corner cases of errors
...
This commit was SVN r32584.
2014-08-22 19:20:45 +00:00
Ralph Castain
6ff2a60829
Handle the non-blocking fence case correctly, and ensure we always at least pass back the hostname of the process whose info is being requested so that the ompi_proc_t can correctly initialize it when we are in a non-blocking fence with np < cutoff scenario
...
This commit was SVN r32578.
2014-08-22 14:26:24 +00:00