Ralph Castain
9c66c4f439
Correctly implement --disable-oshmem and --without-orte so we don't build the disabled section of code. Fix a bunch of code rot in the PMI rte component, and add several missing headers when building --without-orte.
...
NOTE: I transferred the oshmem-disabled-by-default from the 1.7 branch to the trunk to minimize future disruption if/when we change that option.
cmr=v1.8:reviewer=jsquyres
This commit was SVN r31006.
2014-03-11 22:02:40 +00:00
Adrian Reber
49173ccd61
add debug output for the ft_event handler
...
This commit was SVN r30990.
2014-03-11 15:39:16 +00:00
Adrian Reber
7304b700e1
Fix the newly added FT event state when compiling --with-ft
...
This commit was SVN r30988.
2014-03-11 13:20:08 +00:00
Ralph Castain
8e080fb95e
Need a slightly different header
...
This commit was SVN r30986.
2014-03-11 03:03:12 +00:00
Ralph Castain
2cd1cfc7fe
Remove this ignore for now
...
This commit was SVN r30985.
2014-03-11 03:02:13 +00:00
Ralph Castain
103a5c6df1
Output the bindings if ess verbosity is high enough
...
Refs trac:4356
This commit was SVN r30982.
The following Trac tickets were found above:
Ticket 4356 --> https://svn.open-mpi.org/trac/ompi/ticket/4356
2014-03-11 01:21:14 +00:00
Ralph Castain
176b326c27
Add a comment to make Jeff happier...
...
Refs trac:4340
This commit was SVN r30980.
The following Trac tickets were found above:
Ticket 4340 --> https://svn.open-mpi.org/trac/ompi/ticket/4340
2014-03-10 23:02:04 +00:00
Ralph Castain
081669b440
When pretty-printing binding info, we need to pass the topology down to the routine as the mapper isn't always working with the local topology - otherwise, we get an erroneous help message. Thanks to Tetsuya Mishima for reporting it
...
cmr=v1.7.5:reviewer=rhc:subject=fix pretty-print of bindings
This commit was SVN r30968.
2014-03-10 15:53:07 +00:00
Adrian Reber
b51733c456
fix "warning: 'sstore_stage_select' defined but not used"
...
In the function sstore_stage_select() the local variables
were set up and defined. Unfortunately this function was
never called. This patch moves variable set up to the
sstore_stage_register() function and checks the return
values of the variable initialization.
This commit was SVN r30958.
2014-03-06 16:53:27 +00:00
Ralph Castain
7a44af375c
Add an FT event state and set the state machine to callback to the OOB base ft event when activated
...
This commit was SVN r30950.
2014-03-06 02:44:29 +00:00
Ralph Castain
9793909988
Correct the constant we check for an error. Thanks to George for noticing it.
...
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r30949.
2014-03-06 02:21:27 +00:00
Ralph Castain
fc2dd6ac48
Per Jeff's request, add a more detailed comment as to why we are turning off the warning at this time.
...
Refs trac:4339
This commit was SVN r30948.
The following Trac tickets were found above:
Ticket 4339 --> https://svn.open-mpi.org/trac/ompi/ticket/4339
2014-03-06 02:17:25 +00:00
Ralph Castain
c9465d97b4
Resolve a race condition when responding to a SIGTERM to ensure that any final message from the application is correctly output. Remove a duplicate command, reduce the priority of the daemon exit command to MSG so that the IOF will have a chance to output cached messages. Update the signal trapping test.
...
Thanks to Paul Kapinos for reporting the problem.
cmr=v1.7.5:reviewer=jsquyres:subject=resolve a race condition
This commit was SVN r30942.
2014-03-05 04:38:17 +00:00
Ralph Castain
a2b539c763
Per the telecon, silence the warning for 1.7.5 to give us time to consider a better permanent solution
...
Refs trac:4339
This commit was SVN r30941.
The following Trac tickets were found above:
Ticket 4339 --> https://svn.open-mpi.org/trac/ompi/ticket/4339
2014-03-05 03:02:29 +00:00
Ralph Castain
50c30d62ca
Repair builds without hwloc
...
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r30940.
2014-03-05 02:48:15 +00:00
Adrian Reber
e5bef82ee1
OPAL_ENABLE_FT_CR: remove compiler warnings
...
When compiling --with-ft there are a few compiler warnings about
unused variables. This patch fixes those compiler warnings.
This commit was SVN r30927.
2014-03-04 15:28:07 +00:00
Ralph Castain
da4cb39683
If we can't find a route to communicate, emit an error message rather than just exiting with a non-zero status
...
cmr=v1.7.5:reviewer=jsquyres:subject=print error if cannot communicate
This commit was SVN r30922.
2014-03-04 04:57:53 +00:00
Ralph Castain
0ac97761cc
Now that we are binding by default, the issue of #slots and what to do when oversubscribed has become a bit more complicated. This isn't a problem in managed environments as we are always provided an accurate assignment for the #slots, or when -host is used to define the allocation since we automatically assume one slot for every time a node is named.
...
The problem arises when a hostfile is used, and the user provides host names without specifying the slots= paramater. In these cases, we assign slots=1, but automatically allow oversubscription since that number isn't confirmed. We then provide a separate parameter by which the user can direct that we assign the number of slots based on the sensed hardware - e.g., by telling us to set the #slots equal to the #cores on each node. However, this has been set to "off" by default.
In order to make this a little less complex for the user, set the default such that we automatically set #slots equal to #cores (or #hwt's if use_hwthreads_as_cpus has been set) only for those cases where the user provides names in a hostfile but does not provide slot information.
Also cleanup some a couple of issues in the mapping/binding system:
* ensure we only override the binding directive if we are oversubscribed *and* overload is not allowed
* ensure that the MPI procs don't attempt to bind themselves if they are launched by an orted as any binding directive (no matter what it was) would have been serviced by the orted on launch
* minor cleanup to the warning message when oversubscribed and binding was requested
cmr=v1.7.5:reviewer=rhc:subject=update mapping/binding system
This commit was SVN r30909.
2014-03-03 16:46:37 +00:00
Ralph Castain
88b0e0cc6d
Allow the user to turn off the oversubscribed-binding warning if overload-allowed has been provided
...
Refs trac:4317
This commit was SVN r30892.
The following Trac tickets were found above:
Ticket 4317 --> https://svn.open-mpi.org/trac/ompi/ticket/4317
2014-02-28 17:55:53 +00:00
Ralph Castain
4a645f0342
Add detection of oversubscription with binding requested - if binding requested to core or hwt, warn and do not bind or else we will hurt performance. Also, if no binding directive was given, turn off the default binding
...
Refs trac:4317
This commit was SVN r30888.
The following Trac tickets were found above:
Ticket 4317 --> https://svn.open-mpi.org/trac/ompi/ticket/4317
2014-02-28 16:08:52 +00:00
Ralph Castain
8500247c7b
Fix the by-obj mapper in the case where slots are not specified, and so we are in a perpetual oversubscribed state
...
cmr=v1.7.5:reviewer=rhc
This commit was SVN r30887.
2014-02-28 05:21:46 +00:00
Ralph Castain
a4c3d0a5a0
Add some more debug to the by-obj mapper
...
This commit was SVN r30884.
2014-02-28 02:52:53 +00:00
Ralph Castain
d109c523b9
Per patch from Tetsuya Mishima, complete the overhaul of the round-robin mappers
...
Refs trac:4296
This commit was SVN r30861.
The following Trac tickets were found above:
Ticket 4296 --> https://svn.open-mpi.org/trac/ompi/ticket/4296
2014-02-27 00:43:53 +00:00
Ralph Castain
61a21e4f31
Based on Tetsuya's patch, with some changes, correct the case of map-by node where multiple cpus/rank are requested and result in a non-integer match with num slots. Also correct tests for binding policy given to use the proper macro.
...
Refs trac:4296
This commit was SVN r30857.
The following Trac tickets were found above:
Ticket 4296 --> https://svn.open-mpi.org/trac/ompi/ticket/4296
2014-02-26 18:12:23 +00:00
Ralph Castain
b880aa46bd
Update the map-by obj and map-by obj:span mappers to correct for errors in computing carryover across the nodes. Be a little less complex in the algorithm so it is easier to follow and debug.
...
Refs trac:4296
This commit was SVN r30826.
The following Trac tickets were found above:
Ticket 4296 --> https://svn.open-mpi.org/trac/ompi/ticket/4296
2014-02-25 23:32:43 +00:00
Joshua Ladd
9ea9bec4ad
Addressing Jeff's comments:
...
1. Changed rng_buff_t --> opal_rng_buff_t
2. All global variables obey the prefix rule
3. Old code has been removed
4. Found a couple of unnecessary includes
Refs trac:4298
This commit was SVN r30807.
The following Trac tickets were found above:
Ticket 4298 --> https://svn.open-mpi.org/trac/ompi/ticket/4298
2014-02-24 23:18:35 +00:00
Joshua Ladd
e39d9f4080
Per the RFC schedule, add an additive lagged Fibonacci parallel random number generator to OPAL. In order to use, please add the following header to your code: opal/util/alfg.h. See ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c for an example how to seed with opal_srand and invoke the generator with opal_rand. This should be added to
...
cmr=v1.7.5:reviewer=rhc:subject=Add an OPAL RNG
This commit was SVN r30801.
2014-02-23 21:41:38 +00:00
Ralph Castain
c8112c1086
Loadbalancing across nodes (i.e., map-by node) wasn't working correctly - the algorithm relied on the nodes being defined in descending order of slots, or the numbe
...
r of slots remaing to be assigned being only one/node. Regardless, it didn't work for the case where nodes were defined in ascending order of slots.
Tetsuya's proposed patch didn't solve the problem for me, but it did correct the case where cpus/proc > 1. The final patch requires that we loop over the assignment
algo until all procs are assigned or all nodes are filled - any remaining procs are then handled in the cleanup loop.
cmr=v1.7.5:reviewer=rhc:subject=fix map-by node for different cases
This commit was SVN r30798.
2014-02-22 16:39:41 +00:00
Adrian Reber
f17ec1ab10
ESS/BASE: orte-restart needs sstore
...
Running orte-restart requires an initialized sstore.
This opens the sstore component for FT builds just like
the snapc component.
This commit was SVN r30796.
2014-02-21 21:23:26 +00:00
Ralph Castain
0319d5fb19
Seeing some errors coming out of MTT on this component, so turn it off for now and will debug later
...
This commit was SVN r30789.
2014-02-21 16:31:52 +00:00
Mike Dubman
8d4592a94b
rmaps/mindist: better error message
...
better error message when there is only one socket available
fixed by Elena, reviewed by Miked
cmr=v1.7.5:reviewer=ompi-rm1.7
This commit was SVN r30787.
2014-02-21 11:38:35 +00:00
Ralph Castain
5520d6971b
We do have to track the origin of messages sent over usock as the daemon does route them back down, and we need to get the "sender" info correct. Also do a better job of dealing with simultaneous connections to avoid binding to a used socket.
...
Refs trac:4280
This commit was SVN r30781.
The following Trac tickets were found above:
Ticket 4280 --> https://svn.open-mpi.org/trac/ompi/ticket/4280
2014-02-20 17:27:05 +00:00
Ralph Castain
63803f5e61
Fix the leader data for PMI direct-launch as well
...
This commit was SVN r30778.
2014-02-20 01:41:19 +00:00
Ralph Castain
418ca60776
Since we don't know the name of the local leader, store that info under our own name :-)
...
This commit was SVN r30777.
2014-02-20 01:39:52 +00:00
Ralph Castain
262c927778
Define a new key and store the process name of the local_rank=0 process on each node so that the MPI layer can retrieve it as desired.
...
This commit was SVN r30759.
2014-02-18 00:32:58 +00:00
Adrian Reber
6b45d475e9
Fix compiler warnings when compiling with --with-ft
...
With enabled fault tolerance code different functions
are selected during compilation. Most of the ft
code is #ifdef'd out. This #ifdef's more code out
so that compiler warnings like
warning: unused variable 'item' [-Wunused-variable]
opal_list_item_t *item;
are removed.
This commit was SVN r30747.
2014-02-17 10:53:44 +00:00
Ralph Castain
c3df744a3b
Shift the orte_db_localrank key to the opal level. Add the job and proc-level session directory names to the database using opal_db keys.
...
This commit was SVN r30746.
2014-02-17 01:40:56 +00:00
Ralph Castain
ea0217c337
Remove unused file and minimize the usock uri contribution (add explanation as to why)
...
Refs trac:4280
This commit was SVN r30744.
The following Trac tickets were found above:
Ticket 4280 --> https://svn.open-mpi.org/trac/ompi/ticket/4280
2014-02-16 22:37:30 +00:00
Ralph Castain
a91d358c48
Add/modify a couple of tests
...
This commit was SVN r30743.
2014-02-16 20:54:34 +00:00
Ralph Castain
d42f4be8a4
Add unix socket component to OOB - no longer require active network for local operations. Demonstrate inter-transport crossover.
...
VERY tentatively schedule this for 1.7.5 - only to be applied if we see no troubles AND the branch is ready in advance.
cmr=v1.7.5:reviewer=rhc:subject=Add unix socket component to OOB
This commit was SVN r30742.
2014-02-16 20:54:12 +00:00
Ralph Castain
14bb7a117c
Fix bugs in the oob base - ensure we get the components in high-to-low priority, and that we correctly track reachability via all components. Adjust the priority of the tcp component to leave headroom for others
...
Refs trac:267
This commit was SVN r30740.
The following Trac tickets were found above:
Ticket 267 --> https://svn.open-mpi.org/trac/ompi/ticket/267
2014-02-16 03:19:08 +00:00
Ralph Castain
509d5d82b0
Add some verbage requested by Jeff, change the param level to something...?
...
Refs trac:4275
This commit was SVN r30736.
The following Trac tickets were found above:
Ticket 4275 --> https://svn.open-mpi.org/trac/ompi/ticket/4275
2014-02-15 15:11:05 +00:00
Ralph Castain
3f9db36e0d
Make Jeff smile - pretty-up the indentation
...
Refs trac:4267
This commit was SVN r30733.
The following Trac tickets were found above:
Ticket 4267 --> https://svn.open-mpi.org/trac/ompi/ticket/4267
2014-02-14 23:25:48 +00:00
Ralph Castain
91f90058ce
Add missing options and cleanup the code a bit. Default to by-slot ranking if a non-hardware option isn't given. Thanks to Tetsuya Mishima for the assist.
...
cmr=v1.7.5:reviewer=ompi-gk1.7
This commit was SVN r30725.
2014-02-14 10:23:16 +00:00
Ralph Castain
fd9b301a8b
Check equality instead of bit-mask - thanks to Tetsuya Mishima for reporting it
...
cmr=v1.7.5:reviewer=ompi-gk1.7
This commit was SVN r30722.
2014-02-14 02:34:42 +00:00
Ralph Castain
4e1c07cbf2
If we are given a TCP oob address that doesn't match any active module, it is still possible that we could route to the address if a router is in the system. No harm in trying, so arbitrarily pick the first connection in the active module list and assign the peer to it. If that module can't reach it, we'll follow the usual failover mechanism until finally concluding that nobody can get there.
...
cmr=v1.7.5:reviewer=jsquyres:subject=handle non-matching addresses
This commit was SVN r30719.
2014-02-13 23:37:22 +00:00
Ralph Castain
449cd8f3d7
Update a couple of fields, add a scheduler field to proc_info
...
This commit was SVN r30718.
2014-02-13 23:30:04 +00:00
Ralph Castain
fc6101b508
Handle "localhost" better
...
Refs trac:4263
This commit was SVN r30702.
The following Trac tickets were found above:
Ticket 4263 --> https://svn.open-mpi.org/trac/ompi/ticket/4263
2014-02-12 20:30:39 +00:00
Ralph Castain
a8a9801a0b
Ensure an orted exits with non-zero status if it is unable to send a message. Add more diagnostic messages to the OOB set_addr code
...
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r30701.
2014-02-12 19:44:01 +00:00
Ralph Castain
1473dde6ea
Okay, once again be caught by the blasted hwloc inability to cleanly handle caches. Protect the calls to get_depth by first checking to see if it is a "cache", then use a cache-specific function to get the stupid data. Very, very irritating.
...
cmr=v1.7.5:reviewer=jsquyres:subject=treat caches as something different yet again
This commit was SVN r30693.
2014-02-12 01:45:06 +00:00