Joshua Ladd
e39d9f4080
Per the RFC schedule, add an additive lagged Fibonacci parallel random number generator to OPAL. In order to use, please add the following header to your code: opal/util/alfg.h. See ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c for an example how to seed with opal_srand and invoke the generator with opal_rand. This should be added to
...
cmr=v1.7.5:reviewer=rhc:subject=Add an OPAL RNG
This commit was SVN r30801.
2014-02-23 21:41:38 +00:00
Ralph Castain
c8112c1086
Loadbalancing across nodes (i.e., map-by node) wasn't working correctly - the algorithm relied on the nodes being defined in descending order of slots, or the numbe
...
r of slots remaing to be assigned being only one/node. Regardless, it didn't work for the case where nodes were defined in ascending order of slots.
Tetsuya's proposed patch didn't solve the problem for me, but it did correct the case where cpus/proc > 1. The final patch requires that we loop over the assignment
algo until all procs are assigned or all nodes are filled - any remaining procs are then handled in the cleanup loop.
cmr=v1.7.5:reviewer=rhc:subject=fix map-by node for different cases
This commit was SVN r30798.
2014-02-22 16:39:41 +00:00
Adrian Reber
f17ec1ab10
ESS/BASE: orte-restart needs sstore
...
Running orte-restart requires an initialized sstore.
This opens the sstore component for FT builds just like
the snapc component.
This commit was SVN r30796.
2014-02-21 21:23:26 +00:00
Ralph Castain
0319d5fb19
Seeing some errors coming out of MTT on this component, so turn it off for now and will debug later
...
This commit was SVN r30789.
2014-02-21 16:31:52 +00:00
Mike Dubman
8d4592a94b
rmaps/mindist: better error message
...
better error message when there is only one socket available
fixed by Elena, reviewed by Miked
cmr=v1.7.5:reviewer=ompi-rm1.7
This commit was SVN r30787.
2014-02-21 11:38:35 +00:00
Ralph Castain
5520d6971b
We do have to track the origin of messages sent over usock as the daemon does route them back down, and we need to get the "sender" info correct. Also do a better job of dealing with simultaneous connections to avoid binding to a used socket.
...
Refs trac:4280
This commit was SVN r30781.
The following Trac tickets were found above:
Ticket 4280 --> https://svn.open-mpi.org/trac/ompi/ticket/4280
2014-02-20 17:27:05 +00:00
Ralph Castain
63803f5e61
Fix the leader data for PMI direct-launch as well
...
This commit was SVN r30778.
2014-02-20 01:41:19 +00:00
Ralph Castain
418ca60776
Since we don't know the name of the local leader, store that info under our own name :-)
...
This commit was SVN r30777.
2014-02-20 01:39:52 +00:00
Ralph Castain
262c927778
Define a new key and store the process name of the local_rank=0 process on each node so that the MPI layer can retrieve it as desired.
...
This commit was SVN r30759.
2014-02-18 00:32:58 +00:00
Adrian Reber
6b45d475e9
Fix compiler warnings when compiling with --with-ft
...
With enabled fault tolerance code different functions
are selected during compilation. Most of the ft
code is #ifdef'd out. This #ifdef's more code out
so that compiler warnings like
warning: unused variable 'item' [-Wunused-variable]
opal_list_item_t *item;
are removed.
This commit was SVN r30747.
2014-02-17 10:53:44 +00:00
Ralph Castain
c3df744a3b
Shift the orte_db_localrank key to the opal level. Add the job and proc-level session directory names to the database using opal_db keys.
...
This commit was SVN r30746.
2014-02-17 01:40:56 +00:00
Ralph Castain
ea0217c337
Remove unused file and minimize the usock uri contribution (add explanation as to why)
...
Refs trac:4280
This commit was SVN r30744.
The following Trac tickets were found above:
Ticket 4280 --> https://svn.open-mpi.org/trac/ompi/ticket/4280
2014-02-16 22:37:30 +00:00
Ralph Castain
a91d358c48
Add/modify a couple of tests
...
This commit was SVN r30743.
2014-02-16 20:54:34 +00:00
Ralph Castain
d42f4be8a4
Add unix socket component to OOB - no longer require active network for local operations. Demonstrate inter-transport crossover.
...
VERY tentatively schedule this for 1.7.5 - only to be applied if we see no troubles AND the branch is ready in advance.
cmr=v1.7.5:reviewer=rhc:subject=Add unix socket component to OOB
This commit was SVN r30742.
2014-02-16 20:54:12 +00:00
Ralph Castain
14bb7a117c
Fix bugs in the oob base - ensure we get the components in high-to-low priority, and that we correctly track reachability via all components. Adjust the priority of the tcp component to leave headroom for others
...
Refs trac:267
This commit was SVN r30740.
The following Trac tickets were found above:
Ticket 267 --> https://svn.open-mpi.org/trac/ompi/ticket/267
2014-02-16 03:19:08 +00:00
Ralph Castain
509d5d82b0
Add some verbage requested by Jeff, change the param level to something...?
...
Refs trac:4275
This commit was SVN r30736.
The following Trac tickets were found above:
Ticket 4275 --> https://svn.open-mpi.org/trac/ompi/ticket/4275
2014-02-15 15:11:05 +00:00
Ralph Castain
3f9db36e0d
Make Jeff smile - pretty-up the indentation
...
Refs trac:4267
This commit was SVN r30733.
The following Trac tickets were found above:
Ticket 4267 --> https://svn.open-mpi.org/trac/ompi/ticket/4267
2014-02-14 23:25:48 +00:00
Ralph Castain
91f90058ce
Add missing options and cleanup the code a bit. Default to by-slot ranking if a non-hardware option isn't given. Thanks to Tetsuya Mishima for the assist.
...
cmr=v1.7.5:reviewer=ompi-gk1.7
This commit was SVN r30725.
2014-02-14 10:23:16 +00:00
Ralph Castain
fd9b301a8b
Check equality instead of bit-mask - thanks to Tetsuya Mishima for reporting it
...
cmr=v1.7.5:reviewer=ompi-gk1.7
This commit was SVN r30722.
2014-02-14 02:34:42 +00:00
Ralph Castain
4e1c07cbf2
If we are given a TCP oob address that doesn't match any active module, it is still possible that we could route to the address if a router is in the system. No harm in trying, so arbitrarily pick the first connection in the active module list and assign the peer to it. If that module can't reach it, we'll follow the usual failover mechanism until finally concluding that nobody can get there.
...
cmr=v1.7.5:reviewer=jsquyres:subject=handle non-matching addresses
This commit was SVN r30719.
2014-02-13 23:37:22 +00:00
Ralph Castain
449cd8f3d7
Update a couple of fields, add a scheduler field to proc_info
...
This commit was SVN r30718.
2014-02-13 23:30:04 +00:00
Ralph Castain
fc6101b508
Handle "localhost" better
...
Refs trac:4263
This commit was SVN r30702.
The following Trac tickets were found above:
Ticket 4263 --> https://svn.open-mpi.org/trac/ompi/ticket/4263
2014-02-12 20:30:39 +00:00
Ralph Castain
a8a9801a0b
Ensure an orted exits with non-zero status if it is unable to send a message. Add more diagnostic messages to the OOB set_addr code
...
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r30701.
2014-02-12 19:44:01 +00:00
Ralph Castain
1473dde6ea
Okay, once again be caught by the blasted hwloc inability to cleanly handle caches. Protect the calls to get_depth by first checking to see if it is a "cache", then use a cache-specific function to get the stupid data. Very, very irritating.
...
cmr=v1.7.5:reviewer=jsquyres:subject=treat caches as something different yet again
This commit was SVN r30693.
2014-02-12 01:45:06 +00:00
Ralph Castain
1565816988
Do a little better job of cleaning up the session directory left by mpirun by ensuring we delete the event associated with debugger attachment and unlinking the pipe used for that purpose. Also, we no longer leave "abort" files around, so remove that check when deleting session directory trees
...
cmr=v1.7.5:reviewer=jsquyres:subject=cleanup session directories better
This commit was SVN r30689.
2014-02-11 22:16:17 +00:00
Ralph Castain
fa7b686ccc
Provide better messages when we don't find any included interfaces, and/or don't find any interfaces for use by OOB.
...
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r30675.
2014-02-11 19:29:03 +00:00
Ralph Castain
b566cd5e30
Protect against no modifiers
...
Refs trac:4117
This commit was SVN r30672.
The following Trac tickets were found above:
Ticket 4117 --> https://svn.open-mpi.org/trac/ompi/ticket/4117
2014-02-11 17:34:37 +00:00
Ralph Castain
6fa34407bf
Handle modifiers to the --map-by dist option
...
Refs trac:4117
This commit was SVN r30671.
The following Trac tickets were found above:
Ticket 4117 --> https://svn.open-mpi.org/trac/ompi/ticket/4117
2014-02-11 17:19:05 +00:00
Ralph Castain
4781ea71b6
Correct the handling of various map/bind combinations when pe=N is given. Thanks to Elena Elkina for reporting it.
...
Refs trac:4117
This commit was SVN r30663.
The following Trac tickets were found above:
Ticket 4117 --> https://svn.open-mpi.org/trac/ompi/ticket/4117
2014-02-11 03:05:26 +00:00
Ralph Castain
707e51d786
Check for --cpus-per-proc earlier, before the correct option can be processed. Thanks to Tetsuya Mishima for reporting it.
...
Refs trac:4117
This commit was SVN r30662.
The following Trac tickets were found above:
Ticket 4117 --> https://svn.open-mpi.org/trac/ompi/ticket/4117
2014-02-11 02:53:53 +00:00
Ralph Castain
d66d2f5fb3
It is just fine to map by node or slot and bind, so ensure the switch statement includes those options. Thanks to Tatsuya Mishima for point it out.
...
Refs trac:4240
This commit was SVN r30661.
The following Trac tickets were found above:
Ticket 4240 --> https://svn.open-mpi.org/trac/ompi/ticket/4240
2014-02-11 02:52:01 +00:00
Ralph Castain
a49e0db8dd
We haven't supported a c++ wrapper for ORTE in quite some time
...
cmr=v1.7.5:reviewer=ompi-gk1.7:subject=remove c++ cruft
This commit was SVN r30653.
2014-02-10 17:16:30 +00:00
Ralph Castain
1a12325094
Rats - need to include bydist in the mapping list
...
Refs trac:4117
This commit was SVN r30649.
The following Trac tickets were found above:
Ticket 4117 --> https://svn.open-mpi.org/trac/ompi/ticket/4117
2014-02-09 16:17:05 +00:00
Ralph Castain
0dc5f50d27
Add a plm component for local-only operation that doesn't require rsh/ssh to be installed. Requested by Fedora packagers for testing purposes.
...
cmr=v1.7.5:reviewer=jsquyres:subject=Add a plm component for local-only operation
This commit was SVN r30645.
2014-02-09 15:53:10 +00:00
Ralph Castain
ca0c806662
Resolve the problem of binding in inverted topologies - check the relative depth of the map and bind objects in the topology, and let that determine whether we bind downward or upwards.
...
cmr=v1.7.5:reviewer=jsquyres:subject=Resolve the problem of binding in inverted topologies
This commit was SVN r30643.
2014-02-09 05:30:17 +00:00
Ralph Castain
0ee38353ba
In case there are stale session directories around, do a purge of the relevant session directory tree when an orted, HNP, or singleton start. This won't help in the case of direct-launched apps, but it's the best we can do.
...
cmr=v1.7.5:reviewer=jsquyres:subject=purge stale session dirs at startup
This commit was SVN r30642.
2014-02-09 02:10:31 +00:00
Ralph Castain
1d8c061687
Fix a race condition that could result in assert failures during finalize. Ensure we shutdown the orte progress thread prior to finalizing the rml/oob frameworks so that no async operations are executing during destruct of the base-level lists and objects.
...
cmr=v1.7.5:reviewer=jsquyres:subject=fix race condition in finalize
This commit was SVN r30641.
2014-02-08 22:04:19 +00:00
Ralph Castain
5b8e1180cf
Update a test
...
This commit was SVN r30640.
2014-02-08 22:00:12 +00:00
Ralph Castain
a94920276d
Fix singleton MPI_Abort. Singletons no longer immediately start an HNP, but only launch one when they need it for comm_spawn. So there isn't anyone to send the "abort" report to, and thus we just exit after emitting our message.
...
cmr=v1.7.5:reviewer=jsquyres:subject=Fix singleton MPI_Abort
This commit was SVN r30635.
2014-02-08 18:15:07 +00:00
Ralph Castain
bc7cc09749
After a lot of pain, I've managed to resolve the problem of conflicting mapping directives caused by mismatched MCA params - i.e., where someone has one variant of an MCA param (e.g., rmaps_base_mapping_policy) in their default MCA param file, and then specifies another variant (e.g., --npernode) on the command line. I can't fully resolve the problem as there is no way to know precisely what the user meant - we can only guess which param was really intended since the MCA param system
...
can't apply its normal precedence rules.
So...print a big "deprecated" warning for the old params and error out if a conflict is detected. I know that isn't what people really wanted, but it's the best we
can do. If only the old style param is given, then process it after the warning.
Extend the current map-by param to add support for ppr and cpus-per-proc, adding the latter to the list of allowed modifiers using "pe=n" for processing elements/proc. Thus, you can map-by socket:pe=2,oversubscribe to map by socket, binding 2 processing elements/process, with oversubscription allowed. Or you can map-by ppr:2:socket:pe=4 to map two processes to every socket in the allocation, binding each process to 4 processing elements.
For those wondering, a processing element is defined as a hwthread if --use-hwthreads-as-cpus is given, or else as a core.
Refs trac:4117
This commit was SVN r30620.
The following Trac tickets were found above:
Ticket 4117 --> https://svn.open-mpi.org/trac/ompi/ticket/4117
2014-02-07 21:25:40 +00:00
Ralph Castain
c617d66d98
Paul Hargrove has pointed out that some big SMP systems (e.g., from SGI) configure Torque differently - instead of listing each node name once/slot in the nodefile, they list the node only once and set an envar to indicate the number of procs/node being allocated. Add an MCA param users can set to indicate we are in such an environment, and then use the envar to set the slots. Error out if the mode flag is given, but (a) we don't find the PBS_PPN envar, or (b) we find a node actually listed more than once in the PBS_Nodefile.
...
cmr=v1.7.5:reviewer=jsquyres:subject=Support SMP mode in Torque
This commit was SVN r30568.
2014-02-05 15:51:17 +00:00
Ralph Castain
1326ed704f
Per the RFC discussed here:
...
http://www.open-mpi.org/community/lists/devel/2014/01/13789.php
add support for async modex when requested.
cmr=v1.7.5:reviewer=jsquyres:subject=Add async modex support
This commit was SVN r30565.
2014-02-05 14:39:27 +00:00
Ralph Castain
230336b6a8
Upgrade the security framework to avoid multiple hits against the global security server. Add support for future case where mpirun assings a global security credential for a given run, though we need to work out how to handle connect-accept from other mpirun's in that case. Remove a bunch of duplicate code in the OOB by consolidating the connection handshake code.
...
Refs trac:4221
This commit was SVN r30554.
The following Trac tickets were found above:
Ticket 4221 --> https://svn.open-mpi.org/trac/ompi/ticket/4221
2014-02-04 14:47:04 +00:00
Adrian Reber
fde1040d2f
Use unique collective ids for the checkpoint/restart code
...
This commit was SVN r30552.
2014-02-04 14:03:05 +00:00
Ralph Castain
5980b7e042
Add a security framework for authenticating connections - we will add LDAP, Kerberos, and Keystone support in the next month. For now, just put a placeholder "basic" module that does the minimum.
...
Wire the security check into ORTE's OOB handshake, and add a "version" check to ensure that both ends are from the same ORTE version. If not, report the mismatch and refuse the connection
Fixes trac:4171
cmr=v1.7.5:reviewer=jsquyres:subject=Add a security framework for authenticating connections
This commit was SVN r30551.
The following Trac tickets were found above:
Ticket 4171 --> https://svn.open-mpi.org/trac/ompi/ticket/4171
2014-02-04 01:38:45 +00:00
Ralph Castain
e43589ed84
Fix warning - thanks to Paul Hargrove for reporting it
...
cmr=v1.7.4:reviewer=ompi-gk1.7
This commit was SVN r30548.
2014-02-03 23:51:45 +00:00
Ralph Castain
993198cfba
Fix lost message problem - if multiple messages are queued before the connection is formed, we lost all but the first one. Ensure that all messages get properly queued prior to completing the connection
...
cmr=v1.7.4:reviewer=jsquyres:subject=Fix lost message problem
This commit was SVN r30516.
2014-01-31 05:30:51 +00:00
Ralph Castain
2bc9fd30ee
Orcm sends heartbeats to its daemons, but ORTE needs to continue sending it to the HNP
...
This commit was SVN r30514.
2014-01-31 01:56:01 +00:00
Ralph Castain
193cceb483
Okay, since a certain other RM out there made a fuss about being able to lock their daemons to specified cores, offer the same option here. The MCA param orte_daemon_cores can be used to specify which core(s) you want the orte daemons to use. This will have no bearing on the application procs - unbound will remain unbound, and binding directives will be applied to the apps.
...
Yippee skippee...
This commit was SVN r30513.
2014-01-30 23:50:14 +00:00
Rolf vandeVaart
f7055de78e
Stop listening thread and wait for it to terminate.
...
This commit was SVN r30507.
2014-01-30 20:37:15 +00:00