1
1

334 Коммитов

Автор SHA1 Сообщение Дата
Gilles Gouaillardet
456baeb71b rmaps/base: fix misc memory leaks
as reported by Coverity with CIDs 1196751, 1196754, 1196755 and 1269866
2015-03-02 15:31:11 +09:00
Jeff Squyres
398ae15533 rmaps_base_frame: remove dead code
This was CID 1196641
2015-02-24 15:24:11 -05:00
Howard Pritchard
bf89131f9e add owner files to opa/ompi/orte mca directories
This commit adds an owner file in each of the component directories
for each framework.  This allows for a simple script to parse
the contents of the files and generate, among other things, tables
to be used on the project's wiki page.  Currently there are two
"fields" in the file, an owner and a status.  A tool to parse
the files and generate tables for the wiki page will be added
in a subsequent commit.
2015-02-22 15:10:23 -07:00
Ralph Castain
116fcaff2c Start adding support for cmd line options to orte-submit 2015-02-10 12:13:21 -08:00
Ralph Castain
b757b3f452 Ensure that the #nodes in the job map gets properly updated when using the sequential mapper. Provide some further diagnostic info to help understand the problem when encountered. 2014-12-08 08:03:53 -08:00
Ralph Castain
3f9d9ae8b6 Provide tighter LSF integration by correctly handling scenarios where the user has asked LSF to assign bindings. Fix a couple of typos in lex parser definitions. Tell hostfile parser to ignore binding designations in hostfiles. Add an attribute to indicate that cpusets were provided as physical cpu ids.
Once validated, a version of this will be backported to the v1.8.4 release.
2014-11-30 11:50:31 -08:00
Ralph Castain
ea11e63f59 Per patch from Tetsuya, allow the user to bind-to none when specifying multiple pe's/rank as requested by Reuti. This allows the user to reserve multiple "slots" in the allocation for each process while mapping, but not to bind the process to specific processing elements on the node.
Reviewed by rhc, so RM-approved to go across to v1.8.3

cmr=v1.8.3:reviewer=ompi-gk1.8

This commit was SVN r32701.
2014-09-10 15:52:18 +00:00
Ralph Castain
4207b4c4ad Improve the --bind-to help message to better indicate the default options under various values of np. Remove the warning message if the user doesn't specify a binding policy and we are overloaded
cmr=v1.8.3:reviewer=jsquyres

This commit was SVN r32687.
2014-09-08 21:03:51 +00:00
Ralph Castain
024572cb6c Sigh - I promised to remove these deprecation warnings back in June. My apologies to Dave Goodell and others who requested it.
cmr=v1.8.2:reviewer=dgoodell:subject=remove deprecation warnings for pernode, npernode, and npersocket

This commit was SVN r32552.
2014-08-19 19:40:20 +00:00
Gilles Gouaillardet
c3c364a262 check-help-strings cleanup
This commit was SVN r32494.
2014-08-11 03:22:05 +00:00
Ralph Castain
149810f02c Per request from Jeff, slightly modify the show_help message as the precise name of the NUMA-containing packages differs based on OS and distro
cmr=v1.8.2:reviewer=jsquyres:subject=modify show_help message

This commit was SVN r32122.
2014-07-02 14:46:00 +00:00
Ralph Castain
8fca77c3d3 Protect the binding policy setting so it builds when --without-hwloc
Refs trac:4742

This commit was SVN r32085.

The following Trac tickets were found above:
  Ticket 4742 --> https://svn.open-mpi.org/trac/ompi/ticket/4742
2014-06-25 18:13:54 +00:00
Ralph Castain
5f6be06b54 Per request from Gilles and discussion at devel conference, have the --oversubscribe option automatically set both oversubscribe and overload-allowed properties as this is likely what the user intended.
cmr=v1.8.2:reviewer=rhc:subject=automatically set oversub/load

This commit was SVN r32072.
2014-06-24 18:11:39 +00:00
Ralph Castain
9a47e45a09 <laugh> ensure we really compare the things we want to compare
This commit was SVN r32055.
2014-06-19 20:54:25 +00:00
Ralph Castain
e65538e91b Add some defensive programming, fix a typo
This commit was SVN r32054.
2014-06-19 20:52:13 +00:00
Ralph Castain
65275d6326 Add a little more info to the warning message - i.e., that the likely cause of the problem is missing libnumactl and/or libnumactl-devel
cmr=v1.8.2:reviewer=miked:subject=improve memory binding failure message

This commit was SVN r32030.
2014-06-18 19:20:28 +00:00
Ralph Castain
3f04d50cb0 Per the ticket, resolve our handling of overload conditions to provide a more consistent response. If we are overloaded (i.e., attempting to bind more processes to a location than the number of cpus under that location), then we consider the following conditions:
(a) default binding policy is in effect. In this case, we will emit a
warning and default to not binding unless the user provided the
"oversubscribe" or "overload" modifier to the "bind-to" option.

(b) user-specified binding policy is in effect. In this case, we will
error out unless the user provided the "oversubscribe" or "overload"
modifier to the "bind-to" option as we cannot meet the directive.

Either "bind-to" modifier (oversubscribe or overload) will be accepted for
now - in 1.9, we will deprecate the "overload" term in favor of
"oversubscribe".

Also added the ability to accept a --bind-to modifier without specifying the binding policy itself so a user can specify overload-allowed with the default policy.

Closes trac:4345

cmr=v1.8.2:reviewer=rhc:subject=resolve handling of overload conditions

This commit was SVN r32005.

The following Trac tickets were found above:
  Ticket 4345 --> https://svn.open-mpi.org/trac/ompi/ticket/4345
2014-06-14 15:38:32 +00:00
Ralph Castain
56c3575c0e Can't emit an error for an unrecognized mapping policy modifier as the ppr policy relies on not doing so.
This commit was SVN r31998.
2014-06-13 20:10:09 +00:00
Ralph Castain
3ed282bf44 Per patch from Tetsuya, correct the cpus-per-proc logic so we correctly detect when the user is attempting to bind too low for that option
Refs trac:4702

This commit was SVN r31988.

The following Trac tickets were found above:
  Ticket 4702 --> https://svn.open-mpi.org/trac/ompi/ticket/4702
2014-06-13 16:32:52 +00:00
Ralph Castain
06dbfa3098 Make the cpus-per-proc equivalent a little more intuitive:
* allow users to specify just a modifier for map-by instead of requiring that they also specify a policy. Thus, we now accept --map-by :pe=3 as indicating that we should use the default mapping policy, but bind 3 cpus/proc.

* if users specify a pe's/proc but no policy, default to --map-by NUMA to ensure we have access to multiple cpus for the request. This won't guarantee we have access to enough to meet the request, but gives us a chance. In addition, we know that binding a proc to multiple cpus will work best if those cpus are all in the same NUMA, so this provides some degree of optimized behavior.

Per a request from Jeff, define "oversubscribe" for binding as a synonym for the "overload" modifier.

cmr=v1.8.2:reviewer=rhc

This commit was SVN r31967.
2014-06-08 20:26:59 +00:00
Ralph Castain
638c24f655 Correct the bind-in-place algorithm to better handle comm_spawn. If the location identified by the mapper is already occupied by procs from another job, then we need to shift either right or left until we find an unoccupied location where we can be bound. If nothing is available, then check for the overload flag (and bind us in the original location if provided), or see if this was the default binding policy instead of one specified by the user - if so, then just don't bind this process.
cmr=v1.8.2:reviewer=rhc

This commit was SVN r31959.
2014-06-06 12:36:14 +00:00
Ralph Castain
f1978fba7c Cleanup a set of typos on the orte_get_attribute call
This commit was SVN r31942.
2014-06-03 20:36:38 +00:00
Ralph Castain
8736a1c138 Per RFC:
http://www.open-mpi.org/community/lists/devel/2014/05/14822.php

Revamp the ORTE global data structures to reduce memory footprint and add new features. Add ability to control/set cpu frequency, though this can only be done if the sys admin has setup the system to support it (or you run as root).

This commit was SVN r31916.
2014-06-01 16:14:10 +00:00
Ralph Castain
5602156a1c Use the correct abstraction layer name for the data dirs
This commit was SVN r31684.
2014-05-08 14:32:24 +00:00
Ralph Castain
6545e6e9a8 Add one more check for failed mapping that rarely occurs, but results in a hang when it does
cmr=v1.8.2:reviewer=rhc

This commit was SVN r31598.
2014-05-02 10:35:14 +00:00
Ralph Castain
61d94fcee2 Fix the sequential mapper - it was out-of-sync with the hostfile changes, and we missed the "seq" policy when parsing the --map-by option. Thanks to Bill Chen for reporting it
cmr=v1.8.1:reviewer=jsquyres

This commit was SVN r31333.
2014-04-08 03:38:25 +00:00
Jeff Squyres
82e104719a hwloc/rmaps base: Add missing help message.
Also, add missing ORTE_ERROR_LOG in the other case where this error
message is used (i.e., ORTE_ERROR_LOG was used in the one place, so
let's also use it in the other place).

This commit was SVN r31321.
2014-04-07 15:39:54 +00:00
Ralph Castain
3fdcaeab97 Fix a problem where we need to abort due to a mapping failure, but we are in a managed environment and thus the orteds have not wired up. Thus, if we send the exit message across the routed network, the remote daemons won't have a way to relay the message along - and we won't exit.
If we are aborting, then set the flags so the HNP directly sends an exit command to each daemon. Make it the halt_vm command so the remote daemon doesn't try to relay it, but instead just exits without waiting for its routed children to exit first.

cmr=v1.8.1:reviewer=jsquyres:subject=fix hangs due to abort prior to daemon wireup

This commit was SVN r31304.
2014-04-02 04:17:55 +00:00
Ralph Castain
390645ac2a Per patch from Tetsuya Mishima, do a nicer job of warning the user that we need to map to a higher level to get the number of requested cpus/rank. Also, change the mapping policy to "byslot" when falling back to that option.
cmr=v1.8:reviewer=rhc

This commit was SVN r31196.
2014-03-24 15:47:29 +00:00
Ralph Castain
081669b440 When pretty-printing binding info, we need to pass the topology down to the routine as the mapper isn't always working with the local topology - otherwise, we get an erroneous help message. Thanks to Tetsuya Mishima for reporting it
cmr=v1.7.5:reviewer=rhc:subject=fix pretty-print of bindings

This commit was SVN r30968.
2014-03-10 15:53:07 +00:00
Ralph Castain
50c30d62ca Repair builds without hwloc
cmr=v1.7.5:reviewer=jsquyres

This commit was SVN r30940.
2014-03-05 02:48:15 +00:00
Ralph Castain
0ac97761cc Now that we are binding by default, the issue of #slots and what to do when oversubscribed has become a bit more complicated. This isn't a problem in managed environments as we are always provided an accurate assignment for the #slots, or when -host is used to define the allocation since we automatically assume one slot for every time a node is named.
The problem arises when a hostfile is used, and the user provides host names without specifying the slots= paramater. In these cases, we assign slots=1, but automatically allow oversubscription since that number isn't confirmed. We then provide a separate parameter by which the user can direct that we assign the number of slots based on the sensed hardware - e.g., by telling us to set the #slots equal to the #cores on each node. However, this has been set to "off" by default.

In order to make this a little less complex for the user, set the default such that we automatically set #slots equal to #cores (or #hwt's if use_hwthreads_as_cpus has been set) only for those cases where the user provides names in a hostfile but does not provide slot information.

Also cleanup some a couple of issues in the mapping/binding system:

* ensure we only override the binding directive if we are oversubscribed *and* overload is not allowed

* ensure that the MPI procs don't attempt to bind themselves if they are launched by an orted as any binding directive (no matter what it was) would have been serviced by the orted on launch

* minor cleanup to the warning message when oversubscribed and binding was requested

cmr=v1.7.5:reviewer=rhc:subject=update mapping/binding system

This commit was SVN r30909.
2014-03-03 16:46:37 +00:00
Ralph Castain
88b0e0cc6d Allow the user to turn off the oversubscribed-binding warning if overload-allowed has been provided
Refs trac:4317

This commit was SVN r30892.

The following Trac tickets were found above:
  Ticket 4317 --> https://svn.open-mpi.org/trac/ompi/ticket/4317
2014-02-28 17:55:53 +00:00
Ralph Castain
4a645f0342 Add detection of oversubscription with binding requested - if binding requested to core or hwt, warn and do not bind or else we will hurt performance. Also, if no binding directive was given, turn off the default binding
Refs trac:4317

This commit was SVN r30888.

The following Trac tickets were found above:
  Ticket 4317 --> https://svn.open-mpi.org/trac/ompi/ticket/4317
2014-02-28 16:08:52 +00:00
Ralph Castain
61a21e4f31 Based on Tetsuya's patch, with some changes, correct the case of map-by node where multiple cpus/rank are requested and result in a non-integer match with num slots. Also correct tests for binding policy given to use the proper macro.
Refs trac:4296

This commit was SVN r30857.

The following Trac tickets were found above:
  Ticket 4296 --> https://svn.open-mpi.org/trac/ompi/ticket/4296
2014-02-26 18:12:23 +00:00
Ralph Castain
c8112c1086 Loadbalancing across nodes (i.e., map-by node) wasn't working correctly - the algorithm relied on the nodes being defined in descending order of slots, or the numbe
r of slots remaing to be assigned being only one/node. Regardless, it didn't work for the case where nodes were defined in ascending order of slots.

Tetsuya's proposed patch didn't solve the problem for me, but it did correct the case where cpus/proc > 1. The final patch requires that we loop over the assignment
 algo until all procs are assigned or all nodes are filled - any remaining procs are then handled in the cleanup loop.

cmr=v1.7.5:reviewer=rhc:subject=fix map-by node for different cases

This commit was SVN r30798.
2014-02-22 16:39:41 +00:00
Ralph Castain
91f90058ce Add missing options and cleanup the code a bit. Default to by-slot ranking if a non-hardware option isn't given. Thanks to Tetsuya Mishima for the assist.
cmr=v1.7.5:reviewer=ompi-gk1.7

This commit was SVN r30725.
2014-02-14 10:23:16 +00:00
Ralph Castain
fd9b301a8b Check equality instead of bit-mask - thanks to Tetsuya Mishima for reporting it
cmr=v1.7.5:reviewer=ompi-gk1.7

This commit was SVN r30722.
2014-02-14 02:34:42 +00:00
Ralph Castain
1473dde6ea Okay, once again be caught by the blasted hwloc inability to cleanly handle caches. Protect the calls to get_depth by first checking to see if it is a "cache", then use a cache-specific function to get the stupid data. Very, very irritating.
cmr=v1.7.5:reviewer=jsquyres:subject=treat caches as something different yet again

This commit was SVN r30693.
2014-02-12 01:45:06 +00:00
Ralph Castain
b566cd5e30 Protect against no modifiers
Refs trac:4117

This commit was SVN r30672.

The following Trac tickets were found above:
  Ticket 4117 --> https://svn.open-mpi.org/trac/ompi/ticket/4117
2014-02-11 17:34:37 +00:00
Ralph Castain
6fa34407bf Handle modifiers to the --map-by dist option
Refs trac:4117

This commit was SVN r30671.

The following Trac tickets were found above:
  Ticket 4117 --> https://svn.open-mpi.org/trac/ompi/ticket/4117
2014-02-11 17:19:05 +00:00
Ralph Castain
4781ea71b6 Correct the handling of various map/bind combinations when pe=N is given. Thanks to Elena Elkina for reporting it.
Refs trac:4117

This commit was SVN r30663.

The following Trac tickets were found above:
  Ticket 4117 --> https://svn.open-mpi.org/trac/ompi/ticket/4117
2014-02-11 03:05:26 +00:00
Ralph Castain
707e51d786 Check for --cpus-per-proc earlier, before the correct option can be processed. Thanks to Tetsuya Mishima for reporting it.
Refs trac:4117

This commit was SVN r30662.

The following Trac tickets were found above:
  Ticket 4117 --> https://svn.open-mpi.org/trac/ompi/ticket/4117
2014-02-11 02:53:53 +00:00
Ralph Castain
d66d2f5fb3 It is just fine to map by node or slot and bind, so ensure the switch statement includes those options. Thanks to Tatsuya Mishima for point it out.
Refs trac:4240

This commit was SVN r30661.

The following Trac tickets were found above:
  Ticket 4240 --> https://svn.open-mpi.org/trac/ompi/ticket/4240
2014-02-11 02:52:01 +00:00
Ralph Castain
1a12325094 Rats - need to include bydist in the mapping list
Refs trac:4117

This commit was SVN r30649.

The following Trac tickets were found above:
  Ticket 4117 --> https://svn.open-mpi.org/trac/ompi/ticket/4117
2014-02-09 16:17:05 +00:00
Ralph Castain
ca0c806662 Resolve the problem of binding in inverted topologies - check the relative depth of the map and bind objects in the topology, and let that determine whether we bind downward or upwards.
cmr=v1.7.5:reviewer=jsquyres:subject=Resolve the problem of binding in inverted topologies

This commit was SVN r30643.
2014-02-09 05:30:17 +00:00
Ralph Castain
bc7cc09749 After a lot of pain, I've managed to resolve the problem of conflicting mapping directives caused by mismatched MCA params - i.e., where someone has one variant of an MCA param (e.g., rmaps_base_mapping_policy) in their default MCA param file, and then specifies another variant (e.g., --npernode) on the command line. I can't fully resolve the problem as there is no way to know precisely what the user meant - we can only guess which param was really intended since the MCA param system
can't apply its normal precedence rules.

So...print a big "deprecated" warning for the old params and error out if a conflict is detected. I know that isn't what people really wanted, but it's the best we
 can do. If only the old style param is given, then process it after the warning.

Extend the current map-by param to add support for ppr and cpus-per-proc, adding the latter to the list of allowed modifiers using "pe=n" for processing elements/proc. Thus, you can map-by socket:pe=2,oversubscribe to map by socket, binding 2 processing elements/process, with oversubscription allowed. Or you can map-by ppr:2:socket:pe=4 to map two processes to every socket in the allocation, binding each process to 4 processing elements.

For those wondering, a processing element is defined as a hwthread if --use-hwthreads-as-cpus is given, or else as a core.

Refs trac:4117

This commit was SVN r30620.

The following Trac tickets were found above:
  Ticket 4117 --> https://svn.open-mpi.org/trac/ompi/ticket/4117
2014-02-07 21:25:40 +00:00
Ralph Castain
e43589ed84 Fix warning - thanks to Paul Hargrove for reporting it
cmr=v1.7.4:reviewer=ompi-gk1.7

This commit was SVN r30548.
2014-02-03 23:51:45 +00:00
Ralph Castain
410a3afa7b Fix --without-hwloc operations - must default to map-by slot in that scenario
cmr=v1.7.4:reviewer=jsquyres

This commit was SVN r30474.
2014-01-29 16:54:05 +00:00
Ralph Castain
42eb0bbe1b Fix --without-hwloc builds
cmr=v1.7.4:reviewer=jsquyres

This commit was SVN r30462.
2014-01-28 17:10:32 +00:00