1
1

586 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
2af677b1cf Ensure that we don't bind-by-default in an oversubscribed condition
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-15 07:58:52 -08:00
Josh Hursey
f6337f9eae Merge pull request #2047 from jjhursey/topic/mixed-host2
orte: !FQDN implementation to use opal_net_isaddr
2016-09-06 13:08:54 -05:00
Ralph Castain
f85dcaee2a Fixes CID 1369067 and CID 1196684
Fixes CID 1369648

    Fixes CID 1372409
2016-09-06 08:43:15 -07:00
Joshua Hursey
fe937d1e82 orte: !FQDN implementation to use opal_net_isaddr
* Switch to use opal_net_isaddr() for checking if a name is an IP
   address - as it is a bit cleaner, and uses common functionality.
2016-09-02 13:31:49 -05:00
Joshua Hursey
d26dd2c20e orte: Expand the application of !orte_keep_fqdn_hostnames
* Expand the use of the `orte_keep_fqdn_hostnames` MCA parameter when
   it is set to false.
 * If that parameter is set to false (default) then short hostnames
   (e.g., `node01`) will match with the long hostnames (e.g.,
   `node01.mycluster.org`). This allows a user (or resource manager)
    to mix the use of short and long hostnames.
  - Note that this mechanism does _not_ perform a DNS lookup, but
    instead strips off the FQDN by truncating the hostname string at
    the first `.` character (when not an IP address).
     - By default (`false`) the following is true:
       `node01 == node01.mycluster.org == node01.bogus.com`
       since we use `node01` as the hostname.
2016-08-26 16:09:04 -05:00
rhc54
19b0f4db9f Merge pull request #1995 from rhc54/topic/pe-per-rank
Change the behavior of cpus-per-rank.
2016-08-25 14:38:12 -05:00
Ralph Castain
440eae90ec Correct the binding algorithm to decouple it from oversubscribe.
Oversubscribe stipulates that we allow more procs on the node than assigned slots - it has nothing to do with the number of available pe's. Let overload directives handle the pe situation.
2016-08-24 21:17:22 -07:00
Ralph Castain
7de4d6922b Change the behavior of cpus-per-rank. We previously counted each cpu against the #slots. However, IBM has pointed out that "slot" is equated to the number of processes allowed to run on each node, and not the number of cpus on the node. This has been a continuing source of confusion, so make the distinction a "hard" one.
Each process occupies a "slot". We automatically set #slots = #cpus if nothing else is told to us. If you want to run more procs and slots, you must tell us to allow oversubscription.

A process can utilize multiple pe's if that option is given. If you try to bind more than one proc to a given pe, then we will error out unless you tell us to allow overloading.
2016-08-22 15:54:41 -07:00
Ralph Castain
ddd0d05de3 Fix a bug in the handling of nper<foo> when -host or -hostfile was given. Correctly mark slots as "given" when we auto-assign them. Ensure we don't set the number of procs when using nper<foo> so the PPR mapper can correctly assing them. 2016-07-12 09:27:02 -07:00
Gilles Gouaillardet
5f565dfec3 configury: clean the flex generated .c files 2016-06-01 11:13:31 +09:00
Ralph Castain
30aaf785a8 Fix the dist mapper option 2016-05-23 23:20:33 -07:00
Ralph Castain
58dd41facf Repair the processing of cmd line options that mapped to MCA params. This was responsible for breaking things like map-by <foo>.
Remove debug, let orterun send terminate cmd to DVM

Recover the DVM support
2016-05-06 13:14:03 -07:00
Ralph Castain
6ac7929bd0 Extend the schizo framework to allow definition of CLI options by environment. Refactor orterun to mesh with the orted_submit code, thus improving code reuse. Eliminate the orte-submit tool as orterun can now meet that need.
Cleanups per @jjhursey review
2016-05-01 11:30:25 -07:00
Ralph Castain
e6ad1ad621 Up-port of change for 2.x: if user directs oversubscribe, then do not bind as we will otherwise overload resources 2016-04-28 13:21:10 -07:00
Ralph Castain
1fa236b26c Ensure that we exit with a non-zero status when oversubscribe fails 2016-04-14 05:51:10 -07:00
Ralph Castain
437f5b4289 Fix map-by node and do-not-launch 2016-04-13 09:21:19 -07:00
Ralph Castain
503e1274a9 Per the discussion on the telecon, change the -host behavior so we only run one instance if no slots were provided and the user didn't specify #procs to run. However, if no slots are given and the user does specify #procs, then let the number of slots default to the #found processing elements
Ensure the returned exit status is non-zero if we fail to map

If no -np is given, but either -host and/or -hostfile was given, then error out with a message telling the user that this combination is not supported.

If -np is given, and -host is given with only one instance of each host, then default the #slots to the detected #pe's and enforce oversubscription rules.

If -np is given, and -host is given with more than one instance of a given host, then set the #slots for that host to the number of times it was given and enforce oversubscription rules. Alternatively, the #slots can be specified via "-host foo:N". I therefore believe that row #7 on Jeff's spreadsheet is incorrect.

With that one correction, this now passes all the given use-cases on that spreadsheet.

Make things behave under unmanaged allocations more like their managed cousins - if the #slots is given, then no-np shall fill things up.

Fixes #1344
2016-03-29 11:21:57 -07:00
Ralph Castain
cdd3dc99ca Correct the binding for the --map-by node case - we should still use our default binding algorithms 2016-03-23 09:55:24 -07:00
Ralph Castain
6e68d758b9 Cleanup some valgrind complaints about jumps with uninitialized values. Fix a few IOF issues reported by Mark Santcroos when submitting jobs from tools. Add the ability to pass directives to the --output-filename option that tell ORTE to (a) not include the jobid in the path to the output files, and (b) not to copy the output to the tool (i.e., just store it in the files).
ck

Remove stale debug

Fix a segfault if no subscribers are present
2016-02-18 16:30:37 -08:00
Jeff Squyres
60ffe713b8 common syms: whitelist bison-generated common symbols
Bison generates some common symbols that we can't do anything about,
so whitelist them.
2016-01-16 03:53:14 -08:00
Gilles Gouaillardet
4c43fb2a50 orte_rmaps_base_map_job: set OPAL_BIND_ALLOW_OVERLOAD when needed 2016-01-13 17:13:36 +09:00
Ralph Castain
f53d3c7a18 Silence warning 2015-12-30 10:16:58 -08:00
Gilles Gouaillardet
352b05a552 rmaps: warn if oversubscribing when manually setting the number of hosts
This is a port of the v1.10 series one-off open-mpi/ompi-release@8c5ce45ab6
2015-12-28 10:38:57 +09:00
Ralph Castain
7cc5879bdd Fix the default slot mapping in rank file mapper 2015-12-21 09:47:27 -08:00
Jeff Squyres
3e308f41f7 rmaps base help: update binding error messages
Due to user confusion, update the show-help messages displayed when
processor and/or memory binding fails.  Thanks to Dave Love
(@loveshack) for the initial suggestion.

Fixes open-mpi/ompi#1087
2015-12-14 13:02:41 -05:00
Ralph Castain
24419b6523 Fix relative node syntax for dash-host option 2015-10-31 19:00:46 -07:00
Igor Ivanov
489f27f8e9 orte/mca/rmaps: Improve orte_rmaps_dist_device help message
See: https://github.com/open-mpi/ompi/issues/953
2015-10-09 17:58:07 +03:00
Gilles Gouaillardet
0445484820 ras: remove orte_ras_proc_t and associated code 2015-09-30 08:52:52 +09:00
Gilles Gouaillardet
7cc14ee6f6 orte/rmaps: silence warning 2015-09-29 16:05:52 +09:00
Ralph Castain
d97bc29102 Remove OPAL_HAVE_HWLOC qualifier and error out if --without-hwloc is given 2015-09-04 16:54:40 -07:00
Ralph Castain
ed93154e43 Fix hetero operations. An error in the hwloc utilities only allocated memory for the first display of a binding map, and then assumed that all nodes had the same number of cores in them. This resulted in memory corruption whenever someone displayed a binding pattern for a hetero cluster, and a smaller node was first in line. 2015-07-07 12:52:16 -07:00
Ralph Castain
7455802a36 Add a bunch of debug, and correct an error that caused us to use the wrong mapping policy when determining the default binding policy 2015-07-07 10:13:10 -07:00
Nathan Hjelm
ee36d813dc Merge pull request #657 from hjelmn/c99
more c99 updates
2015-06-25 11:21:09 -06:00
Nathan Hjelm
4d92c9989e more c99 updates
This commit does two things. It removes checks for C99 required
headers (stdlib.h, string.h, signal.h, etc). Additionally it removes
definitions for required C99 types (intptr_t, int64_t, int32_t, etc).

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2015-06-25 10:14:13 -06:00
Howard Pritchard
e49a37c034 ownership: update ownership files
per discussions at OMPI devel workshop

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-06-25 10:04:42 -06:00
Ralph Castain
869041f770 Purge whitespace from the repo 2015-06-23 20:59:57 -07:00
Ralph Castain
869b2891c4 When doing comm-spawn, track the last object we bound to and ensure that we start the next job on the next object so we avoid overload situations when they aren't necessary 2015-06-17 09:20:08 -07:00
Gilles Gouaillardet
b72e9288bc rmaps: fix a misc memory leak
as reported by Coverity with CID 1269887
2015-06-17 11:17:55 +09:00
Ralph Castain
ff92781ec4 Replace hwloc191 with hwloc1110
Fix hwloc compile. Ignore LAMA mapper due to deprecated hwloc functions
2015-06-13 10:11:45 -07:00
Gilles Gouaillardet
2e384a3b65 initialize common symbols from orte
A few uninitialized common symbols are remaining (generated by flex) :
 * orte/mca/rmaps/rank_file/rmaps_rank_file_lex.c: orte_rmaps_rank_file_leng
 * orte/mca/rmaps/rank_file/rmaps_rank_file_lex.c: orte_rmaps_rank_file_text
 * orte/util/hostfile/hostfile_lex.c: orte_util_hostfile_leng
 * orte/util/hostfile/hostfile_lex.c: orte_util_hostfile_text
2015-05-08 10:11:58 +09:00
Ralph Castain
0bb73645f0 Silence Coverity warning 2015-04-30 20:49:28 -07:00
Ralph Castain
7d1980ba83 Add the ability to specify the number of desired slots in the --host option. Just giving a host name => one slot (multiple copies of the name yield one slot per copy). Giving "foo:3" indicates you want three slots - a shorthand notation for saying "foo" three times. Giving "foo:*" indicates you want the topology to set the number of slots based on the orte_set_slots param. 2015-04-30 20:35:23 -07:00
Ralph Castain
e26e7ad736 Better support automated tests for map, rank, and bind options 2015-04-30 14:01:13 -07:00
Ralph Castain
9104e81958 When --map-by node, we should be unbound. Also remove dead code due to copy/paste error. 2015-04-23 20:35:54 -07:00
Ralph Castain
5003be5c5c If the user specifies a --map-by <foo> option, then default to bind-to <foo> unless they specify a bind-to option. If they map-by slot/node, then use the default policy based on num_procs. 2015-04-23 13:30:21 -07:00
Nathan Hjelm
45e053dbce orte: use C99 subobject naming for component initialization
This commit helps future-proof orte components by initializing each
component member by name.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-04-18 10:29:58 -06:00
Nathan Hjelm
3436f2917d Merge pull request #449 from hjelmn/mca_base_update
mca/base update
2015-04-16 08:41:48 -06:00
Ralph Castain
9c6d452d6b If we are using HT cpus and have <= 2 procs, then map-by hwthread by default 2015-04-11 21:18:05 -07:00
Ralph Castain
033418f62a Correct a typo that reversed the default binding pattern. Ensure we default bind to hwthread if user specified --use-hwthread-cpus if nprocs <= 2, and bind to hwthread if told to do so. 2015-04-10 15:58:35 -07:00
Elena
1e913c76c4 changed mindist mapping policy specifier from map-bt dist:device,modifiers to --map-by dist:modifiers -mca rmaps_dist_device device 2015-04-01 15:07:35 +03:00