1
1

4411 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
2da5651869 Restore orted hnp_uri cmd line option
Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-02-18 13:24:03 -08:00
Ralph Castain
e56ee1e06a Remove the remaining cruft from dual oob transport
* When we moved to allowing dual rml/oob transports, we added a bunch of
stuff that is no longer needed. Remove it so as to simplify the
messaging system.

* Fix the routed/radix component so it correctly returns the parent's
vpid

Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-02-08 11:12:31 -08:00
Ralph Castain
01e9aca40f Add topology support for hetero systems
Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-02-08 11:11:25 -08:00
Gilles Gouaillardet
88ac05fca6 misc fixes
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-02-08 11:11:25 -08:00
Ralph Castain
125d236173 Move from the use of regex to compression
We've been fighting the battle of trying to create a regex generator and
parser that can handle arbitrary hostname schemes - without long-term
success. The worst of it is that there is no way of checking to see if
the computed regex is correct short of parsing it and doing a
character-by-character comparison with the original string. Ugh...there
has to be a better solution.

One option is to investigate using 3rd-party regex libraries as
those are coming from communities whose sole focus is resolving that
problem. However, someone would need to spend the time to investigate
it, and we'd have to find a license-friendly implementation.

Another option is to quit beating our heads against the wall and just
compress the information. It won't be as much of a reduction, but we
also won't keep hitting scenarios where things break. In this case, it
seems that "perfection" is definitely the enemy of "good enough".

This PR implements the compression option while retaining the
possibility of people adding regex-generating components. The
compression code used in ORTE is consolidated into the opal/compress
framework. That framework currently held bzip and gzip components for
use in compressing checkpoint files - since we no longer support C/R, I
have .opal_ignore'd those components.

However, I have left the original framework APIs alone in case someone
ever decides to redo C/R. The APIs of interest here are added to the
framework - specifically, the "compress_block" and "decompress_block"
functions. I then moved the ORTE zlib compression code into a new
component in this framework.

Unfortunately, the framework currently is a single-select one - i.e.,
only one active component at a time. Since I .opal_ignore'd the other
two and made the priority of zlib high, this isn't a problem. However,
if someone wants to re-enable bzip/gzip or add another component, they
might need to transition opal/compress to a multi-select framework.

Included changes:

* Consolidate the compression code into the opal/compress framework

* Move the ORTE zlib compression code into a new opal/compress/zlib
  component

* Ignore the bzip and gzip components in opal/compress framework

* Add a "compress_base_limit" MCA param to set the threshold above which
  we compress data - defaults to 4096 bytes

* Delete stale brucks and rcd components from orte/grpcomm framework

* Delete the orte/regx framework

* Update the launch system to use opal/compress instead of string regex

* Provide a default module if no zlib is available

* Fix some misc multi-node issues

* Properly generate the nidmap in response to a "connection warmup"
  message so the remote daemon knows the children it needs to launch.

* Remove stale references to orte_node_regex

* opal_byte_object_t's are not OPAL objects - properly release allocated
  memory.

* Set the topology

* Currently only handling homogeneous case

* Update the compress framework files to conform

* Consolidate open/close into one "frame" file. Ensure we open/close the
  framework

Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-02-08 11:11:14 -08:00
Ralph Castain
fcbc7ea298
Merge pull request #6306 from karasevb/regx_host_ordering_fix
regex: fixed host ordering for different prefixes
2019-02-08 11:09:55 -08:00
Ralph Castain
8794077520 Remove stale rml/ofi component
Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-01-30 12:41:50 -08:00
Boris Karasev
46e38b9193 regx: fixed the order of hosts for ranges with different prefixes
Example:
For the list of hosts `a01,b00,a00` a regex is generated:
`a[2:1.0],b[2:0]`, where `a`-hosts prefixes moved to the begining,
it breaks the hosts ordering.
This commit fixes regex for that case to `a[2:1],b[2:0],a[2:0]`

Signed-off-by: Boris Karasev <karasev.b@gmail.com>
2019-01-30 15:06:30 +06:00
Boris Karasev
1967e41a71 regx/reverse: fixed adding an empty range for no numerical hostnames
Example:
For the nodelist `jjss,jjss0000001,jjss0000003,jjss0000002` a regular
expression was `jjss[0:0],jjss[7:1,3,2]` that led to incorrect unpacking
the first host as `jjs0`. This commit fixes an adding empty range for
not numeric hostnames. Here is the fixed regex for this exapmle:
`jjss,jjss[7:1,3,2]`

Signed-off-by: Boris Karasev <karasev.b@gmail.com>
2019-01-30 09:41:00 +06:00
Jason Williams
98d81a5f7a Adding changes for issue #6303 for branch master.
Signed-off-by: Jason Williams <uberlinuxguy@gmail.com>
2019-01-26 10:49:47 -05:00
Howard Pritchard
b46e15535a orte: shutdown
be more careful about closing framewworks as part of
orte_finalize.  Owing to recent restructuring in opal to handle
finalize in a more general fashion, the missing framework
closes were causing meltdowns as the mca vars subsystem
was cleaning itself up.

This problem was recently reported by Siegmar:

https://www.mail-archive.com/users@lists.open-mpi.org//msg32946.html

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2019-01-04 11:04:13 -07:00
Ralph Castain
b19e5edf76 Correct parsing of ppr directives
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2019-01-02 09:03:13 -08:00
Jeff Squyres
f96c04244d odls_base_default_fns.c: put the free() in the right place
Fixes CID 1441826.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-12-22 06:40:05 -08:00
Ralph Castain
d728380741 If job is fully described, there will be no ppn string to unpack
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-12-17 16:13:55 -08:00
Ralph Castain
c86fede9df Fix typo for rmaps_base_oversubscribe
Causes the MCA param to be ignored, while the cmd line option still
works.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-11-29 07:27:46 -08:00
KAWASHIMA Takahiro
8e7d874e14 ess/pmi: Fix --enable-timing compilation error
This commit fixes an compilation error when configured
with `--enable-timing`.

Procedures in the function `orte_ess_base_app_setup`
in `orte/mca/ess/base/ess_base_std_app.c` are moved
to `orte/mca/ess/pmi/ess_pmi_module.c`
and `orte/mca/ess/singleton/ess_singleton_module.c`
in the recent commit 57f6b94fa5.

In `ess_pmi_module.c`, the first argument of the
`OPAL_TIMING_ENV_NEXT` macro should have been adapted
to the destination function but was not.

In `ess_singleton_module.c`, `OPAL_TIMING_ENV_INIT`
was not used in the destination function originally.
So `OPAL_TIMING_ENV_NEXT` cannot be used in the function.

Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
2018-11-12 16:10:48 +09:00
Jeff Squyres
e9bf318dcb orte-rmaps-base: slightly amend help message
Follow on to 430c659908: clarify the help message and fix one typo.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-11-08 14:21:47 -08:00
Jeff Squyres
430c659908 orte-rmaps-base: update out-of-slots show_help message
Update the show_help message for when there are not enough slots to
run an application.

Also, remove a bunch of copies of this message in various show_help
text files that aren't used/referred to anywhere in the code.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-11-08 15:02:57 -05:00
Aurélien Bouteiller
43bd232fd0
Resolve a recursive destruct on the iof proct in finalize
Signed-off-by: Aurélien Bouteiller <bouteill@icl.utk.edu>
2018-10-31 16:38:42 -04:00
Aurelien Bouteiller
348bf8e13f
Prevent errmgr invokation from crashing in finalize
Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
2018-10-31 16:28:04 -04:00
Aurélien Bouteiller
2820aef551
Correctly propagate the oversubscribe flag to the spawnees
Signed-off-by: Aurélien Bouteiller <bouteill@icl.utk.edu>
2018-10-23 23:02:36 -04:00
Ralph Castain
1bd772e8eb Remove the stale orte-dvm code
Users should migrate to https://github.com/pmix/prrte

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-10-17 15:11:38 -07:00
Ralph Castain
647a760b7e Ensure SIGCHLD is unblocked
Thanks to @hjelmn for debugging it and providing the patch

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit efa8bcc17078c89f1c9d6aabed35c90973a469bf)
2018-10-15 21:03:17 -07:00
Jeff Squyres
a85bad37df orte: strncpy() -> opal_string_copy()
Fairly straightforward conversion of strncpy() calls to
opal_string_copy().

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-10-14 16:10:20 -07:00
Ralph Castain
f5a6b7f1e9 Fix -H operations for multi-app case
Correctly aggregate slots across -H entries from each app. Take into
account any -H entry when computing nprocs when no value was given.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-10-11 09:30:01 -07:00
Gilles Gouaillardet
1ef45b7f1d plm/tm: add a missing include file
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-10-09 11:08:36 +09:00
Ralph H Castain
fc81d0d519 Replace asprintf with opal_asprintf
Silence the flood of warnings from ORTE

Signed-off-by: Ralph H Castain <rhc@open-mpi.org>
2018-10-06 19:32:37 +00:00
Ralph H Castain
51acbf738e Fix map-by node for comm_spawn
Do not reorder the available host list as this causes the head node process assignment to differ from those computed on the other nodes

Signed-off-by: Ralph H Castain <rhc@open-mpi.org>
2018-10-06 15:58:45 +00:00
Ralph Castain
44afb59a01
Merge pull request #5838 from rhc54/topic/ev
Correctly notify upon process failure
2018-10-04 04:56:05 -07:00
Ralph Castain
86702b71bc Correctly notify upon process failure
We only need to pass a custom range if the target is a single process.
Otherwise, we let the range be "session".

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-10-03 20:19:42 -07:00
Ralph Castain
57f6b94fa5 Cleanup race condition in finalize
See https://github.com/open-mpi/ompi/issues/5798#issuecomment-426545893
for a lengthy explanation

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-10-03 09:42:59 -07:00
Ralph Castain
cfdd08d309 Remove stale ORTE code
Functionality moved to PMIx

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-10-02 11:55:36 -07:00
Jeff Squyres
6bb356ab87 Squash a bunch of harmless compiler warnings.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-09-26 12:15:21 -07:00
Ralph Castain
45f23ca5c9 Update mapping system
Correctly transfer job-level mapping directives for dynamically spawned
jobs to the mapping system.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-09-26 10:00:09 -07:00
Ralph Castain
7facb3f3e9 Pickup and deploy network-specific envars
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-09-13 09:24:19 -07:00
Ralph Castain
f18954d2d5 Update ORTE to allocate network resources
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-09-13 08:54:39 -07:00
Gilles Gouaillardet
893270caee orte: send error messages to stderr.
When a job terminates normally but with a non zero exit code,
display the error message to stderr.

Thanks Emre Brookes for the bug report.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-09-12 17:03:55 +09:00
Ralph Castain
bc1d13ffbe Remove the orte_enable_instant_on MCA param
We have adequate protection to ensure that we only utilize the PMIx
features related to "instant on" when they are available, so this param
is no longer required and causes confusion.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-09-10 09:20:26 -07:00
Gilles Gouaillardet
d234caef74 ess/hnp: plug a memory leak in rte_finalize()
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-08-30 10:07:17 +09:00
Nathan Hjelm
98172163e6 odls/alps: resolve hang when launching with mpirun on Crays
This commit removes some code that protected the odls/alps component
from closing alps file descriptors. For some unknown reason leaving
these file descriptors open causes can cause an orted to hang when
launching apps.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-08-28 16:24:56 -06:00
Boris Karasev
beb0697f24 Fixed copyrights of prev commit.
Signed-off-by: Boris Karasev <karasev.b@gmail.com>
2018-08-27 09:50:11 +03:00
Boris Karasev
e5291ccc34 Fixed the NUMA obj detection for hwloc ver >= 2.0.0
Since version hwloc 2.0.0 has a new organization of NUMA nodes on the
topology tree. This commit adds the detection of local NUMA object for
hwloc => 2.0.0, which fixes the procs bindings policy for rmaps mindist
component.

Signed-off-by: Boris Karasev <karasev.b@gmail.com>
2018-08-24 19:11:52 +03:00
Boris Karasev
57683366ca pmix: added check for pmix fence status
Signed-off-by: Boris Karasev <karasev.b@gmail.com>
2018-08-06 15:01:57 +06:00
Ralph Castain
bcdb1f45ac Fix the multiple pe/proc option
Things got a little out of whack and we weren't actually processing the map-by modifiers, plus an error crept into the display of the binding report. So clean those up.

Thanks to @tonyreina for the error report

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-07-25 18:47:39 -07:00
Ralph Castain
6b6e63a346 Control inheritance of launch directives by child jobs
Do not have child jobs inherit launch directives unless requested to do so. This affects the map-by, rank-by, bind-to, npernode, pernode, npersocket, persocket, and cpus-per-rank directives. Values provided in the spawn call always take precedence - if a particular value isn't specified, then the ORTE defaults will be used if inheritance is not requested, and the values specified by MCA param will be used if inheritance is set.

Always inherit oversubscribe for now as otherwise MTT will break

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-07-10 15:12:05 -07:00
Ralph Castain
c6ee8d2f5c
Merge pull request #5300 from hjelmn/goodbye_oobud
oob/ud: remove as it has bitrotted
2018-07-09 12:40:04 -07:00
Ralph Castain
0ddbc75ce5
Merge pull request #4930 from kizill/fix-ipv6
fixed ipv6 OOB connection problems (fix issue #1585)
2018-06-26 09:13:53 -07:00
Ralph Castain
3b2390e5d5 Silence coverity warnings, remove/ignore build product
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-06-25 08:01:28 -07:00
Jeff Squyres
4603852740 orterun: use consistent CLI option name for --bind-to
Since the new binding option is tied to the --cpu-list orterun CLI
option, make the --bind-to option reflect the same name (vs. the
--cpu-set CLI option, which is entirely different).  For example:

    mpirun --bind-to cpu-list:ordered ...

Note that "--bind-to cpulist:ordered" is accepted as a synonym,
because people will be lazy.

Also add some minor updates to the orterun.1in man page for
clarification.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-06-21 08:22:00 -07:00
Ralph Castain
d2838139e4 Update man and help output for new binding option
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-06-21 06:36:11 -07:00