1
1
Граф коммитов

5854 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
2da5651869 Restore orted hnp_uri cmd line option
Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-02-18 13:24:03 -08:00
Ralph Castain
e56ee1e06a Remove the remaining cruft from dual oob transport
* When we moved to allowing dual rml/oob transports, we added a bunch of
stuff that is no longer needed. Remove it so as to simplify the
messaging system.

* Fix the routed/radix component so it correctly returns the parent's
vpid

Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-02-08 11:12:31 -08:00
Gilles Gouaillardet
b80210c36a orte/util: strdup() in orte_util_decode_nidmap() since opal_argv_free() will free()
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-02-08 11:11:25 -08:00
Gilles Gouaillardet
78152aec85 orte/nidmap: do not use compressed when uninitialized
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-02-08 11:11:25 -08:00
Ralph Castain
1ee6c185f7 Remove stale code
Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-02-08 11:11:25 -08:00
Ralph Castain
01e9aca40f Add topology support for hetero systems
Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-02-08 11:11:25 -08:00
Gilles Gouaillardet
88ac05fca6 misc fixes
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-02-08 11:11:25 -08:00
Ralph Castain
125d236173 Move from the use of regex to compression
We've been fighting the battle of trying to create a regex generator and
parser that can handle arbitrary hostname schemes - without long-term
success. The worst of it is that there is no way of checking to see if
the computed regex is correct short of parsing it and doing a
character-by-character comparison with the original string. Ugh...there
has to be a better solution.

One option is to investigate using 3rd-party regex libraries as
those are coming from communities whose sole focus is resolving that
problem. However, someone would need to spend the time to investigate
it, and we'd have to find a license-friendly implementation.

Another option is to quit beating our heads against the wall and just
compress the information. It won't be as much of a reduction, but we
also won't keep hitting scenarios where things break. In this case, it
seems that "perfection" is definitely the enemy of "good enough".

This PR implements the compression option while retaining the
possibility of people adding regex-generating components. The
compression code used in ORTE is consolidated into the opal/compress
framework. That framework currently held bzip and gzip components for
use in compressing checkpoint files - since we no longer support C/R, I
have .opal_ignore'd those components.

However, I have left the original framework APIs alone in case someone
ever decides to redo C/R. The APIs of interest here are added to the
framework - specifically, the "compress_block" and "decompress_block"
functions. I then moved the ORTE zlib compression code into a new
component in this framework.

Unfortunately, the framework currently is a single-select one - i.e.,
only one active component at a time. Since I .opal_ignore'd the other
two and made the priority of zlib high, this isn't a problem. However,
if someone wants to re-enable bzip/gzip or add another component, they
might need to transition opal/compress to a multi-select framework.

Included changes:

* Consolidate the compression code into the opal/compress framework

* Move the ORTE zlib compression code into a new opal/compress/zlib
  component

* Ignore the bzip and gzip components in opal/compress framework

* Add a "compress_base_limit" MCA param to set the threshold above which
  we compress data - defaults to 4096 bytes

* Delete stale brucks and rcd components from orte/grpcomm framework

* Delete the orte/regx framework

* Update the launch system to use opal/compress instead of string regex

* Provide a default module if no zlib is available

* Fix some misc multi-node issues

* Properly generate the nidmap in response to a "connection warmup"
  message so the remote daemon knows the children it needs to launch.

* Remove stale references to orte_node_regex

* opal_byte_object_t's are not OPAL objects - properly release allocated
  memory.

* Set the topology

* Currently only handling homogeneous case

* Update the compress framework files to conform

* Consolidate open/close into one "frame" file. Ensure we open/close the
  framework

Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-02-08 11:11:14 -08:00
Ralph Castain
fcbc7ea298
Merge pull request #6306 from karasevb/regx_host_ordering_fix
regex: fixed host ordering for different prefixes
2019-02-08 11:09:55 -08:00
Ralph Castain
8794077520 Remove stale rml/ofi component
Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-01-30 12:41:50 -08:00
Boris Karasev
46e38b9193 regx: fixed the order of hosts for ranges with different prefixes
Example:
For the list of hosts `a01,b00,a00` a regex is generated:
`a[2:1.0],b[2:0]`, where `a`-hosts prefixes moved to the begining,
it breaks the hosts ordering.
This commit fixes regex for that case to `a[2:1],b[2:0],a[2:0]`

Signed-off-by: Boris Karasev <karasev.b@gmail.com>
2019-01-30 15:06:30 +06:00
Boris Karasev
1967e41a71 regx/reverse: fixed adding an empty range for no numerical hostnames
Example:
For the nodelist `jjss,jjss0000001,jjss0000003,jjss0000002` a regular
expression was `jjss[0:0],jjss[7:1,3,2]` that led to incorrect unpacking
the first host as `jjs0`. This commit fixes an adding empty range for
not numeric hostnames. Here is the fixed regex for this exapmle:
`jjss,jjss[7:1,3,2]`

Signed-off-by: Boris Karasev <karasev.b@gmail.com>
2019-01-30 09:41:00 +06:00
Boris Karasev
d1ad90f47e regx/test: update regex test
Signed-off-by: Boris Karasev <karasev.b@gmail.com>
2019-01-30 09:40:59 +06:00
Jason Williams
98d81a5f7a Adding changes for issue #6303 for branch master.
Signed-off-by: Jason Williams <uberlinuxguy@gmail.com>
2019-01-26 10:49:47 -05:00
Howard Pritchard
b46e15535a orte: shutdown
be more careful about closing framewworks as part of
orte_finalize.  Owing to recent restructuring in opal to handle
finalize in a more general fashion, the missing framework
closes were causing meltdowns as the mca vars subsystem
was cleaning itself up.

This problem was recently reported by Siegmar:

https://www.mail-archive.com/users@lists.open-mpi.org//msg32946.html

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2019-01-04 11:04:13 -07:00
Ralph Castain
b19e5edf76 Correct parsing of ppr directives
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2019-01-02 09:03:13 -08:00
Jeff Squyres
f96c04244d odls_base_default_fns.c: put the free() in the right place
Fixes CID 1441826.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-12-22 06:40:05 -08:00
Ralph Castain
d728380741 If job is fully described, there will be no ppn string to unpack
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-12-17 16:13:55 -08:00
Gilles Gouaillardet
a152aa215e cleanup: remove the unused (and unexpanded) {ORTE,OMPI}_WANT_REPO_REV macro
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-12-06 13:13:13 +09:00
Ralph Castain
c86fede9df Fix typo for rmaps_base_oversubscribe
Causes the MCA param to be ignored, while the cmd line option still
works.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-11-29 07:27:46 -08:00
Ralph Castain
f609542bbf Implement process set name support
Add the --pset option for app_contexts so the user can provide a string
name for each app_context. Use the new PMIx pset attribute to store the
names in the PMIx local storage for retrieval

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-11-27 11:07:58 -08:00
Jeff Squyres
dbe064af97
Merge pull request #5653 from bmwiedemann/userhost
Allow to override build user and host
2018-11-26 17:48:37 -05:00
KAWASHIMA Takahiro
8e7d874e14 ess/pmi: Fix --enable-timing compilation error
This commit fixes an compilation error when configured
with `--enable-timing`.

Procedures in the function `orte_ess_base_app_setup`
in `orte/mca/ess/base/ess_base_std_app.c` are moved
to `orte/mca/ess/pmi/ess_pmi_module.c`
and `orte/mca/ess/singleton/ess_singleton_module.c`
in the recent commit 57f6b94fa5.

In `ess_pmi_module.c`, the first argument of the
`OPAL_TIMING_ENV_NEXT` macro should have been adapted
to the destination function but was not.

In `ess_singleton_module.c`, `OPAL_TIMING_ENV_INIT`
was not used in the destination function originally.
So `OPAL_TIMING_ENV_NEXT` cannot be used in the function.

Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
2018-11-12 16:10:48 +09:00
Jeff Squyres
e9bf318dcb orte-rmaps-base: slightly amend help message
Follow on to 430c659908: clarify the help message and fix one typo.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-11-08 14:21:47 -08:00
Jeff Squyres
430c659908 orte-rmaps-base: update out-of-slots show_help message
Update the show_help message for when there are not enough slots to
run an application.

Also, remove a bunch of copies of this message in various show_help
text files that aren't used/referred to anywhere in the code.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-11-08 15:02:57 -05:00
Gilles Gouaillardet
72eb53e064 test: remove obsolete tests from orte/test/mpi
Those tests were likely built on a previous Open MPI version
and cannot even build.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-11-07 13:23:40 +09:00
Aurélien Bouteiller
43bd232fd0
Resolve a recursive destruct on the iof proct in finalize
Signed-off-by: Aurélien Bouteiller <bouteill@icl.utk.edu>
2018-10-31 16:38:42 -04:00
Aurelien Bouteiller
348bf8e13f
Prevent errmgr invokation from crashing in finalize
Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
2018-10-31 16:28:04 -04:00
Ralph Castain
05ac8fa71c Remove stale defunct tools
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-10-30 08:48:16 -07:00
Ralph Castain
2cb271716b Provide deprecation warning of MPIR debugger
If we detect that we are being debugged by an MPIR-based debugger, then
print a warning that OMPI's MPIR support has been deprecated and will be
removed in a subsequent release.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-10-25 07:59:07 -07:00
Aurélien Bouteiller
2820aef551
Correctly propagate the oversubscribe flag to the spawnees
Signed-off-by: Aurélien Bouteiller <bouteill@icl.utk.edu>
2018-10-23 23:02:36 -04:00
Bernhard M. Wiedemann
bc23993dea Allow to override build user and host
using the standard $USER and $HOSTNAME environment variables
to make reproducible builds possible.
See https://reproducible-builds.org/ for why this is good.

This helps improve issue #3759

Signed-off-by: Bernhard M. Wiedemann <bwiedemann@suse.de>
2018-10-20 09:27:00 -04:00
Ralph Castain
1bd772e8eb Remove the stale orte-dvm code
Users should migrate to https://github.com/pmix/prrte

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-10-17 15:11:38 -07:00
Ralph Castain
647a760b7e Ensure SIGCHLD is unblocked
Thanks to @hjelmn for debugging it and providing the patch

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit efa8bcc17078c89f1c9d6aabed35c90973a469bf)
2018-10-15 21:03:17 -07:00
Jeff Squyres
a85bad37df orte: strncpy() -> opal_string_copy()
Fairly straightforward conversion of strncpy() calls to
opal_string_copy().

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-10-14 16:10:20 -07:00
Ralph Castain
f5a6b7f1e9 Fix -H operations for multi-app case
Correctly aggregate slots across -H entries from each app. Take into
account any -H entry when computing nprocs when no value was given.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-10-11 09:30:01 -07:00
Gilles Gouaillardet
5803385d44 util/hostfile: fix a double free error
As reported at https://stackoverflow.com/questions/52707242/mpirun-segmentation-fault-whenever-i-use-a-hostfile
mpirun crashes when the hostfile contains a "user@host" line.
The root cause is username was not strdup'ed and free'd twice by opal_argv_free() and free()

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-10-09 11:09:17 +09:00
Gilles Gouaillardet
1ef45b7f1d plm/tm: add a missing include file
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-10-09 11:08:36 +09:00
Ralph H Castain
fc81d0d519 Replace asprintf with opal_asprintf
Silence the flood of warnings from ORTE

Signed-off-by: Ralph H Castain <rhc@open-mpi.org>
2018-10-06 19:32:37 +00:00
Ralph H Castain
51acbf738e Fix map-by node for comm_spawn
Do not reorder the available host list as this causes the head node process assignment to differ from those computed on the other nodes

Signed-off-by: Ralph H Castain <rhc@open-mpi.org>
2018-10-06 15:58:45 +00:00
Ralph Castain
44afb59a01
Merge pull request #5838 from rhc54/topic/ev
Correctly notify upon process failure
2018-10-04 04:56:05 -07:00
Ralph Castain
86702b71bc Correctly notify upon process failure
We only need to pass a custom range if the target is a single process.
Otherwise, we let the range be "session".

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-10-03 20:19:42 -07:00
Ralph Castain
57f6b94fa5 Cleanup race condition in finalize
See https://github.com/open-mpi/ompi/issues/5798#issuecomment-426545893
for a lengthy explanation

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-10-03 09:42:59 -07:00
Ralph Castain
cfdd08d309 Remove stale ORTE code
Functionality moved to PMIx

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-10-02 11:55:36 -07:00
Jeff Squyres
6bb356ab87 Squash a bunch of harmless compiler warnings.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-09-26 12:15:21 -07:00
Ralph Castain
45f23ca5c9 Update mapping system
Correctly transfer job-level mapping directives for dynamically spawned
jobs to the mapping system.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-09-26 10:00:09 -07:00
Jeff Squyres
3970b06134 misc: passing a bool to va_start() is undefined
According to clang on MacOS, passing a bool parameter -- which
undergoes default parameter promotion -- to va_start() results in
undefined behavior.  So just change these params to int and avoid the
issue.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-09-15 06:04:13 -07:00
Nathan Hjelm
1071d72130
Merge pull request #5445 from hjelmn/asm_type
Update opal to use C11 atomics if available
2018-09-14 12:32:56 -06:00
Nathan Hjelm
fe6528b0d5 opal/atomic: always use C11 atomics if available
This commit disables the use of both the builtin and hand-written
atomics if proper C11 atomic support is detected. This is the first
step towards requiring the availability of C11 atomics for the C
compiler used to build Open MPI.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-09-14 10:51:05 -06:00
Ralph Castain
7facb3f3e9 Pickup and deploy network-specific envars
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-09-13 09:24:19 -07:00