1
1

696 Коммитов

Автор SHA1 Сообщение Дата
Joseph Schuchart
320a9a1660 OPAL: fix string buffer allocation for large env variables
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
2020-10-30 14:28:23 +01:00
NARIBAYASHI Akira
3dc3bbc1b1 opal/util: Fix typo
Signed-off-by: NARIBAYASHI Akira <a.naribayashi@fujitsu.com>
2020-10-08 16:13:09 +09:00
Brian Barrett
9ffac85650 build: Move libevent to a 3rd-party package
With Open MPI 5.0, the decision was made to stop building
3rd-party packages, such as Libevent, HWLOC, PMIx, and PRRTE as
MCA components and instead 1) start relying on external libraries
whenever possible and 2) Open MPI builds the 3rd party
libraries (if needed) as independent libraries, rather than
linked into libopen-pal.

This patch moves libevent from an MCA framework to a stand-alone
library built outside of OPAL.  A wrapper in opal/util is provided
to minimize the unnecessary changes in the rest of the code.  When
using the internal Libevent, it will be installed as a stand-alone
libevent.a, instead of bundled in OPAL.  Any pre-installed version
of Libevent at or after 2.0.21 is preferred over the internal
version.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2020-10-01 16:55:58 +00:00
Jeff Squyres
f5cb1a49b1
Merge pull request #7897 from cniethammer/cmd_line_fixes
Minor fix in cmd line parser help
2020-08-08 07:13:22 -04:00
Tomislav Janjusic
d809f6ba27 New TSD API interface fix for various components
Co-authored by: Artem Polykaov <artemp@mellanox.com>

Signed-off-by: Tomislav Janjusic <tomislavj@mellanox.com>
2020-07-24 18:29:40 +03:00
Tomislav Janjusic
cba5a0e117 Rename tsd interface function calls
Co-authored by: Artem Polykaov <artemp@mellanox.com>
Signed-off-by: Tomislav Janjusic <tomislavj@mellanox.com>
2020-07-24 18:29:07 +03:00
Aurélien Bouteiller
3cd85a9ec5
Add the initial_errhandler info key to MPI_INFO_ENV and populate the
value from prun populated paremeters

Signed-off-by: Aurélien Bouteiller <bouteill@icl.utk.edu>

Allow errhandlers to invoke the initial error handler before MPI_INIT

Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>

Indentation

Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
2020-07-16 03:10:32 -04:00
Jeff Squyres
e547e2d3ea
Merge pull request #7905 from cniethammer/integer_shift_fix
Fix shifting signed integer
2020-07-02 14:35:40 -04:00
Christoph Niethammer
8483ec41ab Fix shifting signed integer
Shifting signed int will lead to undefined behavior in case of 1<<31 here

Signed-off-by: Christoph Niethammer niethammer@hlrs.de
2020-07-02 17:43:45 +02:00
Christoph Niethammer
6ac111c6dc Fix initialization warning reported by clang-10
Delete duplicate initializer for cpuset in opal_process_info.

Signed-off-by: Christoph Niethammer <niethammer@hlrs.de>
2020-07-02 15:15:54 +02:00
Christoph Niethammer
7f6263cdd4 Update out of date documentation for opal_cmd_line_create()
Signed-off-by: Christoph Niethammer <niethammer@hlrs.de>
2020-06-30 22:04:55 +02:00
Christoph Niethammer
f9c42c96d6 Fix wrong enum type in default command line option category assignment
Signed-off-by: Christoph Niethammer <niethammer@hlrs.de>
2020-06-30 22:04:48 +02:00
Gilles Gouaillardet
c450b21405 opal/util: fix opal_str_to_bool()
correctly use strlen(char *) instead of sizeof(char *)

Thanks Georg Geiser for reporting this issue.

Refs. open-mpi/ompi#7772

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2020-05-30 20:47:41 +09:00
Boris Karasev
fb9eca55cf sys limits: fixed soft limit setting if it is less than hard limit
Signed-off-by: Boris Karasev <karasev.b@gmail.com>
2020-05-14 10:54:16 +07:00
Ralph Castain
f608575eec
Remove references to numa_rank
Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-05-01 13:32:29 -07:00
Ralph Castain
bd29ab0ae9
Update dpm to handle deprecation of MPI_Info keys
Deprecate the current OMPI-specific MPI_Info key definitions for
MPI_Comm_spawn and replace them with their PMIx equivalents. Issue a
deprecation/conversion warning as this is done. Also issue deprecation
warnings for options such as "ompi_non_mpi" that are no longer used.

Handle both cases where the user might pass either the PMIx attribute
name itself (e.g., "PMIX_MAPBY") or the string value of the attribute
(e.g., PMIX_MAPBY, which translates to "pmix.mapby"). This can only be
done for PMIx v4 and above, so protect that code.

Silence a couple of Coverity warnings and add a test along the way.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-04-29 14:56:38 -07:00
Ralph Castain
6d29bbfde8
Cleanup heterogeneous builds
Consolidate the ompi_process_info and opal_process_info structs to
remove duplicate storage and conversion issues. Unwind some interweaving
of include files using opal.h. Silence a couple of warnings.

For now, set the arch to local if PMIX_ARCH is not found.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-04-22 12:46:27 -07:00
Nikola Dancejic
3637443454 adding NUMA_RANK to process metadata
adding PMIX_NUMA_RANK info to process metadata so that the local NUMA
rank can be accessed through the opal_process_info object.

Signed-off-by: Nikola Dancejic <dancejic@amazon.com>
2020-04-20 22:02:47 +00:00
Howard Pritchard
f136a20cae
Merge pull request #6578 from hppritcha/topic/thread_framework2
Implement a MCA framework for threads
2020-03-27 15:55:48 -06:00
Noah Evans
ee3517427e Add threads framework
Add a framework to support different types of threading models including
user space thread packages such as Qthreads and argobot:

https://github.com/pmodels/argobots

https://github.com/Qthreads/qthreads

The default threading model is pthreads.  Alternate thread models are
specificed at configure time using the --with-threads=X option.

The framework is static.  The theading model to use is selected at
Open MPI configure/build time.

mca/threads: implement Argobots threading layer

config: fix thread configury

- Add double quotations
- Change Argobot to Argobots
config: implement Argobots check

If the poll time is too long, MPI hangs.

This quick fix just sets it to 0, but it is not good for the
Pthreads version. Need to find a good way to abstract it.

Note that even 1 (= 1 millisecond) causes disastrous performance
degradation.

rework threads MCA framework configury

It now works more like the ompi/mca/rte configury,
modulo some edge items that are special for threading package
linking, etc.

qthreads module
some argobots cleanup

Signed-off-by: Noah Evans <noah.evans@gmail.com>
Signed-off-by: Shintaro Iwasaki <siwasaki@anl.gov>
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2020-03-27 10:15:45 -06:00
Ralph Castain
33ab928e1b ompi_proc_t size reduction: part 1
We currently save the hostname of a proc when we create the ompi_proc_t for it. This was originally done because the only method we had for discovering the host of a proc was to include that info in the modex, and we had to therefore store it somewhere proc-local. Obviously, this ccarried a memory penalty for storing all those strings, and so we added a "cutoff" parameter so that we wouldn't collect hostnames above a certain number of procs.

Unfortunately, this still results in an 8-byte/proc memory cost as we have a char* pointer in the opal_proc_t that is contained in the ompi_proc_t so that we can store the hostname of the other procs if we fall below the cutoff. At scale, this can consume a fair amount of memory.

With the switch to relying on PMIx, there is no longer a need to cache the proc hostnames. Using the "optional" feature of PMIx_Get, we restrict the retrieval to be purely proc-local - i.e., we retrieve the info either via shared memory or from within the proc-internal hash storage (depending upon the active PMIx components). Thus, the retrieval of a hostname is purely a local operation involving no communication.

All RM's are required to provide a complete hostname map of all procs at startup. Thus, we have full access to all hostnames without including them in a modex or having to cache them on each proc. This allows us to remove the char* pointer from the opal_proc_t, saving us 8-bytes/proc.

Unfortunately, PMIx_Get does not currently support the return of a static pointer to memory. Thus, even though PMIx has the hostname in its memory, it can only return a malloc'd version of it. I have therefore ensured that the return from opal_get_proc_hostname is consistently malloc'd and free'd wherever used. This shouldn't be a burden as the hostname is only used in one of two circumstances:

(a) in an error message
(b) in a verbose output for debugging purposes

Thus, there should be no performance penalty associated with the malloc/free requirement. PMIx will eventually be returning static pointers, and so we can eventually simplify this method and return a "const char*" - but as noted, this really isn't an issue even today.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-03-23 12:49:44 -07:00
Artem Polyakov
7c17a38c96 timings: Fix timings when 'prefix' is used
Signed-off-by: Artem Polyakov <artpol84@gmail.com>
2020-03-07 09:36:43 -08:00
Ralph Castain
829fd478b3
Create a hack to protect against non-integer jobids
If someone gives us a namespace that doesn't easily translate to an
integer, we have to create a mechanism for working around the
disconnect. PRRTE has been updated to give us a flag so we know we were
"natively" launched. If we don't see it, then fall back to generating a
hash of the nspace as our jobid. We then have to translate back/forth
between nspace and jobid using a lookup table.

Probably not the right long-term solution, but hopefully helps get us
thru for a bit.

Includes update of PRRTE pointer

Signed-off-by: Ralph Castain <rhc@pmix.org>
2020-02-21 06:04:55 -08:00
Gilles Gouaillardet
174e967dbc
Remove ORTE project
Will be replaced by PRRTE. Ensure that OMPI and OPAL layers build
without reference to ORTE. Setup opal/pmix framework to be static.
Remove support for all PMI-1 and PMI-2 libraries. Add support for
"external" pmix component as well as internal v4 one.

remove orte: misc fixes

 - UCX fixes
 - VPATH issue
 - oshmem fixes
 - remove useless definition
 - Add PRRTE submodule
 - Get autogen.pl to traverse PRRTE submodule
 - Remove stale orcm reference
 - Configure embedded PRRTE
 - Correctly pass the prefix to PRRTE
 - Correctly set the OMPI_WANT_PRRTE am_conditional
 - Move prrte configuration to the end of OMPI's configure.ac
 - Make mpirun a symlink to prun, when available
 - Fix makedist with --no-orte/--no-prrte option
 - Add a `--no-prrte` option which is the same as the legacy
   `--no-orte` option.
 - Remove embedded PMIx tarball. Replace it with new submodule
   pointing to OpenPMIx master repo's master branch
 - Some cleanup in PRRTE integration and add config summary entry
 - Correctly set the hostname
 - Fix locality
 - Fix singleton operations
 - Fix support for "tune" and "am" options

Signed-off-by: Ralph Castain <rhc@pmix.org>
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2020-02-07 18:20:06 -08:00
Austen Lauria
df9745a251
Merge pull request #7299 from awlauria/fix_warnings
Fix some compiler warnings.
2020-01-17 11:33:41 -05:00
Charles Shereda
cbc6feaab2 Created opal_gethostname() as safer gethostname substitute.
The opal_gethostname() function provides a more robust mechanism
to retrieve the hostname than gethostname(), which can return
results that are not null-terminated, and which can vary in its
behavior from system to system.

opal_gethostname() just returns the value in opal_process_info.nodename;
this is populated in opal_init_gethostname() inside opal_init.c.

-Changed all gethostname calls in opal subtree to opal_gethostname
-Changed all gethostname calls in orte subtree to opal_gethostname
-Changed all gethostname calls in ompi subdir to opal_gethostname
-Changed all gethostname calls in oshmem subdir to opal_gethostname
-Changed opal_if.c in test subdir to use opal_gethostname
-Changed opal_init.c to include opal_init_gethostname. This function
 returns an int and directly sets opal_process_info.nodename per
 jsquyres' modifications.

Relates to open-mpi#6801

Signed-off-by: Charles Shereda <cpshereda@lanl.gov>
2020-01-13 08:52:17 -08:00
Austen Lauria
b65ec27307 Fix some compiler warnings.
Silence unused variables, incompatible pointer types,
un-initialized variables, and signed/unsigned comparisons.

Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
2020-01-10 13:10:53 -05:00
Josh Hursey
f828990bcf
Merge pull request #7192 from gvallee/contiaing_typo
Fix typo in comment: contiaing -> containing
2019-11-25 14:40:08 -06:00
Geoffroy Vallee
127573cf44
Fix typo in comment: contiaing -> containing
Signed-off-by: Geoffroy Vallee <geoffroy.vallee@gmail.com>
2019-11-24 11:06:41 -05:00
Geoffroy Vallee
98de17c6da
Fix a type in comments: insertted -> inserted
Signed-off-by: Geoffroy Vallee <geoffroy.vallee@gmail.com>
2019-11-24 00:36:06 -05:00
Austen Lauria
0d4004cc3c Fix miscellaneous compiler warnings.
Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
2019-10-01 16:27:25 -04:00
William Zhang
ca70b703a1 opal/util: Add function to get vertex data from bipartite graph
Currently, there is no function that allows the user to retrieve the
data they have stored in a vertex easily. Using the internal macros and
knowledge of the structures, the new function will return a pointer to
the user provided vertex data.

Signed-off-by: William Zhang <wilzhang@amazon.com>
2019-09-11 18:28:07 +00:00
William Zhang
4ebb37a26c opal/util: Change opal/util/if.h macro IF_NAMESIZE to OPAL_IF_NAMESIZE
Due to IF_NAMESIZE being a reused and conditionally defined macro,
issues could arise from macro mismatches. In particular, in cases where
opal/util/if.h is included, but net/if.h is not, IF_NAMESIZE will be 32.
If net/if.h is included on Linux systems, IF_NAMESIZE will be 16. This
can cause a mismatch when using the same macro on a system. Thus
different parts of the code can have differring ideas on the size of a
structure containing a char name[IF_NAMESIZE]. To avoid this error case,
we avoid reusing the IF_NAMESIZE macro and instead define our own as
OPAL_IF_NAMESIZE.

Signed-off-by: William Zhang <wilzhang@amazon.com>
2019-07-29 21:24:39 +00:00
Gilles Gouaillardet
d2876fa4fd opal/util: revamp opal_output_verbose()
A typical parameter of opal_output_verbose() is ORTE_NAME_PRINT(...),
which is an expensive macro.
Most of the time, this is unnecessary since the verbosity level is too high.

Make opal_output_verbose() a macro so such arguments are only evaluated if the
verbosity is low enough.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-07-09 11:24:31 +09:00
Jeff Squyres
8b60cc8039 opal/util: remove opal_strncpy()
This function is not used anywhere anymore.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2019-02-13 10:21:39 -08:00
Jeff Squyres
9b88e60fc8 info_get: ensure to copy all requested characters
When querying an info value, copy out exactly as many characters as
the caller asked for -- do not artificially truncate the target just
to ensure that it is \0-terminated.

Specifically: do not use opal_string_copy() to copy info values,
because opal_string_copy() will guarantee to \0-terminate the target,
even if it means truncating the target.  E.g., if the caller calls
opal_info_get_nolock() with valuelen=5, opal_string_copy() will return
"1234\0" -- which is wrong.  This commit fixes the behavior to return
"12345".

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-12-19 13:01:45 -08:00
Nathan Hjelm
0edfd328f8 opal: clean up init/finalize
This commit contains the following changes:

 - Remove the unused opal_test_init/opal_test_finalize
   functions. These functions are not used by anything in the code
   base or MTT. Tests use opal_init_util/opal_finalize_util instead.

 - Get rid of gotos in opal_init_util and opal_init. Replaced them
   with a cleaner solution.

 - Automatically register cleanup functions in init functions. The
   cleanup functions are executed in the reverse order of the
   initialization functions. The cleanup functions are run in
   opal_finalize_util() before tearing down the class system.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-12-18 14:37:04 -07:00
Jeff Squyres
290d1c0534 opal/os_path: fix minor string overrun
This is a follow on to 54ca3310ea35b7dc857afc59e00ffbb8f57e15ec to fix
a minor bug identified by Coverity.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-10-17 08:57:30 -07:00
Jeff Squyres
cef6cf0ac5 opal: update some string handling
Make various string handling locations a little more robust.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-10-14 16:04:28 -07:00
Jeff Squyres
e6de42c379 opal/util/info.c: fix compiler warning
Remove unused variable.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-10-14 16:03:28 -07:00
Brian Barrett
e9e4d2a4bc Handle asprintf errors with opal_asprintf wrapper
The Open MPI code base assumed that asprintf always behaved like
the FreeBSD variant, where ptr is set to NULL on error.  However,
the C standard (and Linux) only guarantee that the return code will
be -1 on error and leave ptr undefined.  Rather than fix all the
usage in the code, we use opal_asprintf() wrapper instead, which
guarantees the BSD-like behavior of ptr always being set to NULL.
In addition to being correct, this will fix many, many warnings
in the Open MPI code base.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-08 16:43:53 -07:00
Brian Barrett
c087365ead opal/util: Always support BSD behavior of asprintf
Open MPI's developers like to assume that asprintf() always sets
the ptr to NULL on error, but the standard (and Linux glibc) do
not guarantee this.  As a result, we're making opal_asprintf()
always available for developers, which will guarantee that
ptr is set to NULL on error.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-08 16:43:53 -07:00
Geoff Paulsen
56a55cdb4b
Merge pull request #5786 from jsquyres/pr/string-madness
Replace strcpy() and strncpy() with (new) opal_string_copy()
2018-10-04 16:12:46 -05:00
Brian Barrett
a25df3f29e opal: Remove outdated MacOS workaround
Remove the pack/unpack pragma around net/if.h on MacOS, which
was added to fix a bug in MacOS X 10.4.x on 64-bit platforms.
The bug was fixed in Mac OS X 10.5.0 and, sometime in the last
11 years, compilers started emitting warnings about the fact
that the Apple header stomped over the pragma pack settings
from the workaround.  We already don't support versions of MacOS
earlier than 10.5, so there's no point in keeping the workaround.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-02 13:35:46 -04:00
Jeff Squyres
293a938d29 string_copy: assert() fail if the copy is too long
Add a hueristic: if the string copy is "too long", fail an assert().
This is based on the premise that Open MPI doesn't do large string
copies.  So if we see a dest_len that is over a certain threshhold
(currently set at 128K), this is likely a programmer error, and on
debug builds, we should fail an assert().  In production builds, it
will work just fine (assuming that it's not a programmer error).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-10-01 13:34:15 -07:00
Jeff Squyres
28e51ad4d4 opal: convert from strncpy() -> opal_string_copy()
In many cases, this was a simple string replace.  In a few places, it
entailed:

1. Updating some comments and removing now-redundant foo[size-1]='\0'
   statements.
2. Updating passing (size-1) to (size) (because opal_string_copy()
   wants the entire destination buffer length).

This commit actually fixes a bunch of potential (yet quite unlikely)
bugs where we could have ended up with non-null-terminated strings.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-09-27 11:56:18 -07:00
Jeff Squyres
e1cf8757ae opal/util/string_copy: add "safe" string copy function
Do essentially the same thing as strncpy(3), but a) ensure to always
terminate the destination buffer with a \0, and b) do not \0-pad to
the right.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-09-27 10:48:04 -07:00
Jeff Squyres
38fefcf1cf opal/util/Makefile.am: fix whitespace (remove tabs)
No code or logic changes

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-09-27 09:43:03 -07:00
Jeff Squyres
3970b06134 misc: passing a bool to va_start() is undefined
According to clang on MacOS, passing a bool parameter -- which
undergoes default parameter promotion -- to va_start() results in
undefined behavior.  So just change these params to int and avoid the
issue.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-09-15 06:04:13 -07:00
Gilles Gouaillardet
7556dd0abb opal/util: plug a memory leak in the opal_infosubscriber_t destructor
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-08-30 10:07:17 +09:00