Paul Hargrove pointed out that Stevens tells us that we should
FD_GETFL before FD_SETFL. And so we shall.
Make a new convenience function to do this (opal_fd_set_cloexec()),
just so that we don't have to litter this 2-step process throughout
the code.
Refs trac:4550
This commit was SVN r31513.
The following Trac tickets were found above:
Ticket 4550 --> https://svn.open-mpi.org/trac/ompi/ticket/4550
To accompany r31092 and r310924, also ensure to create a new process
group in the child right after the orted forks. Add trivial configury
to ensure that we have setpgid, and only do the setpgid/getpgid if we
have setpgid.
Without this commit, killing the entire process group can do
unexpected things (e.g., kill the orted, mpirun, and even mpirun's
parent!).
cmr=v1.7.5:reviewer=rhc
This commit was SVN r31132.
The following SVN revision numbers were found above:
r31092 --> open-mpi/ompi@99c9ecaed0
The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
r310924
The problem arises when a hostfile is used, and the user provides host names without specifying the slots= paramater. In these cases, we assign slots=1, but automatically allow oversubscription since that number isn't confirmed. We then provide a separate parameter by which the user can direct that we assign the number of slots based on the sensed hardware - e.g., by telling us to set the #slots equal to the #cores on each node. However, this has been set to "off" by default.
In order to make this a little less complex for the user, set the default such that we automatically set #slots equal to #cores (or #hwt's if use_hwthreads_as_cpus has been set) only for those cases where the user provides names in a hostfile but does not provide slot information.
Also cleanup some a couple of issues in the mapping/binding system:
* ensure we only override the binding directive if we are oversubscribed *and* overload is not allowed
* ensure that the MPI procs don't attempt to bind themselves if they are launched by an orted as any binding directive (no matter what it was) would have been serviced by the orted on launch
* minor cleanup to the warning message when oversubscribed and binding was requested
cmr=v1.7.5:reviewer=rhc:subject=update mapping/binding system
This commit was SVN r30909.
pkg{data,lib,includedir}, use our own ompi{data,lib,includedir}, which is
always set to {datadir,libdir,includedir}/openmpi. This will keep us from
having help files in prefix/share/open-rte when building without Open MPI,
but in prefix/share/openmpi when building with Open MPI.
This commit was SVN r30140.
Reset topology usage for each node as we bind as multiple nodes may be linked to the same topology object. This will need to be revisited for scale as it does take some non-zero time to reset the usage each iteration. However, storing individual topology objects for every node consumes memory, so it's a tradeoff.
cmr=v1.7.4:reviewer=jsquyres:subject=Eliminate excessive binding/memory warnings
This commit was SVN r29978.
Provide some nice error messages if we fail to set the limits. Since the user had to specifically request we set the limit, treat failure as an error-out situation.
This commit was SVN r28288.
Features:
- Support for an override parameter file (openmpi-mca-param-override.conf).
Variable values in this file can not be overridden by any file or environment
value.
- Support for boolean, unsigned, and unsigned long long variables.
- Support for true/false values.
- Support for enumerations on integer variables.
- Support for MPIT scope, verbosity, and binding.
- Support for command line source.
- Support for setting variable source via the environment using
OMPI_MCA_SOURCE_<var name>=source (either command or file:filename)
- Cleaner API.
- Support for variable groups (equivalent to MPIT categories).
Notes:
- Variables must be created with a backing store (char **, int *, or bool *)
that must live at least as long as the variable.
- Creating a variable with the MCA_BASE_VAR_FLAG_SETTABLE enables the use of
mca_base_var_set_value() to change the value.
- String values are duplicated when the variable is registered. It is up to
the caller to free the original value if necessary. The new value will be
freed by the mca_base_var system and must not be freed by the user.
- Variables with constant scope may not be settable.
- Variable groups (and all associated variables) are deregistered when the
component is closed or the component repository item is freed. This
prevents a segmentation fault from accessing a variable after its component
is unloaded.
- After some discussion we decided we should remove the automatic registration
of component priority variables. Few component actually made use of this
feature.
- The enumerator interface was updated to be general enough to handle
future uses of the interface.
- The code to generate ompi_info output has been moved into the MCA variable
system. See mca_base_var_dump().
opal: update core and components to mca_base_var system
orte: update core and components to mca_base_var system
ompi: update core and components to mca_base_var system
This commit also modifies the rmaps framework. The following variables were
moved from ppr and lama: rmaps_base_pernode, rmaps_base_n_pernode,
rmaps_base_n_persocket. Both lama and ppr create synonyms for these variables.
This commit was SVN r28236.
It appears the problem was not with the command line parser but the rsh plm. I don't know why this problem was not occuring before the command line parser changes but it appears to be resolved now.
This commit was SVN r27527.
The following SVN revision numbers were found above:
r27451 --> open-mpi/ompi@d59034e6ef
r27456 --> open-mpi/ompi@ecdbf34937
readable by normal human beings, vs. having a bitmap of physical
PU's). Use the new hwloc base prettyprint functions to generate the
output.
This commit was SVN r26533.
* Remove paffinity, maffinity, and carto frameworks -- they've been
wholly replaced by hwloc.
* Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
* Update sm, smcuda, wv, and openib components to no longer use carto.
Instead, use hwloc data. There are still optimizations possible in
the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old
carto-based code found out how many NUMA nodes were ''available''
-- not how many were used ''in this job''. The new hwloc-using
code computes the same value -- it was not updated to calculate how
many NUMA nodes are used ''by this job.''
* Note that I cannot compile the smcuda and wv BTLs -- I ''think''
they're right, but they need to be verified by their owners.
* The openib component now does a bunch of stuff to figure out where
"near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT
BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
(I do not have a NUMA machine with an OpenFabrics device that is a
non-uniform distance from multiple different NUMA nodes).
* Completely rewrite the OMPI_Affinity_str() routine from the
"affinity" mpiext extension. This extension now understands
hyperthreads; the output format of it has changed a bit to reflect
this new information.
* Bunches of minor changes around the code base to update names/types
from maffinity/paffinity-based names to hwloc-based names.
* Add some helper functions into the hwloc base, mainly having to do
with the fact that we have the hwloc data reporting ''all''
topology information, but sometimes you really only want the
(online | available) data.
This commit was SVN r26391.
Roll in the ORTE state machine. Remove last traces of opal_sos. Remove UTK epoch code.
Please see the various emails about the state machine change for details. I'll send something out later with more info on the new arch.
This commit was SVN r26242.
Brian dealt with this in the past by creating platform files and using "no-build" to block the components. This was clunky, but acceptable when only one organization was using that option. However, that number has now expanded to at least two more locations.
Accordingly, make --without-rte-support actually work by adding appropriate configury to prevent components from building when they shouldn't. While doing so, remove two frameworks (db and rmcast) that are no longer used as ORCM comes to a close (besides, they belonged in ORCM now anyway). Do some minor cleanups along the way.
This commit was SVN r25497.
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
Fix a termination issue that caused procs local to mpirun to not be killed if they weren't calling into the library. Thanks to Terry Dontje for spending countless hours chasing his tail on this one! :-(
This commit was SVN r25285.
This merges the branch containing the revamped build system based around converting autogen from a bash script to a Perl program. Jeff has provided emails explaining the features contained in the change.
Please note that configure requirements on components HAVE CHANGED. For example. a configure.params file is no longer required in each component directory. See Jeff's emails for an explanation.
This commit was SVN r23764.
extravaganza.
= Short version =
This commit does several things, but the short version is that it
re-orients the error message creation of the ODLS default module to
generate error strings in the child process for errors that occur
after the fork but before the exec (such errors are ''usually''
related to paffinity). A show_help string is rendered in the child
and then IPC'ed up to the parent, who displays the string through
normal ORTE show_help aggregation mechanisms. We also broke up the
ginormous paffinity-setting logic into a few separate functions, both
to help us understand the code, and hopefully to ease future
maintenance.
The logic for the ODLS default binding should not have changed -- this
is mainly a code reshuffle and improvement on error reporting.
= Rationale =
The reasoning for this commit is complex. As mentioned above, it's
the first step in some paffinity cleanup. Here's the line of dominoes
that must fall (in this order):
1. Add hwloc paffinity component (already done).
1. While testing hwloc, we discovered that the error reporting from
the ODLS default module was abysmal. So we fixed it.
1. Further, we reorganized the code in the odsl_default_module.c a bit
to help our understanding of it.
1. We also discovered a few bugs in the original ODLS default module
logic that existed before this code shuffle; separate tickets
will be filed to fix them.
1. Next up will be some improvements to paffinity / odls default to
make the act of binding to a core ensure to bind to ''all''
hardware threads contained in that core (similar for sockets:
binding to a socket will bind to ''all'' hardware threads in that
socket).
1. Next will be improvements to paffinity to expose binding to
hardware threads through the paffinity framework API.
1. Finally, we'll expose these binding controls to the user (e.g.,
through mpirun command line arguments, MCA parameters, etc.).
This commit represents the first few bullets; the last 4 bullets are
being worked on right now, but there is no definite timeline for
completion.
= Miscelaneous =
A few points worth mentioning:
* We have tested this new code a bunch; we're pretty sure it behaves
just like the trunk -- but with better / more precise error
reporting. More testing is needed on a wider array of platforms,
however.
* A big comment at the top of odls_default_module.c explains the
(new) general scheme for the error reporting.
* The error reporting in the parent process is now really dumb;
almost all the intelligence about creating error messages is in the
child.
* The show_help file was renamed to be more consistent with other
help files (help-odls-default.txt -> help-orte-odls-default.txt)
* Removed the use of sched_yield() because of recent changes in the
Linux 2.6.3x kernels. We already had an #else clause for
select()'ing for 1us if we didn't have sched_yield() -- that is now
the only code path. This is not a performance-critical section of
the code, so this shouldn't be controversial.
* Replaced the macro-based error reporting with function-based
reporting. It's a bit more bulky, but it helped us understand the
code and saved us multiple times with compile-time parameter
checking, etc.
* Cleaned up the use of several show_help messages to ensure that
they mapped to real messages in help*.txt files.
This commit was SVN r23652.
* Remove OPAL_ERR_PAFFINITY_NOT_SUPPORTED; fit it into the generic
OPAL_ERR_NOT_SUPPORTED case.
* When odls_default detects that processor affinity is not supported,
it prints a specific message about it, and then it suppressed a
generic HNP help message that would normally follow it (i.e., it's
easier to have the "processor affinity is not supported" show_help
message last).
* Use some symbolic names in odls_default instead of fixed int's,
just for slight readability improvements in the code.
* Introduce orte_show_help_suppress(), which gives the ability to
suppress any future showings of any arbitrary show_help() message.
This is useful if you display message X and want to suppress
message Y. This suppression *only* works in environments where
orte_show_help() does coalescing.
This commit was SVN r23249.
* If < 0, it's an OPAL_ERR_* value
* If >= 0, it's the actual output value of the function
This is problematic for the OPAL_SOS stuff. This commit changes those
functions to always return OPAL_* statuses and send the output value
back through output parameters (like 95% of the rest of the code
base). This avoids the confusion with OPAL_SOS stuff and makes
paffinity work again (e.g., mpirun --bind-to-core ...).
I updated all paffinitiy modules for the new function signatures, and
bumped the paffinity API version up to 2.0.1. I don't think the
version change will matter, though, because we'll be introducing
support for hardware threads soon, which will either bump the
paffinity version again or we'll replace paffinity with
a new framework.
This commit was SVN r23197.
(OMPI_ERR_* = OPAL_SOS_GET_ERR_CODE(ret)), since the return value could be a
SOS-encoded error. The OPAL_SOS_GET_ERR_CODE() takes in a SOS error and returns
back the native error code.
* Since OPAL_SUCCESS is preserved by SOS, also change all calls of the form
(OPAL_ERROR == ret) to (OPAL_SUCCESS != ret). We thus avoid having to
decode 'ret' to get the native error code.
This commit was SVN r23162.
It is okay to not have a paffinity module IF you aren't using paffinity anyway. So don't error out of MPI_Init because a paffinity module wasn't selected.
Cleanup error reporting in the odls default module to (once and for all!) eliminate messages originating in the fork'd process. Create some new error codes to allow us to pass enough info back to the parent process to provide useful error messages.
This commit was SVN r23106.
1. file activity - can monitor file size, access and modification times. If these fail to change over a specified number of sampling iterations (rate is an mca param), then the errmgr is notified.
2. memory usage - checks amount of memory used by a process. Limit and sampling rate can be set.
This support must be enabled by configuring --enable-sensors.
ompi_info and orte-info have been updated to include the new framework.
Also includes some initial steps toward restoring the recovery capability. Most notably, the ODLS API has been extended to include a "restart_proc" entry for restarting a local process, and organizes the various ERRMGR framework globals into a single struct as we do in the other ORTE frameworks. Fix an oversight in the ERRMGR framework where a pointer array was constructed, but not initialized.
Implementation continues.
This commit was SVN r23043.