Short version: remove opal_paffinity_alone and restore
mpi_paffinity_alone. ORTE makes various information available for the
MPI layer to decide what it wants to do in terms of processor
affinity.
Details:
* remove opal_paffinity_alone MCA param; restore mpi_paffinity_alone
MCA param
* move opal_paffinity_slot_list param registration to paffinity base
* ompi_mpi_init() calls opal_paffinity_base_slot_list_set(); if that
succeeds use that. If no slot list was set, see if
mpi_paffinity_alone was set. If so, bind this process to its Node
Local Rank (NLR). The NLR is the ORTE-maintained slot ID; if you
COMM_SPAWN to a host in this ORTE universe that already has procs
on it, the NLR for the new job will start at N (not 0). So this is
slightly better than mpi_paffinity_alone in the v1.2 series.
* If a slot list is specified *and* mpi_paffinity_alone is set, we
display an error and abort.
* Remove calls from rmaps/rank_file component to register and lookup
opal_paffinity mca params.
* Remove code in orte/odls that set affinities - instead, have them
just pass a slot_list if it exists.
* Cleanup the orte/odls code that determined
oversubscribed/want_processor as these were just opposites of each
other.
This commit was SVN r18874.
The following Trac tickets were found above:
Ticket 1383 --> https://svn.open-mpi.org/trac/ompi/ticket/1383
meat of it was commented out long ago, anyway (because of the way it
was written, it violates OPAL<->OMPI abstraction barriers); we never
ended up using the MPI keyval MCA parameter stuff. So just delete it.
This commit was SVN r18860.
Lenny and I went back and forth on whether we should simply register
another "mpi_paffinity_alone" MCA param and then try to figure out
which one was set in ompi_mpi_init, but there was difficulty in
figuring out what to do. So it seemed like the Right Thing to do was
to implement what was committed in r18770; then we could tell where
MCA parameters were set from and you could do Better Things (this is
also useful in the openib BTL, where parameters can be set either via
MCA parameter or via an INI file).
But after that was done, it seemed only a few steps further to
actually implement two new features in the MCA params area:
* Synonyms (where one MCA param name is a synonym for another)
* Allow MCA params and/or their synonyms to be marked as "deprecated"
(printing out warnings if they are used)
These features have actually long been discussed/desired, and I had
some time in airports and airplanes recently where I could work in
this stuff on a standalone laptop. So I did it. :-)
This commit introduces these two new features, and then uses them to
register mpi_paffinity_alone as a non-deprecated synonym for
opal_paffinity_alone. A few other random points in this commit:
* Add a few error checks for conditions that were not checked before
* Correct some comments in mca_base_params.h
* Add a few comments in strategic places
* ompi_info now prints additional information:
* for any MCA parameter that has synonyms, it lists all the
synonyms
* synonyms are also output as 1st-class MCA params, but with an
additional attribute indicating that they have a "parent"
* all MCA param name (both "real" or "synonym") will output an
attribute indicating whether it is deprecated or not. A synonym
is deprecated if it iself is marked as deprecated (via the
mca_base_param_regist_syn() or mca_base_param_register_syn_name()
functions) or if its "parent" MCA parameter is deprecated
This commit was SVN r18859.
The following SVN revision numbers were found above:
r18770 --> open-mpi/ompi@8efe67e08c
The following Trac tickets were found above:
Ticket 1383 --> https://svn.open-mpi.org/trac/ompi/ticket/1383
I promoted the ''none'' component to a full component, and updated the other components to reflect this code movement. The ''none'' component is the default component unless the user requests '''-am ft-enable-cr''' to auto-select a component. There is an MCA parameter to show a warning if the application requested an FT enabled job, but the ''none'' component was selected ({{{crs_none_select_warning}}}).
This temporarily fixes the problem mentioned in r18739. The full fix will entail working on ticket #1291.
Thanks to Ethan from Sun for finding this bug.
This commit was SVN r18840.
The following SVN revision numbers were found above:
r18739 --> open-mpi/ompi@a003fa7a50
an MCA parameter's value came from. Note that the actual value of the
parameter is irrelevant. For example, if a value was specified in an
MCA parameter file that happened to have the same defaultvalue that
was specified when the parameter was registered, the returned location
will indicate that the value was set from the file.
Possible answers:
* '''MCA_BASE_PARAM_SOURCE_DEFAULT:''' no user-specified values were
found, so the default value was used
* '''MCA_BASE_PARAM_SOURCE_ENV:''' the value came from the
environment (which also means the mpirun/orterun command line!)
* '''MCA_BASE_PARAM_SOURCE_FILE:''' the value came a file (or the
Windows registry)
* '''MCA_BASE_PARAM_SOURCE_KEYVAL:''' the value came from a keyval
(can currently never happen)
* '''MCA_BASE_PARAM_SOURCE_OVERRIDE:''' the value came from an MCA
param API "set" function
This commit was SVN r18770.
bother to check to see whether they exist or not. Specifically, this
will not cause an error:
{{{
shell$ mpirun --mca btl ^does_not_exist ...
}}}
but neither will this:
{{{
shell$ mpirun --mca btl ^sm ...
}}}
(where the sm BTL ''does'' exist)
This commit was SVN r18760.
The following Trac tickets were found above:
Ticket 1365 --> https://svn.open-mpi.org/trac/ompi/ticket/1365
Make sure that if we ask for the 'none' component (which is not a 'real' component, but a component in crs/base) then we do not fail out of the box when using tools. We check for the {{{OPAL_ERR_NOT_FOUND}}} error.
Also make sure that component_open() returns {{{OPAL_ERR_NOT_FOUND}}} when it cannot find a value instead of {{{OPAL_ERROR}}} which means something quite a bit different.
C/R is working but the tools still print the warning below everytime they are ran:
{{{
--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened. This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded). Note that
Open MPI stopped checking at the first component that it did not find.
Host: odin.cs.indiana.edu
Framework: crs
Component: none
--------------------------------------------------------------------------
}}}
I'll have to figure out a work around for this warning (maybe work on the {{{MCA_NULL}}} Ticket #1291).
This commit was SVN r18739.
The following SVN revision numbers were found above:
r18707 --> open-mpi/ompi@bdaaf01d8a
but not anything in libc. Which causes an incorrect answer for
AC_CHECK_FUNCS. Work around that by also checking for the
#define.
This commit was SVN r18730.
components. If they are not found / able to be opened, a warning will
be printed and the mca_base_component_find() will return
OPAL_ERR_NOT_FOUND. It is the upper-layer's responsibility to handle
this error appropriately.
This commit was SVN r18707.
The following Trac tickets were found above:
Ticket 1338 --> https://svn.open-mpi.org/trac/ompi/ticket/1338
Some minor changes to help facilitate debugger support so that both mpirun and yod can operate with it. Still to be completed.
This commit was SVN r18664.
don't have MPOL_MF_MOVE). I know that this is a configure change in
the middle of the US workday, but this compile problem is preventing
work on several kinds of systems (e.g., RHEL4).
This commit was SVN r18659.
is still to use the C based wrapper compilers (which have many more features
and are more well tested). The Perl compilers are enabled with the option
--enable-script-wrapper-compilers, which also ignores the option
--disable-binaries (ie --enable-script-wrapper-compilers --disable-binaries
will result in perl-based wrapper compilers being installed, but no other
binaries being installed).
This commit was SVN r18655.
a standalone library named libopenmpi-malloc. Users wanting to
use leave_pinned with ptmalloc2 will now need to link the library
into their application explicitly. All other users will use the
libc-provided allocator instead of Open MPI's ptmalloc2. This change
may be overriden with the configure option enable-ptmalloc2-internal
- The leave_pinned options will now default to using mallopt on
Linux in the cases where ptmalloc2 was not linked in. mallopt
will also only be available if munmap can be intercepted (the
default whenever Open MPI is not compiled with --without-memory-
manager.
- Open MPI will now complain and refuse to use leave_pinned if
no memory intercept / mallopt option is available.
This commit was SVN r18654.
After much work by Jeff and myself, and quite a lot of discussion, it has become clear that we simply cannot resolve the infinite loops caused by RML-involved subsystems calling orte_output. The original rationale for the change to orte_output has also been reduced by shifting the output of XML-formatted vs human readable messages to an alternative approach.
I have globally replaced the orte_output/ORTE_OUTPUT calls in the code base, as well as the corresponding .h file name. I have test compiled and run this on the various environments within my reach, so hopefully this will prove minimally disruptive.
This commit was SVN r18619.
Add a new function to opal_progress that tells us our recursion depth to support that solution.
Yes, I know this sounds picky, but good ol' Jeff managed to make it happen by driving his cluster near to death...
Also ensure that we declare "failed" for the daemon job when daemons fail instead of the application job. This is important so that orte knows that it cannot use xcast to tell daemons to "exit", nor should it expect all daemons to respond. Otherwise, it is possible to hang.
After lots of testing, decide to default (again) to slurm detecting failed orteds. This proved necessary to avoid rather annoying hangs that were difficult to recover from. There are conditions where slurm will fail to launch all daemons (slurm folks are working on it), and yet again, good ol' Jeff managed to find both of them.
Thanks you Jeff! :-/
This commit was SVN r18611.
to modify the callback array (add or remove), make sure we don't call
the same callback twice if it get remove in another thread.
This commit was SVN r18608.
orte-checkpoint/orte-restart seem to not seem to totally like orte_output so revert them to opal_output for now. Since we have no need for the additional complexity of orte_output we can drop it for now and revisit this if anyone needs it later.
It seems that if you set the verbose level on an output handle then try to call a normal orte_output() on it then the message will *not* be printed. This is the same for opal_output, and seems incorrect to me because it stops some error messages from being printed out if you do not directly specify opal_output(0, ...). Maybe someone should take a look a this.
orte-checkpoint would segv if passed an incorrect PID. Fixed the return code so it errors out properly.
Thanks to Eric Roman for bringing this to my attention.
This commit was SVN r18583.
If Valgrind is requested but wrong version is supplied, print error messages and stop.
Save the CPPFLAGS in opal_memchecker_valgrind_CPPFLAGS, which could be used in
Makefile.am.
Many thanks to Jeff.
This commit was SVN r18573.
that can *sometimes* cause problems with "make -j [N>1] install".
Ensure to make the target directory before we copy stuff into it --
read the thread starting here for more details:
http://www.open-mpi.org/community/lists/devel/2008/06/4080.php
This commit was SVN r18570.
1. If we don't have the topology information, don't bother trying to
create cross-referencing information
1. Ensure to only check for valid processor ID's
This commit was SVN r18462.