- Revert the $2, which was correct.
- It fixes the problem, that memchecker valgrind component could be
compiled and is required, but it is unable to be selected.
This commit was SVN r18906.
The following SVN revision numbers were found above:
r18899 --> open-mpi/ompi@0b1b96b598
Short version: remove opal_paffinity_alone and restore
mpi_paffinity_alone. ORTE makes various information available for the
MPI layer to decide what it wants to do in terms of processor
affinity.
Details:
* remove opal_paffinity_alone MCA param; restore mpi_paffinity_alone
MCA param
* move opal_paffinity_slot_list param registration to paffinity base
* ompi_mpi_init() calls opal_paffinity_base_slot_list_set(); if that
succeeds use that. If no slot list was set, see if
mpi_paffinity_alone was set. If so, bind this process to its Node
Local Rank (NLR). The NLR is the ORTE-maintained slot ID; if you
COMM_SPAWN to a host in this ORTE universe that already has procs
on it, the NLR for the new job will start at N (not 0). So this is
slightly better than mpi_paffinity_alone in the v1.2 series.
* If a slot list is specified *and* mpi_paffinity_alone is set, we
display an error and abort.
* Remove calls from rmaps/rank_file component to register and lookup
opal_paffinity mca params.
* Remove code in orte/odls that set affinities - instead, have them
just pass a slot_list if it exists.
* Cleanup the orte/odls code that determined
oversubscribed/want_processor as these were just opposites of each
other.
This commit was SVN r18874.
The following Trac tickets were found above:
Ticket 1383 --> https://svn.open-mpi.org/trac/ompi/ticket/1383
Short version: when the HNP launches VPID 0 on the same node as
itself, the STDIN IOF endpoint will have both a pub and a sub on it.
We need to ensure to only forward incoming messages ''once'' (not
twice, as was happening). A lengthy comment in the code explains in
more detail.
This commit was SVN r18873.
The following Trac tickets were found above:
Ticket 1135 --> https://svn.open-mpi.org/trac/ompi/ticket/1135
If IBCM was explicitly specified with exclude/include parameter,
OpenIB BTL will enable verbose report for "/dev/infiniband/ucm" error,
other way the error will not be reported.
This commit was SVN r18868.
meat of it was commented out long ago, anyway (because of the way it
was written, it violates OPAL<->OMPI abstraction barriers); we never
ended up using the MPI keyval MCA parameter stuff. So just delete it.
This commit was SVN r18860.
Lenny and I went back and forth on whether we should simply register
another "mpi_paffinity_alone" MCA param and then try to figure out
which one was set in ompi_mpi_init, but there was difficulty in
figuring out what to do. So it seemed like the Right Thing to do was
to implement what was committed in r18770; then we could tell where
MCA parameters were set from and you could do Better Things (this is
also useful in the openib BTL, where parameters can be set either via
MCA parameter or via an INI file).
But after that was done, it seemed only a few steps further to
actually implement two new features in the MCA params area:
* Synonyms (where one MCA param name is a synonym for another)
* Allow MCA params and/or their synonyms to be marked as "deprecated"
(printing out warnings if they are used)
These features have actually long been discussed/desired, and I had
some time in airports and airplanes recently where I could work in
this stuff on a standalone laptop. So I did it. :-)
This commit introduces these two new features, and then uses them to
register mpi_paffinity_alone as a non-deprecated synonym for
opal_paffinity_alone. A few other random points in this commit:
* Add a few error checks for conditions that were not checked before
* Correct some comments in mca_base_params.h
* Add a few comments in strategic places
* ompi_info now prints additional information:
* for any MCA parameter that has synonyms, it lists all the
synonyms
* synonyms are also output as 1st-class MCA params, but with an
additional attribute indicating that they have a "parent"
* all MCA param name (both "real" or "synonym") will output an
attribute indicating whether it is deprecated or not. A synonym
is deprecated if it iself is marked as deprecated (via the
mca_base_param_regist_syn() or mca_base_param_register_syn_name()
functions) or if its "parent" MCA parameter is deprecated
This commit was SVN r18859.
The following SVN revision numbers were found above:
r18770 --> open-mpi/ompi@8efe67e08c
The following Trac tickets were found above:
Ticket 1383 --> https://svn.open-mpi.org/trac/ompi/ticket/1383
Add comments to both orterun and orted code explaining why we take a snapshot of the local environment and apply it to the local procs when they are spawned.
This commit was SVN r18842.
I promoted the ''none'' component to a full component, and updated the other components to reflect this code movement. The ''none'' component is the default component unless the user requests '''-am ft-enable-cr''' to auto-select a component. There is an MCA parameter to show a warning if the application requested an FT enabled job, but the ''none'' component was selected ({{{crs_none_select_warning}}}).
This temporarily fixes the problem mentioned in r18739. The full fix will entail working on ticket #1291.
Thanks to Ethan from Sun for finding this bug.
This commit was SVN r18840.
The following SVN revision numbers were found above:
r18739 --> open-mpi/ompi@a003fa7a50
first to the trunk. So, here is the trunk checkin:
The call to orte_show_help() to notify truncation of the max_inline value
was missing the want_error_header boolean, which eventually results in
a SEGV. This change corrects the call with the bool set to true.
This commit was SVN r18839.
Actually, the problem was that we were simply -adding- any enviro MCA params to whatever had been found on the cmd line. Thus, duplicate MCA param directives were winding up duplicated in the environment. Some shells took the first one in the environ array - others took the last! So we could get completely different behavior based on the whims of the shell.
This commit fixes trac:1373
This commit was SVN r18836.
The following Trac tickets were found above:
Ticket 1373 --> https://svn.open-mpi.org/trac/ompi/ticket/1373
These are mostly long additions to comments to document what is going on and why, and how/where it may be revised in the future. Just a couple of small, but important, changes to the code itself.
This commit was SVN r18827.
see how the next gen panasas stuff does in terms of warnings; we can
always re-merge this later if we want to. It's just easier if we have
as little OMPI-specific code as possible (particularly when we know
that the panasas code has some big changes coming).
This commit was SVN r18823.
The following SVN revision numbers were found above:
r17543 --> open-mpi/ompi@b4ec81a9fd
already, and we're just about to do a ROMIO version refresh -- so the
less OMPI-specific code we have (e.g., indenting and whatnot), the
better.
Refs trac:1370.
This commit was SVN r18821.
The following SVN revision numbers were found above:
r16691 --> open-mpi/ompi@8dca19cb3b
r16693 --> open-mpi/ompi@037a533752
The following Trac tickets were found above:
Ticket 1370 --> https://svn.open-mpi.org/trac/ompi/ticket/1370
ROMIO in Open MPI (the new version of ROMIO will make this patch
defunct, and David Daniel has confirmed that no one at LANL is using
this functionality, anyway).
Refs trac:1370.
This commit was SVN r18819.
The following Trac tickets were found above:
Ticket 1370 --> https://svn.open-mpi.org/trac/ompi/ticket/1370
hierarch disables itself now if the pml module used is *not* ob1. The reason
is, that the multi-level hierarchy detection algorithm checks the names of the
btl modules used. In case there are no btl's, we would segfault.
Furthermore, three minor changes:
- the 2-level hierarchy detection is now the default (sm vs. everything else
in the world).
- add udapl to the list of protocols checked for by the multi-level hierarch detection
- some of the verbose statements of hierarch were inaccurate. Fixed those comments/messages.
This commit was SVN r18817.
With help from Brian, modify the ompi/proc/proc.c code to be more thread-safe. Remove the list operations from the ompi_proc_t constructor and destructor. Insert list appends to ompi_proc_init and ompi_proc_find_and_add as required, and protect those with thread locks. Let only the ompi_proc_finalize function actually remove objects from the ompi_proc_list.
Cleanup a few places where functions might return without unlocking a thread. Ensure the ompi_proc_world also does an OBJ_RETAIN so that the reference count on any subsequently released object is correct.
This commit was SVN r18816.
1. repair of the linear and direct routed modules
2. repair of the ompi/pubsub/orte module to correctly init routes to the ompi-server, and correctly handle failure to correctly parse the provided ompi-server URI
3. modification of orterun to accept both "file" and "FILE" for designating where the ompi-server URI is to be found - purely a convenience feature
4. resolution of a message ordering problem during the connect/accept handshake that allowed the "send-first" proc to attempt to send to the "recv-first" proc before the HNP had actually updated its routes.
Let this be a further reminder to all - message ordering is NOT guaranteed in the OOB
5. Repair the ompi/dpm/orte module to correctly init routes during connect/accept.
Reminder to all: messages sent to procs in another job family (i.e., started by a different mpirun) are ALWAYS routed through the respective HNPs. As per the comments in orte/routed, this is REQUIRED to maintain connect/accept (where only the root proc on each side is capable of init'ing the routes), allow communication between mpirun's using different routing modules, and to minimize connections on tools such as ompi-server. It is all taken care of "under the covers" by the OOB to ensure that a route back to the sender is maintained, even when the different mpirun's are using different routed modules.
6. corrections in the orte/odls to ensure proper identification of daemons participating in a dynamic launch
7. corrections in build/nidmap to support update of an existing nidmap during dynamic launch
8. corrected implementation of the update_arch function in the ESS, along with consolidation of a number of ESS operations into base functions for easier maintenance. The ability to support info from multiple jobs was added, although we don't currently do so - this will come later to support further fault recovery strategies
9. minor updates to several functions to remove unnecessary and/or no longer used variables and envar's, add some debugging output, etc.
10. addition of a new macro ORTE_PROC_IS_DAEMON that resolves to true if the provided proc is a daemon
There is still more cleanup to be done for efficiency, but this at least works.
Tested on single-node Mac, multi-node SLURM via odin. Tests included connect/accept, publish/lookup/unpublish, comm_spawn, comm_spawn_multiple, and singleton comm_spawn.
Fixes ticket #1256
This commit was SVN r18804.