http://www.open-mpi.org/community/lists/devel/2014/05/14822.php
Revamp the ORTE global data structures to reduce memory footprint and add new features. Add ability to control/set cpu frequency, though this can only be done if the sys admin has setup the system to support it (or you run as root).
This commit was SVN r31916.
grpcomm: fix memory leaks
We were leaking the caddy object used to pass data to the callback
function. This commit fixes these leaks.
oob,rml: fix memory leaks
This commit fixes several leaks:
- Both the oob/base and oob/tcp were leaking objects on their peer
hash tables. Iterate on the hash tables and free any objects.
- Leaked sent messages because of missing OBJ_RELEASE. I placed the
release in ORTE_RML_SEND_COMPLETE to catch all the possible
paths.
ess/base: close the state framework
cmr=v1.8.2:reviewer=rhc
This commit was SVN r31776.
top_ompi_srcdir -> OMPI_TOP_SRCDIR
top_ompi_builddir -> OMPI_TOP_BUILDDIR
We also split the srcdir/builddir flags according to their local tree (e.g., OPAL_TOP_SRCDIR), and tied them all together in configure.ac. Renamed ompi_ignore and ompi_unignore to be opal_<foo> as these are agnostic markers.
Only thing left is ompilibdir being treated similar to what we dif for srcdir/builddir. Coming soon.
This commit was SVN r31678.
Fixes trac:4596
Reviewed by rhc, RM-approved
cmr=v1.8.2:reviewer=ompi-gk1.8
This commit was SVN r31626.
The following Trac tickets were found above:
Ticket 4596 --> https://svn.open-mpi.org/trac/ompi/ticket/4596
http://www.open-mpi.org/community/lists/devel/2014/04/14496.php
Revamp the opal database framework, including renaming it to "dstore" to reflect that it isn't a "database". Move the "db" framework to ORTE for now, soon to move to ORCM
This commit was SVN r31557.
Child processes now look clean; I can't find any more fd's that are
leaking from the parent to children.
Refs trac:4550
This commit was SVN r31515.
The following Trac tickets were found above:
Ticket 4550 --> https://svn.open-mpi.org/trac/ompi/ticket/4550
This provides full locality - i.e., not just node-level, but all the way down to whatever common binding level exists between the procs.
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r31106.
Change the priority of comm_failure and job_termination events to ensure we process final messages prior to terminating. Check for termination conditions when processing proc termination events as we may order proc termination when the daemon gets an exit command, but we can't see the proc actually terminate until we get out of that message event.
Jeff: probably easiest to review this by testing. I tested it under both Slurm and rsh on v1.7.5 as well as trunk
cmr=v1.7.5:reviewer=jsquyres:subject=resolve event priorities during VM shutdown
This commit was SVN r31042.
Running orte-restart requires an initialized sstore.
This opens the sstore component for FT builds just like
the snapc component.
This commit was SVN r30796.
The function
int orte_snapc_base_select(bool seed, bool app);
wants to know if it called by an application or not. Therefore
it expects as second paremeter 'bool app'. It used to be
'!ORTE_PROC_IS_DAEMON' which is not always correct if it is
a tool or a HNP. This patch changes it to ORTE_PROC_IS_APP, which
has the correct information if it is an application.
This commit was SVN r30404.
pkg{data,lib,includedir}, use our own ompi{data,lib,includedir}, which is
always set to {datadir,libdir,includedir}/openmpi. This will keep us from
having help files in prefix/share/open-rte when building without Open MPI,
but in prefix/share/openmpi when building with Open MPI.
This commit was SVN r30140.
Thanks to Dave Love and Ashley Pittman for pointing out the problem.
cmr=v1.7.4:reviewer=jsquyres:subject=Fix tool communications with mpirun
This commit was SVN r29959.
The following Trac tickets were found above:
Ticket 3963 --> https://svn.open-mpi.org/trac/ompi/ticket/3963
includes various fixes all over the C/R code which are
hard to group like the other patches.
Changes from V1:
* explain why mca_base_component_distill_checkpoint_ready no longer works
* compare return result of opal functions with OPAL_* values
Changes from V2:
* use orte_rml_oob_ft_event() instead of referencing through the modules
* properly protect variable (thanks to --enable-picky)
This commit was SVN r29922.
This change contains a non-mandatory modification
of the MPI-RTE interface. Anyone wishing to support
coprocessors such as the Xeon Phi may wish to add
the required definition and underlying support
****************************************************************
Add locality support for coprocessors such as the Intel Xeon Phi.
Detecting that we are on a coprocessor inside of a host node isn't straightforward. There are no good "hooks" provided for programmatically detecting that "we are on a coprocessor running its own OS", and the ORTE daemon just thinks it is on another node. However, in order to properly use the Phi's public interface for MPI transport, it is necessary that the daemon detect that it is colocated with procs on the host.
So we have to split the locality to separately record "on the same host" vs "on the same board". We already have the board-level locality flag, but not quite enough flexibility to handle this use-case. Thus, do the following:
1. add OPAL_PROC_ON_HOST flag to indicate we share a host, but not necessarily the same board
2. modify OPAL_PROC_ON_NODE to indicate we share both a host AND the same board. Note that we have to modify the OPAL_PROC_ON_LOCAL_NODE macro to explicitly check both conditions
3. add support in opal/mca/hwloc/base/hwloc_base_util.c for the host to check for coprocessors, and for daemons to check to see if they are on a coprocessor. The former is done via hwloc, but support for the latter is not yet provided by hwloc. So the code for detecting we are on a coprocessor currently is Xeon Phi specific - hopefully, we will find more generic methods in the future.
4. modify the orted and the hnp startup so they check for coprocessors and to see if they are on a coprocessor, and have the orteds pass that info back in their callback message. Automatically detect that coprocessors have been found and identify which coprocessors are on which hosts. Note that this algo isn't scalable at the moment - this will hopefully be improved over time.
5. modify the ompi proc locality detection function to look for coprocessor host info IF the OMPI_RTE_HOST_ID database key has been defined. RTE's that choose not to provide this support do not have to do anything - the associated code will simply be ignored.
6. include some cleanup of the hwloc open/close code so it conforms to how we did things in other frameworks (e.g., having a single "frame" file instead of open/close). Also, fix the locality flags - e.g., being on the same node means you must also be on the same cluster/cu, so ensure those flags are also set.
cmr:v1.7.4:reviewer=hjelmn
This commit was SVN r29435.
Fix two problems that surfaced when using direct launch under SLURM:
1. locally store our own data because some BTLs want to retrieve
it during add_procs rather than use what they have internally
2. cleanup MPI_Abort so it correctly passes the error status all
the way down to the actual exit. When someone implemented the
"abort_peers" API, they left out the error status. So we lost
it at that point and *always* exited with a status of 1. This
forces a change to the API to include the status.
cmr:v1.7.3:reviewer=jsquyres:subject=Fix MPI_Abort and modex_recv for direct launch
This commit was SVN r29405.
So we now allow singletons to start on their own, only spawning an HNP when initiating an operation that actually requires it.
cmr:v1.7.4:reviewer=jsquyres
This commit was SVN r29354.
However, tools such as mpirun don't need it, and definitely shouldn't be using it. Ditto for procs launched by mpirun.
We used to have a way of dealing with this - we had the PMI component check to see if the process was the HNP or was launched by an HNP. Sadly, moving the OPAL db framework removed
that ability as OPAL has no notion of HNPs or proc type.
So add a boolean flag to the db_base_select API that allows us to restrict selection to "local" components. This gives the PMI component the ability to reject itself as required. W
e then need to pass that param into the ess_base_std_app call so it can pass it all down.
This commit was SVN r29341.