The alps ras and plm components were broken by recent changes in ORTE. This
commit resolves those issues.
Changes:
- Define PMI2_SUCCESS if it isn't defined. This fixes a problem with Cray's
PMI implementation which does not define (for some reason) PMI2_SUCCESS. We
had previously just used PMI_SUCCESS.
- Add missing definition and a typo in pml_alps_module.
- launch_id is no longer available in the orte_node_t structure. Use the
attribute lookup to get the value.
- Do not use an O(n^2) sorting algorithm when putting alps nodes in order. Use
opal_list_sort instead (O(nlogn)).
This commit was SVN r32076.
This won't transition cleanly to the 1.8 series, and may represent too much change, so we'll have to (a) evaluate whether or not to bring it over (once it demonstrates that it does indeed solve the problem), and (b) develop a custom patch for that purpose.
Refs trac:4717
This commit was SVN r32063.
The following Trac tickets were found above:
Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717
(a) default binding policy is in effect. In this case, we will emit a
warning and default to not binding unless the user provided the
"oversubscribe" or "overload" modifier to the "bind-to" option.
(b) user-specified binding policy is in effect. In this case, we will
error out unless the user provided the "oversubscribe" or "overload"
modifier to the "bind-to" option as we cannot meet the directive.
Either "bind-to" modifier (oversubscribe or overload) will be accepted for
now - in 1.9, we will deprecate the "overload" term in favor of
"oversubscribe".
Also added the ability to accept a --bind-to modifier without specifying the binding policy itself so a user can specify overload-allowed with the default policy.
Closes trac:4345
cmr=v1.8.2:reviewer=rhc:subject=resolve handling of overload conditions
This commit was SVN r32005.
The following Trac tickets were found above:
Ticket 4345 --> https://svn.open-mpi.org/trac/ompi/ticket/4345
* allow users to specify just a modifier for map-by instead of requiring that they also specify a policy. Thus, we now accept --map-by :pe=3 as indicating that we should use the default mapping policy, but bind 3 cpus/proc.
* if users specify a pe's/proc but no policy, default to --map-by NUMA to ensure we have access to multiple cpus for the request. This won't guarantee we have access to enough to meet the request, but gives us a chance. In addition, we know that binding a proc to multiple cpus will work best if those cpus are all in the same NUMA, so this provides some degree of optimized behavior.
Per a request from Jeff, define "oversubscribe" for binding as a synonym for the "overload" modifier.
cmr=v1.8.2:reviewer=rhc
This commit was SVN r31967.
This should eliminate the connectivity issues that have been reported, and will make maintenance of this component much easier.
cmr=v1.8.2:reviewer=jsquyres:subject=simplify the OOB/TCP component
This commit was SVN r31956.
http://www.open-mpi.org/community/lists/devel/2014/05/14822.php
Revamp the ORTE global data structures to reduce memory footprint and add new features. Add ability to control/set cpu frequency, though this can only be done if the sys admin has setup the system to support it (or you run as root).
This commit was SVN r31916.
This commit is a slightly better workaround to prevent mesages of
the form:
[unset]:_pmi_alps_get_apid:alps_app_lli_put_request failed
[unset]:_pmi_alps_get_appLayout:pmi_alps_get_apid returned with error: Bad file descriptor
It works by completely disabling PMI in the application process when using
mpirun. This should not be an issue for any apps.
cmr=v1.8.2:reviewer=rhc
This commit was SVN r31882.
Two leaks are fixed in this commit:
- Do not leak btl component list items.
- Do not leak the nodename when decoding the pidmap.
cmr=v1.8.2:reviewer=rhc
This commit was SVN r31779.
grpcomm: fix memory leaks
We were leaking the caddy object used to pass data to the callback
function. This commit fixes these leaks.
oob,rml: fix memory leaks
This commit fixes several leaks:
- Both the oob/base and oob/tcp were leaking objects on their peer
hash tables. Iterate on the hash tables and free any objects.
- Leaked sent messages because of missing OBJ_RELEASE. I placed the
release in ORTE_RML_SEND_COMPLETE to catch all the possible
paths.
ess/base: close the state framework
cmr=v1.8.2:reviewer=rhc
This commit was SVN r31776.
mca_base_var_register (..., MCA_BASE_VAR_TYPE_STRING, ...)
will dup() the orte_set_slots string, so there is no need
to do this in the first place.
cmr=v1.8.2:reviewer=rhc
This commit was SVN r31773.
This patch fixes four memory leaks in orte/util/nidmap.c :
- hwloc_get_root_obj(opal_hwloc_topology)->userdata was never freed
- even if bo->bytes is freed in the decode, bo was not freed
- a job list is populated but never used nor freed
cmr=v1.8.2:reviewer=rhc
This commit was SVN r31770.
top_ompi_srcdir -> OMPI_TOP_SRCDIR
top_ompi_builddir -> OMPI_TOP_BUILDDIR
We also split the srcdir/builddir flags according to their local tree (e.g., OPAL_TOP_SRCDIR), and tied them all together in configure.ac. Renamed ompi_ignore and ompi_unignore to be opal_<foo> as these are agnostic markers.
Only thing left is ompilibdir being treated similar to what we dif for srcdir/builddir. Coming soon.
This commit was SVN r31678.
Fixes trac:4596
Reviewed by rhc, RM-approved
cmr=v1.8.2:reviewer=ompi-gk1.8
This commit was SVN r31626.
The following Trac tickets were found above:
Ticket 4596 --> https://svn.open-mpi.org/trac/ompi/ticket/4596
The HNP can't know the precise reason, of course - all it knows is that the daemon failed. So output a generic error message that provides guidance on probable causes.
Refs trac:4571
This commit was SVN r31589.
The following Trac tickets were found above:
Ticket 4571 --> https://svn.open-mpi.org/trac/ompi/ticket/4571
http://www.open-mpi.org/community/lists/devel/2014/04/14496.php
Revamp the opal database framework, including renaming it to "dstore" to reflect that it isn't a "database". Move the "db" framework to ORTE for now, soon to move to ORCM
This commit was SVN r31557.
Child processes now look clean; I can't find any more fd's that are
leaking from the parent to children.
Refs trac:4550
This commit was SVN r31515.
The following Trac tickets were found above:
Ticket 4550 --> https://svn.open-mpi.org/trac/ompi/ticket/4550
Paul Hargrove pointed out that Stevens tells us that we should
FD_GETFL before FD_SETFL. And so we shall.
Make a new convenience function to do this (opal_fd_set_cloexec()),
just so that we don't have to litter this 2-step process throughout
the code.
Refs trac:4550
This commit was SVN r31513.
The following Trac tickets were found above:
Ticket 4550 --> https://svn.open-mpi.org/trac/ompi/ticket/4550
This pipe is used to communicate between threads in this process.
Mark both fd as close-on-exec so that children don't inherit this
pipe.
Refs trac:4550
This commit was SVN r31512.
The following Trac tickets were found above:
Ticket 4550 --> https://svn.open-mpi.org/trac/ompi/ticket/4550
Make sure the debugger attach fifo is marked as close-on-exec so that
children procs don't inherit it. For example, if you salloc a SLURM
allocation and run "mpirun ..." in there (i.e., mpirun is running on
the head node, and launching on to back-end nodes), the forked srun's
will inherit this fd if it is still open.
Refs trac:4550
This commit was SVN r31499.
The following Trac tickets were found above:
Ticket 4550 --> https://svn.open-mpi.org/trac/ompi/ticket/4550
Add some verbiage about how mpirun now defaults to disallowing running
as root, but you can use the --allow-run-as-root option to override
this default behavior.
Refs trac:4536
This commit was SVN r31477.
The following Trac tickets were found above:
Ticket 4536 --> https://svn.open-mpi.org/trac/ompi/ticket/4536
Prior to r29058, this same logic was in place (i.e., ensure that the
extra fd to /dev/null is closed). It looks like it was accidentally
removed in the ORTE conversion to the state machine in r29058.
This ''might'' have something to do with many hangs that we're seeing
in Cisco MTT with jobs that exhibit failure (e.g., call MPI_ABORT)...?
cmr=v1.8.2:reviewer=rhc
This commit was SVN r31469.
The following SVN revision numbers were found above:
r29058 --> open-mpi/ompi@a200e4f865
The C99 usage to initialize via struct member names was already there,
but commented out. This commit doesn't fix any known problem; it
simply uncomments the C99 code, because it's safer/better.
This commit was SVN r31425.
newer
This commit adds a workaround for messages printed by the Cray PMI library
when launching using mpirun. We are still talking with Cray to find a
better fix but this will silence the warnings for now.
cmr=v1.8.1:reviewer=manjugv
This commit was SVN r31352.
See the ticket for more details.
cmr=v1.8.1:reviewer=rhc:ticket=4489
This commit was SVN r31351.
The following Trac tickets were found above:
Ticket 4489 --> https://svn.open-mpi.org/trac/ompi/ticket/4489
Cray moved the apstat command on CLE 5.x to /opt/cray/alps/../bin and
moved a configuration file. This commit adds support for both of these
changes.
cmr=v1.8.1
This commit was SVN r31329.
Also, add missing ORTE_ERROR_LOG in the other case where this error
message is used (i.e., ORTE_ERROR_LOG was used in the one place, so
let's also use it in the other place).
This commit was SVN r31321.
If we are aborting, then set the flags so the HNP directly sends an exit command to each daemon. Make it the halt_vm command so the remote daemon doesn't try to relay it, but instead just exits without waiting for its routed children to exit first.
cmr=v1.8.1:reviewer=jsquyres:subject=fix hangs due to abort prior to daemon wireup
This commit was SVN r31304.
add -mca base_env_list "var1=val1 var2=val2 ..." mca parameter that can be used in mca param files
or with -am app.conf mpirun commandline to set rank env variables with mca mechanism
fixed by Elena, reviewed by Miked
cmr=v1.8.1:reviewer=ompi-rm1.8
This commit was SVN r31302.
Also, since I put some of the macros for these silent/verbose rules up
in the top-level Makefile.man-page-rules file, I renamed it to
Makefile.ompi-rules.
I've had this sitting around for a while; now seems like as good a
time as any to commit it.
This commit was SVN r31271.
Without this patch running ring_c with the usnic BTL under valgrind will
cause the orteds to segfault.
Reviewed-by: Jeff Squyres <jsquyres@cisco.com>
Reviewed-by: Ralph Castain <rhc@open-mpi.org>
cmr=v1.7.5:reviewer=ompi-rm1.7
This commit was SVN r31161.
To accompany r31092 and r310924, also ensure to create a new process
group in the child right after the orted forks. Add trivial configury
to ensure that we have setpgid, and only do the setpgid/getpgid if we
have setpgid.
Without this commit, killing the entire process group can do
unexpected things (e.g., kill the orted, mpirun, and even mpirun's
parent!).
cmr=v1.7.5:reviewer=rhc
This commit was SVN r31132.
The following SVN revision numbers were found above:
r31092 --> open-mpi/ompi@99c9ecaed0
The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
r310924
This provides full locality - i.e., not just node-level, but all the way down to whatever common binding level exists between the procs.
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r31106.