fix will be included in hwloc 1.1.2.
Brad -- can you verify that this fixes the issue for you?
Fixes trac:2732.
This commit was SVN r24450.
The following Trac tickets were found above:
Ticket 2732 --> https://svn.open-mpi.org/trac/ompi/ticket/2732
filenames -- don't include the project name ("opal")
* Don't link maffinity/hwloc and paffinity/hwloc against the common
hwloc in the static build case (because this will result in
duplicate symbols)
This commit was SVN r24447.
hopefully, this now compiles for libnuma 0.9.x and libnuma 2.0.x.
Fixes for the strategy discussed in the commit message for r24442
(i.e., check against numa_get_mems_allowed(), which only exists in
libnuma 2.0.x) and the new new new plan on #2698 coming in a separate
commit.
This commit was SVN r24443.
The following SVN revision numbers were found above:
r24442 --> open-mpi/ompi@90a8fe4aad
(with libnuma-2.0.4 / LIBNUMA_API_VERSION 2): numa_get_run_node_mask
returns a struct bitmask *.
Whether it's a good idea to blindly pass that on to
numa_set_membind() is another matter: one might want to match against
the list returned by numa_get_mems_allowed(), which may be set by the
outside environment.
Refs trac:2698.
This commit was SVN r24442.
The following SVN revision numbers were found above:
r24421 --> open-mpi/ompi@31510e683b
The following Trac tickets were found above:
Ticket 2698 --> https://svn.open-mpi.org/trac/ompi/ticket/2698
Update the CMake script for checking mca subdirs.
Add windows support for __attribute__ packed structures.
Define usleep and posix_memalign with equivalent windows functions.
And a few minor fixes, type casts.
This commit was SVN r24429.
what memory node the process is running on (which is guaranteed to be
a good answer because maffinity won't be invoked unless the process is
already bound to a specific processor), and then bind our memory to
that.
Refs trac:2698.
This commit was SVN r24421.
The following SVN revision numbers were found above:
r24290 --> open-mpi/ompi@afa654746c
The following Trac tickets were found above:
Ticket 2698 --> https://svn.open-mpi.org/trac/ompi/ticket/2698
OMPI supports multiple different repository systems (SVN, hg, git).
But the VERSION file has listed "want_svn" and "svn_r" as fields, even
though the actual repo system and version may not be SVN.
So search/replace those fields (and derrivative values that come from
those fields) with "want_repo_rev" and "repo_rev", respectively.
This commit was SVN r24405.
Add some new proc/job states
Rename a constant to reflect coming change - remove the arbitrary difference between restarting a proc locally and relocating it to another node in terms of the number of restarts allowed.
Add pretty-print of signals for "proc aborted due to signal" reports.
This commit was SVN r24378.
The following SVN revision numbers were found above:
r24371 --> open-mpi/ompi@93d28a5792
This means that the converters (opal_err2str, orte_err2str) can now
return NULL as a "silent error". The return value of opal_err2str_fn_t
is the status of the operation (OPAL_SUCCESS or OPAL_ERROR).
This fixes the "Unknown error" message issues on the trunk.
This commit was SVN r24371.
Temporarily remove hwloc's internal version of myriexpress.h. It is
causing a problem when compiling Open MPI with MX support because
hwloc uses AC_CONFIG_HEADER in hwloc's hwloc.m4 to generate
opal/mca/paffinity/hwloc/hwloc/include/hwloc/config.h.
AC_CONFIG_HEADER apparently has the (undocumented) side effect of
adding -I$(top_builddir)/opal/mca/paffinity/hwloc/hwloc/include/hwloc
to OMPI's compilation flags. Hence, when the OMPI MX components are
compiled and #include "myriexpress.h" (or <myriexpress.h>) they see
hwloc's myriexpress.h before the system one. Badness ensures.
This removal is temporary because we need to figure out a better
solution. But for now, OMPI is not using hwloc's myriexpress.h file --
so it's safe to remove. I'll push this issue upstream to hwloc to
figure out a better solution...
This commit was SVN r24354.
The following Trac tickets were found above:
Ticket 2690 --> https://svn.open-mpi.org/trac/ompi/ticket/2690
Short Version:
--------------
Event engine needs to be flushed so it does not use old/stale file descriptors.
Long Version:
-------------
The problem was that the restarted process was waiting for the socket to the local daemon to finish establishing during the 'sync' operation. The core problem was that the daemon was sending a header of 36 bytes, but the restarted process only received 35 bytes of the message. So the restarted process became stuck waiting for the last byte to arrive.
After many hours of digging, I figured out that the event engine was using the same file descriptor for its evsig_cb functionality (to signal itself when a signal arrives). So when the daemon wrote in to the new fd the event engine was stealing the first byte (*shakes fist at event engine*) before the recv() could be posted.
The solution is to use the event_reinit() function on restart to re-establish the now-stale file descriptors in the event engine. This seems to have fixed the problem.
A few other minor things:
-------------------------
* Add a check to make sure the event engine is balanced in its init/finalize
* Add the opal_event_base_close() to the BLCR restart exec function (still not 100% sure it is needed, but there it is).
This commit was SVN r24296.
last December. :-(
Add new MCA param: maffinity_libnuma_policy. Thanks to David
Singleton for the suggestion. Here's the help text about it:
{{{
MCA maffinity: parameter "maffinity_libnuma_policy" (current value:
<loose>, data source: default value)
Binding policy that determines what happens if memory
is unavailable on the local NUMA node. A value of
"strict" means that the memory allocation will fail;
a value of "loose" means that the memory allocation
will spill over to another NUMA node.
}}}
This commit was SVN r24290.
either direct link to these basic predefined types, or a combination of them.
Anyway, the first items in the datatype list belong to OPAL, the second round
are MPI datatypes created by composing basic OPAL datatypes, and the last
batch are mapped datatype (direct correspondance between an OMPI datatype and
an OPAL one such as int -> int32_t).
Modify the op to fit this new scheme.
This commit was SVN r24247.
the module to use the new hwloc bitmap API (the cpuset API is both
klunkier and deprecated), which simplified a few things.
This commit was SVN r24217.