hwloc component weren't reverse applied to the external hwloc
component. Additionaly, if we add stuff to LDFLAGS/LIBS, we also may
need to append (DY)LD_LIBRARY_PATH (here in this configure process
only), otherwise future configure tests may fail because they can't
find libhwloc.so (e.g., if you --with-hwloc=/some/path, we need to add
/some/path/lib to (DY)LD_LIBRARY_PATH).
This commit was SVN r26082.
The following Trac tickets were found above:
Ticket 3043 --> https://svn.open-mpi.org/trac/ompi/ticket/3043
* fixed some bugs where "unknown" tokens were allowed on the command
line (which should really only be used for ortertun).
* if an unknown token is encountered, print a short error to stderr
and quit with a nonzero exit status
* if we don't find the right number of parameters to an option, print
a short error to stderr and quit with a nonzero exit status
* when --help is given, print the help message to stdout (not stderr)
and quit with a zero exit status
* added --showme:help option to the wrapper compilers
* updated docs in opal/util/cmd_line.h
* other small/miscellaneous CLI parsing bugs in various tools
I won't bore you with what we did before. :-) Here's some examples
of what the new behavior looks like:
{{{
% ompi_info --bogus
ompi_info: Error: unknown option "--bogus"
Type 'ompi_info --help' for usage.
% ompi_info --param bogus
ompi_info: Error: option "--param" did not have enough parameters (2)
Type 'ompi_info --help' for usage.
%
}}}
This commit was SVN r26072.
This commit was SVN r26037.
The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
r4340
This commit was SVN r25987.
The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
r4319
This commit was SVN r25986.
The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
r4314
Easiest solution is to set the include in a POST_CONFIG macro based on
whether the configure system says the component was selected or not.
This commit was SVN r25968.
1. no binding support - indicated by a negative return code from get_cpubind
2. binding supported, but not bound - the bitset returned by get_cpubind is the same as the available cpuset
3. binding supported and bound - bitset from get_cpubind is a subset of available cpuset
4. only one cpu is available - in this case, get_cpubind matches the available cpuset, but we are effectively bound
This commit was SVN r25957.
in the tarball. Thanks to Paul Hargrove for the fix.
This commit was SVN r25952.
The following Trac tickets were found above:
Ticket 2951 --> https://svn.open-mpi.org/trac/ompi/ticket/2951
makes the test a bit more difficult, per discussions with Paul
Hargrove when going through the release process for hwloc 1.3.2.
This "improved" test is already on OMPI v1.4 -- not sure why it's not
on the trunk or OMPI v1.5 branch. It looks like the extra fprintf got
lost in the translation from AC_TRY_LINK to AC_LINK_IFELSE.
This commit was SVN r25909.
paffinity hwloc was returning "NOT_SUPPORTED" when the real problem
was that the underlying hwloc simply hadn't been initialized yet. So
let's clearly delineate this case: return OPAL_ERR_NOT_INITIALIZED if
the underlying hwloc is not initialized.
This commit was SVN r25902.
into account when configuring the components of the faramework. If
--without-memory-manager was given, then we really don't want any
memory managers to be used.
This commit was SVN r25807.
The following Trac tickets were found above:
Ticket 2844 --> https://svn.open-mpi.org/trac/ompi/ticket/2844
problem on SuSE 10 (which might be related to Oracle's dual-bitness
builds, but we aren't completely sure yet).
So just turn it off for now, and bring this over to v1.5. Find a
proper fix (that enables pci support properly) for trunk/v1.7 later.
This commit was SVN r25769.
The following Trac tickets were found above:
Ticket 2952 --> https://svn.open-mpi.org/trac/ompi/ticket/2952
This commit was SVN r25759.
The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
r4094
This commit was SVN r25707.
The following SVN revision numbers were found above:
r4102 --> open-mpi/ompi@8961ca568d
The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
r4104
.ompi_ignored to allow other developers to test with it. It is
expected that we'll remove the .ompi_ignore here soon, and
simultaneously remove the hwloc 1.2.2ompi component.
There was one very minor patch added to stock hwloc 1.3.1 in
hwloc/config/hwloc.m4:
{{{
--- hwloc-1.3.1/config/hwloc.m4 2011-12-14
+++ ompi3/opal/mca/hwloc/hwloc131/hwloc/config/hwloc.m4
@@ -583,6 +583,7 @@
])
fi
AC_SUBST(HWLOC_PCI_LIBS)
+ HWLOC_LIBS="$HWLOC_LIBS $HWLOC_PCI_LIBS"
# If we asked for pci support but couldn't deliver, fail
AS_IF([test "$enable_pci" = "yes" -a "$hwloc_pci_happy" = "no"],
[AC_MSG_WARN([Specified --enable-pci switch, but could
not])
}}}
This will be pushed upstream to hwloc.
This commit was SVN r25706.
No need for it to be in the base (we mistakenly thought it was used in
multiple shmem components).
This commit was SVN r25662.
The following SVN revision numbers were found above:
r25652 --> open-mpi/ompi@7e223b5799
* Fixes trac:2329 : Improves the error message, and ensures opal-restart will not segv in opal_finalize.
This commit was SVN r25586.
The following Trac tickets were found above:
Ticket 2329 --> https://svn.open-mpi.org/trac/ompi/ticket/2329
to -I the opal/config directory so that autoregen can find our .m4
files.
This commit was SVN r25564.
The following SVN revision numbers were found above:
r25511 --> open-mpi/ompi@5751c45916
supposed to. I.e., half-baked/not complete stuff.
This commit backs out all of r25545. Sorry folks!
This commit was SVN r25546.
The following SVN revision numbers were found above:
r25545 --> open-mpi/ompi@7f9ae11faf
to make MPI_IN_PLACE (and other sentinel Fortran constants) work on OS
X, we need to use the following compiler (linker) flag:
-Wl,-commons,use_dylibs
So if we're compiling on OS X, test to see if that flag works with the
compiler. If so, add it to the wrapper FFLAGS and FCFLAGS (note that
per a future update, we'll only have one Fortran compiler anyway).
Fixes trac:1982.
This commit was SVN r25545.
The following Trac tickets were found above:
Ticket 1982 --> https://svn.open-mpi.org/trac/ompi/ticket/1982
have timer support
* Default timer size should be a long, not an int. Int will roll over way
too fast, with no performance benifit on 64 bit machines...
This commit was SVN r25501.
the previous pack was larger than the allowed space in the pack buffer
bad things happened. Thanks to Yuki MATSUMOTO and Takahiro Kawashima
for the bug report and the patch.
This commit was SVN r25488.
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
http://www.open-mpi.org/community/lists/devel/2011/10/9878.php
I am making a final decision to decide the behavior of what happens
when an MCA parameter is re-registered and changes types. In
developer builds (i.e., OPAL_ENABLE_DEBUG==1), a show_help message
will be displayed. In all builds, an error status will be returned.
Specifically, the logic looks like this:
{{{
if (detect_re-registration_with_type_change) {
#if OPAL_ENABLE_DEBUG
opal_show_help(...);
#endif
return OPAL_ERR_VALUE_OUT_OF_BOUNDS;
}
}}}
If someone would like to change this behavior, they are welcome to do
so. :-) I am committing this so that ''some'' action occurs (rather
than talking about the issue and then nothing happens).
This commit was SVN r25432.
handled properly when MCA parameters are re-registered and their types
change. Specifically, this case was broken:
1. Register an int MCA param with a non-zero default value
1. Re-register the same MCA param as a string with a NULL default value
The 2nd step would cause a segv because the first int default value
wasn't being reset properly. Here's sample code that shows the issue:
{{{
{
int ibogus;
char *sbogus;
opal_init(&argc, &argv);
mca_base_param_reg_int_name("type", "name", "help", false, false, 3, &ibogus);
printf("Ibogus: %d\n", ibogus);
mca_base_param_reg_string_name("type", "name", "help", false, false, NULL, &sbogus);
printf("Sbogus: %s\n", (NULL == sbogus) ? "NULL" : sbogus);
exit(0);
}
}}}
This commit fixes the problem from the sample code above as well as
the a similar issue for file-set MCA params and override values. It
also resets default values for MCA params initially registered as a
string but then re-registered as an int.
This commit was SVN r25392.
Use hwloc to obtain the cpuset for each process during mpi_init, and share that info in the modex. As it arrives, use a new opal_hwloc_base utility function to parse the value against the local proc's cpuset and determine where they overlap. Cache the value in the pmap object as it may be referenced multiple times.
Thus, the return value from orte_ess.proc_get_locality is a 16-bit bitmask that describes the resources being shared with you. This bitmask can be tested using the macros in opal/mca/paffinity/paffinity.h
Locality is available for all procs, whether launched via mpirun or directly with an external launcher such as slurm or aprun.
This commit was SVN r25331.
calling main():
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
xxx:~/openmpi-1.5.4/COMPILE-intel-12.1.0> which ompi_info
~/openmpi-1.5.4/COMPILE-intel-12.1.0/usr/bin/ompi_info
xxx:~/openmpi-1.5.4/COMPILE-intel-12.1.0> ompi_info
Segmentation fault
xxx:~/openmpi-1.5.4/COMPILE-intel-12.1.0> gdb usr/bin/ompi_info
...
(gdb) run
Starting program:
...
Program received signal SIGSEGV, Segmentation fault.
opal_memory_ptmalloc2_int_malloc (av=0x7ffff7fe83d8, bytes=4102) at
../../../../../opal/mca/memory/linux/malloc.c:4080
4080 /* remove from unsorted list */
(gdb) where
#0 opal_memory_ptmalloc2_int_malloc (av=0x7ffff7fe83d8, bytes=4102) at
../../../../../opal/mca/memory/linux/malloc.c:4080
#1 0x00007ffff7c232b9 in opal_memory_linux_malloc_hook
(sz=140737354040280, caller=0x1006) at
../../../../../opal/mca/memory/linux/hooks.c:687
#2 0x0000003dd96a6871 in __alloc_dir () from /lib64/libc.so.6
#3 0x0000003ddfa053cd in ?? () from /usr/lib64/libnuma.so.1
#4 0x0000003dd8e0e445 in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
...
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
A lot of combinations and trials have been done, yet to no avail.
Intel v11.0 worked...
Thanks to Hubert Haberstock (Intel) providing the hint in:
http://software.intel.com/en-us/forums/showthread.php?t=87132
This was tested on openmpi-1.5.4 and therefore should
cmr:v1.5
This commit was SVN r25290.
zeroes);
if so, use it for bit-operations like opal_cube_dim and opal_hibit.
Implement two versions of power-of-two.
In case of opal_next_poweroftwo, this reduces the average execution
time from 83 cycles to 4 cycles (Intel Nehalem, icc, -O2, inlining,
measured rdtsc, with loop over 2^27 values).
Numbers for other functions are similar (but of course heavily depend
on the usage, e.g. opal_hibit() with a start of 4 does not save
much). The bsr instruction on AMD Opteron is also not as fast.
- Replace various places where the next power-of-two is computed.
Tested on Intel Nehalem Cluster with openib, compilers GNU-4.6.1 and
Intel-12.0.4 using mpi_testsuite -t "Collective" with 128 processes.
This commit was SVN r25270.
since the 1.4 release (and configure would abort when run with sparc v8), but
the code was left in place. Sparc v9 (32 or 64 bit) are still supported
targets.
This commit was SVN r25258.
1 otherwise. It was doing the opposite, so this patch fixes the
return values. All uses (all in ORTE) used the actual return values,
not the documented values, so fix them as well.
This commit was SVN r25257.
match similar stuff in the event framework; only add CPPFLAGS /
LDFLAGS / LIBS / and WRAPPER_EXTRA_* of the same for the one, single,
winning component (because this framework is compile-time,
one-of-many).
This commit was SVN r25199.
only became evident when there was more than one event component.
The libevent2013 component is still ompi_ignore'd for most developers.
This commit was SVN r25198.
Since hwloc has a dynamic bitmap size, it could actually have bits set
that will not fit in the paffinity mask. We already made sure that we
didn't overrun the paffinity mask; now also set the return value to
OPAL_ERR_VALUE_OUT_OF_BOUNDS (wow, we really thought of everything
with those error codes, eh?) if the hwloc bitmap has bits set higher
than what will fit into the paffinity bitmask.
This commit was SVN r25179.
The following Trac tickets were found above:
Ticket 2854 --> https://svn.open-mpi.org/trac/ompi/ticket/2854
* change components from setting <framework>_base_include to
opal_<framework>_<component>_include; the framework m4 will figure
out the winning component and pick the right "include" shell
variable. Ditto for the other shell variables (cppflags, ldflags,
etc.).
* misc fixes to hwloc/external
* add a bunch of missing "opal_" prefixes to shell variables
* add a few more / update a few comments in framework m4's
This commit was SVN r25174.
maffinity_base_bind_failure_action MCA params to the hwloc base
(hwloc_base_alloc_polocy and hwloc_base_bind_failure_action). Since
these MCA parameters were never on a release branch, I'm just
moving/renaming them outright and not leaving aliases to the old
names.
Note that some upper layer needs to call
opal_hwloc_base_set_process_membind_policy() to set the
set-by-MCA-param process-wide memory affinity policy. We can't do
this automatically during hwloc_base_open() because, for reasons
described elsewhere, opal_hwloc_topology is not automatically filled
during hwloc_base_open() (in short: potential scalability issues when
launching many MPI processes simultaneously on a single machine, for
example).
This commit was SVN r25156.
paffinity/hwloc components were still calling hwloc_topology_init/load
themselves, and not using the opal_hwloc_topology. Doh!
This commit fixes that -- these 2 components no longer have their own
copy of the topology tree; they just use opal_hwloc_topology.
This commit was SVN r25151.
the command line, hwloc is just like any other external dependency
in OMPI: if we find it, we'll use it. If we don't find it, we'll
ignore it. See comments in opal/mca/hwloc/configure.m4 for an
explanation.
* Fix some copy-n-paste errors in opal/mca/hwloc/configure.m4
w.r.t. flags coming in from the winning component.
* Add another line in ompi_info's output about whether hwloc support
is included or not.
This commit was SVN r25134.
To enable the epochs and the resilient orte code, use the configure flag:
--enable-resilient-orte
This will define both:
ORTE_ENABLE_EPOCH
ORTE_RESIL_ORTE
This commit was SVN r25093.
specify btl_tcp_if_include because btl_tcp_if_exclude is defaulted to
the loopback devices.
This commit does a few things:
* Introduce a new OPAL MCA base function:
mca_base_param_check_exclusive_string(). It checks to see that the
''user'' does not set two MCA parameters that are mutually
exclusive by checking the source of those MCS param values.
* Use the above function in many BTLs (and the OOB TCP) to ensure
that <foo>_if_include and <foo>_if_exclude are not both specified
''by the user''.
* Re-arrange many of these BTLs to move their MCA registration code
into a separate component_register() function (vs. the
component_open() function).
This code has been nominally reviewed and checked by Ralph, George,
Terry, and Shiqing.
This commit was SVN r25043.
The following SVN revision numbers were found above:
r24976 --> open-mpi/ompi@8f4ac54336
included a missing ARM directory).
This commit was SVN r24923.
The following SVN revision numbers were found above:
r24875 --> open-mpi/ompi@ceabe91484
It was agreed that opal_init_util had wound up being used in unintended ways, which raised the problem of getting reference counts to work right. However, fixing it would involve more pain than it was worth - and so long as the other layers are made to behave similarly, I have no preference either way.
Complete implementation will follow - for now, this just reverts the prior changes.
This commit was SVN r24886.
The following SVN revision numbers were found above:
r24862 --> open-mpi/ompi@aa92e0c4eb
r24864 --> open-mpi/ompi@a5062385c2