There were mistakes in the Makefiles for the ugni btl and
mca/common/ugni that prevented the ugni btl from being
used unless one happened to set the --disable-dlopen option
on the config line.
This commit fixes this problem.
This enables IBM's software iWARP provider. With this driver you can
run iWARP/RDMA over any ethernet NIC. Useful for testing OMPI RDMA
logic without requiring an expensive RDMA adapter/infrastructure.
The Soft iWARP code is at: https://www.gitorious.org/softiwarp
Avoid a problem with double-derefence of a variable macro name (i.e.,
a macro with part of its name from an AC_SUBST, such as
```$(foo@BAR@baz)```.
In what might be a bug in Automake 1.14.1, if you do a pattern like
this:
```makefile
lib_LTLIBRARIES = lib@A_PREFIX@a_lib.la
noinst_LTLIBRARIES = lib@A_PREFIX@a_noinst.la
lib@A_PREFIX@a_lib_la_SOURCES = a.c
lib@A_PREFIX@a_noinst_la_SOURCES = $(lib@A_PREFIX@a_lib_la_SOURCES)
```
Then in the resulting Makefile, the value of
```$(lib@A_PREFIX@a_lib_la_OBJECTS)``` will be *blank* (when it really
should be ```a.o```).
To workaround this potential bug, I've simply avoided doing
double-derefences like this, and effectively set the second
```_SOURCES``` line equal to ```a.c``` (just like the first
```_SOURCES``` line).
Fixes#250.
These two macros set the MCA prefix and MCA cmd line id,
respectively. Specifically, MCA parameters will be named
PREFIX<foo> in the environment, and the cmd line will use
-ID foo bar.
These macros must be called during configure.ac and a value
supplied. In the case of Open MPI, the values given are
PREFIX=OMPI_MCA_ and ID=mca.
Other projects (such as ORCM) will call these macros with
their own unique values. For example, ORCM uses PREFIX=ORCM_MCA_
and ID=omca
This scheme is necessary to allow running Open MPI applications under
systems that use their own versions of ORTE and OPAL. For example,
when running OMPI applications under ORCM, we need the MCA params passed
to the ORCM daemons to be separated from those recognized by the OMPI application.
At some point we added a sanity check to the btl base to ensure that
the btl flags match the available functions (this prevents user's from
specifying get or put when no function exists). This check was
disabling get for the sm btl since at the time of the check there is
no btl_get function. The simplest fix is to set a dummy value to btl_get
that will be overwritten with the proper value on btl initialization.
Closes#239.
If there are no usnic BTL modules, then just avoid sending any modex
message at all (other BTLs do this; it's safe to do).
The change is smaller than it looks: I added a "if 0 ==..." check at
the top to return immediately if there are no BTL modules. Then I
removed some now-unnecessary conditionals and un-indented as
appropriate.
Fixes#248
These two macros set the prefix for the OPAL and ORTE libraries,
respectively. Specifically, the OPAL library will be named
libPREFIXopen-pal.la and the ORTE library will be named
libPREFIXopen-rte.la.
These macros must be called, even if the prefix argument is empty.
The intent is that Open MPI will call these macros with an empty
prefix, but other projects (such as ORCM) will call these macros with
a non-empty prefix. For example, ORCM libraries can be named
liborcm-open-pal.la and liborcm-open-rte.la.
This scheme is necessary to allow running Open MPI applications under
systems that use their own versions of ORTE and OPAL. For example,
when running MPI applications under ORTE, if the ORTE and OPAL
libraries between OMPI and ORCM are not identical (which, because they
are released at different times, are likely to be different), we need
to ensure that the OMPI applications link against their ORTE and OPAL
libraries, but the ORCM executables link against their ORTE and OPAL
libraries.
This commit makes the folowing changes:
- Add support for the knem single-copy mechanism. Initially vader will only
support the synchronous copy mode. Asynchronous copy support may be added
int the future.
- Improve Linux cross memory attach (CMA) when using restrictive ptrace
settings. This will allow Open MPI to use CMA without modifying the system
settings to support ptrace attach (see /etc/sysctl.d/10-ptrace.conf).
- Allow runtime selection of the single copy mechanism. The default behavior
is to use the best available. The priority list of single-copy mehanisms is
as follows: xpmem, cma, and knem.
- Allow disabling support for kernel-assisted single copy.
- Some tuning and bug fixes.
Restore the functionality to error out (and show a helpful message) if
knem support is requested by is either not compiled in or cannot be
activated.
Thanks to Gus Correa for bringing the matter to our attention.
Somehow, this MCA param was accidentally dropped after v1.6.5. Thanks
to Gus Correa for bringing this matter to our attention.
Also moving some MCA params down from level 9 to levels 4/5.
of preregistered buffers
Before this change eager gets we retried on each progress loop. This commit
modifies the protocol to only retry eager gets when another eager get has
completed. This commit also cleans up some callback code that is no longer
needed.
The GNI_RDMAMODE_FENCE bit was a left over from
async progress work that is not needed at this point
in the gni BTL. Removing the bit also allows
for the removal of the GNI_CDM_MODE_BTE_SINGLE_CHANNEL
bit from the GNI_CdmCreate call.
This commit adds initial ugni thread safety support.
With this commit, sun thread tests (excepting MPI-2 RMA)
pass with various process counts and threads/process.
Also osu_latency_mt passes.
With --enable-memchecker builds, use calloc(3) for OBJ_NEW instead of
malloc(3). This cuts down on a lot of valgrind/memory checker false
positive output.
Also make a minor change in the valgrind configure.m4; have it assign
0xf to a char. The prior assignment (of 0xff) was warning about an
overflow. This didn't really matter, but we might as well make the
test not have a gratuitious warning in it.
Properly setup the opal_process_info structure early in the initialization procedure. Define the local hostname right at the beginning of opal_init so all parts of opal can use it. Overlay that during orte_init as the user may choose to remove fqdn and strip prefixes during that time. Setup the job_session_dir and other such info immediately when it becomes available during orte_init.
Remove configure.params support: configure.params hasn't been used in
years.
Also remove autogen.subdirs support; those should really be handled by
their respective Makefile.am's.
Per discussions with pmix folks, it was determined that
the way the cray pmi pmix component was computing the
PMIX_NODE_RANK attribute for a process was incorrect.
This commit fixes the problem.
This commit was SVN r32810.
This is a large update that does the following:
- Only allocate fast boxes for a peer if a send count threshold
has been reached (default: 16). This will greatly reduce the memory
usage with large numbers of local peers.
- Improve performance by limiting the number of fast boxes that can
be allocated per peer (default: 32). This will reduce the amount
of time spent polling for fast box messages.
- Provide new MCA variables to configure the size, maximum count,
and send count thresholds for fast boxes allocations.
- Updated buffer design to increase the range of message sizes that
can be sent with a fast box.
- Add thread protection around fast box allocation (locks). When
spin locks are available this should be updated to use spin locks.
- Various fixes and cleanup.
This commit was SVN r32774.
The ess pmi module was not handling aprun launched
daemons. All daemons were thinking they were vpid 1.
Also, turns out that on cray systems using MOM nodes
for launched jobs, just detecting whether or not a
process is in a PAGG container is not sufficient.
Crank up the priority of the alps PLM component in the
event that the configure detected the presence of both
slurm and alps.
Have the ESS pmi component open the pmix framework and
select a pmix component.
This commit was SVN r32773.
Add a debugging check that ensures that the registered storage is
aligned appropriately for the type that is specified.
When we know that the storage is properly aligned, we can cast the
mbv_storage to the appropriate type and then simply do the assignment.
We used to do this assignment via a union, but clang's
-fsanitizer=alignment complained about this.
This commit was SVN r32716.
The use of restrict in the return type qualifier for mca_btl_vader_reserve_fbox
is being ignored by gnu compiler. for newer gcc, one sees this warning only
with -Wignored-qualifiers set, but for older variants of gcc it was reported
that numerous warning messages about this ignored qualifier were being
generated as vader is being compiled.
The warning reported by gcc is
btl_vader_fbox.h:53:47: warning: type qualifiers ignored on function return type [-Wignored-qualifiers]
static inline mca_btl_vader_fbox_t * restrict mca_btl_vader_reserve_fbox (struct mca_btl_base_endpoint_t *ep, const size_t size)
This commit was SVN r32714.
The question of whether or not the openib BTL supports loopback is a separate question. It may be more appropriate to make the modex be PMIX_GLOBAL for cases where openib can support loopback so someone can run without a shared memory component. I'll leave that decision to the IB vendors.
This commit was SVN r32702.
So add a new function for wrapping MCA arguments, and tell the backend parser to ignore/remove leading/trailing quotes.
cmr=v1.8.3:reviewer=jsquyres
This commit was SVN r32686.
- do not have mca_btl_openib_component.default_recv_qps point to the stack
- do not reset mca_btl_openib_component.default_recv_qps in btl_openib_component_open
cmr=v1.8.3:reviewer=miked
This commit was SVN r32642.
opal_process_name_t is an uint64_t which is not equivalent to
an unsigned long on 32 bits systems.
this is now parsed as an unsigned long long.
This commit was SVN r32592.
Per #4874, code review revealed a possible race condition in the
module struct and the connectivity agent. Move the setup of the
connectivity agent listener until the module struct has been fully
setup.
This commit was SVN r32573.
When no components were able to be found, btl_base_select() was
showing the wrong help message -- one that indicated that a specific
component could not be found. And it left off a string argument, so
the end of the help message was garbage.
This commit creates a new help message for this case and updates the
show_help call to use the new message.
This commit was SVN r32572.
WHAT: Merge the PMIx branch into the devel repo, creating a new
OPAL “lmix” framework to abstract PMI support for all RTEs.
Replace the ORTE daemon-level collectives with a new PMIx
server and update the ORTE grpcomm framework to support
server-to-server collectives
WHY: We’ve had problems dealing with variations in PMI implementations,
and need to extend the existing PMI definitions to meet exascale
requirements.
WHEN: Mon, Aug 25
WHERE: https://github.com/rhc54/ompi-svn-mirror.git
Several community members have been working on a refactoring of the current PMI support within OMPI. Although the APIs are common, Slurm and Cray implement a different range of capabilities, and package them differently. For example, Cray provides an integrated PMI-1/2 library, while Slurm separates the two and requires the user to specify the one to be used at runtime. In addition, several bugs in the Slurm implementations have caused problems requiring extra coding.
All this has led to a slew of #if’s in the PMI code and bugs when the corner-case logic for one implementation accidentally traps the other. Extending this support to other implementations would have increased this complexity to an unacceptable level.
Accordingly, we have:
* created a new OPAL “pmix” framework to abstract the PMI support, with separate components for Cray, Slurm PMI-1, and Slurm PMI-2 implementations.
* Replaced the current ORTE grpcomm daemon-based collective operation with an integrated PMIx server, and updated the grpcomm APIs to provide more flexible, multi-algorithm support for collective operations. At this time, only the xcast and allgather operations are supported.
* Replaced the current global collective id with a signature based on the names of the participating procs. The allows an unlimited number of collectives to be executed by any group of processes, subject to the requirement that only one collective can be active at a time for a unique combination of procs. Note that a proc can be involved in any number of simultaneous collectives - it is the specific combination of procs that is subject to the constraint
* removed the prior OMPI/OPAL modex code
* added new macros for executing modex send/recv to simplify use of the new APIs. The send macros allow the caller to specify whether or not the BTL supports async modex operations - if so, then the non-blocking “fence” operation is used, if the active PMIx component supports it. Otherwise, the default is a full blocking modex exchange as we currently perform.
* retained the current flag that directs us to use a blocking fence operation, but only to retrieve data upon demand
This commit was SVN r32570.
Thanks to Ashley Pittman for pointing the modex should now be zero'ed
cmr=v1.8.2:ticket=trac:4871
This commit was SVN r32568.
The following Trac tickets were found above:
Ticket 4871 --> https://svn.open-mpi.org/trac/ompi/ticket/4871
PGI compilers 2013 and older do not support the following syntax :
mca_btl_scif_modex_t modex = {.port_id = mca_btl_scif_module.port_id};
so split it on two lines
cmr=v1.8.2:reviewer=hjelmn
This commit was SVN r32555.
It turns out that we ''can'' get to the endpoint destructor with the
endpoint still on the "endpoints needing ACKs" list. So if it's on
the list, remove it first, and then DESTRUCT the opal_list_item_t.
This prevents an assert() fail in debug builds. We'd like to let this
soak over the weekend.
cmr=v1.8.2:reviewer=dgoodell
This commit was SVN r32546.
Rarely -- but it happens -- the connectivity client gets ECONNREFUSED
because the connectivity agent listen() backlog is too small. Rather
than put in a loop on the client side, take the simple way out for
now: increase the backlog size to an arbitrarily-large number.
Reviewed by Dave Goodell.
cmr=v1.8.2:reviewer=ompi-rm1.8
This commit was SVN r32543.
Instead of waiting to destroy the connectivity agent during component
shutdown, have the module shutdown send an "unlisten" command to the
cagent that will tell it to stop listening on a given interface.
This commit was SVN r32536.
1. After we receive N abnormally-short messages (meaning: corrupted),
print a show_help message about it. N defaults to 25. N can be set
to 0 disable the message via btl_usnic_max_short_packets.
1. If we receive a completion error for something other than a
receive, display a show_help message.
Reviewed by Dave Goodell.
CMR'ing to v1.8.3, but it will require a custom patch because of the
OMPI->OPAL BTL move.
cmr=v1.8.3
This commit was SVN r32522.
There were remaining references to size of MPI_COMM_WORLD,
etc. in ugni btl which prevented building of opal library
following btl move to opal.
Eventually another mechanism will need to be found for
providing hints to BTLs about how to setup internal
resources relevant to max. number possible endpoints, etc.
Conflicts:
opal/mca/btl/ugni/btl_ugni_component.c
This commit was SVN r32507.
Show an example of using the btl_usnic_connectivity_map option. Also,
mention that another reason for the "total connectivity failure" may
be due to asymmetric / unexpected routing.
Reviewed by Dave Goodell.
cmr=v1.8.2:reviewer=ompi-rm1.8
This commit was SVN r32465.
The contrib/check-help-strings.pl gets confused if the topic is an
inline logic check, so separate it into two calls to show_help.
This commit was SVN r32455.
In core library portions of the configury (e.g., top-level
configure.ac itself), we were calling AC_CHECK_LIB and
OPAL_CHECK_FUNC_LIB to check for various libraries.
'''SIDENOTE:''' It turns out that modern Autoconf has AC_SEARCH_LIBS,
which does just about exactly what OPAL_CHECK_FUNC_LIB does. So this
commit effectively replaces OPAL_CHECK_FUNC_LIB with AC_SEARCH_LIBS.
However, we never bothered to add these found libraries to the wrapper
compiler list of libraries used for static linking (doh!). We've been
getting lucky for quite a while that components were adding the same
libraries to their wrapper compiler LIBS list.
This is problematic, however, if we don't build some of these
components. For example, Paul Hargrove noticed that if he configured
with --disable-shared --enable-static --disable-io-romio, ROMIO was no
longer adding some libraries to the wrapper LIBS list -- libraries
that just happened to also be needed by core OPAL/ORTE/OMPI layers.
The solution is not to use AC_CHECK_LIB or OPAL_CHECK_FUNC_LIB, but
use a pair of new macros:
* OPAL_SEARCH_LIBS_CORE: a wrapper around AC_SEARCH_LIBS. If we add
something to $LIBS, then also add it to the wrapper list of static
libraries. This is the main piece of functionality that was
wrong/missing.
* OPAL_SEARCH_LIBS_COMPONENT: similar to OPAL_SEARCH_LIBS_CORE, but
instead of directly adding it to the wrapper list of static
libaries, add it to <framework>_<component>_LIBS (which eventually
gets slurped up into the wrapper list of static libraries. See the
lengthy comment in config/opal_setup_wrappers.m4 near the beginning
of OPAL_SETUP_WRAPPER_INIT() for a more detailed explanation).
Most components did this correctly already, but one or two weren't
right, so I implemented this second macro quite similar to the
first and put it everywhere we already used AC_SEARCH_LIBS or
OPAL_CHECK_FUNC_LIB.
This needs to soak for a day or two on the trunk before moving to the
v1.8 branch.
Refs trac:4834
cmr=v1.8.2:reviewer=ggouaillardet
This commit was SVN r32447.
The following Trac tickets were found above:
Ticket 4834 --> https://svn.open-mpi.org/trac/ompi/ticket/4834
These messages were committed in the v1.8 branch in r32341, but were
never committed to the trunk (because we were waiting for the OPAL BTL
move). This commit brings the trunk and v1.8 help messages in line
with each other.
This commit was SVN r32445.
The following SVN revision numbers were found above:
r32341 --> open-mpi/ompi@5e752b4aba
Ensure that the connectivity checker agent only uses pointers from the
client that is the same process as the agent.
Not necessary for the v1.8 branch -- this is a trunk/v1.9-only problem.
This commit was SVN r32438.
Make the del_procs, module finalize, and endpoint destructors be the
same between trunk and v1.8, with one exception: the very beginning of
v1.8 module_finalize calls del_procs for each proc to simulate/pretend
the trunk/v1.9 PML behavior of calling del_procs before module_finalize.
This commit was SVN r32437.
Fixes an assertion failure in --enable-debug builds and SEGVs in normal
builds.
I'm not 100% sure I like this model, but it at least seems to be
consistent. Some variation on this scheme will need to be adapted to
the trunk, where usnic_del_procs() is called by the PML instead of
internally in usnic_finalize().
A related bug (but with different mechanics) is #4832.
This commit was SVN r32424.
Per patch from aryzhikh, initialize a copule of fields in the openib ini struct
Hard to understand why this has sat there for so long...
cmr=v1.8.2:reviewer=rhc:subject=initialize a copule of fields in the openib ini struct
This commit was SVN r32417.
The following Trac tickets were found above:
Ticket 4377 --> https://svn.open-mpi.org/trac/ompi/ticket/4377
Previously, the connectivity agent was pretty dumb: it took whatever
pings it got and ACKed them. Then we added an agent check to ensured
that the ping actually came from the source interface that it said it
came from. Now we add another check such that when a ping is received
on interface X that corresponds to usnic module Y, we ensure that the
source interface of the ping is on the all_endpoints list for module Y
(i.e., module Y expects to be able to talk to that peer interface).
This detects cases where peers have come to different conclusions
about which interfaces should be used to communicate (which is bad!).
This usually reflects a network misconfiguration.
Fixes CSCuq05389.
This commit was SVN r32383.
Ensure that incoming "ping" messages came from the IP address that
they think they came from. If they don't, drop them (because it is
probably routing error), which will likely eventually cause the
connectivity checker to timeout, and therefore cause the job to abort.
This commit was SVN r32368.
Update compat.h to only handle compatability between v1.7/v1.8 and
v1.9/2.0 (i.e., the current trunk). Remove what seems to be the last
vestiages of OMPI/ORTE pollution in the now-OPAL-ized usnic BTL.
Currently use a hard-coded constant for the MCW size (i.e.,
MPI_COMM_WORLD size) for some initialization values in the v1.9/2.0
series; still need to figure out something better there.
This commit was SVN r32365.
Adapt to moving down to OPAL: find a PML-registered error callback,
and use that when we don't have a module context and we need to abort.
Failing that, just call exit().
This commit was SVN r32362.
If --with-usnic is specified and we can't build the usnic BTL, abort.
If --without-usnic is specified, gracefully skip building the usnic
BTL. If neither is specified, do the OMPI-default behavior: try to
configure/build the usnic BTL, and if we can't, skip it.
Fixes CSCuq13889.
This commit was SVN r32349.
The only user of this code was coll/sm. I implemented a basic replacement
for the removed code. This gets the trunk compiling again with
--disable-dlopen.
This commit was SVN r32333.
common/ofacm is only used by the iboffload code in ompi. This code does
not currently work so it is safe to ignore these components until it is
fixed.
This commit was SVN r32331.
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
new automake requires subdirs-object directive, to resolve this:
09:43:37 automake: warning: possible forward-incompatibility.
09:43:37 automake: At least a source file is in a subdirectory, but the 'subdir-objects'
09:43:37 automake: automake option hasn't been enabled. For now, the corresponding output
09:43:37 automake: object file(s) will be placed in the top-level directory. However,
09:43:37 automake: this behaviour will change in future Automake versions: they will
09:43:37 automake: unconditionally cause object files to be placed in the same subdirectory
09:43:37 automake: of the corresponding sources.
09:43:37 automake: You are advised to start using 'subdir-objects' option throughout your
09:43:37 automake: project, to avoid future incompatibilities.
09:43:37 tools/otfmerge/Makefile.common:13: warning: source file '$(OTFMERGESRCDIR)/otfmerge.c' is in a subdirectory,
09:43:37 tools/otfmerge/Makefile.common:13: but option 'subdir-objects' is disabled
cmr=v1.8.2:reviewer=ompi-rm1.8
This commit was SVN r32225.
new flag to ompi_info that allows a user to print all MCA variables of a specific type.
--type version_string
This command will print all MCA variables of type version_string.
This feature was developed by Elena Shipunova and was reviewed by Josh Ladd.
This commit was SVN r32166.
mpirun ... -x env_foo1=val1 -x env_foo2 -x env_foo3=val3 should now be expressed as
mpirun ... -mca mca_base_env_list env_foo1=val1+env_foo2+env_foo3=val3.
The motivation for doing this is so that a list of environment variables may be set via standard MCA mechanisms such as mca parameter files, amca lists, etc.
This feature was developed by Elena Shipunova and was reviewed by Josh Ladd.
This commit was SVN r32163.
After discussion with Nathan, change the ERR_VALUE_OUT_OF_BOUNDS to be
ERR_NOT_FOUND, for two reasons:
1. It's consistent with other uses of ERR_NOT_FOUND in the code.
1. In this case, we could have just looked up a variable that is
basically a "hole" -- e.g., var indexes 0, 1, 2 are valid, and 4,
5, and 5 are valid, but 3 is invalid (e.g., 3 was de-registered --
remember that MPI_T explicitly does not allow re-using indexes).
So returning ERR_OUT_OF_BOUNDS seems weird -- returning
ERR_NOT_FOUND seems a bit more natural.
cmr=v1.8.2:ticket=trac:4587
This commit was SVN r32158.
The following Trac tickets were found above:
Ticket 4587 --> https://svn.open-mpi.org/trac/ompi/ticket/4587
The alps ras and plm components were broken by recent changes in ORTE. This
commit resolves those issues.
Changes:
- Define PMI2_SUCCESS if it isn't defined. This fixes a problem with Cray's
PMI implementation which does not define (for some reason) PMI2_SUCCESS. We
had previously just used PMI_SUCCESS.
- Add missing definition and a typo in pml_alps_module.
- launch_id is no longer available in the orte_node_t structure. Use the
attribute lookup to get the value.
- Do not use an O(n^2) sorting algorithm when putting alps nodes in order. Use
opal_list_sort instead (O(nlogn)).
This commit was SVN r32076.
(a) default binding policy is in effect. In this case, we will emit a
warning and default to not binding unless the user provided the
"oversubscribe" or "overload" modifier to the "bind-to" option.
(b) user-specified binding policy is in effect. In this case, we will
error out unless the user provided the "oversubscribe" or "overload"
modifier to the "bind-to" option as we cannot meet the directive.
Either "bind-to" modifier (oversubscribe or overload) will be accepted for
now - in 1.9, we will deprecate the "overload" term in favor of
"oversubscribe".
Also added the ability to accept a --bind-to modifier without specifying the binding policy itself so a user can specify overload-allowed with the default policy.
Closes trac:4345
cmr=v1.8.2:reviewer=rhc:subject=resolve handling of overload conditions
This commit was SVN r32005.
The following Trac tickets were found above:
Ticket 4345 --> https://svn.open-mpi.org/trac/ompi/ticket/4345
* allow users to specify just a modifier for map-by instead of requiring that they also specify a policy. Thus, we now accept --map-by :pe=3 as indicating that we should use the default mapping policy, but bind 3 cpus/proc.
* if users specify a pe's/proc but no policy, default to --map-by NUMA to ensure we have access to multiple cpus for the request. This won't guarantee we have access to enough to meet the request, but gives us a chance. In addition, we know that binding a proc to multiple cpus will work best if those cpus are all in the same NUMA, so this provides some degree of optimized behavior.
Per a request from Jeff, define "oversubscribe" for binding as a synonym for the "overload" modifier.
cmr=v1.8.2:reviewer=rhc
This commit was SVN r31967.
Two leaks are fixed by this commit:
- opal_dss.lookup_data_type returns an allocated string. Free it.
- opal_ifaddrtokindex was leaking a struct addrinfo. Ensure that is
released before returning.
cmr=v1.8.2:reviewer=rhc
This commit was SVN r31777.
Either we need to ensure that all events are deleted (which may prove difficult), or find another way to release the base memory.
This commit was SVN r31775.
The following SVN revision numbers were found above:
r31765 --> open-mpi/ompi@75fb6dbba9
This fixes a series of leaks identified by valgrind. Since
opal_event_base_open was creating the event base it follows that
opal_event_base_close should release the event base.
cmr=v1.8.2:reviewer=rhc
This commit was SVN r31765.
Bring down 3aa0ed6 from the hwloc v1.7 branch: Stevens says we should
GETFD before we SETFD, so we do
cmr=v1.8.2:reviewer=rhc
This commit was SVN r31683.
top_ompi_srcdir -> OMPI_TOP_SRCDIR
top_ompi_builddir -> OMPI_TOP_BUILDDIR
We also split the srcdir/builddir flags according to their local tree (e.g., OPAL_TOP_SRCDIR), and tied them all together in configure.ac. Renamed ompi_ignore and ompi_unignore to be opal_<foo> as these are agnostic markers.
Only thing left is ompilibdir being treated similar to what we dif for srcdir/builddir. Coming soon.
This commit was SVN r31678.
- This line, and those below, will be ignored--
M opal/mca/event/libevent2021/configure.m4
M opal/mca/hwloc/hwloc172/configure.m4
M configure.ac
M config/opal_setup_libltdl.m4
M config/opal_check_visibility.m4
M config/opal_setup_cc.m4
This commit was SVN r31637.
Several fixes:
- I was allowing an MPI_T_cvar_handle to be created for an invalid
variable. Fixed this by checking if the variable is valid in
mca_base_var_get.
- Use a better error code when the caller tries to create an unbound
pvar handle for a bound variable.
- Return the verbosity level in MPI_T_cvar_get_info.
cmr=v1.8.2:reviewer=jsquyres
This commit was SVN r31576.
http://www.open-mpi.org/community/lists/devel/2014/04/14496.php
Revamp the opal database framework, including renaming it to "dstore" to reflect that it isn't a "database". Move the "db" framework to ORTE for now, soon to move to ORCM
This commit was SVN r31557.
Make sure that an internal, long-lived hwloc fd is marked as
close-on-exec so that children don't inherit it. This patch is
committed upstream in the hwloc master and v1.9 branches as 7489287
and b654e19, respectively. The patch applied here is the exact same
logic, but the surrounding code changed slightly since the hwloc v1.7
series, so the patch doesn't apply cleanly.
Refs trac:4550
This commit was SVN r31511.
The following Trac tickets were found above:
Ticket 4550 --> https://svn.open-mpi.org/trac/ompi/ticket/4550
Aside from comments, these help files are empty -- so just remove
them. Also remove a few inclusions of show_help.h, since it isn't
used in those source files.
This commit was SVN r31328.
The contents of this file were superceded by mca-help-var.txt.
Indeed, this file was already removed from Makefile.am; it looks like
it was accidentally left in SVN.
This commit was SVN r31327.
Also:
* fix formatting error at top (make the emacs format line begin
with a comment character)
* Remove blank lines between topics and replace them with empty
comment lines (so that the help messages are not displayed with a
trailing blank line)
This commit was SVN r31324.
Also, add missing ORTE_ERROR_LOG in the other case where this error
message is used (i.e., ORTE_ERROR_LOG was used in the one place, so
let's also use it in the other place).
This commit was SVN r31321.
This type of help message is now handled by the MCA var system itself
(via the MCA_BASE_VAR_FLAG_ENVIRONMENT_ONLY flag at var_register
time).
This commit was SVN r31318.
add -mca base_env_list "var1=val1 var2=val2 ..." mca parameter that can be used in mca param files
or with -am app.conf mpirun commandline to set rank env variables with mca mechanism
fixed by Elena, reviewed by Miked
cmr=v1.8.1:reviewer=ompi-rm1.8
This commit was SVN r31302.
Also, since I put some of the macros for these silent/verbose rules up
in the top-level Makefile.man-page-rules file, I renamed it to
Makefile.ompi-rules.
I've had this sitting around for a while; now seems like as good a
time as any to commit it.
This commit was SVN r31271.
on some linux distro (sles11sp2) csh fails to parse $LS_COLORS and borks with error:
Unknown colorls variable `mh'.
The workaround is to unset LS_COLORS before calling to csh script
reviewed by Jeff
cmr=v1.8:reviewer=ompi-rm1.8
This commit was SVN r31244.
an enumerator's mca_base_var_enum_sfv_fn_t can be NULL.
cmr=v1.7.5:ticket=trac:4398:reviewer=ompi-gk1.7
This commit was SVN r31085.
The following Trac tickets were found above:
Ticket 4398 --> https://svn.open-mpi.org/trac/ompi/ticket/4398
In the OPAL_ENABLE_FT_CR code path there used to be a variable
'mca_base_component_distill_checkpoint_ready' which got removed.
The FT code was not compiling and while trying to get it to compile
again the old variable was #ifdef'd out. This re-introduces the
variable with a new name 'opal_base_distill_checkpoint_ready'
and enables the code previously #ifdef'd out.
This removes the last hack introduced to get the FT code to compile
again.
This commit was SVN r30928.
1. Changed rng_buff_t --> opal_rng_buff_t
2. All global variables obey the prefix rule
3. Old code has been removed
4. Found a couple of unnecessary includes
Refs trac:4298
This commit was SVN r30807.
The following Trac tickets were found above:
Ticket 4298 --> https://svn.open-mpi.org/trac/ompi/ticket/4298
This adds the code to actually checkpoint a process using CRIU
with the necessary variables to control the behaviour.
Right now only --np 1 is supported and --mca oob tcp.
Following parameters are supported:
* crs_criu_log: name of the log file
* crs_criu_log_level: verbosity level in the log file
* crs_criu_tcp_established: C/R established TCP connections
* crs_criu_shell_job: C/R shell jobs
* crs_criu_ext_unix_sk: allow external unix connections
* crs_criu_leave_running: leave tasks in running state after checkpoint
This commit was SVN r30772.
* Remove redundant/unnecessary uses of $2
* Change a bunch of logic from negative to positive
* Use OPAL_VAR_SCOPE_PUSH/POP to help reduce env var usage
* Only use "" in test statements with strings that require sanitization
* Removed redundant AC_MSG_WARN/ERROR. There's now only one check at
the bottom for whether the component is "good" or not. We'll
AC_MSG_WARN/ERROR in that one location.
Thanks to Jeff Squyres for this patch.
This commit was SVN r30739.
To be able to checkpoint/restart using criu (criu.org) a new
CRS component is added which is based on criu. This first commit
provides the minimal set of functions and configure script options
to enable --with-criu and link against libcriu.so.
No actual checkpoint/restart functionality is yet implemented.
This is only the framework which needs to be filled with the
actual functionality.
This commit was SVN r30666.