converting an opal_process_name_t means the loss of one bit,
it was decided to restrict the local job id to 15 bits, so the
useful information of an opal_process_name_t can fit in 63 bits.
This commit changes the OPAL_THREAD_LOCK/OPAL_THREAD_UNLOCK calls in
ompi/proc to opal_mutex_lock/opal_mutex_unlock. This will allow
multi-threaded BTLs the ability to creat ompi_proc_t's without having
to set opal_using_threads. There should be no performance hits as none
of the lock points are in the critical path.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit adds two new functions:
- ompi_proc_get_allocated - Returns all procs in the current job that
have already been allocated. This is used in init/finalize to
determine which procs to pass to add_procs/del_procs.
- ompi_proc_world_size - returns the number of processes in
MPI_COMM_WORLD. This may be removed in favor of callers just
looking at ompi_process_info.
The behavior of ompi_proc_world has been restored to return
ompi_proc_t's for all processes in the current job. The use of this
function is discouraged.
Code that was using ompi_proc_world() has been updated to make use of
the new functions to avoid the memory overhead of ompi_comm_world ().
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Most functionality of oshmem_proc duplicates ompi_proc. In addition
to that, Current logic does not allow to do oshmem initialization
w/o ompi startup.
So this refactoring allows to avoid code duplication, decrease used
memory and make oshmem support easier.
Now oshmem_proc is transparent ompi_proc structure, that can be
extended by oshmem specific data.
Signed-off-by: Igor Ivanov <Igor.Ivanov@itseez.com>
The assumption that the high bit is not in use in pointers on any of our
supported platforms was incorrect. A better assumption is that all
ompi_proc_t pointers will be at least 2-byte aligned. This allows us
to use the low bit. To do this we drop the highest bit of the
opal_process_name_t jobid (hope this is ok) and use the low bit to
indicate the proc is really a sentinel.
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
This commit modifies the ompi_group_t union/difference code to compare/copy the
raw group values. This will either be a ompi_proc_t or a sentinel value. This
commit also adds helper functions to convert between opal process names and
sentinel values.
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
This commit adds an opal hash table to keep track of mapping between
process identifiers and ompi_proc_t's. This hash table is used by the
ompi_proc_by_name() function to lookup (in O(1) time) a given
process. This can be used by a BTL or other component to get a
ompi_proc_t when handling an incoming message from an as yet unknown
peer.
Additionally, this commit adds a new MCA variable to control the new
add_procs behavior: mpi_add_procs_cutoff. If the number of ranks in
the process falls below the threshold a ompi_proc_t is created for
every process. If the number of ranks is above the threshold then a
ompi_proc_t is only created for the local rank. The code needed to
generate additional ompi_proc_t's for a communicator is not yet
complete.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Bring Slurm PMI-1 component online
Bring the s2 component online
Little cleanup - let the various PMIx modules set the process name during init, and then just raise it up to the ORTE level. Required as the different PMI environments all pass the jobid in different ways.
Bring the OMPI pubsub/pmi component online
Get comm_spawn working again
Ensure we always provide a cpuset, even if it is NULL
pmix/cray: adjust cray pmix component for pmix
Make changes so cray pmix can work within the integrated
ompi/pmix framework.
Bring singletons back online. Implement the comm_spawn operation using pmix - not tested yet
Cleanup comm_spawn - procs now starting, error in connect_accept
Complete integration
We recognize that this means other users of OPAL will need to "wrap" the opal_process_name_t if they desire to abstract it in some fashion. This is regrettable, and we are looking at possible alternatives that might mitigate that requirement. Meantime, however, we have to put the needs of the OMPI community first, and are taking this step to restore hetero and SPARC support.
* redefine orte_process_name_t so it can be converted
between host and network format as an opal_identifier_t
aka uint64_t by the OPAL layer.
* correctly send OPAL_DSTORE_ARCH key
WHAT: Merge the PMIx branch into the devel repo, creating a new
OPAL “lmix” framework to abstract PMI support for all RTEs.
Replace the ORTE daemon-level collectives with a new PMIx
server and update the ORTE grpcomm framework to support
server-to-server collectives
WHY: We’ve had problems dealing with variations in PMI implementations,
and need to extend the existing PMI definitions to meet exascale
requirements.
WHEN: Mon, Aug 25
WHERE: https://github.com/rhc54/ompi-svn-mirror.git
Several community members have been working on a refactoring of the current PMI support within OMPI. Although the APIs are common, Slurm and Cray implement a different range of capabilities, and package them differently. For example, Cray provides an integrated PMI-1/2 library, while Slurm separates the two and requires the user to specify the one to be used at runtime. In addition, several bugs in the Slurm implementations have caused problems requiring extra coding.
All this has led to a slew of #if’s in the PMI code and bugs when the corner-case logic for one implementation accidentally traps the other. Extending this support to other implementations would have increased this complexity to an unacceptable level.
Accordingly, we have:
* created a new OPAL “pmix” framework to abstract the PMI support, with separate components for Cray, Slurm PMI-1, and Slurm PMI-2 implementations.
* Replaced the current ORTE grpcomm daemon-based collective operation with an integrated PMIx server, and updated the grpcomm APIs to provide more flexible, multi-algorithm support for collective operations. At this time, only the xcast and allgather operations are supported.
* Replaced the current global collective id with a signature based on the names of the participating procs. The allows an unlimited number of collectives to be executed by any group of processes, subject to the requirement that only one collective can be active at a time for a unique combination of procs. Note that a proc can be involved in any number of simultaneous collectives - it is the specific combination of procs that is subject to the constraint
* removed the prior OMPI/OPAL modex code
* added new macros for executing modex send/recv to simplify use of the new APIs. The send macros allow the caller to specify whether or not the BTL supports async modex operations - if so, then the non-blocking “fence” operation is used, if the active PMIx component supports it. Otherwise, the default is a full blocking modex exchange as we currently perform.
* retained the current flag that directs us to use a blocking fence operation, but only to retrieve data upon demand
This commit was SVN r32570.
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
This conservative fixes tries to fetch info from both
opal_dstore_nonpeer and opal_dstore_peer.
This is required is task A spawns tasks B and C.
B was previously unable to find info from C, this caused locality
info not being set and a hang in coll/ml init.
no CMR is required since v1.8 uses a unique dstore
This commit was SVN r31923.
http://www.open-mpi.org/community/lists/devel/2014/04/14496.php
Revamp the opal database framework, including renaming it to "dstore" to reflect that it isn't a "database". Move the "db" framework to ORTE for now, soon to move to ORCM
This commit was SVN r31557.
This provides full locality - i.e., not just node-level, but all the way down to whatever common binding level exists between the procs.
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r31106.
NOTE: I transferred the oshmem-disabled-by-default from the 1.7 branch to the trunk to minimize future disruption if/when we change that option.
cmr=v1.8:reviewer=jsquyres
This commit was SVN r31006.