inserted in the ompi_proc_list as soon as it is created and it
is removed only upon the call to the destructor. In ompi_proc_finalize
we loop over all procs in ompi_proc_finalize and release them once.
However, as a proc is not removed from this list right away, we
decrease the ref count for each proc until it reach zero and the
proc is finally removed. Thus, we cannot clean the BML/BTL after
the call the ompi_proc_finalize.
A quick fix is to delay the call to ompi_proc_finalize until all
other frameworks have been finalized, and then the behavior
depicted above will give the expected outcome.
We recognize that this means other users of OPAL will need to "wrap" the opal_process_name_t if they desire to abstract it in some fashion. This is regrettable, and we are looking at possible alternatives that might mitigate that requirement. Meantime, however, we have to put the needs of the OMPI community first, and are taking this step to restore hetero and SPARC support.
Properly setup the opal_process_info structure early in the initialization procedure. Define the local hostname right at the beginning of opal_init so all parts of opal can use it. Overlay that during orte_init as the user may choose to remove fqdn and strip prefixes during that time. Setup the job_session_dir and other such info immediately when it becomes available during orte_init.
Update the VERSION file scheme:
* Remove "want_repo_rev".
* Add "tarball_version".
All values are now always included (major, minor, release, greek,
repo_rev). However, configure.ac now runs "opal_get_version.sh
... --tarball", which will return the value of tarball_version (if it
is non-empty) or the "full" version string (i.e.,
"major.minor.releasegreek").
Replace our old, clunky timing setup with a much nicer one that is only available if configured with --enable-timing. Add a tool for profiling clock differences between the nodes so you can get more precise timing measurements. I'll ask Artem to update the Github wiki with full instructions on how to use this setup.
This commit was SVN r32738.
WHAT: Merge the PMIx branch into the devel repo, creating a new
OPAL “lmix” framework to abstract PMI support for all RTEs.
Replace the ORTE daemon-level collectives with a new PMIx
server and update the ORTE grpcomm framework to support
server-to-server collectives
WHY: We’ve had problems dealing with variations in PMI implementations,
and need to extend the existing PMI definitions to meet exascale
requirements.
WHEN: Mon, Aug 25
WHERE: https://github.com/rhc54/ompi-svn-mirror.git
Several community members have been working on a refactoring of the current PMI support within OMPI. Although the APIs are common, Slurm and Cray implement a different range of capabilities, and package them differently. For example, Cray provides an integrated PMI-1/2 library, while Slurm separates the two and requires the user to specify the one to be used at runtime. In addition, several bugs in the Slurm implementations have caused problems requiring extra coding.
All this has led to a slew of #if’s in the PMI code and bugs when the corner-case logic for one implementation accidentally traps the other. Extending this support to other implementations would have increased this complexity to an unacceptable level.
Accordingly, we have:
* created a new OPAL “pmix” framework to abstract the PMI support, with separate components for Cray, Slurm PMI-1, and Slurm PMI-2 implementations.
* Replaced the current ORTE grpcomm daemon-based collective operation with an integrated PMIx server, and updated the grpcomm APIs to provide more flexible, multi-algorithm support for collective operations. At this time, only the xcast and allgather operations are supported.
* Replaced the current global collective id with a signature based on the names of the participating procs. The allows an unlimited number of collectives to be executed by any group of processes, subject to the requirement that only one collective can be active at a time for a unique combination of procs. Note that a proc can be involved in any number of simultaneous collectives - it is the specific combination of procs that is subject to the constraint
* removed the prior OMPI/OPAL modex code
* added new macros for executing modex send/recv to simplify use of the new APIs. The send macros allow the caller to specify whether or not the BTL supports async modex operations - if so, then the non-blocking “fence” operation is used, if the active PMIx component supports it. Otherwise, the default is a full blocking modex exchange as we currently perform.
* retained the current flag that directs us to use a blocking fence operation, but only to retrieve data upon demand
This commit was SVN r32570.
also replase the OMPI_CAST_RTE_NAME macro with
an inline function if OPAL_ENABLE_DEBUG, so we can
get warnings from the compiler if ampersand is missing.
Thanks to Paul Hargrove for reporting the bugs
This commit was SVN r32408.
communication library should use to initialize itself.
Ralph will champion this change back with an RFC if there is a realistic
need/use case from the community.
This commit was SVN r32361.
The following SVN revision numbers were found above:
r32355 --> open-mpi/ompi@c903917f47
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
We have been getting several requests for new collectives that need to be inserted in various places of the MPI layer, all in support of either checkpoint/restart or various research efforts. Until now, this would require that the collective id's be generated at launch. which required modification
s to ORTE and other places. We chose not to make collectives reusable as the race conditions associated with resetting collective counters are daunti
ng.
This commit extends the collective system to allow self-generation of collective id's that the daemons need to support, thereby allowing developers to request any number of collectives for their work. There is one restriction: RTE collectives must occur at the process level - i.e., we don't curren
tly have a way of tagging the collective to a specific thread. From the comment in the code:
* In order to allow scalable
* generation of collective id's, they are formed as:
*
* top 32-bits are the jobid of the procs involved in
* the collective. For collectives across multiple jobs
* (e.g., in a connect_accept), the daemon jobid will
* be used as the id will be issued by mpirun. This
* won't cause problems because daemons don't use the
* collective_id
*
* bottom 32-bits are a rolling counter that recycles
* when the max is hit. The daemon will cleanup each
* collective upon completion, so this means a job can
* never have more than 2**32 collectives going on at
* a time. If someone needs more than that - they've got
* a problem.
*
* Note that this means (for now) that RTE-level collectives
* cannot be done by individual threads - they must be
* done at the overall process level. This is required as
* there is no guaranteed ordering for the collective id's,
* and all the participants must agree on the id of the
* collective they are executing. So if thread A on one
* process asks for a collective id before thread B does,
* but B asks before A on another process, the collectives will
* be mixed and not result in the expected behavior. We may
* find a way to relax this requirement in the future by
* adding a thread context id to the jobid field (maybe taking the
* lower 16-bits of that field).
This commit includes a test program (orte/test/mpi/coll_test.c) that cycles 100 times across barrier and modex collectives.
This commit was SVN r32203.
I recently found a case where ompi_mpi_abort() segv's:
{{{
$ mpirun --mca btl non_existent_btl_name ...
}}}
In this case, the BML init fails because we have no paths to any
peers. It calls ompi_mpi_abort(), but this is before ompi_comm_self
has been setup. ompi_mpi_abort() assumes that if the comm parameter
is != NULL, it can be used. But since we aborted so early in
MPI_INIT, that's a false assumption.
(note that this isn't happening on v1.8 because the check for
INIT/FINALIZE in ompi_mpi_abort() is a little different. Hence: this
is a trunk issue -- at least for now)
When fixing this problem, I noticed a few other problems in ompi_mpi_abort():
* the group access was incorrect (it didn't use accessor functions)
* it wasn't clear that ORTE's ompi_rte_abort_peers() returns
NOT_IMPLEMENTED and falls through down to ompi_rte_abort()
* the check for my proc in the communicator was a little more
complicated than necessary
* the logic for checking for aborts early in MPI_INIT wasn't right
* some comments were stale
* the hostname output in error messages would be NULL if MPI_FINALIZE
had been invoked
* it was possible to abort, but still exit with a 0 status
This commit fixes all of the above problems, and makes the logic a
little more straightforward. Thanks to Ralph Castain and George
Bosilca for the assists with this patch.
This commit was SVN r32125.
Based on extensive discussions before/at the June 2014 developer's
meeting, put a lengthy comment explaining a second reason why we
''must'' use an RTE barrier during MPI_FINALIZE and
MPI_COMM_DISCONNECT (i.e., unreliable transports). Slightly explain
more the original reason why we do this, too (BTLs can lie/buffer a
message without actually injecting it on the network).
This commit was SVN r32095.
This commit fixes leaks of bml endpoints in finalize. A summary of the
bugs/fixes is below.
1) ompi_mpi_finalize used ompi_proc_all to get the list of procs but
never released the reference to them (ompi_proc_all called
OBJ_RETAIN on all the procs returned). When calling del_procs at
finalize it should suffice to call ompi_proc_world which does not
increment the reference count.
2) del_procs is called BEFORE ompi_comm_finalize. This leaves the
references to the procs from calling the pml_add_comm
function. The fix is to reorder the calls to do omp_comm_finalize,
del_procs, pml_finalize instead of del_procs, pml_finalize,
ompi_comm_finalize.
3) The check in del_procs in r2 checked for a reference count of
1. This is incorrect. At this point there should be 2 references:
1 from ompi_proc, and another from the add_procs. The fix is to
change this check to look for a reference count of 22. This check
makes me extremely uncomforable as nothing will call del_procs if
the reference count of a procs is not 2 when del_procs is
called. Maybe there should be an assert since this is a developer
error IMHO.
cmr=v1.8.2:reviewer=bosilca
This commit was SVN r31782.
The following SVN revision numbers were found above:
r2 --> open-mpi/ompi@58fdc18855
This commit fixes three leaks:
- bml/r2: fix leak of del_procs in mca_bml_r2_del_procs
- Release the modex data in btl/scif, btl/ugni, and btl/vader
- ompi_mpi_finalize: close the allocator framework
cmr=v1.8.2:reviewer=jsquyres
This commit was SVN r31778.
The following SVN revision numbers were found above:
r2 --> open-mpi/ompi@58fdc18855
So track that the rte has reached that point, and only emit the new message if it is accurate.
Note that we still generate a TON of output for a minor error:
Ralphs-iMac:examples rhc$ mpirun -n 3 -mca btl sm ./hello_c
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.
Process 1 ([[50239,1],2]) is on host: Ralphs-iMac
Process 2 ([[50239,1],2]) is on host: Ralphs-iMac
BTLs attempted: sm
Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
--------------------------------------------------------------------------
MPI_INIT has failed because at least one MPI process is unreachable
from another. This *usually* means that an underlying communication
plugin -- such as a BTL or an MTL -- has either not loaded or not
allowed itself to be used. Your MPI job will now abort.
You may wish to try to narrow down the problem;
* Check the output of ompi_info to see which BTL/MTL plugins are
available.
* Run your application with MPI_THREAD_SINGLE.
* Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
if using MTL-based communications) to see exactly which
communication plugins were considered and/or discarded.
--------------------------------------------------------------------------
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[50239,1],2]
Exit code: 1
--------------------------------------------------------------------------
[Ralphs-iMac.local:23227] 2 more processes have sent help message help-mca-bml-r2.txt / unreachable proc
[Ralphs-iMac.local:23227] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[Ralphs-iMac.local:23227] 2 more processes have sent help message help-mpi-runtime / mpi_init:startup:pml-add-procs-fail
Ralphs-iMac:examples rhc$
Hopefully, we can agree on a way to reduce this verbage!
This commit was SVN r31686.
The following SVN revision numbers were found above:
r2 --> open-mpi/ompi@58fdc18855
http://www.open-mpi.org/community/lists/devel/2014/04/14496.php
Revamp the opal database framework, including renaming it to "dstore" to reflect that it isn't a "database". Move the "db" framework to ORTE for now, soon to move to ORCM
This commit was SVN r31557.
Also added some missing values and sentinels.
cmr=v1.8:ticket=trac:4470
This commit was SVN r31263.
The following SVN revision numbers were found above:
r31260 --> open-mpi/ompi@69036437b7
The following Trac tickets were found above:
Ticket 4470 --> https://svn.open-mpi.org/trac/ompi/ticket/4470
NOTE: launch performance will be absolutely awful if you do this with BTLs that aren't configured to modex_recv on first message!
Even with "modex on demand", we still have to do a barrier in place of the modex - we simply don't move any data around, which does reduce the time impact. The barrier is required to ensure that the other proc has in fact registered all its BTL info and therefore is prepared to hand over a complete data package. Otherwise, you may not get the info you need. In addition, the shared memory BTL can fail to properly rendezvous as it expects the barrier to be in place.
This behavior will *only* take effect under the following conditions:
1. launched via mpirun
2. #procs is greater than ompi_hostname_cutoff, which defaults to UINT32_MAX
3. mca param rte_orte_direct_modex is set to 1. At the moment, we are having problems getting this param to register properly, so only the first two conditions are in effect. Still, the bottom line is you have to *want* this behavior to get it.
The planned next evolution of this will be to make the direct modex be non-blocking - this will require two fixes:
1. if the remote proc doesn't have the required info, then let it delay its response until it does. This means we need a way for the MPI layer to tell the RTE "I am done entering modex data".
2. adjust the SM rendezvous logic to loop until the required file has been created
Creating a placeholder to bring this over to 1.7.5 when ready.
cmr=v1.7.5:reviewer=hjelmn:subject=Enable direct modex at scale
This commit was SVN r30259.
pkg{data,lib,includedir}, use our own ompi{data,lib,includedir}, which is
always set to {datadir,libdir,includedir}/openmpi. This will keep us from
having help files in prefix/share/open-rte when building without Open MPI,
but in prefix/share/openmpi when building with Open MPI.
This commit was SVN r30140.
* automatically retrieve the hostname (and all RTE info) for all procs during MPI_Init if nprocs < cutoff
* if nprocs > cutoff, retrieve the hostname (and all RTE info) for a proc upon the first call to modex_recv for that proc. This would provide the hostname for debugging purposes as we only report errors on messages, and so we must have called modex_recv to get the endpoint info
* BTLs are not to call modex_recv until they need the endpoint info for first message - i.e., not during add_procs so we don't call it for every process in the job, but only those with whom we communicate
My understanding is that only some BTLs have been modified to meet that third requirement, but those include the Cray ones where jobs are big enough that launch times were becoming an issue. Other BTLs would hopefully be modified as time went on and interest in using them at scale arose. Meantime, those BTLs would call modex_recv on every proc, and we would therefore be no worse than the prior behavior.
This commit revises the MPI-RTE interface to pass the ompi_proc_t instead of the ompi_process_name_t for the proc so that the hostname can be easily inserted. I have advised the ORNL folks of the change.
cmr=v1.7.4:reviewer=jsquyres:subject=Fix thread deadlock
This commit was SVN r29931.
The following SVN revision numbers were found above:
r29917 --> open-mpi/ompi@1a972e2c9d
Reworked ompi_info tool to be close with orte_info implementation.
ompi_info_register_types(), ompi_info_close_components() and
ompi_info_show_ompi_version() are moved to runtime/ompi_info_support.c.
Added runtime/oshmem_info_support layer that exports following api to be
used into oshmem_info tool as
oshmem_info_register_types()
oshmem_info_register_framework_params()
oshmem_info_close_components()
oshmem_info_show_oshmem_version()
These functions call ompi_info_support related interfaces as long as
Oshmem supports Open MPI/SHMEM combination.
Now orte_info/ompi_info/oshmem_info have identical implementation approach.
Possible improvement:
OSHMEM processing of --config option is the same as OMPI`s (code is duplicated).
Probably list of info_support interfaces can be extended by xxx_info_do_config().
developed by Igor, reviewed by miked
This commit was SVN r29429.
Fix two problems that surfaced when using direct launch under SLURM:
1. locally store our own data because some BTLs want to retrieve
it during add_procs rather than use what they have internally
2. cleanup MPI_Abort so it correctly passes the error status all
the way down to the actual exit. When someone implemented the
"abort_peers" API, they left out the error status. So we lost
it at that point and *always* exited with a status of 1. This
forces a change to the API to include the status.
cmr:v1.7.3:reviewer=jsquyres:subject=Fix MPI_Abort and modex_recv for direct launch
This commit was SVN r29405.
the fortran handle. Use a seperate opal_pointer_array to keep track of
the fortran handles of communicators.
This commit also fixes a bug in ompi_comm_idup where the newcomm was not
set until after the operation completed.
cmr=v1.7.4:reviewer=jsquyres:ticket=trac:3796
This commit was SVN r29342.
The following Trac tickets were found above:
Ticket 3796 --> https://svn.open-mpi.org/trac/ompi/ticket/3796
MPI_Comm_idup.
As part of this work I implemented a basic request scheduler in
ompi/comm/comm_request.c. This scheduler might be useful for more
than just communicator requests and could be moved to ompi/request
if there is a demand. Otherwise I will leave it where it is.
Added a non-blocking version of ompi_comm_set to support ompi_comm_idup.
The call makes a recursive call to comm_dup and a non-blocking version
was needed. To simplify the code the blocking version calls the nonblocking
version and waits on the resulting request if one exists.
cmr=v1.7.4:reviewer=jsquyres:ticket=trac:3796
This commit was SVN r29334.
The following Trac tickets were found above:
Ticket 3796 --> https://svn.open-mpi.org/trac/ompi/ticket/3796
*** THIS RFC INCLUDES A MINOR CHANGE TO THE MPI-RTE INTERFACE ***
Note: during the course of this work, it was necessary to completely separate the MPI and RTE progress engines. There were multiple places in the MPI layer where ORTE_WAIT_FOR_COMPLETION was being used. A new OMPI_WAIT_FOR_COMPLETION macro was created (defined in ompi/mca/rte/rte.h) that simply cycles across opal_progress until the provided flag becomes false. Places where the MPI layer blocked waiting for RTE to complete an event have been modified to use this macro.
***************************************************************************************
I am reissuing this RFC because of the time that has passed since its original release. Since its initial release and review, I have debugged it further to ensure it fully supports tests like loop_spawn. It therefore seems ready for merge back to the trunk. Given its prior review, I have set the timeout for one week.
The code is in https://bitbucket.org/rhc/ompi-oob2
WHAT: Rewrite of ORTE OOB
WHY: Support asynchronous progress and a host of other features
WHEN: Wed, August 21
SYNOPSIS:
The current OOB has served us well, but a number of limitations have been identified over the years. Specifically:
* it is only progressed when called via opal_progress, which can lead to hangs or recursive calls into libevent (which is not supported by that code)
* we've had issues when multiple NICs are available as the code doesn't "shift" messages between transports - thus, all nodes had to be available via the same TCP interface.
* the OOB "unloads" incoming opal_buffer_t objects during the transmission, thus preventing use of OBJ_RETAIN in the code when repeatedly sending the same message to multiple recipients
* there is no failover mechanism across NICs - if the selected NIC (or its attached switch) fails, we are forced to abort
* only one transport (i.e., component) can be "active"
The revised OOB resolves these problems:
* async progress is used for all application processes, with the progress thread blocking in the event library
* each available TCP NIC is supported by its own TCP module. The ability to asynchronously progress each module independently is provided, but not enabled by default (a runtime MCA parameter turns it "on")
* multi-address TCP NICs (e.g., a NIC with both an IPv4 and IPv6 address, or with virtual interfaces) are supported - reachability is determined by comparing the contact info for a peer against all addresses within the range covered by the address/mask pairs for the NIC.
* a message that arrives on one TCP NIC is automatically shifted to whatever NIC that is connected to the next "hop" if that peer cannot be reached by the incoming NIC. If no TCP module will reach the peer, then the OOB attempts to send the message via all other available components - if none can reach the peer, then an "error" is reported back to the RML, which then calls the errmgr for instructions.
* opal_buffer_t now conforms to standard object rules re OBJ_RETAIN as we no longer "unload" the incoming object
* NIC failure is reported to the TCP component, which then tries to resend the message across any other available TCP NIC. If that doesn't work, then the message is given back to the OOB base to try using other components. If all that fails, then the error is reported to the RML, which reports to the errmgr for instructions
* obviously from the above, multiple OOB components (e.g., TCP and UD) can be active in parallel
* the matching code has been moved to the RML (and out of the OOB/TCP component) so it is independent of transport
* routing is done by the individual OOB modules (as opposed to the RML). Thus, both routed and non-routed transports can simultaneously be active
* all blocking send/recv APIs have been removed. Everything operates asynchronously.
KNOWN LIMITATIONS:
* although provision is made for component failover as described above, the code for doing so has not been fully implemented yet. At the moment, if all connections for a given peer fail, the errmgr is notified of a "lost connection", which by default results in termination of the job if it was a lifeline
* the IPv6 code is present and compiles, but is not complete. Since the current IPv6 support in the OOB doesn't work anyway, I don't consider this a blocker
* routing is performed at the individual module level, yet the active routed component is selected on a global basis. We probably should update that to reflect that different transports may need/choose to route in different ways
* obviously, not every error path has been tested nor necessarily covered
* determining abnormal termination is more challenging than in the old code as we now potentially have multiple ways of connecting to a process. Ideally, we would declare "connection failed" when *all* transports can no longer reach the process, but that requires some additional (possibly complex) code. For now, the code replicates the old behavior only somewhat modified - i.e., if a module sees its connection fail, it checks to see if it is a lifeline. If so, it notifies the errmgr that the lifeline is lost - otherwise, it notifies the errmgr that a non-lifeline connection was lost.
* reachability is determined solely on the basis of a shared subnet address/mask - more sophisticated algorithms (e.g., the one used in the tcp btl) are required to handle routing via gateways
* the RML needs to assign sequence numbers to each message on a per-peer basis. The receiving RML will then deliver messages in order, thus preventing out-of-order messaging in the case where messages travel across different transports or a message needs to be redirected/resent due to failure of a NIC
This commit was SVN r29058.
So shift the cutoff param to the MPI layer, and have it solely determine whether or not we call modex_recv on the hostname. If comm_world is of size greater than the cutoff, then we don't automatically retrieve the hostname when we build the ompi_proc_t for a process - instead, we fill the hostname entry on first call to modex_recv for that process.
The param is now "ompi_hostname_cutoff=N", where N=number of procs for cutoff.
Refs trac:3729
This commit was SVN r29056.
The following Trac tickets were found above:
Ticket 3729 --> https://svn.open-mpi.org/trac/ompi/ticket/3729
* add a new MCA param orte_hostname_cutoff to specify the number of nodes at which we stop including hostnames. This defaults to INT_MAX => always include hostnames. If a value is given, then we will include hostnames for any allocation smaller than the given limit.
* remove ompi_proc_get_hostname. Replace all occurrences with a direct link to ompi_proc_t's proc_hostname, protected by appropriate "if NULL"
* modify the OMPI-ORTE integration component so that any call to modex_recv automatically loads the ompi_proc_t->proc_hostname field as well as returning the requested info. Thus, any process whose modex info you retrieve will automatically receive the hostname. Note that on-demand retrieval is still enabled - i.e., if we are running under direct launch with PMI, the hostname will be fetched upon first call to modex_recv, and then the ompi_proc_t->proc_hostname field will be loaded
* removed a stale MCA param "mpi_keep_peer_hostnames" that was no longer used anywhere in the code base
* added an envar lookup in ess/pmi for the number of nodes in the allocation. Sadly, PMI itself doesn't provide that info, so we have to get it a different way. Currently, we support PBS-based systems and SLURM - for any other, rank0 will emit a warning and we assume max number of daemons so we will always retain hostnames
This commit was SVN r29052.
George and I were talking about ORTE's error handling the other day in regards to the right way to deal with errors in the updated OOB. Specifically, it seemed a bad idea for a library such as ORTE to be aborting the job on its own prerogative. If we lose a connection or cannot send a message, then we really should just report it upwards and let the application and/or upper layers decide what to do about it.
The current code base only allows a single error callback to exist, which seemed unduly limiting. So, based on the conversation, I've modified the errmgr interface to provide a mechanism for registering any number of error handlers (this replaces the current "set_fault_callback" API). When an error occurs, these handlers will be called in order until one responds that the error has been "resolved" - i.e., no further action is required - by returning OMPI_SUCCESS. The default MPI layer error handler is specified to go "last" and calls mpi_abort, so the current "abort" behavior is preserved unless other error handlers are registered.
In the register_callback function, I provide an "order" param so you can specify "this callback must come first" or "this callback must come last". Seemed to me that we will probably have different code areas registering callbacks, and one might require it go first (the default "abort" will always require it go last). So you can append and prepend, or go first. Note that only one registration can declare itself "first" or "last", and since the default "abort" callback automatically takes "last", that one isn't available. :-)
The errhandler callback function passes an opal_pointer_array of structs, each of which contains the name of the proc involved (which can be yourself for internal errors) and the error code. This is a change from the current fault callback which returned an opal_pointer_array of just process names. Rationale is that you might need to see the cause of the error to decide what action to take. I realize that isn't a requirement for remote procs, but remember that we will use the SAME interface to report RTE errors internal to the proc itself. In those cases, you really do need to see the error code. It is legal to pass a NULL for the pointer array (e.g., when reporting an internal failure without error code), so handlers must be prepared for that possibility. If people find that too burdensome, we can remove it.
Should we ever decide to create a separate callback path for internal errors vs remote process failures, or if we decide to do something different based on experience, then we can adjust this API.
This commit was SVN r28852.
This patch reshape the way we deal with topologies completely. Where
our topologies were mainly storage components (they were not capable
of creating the new communicator), the new version is built around a
[possibly] common representation (in mca/topo/topo.h), but the functions
to attach and retrieve the topological information are specific to each
component. As a result the ompi_create_cart and ompi_create_graph functions
become useless and have been removed.
In addition to adding the internal infrastructure to manage the topology
information, it updates the MPI interface, and the debuggers support and
provides all Fortran interfaces.
This commit was SVN r28687.
Notes:
- This commit also eliminates the need for an available components list in use
in several frameworks. None of the code in question was making use of the
priority field of the priority component list item so these extra lists were
removed.
- Cleaned up selection code in several frameworks to sort lists using opal_list_sort.
- Cleans up the ompi/orte-info functions. Expose the functions that construct the
list of params so they can be used elsewhere.
patches for mtl/portals4 from brian
missed a few output variables in openib
This commit was SVN r28241.
Features:
- Support for an override parameter file (openmpi-mca-param-override.conf).
Variable values in this file can not be overridden by any file or environment
value.
- Support for boolean, unsigned, and unsigned long long variables.
- Support for true/false values.
- Support for enumerations on integer variables.
- Support for MPIT scope, verbosity, and binding.
- Support for command line source.
- Support for setting variable source via the environment using
OMPI_MCA_SOURCE_<var name>=source (either command or file:filename)
- Cleaner API.
- Support for variable groups (equivalent to MPIT categories).
Notes:
- Variables must be created with a backing store (char **, int *, or bool *)
that must live at least as long as the variable.
- Creating a variable with the MCA_BASE_VAR_FLAG_SETTABLE enables the use of
mca_base_var_set_value() to change the value.
- String values are duplicated when the variable is registered. It is up to
the caller to free the original value if necessary. The new value will be
freed by the mca_base_var system and must not be freed by the user.
- Variables with constant scope may not be settable.
- Variable groups (and all associated variables) are deregistered when the
component is closed or the component repository item is freed. This
prevents a segmentation fault from accessing a variable after its component
is unloaded.
- After some discussion we decided we should remove the automatic registration
of component priority variables. Few component actually made use of this
feature.
- The enumerator interface was updated to be general enough to handle
future uses of the interface.
- The code to generate ompi_info output has been moved into the MCA variable
system. See mca_base_var_dump().
opal: update core and components to mca_base_var system
orte: update core and components to mca_base_var system
ompi: update core and components to mca_base_var system
This commit also modifies the rmaps framework. The following variables were
moved from ppr and lama: rmaps_base_pernode, rmaps_base_n_pernode,
rmaps_base_n_persocket. Both lama and ppr create synonyms for these variables.
This commit was SVN r28236.
At the same time, fix a minor issue where the init hook was being called twice, once by the libc malloc and once by our malloc by removing the call from our malloc.
This commit was SVN r28202.
library to multiple libraries that are implicitly sucked into the executable
as a dependency of libmpi. The initialize hook isn't visible to libc on some
linux distributions when it's in libopal and libopal isn't explicity linked
into the executable. The fix is to have a duplicate initialize hook in
libmpi as well as libopal. *sigh*.
This commit was SVN r28164.
ompi_show_help, because opal_show_help is replaced with an
aggregating version when using ORTE, so there's no reason to
directly call orte_show_help.
This commit was SVN r28051.
actually care if opal_pointer_array is limited to handle_max already passes
that in as the max_size during init, so don't need it there. The arch
constant was a bit more difficult, so pass that in during MPI init and
leave empty otherwise.
This is to help with the effort to allow building ompi against an external
opal or orte.
This commit was SVN r27817.
* Minor man page tweaks
* Use existing ompi_mpi_thread_requested global
This commit was SVN r27308.
The following Trac tickets were found above:
Ticket 3309 --> https://svn.open-mpi.org/trac/ompi/ticket/3309
"num_app_ctx" - the number of app_contexts in the job
"first_rank" - the MPI rank of the first process in each app_context
"np" - the number of procs in each app_context
Still need clarification on the MPI_Init portion of the ticket. Specifically, does the ticket call for returning an error is someone calls MPI_Init more than once in a program? We set a flag to tell us that we have been initialized, but currently never check it.
This commit was SVN r27005.
aren't separated out into individual commits; they represent a few
months of work in the Mercurial branch, and it seemed error-prone to
try to break them up into multiple SVN commits.
* Remove 2nd overloaded interfaces for MPI_TESTALL, MPI_TESTSOME,
MPI_WAITALL, and MPI_WAITSOME in the "mpi" module implementations
(because we're not allowed to have them, anyway -- it causes
complications in the profiling interface). This forced an MPI-2.2
errata in the MPI Forum; we applied the errata here (the array of
statuses parameter could not have a specific dimension specified in
the dummy argument). Fixes trac:3166.
* Similarly, fix type for MPI_ARGVS_NULL in Fortran
* Add MPI_3.0 function MPI_F_SYNC_REG (Fortran interfaces only).
* Add MPI-3.0 MPI_MESSAGE_NO_PROC in the mpi_f08 module.
* Added mpi_f08 handle comparison operators, per MPI-3.0 addendum to
the F08 proposal at the last Forum meeting.
* Added missing type(MPI_File) and type(Message) in mpi_f08 module.
* Fix --disable-mpi-io configure switch with all Fortran interfaces
* Re-factor the Fortran header files to be fundamentally simpler and
easier to maintain. Fortran constant values in the header files
are now generated by a script named mpif-values.pl during
autogen.pl (they were previously generated by mpif-common.pl, but
it was quite a bit more subtle/complex). A second commit will
follow this one to update svn:ignore values (just to ensure we
don't muck up the first commit with the SVN client getting confused
by the changed ignore values and new/changed files).
* Fix some dependencies for compile ordering in
ompi/mpi/fortran/use-mpi-ignore-tkr/Makefile.am.
* Fix bad wording in several places (.m4 file name, ompi_info output,
etc.): we previoulsy said "F08 assumed shape" when we really meant
"F08 assumed rank" (for Fortran gurus, those are very different
things).
* Removed the GREEK/SVN version string from mpif.h. It really had no
purpose being there.
Still to be done:
* Handling of 2D array of strings in MPI_COMM_SPAWN_MULTIPLE still
isn't right yet. Not sure how many people really care about this
:-), but it is still broken.
This commit was SVN r26997.
The following Trac tickets were found above:
Ticket 3166 --> https://svn.open-mpi.org/trac/ompi/ticket/3166
Update all the orte ess components to remove their associated APIs for retrieving proc data. Update the grpcomm API to reflect transfer of set/get modex info to the db framework.
Note that this doesn't recreate the old GPR. This is strictly a local db storage that may (at some point) obtain any missing data from the local daemon as part of an async methodology. The framework allows us to experiment with such methods without perturbing the default one.
This commit was SVN r26678.
1. they modified the OMPI-ORTE interface, which is something I promised to avoid doing unless absolutely necessary, and
2. the framework ident is already in the component name key provided to the modex db. What is missing is the project ident, but as Jeff and I discussed last week, we really need to add that field to the component struct anyway to avoid multi-project collisions on framework names. That will be done over the next couple of weeks as a separate effort.
This commit was SVN r26613.
The following SVN revision numbers were found above:
r26600 --> open-mpi/ompi@5ba4deff07
r26601 --> open-mpi/ompi@0e3094c318