- Remove an old comment from crcp_base_fns.c
- Let ob1 have its very own ft_event function (which I'll fill in shortly)
- Make sure ob1 finalizes the bsend stuff so we don't leave a bunch of memory sitting around
- PML base - destruct the array upon finalize. Shrink the include search so it stops after finding a match
This commit was SVN r14222.
when on the Cray machine (aka when the NULL GPR is in use).
This commit was SVN r13638.
The following SVN revision numbers were found above:
r13582 --> open-mpi/ompi@041beeb1b6
I found only two places that were looking at the tokens:
1. the odls - we used the tokens to separately process the globals container data from everything else. In this case, I left the subscription that returned the globals data alone, but "stripped" the subscription that returned the launch data for the procs. These subscriptions have nothing to do with the xcast message.
2. the pml_base_modex - the callback function was getting process names from the returned tokens. Actually, this function was doing a very bad thing - it was assuming that the first token returned was *always* the process name. This is currently true, but is one of those assumptions that someone could have easily changed - and suddenly found the system inexplicably failing. I modified the function to (a) get the name sent back to us, (b) "stripped" the value structures of tokens and segment strings, and (c) correctly obtained process names from the returned values. I also reindented the heck out of the code so it was legible (at least, to my old eyes).
This commit was SVN r12813.
* Do not add new procs to the global list during modex callback or
when sharing orte names during accept/connect. For modex, we
cache the modex info for later, in case that proc ever does get
added to the global proc list. For accept/connect orte name
exchange between the roots, we only need the orte name, so no
need to add a proc structure anyway. The procs will be added
to the global process list during the proc exchange later in
the wireup process
* Rename proc_get_namebuf and proc_get_proclist to proc_pack
and proc_unpack and extend them to include all information
needed to build that proc struct on a remote node (which
includes ORTE name, architecture, and hostname). Change
unpack to call pml_add_procs for the entire list of new
procs at once, rather than one at a time.
* Remove ompi_proc_find_and_add from the public proc
interface and make it a private function. This function
would add a half-created proc to the global proc list, so
making it harder to call is a good thing.
This means that there's only two ways to add new procs into the global proc list at this time: During MPI_INIT via the call to ompi_proc_init, where my job is added to the list and via ompi_proc_unpack using a buffer from a packed proc list sent to us by someone else. Currently, this is enough to implement MPI semantics. We can extend the interface more if we like, but that may require HNP communication to get the remote proc information and I wanted to avoid that if at all possible.
Refs trac:564
This commit was SVN r12798.
The following Trac tickets were found above:
Ticket 564 --> https://svn.open-mpi.org/trac/ompi/ticket/564
Accordingly, there are new APIs to the name service to support the ability to get a job's parent, root, immediate children, and all its descendants. In addition, the terminate_job, terminate_orted, and signal_job APIs for the PLS have been modified to accept attributes that define the extent of their actions. For example, doing a "terminate_job" with an attribute of ORTE_NS_INCLUDE_DESCENDANTS will terminate the given jobid AND all jobs that descended from it.
I have tested this capability on a MacBook under rsh, Odin under SLURM, and LANL's Flash (bproc). It worked successfully on non-MPI jobs (both simple and including a spawn), and MPI jobs (again, both simple and with a spawn).
This commit was SVN r12597.
Clean up the remainder of the size_t references in the runtime itself. Convert to orte_std_cntr_t wherever it makes sense (only avoid those places where the actual memory size is referenced).
Remove the obsolete oob barrier function (we actually obsoleted it a long time ago - just never bothered to clean it up).
I have done my best to go through all the components and catch everything, even if I couldn't test compile them since I wasn't on that type of system. Still, I cannot guarantee that problems won't show up when you test this on specific systems. Usually, these will just show as "warning: comparison between signed and unsigned" notes which are easily fixed (just change a size_t to orte_std_cntr_t).
In some places, people didn't use size_t, but instead used some other variant (e.g., I found several places with uint32_t). I tried to catch all of them, but...
Once we get all the instances caught and fixed, this should once and for all resolve many of the heterogeneity problems.
This commit was SVN r11204.
- move files out of toplevel include/ and etc/, moving it into the
sub-projects
- rather than including config headers with <project>/include,
have them as <project>
- require all headers to be included with a project prefix, with
the exception of the config headers ({opal,orte,ompi}_config.h
mpi.h, and mpif.h)
This commit was SVN r8985.
any NICs to use
* Make mvapi, gm, and mx components all publish information, even if there
are no NICs available so that modex_recv doesn't hang. If there are no
NICs available, don't set the reachable bit, but don't do anything
to fail. This unfortunately doesn't cover the hangs that will result if
different procs load different sets of components, but it's a start
This commit was SVN r7550.
1. Added OMPI_PROC_ARCH as a defined registry key and added the code so that the architecture info gets properly transmitted across all processes using the startup message.
2. Added an OMPI_MODEX_KEY definition and removed the hard-coded "modex" key from pml_modex_exchange
This commit was SVN r7129.
Here's the huge registry check-in you've all been waiting for with baited breath. The revised version sends a single message to all processes at the various stage gates, thus making the startup much more scalable. I could provide you with all the tawdry details, but won't for now - you are welcome to ask, though, and I'll merrily bore your ears to tears.
In addition, the commit contains the following:
1. set the ignore properties on ompi/debuggers and orte/mca/pls/poe
2. Added simplified subscribe and put functions to the registry's API. I have also converted all of the ompi functions that registered subscriptions to the new API, and caught their associated put's as well.
In a follow-on commit, I'll be adding support for George's hetero arch registry subscription (wanted to get this one in first).
This commit was SVN r7118.
Change all the places where they are used to fit the new name.
Remove the code to check the remote arch from the PML. We will have a GPR mechanism
in ompi_mpi_initialize to do that.
This commit was SVN r6750.
1. Modify the registry to eliminate redundant data copying for startup messages.
2. Revise the subscription/trigger system to avoid redundant storage of triggers and subscriptions. This dramatically reduces the search time when a registry action occurs - to illustrate the point, there are now only a handful of triggers on the system for each job. Before, there were a handful of triggers for each PROCESS in the job, all of which had to be checked every time something happened on the registry. This is much, much faster now.
3. Update all subscriptions to the new format. There are now "named" subscriptions - this allows you to "name" a subscription that all the processes will be using. The first one to hit the registry actually defines the subscription. From then on, any subsequent "subscribes" to the same name just cause that process to "attach" to the existing subscription. This keeps the number of subscriptions being tracked by the registry to a minimum, while ensuring that each process still gets notified.
4. Do the same for triggers.
Also fixed a duplicate subscription problem that was causing people to receive data equal to the number of processes times the data they should have received from a trigger/subscription. Sorry about that... :-( ...but it's all better now!
Uncovered a situation where the modex data seems to be getting entered on the registry a second time - the latter time coming after the compound command has been "fired", thereby causing all the subscriptions to fire. Asked Tim and Jeff to look into this.
Second phase of the changes will involve modifying the xcast system so that the same message gets sent to all processes. This will further reduce the message traffic, and - once we have a true "broadcast" version of xcast - really speed things up and improve scalability.
This commit was SVN r6542.