fix (r14749) and then backed it out (r14753).
As we are unable to send more than a 32 bits length over TCP in one go, there
is no reason to have an uint64 length in the header. This reduce the size
of the TCP header.
This commit was SVN r14755.
The following SVN revision numbers were found above:
r14749 --> open-mpi/ompi@48c026ce6b
r14753 --> open-mpi/ompi@28ed850b4c
r14703 for the point-to-point component.
* Associate the list of long message requests to poll with the
component, not the individual modules
* add progress thread that sits on the OMPI request structure
and wakes up at the appropriate time to poll the message
list to move long messages asynchronously.
* Instead of calling opal_progress() all over the place, move
to using the condition variables like the rest of the project.
Has the advantage of moving it slightly further along in the
becoming thread safe thing.
* Fix a problem with the passive side of unlock where it could
go recursive and cause all kinds of problems, especially
when progress threads are used. Instead, have two parts of
passive unlock -- one to start the unlock, and another to
complete the lock and send the ack back. The data moving
code trips the second at the right time.
This commit was SVN r14751.
The following SVN revision numbers were found above:
r14703 --> open-mpi/ompi@2b4b754925
mca_btl_tcp_hdr_t struct and remove the need for the heterogeneous
padding by changing the type of the "size" member to be uint32_t
(vs. uint64_t). The value would never be greater than 32 bits anyway,
so having the type be uint64_t was wasteful.
This commit was SVN r14749.
Rename the oob_tcp_include and oob_tcp_exclude MCA parameters to be
oob_tcp_if_include and oob_tcp_if_exclude (to match the convention
with btl_tcp_if_[in|ex]clude). Keep "hidden" synonyms oob_tcp_include
and oob_tcp_exclude in case anyone is actually using them (and some
users undoubtedly are), but do not have them show up in ompi_info
--param output. Instead, the new "oob_tcp_if_*" names will show up in
ompi_info output.
This commit was SVN r14746.
wireup. For small clusters or clusters with decent ARP lookup and
connect times, this will have marginal impact. For systems with either
bad ARP lookup times or long connect times, increasing this number
to something much closer to SOMAXCONN (128 on most modern machines) will
result in a faster OOB wireup. Don't set higher than SOMAXCONN or you
can end up with lots of connect() retries and we'll end up slower.
This commit was SVN r14742.
* Remove unused declaration
* remove unused variable warning when not using progress threads
* If we're using progress threads, we want to lock, not trylock
when in progress, since it was called from the wakeup thread
and not the progress function
This commit was SVN r14739.
to do while(...) { } then we can't change the variables in the ...
atomically, but should do it while holding the module lock.
* Fix dumb communicator creation error when we don't create the progress
stuff (because a window already exists), where we would accidently
jump to the error case.
This commit was SVN r14715.
The primary change that underlies all this is in the OOB. Specifically, the problem in the code until now has been that the OOB attempts to resolve an address when we call the "send" to an unknown recipient. The OOB would then wait forever if that recipient never actually started (and hence, never reported back its OOB contact info). In the case of an orted that failed to start, we would correctly detect that the orted hadn't started, but then we would attempt to order all orteds (including the one that failed to start) to die. This would cause the OOB to "hang" the system.
Unfortunately, revising how the OOB resolves addresses introduced a number of additional problems. Specifically, and most troublesome, was the fact that comm_spawn involved the immediate transmission of the rendezvous point from parent-to-child after the child was spawned. The current code used the OOB address resolution as a "barrier" - basically, the parent would attempt to send the info to the child, and then "hold" there until the child's contact info had arrived (meaning the child had started) and the send could be completed.
Note that this also caused comm_spawn to "hang" the entire system if the child never started... The app-failed-to-start helped improve that behavior - this code provides additional relief.
With this change, the OOB will return an ADDRESSEE_UNKNOWN error if you attempt to send to a recipient whose contact info isn't already in the OOB's hash tables. To resolve comm_spawn issues, we also now force the cross-sharing of connection info between parent and child jobs during spawn.
Finally, to aid in setting triggers to the right values, we introduce the "arith" API for the GPR. This function allows you to atomically change the value in a registry location (either divide, multiply, add, or subtract) by the provided operand. It is equivalent to first fetching the value using a "get", then modifying it, and then putting the result back into the registry via a "put".
This commit was SVN r14711.
To be precise, given this hypothetical launching pattern:
host1: vpids 0, 2, 4, 6
host2: vpids 1, 3, 5, 7
The local_rank for these procs would be:
host1: vpids 0->local_rank 0, v2->lr1, v4->lr2, v6->lr3
host2: vpids 1->local_rank 0, v3->lr1, v5->lr2, v7->lr3
and the number of local procs on each node would be four. If vpid=0 then does a comm_spawn of one process on host1, the values of the parent job would remain unchanged. The local_rank of the child process would be 0 and its num_local_procs would be 1 since it is in a separate jobid.
I have verified this functionality for the rsh case - need to verify that slurm and other cases also get the right values. Some consolidation of common code is probably going to occur in the SDS components to make this simpler and more maintainable in the future.
This commit was SVN r14706.
* Combine polling of the long requests and buffer requests into
one type, and in one place
* Associate the list of requests to poll with the component, not
the individual modules
* add progress thread that sits on the OMPI request structure
and wakes up at the appropriate time to poll the message
list. Not the best, but without some asynch notification
from the PML that a given set of requests has completed, there
isn't much better
* Instead of calling opal_progress() all over the place, move
to using the condition variables like the rest of the project.
Has the advantage of moving it slightly futher along in the
becoming thread safe thing
* Fix a problem with the passive side of unlock where it could
go recursive and cause all kinds of problems, especially
when progress threads are used. Instead, have two parts of
passive unlock -- one to start the unlock, and another to
complete the lock and send the ack back. The data moving
code trips the second at the right time.
This commit was SVN r14703.