components that use configure.m4 for configuration or are always built.
The macro has not been needed since moving to configure types other than
configure.stub
Fixes trac:590
This commit was SVN r13031.
The following Trac tickets were found above:
Ticket 590 --> https://svn.open-mpi.org/trac/ompi/ticket/590
George wrote the initial patch, I extended it slightly and am responsible for all bugs found.
Refs trac:587
This commit was SVN r13023.
The following Trac tickets were found above:
Ticket 587 --> https://svn.open-mpi.org/trac/ompi/ticket/587
This is somewhat limited currently for expample, if you have 3 ports on Node A and 5 ports
on Node B then the peers will use 3 ports to communicate with each other.
This is on a subnet basis, so for any pair of nodes we take the
intersection of the available ports within a subnet.
We use subnets to determine reachability for lazy connection establishment. So
if Node A and Node B each have two HCA's (on seperate networks) then the
subnet's must be distinct, otherwise we will try to wire up HCA's on seperate
networks.
This commit was SVN r12978.
* Make sure that the pval always writes to the correct portion of the
lval. This only matters on 32 bit big endian machines.
* On 32 bit machines when assigning to pval, the other 4 bytes of lval
weren't being written, which could lead to bogus data
We use macros so that there aren't casts all over the code and the pval
assignment can occur to the correct 4 bytes. Refs trac:587
This commit was SVN r12974.
The following Trac tickets were found above:
Ticket 587 --> https://svn.open-mpi.org/trac/ompi/ticket/587
completion of the RDMA operation associated with the fragment. The
PML will call the BML free which in turn will call the BTL free. The MX
BTL will not release the fragment if it not tagged with 0xff.
This commit was SVN r12947.
the data was buffered by the MX library. If it's the case then we declare
the send as completed and disable the completion event for the mx request.
This commit was SVN r12935.
protocol over the MX BTL. Now, we have only one matching, the one in Open
MPI.
The problem is that when the unexpected handler is triggered, not all the
message is on the host memory. In the best case we get one MX fragment (internal
MX fragment), in the worst we get NULL. The only way to fit this with the
design of the PML is to force the eager protocol at the MX internal fragment
size, and to limit the send/receive protocol at the same size. Tests show
the outcome is not far from optimal (if the pipeline depth is increased
a little bit).
Set MX_PIPELINE_LOG in order to allow MX to use internal fragments of 4K.
This commit was SVN r12930.
performance on a 2G Myrinet card, as it look like pipelining the messages
by 1M is faster than a simple send/receive. However, when using a 10G card
the send/receive will limit the maximum bandwidth to 2.5Gbs. The reason is
the scarce bus resources that have to be shared between the Myrinet hardware
and the memcpy operation. The PUT protocol remove the memcpy, we now have a
true zero-copy mechanism. But, there is no pipelining yet as it look like the
RDMA pipeline somehow disappeared from the OB1 PML ...
This commit was SVN r12925.
which can cause segfaults on shutdown. Calling mx_finalize() isn't enough
to shutdown the thread, so must close endpoints as well.
Refs trac:513
This commit was SVN r12908.
The following Trac tickets were found above:
Ticket 513 --> https://svn.open-mpi.org/trac/ompi/ticket/513
if the remote architecture differs from the local architecture and the
btl doesn't support heterogeneous transport.
Refs trac:587
This commit was SVN r12879.
The following Trac tickets were found above:
Ticket 587 --> https://svn.open-mpi.org/trac/ompi/ticket/587
udapl/openib/vapi/gm mpools a deprecated. rdma mpool has parameter that allows
to limit its size mpool_rdma_rcache_size_limit (default is 0 - unlimited).
This commit was SVN r12878.
Add ability for ini files to recognize "use_eager_rdma" flag. Set the
default to "no" (because we should assume that HCAs cannot support the
property necessary for using RDMA for eager messages -- that the last
byte of the message is guaranteed to be written to memory last --
unless proven otherwise. For example, iWARP cards apparently do not
provide this guarantee), and then set all Mellanox and IBM HCAs to
override the default to enable this behavior on these cards.
This commit was SVN r12851.
The following Trac tickets were found above:
Ticket 366 --> https://svn.open-mpi.org/trac/ompi/ticket/366
usually is ok on little-endian systems, as the upper 32 bits will likely
be ignored, but on 32-bit big-endian systems, lval is complete junk.
Use ival if 32 bit mode, lval if 64.
Mixing of 32 and 64 bit architectures won't work without more changes.
This commit was SVN r12802.
r12714) for supporting compilers / architectures with different
padding rules.
This commit was SVN r12749.
The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
r12491
r12714
hits the buffer on the other side. For this kind of BTLs we need to send
FIN through the same BTL, PUT was performed with so network will handle
ordering for us. If we will use another BTL, receiver can get FIN before
data will hit the buffer and complete request prematurely. We mark such
problematic BTLs with MCA_BTL_FLAGS_FAKE_RDMA flag (this kind of RDMA
is really fake, because the real one guaranties that sender will see the
completion only after receiver's NIC confirmed that all the data was
received).
This commit was SVN r12732.
- consistent error message when something fails (via BTL_ERROR macro)
- decrease the number of jumps.
- cleanup some parts of the code.
This commit was SVN r12719.
because they are in ORTE, not OMPI. Also, remove the ORTE_PROCESS_NAME macros
in iof base as they are duplicates of the ones that were in ns_types, which
meant that bad things happened if you changed what an orte_process_name_t
looked like.
This commit was SVN r12646.
the same time, remove some of the MPI-related options from OPAL:
- provide mechanism to change at runtime whether sched_yield() should
be called when the progress engine is idle
- provide mechanism for changing the rate at which the event engine
is called when there are "no" users of the event engine (ie, when
using MPI but not TCP)
- fix some function names in the progress engine to better match
their intended use (and remove MPI naming scheme)
- remove progress_mpi_enable / progress_mpi_disable because
we can now use the functions to set the sched_yield and
tick rate interfaces
- rename opal_progress_events() to opal_progress_set_event_flag()
because the first really isn't descriptive of what the function
does and I always got confused by it
This commit was SVN r12645.
Accordingly, there are new APIs to the name service to support the ability to get a job's parent, root, immediate children, and all its descendants. In addition, the terminate_job, terminate_orted, and signal_job APIs for the PLS have been modified to accept attributes that define the extent of their actions. For example, doing a "terminate_job" with an attribute of ORTE_NS_INCLUDE_DESCENDANTS will terminate the given jobid AND all jobs that descended from it.
I have tested this capability on a MacBook under rsh, Odin under SLURM, and LANL's Flash (bproc). It worked successfully on non-MPI jobs (both simple and including a spawn), and MPI jobs (again, both simple and with a spawn).
This commit was SVN r12597.
The fix is to set opcode to SEND at the entrance to the send function before
checking credits and putting fragment to the pending list. We do the same thing
in put/get functions i.e setting opcode at the entrance to the function.
This commit was SVN r12559.
mca_btl_openib_endpoint_connect_eager_rdma() is called recursively. He also
noticed that orte_pointer_array_add() can't fail because we allocate max number
of elements at init time. So just remove error handling and locking. No locking
- no deadlocks.
This commit was SVN r12388.
something is going wrong down in the code it is removed from the array. So add
mutex to prevent concurrent access to the array from different threads.
This commit was SVN r12385.
What's happening is that we're holding openib_btl->eager_rdma_lock when
we call mca_btl_openib_endpoint_send_eager_rdma() on
btl_openib_endpoint.c:1227. This in turn calls
mca_btl_openib_endpoint_send() on line 1179. Then, if the endpoint
state isn't MCA_BTL_IB_CONNECTED or MCA_BTL_IB_FAILED, we call
opal_progress(), where we eventually try to lock
openib_btl->eager_rdma_lock at btl_openib_component.c:997.
The fix removes this lock altogether. Instead we atomically set local RDMA
pointer to prevent other threads to create rdma buffer for the same endpoint.
And we increment eager_rdma_buffers_count atomically thus polling thread doesn't
need lock around it.
This commit was SVN r12369.
allocation logic is completely done outside the data-type engine (in the PML) there is
no need for any special case inside the data-type engine. There is less arguments for
the ompi_convertor_pack and ompi_convertor_unpack as well (the last field free_after is
not required anymore as there is no memory allocated in the engine itself). This change
affect all components using datatypes. I test most of them, but it might happens that I
miss some ... If it's the case please let me know (don't shoot the pianist!!).
This commit was SVN r12331.
all platforms. The only exceptions (and I will not deal with them
anytime soon) are on Windows:
- the write functions which require the length to be an int when it's
a size_t on all UNIX variants.
- all iovec manipulation functions where the iov_len is again an int
when it's a size_t on most of the UNIXes.
As these only happens on Windows, so I think we're set for now :)
This commit was SVN r12215.
The UD BTL isn't gone - the latest version is in my afriedle-ud branch. This version on the trunk was very old, ompi_ignore'd, lacked performance, and probably contained bugs. The maintained version on my branch is working solid, and will eventually come back, but not for v1.2.
This commit was SVN r12144.
constrained:
* Make sure we always have a number of eager fragments available
that scales with the number of processes communicating with
a given proc over shared memory
* Use FREE_LIST_GET instead of FREE_LIST_WAIT to return an
error to the PML when resource exhaustion occurs
* Don't dereference the frag during alloc unless we're sure
it's not NULL
Reviewed by: Galen
Refs trac:413
This commit was SVN r12053.
The following Trac tickets were found above:
Ticket 413 --> https://svn.open-mpi.org/trac/ompi/ticket/413
GIDs (there can be more than one) and not GIDs of the HCA on the network. Entry
zero always have to be initialized so we use it, and warn user if there is more
then one port active and default subnet is configured on at least one of them.
This commit was SVN r11815.
bunch of code changed indenting level and some code got moved out of
one function and made into its own subroutine.
- Gleb pointed out that I wasn't taking into account values from the
default section of the INI file (and not finding values in the INI
file is not an error).
- I incorrectly thought that 0x5ad was Mellanox's vendor ID. Turns
out that 0x5ad is Cisco's ID, while 0x2c9 is Mellanox.
Specifically, Cisco burns its own firmware into the HCA which
replaces the vendor ID, although the part ID stays the same. So
it's Mellanox hardware with Cisco firmware. And apparently several
of us do that. :-) So I expanded the concept of the vendor_id in
the INI file to allow for lists of vendor IDs.
- Along with that, I updated the default INI file to list all the IB
vendors (that I am aware of -- certainly open to putting more data
in there from other vendors) who overwrite Mellanox's vendor_id with
their own for the part numbers that we have on file.
This commit was SVN r11506.
In order to provide backwards compatability the framework versions are bumped
and the handler registeration function is at the end of the btl struct.
Testing done on sm, openib, and gm..
This commit was SVN r11256.
the upperlayer assynchronously although there are some issues with this.. such
as there are multiple consumers of the btl's.. who get's the
This commit was SVN r11232.
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
implemented entirely on top of the PML. This allows us to have a
one-sided interface even when we are using the CM PML and MTLs for
point-to-point transport (and therefore not using the BML/BTLs)
* Old pt2pt component was renamed "rdma", as it will soon be having
real RDMA support added to it.
Work was done in a temporary branch. Commit is the result of the
merge command:
svn merge -r10862:11099 https://svn.open-mpi.org/svn/ompi/tmp/bwb-osc-pt2pt
This commit was SVN r11100.
The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
r10862
r11099
of send/receives outstanding.
Use ibv_cq_resize if available after initial creation of completion queue if
cq_size is too small (based on number of peers).
This commit was SVN r11053.
shared memory segments
* make sure to properly unlink the collectives sm bootstrap area at
shutdown
* Add missing / in the path for the mpool shared memory segment
* make sure to release the common_mmap structure in the SM btl
after unlinking the file during shutdown
This commit was SVN r10886.
I've introduced a race condition - seeing occasional LOCAL_LENGTH errors on the receive side. I think I'm mixing up eager/max somehow - will look at it more on monday.
This commit was SVN r10690.
interconnects that provide matching logic in the library.
Currently includes support for MX and some support for
Portals
* Fix overuse of proc_pml pointer on the ompi_proc structuer,
splitting into proc_pml for pml data and proc_bml for
the BML endpoint data
* bug fixes in bsend init code, which wasn't being used by
the OB1 or DR PMLs...
This commit was SVN r10642.
mpi_leave_pinned when multiple OpenIB HCA ports are found.
Specifically, if mpi_leave_pinned == 1 and ultiple HCA ports are
found, the MCA parameter btl_openib_max_btls is set to 1. If the MCA
parameter btl_openib_warn_leave_pinned_multi_port is true, emit a
warning that this happened (having an MCA parameter to control the
warning allows users/sysadmins to turn it off instead of being nagged
for every run).
This commit was SVN r10521.
time.
UD is connectionless, and as long as peers are statically assigned to QPs,
there is no reason to set up the adressing information lazily.
Lots of code was axed, as endpoints no longer have state. Removed a
number of other elements in the endpoint struct to make it as lightweight
as possible.
I was able to remove an entire function call/branch in the send path,
which I believe is the main contributor to a 2us drop in NetPIPE latency.
Some whitespace cleanups as well.
Passes IBM test suite, and all but certain intel tests that were failing
before the change, over ob1 PML.
This commit was SVN r10494.
Moved a lot of the module-specific init from the component init to the module init.
Try keeping a pointer to reduce indexing, didn't seem to help - leaving in place
for now.
This commit was SVN r10485.
Playing around with OPAL_LIKELY/UNLIKELY, no real gains yet.
Reworked progress() to process many WC's at a time, as well
as immediately repost groups of receive buffers.
This commit was SVN r10481.
1. ompi/mca/btl/udapl/btl_udapl_proc.c should be including
btl_udapl_endpoint.h for mca_btl_udapl_proc_insert function.
2. btl_udapl_endpoint.c it looks like you are using
&endpoint->endpoint_lock when you should use &ep->endpoint_lock in a
OPAL_THREAD_LOCK call.
3. btl_udapl_frag.h has a couple opal_list_item_t's that should be
ompi_free_list_item_t in the _FRAG_ALLOC_{EAGER,MAX} macros.
This commit was SVN r10442.
mpi_leave_pinned when multiple OpenIB HCA ports are found.
Specifically, if mpi_leave_pinned == 1 and ultiple HCA ports are
found, the MCA parameter btl_openib_max_btls is set to 1. If the MCA
parameter btl_openib_warn_leave_pinned_multi_port is true, emit a
warning that this happened (having an MCA parameter to control the
warning allows users/sysadmins to turn it off instead of being nagged
for every run).
This commit was SVN r10424.
send/receive queues if there is something available in the short message
rdma queues, then we have to poll *ALL* the rdma queues before exiting,
or we aren't fair about frag reception and fall into degenerate matching
cases.
This commit was SVN r10410.
- Added some basic flow control to limit number of posted sends.
- Merged endpoint send/recv lock into single endpoint lock.
- Set the LMR triplet length in the send path, not at allocation time.
This has to be done because upper layers might send less than the
amount allocated.
- Alter the tie-breaker if statement protecting the second call
to dat_ep_connect(). The logic was reversed compared to the tie-
breaker for the first dat_ep_connect(), making it possible for
3 or more processes to form a deadlock loop.
- Some asserts were added for debugging purposes.. leaving them
in place for now.
This commit was SVN r10317.
btl_openib_ib_mtu and btl_mvapi_ib_mtu MCA params by showing the valid
values what what they represent (got a question about this from Cisco
testing engineers).
This commit was SVN r10277.
UD is the Unreliable Datagram transport for Infiniband, specifically OpenIB. This BTL is derived from the existing openib BTL, which is RC (Reliable Connection) based.
Still a work in progress, as there is a lot of work left to do. Specifically, performance, scalability, and flow control need to be addressed.
Currently I'm playing around with different methods for handling receive buffers, as well as profiling to figure out where the time is going.
This commit was SVN r10271.
support for progress threads, so we shouldn't build them or try to use
them when support for progress threads has been requested. The TCP, GM,
SELF, and SM BTLs should have progress thread support, so they aren't
disabled. The Portals BTL isn't compiled on platforms with threads,
so it doens't need to be updated.
This commit was SVN r10156.
Do this rather than the my_list pointer because we need to do some
things that are somewhat special because we pre-pin eager fragments but
not send fragments. Also makes a couple ideas I have slightly easier to
play around with.
This commit was SVN r10127.
Instead of figuring out which free list the fragment belongs to based on size
we simply store a pointer to the list which it belongs in the fragment.
This was reviewed by Brian and should hit all the branches.
This commit was SVN r10072.
Trying to remember what I did here.. eager/max messages should work now, no RDMA yet. A number of other fixes and cleanups.
I do know of two problems:
Bad stuff happens when flooded with send frags too quickly - the BTL doesn't handle flow control.
Certain IBM tests turn up a length assertion in the datatype engine - needs more investigation.
This commit was SVN r10070.
derefence through it. It is legal for endpoint_addr to be NULL in the
destructor because if btl_tcp_add_procs() -> btl_tcp_proc_insert()
returns UNREACH, then endpoint_addr will be NULL and we'll OBJ_RELEASE
it.
This commit was SVN r9940.
apparently don't work properly: r9869, r9868 (sm btl alignment issues)
This commit was SVN r9936.
The following SVN revision numbers were found above:
r9868 --> open-mpi/ompi@9b985c3216
r9869 --> open-mpi/ompi@adedf511fb
easier to do event accounting that way
* greatly increase receive event and buffer sizes. We're still about half
of what Cray defaults to, so I don't feel bad about the increases
* Implement a pre-pinning optimization for eager fragments - will be
pinned on first use and left pinned for the life of the fragment
* Since we can't have two receive frag callbacks fired at the same time,
don't have receive free list - just keep one receive fragment in the
module. Saves a big free list and all that interaction.
This commit was SVN r9915.
right now. Some testing on large NUMA machines should be done in order
to make sure that we need to export this variable out to the MCA layer.
This commit was SVN r9868.
the memory descriptor is closing itself, so that it actually works properly ;). I think I
was just getting lucky and not sending enough short messages with the reference impl.
This commit was SVN r9748.
* It appears that Cray's SeaStar has some horrible performance for iovecs - IN_pLACE
was actually slower than copying into eager frags. Ugh. And we don't even pre-pin
eager frags yet!
This commit was SVN r9738.
but causes resource exhaustion on the Red Storm implementation. Sigh...
This commit was SVN r9686.
The following SVN revision numbers were found above:
r9005 --> open-mpi/ompi@20d06e889e
- Some initial work on prepare_src
- Move some fragment initialization around
- Fix a union casting issue on picky compilers, identified by Don Kerr
- Other small cleanups/bugfixes
This commit was SVN r9662.
we send our local addr_t OOB. Remote side then matches endpoints and calls
dat_ep_connect(). Everything should be the same as before from here, except
that client/server roles are reversed.
- Properly set our buffer size when posting receives. When the frag used to
transfer address information is recycled by the free list, the wrong buffer
size was being used, which caused buffer overflow errors.
- Finally put the uDAPL error handling stuff in the mpool component.
- Remove a few more OPAL_OUTPUTs.
This commit was SVN r9569.
Not much got tested that wasn't already - I've uncovered a connection
establishment deadlock and wanted to get these changes committed before I
attack it.
The big changes:
- Moved much of the connection code from btl_udapl_component.c to
btl_udapl_endpoint.c.
- Cleaned up initialization of various fragment members.
- MCA_BTL_UDAPL_ERROR macro, which is compiled in/out appropriately.
This commit was SVN r9496.
- Grab the mpool_registration in _frag_common_constructor()
- Save the LMR context in the segment key
- No need for cookie variables - can just cast the frag
- No need to memcpy() data when recv'ing
- Add an LMR triplet to the fragment structure and initialize it
in btl_udapl_alloc().
- Whitespace/typo fixes, remove some opal_output() calls
Looks like I can use triplets describing sub-regions of registered LMR's. So I
do this - prior to this patch I was sending the entire free list memory over,
which isn't correct :)
Back to an earlier problem - when sending address information right after
connection establishment, the receiving end receives a DTO completion event and
appears to have good data. But the sending end never receives a DTO completion
event indicating the send completed, and never completes the client side of the
connection.
This commit was SVN r9386.
In short, I'm very close to having connection establishment and eager send/recv working.
Part of the connection process involves sending address information from the
client to server. For some reason, I am never receiving an event indicating
completetion of the send on the client side. Otherwise, connection
establishment is working and eager send/recv should be trivial from here.
Some more detailed changes:
- Send partially implemented, just handles starting up new connections.
- Several support functions implemented for establishing connection. Client
side code went in btl_udapl_endpoint.c, server side in btl_udapl_component.c
- Frags list and send/recv locks added to the endpoint structure.
- BTL sets up a public service point, which listens for new connections.
Steps over ports that are already bound, iterating through a range of ports.
- Remove any traces of recv frags, don't think I need them after all.
- Pieces of component_progress() implemented for connection establishment.
- Frags have two new types for connection establishment - CONN_SEND and
CONN_RECV.
- Many other minor cleanups not affecting functionality
This commit was SVN r9345.
flag, new flags to be included when convertor is initialized
- modified pml/btl module defs and added stub functions for diagnostic
output routines to dump state of queues / endpoints
- updates to data reliability pml
This commit was SVN r9329.
- initial support for gm progress thread
- corrected threading issue in pml
- added polling progress for a configurable number of cycles to wait for threaded case
This commit was SVN r9188.
- moved hton64 and ntoh64 from the bunch of places it had been copied
into one header file
- properly set and use the btl_tcp's nbo option to put things in
network byte order on the wire if both sides don't have the same
endianness
- Put the OB1 PML's headers (with a couple exceptions I need to discuss
with Tim) in network byte order on the wire if both sides don't have
the same endianness
- since it was needed for the TCP BTL, move the orte_process_name_t
HTON and NTOH macros from the TCP OOB to ns_types.h
This commit was SVN r9145.
when running an MPI job spanning a node that has two TCP NICs and a
node that has one TCP NIC. Previously, for the 2 NIC/module process,
we would return the first peer IP address if we couldn't find a subnet
match with any of the peer's published IP addresses -- this was to
support running OMPI across subnet boundaries. Changed the behavior
to only do that behavior if the IP address we're trying to match is
public (i.e., not 10.x.y.z, 192.168.x.y, or 172.16.x.y) *and* any of
the remote peer's addresses are public (working on the assumption that
if we both have public addresses, they're routable to each other).
This definitely will not work in all scenarios, such as when we go to
WAN kinds of executions, and will need to be revisited at that time.
This commit was SVN r9119.