MCA_BTL_SM_FRAG_SEND) and status success/fail in low bits of pointers we
are passing through circular buffer. The rank that receives ACK doesn't need
to look into data it received and this is a big win since this data is not in
the cache of the rank's CPU. (Note that we can use low bits of pointers because
free_list always return pointers aligned at least to cache line size).
This commit was SVN r13922.
allocated from mpool memory (which is registered memory for RDMA transports)
This is not a problem for a small jobs, but for a big number of ranks an
amount of waisted memory is big.
This commit was SVN r13921.
- mca_base_param_file_prefix
(Default: NULL)
This is the fullname of the "-am" mpirun option. Used to specify a ':'
separated list of AMCA parameter set files.
- mca_base_param_file_path
(Default: $SYSCONFDIR/amca-param-sets/:$CWD)
The path to search for AMCA files with relative paths. A warning will be
printed if the AMCA file cannot be found.
* Added a new function "mca_base_param_recache_files" the re-reads the file
configurations. This is used internally to help bootstrap the MCA system.
* Added a new orterun/mpirun command line option '-am' that aliases for the
mca_base_param_file_prefix MCA parameter
* Exposed the opal_path_access function as it is generally useful in other
places in the code.
* New function "opal_cmd_line_make_opt_mca" which will allow you to append a
new command line option with MCA parameter identifiers to set at the same
time. Previously this could only be done at command line declaration time.
* Added a new directory under the $pkgdatadir named "amca-param-sets" where all
the 'shipped with' Open MPI AMCA parameter sets are placed. This is the first
place to search for AMCA sets with relative paths.
* An example.conf AMCA parameter set file is located in
contrib/amca-param-sets/.
* Jeff Squyres contributed an OpenIB AMCA set for benchmarking.
Note: You will need to autogen with this commit as it adds a configure param.
Sorry :(
This commit was SVN r13867.
and another for dst descriptors. This provide partial solution to OB1 protocol
deadlock problem. We can limit number of RDMA descriptors (by setting
btl_openib_free_list_max to something different from -1) and if we will be
lucky to hit this limit before we fail to register more memory the protocol
will not deadlock. When we had only one list for src/dst descriptors we
deadlocked when we reached max limit for the list.
This commit was SVN r13844.
- Implement the BML/r2 finialize funciton
- Cleanup the btl close routine
- Wire up a pml_base_verbose MCA parameter so you can actually watch the PML selection logic if you really want to.
- Fix a potental segfault in the selection logic.
ompi_pointer_array_get_item() may return NULL, so we have to check for it
This commit was SVN r13734.
The following SVN revision numbers were found above:
r2 --> open-mpi/ompi@58fdc18855
middle of a struct). Now we properly define and name the union
outside the struct and simply create an instance of it inside the
struct.
This commit was SVN r13709.
buffer fails. If cb is already allocated, but it is full and allocation of
additional cb fails, we spin waiting for receiver to free space in existing
cb.
This commit was SVN r13635.
open MTL MX and BTL MX and initialize them at the same time. The problem is
that both call mx_init and mx_finalize, solution is to add an external entity
that does the init and finalize (based on ref counting).
This commit was SVN r13576.
* have the mpool size be based on MCW, not num procs
in other jobs we know about. Solves the problem of
the spawned job having a much bigger than needed
sm file
* Can't assume that "me" is in the list of procs
passed to addprocs, so need to use slightly different
logic and not go through all of add procs unless
there's a proc in my job that isn't me.
This seems to greatly improve the situation, although
there still seems to be more of a slowdown through
MPI_INIT for the children (if there are more than one
child) than MPI_INIT for the parent if there are 'n'
children compared to 'n' parents. Hopefully that
made sense ;)
This commit was SVN r13417.
not the component. This potentially allows for a mix of HCAs that
support eager RDMA and those who do not on a port-by-port basis.
This commit was SVN r13242.
* If the text to cite where the problem occurred is "\n", prettyprint
somethign a little nicer so that it's clear that we're talking
about the end of line
* Add a missing help message ("ini file:unknown field"), and display
it a little better (i.e., show the erroneous field, not a
misleading "end of line" marker)
* It's "OpenIB", not "Open IB"
This commit was SVN r13241.
and convert it to a pointer when finding the destination addr.
Refs trac:587
This commit was SVN r13116.
The following Trac tickets were found above:
Ticket 587 --> https://svn.open-mpi.org/trac/ompi/ticket/587
we are looking at subnet_id's and we are counting active ports per subnet.
move subnet count out of procs loop,, no need to do it there...
This commit was SVN r13105.
memcpy() instead of assigning the struct's by value.
Fixes trac:739.
This commit was SVN r13081.
The following Trac tickets were found above:
Ticket 739 --> https://svn.open-mpi.org/trac/ompi/ticket/739
Sorry for the configure change -- hopefully it's early enough in the
morning that it won't affect people... (new approach won't have a
configure change).
Refs trac:739.
This commit was SVN r13080.
The following Trac tickets were found above:
Ticket 739 --> https://svn.open-mpi.org/trac/ompi/ticket/739
Let's minimize the disturbances and say that the configure system is right.
From now on it's OPAL_BOOL_STRUCT_COPY. This one is related to r13076 and
has to follow when r13076 goes in the 1.2.
This commit was SVN r13077.
The following SVN revision numbers were found above:
r13076 --> open-mpi/ompi@f0932a0701
been fixed in the 7.0 PGI series, but is unlikely to be fixed in the
6.2 series:
* Add a configure test looking for the bad behavior (the PGI compiler
chokes on C code where structs containing bool's are copied by
value)
* Set OMPI_BOOL_STRUCT_COPY to 1 if it's ok, 0 if it's not (i.e., PGI
6.2 series will have this value set to 0)
* In two places in the code base -- orte-clean and btl_openib_ini.h,
we have a struct that contains a bool that is copied by value. In
these two places, check OMPI_BOOL_STRUCT_COPY and if it's 1, use
the "int" type instead of "bool".
Fixes trac:739
This commit was SVN r13076.
The following Trac tickets were found above:
Ticket 739 --> https://svn.open-mpi.org/trac/ompi/ticket/739
it may contain garbage and we will try to unregister it later in btl_free().
This commit was SVN r13054.
The following Trac tickets were found above:
Ticket 729 --> https://svn.open-mpi.org/trac/ompi/ticket/729
components that use configure.m4 for configuration or are always built.
The macro has not been needed since moving to configure types other than
configure.stub
Fixes trac:590
This commit was SVN r13031.
The following Trac tickets were found above:
Ticket 590 --> https://svn.open-mpi.org/trac/ompi/ticket/590
George wrote the initial patch, I extended it slightly and am responsible for all bugs found.
Refs trac:587
This commit was SVN r13023.
The following Trac tickets were found above:
Ticket 587 --> https://svn.open-mpi.org/trac/ompi/ticket/587
This is somewhat limited currently for expample, if you have 3 ports on Node A and 5 ports
on Node B then the peers will use 3 ports to communicate with each other.
This is on a subnet basis, so for any pair of nodes we take the
intersection of the available ports within a subnet.
We use subnets to determine reachability for lazy connection establishment. So
if Node A and Node B each have two HCA's (on seperate networks) then the
subnet's must be distinct, otherwise we will try to wire up HCA's on seperate
networks.
This commit was SVN r12978.
* Make sure that the pval always writes to the correct portion of the
lval. This only matters on 32 bit big endian machines.
* On 32 bit machines when assigning to pval, the other 4 bytes of lval
weren't being written, which could lead to bogus data
We use macros so that there aren't casts all over the code and the pval
assignment can occur to the correct 4 bytes. Refs trac:587
This commit was SVN r12974.
The following Trac tickets were found above:
Ticket 587 --> https://svn.open-mpi.org/trac/ompi/ticket/587
completion of the RDMA operation associated with the fragment. The
PML will call the BML free which in turn will call the BTL free. The MX
BTL will not release the fragment if it not tagged with 0xff.
This commit was SVN r12947.
the data was buffered by the MX library. If it's the case then we declare
the send as completed and disable the completion event for the mx request.
This commit was SVN r12935.
protocol over the MX BTL. Now, we have only one matching, the one in Open
MPI.
The problem is that when the unexpected handler is triggered, not all the
message is on the host memory. In the best case we get one MX fragment (internal
MX fragment), in the worst we get NULL. The only way to fit this with the
design of the PML is to force the eager protocol at the MX internal fragment
size, and to limit the send/receive protocol at the same size. Tests show
the outcome is not far from optimal (if the pipeline depth is increased
a little bit).
Set MX_PIPELINE_LOG in order to allow MX to use internal fragments of 4K.
This commit was SVN r12930.
performance on a 2G Myrinet card, as it look like pipelining the messages
by 1M is faster than a simple send/receive. However, when using a 10G card
the send/receive will limit the maximum bandwidth to 2.5Gbs. The reason is
the scarce bus resources that have to be shared between the Myrinet hardware
and the memcpy operation. The PUT protocol remove the memcpy, we now have a
true zero-copy mechanism. But, there is no pipelining yet as it look like the
RDMA pipeline somehow disappeared from the OB1 PML ...
This commit was SVN r12925.
which can cause segfaults on shutdown. Calling mx_finalize() isn't enough
to shutdown the thread, so must close endpoints as well.
Refs trac:513
This commit was SVN r12908.
The following Trac tickets were found above:
Ticket 513 --> https://svn.open-mpi.org/trac/ompi/ticket/513
if the remote architecture differs from the local architecture and the
btl doesn't support heterogeneous transport.
Refs trac:587
This commit was SVN r12879.
The following Trac tickets were found above:
Ticket 587 --> https://svn.open-mpi.org/trac/ompi/ticket/587
udapl/openib/vapi/gm mpools a deprecated. rdma mpool has parameter that allows
to limit its size mpool_rdma_rcache_size_limit (default is 0 - unlimited).
This commit was SVN r12878.
Add ability for ini files to recognize "use_eager_rdma" flag. Set the
default to "no" (because we should assume that HCAs cannot support the
property necessary for using RDMA for eager messages -- that the last
byte of the message is guaranteed to be written to memory last --
unless proven otherwise. For example, iWARP cards apparently do not
provide this guarantee), and then set all Mellanox and IBM HCAs to
override the default to enable this behavior on these cards.
This commit was SVN r12851.
The following Trac tickets were found above:
Ticket 366 --> https://svn.open-mpi.org/trac/ompi/ticket/366
usually is ok on little-endian systems, as the upper 32 bits will likely
be ignored, but on 32-bit big-endian systems, lval is complete junk.
Use ival if 32 bit mode, lval if 64.
Mixing of 32 and 64 bit architectures won't work without more changes.
This commit was SVN r12802.
r12714) for supporting compilers / architectures with different
padding rules.
This commit was SVN r12749.
The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
r12491
r12714
hits the buffer on the other side. For this kind of BTLs we need to send
FIN through the same BTL, PUT was performed with so network will handle
ordering for us. If we will use another BTL, receiver can get FIN before
data will hit the buffer and complete request prematurely. We mark such
problematic BTLs with MCA_BTL_FLAGS_FAKE_RDMA flag (this kind of RDMA
is really fake, because the real one guaranties that sender will see the
completion only after receiver's NIC confirmed that all the data was
received).
This commit was SVN r12732.
- consistent error message when something fails (via BTL_ERROR macro)
- decrease the number of jumps.
- cleanup some parts of the code.
This commit was SVN r12719.
because they are in ORTE, not OMPI. Also, remove the ORTE_PROCESS_NAME macros
in iof base as they are duplicates of the ones that were in ns_types, which
meant that bad things happened if you changed what an orte_process_name_t
looked like.
This commit was SVN r12646.
the same time, remove some of the MPI-related options from OPAL:
- provide mechanism to change at runtime whether sched_yield() should
be called when the progress engine is idle
- provide mechanism for changing the rate at which the event engine
is called when there are "no" users of the event engine (ie, when
using MPI but not TCP)
- fix some function names in the progress engine to better match
their intended use (and remove MPI naming scheme)
- remove progress_mpi_enable / progress_mpi_disable because
we can now use the functions to set the sched_yield and
tick rate interfaces
- rename opal_progress_events() to opal_progress_set_event_flag()
because the first really isn't descriptive of what the function
does and I always got confused by it
This commit was SVN r12645.
Accordingly, there are new APIs to the name service to support the ability to get a job's parent, root, immediate children, and all its descendants. In addition, the terminate_job, terminate_orted, and signal_job APIs for the PLS have been modified to accept attributes that define the extent of their actions. For example, doing a "terminate_job" with an attribute of ORTE_NS_INCLUDE_DESCENDANTS will terminate the given jobid AND all jobs that descended from it.
I have tested this capability on a MacBook under rsh, Odin under SLURM, and LANL's Flash (bproc). It worked successfully on non-MPI jobs (both simple and including a spawn), and MPI jobs (again, both simple and with a spawn).
This commit was SVN r12597.
The fix is to set opcode to SEND at the entrance to the send function before
checking credits and putting fragment to the pending list. We do the same thing
in put/get functions i.e setting opcode at the entrance to the function.
This commit was SVN r12559.
mca_btl_openib_endpoint_connect_eager_rdma() is called recursively. He also
noticed that orte_pointer_array_add() can't fail because we allocate max number
of elements at init time. So just remove error handling and locking. No locking
- no deadlocks.
This commit was SVN r12388.
something is going wrong down in the code it is removed from the array. So add
mutex to prevent concurrent access to the array from different threads.
This commit was SVN r12385.
What's happening is that we're holding openib_btl->eager_rdma_lock when
we call mca_btl_openib_endpoint_send_eager_rdma() on
btl_openib_endpoint.c:1227. This in turn calls
mca_btl_openib_endpoint_send() on line 1179. Then, if the endpoint
state isn't MCA_BTL_IB_CONNECTED or MCA_BTL_IB_FAILED, we call
opal_progress(), where we eventually try to lock
openib_btl->eager_rdma_lock at btl_openib_component.c:997.
The fix removes this lock altogether. Instead we atomically set local RDMA
pointer to prevent other threads to create rdma buffer for the same endpoint.
And we increment eager_rdma_buffers_count atomically thus polling thread doesn't
need lock around it.
This commit was SVN r12369.
allocation logic is completely done outside the data-type engine (in the PML) there is
no need for any special case inside the data-type engine. There is less arguments for
the ompi_convertor_pack and ompi_convertor_unpack as well (the last field free_after is
not required anymore as there is no memory allocated in the engine itself). This change
affect all components using datatypes. I test most of them, but it might happens that I
miss some ... If it's the case please let me know (don't shoot the pianist!!).
This commit was SVN r12331.
all platforms. The only exceptions (and I will not deal with them
anytime soon) are on Windows:
- the write functions which require the length to be an int when it's
a size_t on all UNIX variants.
- all iovec manipulation functions where the iov_len is again an int
when it's a size_t on most of the UNIXes.
As these only happens on Windows, so I think we're set for now :)
This commit was SVN r12215.
The UD BTL isn't gone - the latest version is in my afriedle-ud branch. This version on the trunk was very old, ompi_ignore'd, lacked performance, and probably contained bugs. The maintained version on my branch is working solid, and will eventually come back, but not for v1.2.
This commit was SVN r12144.
constrained:
* Make sure we always have a number of eager fragments available
that scales with the number of processes communicating with
a given proc over shared memory
* Use FREE_LIST_GET instead of FREE_LIST_WAIT to return an
error to the PML when resource exhaustion occurs
* Don't dereference the frag during alloc unless we're sure
it's not NULL
Reviewed by: Galen
Refs trac:413
This commit was SVN r12053.
The following Trac tickets were found above:
Ticket 413 --> https://svn.open-mpi.org/trac/ompi/ticket/413
GIDs (there can be more than one) and not GIDs of the HCA on the network. Entry
zero always have to be initialized so we use it, and warn user if there is more
then one port active and default subnet is configured on at least one of them.
This commit was SVN r11815.
bunch of code changed indenting level and some code got moved out of
one function and made into its own subroutine.
- Gleb pointed out that I wasn't taking into account values from the
default section of the INI file (and not finding values in the INI
file is not an error).
- I incorrectly thought that 0x5ad was Mellanox's vendor ID. Turns
out that 0x5ad is Cisco's ID, while 0x2c9 is Mellanox.
Specifically, Cisco burns its own firmware into the HCA which
replaces the vendor ID, although the part ID stays the same. So
it's Mellanox hardware with Cisco firmware. And apparently several
of us do that. :-) So I expanded the concept of the vendor_id in
the INI file to allow for lists of vendor IDs.
- Along with that, I updated the default INI file to list all the IB
vendors (that I am aware of -- certainly open to putting more data
in there from other vendors) who overwrite Mellanox's vendor_id with
their own for the part numbers that we have on file.
This commit was SVN r11506.
In order to provide backwards compatability the framework versions are bumped
and the handler registeration function is at the end of the btl struct.
Testing done on sm, openib, and gm..
This commit was SVN r11256.
the upperlayer assynchronously although there are some issues with this.. such
as there are multiple consumers of the btl's.. who get's the
This commit was SVN r11232.
addition to my design and testing, it was conceptually approved by
Gil, Gleb, Pasha, Brad, and Galen. Functionally [probably somewhat
lightly] tested by Galen. We may still have to shake out some bugs
during the next few months, but it seems to be working for all the
cases that I can throw at it.
Here's a summary of the changes from that branch:
* Move MCA parameter registration to a new file (btl_openib_mca.c):
* Properly check the retun status of registering MCA params
* Check for valid values of MCA parameters
* Make help strings better
* Otherwise, the only default value of an MCA param that was
changed was max_btls; it went from 4 to -1 (meaning: use all
available)
* Properly prototyped internal functions in _component.c
* Made a bunch of functions static that didn't need to be public
* Renamed to remove "mca_" prefix from static functions
* Call new MCA param registration function
* Call new INI file read/lookup/finalize functions
* Updated a bunch of macros to be "BTL_" instead of "ORTE_"
* Be a little more consistent with return values
* Handle -1 for the max_btls MCA param
* Fixed a free() that should have been an OBJ_RELEASE()
* Some re-indenting
* Added INI-file parsing
* New flex file: btl_openib_ini.l
* New default HCA params .ini file (probably to be expanded over
time by other HCA vendors)
* Added more show_help messages for parsing problems
* Read in INI files and cache the values for later lookup
* When component opens an HCA, lookup to see if any corresponding
values were found in the INI files (ID'ed by the HCA vendor_id
and vendor_part_id)
* Added btl_openib_verbose MCA param that shows what the INI-file
stuff does (e.g., shows which MTU your HCA ends up using)
* Added btl_openib_hca_param_files as a colon-delimited list of INI
files to check for values during startup (in order,
left-to-right, just like the MCA base directory param).
* MTU is currently the only value supported in this framework.
* It is not a fatal error if we don't find params for the HCA in
the INI file(s). Instead, just print a warning. New MCA param
btl_openib_warn_no_hca_params_found can be used to disable
printing the warning.
* Add MTU to peer negotiation when making a connection
* Exchange maximum MTU; select the lesser of the two
This commit was SVN r11182.
implemented entirely on top of the PML. This allows us to have a
one-sided interface even when we are using the CM PML and MTLs for
point-to-point transport (and therefore not using the BML/BTLs)
* Old pt2pt component was renamed "rdma", as it will soon be having
real RDMA support added to it.
Work was done in a temporary branch. Commit is the result of the
merge command:
svn merge -r10862:11099 https://svn.open-mpi.org/svn/ompi/tmp/bwb-osc-pt2pt
This commit was SVN r11100.
The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
r10862
r11099
of send/receives outstanding.
Use ibv_cq_resize if available after initial creation of completion queue if
cq_size is too small (based on number of peers).
This commit was SVN r11053.
shared memory segments
* make sure to properly unlink the collectives sm bootstrap area at
shutdown
* Add missing / in the path for the mpool shared memory segment
* make sure to release the common_mmap structure in the SM btl
after unlinking the file during shutdown
This commit was SVN r10886.
I've introduced a race condition - seeing occasional LOCAL_LENGTH errors on the receive side. I think I'm mixing up eager/max somehow - will look at it more on monday.
This commit was SVN r10690.
interconnects that provide matching logic in the library.
Currently includes support for MX and some support for
Portals
* Fix overuse of proc_pml pointer on the ompi_proc structuer,
splitting into proc_pml for pml data and proc_bml for
the BML endpoint data
* bug fixes in bsend init code, which wasn't being used by
the OB1 or DR PMLs...
This commit was SVN r10642.