1
1

613 Коммитов

Автор SHA1 Сообщение Дата
George Bosilca
e8bd985870 Add more output when calls to the MX library fails.
Move the connection status from theproc into the endpoint.

This commit was SVN r12924.
2006-12-24 22:34:48 +00:00
George Bosilca
14dc72f595 Allow the user to change the MX flags.
This commit was SVN r12923.
2006-12-24 22:21:00 +00:00
George Bosilca
dbe2798638 Allow MX to handle shared memory and self communications. By default these features
are disabled (btl_mx_shared_mem respectively btl_mx_self have to be set in order
to activate them).

This commit was SVN r12922.
2006-12-24 22:18:41 +00:00
Brian Barrett
7880353fcc Need to close every endpoint we open, or the MX progress thread doesn't die,
which can cause segfaults on shutdown.  Calling mx_finalize() isn't enough
to shutdown the thread, so must close endpoints as well.

Refs trac:513

This commit was SVN r12908.

The following Trac tickets were found above:
  Ticket 513 --> https://svn.open-mpi.org/trac/ompi/ticket/513
2006-12-21 18:13:22 +00:00
Gleb Natapov
484c6a2c1a Use OPAL_ALIGN() macro to align length. Return address from mpool_alloc is now
properly aligned so no need to align it once more.

This commit was SVN r12899.
2006-12-19 08:34:48 +00:00
Brian Barrett
2ab65eb521 Remove some debugging output that was #if 0'ed out but shouldn't have been
committed into the trunk anyway

This commit was SVN r12897.
2006-12-19 02:34:41 +00:00
Brian Barrett
b448b4e47e More heterogeneous fixes. Don't set reachability bit on a remote proc
if the remote architecture differs from the local architecture and the
btl doesn't support heterogeneous transport.

Refs trac:587

This commit was SVN r12879.

The following Trac tickets were found above:
  Ticket 587 --> https://svn.open-mpi.org/trac/ompi/ticket/587
2006-12-17 17:27:08 +00:00
Gleb Natapov
190e7a27cd Merge with gleb-mpool branch. All RDMA components use same mpool now (rdma).
udapl/openib/vapi/gm mpools a deprecated. rdma mpool has parameter that allows
to limit its size mpool_rdma_rcache_size_limit (default is 0 - unlimited).

This commit was SVN r12878.
2006-12-17 12:26:41 +00:00
Brian Barrett
0653dc3f24 Pad headers to eliminate heterogeneous issues. Add conversion functions
for switching endianness of headers.  Galen is going to add the code to
use the endian stuff...

This commit was SVN r12876.
2006-12-17 00:50:59 +00:00
Brad Benton
18da4c40d3 Set the QP's static rate from the associated MCA parameter, rather
than just defaulting to 0.

Fixes trac:675

This commit was SVN r12855.

The following Trac tickets were found above:
  Ticket 675 --> https://svn.open-mpi.org/trac/ompi/ticket/675
2006-12-14 19:42:24 +00:00
Brian Barrett
38c2e43ac2 Print out error string rather than errno for TCP-related errors, making it easier for both the user and us to debug issues with BTL and OOB issues...
This commit was SVN r12852.
2006-12-14 18:20:43 +00:00
Jeff Squyres
0ca8cb35b7 Fixes trac:366
Add ability for ini files to recognize "use_eager_rdma" flag.  Set the
default to "no" (because we should assume that HCAs cannot support the
property necessary for using RDMA for eager messages -- that the last
byte of the message is guaranteed to be written to memory last --
unless proven otherwise.  For example, iWARP cards apparently do not
provide this guarantee), and then set all Mellanox and IBM HCAs to
override the default to enable this behavior on these cards.

This commit was SVN r12851.

The following Trac tickets were found above:
  Ticket 366 --> https://svn.open-mpi.org/trac/ompi/ticket/366
2006-12-14 15:52:13 +00:00
George Bosilca
80bc0c8868 Allow the MX to survive if we are unable to connect to a peer. The PML will
try to find another route.

This commit was SVN r12837.
2006-12-13 01:12:07 +00:00
Brad Benton
337116d5fd Added IBM eHCA vendor and part id info
This commit was SVN r12827.
2006-12-12 14:12:39 +00:00
Jeff Squyres
e70ef98ea6 Update the help message to be a bit more specific and refer to the web
FAQ.

This commit was SVN r12812.
2006-12-09 15:13:03 +00:00
Patrick Geoffray
58c6f8c8e1 Copyright update. Thanks to Jeff to remind me.
This commit was SVN r12803.
2006-12-07 23:55:00 +00:00
Patrick Geoffray
6e09b0c23f lval is not defined when pval is assigned on 32 bit systems. this
usually is ok on little-endian systems, as the upper 32 bits will likely
be ignored, but on 32-bit big-endian systems, lval is complete junk.
Use ival if 32 bit mode, lval if 64.

Mixing of 32 and 64 bit architectures won't work without more changes.

This commit was SVN r12802.
2006-12-07 23:34:04 +00:00
Brian Barrett
6f8b366acb Rename liborte to libopen-rte and libopal to libopen-pal per telecon today
and bug #632.

Refs trac:632

This commit was SVN r12762.

The following Trac tickets were found above:
  Ticket 632 --> https://svn.open-mpi.org/trac/ompi/ticket/632
2006-12-05 18:27:24 +00:00
Brian Barrett
441432950f Merge in changes from the bwb-heterogeneous temp branch (r12491 -
r12714) for supporting compilers / architectures with different
padding rules.

This commit was SVN r12749.

The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
  r12491
  r12714
2006-12-04 20:11:42 +00:00
Gleb Natapov
30ca7457b4 Some BTLs (e.g TCP) can report put/get completion before data actually
hits the buffer on the other side. For this kind of BTLs we need to send
FIN through the same BTL, PUT was performed with so network will handle
ordering for us. If we will use another BTL, receiver can get FIN before
data will hit the buffer and complete request prematurely. We mark such
problematic BTLs with MCA_BTL_FLAGS_FAKE_RDMA flag (this kind of RDMA
is really fake, because the real one guaranties that sender will see the
completion only after receiver's NIC confirmed that all the data was
received).

This commit was SVN r12732.
2006-12-03 10:12:09 +00:00
George Bosilca
658879232b Several small improvements:
- consistent error message when something fails (via BTL_ERROR macro)
- decrease the number of jumps.
- cleanup some parts of the code.

This commit was SVN r12719.
2006-12-01 21:48:06 +00:00
Pavel Shamis
f08bc818c4 Cleaning mca_btl_openib_progress_thread from
unused variables.

This commit was SVN r12709.
2006-11-30 18:28:45 +00:00
George Bosilca
59cfee0cd2 Use the MX infinite timeout by default. The user can modify it using an MCA
parameter.

This commit was SVN r12670.
2006-11-27 20:18:58 +00:00
Brian Barrett
0895f5e08d Rename OMPI_PROCESS_NAME_{HTON, NTOH} macros to ORTE_PROCESS_NAME_{HTON, NTOH}
because they are in ORTE, not OMPI.  Also, remove the ORTE_PROCESS_NAME macros
in iof base as they are duplicates of the ones that were in ns_types, which 
meant that bad things happened if you changed what an orte_process_name_t
looked like.

This commit was SVN r12646.
2006-11-22 03:03:21 +00:00
Brian Barrett
33320b7165 Rework the opal_progress interface to better support dynamic processes and at
the same time, remove some of the MPI-related options from OPAL:

  - provide mechanism to change at runtime whether sched_yield() should 
    be called when the progress engine is idle
  - provide mechanism for changing the rate at which the event engine
    is called when there are "no" users of the event engine (ie, when
    using MPI but not TCP)
  - fix some function names in the progress engine to better match
    their intended use (and remove MPI naming scheme)
  - remove progress_mpi_enable / progress_mpi_disable because 
    we can now use the functions to set the sched_yield and
    tick rate interfaces
  - rename opal_progress_events() to opal_progress_set_event_flag()
    because the first really isn't descriptive of what the function
    does and I always got confused by it

This commit was SVN r12645.
2006-11-22 02:06:52 +00:00
Ralph Castain
6d6cebb4a7 Bring over the update to terminate orteds that are generated by a dynamic spawn such as comm_spawn. This introduces the concept of a job "family" - i.e., jobs that have a parent/child relationship. Comm_spawn'ed jobs have a parent (the one that spawned them). We track that relationship throughout the lineage - i.e., if a comm_spawned job in turn calls comm_spawn, then it has a parent (the one that spawned it) and a "root" job (the original job that started things).
Accordingly, there are new APIs to the name service to support the ability to get a job's parent, root, immediate children, and all its descendants. In addition, the terminate_job, terminate_orted, and signal_job APIs for the PLS have been modified to accept attributes that define the extent of their actions. For example, doing a "terminate_job" with an attribute of ORTE_NS_INCLUDE_DESCENDANTS will terminate the given jobid AND all jobs that descended from it.

I have tested this capability on a MacBook under rsh, Odin under SLURM, and LANL's Flash (bproc). It worked successfully on non-MPI jobs (both simple and including a spawn), and MPI jobs (again, both simple and with a spawn).

This commit was SVN r12597.
2006-11-14 19:34:59 +00:00
George Bosilca
139f9cf3d0 Make sure we disable the MX shared memory when we use the MX BTL.
This commit was SVN r12587.
2006-11-13 22:17:06 +00:00
Gleb Natapov
9933a6f469 Previous fix doesn't fix the case when opcode is changed in put/get functions.
The fix is to set opcode to SEND at the entrance to the send function before
checking credits and putting fragment to the pending list. We do the same thing
in put/get functions i.e setting opcode at the entrance to the function.

This commit was SVN r12559.
2006-11-11 07:51:06 +00:00
Gleb Natapov
7e03b83d23 Reset opcode field to SEND. It is checked later in pending progress function.
This commit was SVN r12531.
2006-11-10 06:17:00 +00:00
George Bosilca
eab1776e9a Explicit casts for our friendly Windows environment...
This commit was SVN r12496.
2006-11-08 17:02:46 +00:00
George Bosilca
3d0df2cf29 Allow the MX BTL to finish the small sends quicker. Once the mx_isend is posted if
the message size is less than 4K do a check for the message completion and if any
call the callback.

This commit was SVN r12453.
2006-11-06 23:12:01 +00:00
Gleb Natapov
b4fd2d7d50 Fix warnings from progress thread patch.
This commit was SVN r12434.
2006-11-06 12:34:56 +00:00
Pavel Shamis
566667ac61 Adding progress thread support to OpenIB BTL.
Reviewed by Gleb.

This commit was SVN r12411.
2006-11-02 16:15:21 +00:00
Gleb Natapov
4c784b6403 As Andrew Friedley pointed, my previous patch may cause deadlock if
mca_btl_openib_endpoint_connect_eager_rdma() is called recursively. He also
noticed that orte_pointer_array_add() can't fail because we allocate max number
of elements at init time. So just remove error handling and locking. No locking
 - no deadlocks.

This commit was SVN r12388.
2006-11-01 15:53:33 +00:00
Gleb Natapov
b5714d698a Fix compilation with GM version smaller than 2.0. Fix compilation warnings.
This commit was SVN r12386.
2006-11-01 10:26:15 +00:00
Gleb Natapov
aac695a51f eager_rdma_buffers update is not atomic. A buffer is added to the array and if
something is going wrong down in the code it is removed from the array. So add
mutex to prevent concurrent access to the array from different threads.

This commit was SVN r12385.
2006-11-01 07:27:32 +00:00
Andrew Friedley
48c5117476 Fix some signedness warnings on threaded builds introduced by r12369
This commit was SVN r12376.

The following SVN revision numbers were found above:
  r12369 --> open-mpi/ompi@d7375ec102
2006-10-31 17:29:25 +00:00
Gleb Natapov
d7375ec102 Fix deadlock reported by Andrew Friedley:
What's happening is that we're holding openib_btl->eager_rdma_lock when
we call mca_btl_openib_endpoint_send_eager_rdma() on
btl_openib_endpoint.c:1227.  This in turn calls
mca_btl_openib_endpoint_send() on line 1179.  Then, if the endpoint
state isn't MCA_BTL_IB_CONNECTED or MCA_BTL_IB_FAILED, we call
opal_progress(), where we eventually try to lock
openib_btl->eager_rdma_lock at btl_openib_component.c:997.

The fix removes this lock altogether. Instead we atomically set local RDMA
pointer to prevent other threads to create rdma buffer for the same endpoint.
And we increment eager_rdma_buffers_count atomically thus polling thread doesn't
need lock around it.

This commit was SVN r12369.
2006-10-31 09:54:52 +00:00
Gleb Natapov
1b152dfe09 On 64 bit platform if high 32 bits of buf address is not zero they are trimmed
by wrong bitwise and. Fix it by expanding mask to 64 bits.

This commit was SVN r12368.
2006-10-31 07:33:35 +00:00
George Bosilca
126a68dc9a Big datatype commit. Remove all unused features of the datatype engine. As the memory
allocation logic is completely done outside the data-type engine (in the PML) there is
no need for any special case inside the data-type engine. There is less arguments for
the ompi_convertor_pack and ompi_convertor_unpack as well (the last field free_after is
not required anymore as there is no memory allocated in the engine itself). This change
affect all components using datatypes. I test most of them, but it might happens that I
miss some ... If it's the case please let me know (don't shoot the pianist!!).

This commit was SVN r12331.
2006-10-26 23:11:26 +00:00
George Bosilca
a1a4f7c422 Reset the segment pointer once we release the self fragment.
This commit was SVN r12330.
2006-10-26 23:07:14 +00:00
George Bosilca
be8516e0d7 Anothers indentations.
This commit was SVN r12329.
2006-10-26 23:06:15 +00:00
George Bosilca
83dfd36c1f Indentations.
This commit was SVN r12328.
2006-10-26 23:05:41 +00:00
George Bosilca
91ab093e96 Cleanup. No extern required for the function prototypes.
This commit was SVN r12327.
2006-10-26 23:03:12 +00:00
Sven Stork
3563f15fde - Fix a bug in descriptor handling code. The self BTL was mixing the different
kinds of descriptors (e.g. put rdma descriptor in the eager free-list). 
  This part 1 of the fix for ticket #246.

This commit was SVN r12291.
2006-10-25 08:45:29 +00:00
George Bosilca
d7268557a8 Complete the SM BTL changes. Now all displacements are ptrdiff_t and there is
no warnings about any issue with signed/unsigned.

This commit was SVN r12234.
2006-10-20 19:28:12 +00:00
George Bosilca
c86214f420 Fix the SM BTL issues. The problem seems to come from the fact that
the maximum number of nodes on the SM file should be signed, as we use
the -1 to unlimit it.

This commit was SVN r12227.
2006-10-20 17:25:53 +00:00
George Bosilca
06563b5dec Last set of explicit conversions. We are now close to the zero warnings on
all platforms. The only exceptions (and I will not deal with them
anytime soon) are on Windows:
- the write functions which require the length to be an int when it's
  a size_t on all UNIX variants.
- all iovec manipulation functions where the iov_len is again an int
  when it's a size_t on most of the UNIXes.
As these only happens on Windows, so I think we're set for now :)

This commit was SVN r12215.
2006-10-20 03:57:44 +00:00
George Bosilca
3db5c0487d typos.
This commit was SVN r12168.
2006-10-18 17:12:25 +00:00
George Bosilca
640178c4b3 Grepping through the source files I found these calls to the data-type engine
with the wrong type of arguments.

This commit was SVN r12148.
2006-10-17 21:05:04 +00:00