1
1
Граф коммитов

8769 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
8314e8dbb9 Modify the pernode option so it can accept a request for the number of processes to be launched. We now check three use-cases for pernode:
1. no -np provided - put one proc/node across all allocated nodes

2. -np N provided, N > #nodes - we print a pretty error message and exit

3. -np N provided, N <= #nodes - put one proc/node across N nodes

I also added a new orte constant (ORTE_ERR_SILENT) that allows us to pass up the chain that an error was encountered, but NOT print ORTE_ERROR_LOG messages. This is intended to be used for cases where the error we encounter is NOT an orte error, but rather is one associated with incorrect user input (e.g., the preceding case 2). In such cases, there is no point in printing an ORTE_ERROR_LOG chain of messages as it isn't an orte error.

This commit was SVN r12821.
2006-12-11 18:07:07 +00:00
Ralph Castain
0a5d41857a Complete next round of message size reduction: "strip" the descriptive info from the returned values. I have now added a flag to the gpr address mode (ORTE_GPR_STRIPPED) that instructs the gpr to not include segment names or tokens in the returned gpr_value_t objects.
I found only two places that were looking at the tokens:

1. the odls - we used the tokens to separately process the globals container data from everything else. In this case, I left the subscription that returned the globals data alone, but "stripped" the subscription that returned the launch data for the procs. These subscriptions have nothing to do with the xcast message.

2. the pml_base_modex - the callback function was getting process names from the returned tokens. Actually, this function was doing a very bad thing - it was assuming that the first token returned was *always* the process name. This is currently true, but is one of those assumptions that someone could have easily changed - and suddenly found the system inexplicably failing. I modified the function to (a) get the name sent back to us, (b) "stripped" the value structures of tokens and segment strings, and (c) correctly obtained process names from the returned values. I also reindented the heck out of the code so it was legible (at least, to my old eyes).

This commit was SVN r12813.
2006-12-09 23:10:25 +00:00
Jeff Squyres
e70ef98ea6 Update the help message to be a bit more specific and refer to the web
FAQ.

This commit was SVN r12812.
2006-12-09 15:13:03 +00:00
Jeff Squyres
c7282855e7 Fixes trac:659
This commit fixes several aspects regarding MPI conformance of requests.

 * Eliminate the last argument of ompi_errhandler_request_invoke(); we
   ''always'' want to invoke the back-end exception handler with the
   real error code.
 * Make it clear in comments that we only invoke the ''first''
   exception in a given array of requests, even if there's more than
   one request with a non-MPI_SUCCESS value for MPI_ERROR.
 * Defer the freeing of requests upon exception in the back-end
   functions to MPI_WAIT* and MPI_TEST* until later; the requests are
   kept so that we know what handler to invoke when we actually invoke
   the exception.  After figuring that out, ''then'' we free requests
   with pending exceptions on them.
 * Clean up return codes from the back-end MPI_TEST* and MPI_WAIT*
   functions.
 * Slightly modify ompi_errcode_get_mpi_code() to return unity if it
   receives an MPI error code (vs. an OMPI error code).

This commit was SVN r12810.

The following Trac tickets were found above:
  Ticket 659 --> https://svn.open-mpi.org/trac/ompi/ticket/659
2006-12-09 14:20:08 +00:00
Ralph Castain
58569546ed Fix the fix to remove compiler warning - an incorrect "\" was placed in the command string.
This commit was SVN r12805.
2006-12-08 04:17:38 +00:00
Jeff Squyres
568d20de37 Add Myricom to the aggregated list.
This commit was SVN r12804.
2006-12-08 00:11:51 +00:00
Patrick Geoffray
58c6f8c8e1 Copyright update. Thanks to Jeff to remind me.
This commit was SVN r12803.
2006-12-07 23:55:00 +00:00
Patrick Geoffray
6e09b0c23f lval is not defined when pval is assigned on 32 bit systems. this
usually is ok on little-endian systems, as the upper 32 bits will likely
be ignored, but on 32-bit big-endian systems, lval is complete junk.
Use ival if 32 bit mode, lval if 64.

Mixing of 32 and 64 bit architectures won't work without more changes.

This commit was SVN r12802.
2006-12-07 23:34:04 +00:00
Brian Barrett
98884e45e4 Clean up the way procs are added to the global process list after MPI_INIT:
* Do not add new procs to the global list during modex callback or
    when sharing orte names during accept/connect.  For modex, we
    cache the modex info for later, in case that proc ever does get
    added to the global proc list.  For accept/connect orte name
    exchange between the roots, we only need the orte name, so no
    need to add a proc structure anyway.  The procs will be added
    to the global process list during the proc exchange later in 
    the wireup process
  * Rename proc_get_namebuf and proc_get_proclist to proc_pack
    and proc_unpack and extend them to include all information
    needed to build that proc struct on a remote node (which
    includes ORTE name, architecture, and hostname).  Change
    unpack to call pml_add_procs for the entire list of new
    procs at once, rather than one at a time.
  * Remove ompi_proc_find_and_add from the public proc
    interface and make it a private function.  This function
    would add a half-created proc to the global proc list, so
    making it harder to call is a good thing.

This means that there's only two ways to add new procs into the global proc list at this time: During MPI_INIT via the call to ompi_proc_init, where my job is added to the list and via ompi_proc_unpack using a buffer from a packed proc list sent to us by someone else.  Currently, this is enough to implement MPI semantics.  We can extend the interface more if we like, but that may require HNP communication to get the remote proc information and I wanted to avoid that if at all possible.

Refs trac:564

This commit was SVN r12798.

The following Trac tickets were found above:
  Ticket 564 --> https://svn.open-mpi.org/trac/ompi/ticket/564
2006-12-07 19:56:54 +00:00
Brian Barrett
b07dfa7841 * remove unused variable in ompi_comm_get_rprocs
* don't load data into a buffer until we have the data, as
    the data contains some header information needed to
    properly load the data

This commit was SVN r12792.
2006-12-07 16:19:44 +00:00
Sven Stork
78173a697a Replace the test opertion "-e" with "-r" to improve the protability.
Refs: #392

This commit was SVN r12790.
2006-12-07 12:14:40 +00:00
Ralph Castain
62d7826e01 Helps if we total up the correct field to get the total number of slots in the universe
This commit was SVN r12789.
2006-12-07 03:17:12 +00:00
Ralph Castain
a1153fdc8f Eliminate virtually all of the attribute_predefined data from the STG1 message. We now compute the total number of slots allocated to us and save that in the registry - the attributed_predefined then retrieves it via the STG1 message. The app_num is passed via the process_info structure, which gets the value from the ODLS in the environment.
Obviously, people like bproc will have to get the app_num via another avenue...but that's a problem for another day. Several options are easily available.

This commit was SVN r12788.
2006-12-07 03:11:20 +00:00
Brian Barrett
41a70a8f01 indent, this time with the right coding standards...
This commit was SVN r12787.
2006-12-07 00:24:01 +00:00
Brian Barrett
f9ec8d6f2a reindent file to make it easier to deal with...
This commit was SVN r12786.
2006-12-07 00:21:25 +00:00
Brian Barrett
8f68764e5e A number of heterogeneous fixes for the dss with the new buffer options:
* When using the load/unload interface, stash away the current buffer
    type so that it can be properly unpacked on the receiving side if
    the buffer type is other than the receiver default
  * Include type information for unsized types (bool, int, size_t,
    pid_t) so that they can be properly unpacked by the receiver
    in the heterogeneous case.
  * Restore the NON_DESC type as the default for optimized builds,
    since it looks like this fixes the known issues with the
    non-described buffers

Refs trac:587

This commit was SVN r12784.

The following Trac tickets were found above:
  Ticket 587 --> https://svn.open-mpi.org/trac/ompi/ticket/587
2006-12-06 23:19:06 +00:00
Brian Barrett
cfeac5581a temporarily always use described buffers as the non-described causes all
kinds of problems for heterogeneous environments

This commit was SVN r12783.
2006-12-06 20:22:31 +00:00
Ralph Castain
d4bd60c9fe Restore the paffinity capability, along with all the required logic to ensure we "do the right thing" when the user gives us inaccurate information about the number of slots on a remote node.
This commit was SVN r12780.
2006-12-06 15:59:34 +00:00
Ralph Castain
b1e16fffac Add the C++ doo-hicky stuff around the odls framework definitions just in case somebody, somewhere, on some remote planet where only goats can feed needs it.
This commit was SVN r12777.
2006-12-06 13:58:04 +00:00
Ralph Castain
8ca415a0c5 Remove duplicate orte_odls declaration
This commit was SVN r12776.
2006-12-06 13:44:41 +00:00
Jeff Squyres
122b8553ef Sync against 1.1 tree
This commit was SVN r12767.
2006-12-05 19:15:56 +00:00
Edgar Gabriel
1359ba9b13 Rewriting much of the errorcode and errorclass code, since
- we have to be able to attach a string to an error class, not just to an
 error code
 - according to MPI-2 the attribute MPI_LASTUSEDCODE has to be updated
  everytime you add a new code or a new class. Thus, you have to have single
  list for both. 

Thus, we got rid of the error_class structure. In the error-code structure, we
can distinguish whether we are dealing with an error code or an error class by
looking at the err->code element of the structure. In case its value is
MPI_UNDEFINED, the according entry is a class, else it is an error code. All
predefined error codes have the code and the class field set to the same
value.

The test MPI_Add_error_class1 passes now.

Fixes trac:418

This commit was SVN r12764.

The following Trac tickets were found above:
  Ticket 418 --> https://svn.open-mpi.org/trac/ompi/ticket/418
2006-12-05 19:07:02 +00:00
Jeff Squyres
73696be5ac Add bullets about library name changes (because some users are
undoubtedly not using the wrapper compilers) and one-sided support
fixes.

This commit was SVN r12763.
2006-12-05 18:41:40 +00:00
Brian Barrett
6f8b366acb Rename liborte to libopen-rte and libopal to libopen-pal per telecon today
and bug #632.

Refs trac:632

This commit was SVN r12762.

The following Trac tickets were found above:
  Ticket 632 --> https://svn.open-mpi.org/trac/ompi/ticket/632
2006-12-05 18:27:24 +00:00
Dan Lacher
ba16157f7e Removed local zone only requirment from solaris packages
Submitted by: Dan Lacher
Reviewed by: Rolfv Vandavaart

This commit was SVN r12759.
2006-12-05 14:37:44 +00:00
Tim Prins
08d5ca821f Don't get the node architecture when useing the LoadLevleer RAS. It is slow (about a second for ~300 nodes) and we don't even use the value.
This commit was SVN r12758.
2006-12-05 13:47:53 +00:00
Ralph Castain
eb941d8ae2 Fix a bug that declared a node as "oversubscribed" a little early during the mapper procedure. This only affected the mapping procedure, and only if you had set the "--no-oversubscribe" flag.
Kudos to Tim Prins for finding it.

This commit was SVN r12757.
2006-12-05 13:04:27 +00:00
George Bosilca
6f28bcdc21 Remove the last set of compiler warnings from the precondition file.
This commit was SVN r12753.
2006-12-04 21:45:57 +00:00
Brian Barrett
441432950f Merge in changes from the bwb-heterogeneous temp branch (r12491 -
r12714) for supporting compilers / architectures with different
padding rules.

This commit was SVN r12749.

The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
  r12491
  r12714
2006-12-04 20:11:42 +00:00
Brian Barrett
d64fa194f1 Instead of continually screwing around with different format strings to
make this warning-proof, loop over the uint64_ts as an array of integers
and use %x.  The final string is just as random and formatted exactly
the same, so we're all good in that department.

Refs trac:655

This commit was SVN r12742.

The following Trac tickets were found above:
  Ticket 655 --> https://svn.open-mpi.org/trac/ompi/ticket/655
2006-12-04 18:07:24 +00:00
Gleb Natapov
f0132b2499 Provide parameters in a correct order (processor/oversubscribed was swapped).
This commit was SVN r12737.
2006-12-04 12:55:45 +00:00
Rainer Keller
d078bb3e8a - Revert changes and include pointers to discussion.
This commit was SVN r12736.
2006-12-03 17:05:15 +00:00
Rainer Keller
6f8f28f40f - Get rid of inline definition, otherwise static-compilation fails.
This commit was SVN r12735.
2006-12-03 14:52:17 +00:00
Rainer Keller
e61dd8722e - Silence compiler on ORTE_TRANSPORT_KEY_FMT, it is fixed to llx
- No functional changes, just indentation and corrections to error
   output.

This commit was SVN r12734.
2006-12-03 13:59:23 +00:00
Gleb Natapov
30ca7457b4 Some BTLs (e.g TCP) can report put/get completion before data actually
hits the buffer on the other side. For this kind of BTLs we need to send
FIN through the same BTL, PUT was performed with so network will handle
ordering for us. If we will use another BTL, receiver can get FIN before
data will hit the buffer and complete request prematurely. We mark such
problematic BTLs with MCA_BTL_FLAGS_FAKE_RDMA flag (this kind of RDMA
is really fake, because the real one guaranties that sender will see the
completion only after receiver's NIC confirmed that all the data was
received).

This commit was SVN r12732.
2006-12-03 10:12:09 +00:00
Gleb Natapov
39c930b160 The bug fixing part of r12720 introduce much more serious bug that it fixes.
It calls mca_pml_ob1_send_fin_btl() which may fail and doesn't check return
code. This breaks all RDMA transports event when only one BTL is used. Revert
it for now, I am working on a real fix for the problem (I hope).

This commit was SVN r12731.

The following SVN revision numbers were found above:
  r12720 --> open-mpi/ompi@3e3689320b
2006-12-03 08:55:59 +00:00
Gleb Natapov
65d7ad4581 The "bug fix" from the r12721 reverts part of the r12433 that fixed
regresion from v1.1 was reviewed and put to v1.2 branch. So revert this part
of r12721 back.

This commit was SVN r12730.

The following SVN revision numbers were found above:
  r12433 --> open-mpi/ompi@82f7c0dd69
  r12721 --> open-mpi/ompi@3edd850d2e
2006-12-03 08:29:55 +00:00
George Bosilca
a0ed53d70b Make the compilers happy.
This commit was SVN r12729.
2006-12-03 00:19:11 +00:00
Ralph Castain
4151a46871 Per Jeff's request (which made a lot of sense), setup the default buffer type to be DESCRIBED for debug/devel builds, and NON-DESC for optimized builds. The user can still select the default buffer type via mca parameter at runtime - this just sets the default default. :-)
Also, change the dss buffer type mca param to something more easily remembered (it is now "dss_buffer_type"). Heck, even I had to keep looking at the darn code to remember it.

This commit was SVN r12728.
2006-12-02 13:32:16 +00:00
George Bosilca
3fd278c522 Make the tree compile in debug mode.
This commit was SVN r12724.
2006-12-01 23:03:09 +00:00
Ralph Castain
897744cdeb Two major changes to the runtime:
1. implement and enable the non-described buffer operations. I will send out a more detailed explanation separately. However, this mode of operation (which is now the default) significantly reduces message size during startup. If you want the described buffers, set the mca param "-mca dss_describe_buffer 1".

2. revise the xcast system to support both linear and binomial tree broadcast methods. Since we are seeing scenarios where the binomiall tree can cause problems, I have made the linear method the default. To run with the binomial tree, set the mca param "-mca oob_xcast_mode binomial".

3. add some detailed timing reports to the xcast operation. These are enabled via "-mca oob_xcast_timing 1".

4. add some more unit tests for the dss and gpr (focused on support for the non-described buffer)

This commit was SVN r12722.
2006-12-01 22:30:39 +00:00
George Bosilca
3edd850d2e Some indentation and code arrangement. However, there is a bug fix. Force the PUT
protocol to always obey to the btl_max_rdma_size.

This commit was SVN r12721.
2006-12-01 22:26:14 +00:00
George Bosilca
3e3689320b Some indentations and one BIG fix. Avoid race conditions on the PUT RDMA
protocol when multiple NICS are available between 2 peers. The fix force
the FIN message to take exactly the same path as the fragment it describe
(i.e. same path means same BTL). Otherwise, the FIN can be received by
the peer before the RDMA complete and the request will get freed
too early.

This commit was SVN r12720.
2006-12-01 21:52:07 +00:00
George Bosilca
658879232b Several small improvements:
- consistent error message when something fails (via BTL_ERROR macro)
- decrease the number of jumps.
- cleanup some parts of the code.

This commit was SVN r12719.
2006-12-01 21:48:06 +00:00
Brian Barrett
657168a74c A couple of Window related error handling cleanups:
* Always invoke the error handler on MPI_COMM_WORLD for
    invalid windows (except in win_create, which should 
    instead be on the given communicator).
  * Allow get_errhandler in addition to set_errhandler
    on MPI_WIN_NULL

Refs trac:647

This commit was SVN r12718.

The following Trac tickets were found above:
  Ticket 647 --> https://svn.open-mpi.org/trac/ompi/ticket/647
2006-12-01 20:05:06 +00:00
Brian Barrett
2a09fa2d9d * silence compiler warning
This commit was SVN r12717.
2006-12-01 20:01:53 +00:00
Brian Barrett
bfbc281e93 Fix slow startup issue with the MX MTL. The problem is caused by mx_connect() being a one-sided operation from the API level, but not being an interrupting call when the target is not entering the MX library. So if most of the processes exit mtl_mx_add_procs() and enter the stage gate 2 barrier, the other processes can only progress their mx_connect() calls when the targets enter the mx library. Because the event library is in EV_ONELOOP mode, this only happens once a second. The mx progress thread (hidden in the MX library) also only wakes up once a second, so mx_connect calls can take a second to complete.
The temporary solution is to switch into EV_NONBLOCK mode earlier (right after the mx_connect loop) so that there isn't a giant slowdown when processes enter the stage gate 2 barrier before other proesses.  They will now not block in the event library for any period of time, which appears to have a 50% speedup when running at > 64 procs.

Refs trac:645

This commit was SVN r12713.

The following Trac tickets were found above:
  Ticket 645 --> https://svn.open-mpi.org/trac/ompi/ticket/645
2006-12-01 02:49:01 +00:00
Bill D'Amico
3e11c76d4c Fixes trac:482
OMPI_ARRAY_INT_2_LOGICAL had an array bounds error - fixed this and the
analogous error in OMPI_ARRAY_LOGICAL_2_INT.

This commit was SVN r12712.

The following Trac tickets were found above:
  Ticket 482 --> https://svn.open-mpi.org/trac/ompi/ticket/482
2006-11-30 19:48:18 +00:00
Pavel Shamis
f08bc818c4 Cleaning mca_btl_openib_progress_thread from
unused variables.

This commit was SVN r12709.
2006-11-30 18:28:45 +00:00
Jeff Squyres
384caeacf4 More fixes similar to r12684 -- fix bad lvalues in assignments for the
case where sizeof(INTEGER) > sizeof(int).

This commit was SVN r12707.

The following SVN revision numbers were found above:
  r12684 --> open-mpi/ompi@e2c605f32a
2006-11-30 16:41:56 +00:00