1
1
Граф коммитов

1230 Коммитов

Автор SHA1 Сообщение Дата
Tim Prins
31a3430c85 Fix threaded builds broken by r14914
This commit was SVN r14933.

The following SVN revision numbers were found above:
  r14914 --> open-mpi/ompi@983fd3432a
2007-06-06 22:30:34 +00:00
Ralph Castain
e0e4163f53 Remove pithy comment
This commit was SVN r14930.
2007-06-06 20:26:52 +00:00
George Bosilca
3b7f3e5565 Keep the unknown shell string.
This commit was SVN r14929.
2007-06-06 20:24:42 +00:00
George Bosilca
29dd535c01 Remove all references to the orte_bitmap as well as the files.
This commit was SVN r14928.
2007-06-06 20:24:07 +00:00
George Bosilca
fbb46f0ee7 A faster search without the bitmap. Remove all references to the orte_bitmap.
This commit was SVN r14926.
2007-06-06 20:23:14 +00:00
George Bosilca
24eae5c1ec We have a goto label for cleanup so make sure we always use it. This way we insure
that the lock are correctly released in all cases.

This commit was SVN r14925.
2007-06-06 20:20:52 +00:00
George Bosilca
28c9d0758b These functions are supposed to be static so there is no reason to have them declared in the header file.
This commit was SVN r14924.
2007-06-06 20:18:37 +00:00
George Bosilca
b047ed75d7 Don't forget to free the temporary buffer.
This commit was SVN r14923.
2007-06-06 20:17:27 +00:00
Ralph Castain
ea0c03fd7a Revert out r14910. Turns out that the GPR *has* to be able to deal with NULL data values. We fixed this a long time ago on the "put" side, but never dealt with it for "get" - hence, we could "put" ORTE_UNDEF'd attributes in a mapping policy, but couldn't retrieve them. This is why you only encountered the error on comm_spawn and not during the original launch of a job.
This correctly repairs the problem by enabling the GPR's "get" function to correctly handle NULL data values.

This commit was SVN r14916.

The following SVN revision numbers were found above:
  r14910 --> open-mpi/ompi@0757467d77
2007-06-06 18:34:54 +00:00
Ralph Castain
983fd3432a Fix singleton comm_spawn. Ensure that singleton's start the RML receive function so they can receive RML updates during xconnect procedures once any comm_spawn'd children start. Since singleton's only use the RMGR/URM component, update that component to also hold us until xconnect is completed (if it is invoked) before returning to the caller.
This commit was SVN r14914.
2007-06-06 17:39:23 +00:00
Ralph Castain
bbb60e37c0 Remove stale define
This commit was SVN r14913.
2007-06-06 17:19:43 +00:00
Ralph Castain
0757467d77 Fix comm_spawn. The problem stems from our use of the existence of an attribute as equivalent to a boolean "true" - in other words, we only confirm the existence of an attribute on a list to indicate something as opposed to looking at its specific value. Hence, we create the attribute with a type of ORTE_UNDEF - which is fine...until we then attempt to store/retrieve that attribute from the registry. In that case, the DSS barks because it treats ORTE_UNDEF as an error.
The only place where we attempt to store/retrieve attributes is in the RMAPS framework in support of comm_spawn. So this is where things broke down. The fix was simply to say "if the attribute data type is ORTE_UNDEF, then treat it like a boolean with value true". Trivial fix - solves problem.

This commit was SVN r14910.
2007-06-06 15:16:22 +00:00
Brian Barrett
508da4e959 OS X apparently really doesn't like shared libraries with unresolvable
symbols in them and environ is defined only in the final application
(probably in crt1.o).  Apple provides a function for getting at the
environment, so use that instead if it's available.

This commit was SVN r14857.
2007-06-05 03:03:59 +00:00
Brian Barrett
34fea87819 * Only need to to the opal_progress_event_users_increment() once between
OPAL and ORTE.  Since we now do opal_progress_init(), we do it
    there.  Fixes a performance issue introduced in r14773.
  * While trying to find the above, notived that we did the reference
    counting for the init in init_util and for finalize in fini.  That
    isn't right, so make them both in the non-util versions.

This commit was SVN r14830.

The following SVN revision numbers were found above:
  r14773 --> open-mpi/ompi@1e678c3f55
2007-06-01 02:43:46 +00:00
Brian Barrett
11ae30333d strmode is a standard BSD function defined in string.h
This commit was SVN r14828.
2007-05-31 22:47:40 +00:00
Rolf vandeVaart
ec963d02c7 Fix debug timing output so that proper algorithm is printed for
each type of xcast (direct, linear, binomial).

This commit was SVN r14822.
2007-05-31 18:27:54 +00:00
Brian Barrett
e4b369c93e Properly handle case where user instructs the oob to not use all non-localhost
interfaces

This commit was SVN r14815.
2007-05-31 02:29:44 +00:00
George Bosilca
f8f71b9ba0 Correct a threaded problem and make sure we only free what was allocated.
This commit was SVN r14803.
2007-05-30 18:50:29 +00:00
George Bosilca
905570a6d2 Call opal_show_help with the expected number of arguments.
This commit was SVN r14802.
2007-05-30 18:49:43 +00:00
George Bosilca
6afbc02052 The idea behind this patch is to decrease the number of strcmp used in the replica
by using a small hash function before doing the strcmp. The hask key for each
registry entry is computed when it is added to the registry. When we're doing a
query, instead of comparing the 2 strings we first check if the hash key match,
and if they do match then we compare the 2 strings in order to make sure we
eliminate collisions from our answers.

There is some benefit in terms of performance. It's hardly visible for few
processes, but it start showing up when the number of processes increase. In fact
the number of strcmp in the trace file drastically decrease. The main reason it
works well, is because most of the keys start with basically the same chars
(such as orte-blahblah) which transform the strcmp on a loop over few chars.

This commit was SVN r14791.
2007-05-29 18:40:07 +00:00
Ralph Castain
a2964f429e Fix a compiler warning - strncmp returns an int, so you have to compare to 0 instead of NULL.
This commit was SVN r14790.
2007-05-29 18:02:10 +00:00
Anya Tatashina
de676d717b Ref Trac #1032; added suport for full path launching with TotalView
This commit was SVN r14789.
2007-05-29 17:39:11 +00:00
Tim Prins
f95442dec9 better error output
This commit was SVN r14788.
2007-05-29 16:53:40 +00:00
Tim Prins
b4e3ad8da0 Fixup tests for recent api changes
cleanup a ton of warnings, include proper files

fix orte_ring, it had a deadlock in it...

fix the abort test so it can be used with less than 4 processes

This commit was SVN r14787.
2007-05-29 16:22:50 +00:00
Josh Hursey
1e678c3f55 per conversation with Ralph and Jeff take out the opal_init_only logic.
This commit moves the initalization/finalization of opal_event and opal_progress
to opal_init/finalize. These were previously init/final in ORTE which is an
abstraction violation. After talking about it we concluded that there are no
ordering issues that require these to be init/final in ORTE instead of OPAL.

I ran the IBM test suite against this commit and it didn't turn up any new
failures so I think it is good to go.

Let us know if this causes problems.

This commit was SVN r14773.
2007-05-24 21:54:58 +00:00
Josh Hursey
e8b85faf28 Fix for the invalid arguments case. we were not finalizing cleanly.
This commit was SVN r14770.
2007-05-24 21:27:06 +00:00
Josh Hursey
a296ef5487 Checkpoint/restart fix:
Still recovering from interface changes.

This commit was SVN r14769.
2007-05-24 21:12:34 +00:00
Rainer Keller
a665b7a20d - Getting rid of "missing initializer" warnings
This commit was SVN r14766.
2007-05-24 19:19:52 +00:00
Rainer Keller
7d84de8510 - now the formatting (just getting rid of spaces at the end)....
This commit was SVN r14764.
2007-05-24 19:10:32 +00:00
Rainer Keller
ff3cfc0011 - Get rid of "set but never used" warning
This commit was SVN r14763.
2007-05-24 19:07:45 +00:00
Jeff Squyres
379a4ec5e2 While we're editing MCA params in the oob tcp component, ditch the use
of the deprecated MCA param API for registering MCA parameters and
update to the current API.

This commit was SVN r14747.
2007-05-24 13:01:55 +00:00
Jeff Squyres
839c1db95c Fix something that has been bugging me for a while:
Rename the oob_tcp_include and oob_tcp_exclude MCA parameters to be
oob_tcp_if_include and oob_tcp_if_exclude (to match the convention
with btl_tcp_if_[in|ex]clude).  Keep "hidden" synonyms oob_tcp_include
and oob_tcp_exclude in case anyone is actually using them (and some
users undoubtedly are), but do not have them show up in ompi_info
--param output.  Instead, the new "oob_tcp_if_*" names will show up in
ompi_info output.

This commit was SVN r14746.
2007-05-24 12:52:26 +00:00
George Bosilca
7485c920c4 Remove the useless comment as it does not reflect the reality anymore.
This commit was SVN r14737.
2007-05-23 18:55:02 +00:00
Ralph Castain
02f6e6ab3e Slight touchup to make it pretty
This commit was SVN r14734.
2007-05-23 16:39:18 +00:00
Ralph Castain
3fc227286f Be sure to NULL terminate the list of keys...
This commit was SVN r14733.
2007-05-23 16:35:03 +00:00
Ralph Castain
cffea274b3 Fix the hostfile system. Enable the hostfile RDS to properly setup the name service. Modify the name service so that a caller can provide a valid cellid and update its site info for later retrieval.
This commit was SVN r14732.
2007-05-23 16:31:44 +00:00
Sven Stork
ed72cbcaec - Fix a deadlock problem for threaded builds. We have to release the lock
before we wait for the callback, because the callback will try to lock
  the lock again. (show up in debug+threaded build)

This commit was SVN r14731.
2007-05-23 16:11:50 +00:00
Josh Hursey
a010ff6e6a Some updates from the interface change to orte_init
This commit was SVN r14729.
2007-05-23 14:44:23 +00:00
Ralph Castain
e6ff7757ab Modify the new DSS xfer and copy functions so they only xfer/copy the unpacked portion of a buffer's payload. This allows for more rapid transfer of data during message relay without requiring any knowledge of what is in the buffer.
Begin work on restoring binomial message distribution method.

This commit was SVN r14728.
2007-05-23 14:06:32 +00:00
Ralph Castain
b582d98d4a Fix singleton comm_spawn so it can see available resources
This commit was SVN r14719.
2007-05-22 13:29:07 +00:00
Ralph Castain
5b0abf520b Don't update our own contact info
This commit was SVN r14718.
2007-05-22 13:28:23 +00:00
Ralph Castain
677eb5e4bc Ensure the singleton wakes up when comm_spawn fails
This commit was SVN r14714.
2007-05-21 20:13:31 +00:00
Ralph Castain
b771cfcce3 Fix compile problem
This commit was SVN r14713.
2007-05-21 20:11:03 +00:00
Ralph Castain
4fff584a68 Commit the orted-failed-to-start code. This correctly causes the system to detect the failure of an orted to start and allows the system to terminate all procs/orteds that *did* start.
The primary change that underlies all this is in the OOB. Specifically, the problem in the code until now has been that the OOB attempts to resolve an address when we call the "send" to an unknown recipient. The OOB would then wait forever if that recipient never actually started (and hence, never reported back its OOB contact info). In the case of an orted that failed to start, we would correctly detect that the orted hadn't started, but then we would attempt to order all orteds (including the one that failed to start) to die. This would cause the OOB to "hang" the system.

Unfortunately, revising how the OOB resolves addresses introduced a number of additional problems. Specifically, and most troublesome, was the fact that comm_spawn involved the immediate transmission of the rendezvous point from parent-to-child after the child was spawned. The current code used the OOB address resolution as a "barrier" - basically, the parent would attempt to send the info to the child, and then "hold" there until the child's contact info had arrived (meaning the child had started) and the send could be completed.

Note that this also caused comm_spawn to "hang" the entire system if the child never started... The app-failed-to-start helped improve that behavior - this code provides additional relief.

With this change, the OOB will return an ADDRESSEE_UNKNOWN error if you attempt to send to a recipient whose contact info isn't already in the OOB's hash tables. To resolve comm_spawn issues, we also now force the cross-sharing of connection info between parent and child jobs during spawn.

Finally, to aid in setting triggers to the right values, we introduce the "arith" API for the GPR. This function allows you to atomically change the value in a registry location (either divide, multiply, add, or subtract) by the provided operand. It is equivalent to first fetching the value using a "get", then modifying it, and then putting the result back into the registry via a "put".

This commit was SVN r14711.
2007-05-21 18:31:28 +00:00
Ralph Castain
c9c2922706 Drat - commit missing file.
This commit was SVN r14710.
2007-05-21 17:20:56 +00:00
Ralph Castain
3288ce0462 Cleanup the SDS components and move some common code to the base.
Modify the seed and singleton SDS code so it returns the right local rank and num_local_procs.

This commit was SVN r14707.
2007-05-21 14:57:58 +00:00
Ralph Castain
d9acc93efa Compute and pass the local_rank and local number of procs (in that proc's job) on the node.
To be precise, given this hypothetical launching pattern:

host1: vpids 0, 2, 4, 6
host2: vpids 1, 3, 5, 7

The local_rank for these procs would be:

host1: vpids 0->local_rank 0, v2->lr1, v4->lr2, v6->lr3
host2: vpids 1->local_rank 0, v3->lr1, v5->lr2, v7->lr3

and the number of local procs on each node would be four. If vpid=0 then does a comm_spawn of one process on host1, the values of the parent job would remain unchanged. The local_rank of the child process would be 0 and its num_local_procs would be 1 since it is in a separate jobid.

I have verified this functionality for the rsh case - need to verify that slurm and other cases also get the right values. Some consolidation of common code is probably going to occur in the SDS components to make this simpler and more maintainable in the future.

This commit was SVN r14706.
2007-05-21 14:30:10 +00:00
Jeff Squyres
97248d6bc6 Add another test to check multiple, concurrent COMM_SPAWN's.
This commit was SVN r14701.
2007-05-19 19:02:24 +00:00
Jeff Squyres
47ba3db3b8 Add a simple MPI_COMM_SPAWN_MULTIPLE test.
This commit was SVN r14700.
2007-05-19 02:30:53 +00:00
Ralph Castain
180c96bb8f Clear an erroneous error message pending a more complete fix
This commit was SVN r14698.
2007-05-18 14:44:27 +00:00