1
1
Граф коммитов

195 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
c1690f403e Remove non-existent file
This commit was SVN r27730.
2012-12-29 02:21:50 +00:00
Ralph Castain
68329b516c Cleanup stale test codes
This commit was SVN r27729.
2012-12-28 16:52:51 +00:00
Jeff Squyres
f779b1ded9 Put back the static-library-detection stuff from r27668, with some
additional functionality.  Rationale (refs trac:3422):

 * Normal MPI applications only ever use the MPI API. Hence, -lmpi is
   sufficient (they'll never directly call ORTE or OPAL
   functions). This is arguably the most common case.
 * That being said, we do have some test programs (e.g., those in
   orte/test/mpi) that call MPI functions but also call ORTE/OPAL
   functions. I've also written the occasional MPI test program that
   calls opal_output, for example (there even might be a few tests in
   the IBM test suite that directly call ORTE/OPAL functions).
   * Even though this is not a common case, these applications should
     also compile/link with mpicc.
   * So we should add a --openmpi:linkall option that will also link
     in whatever is necessary to call ORTE/OPAL functions
   * Yes, we could hard-code "-lopen-rte -lopen-pal" in Makefiles, but
     we do reserve the right to change those library names and/or add
     others someday, so it's better to abstract out the names and let
     the wrapper supply whatever is necessary.
 * ORTE programs, however, are different. They almost always call OPAL
   functions (e.g., if they want to send a message, they must use the
   OPAL DSS). As such, it seems like the ORTE programs should always
   link in OPAL.

Therefore:

 * Add undocumented --openmpi:linkall flag to the wrapper compilers.
   See the comment in opal_wrapper.c for an explanation of what it
   does.  This flag is only intended for Open MPI developers -- not
   end users.  That's why it's undocumented.
 * Update orte/test/mpi/Makefile.am to add --openmpi:linkall
 * Make ortecc/ortec++'s wrapper data text files always explicitly
   link in libopen-pal

This commit was SVN r27670.

The following SVN revision numbers were found above:
  r27668 --> open-mpi/ompi@cf845897aa

The following Trac tickets were found above:
  Ticket 3422 --> https://svn.open-mpi.org/trac/ompi/ticket/3422
2012-12-13 22:31:37 +00:00
Ralph Castain
e11f32038a Add an MCA param to retain all aliases based on IP addrs for node names so that procs can look them up by interface, if desired. If the param is set, pass aliases around to all daemons and procs for local use
This commit was SVN r27619.
2012-11-16 04:04:29 +00:00
Ralph Castain
fe6dfad625 Update DFS to support multi-node operations
This commit was SVN r27594.
2012-11-12 02:54:53 +00:00
Ralph Castain
bd887f7f56 Add a new "test" component to the DFS that treats all files as remote in order to test the app-to-daemon interactions on a single machine. Set a global param to indicate we are using staged execution. Add a param to indicate it is okay for non-MPI processes to execute without finalizing. Cleanup file map load and fetch operations.
This commit was SVN r27587.
2012-11-10 14:09:12 +00:00
Ralph Castain
a080de188f Enable orterun to directly support staged execution, treating each app as a separate job. Support transfer of file maps when support exists.
This commit was SVN r27516.
2012-10-29 23:11:30 +00:00
Ralph Castain
4e52a15e70 Provide for sync on seek and close DFS operations. Eliminate an unnecessary wake-up timer when using ORTE progress thread
This commit was SVN r27500.
2012-10-26 15:49:04 +00:00
Ralph Castain
df642f1508 Add an API to get a remote file's size. Separate dfs cmds from returned data messages so daemons don't get confused.
This commit was SVN r27487.
2012-10-25 22:23:08 +00:00
Ralph Castain
094d6f3143 Add a new "distributed file system" capability to support file access operations across nodes that do not have a network file system attached to them.
Add a set of URI create/parse utilities

This commit was SVN r27483.
2012-10-25 17:15:17 +00:00
Ralph Castain
90d7b5fdca Update test
This commit was SVN r27354.
2012-09-20 02:51:27 +00:00
Ralph Castain
5f7a5c4793 Update test to include all keys
This commit was SVN r27311.
2012-09-12 05:02:51 +00:00
Ralph Castain
cd8aff675b Update test
This commit was SVN r27303.
2012-09-11 20:32:43 +00:00
Ralph Castain
0e1dbe8711 Remove non-existent files
This commit was SVN r27136.
2012-08-25 01:29:17 +00:00
Ralph Castain
c8b511d18a Remove stale tests
This commit was SVN r27126.
2012-08-24 02:22:11 +00:00
Ralph Castain
3c13176aa7 Remove test code
This commit was SVN r27114.
2012-08-22 21:36:54 +00:00
Ralph Castain
335c0eafcf Add a filem test program and set ignores
This commit was SVN r27069.
2012-08-16 17:46:46 +00:00
Jeff Squyres
96f640a762 Add new "opal_hotel" class. Abstractly speaking, this class does the
following:

 * Provides a fixed number of resource slots (i.e., "hotel rooms").
 * Allows one thing to occupy a resource slot at a time (i.e., each
   hotel room can have an occupant check in to that room).
 * Resource slots can be vacated at any time (i.e., occupants can
   voluntarily check out of their hotel room).
 * Resource slots can be occupied for a specific maximum amount of
   time.  If that time expires, the occupant is forcibly evicted and
   the upper layer is notified via (libevent) callback (i.e., the maid
   will kick an occupant of out of their room when their reservation
   is over).

This class can be to be used for things like retransmission schemes
for unreliable transports.  For example, a message sent on an
unreliable transport can be checked in to a hotel room.  If an ACK for
that message is received, the message can be checked out.  But if the
ACK is never received, the message will eventually be evicted from its
room and the upper layer will be notified that the message failed to
check out in time (i.e., that an ACK for that message was not received
in time).

Code using this class is currently being developed off-trunk, but will
be coming to SVN soon.

This commit was SVN r27067.
2012-08-16 17:29:55 +00:00
Ralph Castain
c90b7380c1 Sigh - of course, they changed the name of the silly MPI_Info object in the final standard, but not in the proposal. So change to the new MPI_INFO_ENV name. Also, don't set unknown values to "N/A", but just leave them unset.
This commit was SVN r27012.
2012-08-12 05:00:57 +00:00
Ralph Castain
cb48fd52d4 Implement the MPI_Info part of MPI-3 Ticket 313. Add an MPI_info object MPI_INFO_GET_ENV that contains a number of run-time related pieces of info. This includes all the required ones in the ticket, plus a few that specifically address recent user questions:
"num_app_ctx" - the number of app_contexts in the job
"first_rank" - the MPI rank of the first process in each app_context
"np" - the number of procs in each app_context

Still need clarification on the MPI_Init portion of the ticket. Specifically, does the ticket call for returning an error is someone calls MPI_Init more than once in a program? We set a flag to tell us that we have been initialized, but currently never check it.

This commit was SVN r27005.
2012-08-12 01:28:23 +00:00
Ralph Castain
6ee35e4977 Add num_local_peers to orte_process_info so we don't keep re-computing it, ensure it is available for direct launch via pmi as well
This commit was SVN r26931.
2012-07-31 21:21:50 +00:00
Ralph Castain
9680c52f5e Add mrplus examples to tarball
This commit was SVN r26643.
2012-06-23 02:40:46 +00:00
Ralph Castain
40c2fc5f55 Update the tests, add a couple
This commit was SVN r26379.
2012-05-02 19:00:05 +00:00
Ralph Castain
8f7bf3344a Update test
This commit was SVN r26370.
2012-05-01 18:38:44 +00:00
Ralph Castain
4542070cf2 Add event priority inversion test
This commit was SVN r26369.
2012-05-01 16:42:22 +00:00
Ralph Castain
f68487016c Add test code from Terry. Properly terminate if we don't abort on non-zero exit
This commit was SVN r26271.
2012-04-16 16:44:23 +00:00
Ralph Castain
bd8b4f7f1e Sorry for mid-day commit, but I had promised on the call to do this upon my return.
Roll in the ORTE state machine. Remove last traces of opal_sos. Remove UTK epoch code.

Please see the various emails about the state machine change for details. I'll send something out later with more info on the new arch.

This commit was SVN r26242.
2012-04-06 14:23:13 +00:00
Ralph Castain
534d70025f Cleanup the detection of process binding during mpi_init. There are several cases that need to be checked:
1. no binding support - indicated by a negative return code from get_cpubind

2. binding supported, but not bound - the bitset returned by get_cpubind is the same as the available cpuset

3. binding supported and bound - bitset from get_cpubind is a subset of available cpuset

4. only one cpu is available - in this case, get_cpubind matches the available cpuset, but we are effectively bound

This commit was SVN r25957.
2012-02-17 21:18:53 +00:00
Ralph Castain
10f94efbda Add binding output to test
This commit was SVN r25955.
2012-02-17 16:48:01 +00:00
Ralph Castain
15facc4ba6 Fix comm_spawn yet again...add another test
This commit was SVN r25579.
2011-12-06 20:15:40 +00:00
Ralph Castain
6310361532 At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement

The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.

In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:

1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.

2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.

3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.

As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.

This commit was SVN r25476.
2011-11-15 03:40:11 +00:00
Samuel Gutierrez
3ea59cce96 minor cleanup to getenv_pmi.c.
This commit was SVN r25449.
2011-11-07 03:18:07 +00:00
Samuel Gutierrez
e03bc93fb7 only use pmi grpcomm and pubsub during the direct launch case. use PMI environment variable to setup vpid in ess alps on cray xe systems. add pmi test code.
This commit was SVN r25447.
2011-11-06 17:28:40 +00:00
Ralph Castain
198e001554 Add another test
This commit was SVN r25415.
2011-11-02 15:59:16 +00:00
Wesley Bland
4e7ff0bd5e By popular demand the epoch code is now disabled by default.
To enable the epochs and the resilient orte code, use the configure flag:

--enable-resilient-orte

This will define both:

ORTE_ENABLE_EPOCH
ORTE_RESIL_ORTE

This commit was SVN r25093.
2011-08-26 22:16:14 +00:00
Wesley Bland
09274cd047 Make sure that the epoch is initialized everywhere so we don't get weird output
during valgrind. This shouldn't have caused any problems with any actual
execution. Just extra warnings in valgrind.

This commit was SVN r25015.
2011-08-08 15:11:55 +00:00
Ralph Castain
7b9f958dcf Add some missing error strings. Update test to show silent errors
This commit was SVN r25010.
2011-08-08 04:21:02 +00:00
Ralph Castain
590ac70e88 Add a simple test program for error string output
This commit was SVN r25007.
2011-08-07 21:32:25 +00:00
Ralph Castain
8853e0e80a Fix regular expression analyzer for slurmd - use a slurm-specific version
Fix multi-node routing for daemon startup when static ports are not set

This commit was SVN r24898.
2011-07-13 22:49:56 +00:00
Ralph Castain
1ee7c39982 Fix some major bit-rot on scalable launch. If static ports are provided, then daemons can connect back to the HNP via the routed connection tree instead of doing so directly. In order to do that at scale, the node list must be passed as a regular expression - otherwise, the orted command line gets too long.
Over the course of time, usage of static ports got corrupted in several places, the "parent" info got incorrectly reset, etc. So correct all that and get the regex-based wireup going again.

Also, don't pass node lists if static ports aren't enabled - they are of no value to the orted and just create the possibility of overly-long cmd lines.

This commit was SVN r24860.
2011-07-07 18:54:30 +00:00
Wesley Bland
e1ba09ad51 Add a resilience to ORTE. Allows the runtime to continue after a process (or
ORTED) failure. Note that more work will be necessary to allow the MPI layer to
take advantage of this.

Per RFC:
http://www.open-mpi.org/community/lists/devel/2011/06/9299.php

This commit was SVN r24815.
2011-06-23 20:38:02 +00:00
Ralph Castain
138928fcf4 Use ports as multicast channels instead of networks so we avoid stepping into reserved spaces.
This commit was SVN r24666.
2011-04-29 18:46:40 +00:00
Ralph Castain
80ef1af8ba Add psm key generator program
This commit was SVN r24197.
2010-12-30 20:54:58 +00:00
Ralph Castain
58e711a412 Update a test and add two new ones for testing event lib thread support
This commit was SVN r24051.
2010-11-13 15:39:28 +00:00
Ralph Castain
703684e071 Output the mca params for debug purposes
This commit was SVN r24042.
2010-11-11 20:06:29 +00:00
Ralph Castain
a47b33678b Add orte-level thread support to avoid some of the opal_if_threads protection used solely for ompi.
Use threads to help process multicast messages.

This commit was SVN r24009.
2010-11-08 19:09:23 +00:00
Ralph Castain
bf665692c3 Update the rmcast callback function API to return message sequence number. Update orte_mcast test to stress the system.
This commit was SVN r24004.
2010-11-07 23:29:52 +00:00
Ralph Castain
9ea2b196ce Convert the opal_event framework to use direct function calls instead of hiding functions behind function pointers. Eliminate the opal_object_t abstraction of libevent's event struct so it can be directly passed to the libevent functions.
Note: the ompi_check_libfca.m4 file had to be modified to avoid it stomping on global CPPFLAGS and the like. The file was also relocated to the ompi/config directory as it pertains solely to an ompi-layer component.

Forgive the mid-day configure change, but I know Shiqing is working the windows issues and don't want to cause him unnecessary redo work.

This commit was SVN r23966.
2010-10-28 15:22:46 +00:00
Ralph Castain
86c7365e8e Clean up a few initialization issues - don't think these are impacting the shared memory situation as it didn't fix the problem.
Setup the event API to support multiple bases in preparation for splitting the OMPI and ORTE events. Holding here pending shared memory resolution.

This commit was SVN r23943.
2010-10-26 02:41:42 +00:00
Ralph Castain
fceabb2498 Update libevent to the 2.0 series, currently at 2.0.7rc. We will update to their final release when it becomes available. Currently known errors exist in unused portions of the libevent code. This revision passes the IBM test suite on a Linux machine and on a standalone Mac.
This is a fairly intrusive change, but outside of the moving of opal/event to opal/mca/event, the only changes involved (a) changing all calls to opal_event functions to reflect the new framework instead, and (b) ensuring that all opal_event_t objects are properly constructed since they are now true opal_objects.

Note: Shiqing has just returned from vacation and has not yet had a chance to complete the Windows integration. Thus, this commit almost certainly breaks Windows support on the trunk. However, I want this to have a chance to soak for as long as possible before I become less available a week from today (going to be at a class for 5 days, and thus will only be sparingly available) so we can find and fix any problems.

Biggest change is moving the libevent code from opal/event to a new opal/mca/event framework. This was done to make it much easier to update libevent in the future. New versions can be inserted as a new component and tested in parallel with the current version until validated, then we can remove the earlier version if we so choose. This is a statically built framework ala installdirs, so only one component will build at a time. There is no selection logic - the sole compiled component simply loads its function pointers into the opal_event struct.

I have gone thru the code base and converted all the libevent calls I could find. However, I cannot compile nor test every environment. It is therefore quite likely that errors remain in the system. Please keep an eye open for two things:

1. compile-time errors: these will be obvious as calls to the old functions (e.g., opal_evtimer_new) must be replaced by the new framework APIs (e.g., opal_event.evtimer_new)

2. run-time errors: these will likely show up as segfaults due to missing constructors on opal_event_t objects. It appears that it became a typical practice for people to "init" an opal_event_t by simply using memset to zero it out. This will no longer work - you must either OBJ_NEW or OBJ_CONSTRUCT an opal_event_t. I tried to catch these cases, but may have missed some. Believe me, you'll know when you hit it.

There is also the issue of the new libevent "no recursion" behavior. As I described on a recent email, we will have to discuss this and figure out what, if anything, we need to do.

This commit was SVN r23925.
2010-10-24 18:35:54 +00:00