1
1
Граф коммитов

124 Коммитов

Автор SHA1 Сообщение Дата
Nathan Hjelm
d59034e6ef MCA: remove deprecated mca_base_param functions (mca_base_param_register_int, mca_base_param_register_string, mca_base_param_environ_variable). Remove all uses of deprecated functions.
cmr:v1.7

This commit was SVN r27451.
2012-10-17 20:17:37 +00:00
Ralph Castain
c82cfecc1c Cleanup comm_spawn for the multi-node case where at least one new process isn't spawned on every node. Avoid the complexities of trying to execute a daemon collective across the dynamic spawn as it becomes too hard to ensure that all daemons participate or are accounted for - instead, use a less scalable but workable solution of sending the data directly between the participating procs. Ensure that singletons get their collectives properly defined at startup so the spawned "HNP" is ready for them.
As a secondary cleanup, the HNP doesn't need to update its nidmap during an xcast as it already has an up-to-date picture of the situation. So just dump that data and move along.

This commit was SVN r27318.
2012-09-12 11:31:36 +00:00
Ralph Castain
fa6a18f05a If the orted HNP is spawned by a singleton, then it needs to harvest all the MCA params from its environment to ensure they are passed on to any subsequently spawned daemons. Otherwise, the singleton could be directed to select options that other apps miss.
This commit was SVN r27221.
2012-09-04 01:10:26 +00:00
Ralph Castain
38ce23db43 Add some protection to allow NULL bytes in byte objects and NULL strings to be handled cleanly in nidmaps and modex entries. Ensure there is a valid nidmap available for the HNP to pass down to any local procs when it is operating alone.
This commit was SVN r27188.
2012-08-31 01:07:36 +00:00
Ralph Castain
98580c117b Introduce staged execution. If you don't have adequate resources to run everything without oversubscribing, don't want to oversubscribe, and aren't using MPI, then staged execution lets you (a) run as many procs as there are available resources, and (b) start additional procs as others complete and free up resources. Adds a new mapper as well as a new state machine.
Remove some stale configure.m4's we no longer need.

Optimize the nidmaps a bit by only sending info that has changed each time, instead of sending a complete copy of everything. Makes no difference for the typical MPI job - only impacts things like staged execution where we are sending multiple (possibly many) launch messages.

This commit was SVN r27165.
2012-08-28 21:20:17 +00:00
Ralph Castain
ed4b354846 Ensure we pass along user-specified mca params from the cmd line when doing a tree spawn, but don't extend the cmd line with duplicates or things that shouldn't be there
This commit was SVN r27117.
2012-08-22 21:41:50 +00:00
Ralph Castain
0dfe29b1a6 Roll in the rest of the modex change. Eliminate all non-modex API access of RTE info from the MPI layer - in some cases, the info was already present (either in the ompi_proc_t or in the orte_process_info struct) and no call was necessary. This removes all calls to orte_ess from the MPI layer. Calls to orte_grpcomm remain required.
Update all the orte ess components to remove their associated APIs for retrieving proc data. Update the grpcomm API to reflect transfer of set/get modex info to the db framework.

Note that this doesn't recreate the old GPR. This is strictly a local db storage that may (at some point) obtain any missing data from the local daemon as part of an async methodology. The framework allows us to experiment with such methods without perturbing the default one.

This commit was SVN r26678.
2012-06-27 14:53:55 +00:00
Ralph Castain
e9591f2563 Fix tree spawn in the rsh/qrsh environment
This commit was SVN r26631.
2012-06-21 21:29:28 +00:00
Ralph Castain
96c778656a Improve launch performance on clusters that use dedicated nodes by instructing the orteds to use the same port as the HNP, thus allowing them to "rollup" their initial callback via the routed network. This substantially reduces the HNP bottleneck and the number of ports opened by the HNP.
Restore enable-static-ports option by default - the Cray will have to disable it to get around their library issues, but that's just a warning problem as opposed to blocking the build.

This commit was SVN r26606.
2012-06-15 10:15:07 +00:00
Ralph Castain
a526afae92 Ensure we always cleanup local procs, no matter how we exited.
This commit was SVN r26454.
2012-05-18 23:37:40 +00:00
Ralph Castain
b2f77bf08f Extend the iof by adding two new components to support map-reduce IO chaining. Add a mapreduce tool for running such applications.
Fix the state machine to support multiple jobs being simultaneously launched as this is not only required for mapreduce, but can happen under comm-spawn applications as well.

This commit was SVN r26380.
2012-05-02 21:00:22 +00:00
Ralph Castain
c5da4f24d7 Fix stupid singletons - get the pidmap message correct
This commit was SVN r26378.
2012-05-02 17:48:02 +00:00
Ralph Castain
14d5525fb1 Some minor cleanups. Get singletons working. Cleanup abort handling so it gets properly identified.
This commit was SVN r26261.
2012-04-10 19:08:54 +00:00
Ralph Castain
bd8b4f7f1e Sorry for mid-day commit, but I had promised on the call to do this upon my return.
Roll in the ORTE state machine. Remove last traces of opal_sos. Remove UTK epoch code.

Please see the various emails about the state machine change for details. I'll send something out later with more info on the new arch.

This commit was SVN r26242.
2012-04-06 14:23:13 +00:00
Ralph Castain
a309c53bf2 Set the lifeline when we are tree spawning under rsh so that the orted can self-terminate when its parent dies
This commit was SVN r25655.
2011-12-15 15:29:53 +00:00
George Bosilca
25476c7e54 buffer is not yet initialized, so there is no reason to release it.
This commit was SVN r25549.
2011-11-29 23:50:18 +00:00
Jeff Squyres
6fbbfd0f7a Gah! r25545 acidentally included ''waaaay'' more stuff than it was
supposed to.  I.e., half-baked/not complete stuff.

This commit backs out all of r25545.  Sorry folks!

This commit was SVN r25546.

The following SVN revision numbers were found above:
  r25545 --> open-mpi/ompi@7f9ae11faf
2011-11-29 23:24:52 +00:00
Jeff Squyres
7f9ae11faf Per http://www.open-mpi.org/community/lists/users/2011/11/17862.php,
to make MPI_IN_PLACE (and other sentinel Fortran constants) work on OS
X, we need to use the following compiler (linker) flag:

    -Wl,-commons,use_dylibs 

So if we're compiling on OS X, test to see if that flag works with the
compiler.  If so, add it to the wrapper FFLAGS and FCFLAGS (note that
per a future update, we'll only have one Fortran compiler anyway).

Fixes trac:1982.  

This commit was SVN r25545.

The following Trac tickets were found above:
  Ticket 1982 --> https://svn.open-mpi.org/trac/ompi/ticket/1982
2011-11-29 23:05:54 +00:00
Ralph Castain
b475421c16 As promised, rationalize the rsh support. Remove rshbase and the base rsh support, centralizing all rsh support into the rsh component. Remove the "slave" launch support as that experiment is complete. Fix tree spawn and make that the default method for rsh launch, turning it "off" for qrsh as that system does not support tree spawn.
This commit was SVN r25507.
2011-11-26 02:33:05 +00:00
Ralph Castain
6310361532 At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement

The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.

In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:

1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.

2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.

3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.

As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.

This commit was SVN r25476.
2011-11-15 03:40:11 +00:00
Ralph Castain
d28dd55d33 Minimize the amount of topology info returned by the daemons. Most clusters, especially at scale, use the same node topology on every node, so there is no re
ason to return the topology from every daemon. Borrow a page from the --hetero-apps page and let users indicate that the node topology differs by adding a --
hetero-nodes option to mpirun. If the option is set, then every daemon returns topology info. If not set, then only daemon vpid=1 returns it.

We always want one daemon to return the topology as the head node is often different from the compute nodes. Having one daemon return the compute node topolo
gy allows us to detect any such difference. All compute nodes are then set to the same topology.

This commit was SVN r25408.
2011-11-01 18:43:10 +00:00
Ralph Castain
24a46f2acb These were missed by prior commit - need to remove lingering references to OPAL_HWLOC_HAVE_XML
This commit was SVN r25272.
2011-10-12 16:54:03 +00:00
Ralph Castain
92c7372e20 Per the RFC from Jeff, move hwloc from opal/mca/common to its own static framework ala libevent. Have ORTE daemons collect the topology info at startup and, if --enable-hwloc-xml is set, send that info back to the HNP for later use. The HNP only retains unique topology "templates" to reduce memory footprint. Have the daemon include the local topology info in the nidmap buffer sent to each app so the apps don't all hammer the local system to discover it for themselves.
Remove the sysinfo framework as hwloc replaces that functionality.

This commit was SVN r25124.
2011-09-11 19:02:24 +00:00
Wesley Bland
4e7ff0bd5e By popular demand the epoch code is now disabled by default.
To enable the epochs and the resilient orte code, use the configure flag:

--enable-resilient-orte

This will define both:

ORTE_ENABLE_EPOCH
ORTE_RESIL_ORTE

This commit was SVN r25093.
2011-08-26 22:16:14 +00:00
Ralph Castain
1ad110d2e9 After a nice, calm, rational discussion between Brian, Jeff, and myself, we decided to revert r24864 and r24862 to restore the reference counters in opal_init/finalize. The rationale was that we should instead change orte_init/finalize to also use reference counters to support multi-embedded libraries. Jeff and Brian will discuss proposing a similar change to mpi_init/finalize to the MPI Forum so that all three libraries will behave in similar manners.
It was agreed that opal_init_util had wound up being used in unintended ways, which raised the problem of getting reference counts to work right. However, fixing it would involve more pain than it was worth - and so long as the other layers are made to behave similarly, I have no preference either way.

Complete implementation will follow - for now, this just reverts the prior changes.

This commit was SVN r24886.

The following SVN revision numbers were found above:
  r24862 --> open-mpi/ompi@aa92e0c4eb
  r24864 --> open-mpi/ompi@a5062385c2
2011-07-12 17:07:41 +00:00
Ralph Castain
a5062385c2 Fix singletons
This commit was SVN r24864.
2011-07-08 14:38:33 +00:00
Jeff Squyres
3affb8403e Remove extra output
This commit was SVN r24863.
2011-07-08 13:01:30 +00:00
Ralph Castain
1ee7c39982 Fix some major bit-rot on scalable launch. If static ports are provided, then daemons can connect back to the HNP via the routed connection tree instead of doing so directly. In order to do that at scale, the node list must be passed as a regular expression - otherwise, the orted command line gets too long.
Over the course of time, usage of static ports got corrupted in several places, the "parent" info got incorrectly reset, etc. So correct all that and get the regex-based wireup going again.

Also, don't pass node lists if static ports aren't enabled - they are of no value to the orted and just create the possibility of overly-long cmd lines.

This commit was SVN r24860.
2011-07-07 18:54:30 +00:00
Wesley Bland
e1ba09ad51 Add a resilience to ORTE. Allows the runtime to continue after a process (or
ORTED) failure. Note that more work will be necessary to allow the MPI layer to
take advantage of this.

Per RFC:
http://www.open-mpi.org/community/lists/devel/2011/06/9299.php

This commit was SVN r24815.
2011-06-23 20:38:02 +00:00
Ralph Castain
b47ec2ee87 Remove lingering references to opal_profile option
This commit was SVN r24709.
2011-05-18 18:27:29 +00:00
George Bosilca
80fe617cd2 If we don't release the OPAL utils explicitly there will be a memory leak.
This commit was SVN r24505.
2011-03-10 00:42:28 +00:00
Ralph Castain
9ea2b196ce Convert the opal_event framework to use direct function calls instead of hiding functions behind function pointers. Eliminate the opal_object_t abstraction of libevent's event struct so it can be directly passed to the libevent functions.
Note: the ompi_check_libfca.m4 file had to be modified to avoid it stomping on global CPPFLAGS and the like. The file was also relocated to the ompi/config directory as it pertains solely to an ompi-layer component.

Forgive the mid-day configure change, but I know Shiqing is working the windows issues and don't want to cause him unnecessary redo work.

This commit was SVN r23966.
2010-10-28 15:22:46 +00:00
Ralph Castain
86c7365e8e Clean up a few initialization issues - don't think these are impacting the shared memory situation as it didn't fix the problem.
Setup the event API to support multiple bases in preparation for splitting the OMPI and ORTE events. Holding here pending shared memory resolution.

This commit was SVN r23943.
2010-10-26 02:41:42 +00:00
Ralph Castain
fceabb2498 Update libevent to the 2.0 series, currently at 2.0.7rc. We will update to their final release when it becomes available. Currently known errors exist in unused portions of the libevent code. This revision passes the IBM test suite on a Linux machine and on a standalone Mac.
This is a fairly intrusive change, but outside of the moving of opal/event to opal/mca/event, the only changes involved (a) changing all calls to opal_event functions to reflect the new framework instead, and (b) ensuring that all opal_event_t objects are properly constructed since they are now true opal_objects.

Note: Shiqing has just returned from vacation and has not yet had a chance to complete the Windows integration. Thus, this commit almost certainly breaks Windows support on the trunk. However, I want this to have a chance to soak for as long as possible before I become less available a week from today (going to be at a class for 5 days, and thus will only be sparingly available) so we can find and fix any problems.

Biggest change is moving the libevent code from opal/event to a new opal/mca/event framework. This was done to make it much easier to update libevent in the future. New versions can be inserted as a new component and tested in parallel with the current version until validated, then we can remove the earlier version if we so choose. This is a statically built framework ala installdirs, so only one component will build at a time. There is no selection logic - the sole compiled component simply loads its function pointers into the opal_event struct.

I have gone thru the code base and converted all the libevent calls I could find. However, I cannot compile nor test every environment. It is therefore quite likely that errors remain in the system. Please keep an eye open for two things:

1. compile-time errors: these will be obvious as calls to the old functions (e.g., opal_evtimer_new) must be replaced by the new framework APIs (e.g., opal_event.evtimer_new)

2. run-time errors: these will likely show up as segfaults due to missing constructors on opal_event_t objects. It appears that it became a typical practice for people to "init" an opal_event_t by simply using memset to zero it out. This will no longer work - you must either OBJ_NEW or OBJ_CONSTRUCT an opal_event_t. I tried to catch these cases, but may have missed some. Believe me, you'll know when you hit it.

There is also the issue of the new libevent "no recursion" behavior. As I described on a recent email, we will have to discuss this and figure out what, if anything, we need to do.

This commit was SVN r23925.
2010-10-24 18:35:54 +00:00
Terry Dontje
b74ef351b7 Added new solaris sysinfo module. Also added code to assign
orte_local_chip_type and orte_local_chip_model in MPI processes it the
appropriate sysinfo module found the values on the machine.

This commit was SVN r23581.
2010-08-09 19:28:56 +00:00
Ralph Castain
3f7f8df40f Fix singleton comm_spawn by having the host orted create a map object for the singleton job. Even though this isn't filled in, the pidmap function will ignore any job that doesn't have a map object. Thus, this ensures that the singleton's location is included in the pidmap, thereby allowing messages to be routed to/from it.
This commit was SVN r23430.
2010-07-18 02:48:17 +00:00
Ralph Castain
12cd07c9a9 Start reducing our dependency on the event library by removing at least one instance where we use it to redirect the program counter. Rolf reported occasional hangs of mpirun in very specific circumstances after all daemons were done. A review of MTT results indicates this may have been happening more generally in a small fraction of cases.
The problem was tracked to use of the grpcomm.onesided_barrier to control daemon/mpirun termination. This relied on messaging -and- required that the program counter jump from the errmgr back to grpcomm. On rare occasions, this jump did not occur, causing mpirun to hang.

This patch looks more invasive than it is - most of the affected files simply had one or two lines removed. The essence of the change is:

* pulled the job_complete and quit routines out of orterun and orted_main and put them in a common place

* modified the errmgr to directly call the new routines when termination is detected

* removed the grpcomm.onesided_barrier and its associated RML tag

* add a new "num_routes" API to the routed framework that reports back the number of dependent routes. When route_lost is called, the daemon's list of "children" is checked and adjusted if that route went to a "leaf" in the routing tree

* use connection termination between daemons to track rollup of the daemon tree. Daemons and HNP now terminate once num_routes returns zero

Also picked up in this commit is the addition of a new bool flag to the app_context struct, and increasing the job_control field from 8 to 16 bits. Both trivial.

This commit was SVN r23429.
2010-07-17 21:03:27 +00:00
Abhishek Kulkarni
afbe3e99c6 * Wrap all the direct error-code checks of the form (OMPI_ERR_* == ret) with
(OMPI_ERR_* = OPAL_SOS_GET_ERR_CODE(ret)), since the return value could be a
 SOS-encoded error. The OPAL_SOS_GET_ERR_CODE() takes in a SOS error and returns
 back the native error code.

* Since OPAL_SUCCESS is preserved by SOS, also change all calls of the form
  (OPAL_ERROR == ret) to (OPAL_SUCCESS != ret). We thus avoid having to
  decode 'ret' to get the native error code.

This commit was SVN r23162.
2010-05-17 23:08:56 +00:00
Ralph Castain
2ff1ae13e1 Create a new "heartbeat" module in the sensor framework and move the plm_base heartbeat code there. Add new proc and job states for heartbeat_failed. Remove the "heartbeat" cmd line option for orted as this is now done automatically if the --enable-heartbeat configure option is set.
This commit was SVN r23102.
2010-05-05 00:48:43 +00:00
Ralph Castain
4308922f59 Ensure that any application-specific selection of ess module doesn't get overridden by what is given to the orted or orterun
Cleanup tool name determination for CM

This commit was SVN r22980.
2010-04-15 18:10:50 +00:00
Shiqing Fan
96b20a29b5 An easy solution to make singleton work on Windows.
This commit was SVN r22952.
2010-04-10 16:30:59 +00:00
Ralph Castain
0b9552cd4e Expand the ESS framework's API to include a new function "query_sys_info" that allows the caller to retrieve key-value pairs of info on the local system capabilities (e.g., cpu type/model). Have each daemon and the HNP "sense" that information and provide it to their local procs to avoid having every proc querying the system directly.
This commit was SVN r22870.
2010-03-23 20:47:41 +00:00
Josh Hursey
e9b5162d79 Fix the configure logic for --with-ft so that it properly takes a comma separated list.
Many of the OPAL_ENABLE_FT should be OPAL_ENABLE_FT_CR, so fix those.

The OPAL Layer INC should call opal_output on restart so that it can refresh the string it prints to reflect the current pid/hostname which may have changed.

This commit was SVN r22824.
2010-03-12 23:57:50 +00:00
Ralph Castain
f2c65dc70f Ensure that the errmgr does not take action if the process was terminated by a "kill_procs" command as this can lead to circular logic.
Cleanup the kill_procs command by removing a no-longer-used param. We update the process state when the proc actually exits.

This commit was SVN r22783.
2010-03-05 13:22:12 +00:00
Ralph Castain
cd1efbb41e Try and do a better job of cleanup in abnormal termination. Ensure the daemons whack session directories prior to disabling signal traps. Ensure that the HNP and daemons all cleanup when they are doing an internal abort.
This commit was SVN r22755.
2010-03-02 14:51:23 +00:00
Ralph Castain
b692645772 Remote daemons should -always- whack any lingering session directories when exiting
This commit was SVN r22749.
2010-03-02 05:28:53 +00:00
Ralph Castain
c2aba2a6d7 Pack the resource key correctly
This commit was SVN r22541.
2010-02-03 18:21:27 +00:00
Ralph Castain
f66b6cae23 Enable the boot of an orted "virtual machine". Modify the mapper framework to allow mapping of only daemons. Remove the cm ras module as no longer required. Modify the orted code to always send back node arch info. Remove the "--enable-bootstrap" configure option as this feature will now always be available.
This commit was SVN r22480.
2010-01-25 22:25:13 +00:00
Ralph Castain
2517799102 Report, but ignore, SIGPIPE events. The odls already resets this signal handler when spawning local procs.
This commit was SVN r22463.
2010-01-21 05:01:06 +00:00
Jeff Squyres
16b100219d A patch from UTK to allow orte_init(), opal_init(), and associated
friends also receive &argc and &argv (George asked Jeff to Ralph to
review before committing).  The thought is that passing argv and argc
to opal/orte_init be useful to other projects outside of OMPI that are
using OPAL and/or ORTE (especially in conjunction with some other
bootstrapping code where it is helpful to modify argv).  It's such a
small thing that it's easy to apply here to make others' lives a
little easier.

Ask George for more details; I'm just the messenger.  :-)

Judging by the copyrights on this patch, it's been around for a
while.  :-)

This commit was SVN r22260.
2009-12-04 00:51:15 +00:00