1
1
Граф коммитов

14561 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
10e410f454 Update cisco platform files
This commit was SVN r23087.
2010-05-04 02:39:00 +00:00
Ralph Castain
8c5f442ee0 Fix some bugs in the spread rmcast component
This commit was SVN r23086.
2010-05-04 02:38:37 +00:00
Ralph Castain
8e7faf9119 Add a new test for the db framework, fix some minor bugs in the daemon module
This commit was SVN r23085.
2010-05-04 02:38:11 +00:00
Ralph Castain
9dfb5c7c62 Rename the orte state framework to be "db", which more accurately reflects its overall capabilities since it can store any kind of data (not just state, although that will be its primary purpose). Update tools and tests accordingly. Add a daemon module for storing data on the daemons - requires --enable-multicast, so it won't build unless that is set
This commit was SVN r23082.
2010-05-03 04:11:03 +00:00
Ralph Castain
d3fda5d3b9 Update cisco platform files
This commit was SVN r23081.
2010-05-03 04:08:43 +00:00
Ralph Castain
5103ffead6 Continue work on local recovery for orteds
This commit was SVN r23080.
2010-05-03 04:08:13 +00:00
Ralph Castain
bcff0d6301 Some minor cleanup in the rmcast framework, ensure that a default multicast group is always defined for each app
This commit was SVN r23079.
2010-05-03 04:07:14 +00:00
Ralph Castain
f994a7edf4 Add recovery data to the jobdat object
This commit was SVN r23078.
2010-05-03 04:06:13 +00:00
Ralph Castain
323224b84b Allow the restart data to be retrieved from a job's app->env or default that given to orterun
This commit was SVN r23077.
2010-05-03 04:05:19 +00:00
Ralph Castain
3f262bf0b6 Add a new reliable multicast component based on the "spread" library
Thanks to Srini Nariangadu (Cisco) for the contribution!

This commit was SVN r23076.
2010-05-02 17:29:41 +00:00
Ralph Castain
c93af95351 Since there is no defined behavior for the case where all application procs exit normally, but some or all have non-zero returns, just output a warning telling the user how many procs meet that criteria. Let the return code of mpirun in that scenario reflect any errors in OMPI/ORTE itself.
Clearly a temporary solution until a defined behavior can be established.

This commit was SVN r23075.
2010-04-30 19:01:10 +00:00
Matthias Jurenz
29f02d88c6 - fixed buffer overflow which occurred if marker or event comment records generated
- fixed bug in MPI-I/O tracing: tracking MPI file handles even if MPI_File_open isn't recorded
- fixed compiler warnings in the PAPI component
- incremented version number to 5.8.2

This commit was SVN r23074.
2010-04-30 10:04:26 +00:00
Ralph Castain
c62418d76d Add a new test that checks behavior when we call exit with a non-zero return code after calling finalize - don't ask why.
Modify the check_complete code so it finds the first non-zero exit status (i.e., the one from the lowest rank) in a job that terminates normally, and sets the mpirun exit code to that status.

This commit was SVN r23071.
2010-04-29 19:58:44 +00:00
Jeff Squyres
71cbe1a69f Bump up to hwloc v1.0rc3
This commit was SVN r23070.
2010-04-29 15:59:01 +00:00
Jeff Squyres
b6e401a512 Fix minor typo.
This commit was SVN r23067.
2010-04-29 11:45:25 +00:00
Ralph Castain
11f710b9cd Add some news items re the trunk
This commit was SVN r23066.
2010-04-29 01:28:32 +00:00
Ralph Castain
5689bd6d30 Remove debug
This commit was SVN r23062.
2010-04-28 22:14:23 +00:00
Jeff Squyres
1667121bc4 Add more text about shared library versioning
This commit was SVN r23061.
2010-04-28 13:08:04 +00:00
Jeff Squyres
f0d1fdf3b1 Add note about xgrid support being broken.
This commit was SVN r23060.
2010-04-28 12:48:20 +00:00
Jeff Squyres
85ce4125bd Updates from 1.4 branch SVN commit log.
This commit was SVN r23058.
2010-04-28 12:39:45 +00:00
Ralph Castain
319758e3e0 Restore process recovery for procs local to mpirun (first step towards restoring full capability). Define three new MCA params:
1. orte_enable_recovery - default recovery policy, can be overridden on a per-job basis

2. orte_max_local_restarts - default max number of local restarts, can be overridden

3. orte_max_global_restarts - default max number of relocates, can be overridden

Implement the restart_proc API for the ODLS framework, reorganize the default fns a little to avoid copying code.

This commit was SVN r23057.
2010-04-28 04:06:57 +00:00
Jeff Squyres
f064056a07 We don't need all this stuff in OMPI.
This commit was SVN r23056.
2010-04-28 00:31:15 +00:00
Jeff Squyres
28046f601e Fix typos.
This commit was SVN r23055.
2010-04-28 00:07:04 +00:00
Ralph Castain
67568e7dec Fix a warning
This commit was SVN r23049.
2010-04-27 15:59:24 +00:00
Ralph Castain
401fb22847 Fix the pack/unpack of process state values by getting the actual packing size to match the new value size.
Improve the debug output by printing the string

This commit was SVN r23048.
2010-04-27 15:58:04 +00:00
Ralph Castain
efc7d3e2f6 Update the LSF PLM module to ensure that the path is correctly passed.
Many thanks to Teng Lin for the error report - and the patch!

This commit was SVN r23047.
2010-04-27 03:46:04 +00:00
Ralph Castain
3434296836 Ensure we don't have a trailing separator on the end of our tmpdir as (a) it really looks weird, and (b) some exotic systems interpret that as indicating the rest of the path is to be treated as absolute. Makes for very strange and interesting behavior...
This commit was SVN r23046.
2010-04-27 03:40:44 +00:00
Ralph Castain
55889934d8 After hours spent chasing the stupid "abort" file, it became clear that we were always going to be plagued by that idiot contraption when trying to be good citizens and properly cleanup. So get rid of it by instead doing a messaging handshake with the local daemon.
Note that this isn't a problem since MPI_Abort and orte_abort are only called under controlled circumstances - i.e., we are doing an orderly abort and not segfaulting. If we can't get the message out for some reason, then too bad - we'll still see an abnormal process termination and act accordingly.

This commit was SVN r23045.
2010-04-27 03:39:32 +00:00
Ralph Castain
b1577c4fcd Update cisco platform files to include sensor support
This commit was SVN r23044.
2010-04-26 22:16:48 +00:00
Ralph Castain
b9893aacc5 Add a sensor framework to ORTE that monitors applications and notifies the errmgr when they exceed specified boundaries. Two modules are included here:
1. file activity - can monitor file size, access and modification times. If these fail to change over a specified number of sampling iterations (rate is an mca param), then the errmgr is notified.

2. memory usage - checks amount of memory used by a process. Limit and sampling rate can be set.

This support must be enabled by configuring --enable-sensors.

ompi_info and orte-info have been updated to include the new framework.

Also includes some initial steps toward restoring the recovery capability. Most notably, the ODLS API has been extended to include a "restart_proc" entry for restarting a local process, and organizes the various ERRMGR framework globals into a single struct as we do in the other ORTE frameworks. Fix an oversight in the ERRMGR framework where a pointer array was constructed, but not initialized.

Implementation continues.

This commit was SVN r23043.
2010-04-26 22:15:57 +00:00
Jeff Squyres
2fe1bc043d Bump up to hwloc 1.0rc2
This commit was SVN r23042.
2010-04-26 21:57:51 +00:00
Ralph Castain
13a7338289 Ensure we get past the '=' in the parameter
This commit was SVN r23039.
2010-04-26 20:46:50 +00:00
Jeff Squyres
d261c0fe80 mpi_portable_platform.h should not be ignored.
This commit was SVN r23037.
2010-04-26 12:59:09 +00:00
Ralph Castain
43a89bbace Extend the process and job states by adding values for exceeding sensor bounds. This changes the job state field to 32-bit to also provide room for future expansion.
This commit was SVN r23036.
2010-04-26 12:36:40 +00:00
Ralph Castain
aa5b9eb2ed Update ignore properties
This commit was SVN r23035.
2010-04-26 12:23:34 +00:00
Ralph Castain
e1b9f400ba Add some new utilities that support searching an environ string list (not just our own environ) for specific MCA params and returning their value. Helpful when a daemon needs to check an app_context's environ for params that can impact how the daemon launches and/or interacts with it, but don't pertain to the daemon's own environ.
This commit was SVN r23034.
2010-04-26 03:35:09 +00:00
Ralph Castain
e3164d2ac1 For the autogen-challenged (i.e., Jeff), create a new ORTE constant that tells the user that a required module was not found. Update the errmgr select function to output the error if no module is found.
This commit was SVN r23032.
2010-04-24 01:39:26 +00:00
Ralph Castain
b62ebb8b94 Enable these to build so others can more easily begin implementation
This commit was SVN r23031.
2010-04-23 22:46:33 +00:00
George Bosilca
321213e779 Fix segmentation fault on heterogeneous architectures. Don't mess with the
ompi_ptr_t by translating into void*. Instead keep it as an ompi_ptr_t all
the way. Thanks to Timur Magomedov for helping to track down this issue and
test the patch.

cmr:v1.4
cmr:v1.5

This commit was SVN r23030.
2010-04-23 15:14:55 +00:00
Ralph Castain
9c69900bbc Ensure we get the jdata object when updating proc state so we don't segv if report-progress is set. Thanks to Sam for reporting it!
This commit was SVN r23029.
2010-04-23 15:14:21 +00:00
George Bosilca
95394456ca No Windows EOF.
This commit was SVN r23024.
2010-04-23 05:51:29 +00:00
Ralph Castain
efbb5c9b7c Revamp the errmgr framework to provide a greater range of optional behaviors, including different behaviors for daemons, and remove several looping messages across the code base:
* add hnp and orted modules to the errmgr framework. The HNP module contains much of the code that was in the errmgr base since that code could only be executed by the HNP anyway.

* update the odls to report process states directly into the active errmgr module, thus removing the need to send messages looped back into the odls cmd processor. Let the active errmgr module decide what to do at various states.

* remove the code to track application state progress from the plm_base_launch_support.c code. Update the plm modules to call the errmgr directly when a launch fails.

* update the plm_base_receive.c code to call the errmgr with state updates from remote daemons

* update the routed modules to reflect that process state is updated in the errmgr

* ensure that the orted's open the errmgr and select their appropriate module

* add new pretty-print utilities to print process and job state. Move the pretty-print of time info to a globally-accessible place

* define a global orte_comm function to send messages from orted's to the HNP so that others can overlay the standard RML methods, if desired.

* update the orterun help output to reflect that the "term w/o sync" error message can result from three, not two, scenarios

This commit was SVN r23023.
2010-04-23 04:44:41 +00:00
Ralph Castain
f711c4713f Add threading support to odin platform file
This commit was SVN r23022.
2010-04-23 04:31:04 +00:00
Jeff Squyres
bc0987e5bd Fix more typos.
This commit was SVN r23021.
2010-04-22 16:39:40 +00:00
Shiqing Fan
077f6e6398 Type casts for building dynamical Fortran libraries.
And export correct function names.

This commit was SVN r23020.
2010-04-22 15:48:27 +00:00
Jeff Squyres
668c1c7f17 Fix two mistakes in the README. Thanks to 3rd party reviewers!
This commit was SVN r23019.
2010-04-22 15:35:00 +00:00
Jeff Squyres
359464a144 Add an "affinity" Open MPI extension (also describe the
--enable-mpi-ext configure switch in the top-level README file).

See Josh's excellent wiki page about OMPI extensions:

    https://svn.open-mpi.org/trac/ompi/wiki/MPIExtensions

This extension exposes a new API to MPI applications: 

{{{
int OMPI_Affinity_str(char ompi_bound[OMPI_AFFINITY_STRING_MAX],
                      char current_binding[OMPI_AFFINITY_STRING_MAX],
                      char exists[OMPI_AFFINITY_STRING_MAX]);
}}}

It returns 3 things.  Each are a prettyprint string describing sets of
processors in terms of sockets and cores:

 1. What Open MPI bound this process to.  If Open MPI didn't bind this
    process, the prettyprint string says so.
 1. What this process is currently bound to.  If the process is
    unbound, the prettyprint string says so.  This string is a
    separate OUT parameter to detect the case where some other entity
    bound the process (potentially after Open MPI bound it).
 1. What processors are availabile in the system, mainly for reference.

This commit was SVN r23018.
2010-04-21 17:28:08 +00:00
Jeff Squyres
ea8b0ea569 Add a new function in the paffinity base:
opal_paffinity_base_cset2str().  This function basically makes a
prettyprint string out of an opal_paffinity_base_cset_t.

This commit was SVN r23017.
2010-04-21 17:26:36 +00:00
Shiqing Fan
d1e66bdd01 Use variables instead of hard-coded compiler flags, in order to support various C/C++ compilers on Windows.
This commit was SVN r23016.
2010-04-21 12:45:00 +00:00
Shiqing Fan
e539322807 Move definitions to the main config file.
This commit was SVN r23015.
2010-04-21 09:17:10 +00:00