Ralph Castain
43a89bbace
Extend the process and job states by adding values for exceeding sensor bounds. This changes the job state field to 32-bit to also provide room for future expansion.
...
This commit was SVN r23036.
2010-04-26 12:36:40 +00:00
Ralph Castain
e3164d2ac1
For the autogen-challenged (i.e., Jeff), create a new ORTE constant that tells the user that a required module was not found. Update the errmgr select function to output the error if no module is found.
...
This commit was SVN r23032.
2010-04-24 01:39:26 +00:00
Ralph Castain
b62ebb8b94
Enable these to build so others can more easily begin implementation
...
This commit was SVN r23031.
2010-04-23 22:46:33 +00:00
Ralph Castain
9c69900bbc
Ensure we get the jdata object when updating proc state so we don't segv if report-progress is set. Thanks to Sam for reporting it!
...
This commit was SVN r23029.
2010-04-23 15:14:21 +00:00
George Bosilca
95394456ca
No Windows EOF.
...
This commit was SVN r23024.
2010-04-23 05:51:29 +00:00
Ralph Castain
efbb5c9b7c
Revamp the errmgr framework to provide a greater range of optional behaviors, including different behaviors for daemons, and remove several looping messages across the code base:
...
* add hnp and orted modules to the errmgr framework. The HNP module contains much of the code that was in the errmgr base since that code could only be executed by the HNP anyway.
* update the odls to report process states directly into the active errmgr module, thus removing the need to send messages looped back into the odls cmd processor. Let the active errmgr module decide what to do at various states.
* remove the code to track application state progress from the plm_base_launch_support.c code. Update the plm modules to call the errmgr directly when a launch fails.
* update the plm_base_receive.c code to call the errmgr with state updates from remote daemons
* update the routed modules to reflect that process state is updated in the errmgr
* ensure that the orted's open the errmgr and select their appropriate module
* add new pretty-print utilities to print process and job state. Move the pretty-print of time info to a globally-accessible place
* define a global orte_comm function to send messages from orted's to the HNP so that others can overlay the standard RML methods, if desired.
* update the orterun help output to reflect that the "term w/o sync" error message can result from three, not two, scenarios
This commit was SVN r23023.
2010-04-23 04:44:41 +00:00
Ralph Castain
86228aee38
Provide two new opal paffinity utilities for printing a hex representation of the cpu set and parsing that string back into a cpu set on the other end. Also add a new MCA param for passing the cpu set applied to a process during launch down to that process so it can know what we attempted to do.
...
All to be used in some new MPI extensions provided by Jeff so that users can easily query their binding situation.
This commit was SVN r22998.
2010-04-19 22:16:35 +00:00
Ralph Castain
4d06125a33
Establish a method by which a process knows if it has been bound by mpirun. This helps resolve a problem where a process gets "bound" to all available resources, which looks to the opal paffinity system as "not bound". This can cause mpi_init to attempt to "bind" the process itself, causing unintended behavior.
...
This commit was SVN r22985.
2010-04-17 01:58:26 +00:00
Ralph Castain
41428e6b61
Issue a warning if a requested binding operation results in processes being bound to all available processes, which is the equivalent of not being bound at all.
...
See the following email thread for further details:
http://www.open-mpi.org/community/lists/devel/2010/04/7745.php
This commit was SVN r22984.
2010-04-17 01:02:41 +00:00
Ralph Castain
6c379fed2e
IOF components should not assume they will be selected when queried - thus, they should not perform init functions until after selection. Create init/finalize entry points for that purpose, and have select init the module after it has been selected.
...
This commit was SVN r22982.
2010-04-16 18:51:27 +00:00
Ralph Castain
2ecc9fc2b3
Additional diag output
...
This commit was SVN r22981.
2010-04-16 14:48:37 +00:00
Ralph Castain
4308922f59
Ensure that any application-specific selection of ess module doesn't get overridden by what is given to the orted or orterun
...
Cleanup tool name determination for CM
This commit was SVN r22980.
2010-04-15 18:10:50 +00:00
Ralph Castain
c3e4c40cdf
Add another multicast tag for updating state
...
This commit was SVN r22979.
2010-04-15 18:08:53 +00:00
Ralph Castain
eeccf2f15c
Some minor changes to support vm's
...
This commit was SVN r22975.
2010-04-14 01:20:43 +00:00
Ralph Castain
854dc12fc0
Add a tag for multicasting IOF messages
...
This commit was SVN r22972.
2010-04-13 22:51:26 +00:00
Ralph Castain
81329c637e
Indentation corrections
...
This commit was SVN r22971.
2010-04-13 17:47:34 +00:00
Ralph Castain
8da781af84
Continue developing support for distributed virtual machines - minor changes to ensure correct jobid gets used and that dvm's can communicate with tools
...
This commit was SVN r22958.
2010-04-12 22:33:09 +00:00
Shiqing Fan
96b20a29b5
An easy solution to make singleton work on Windows.
...
This commit was SVN r22952.
2010-04-10 16:30:59 +00:00
Ralph Castain
d3ed4e68b7
Utilize a non-used mapping policy bit to define a policy that uses only existing alive daemons to support virtual machines and restarting processes on already-active nodes
...
This commit was SVN r22951.
2010-04-10 05:02:47 +00:00
Ralph Castain
4f8279df3d
Enable substitution of the communication calls in the orted when sending messages back to the HNP by creating a function for this purpose and saving the pointer to it in orte_odls_base. Higher level libraries can then override the default function to use their own method.
...
This commit was SVN r22950.
2010-04-09 18:50:10 +00:00
Terry Dontje
282a537cf7
This commit fixes 2370, by having the solaris paffinity module return error codes for get_physical_processor_id and having odls_default_fork_local_proc check get_physical_processor_id for OPAL_ERROR
...
This commit was SVN r22948.
2010-04-09 15:10:46 +00:00
Brad Benton
101b896f2e
IBM has approved the release of the LoadLeveler sample code under the
...
BSD license. Consequently, a more restrictive licensing clause that was
originally associated with the LoadLeveler sample code documentation and
replicated in a comment block in this file has been removed.
This commit was SVN r22947.
2010-04-08 19:41:44 +00:00
Terry Dontje
929c58e38d
This commit fixes trac:2073
...
This commit was SVN r22946.
The following Trac tickets were found above:
Ticket 2073 --> https://svn.open-mpi.org/trac/ompi/ticket/2073
2010-04-08 18:17:44 +00:00
Ralph Castain
75e99e6118
Do a better job of selecting cm ess component, handle tool and daemon issues
...
This commit was SVN r22942.
2010-04-07 18:59:21 +00:00
Ralph Castain
f1fc344336
Add some diagnostics
...
This commit was SVN r22941.
2010-04-07 18:58:17 +00:00
Ralph Castain
8e29a6858a
Properly handle the case when a daemon is given both parts of its name
...
This commit was SVN r22935.
2010-04-06 22:41:18 +00:00
Ralph Castain
a1e82e9d05
Per discussion with Josh, cleanup the errmgr API by creating separate modules for the public vs internal APIs. This mirrors the architecture used in other frameworks that had similar requirements.
...
Remove the orcm errmgr module - moving to the orcm code base so it can utilize orcm communications and not interfere with ompi-related operations.
This commit was SVN r22931.
2010-04-05 22:59:21 +00:00
Ralph Castain
1caba7af2f
Fix a bunch of compiler warnings reported by Jeff
...
This commit was SVN r22930.
2010-04-03 00:20:19 +00:00
Ralph Castain
84c7973df8
Update the #procs in the job prior to assigning vpids for each app_context.
...
This commit was SVN r22929.
2010-04-03 00:03:35 +00:00
Ralph Castain
6b43b76f9d
Some updates required for generating a LAM-style virtual machine. Retain the local node if requested. Properly setup the daemon job map for a VM launch.
...
This commit was SVN r22928.
2010-04-03 00:03:01 +00:00
Ralph Castain
ed0f42fa49
Fix a bug courtesy of Jeff - since check_job_complete removes the child object and releases it, preserve the pointer to the next item on the list prior to working with it
...
This commit was SVN r22924.
2010-04-02 07:08:34 +00:00
Josh Hursey
62f8d3c471
r22885 missed a few symbol updates when it changed ompi_want_ft to opal_want_ft
...
This commit was SVN r22916.
The following SVN revision numbers were found above:
r22885 --> open-mpi/ompi@522a23d6a3
2010-03-30 16:47:39 +00:00
Ralph Castain
f6bfaa76ba
Add some debug output to job_complete. If no session dirs were created, then cannot check for abort file - which wouldn't be created anyway
...
This commit was SVN r22903.
2010-03-29 23:21:03 +00:00
Ralph Castain
2603bd8a47
Eliminate a race condition (first reported by Josh) when deliberately killing procs. Need to cancel the waitpid callback for the proc, then properly flag it as dead (both not-alive and waitpid-fired) so that the system cleans up properly.
...
This commit was SVN r22900.
2010-03-28 16:08:05 +00:00
Ralph Castain
4f9db20d94
Couple of minor cleanups
...
This commit was SVN r22899.
2010-03-28 15:41:27 +00:00
Ralph Castain
522a23d6a3
A few changes to the FT-related configure options:
...
1. fix a bug that caused an infinite loop in configure when specifying want-ft but not want-ft-thread by removing a stale reference to the opal-progress-thread option
2. add want-ft=orcm so we can build the orcm errmgr component
3. cleanup the use of "ompi_want_ft_xxx" and replace it with "opal_want_ft_xxx" so that naming conventions are preserved
This commit was SVN r22885.
2010-03-25 22:53:48 +00:00
Josh Hursey
e4f2d03d28
ErrMgr Framework redesign to better support fault tolerance development activities.
...
Explained in more detail in the following RFC:
http://www.open-mpi.org/community/lists/devel/2010/03/7589.php
This commit was SVN r22872.
2010-03-23 21:28:02 +00:00
Ralph Castain
0b9552cd4e
Expand the ESS framework's API to include a new function "query_sys_info" that allows the caller to retrieve key-value pairs of info on the local system capabilities (e.g., cpu type/model). Have each daemon and the HNP "sense" that information and provide it to their local procs to avoid having every proc querying the system directly.
...
This commit was SVN r22870.
2010-03-23 20:47:41 +00:00
Ralph Castain
62e751a95c
Add a tag
...
This commit was SVN r22862.
2010-03-22 15:46:00 +00:00
Ralph Castain
d49f93b743
Cleanup the initialization handshake for multicast apps
...
This commit was SVN r22855.
2010-03-19 20:15:01 +00:00
Ralph Castain
74bd4adc6b
Add some diagnostics, correctly check for existing channel
...
This commit was SVN r22854.
2010-03-19 08:21:01 +00:00
Ralph Castain
abbdc2b527
Pass the job family to tools that need to connect to specific HNPs
...
This commit was SVN r22853.
2010-03-19 04:01:33 +00:00
Ralph Castain
a479e6c320
Provide the sender's name for blocking recv's
...
This commit was SVN r22852.
2010-03-19 04:00:34 +00:00
Ralph Castain
e291fc2c69
With Jeff's help, get the libraries to link as required.
...
Update ompi_info and orte-info to include the new framework.
Fix some selection logic and a typo'd variable name
Still remains ompi_ignored until we complete testing
This commit was SVN r22848.
2010-03-18 02:12:59 +00:00
Ralph Castain
3cd96928a9
Use the OMPI_CHECK_PACKAGE macro to check both header file and library existence before building the component.
...
Still haven't gotten the right libraries linked in...so add ompi_ignore/unignore until we get it all fully integrated.
This commit was SVN r22843.
2010-03-17 00:46:12 +00:00
Ralph Castain
ffd5be6aa1
Add a new framework to ORTE for saving and recovering state information. Two components are included that use the db or dbm library for storing the data, with a distributed hash table component coming later.
...
Note that each of these components will only be selected if specifically requested - otherwise, a "NULL" component will be used. The framework is only opened by the HNP and orteds, though neither is currently coded to save/restore state
This commit was SVN r22839.
2010-03-16 20:59:48 +00:00
Rainer Keller
814fb9399f
- Further patches for support on NetBSD (and DragonFly) by
...
Aleksej Saushev.
Dont use bash or bashism in shell scripts
We should use Posix' setpgid(0,0), which is equivalent to setpgrp().
This commit was SVN r22829.
2010-03-15 05:33:42 +00:00
Josh Hursey
e9b5162d79
Fix the configure logic for --with-ft so that it properly takes a comma separated list.
...
Many of the OPAL_ENABLE_FT should be OPAL_ENABLE_FT_CR, so fix those.
The OPAL Layer INC should call opal_output on restart so that it can refresh the string it prints to reflect the current pid/hostname which may have changed.
This commit was SVN r22824.
2010-03-12 23:57:50 +00:00
Josh Hursey
b43d621f30
Remove an errant '$' in the configure.m4 files. Was causing problems with configure.
...
This commit was SVN r22821.
2010-03-12 20:08:22 +00:00
Ralph Castain
7105207b1c
If we only have one app participating in a all_gather (which lies under a modex as well), then we need to ensure that the returned buffer has the proper packing order so it can be unpacked correctly.
...
This commit was SVN r22815.
2010-03-10 19:22:06 +00:00