Ralph Castain
522a23d6a3
A few changes to the FT-related configure options:
...
1. fix a bug that caused an infinite loop in configure when specifying want-ft but not want-ft-thread by removing a stale reference to the opal-progress-thread option
2. add want-ft=orcm so we can build the orcm errmgr component
3. cleanup the use of "ompi_want_ft_xxx" and replace it with "opal_want_ft_xxx" so that naming conventions are preserved
This commit was SVN r22885.
2010-03-25 22:53:48 +00:00
Josh Hursey
e4f2d03d28
ErrMgr Framework redesign to better support fault tolerance development activities.
...
Explained in more detail in the following RFC:
http://www.open-mpi.org/community/lists/devel/2010/03/7589.php
This commit was SVN r22872.
2010-03-23 21:28:02 +00:00
Ralph Castain
0b9552cd4e
Expand the ESS framework's API to include a new function "query_sys_info" that allows the caller to retrieve key-value pairs of info on the local system capabilities (e.g., cpu type/model). Have each daemon and the HNP "sense" that information and provide it to their local procs to avoid having every proc querying the system directly.
...
This commit was SVN r22870.
2010-03-23 20:47:41 +00:00
Josh Hursey
9e967a3a9b
Revert this change, since even in the CR case we want to reset this var to NULL.
...
Thanks to Jeff for the catch.
This commit was SVN r22868.
2010-03-23 19:55:21 +00:00
Shiqing Fan
9591680ec0
One of the binaries was generated from a wrong source.
...
This commit was SVN r22865.
2010-03-23 09:56:11 +00:00
Ralph Castain
62e751a95c
Add a tag
...
This commit was SVN r22862.
2010-03-22 15:46:00 +00:00
Ralph Castain
d49f93b743
Cleanup the initialization handshake for multicast apps
...
This commit was SVN r22855.
2010-03-19 20:15:01 +00:00
Ralph Castain
74bd4adc6b
Add some diagnostics, correctly check for existing channel
...
This commit was SVN r22854.
2010-03-19 08:21:01 +00:00
Ralph Castain
abbdc2b527
Pass the job family to tools that need to connect to specific HNPs
...
This commit was SVN r22853.
2010-03-19 04:01:33 +00:00
Ralph Castain
a479e6c320
Provide the sender's name for blocking recv's
...
This commit was SVN r22852.
2010-03-19 04:00:34 +00:00
Ralph Castain
8fb71c0fe6
Add some helpful defined values
...
This commit was SVN r22850.
2010-03-19 03:59:29 +00:00
Ralph Castain
e291fc2c69
With Jeff's help, get the libraries to link as required.
...
Update ompi_info and orte-info to include the new framework.
Fix some selection logic and a typo'd variable name
Still remains ompi_ignored until we complete testing
This commit was SVN r22848.
2010-03-18 02:12:59 +00:00
Ralph Castain
3cd96928a9
Use the OMPI_CHECK_PACKAGE macro to check both header file and library existence before building the component.
...
Still haven't gotten the right libraries linked in...so add ompi_ignore/unignore until we get it all fully integrated.
This commit was SVN r22843.
2010-03-17 00:46:12 +00:00
Ralph Castain
b400b84162
Merge in the modified thread configure option branch per today's telecon.
...
Remove the --enable-progress-threads option as this is no longer functional, and hardcode OPAL_ENABLE_PROGRESS_THREADS to 0.
Replace the --enable-mpi-threads option with --enable-mpi-thread-multiple as this is clearer as to meaning. This option automatically turns "on" opal thread support if it wasn't already so specified. If the user specifies --disable-opal-multi-threads --enable-mpi-thread-multiple, we will error out with a message
Add a new --enable-opal-multi-threads option that turns "on" opal thread support without doing anything wrt mpi-thread-multiple
This commit was SVN r22841.
2010-03-16 23:10:50 +00:00
Ralph Castain
ffd5be6aa1
Add a new framework to ORTE for saving and recovering state information. Two components are included that use the db or dbm library for storing the data, with a distributed hash table component coming later.
...
Note that each of these components will only be selected if specifically requested - otherwise, a "NULL" component will be used. The framework is only opened by the HNP and orteds, though neither is currently coded to save/restore state
This commit was SVN r22839.
2010-03-16 20:59:48 +00:00
Rainer Keller
814fb9399f
- Further patches for support on NetBSD (and DragonFly) by
...
Aleksej Saushev.
Dont use bash or bashism in shell scripts
We should use Posix' setpgid(0,0), which is equivalent to setpgrp().
This commit was SVN r22829.
2010-03-15 05:33:42 +00:00
Josh Hursey
e9b5162d79
Fix the configure logic for --with-ft so that it properly takes a comma separated list.
...
Many of the OPAL_ENABLE_FT should be OPAL_ENABLE_FT_CR, so fix those.
The OPAL Layer INC should call opal_output on restart so that it can refresh the string it prints to reflect the current pid/hostname which may have changed.
This commit was SVN r22824.
2010-03-12 23:57:50 +00:00
Josh Hursey
b43d621f30
Remove an errant '$' in the configure.m4 files. Was causing problems with configure.
...
This commit was SVN r22821.
2010-03-12 20:08:22 +00:00
Ralph Castain
c16cd10bb2
Save the username, if specified, for each node
...
This commit was SVN r22817.
2010-03-11 15:24:18 +00:00
Ralph Castain
7105207b1c
If we only have one app participating in a all_gather (which lies under a modex as well), then we need to ensure that the returned buffer has the proper packing order so it can be unpacked correctly.
...
This commit was SVN r22815.
2010-03-10 19:22:06 +00:00
Ralph Castain
7ebf72b4aa
Trivial cleanup
...
This commit was SVN r22813.
2010-03-10 18:24:38 +00:00
Ralph Castain
7fd7b7a8cc
Fix the load_balance mapper so that it sets the #procs in the job before attempting to compute vpids
...
This commit was SVN r22812.
2010-03-10 17:52:19 +00:00
Ralph Castain
17936e6e5f
Ensure we cleanly terminate if an executable cannot be found
...
This commit was SVN r22805.
2010-03-10 16:45:08 +00:00
Josh Hursey
b73237c92a
Identify the process sending the update in the verbose message (helps debugging of process control).
...
This commit was SVN r22804.
2010-03-10 00:23:24 +00:00
Shiqing Fan
49502af2ba
fix the type cast.
...
This commit was SVN r22800.
2010-03-09 10:02:50 +00:00
Ralph Castain
4355134991
Let the vm launcher specify the mapping policy
...
This commit was SVN r22797.
2010-03-08 19:13:21 +00:00
Ralph Castain
bfa39d7f7e
Update the seq mapper to support lists from -host. Reorg the dash_host code to provide an ordered list as required by the seq mapper
...
This commit was SVN r22795.
2010-03-08 09:54:49 +00:00
Ralph Castain
9e7f621a98
Port Brad's paffinity change to the 1.4 branch over to the trunk so we don't lose it going forward.
...
This commit was SVN r22794.
2010-03-07 18:44:22 +00:00
Ralph Castain
2a0f7e95ee
Don't double account for the killed local proc - only adjust num_local_procs when the proc actually dies.
...
This commit was SVN r22787.
2010-03-05 13:53:18 +00:00
Ralph Castain
b2e24693c4
Check the return status when we forward stdin and remove the recipient when they are no longer alive
...
This commit was SVN r22786.
2010-03-05 13:41:28 +00:00
Ralph Castain
577eef1491
Pretty-print the recvd command for debug purposes
...
This commit was SVN r22785.
2010-03-05 13:38:20 +00:00
Ralph Castain
cdae19cf7b
Add a convenience macro to make a job family
...
This commit was SVN r22784.
2010-03-05 13:35:09 +00:00
Ralph Castain
f2c65dc70f
Ensure that the errmgr does not take action if the process was terminated by a "kill_procs" command as this can lead to circular logic.
...
Cleanup the kill_procs command by removing a no-longer-used param. We update the process state when the proc actually exits.
This commit was SVN r22783.
2010-03-05 13:22:12 +00:00
Ralph Castain
ef6c432e22
Fix a nasty bug where we would hang if an application trapped signals such as SIGTERM - a permissible thing to do. In such cases, we removed the process from the waitpid system and then sent it a SIGTERM. If the application trapped that and attempted to cleanly terminate, it would send us a sync message - and the daemon would then add it back to its local child list, causing both the daemon and the process to hang.
...
In this revision, we let the process terminate/exit however it can, and then pick it up via the usual waitpid.
This commit was SVN r22781.
2010-03-05 04:14:56 +00:00
Shiqing Fan
db747e4390
Remove the old timing parameter but using orte_timing instead. Thanks for Rainer.
...
This commit was SVN r22775.
2010-03-04 15:00:03 +00:00
Ralph Castain
c88fe1ea54
Create a new mca parameter to control creation of session directories. Defaults to true so that the current behavior of always creating them is preserved. If set to false (0), then don't create session directories. Helps in those environments where session directories are a problem.
...
Tell the sm btl that it cannot run if no session directories were created.
This commit was SVN r22756.
2010-03-02 15:18:33 +00:00
Ralph Castain
cd1efbb41e
Try and do a better job of cleanup in abnormal termination. Ensure the daemons whack session directories prior to disabling signal traps. Ensure that the HNP and daemons all cleanup when they are doing an internal abort.
...
This commit was SVN r22755.
2010-03-02 14:51:23 +00:00
Ralph Castain
b692645772
Remote daemons should -always- whack any lingering session directories when exiting
...
This commit was SVN r22749.
2010-03-02 05:28:53 +00:00
Ralph Castain
69fe5ca69b
Correctly compute bynode mapping, even in the presence of a $#$%#@^$ rankfile
...
This commit was SVN r22748.
2010-03-02 05:21:42 +00:00
Ralph Castain
bef06d52bc
Silence compiler warning
...
This commit was SVN r22747.
2010-03-01 21:04:26 +00:00
Ralph Castain
5514d9c673
Fix the stupid rankfile mapper again, hopefully not breaking everything else to accommodate it. Looks like the round-robin mappers still work, at least...
...
This commit was SVN r22746.
2010-03-01 20:40:47 +00:00
Ralph Castain
96590b9fad
Filter multicast messages to avoid cross-job confusion
...
This commit was SVN r22729.
2010-02-28 18:22:56 +00:00
Ralph Castain
359dc5cad3
Complete the app_idx change by cleaning up warnings in mappers
...
This commit was SVN r22728.
2010-02-27 18:14:27 +00:00
Ralph Castain
2541aa98ab
Change the app_idx type to uint32_t to support users who use large numbers of app_contexts. Set it up as a new typedef so we can change it later without as much effort.
...
This commit was SVN r22727.
2010-02-27 17:37:34 +00:00
Ralph Castain
6c0d7940c7
Add a new MCA param (and corresponding mpirun cmd line option) to output the debugger proctable info after launch. The output is just the job map with the process pid included, so you get a node-by-node list of the process ranks on that node and thier pids.
...
Works for initial launch and comm_spawn. xml and non-xml output is available
This commit was SVN r22725.
2010-02-27 08:32:25 +00:00
Shiqing Fan
4a3f42d159
Correctly initialize the CCP command line buffer.
...
This commit was SVN r22721.
2010-02-26 15:53:00 +00:00
Ralph Castain
c6448587fe
It is okay to not select an rmcast module
...
This commit was SVN r22719.
2010-02-26 02:39:04 +00:00
Ralph Castain
b89a21f0fa
Grrr....cleanup the new module
...
This commit was SVN r22711.
2010-02-25 06:08:04 +00:00
Ralph Castain
8954700845
No, we don't have a .windows file...
...
This commit was SVN r22710.
2010-02-25 02:18:54 +00:00
Ralph Castain
18c7aaff08
Update the grpcomm framework to be more thread-friendly.
...
Modify the orte configure options to specify --enable-multicast such that it directs components to build or not instead of littering the code base with #if's. Remove those #if's where they used to occur.
Add a new grpcomm "mcast" module to support multicast operations. Still some work required to properly perform daemon collectives for comm_spawn operations. New module only builds when --enable-multicast is provided, and when specifically selected.
This commit was SVN r22709.
2010-02-25 01:11:29 +00:00