1
1
Граф коммитов

3151 Коммитов

Автор SHA1 Сообщение Дата
Abhishek Kulkarni
197ec7586d Add a new "file" component to the notifier framework.
When this component is selected, the notification messages are sent to a file. 
The file can be a plain file or stdout or stderr.

The MCA parameter "notifier_file_name" can be used to specify the suffix of the file the notification messages should be sent to.
The default suffix is "wdc" and the file full name is "output-wdc".

This commit was SVN r23155.
2010-05-17 22:39:52 +00:00
Ralph Castain
7e6985edbf Cleanup warnings
This commit was SVN r23150.
2010-05-16 20:23:26 +00:00
Ralph Castain
de85049477 Cleanup warnings
This commit was SVN r23149.
2010-05-16 20:22:18 +00:00
Ralph Castain
88f5217a12 Cleanup the debugger daemon co-launch code and add an ability to test it. Implement ability to co-launch debugger daemons upon attach to a running job for jobs launched under rsh, slurm, and tm environments (others can easily be added if desired).
Add new mca params to test:

orte_debugger_test_daemon: Name of the executable to be used to simulate a debugger colaunch
orte_debugger_test_attach: Test debugger colaunch after debugger attachment

To test co-launch at job start, just set the orte_debugger_test_daemon param.

To test co-launch upon attach:
set orte_debugger_test_daemon
set orte_debugger_test_attach=1
set orte_enable_debug_cospawn_while_running=1
set orte_debugger_check_rate=<N> - defines the number of seconds to wait before "checking" for a debugger attaching

Added a "debugger" program to orte/test/mpi that just spins to simulate a debugger daemon.

This commit was SVN r23144.
2010-05-14 18:44:49 +00:00
Ralph Castain
b9f0615727 Correct some logic for tracking launch progress
This commit was SVN r23122.
2010-05-12 18:39:10 +00:00
Ralph Castain
7ce34223f1 Per off-list discussion, implement the new OMPI exit status policy (soon to be on wiki) and further cleanup error reporting to cover new cases.
Implement process migration when failed nodes are detected. Some testing still required

This commit was SVN r23121.
2010-05-12 18:11:58 +00:00
Ralph Castain
306533fdb8 Replace a missing line that shutdown a peer that failed comm.
This commit was SVN r23120.
2010-05-12 18:09:35 +00:00
Ralph Castain
871f445848 Ignore nodes that are "down" when generating maps
This commit was SVN r23119.
2010-05-12 18:08:40 +00:00
Ralph Castain
4bd25f587c Begin handling the case of lost connections by having the OOB report it to the errmgr instead of the routed framework. Add an "app" component to t
he errmgr framework so that it can decide how to respond - which for now at least is just to check for lifeline and abort if so.

Add a new error constant to indicate that the error is "unrecoverable" so the oob can know it needs to abort.

This commit was SVN r23112.
2010-05-11 00:34:12 +00:00
Ralph Castain
86033074ce Grr...it's a job state, not a proc state
This commit was SVN r23110.
2010-05-07 16:50:10 +00:00
Ralph Castain
70bef5ca95 It is possible for the jdata param to be NULL, so protect the verbose output in that case
This commit was SVN r23109.
2010-05-07 16:48:44 +00:00
Shiqing Fan
8ea8563462 Include the new component config file for Windows into tarball.
This commit was SVN r23108.
2010-05-07 14:30:12 +00:00
Ralph Castain
d6a1d7a082 Little more cleanup on paffinity. Provide a specific error code for affinity not supported so we can better report the problem. Move the error reporting to orterun so we only get one error message. Update the darwin paffinity module to return the correct new error codes.
This commit was SVN r23107.
2010-05-07 14:04:55 +00:00
Ralph Castain
d4f56cff61 More cleanup on paffinity....groan
It is okay to not have a paffinity module IF you aren't using paffinity anyway. So don't error out of MPI_Init because a paffinity module wasn't selected.

Cleanup error reporting in the odls default module to (once and for all!) eliminate messages originating in the fork'd process. Create some new error codes to allow us to pass enough info back to the parent process to provide useful error messages.

This commit was SVN r23106.
2010-05-06 20:57:17 +00:00
Jeff Squyres
477201e161 Fix "make dist" breakage
This commit was SVN r23105.
2010-05-06 18:47:20 +00:00
Shiqing Fan
76726f6094 Update a few module configuration for Windows.
This commit was SVN r23104.
2010-05-05 12:22:04 +00:00
Ralph Castain
2ff1ae13e1 Create a new "heartbeat" module in the sensor framework and move the plm_base heartbeat code there. Add new proc and job states for heartbeat_failed. Remove the "heartbeat" cmd line option for orted as this is now done automatically if the --enable-heartbeat configure option is set.
This commit was SVN r23102.
2010-05-05 00:48:43 +00:00
Ralph Castain
99f223210d Add some contributed examples of how to start and configure the spread library. Do a little more cleanup on the spread module, and ensure that it isn't selected if spread isn't running.
This commit was SVN r23101.
2010-05-04 23:44:00 +00:00
Ralph Castain
43005b88e0 Catch one more spot...thanks to Sam for reporting it
This commit was SVN r23094.
2010-05-04 17:21:15 +00:00
Ralph Castain
f4ae2885e2 Add new error constant
This commit was SVN r23090.
2010-05-04 13:44:33 +00:00
Ralph Castain
cd569f8a79 Restore the global restart capability
This commit was SVN r23089.
2010-05-04 02:40:29 +00:00
Ralph Castain
3ca0b4138b Let the nidmap functions update a new orte_process_info field as to the number of daemons in the system
This commit was SVN r23088.
2010-05-04 02:40:09 +00:00
Ralph Castain
8c5f442ee0 Fix some bugs in the spread rmcast component
This commit was SVN r23086.
2010-05-04 02:38:37 +00:00
Ralph Castain
8e7faf9119 Add a new test for the db framework, fix some minor bugs in the daemon module
This commit was SVN r23085.
2010-05-04 02:38:11 +00:00
Ralph Castain
9dfb5c7c62 Rename the orte state framework to be "db", which more accurately reflects its overall capabilities since it can store any kind of data (not just state, although that will be its primary purpose). Update tools and tests accordingly. Add a daemon module for storing data on the daemons - requires --enable-multicast, so it won't build unless that is set
This commit was SVN r23082.
2010-05-03 04:11:03 +00:00
Ralph Castain
5103ffead6 Continue work on local recovery for orteds
This commit was SVN r23080.
2010-05-03 04:08:13 +00:00
Ralph Castain
bcff0d6301 Some minor cleanup in the rmcast framework, ensure that a default multicast group is always defined for each app
This commit was SVN r23079.
2010-05-03 04:07:14 +00:00
Ralph Castain
f994a7edf4 Add recovery data to the jobdat object
This commit was SVN r23078.
2010-05-03 04:06:13 +00:00
Ralph Castain
323224b84b Allow the restart data to be retrieved from a job's app->env or default that given to orterun
This commit was SVN r23077.
2010-05-03 04:05:19 +00:00
Ralph Castain
3f262bf0b6 Add a new reliable multicast component based on the "spread" library
Thanks to Srini Nariangadu (Cisco) for the contribution!

This commit was SVN r23076.
2010-05-02 17:29:41 +00:00
Ralph Castain
c93af95351 Since there is no defined behavior for the case where all application procs exit normally, but some or all have non-zero returns, just output a warning telling the user how many procs meet that criteria. Let the return code of mpirun in that scenario reflect any errors in OMPI/ORTE itself.
Clearly a temporary solution until a defined behavior can be established.

This commit was SVN r23075.
2010-04-30 19:01:10 +00:00
Ralph Castain
c62418d76d Add a new test that checks behavior when we call exit with a non-zero return code after calling finalize - don't ask why.
Modify the check_complete code so it finds the first non-zero exit status (i.e., the one from the lowest rank) in a job that terminates normally, and sets the mpirun exit code to that status.

This commit was SVN r23071.
2010-04-29 19:58:44 +00:00
Ralph Castain
5689bd6d30 Remove debug
This commit was SVN r23062.
2010-04-28 22:14:23 +00:00
Ralph Castain
319758e3e0 Restore process recovery for procs local to mpirun (first step towards restoring full capability). Define three new MCA params:
1. orte_enable_recovery - default recovery policy, can be overridden on a per-job basis

2. orte_max_local_restarts - default max number of local restarts, can be overridden

3. orte_max_global_restarts - default max number of relocates, can be overridden

Implement the restart_proc API for the ODLS framework, reorganize the default fns a little to avoid copying code.

This commit was SVN r23057.
2010-04-28 04:06:57 +00:00
Ralph Castain
67568e7dec Fix a warning
This commit was SVN r23049.
2010-04-27 15:59:24 +00:00
Ralph Castain
401fb22847 Fix the pack/unpack of process state values by getting the actual packing size to match the new value size.
Improve the debug output by printing the string

This commit was SVN r23048.
2010-04-27 15:58:04 +00:00
Ralph Castain
efc7d3e2f6 Update the LSF PLM module to ensure that the path is correctly passed.
Many thanks to Teng Lin for the error report - and the patch!

This commit was SVN r23047.
2010-04-27 03:46:04 +00:00
Ralph Castain
3434296836 Ensure we don't have a trailing separator on the end of our tmpdir as (a) it really looks weird, and (b) some exotic systems interpret that as indicating the rest of the path is to be treated as absolute. Makes for very strange and interesting behavior...
This commit was SVN r23046.
2010-04-27 03:40:44 +00:00
Ralph Castain
55889934d8 After hours spent chasing the stupid "abort" file, it became clear that we were always going to be plagued by that idiot contraption when trying to be good citizens and properly cleanup. So get rid of it by instead doing a messaging handshake with the local daemon.
Note that this isn't a problem since MPI_Abort and orte_abort are only called under controlled circumstances - i.e., we are doing an orderly abort and not segfaulting. If we can't get the message out for some reason, then too bad - we'll still see an abnormal process termination and act accordingly.

This commit was SVN r23045.
2010-04-27 03:39:32 +00:00
Ralph Castain
b9893aacc5 Add a sensor framework to ORTE that monitors applications and notifies the errmgr when they exceed specified boundaries. Two modules are included here:
1. file activity - can monitor file size, access and modification times. If these fail to change over a specified number of sampling iterations (rate is an mca param), then the errmgr is notified.

2. memory usage - checks amount of memory used by a process. Limit and sampling rate can be set.

This support must be enabled by configuring --enable-sensors.

ompi_info and orte-info have been updated to include the new framework.

Also includes some initial steps toward restoring the recovery capability. Most notably, the ODLS API has been extended to include a "restart_proc" entry for restarting a local process, and organizes the various ERRMGR framework globals into a single struct as we do in the other ORTE frameworks. Fix an oversight in the ERRMGR framework where a pointer array was constructed, but not initialized.

Implementation continues.

This commit was SVN r23043.
2010-04-26 22:15:57 +00:00
Ralph Castain
43a89bbace Extend the process and job states by adding values for exceeding sensor bounds. This changes the job state field to 32-bit to also provide room for future expansion.
This commit was SVN r23036.
2010-04-26 12:36:40 +00:00
Ralph Castain
e3164d2ac1 For the autogen-challenged (i.e., Jeff), create a new ORTE constant that tells the user that a required module was not found. Update the errmgr select function to output the error if no module is found.
This commit was SVN r23032.
2010-04-24 01:39:26 +00:00
Ralph Castain
b62ebb8b94 Enable these to build so others can more easily begin implementation
This commit was SVN r23031.
2010-04-23 22:46:33 +00:00
Ralph Castain
9c69900bbc Ensure we get the jdata object when updating proc state so we don't segv if report-progress is set. Thanks to Sam for reporting it!
This commit was SVN r23029.
2010-04-23 15:14:21 +00:00
George Bosilca
95394456ca No Windows EOF.
This commit was SVN r23024.
2010-04-23 05:51:29 +00:00
Ralph Castain
efbb5c9b7c Revamp the errmgr framework to provide a greater range of optional behaviors, including different behaviors for daemons, and remove several looping messages across the code base:
* add hnp and orted modules to the errmgr framework. The HNP module contains much of the code that was in the errmgr base since that code could only be executed by the HNP anyway.

* update the odls to report process states directly into the active errmgr module, thus removing the need to send messages looped back into the odls cmd processor. Let the active errmgr module decide what to do at various states.

* remove the code to track application state progress from the plm_base_launch_support.c code. Update the plm modules to call the errmgr directly when a launch fails.

* update the plm_base_receive.c code to call the errmgr with state updates from remote daemons

* update the routed modules to reflect that process state is updated in the errmgr

* ensure that the orted's open the errmgr and select their appropriate module

* add new pretty-print utilities to print process and job state. Move the pretty-print of time info to a globally-accessible place

* define a global orte_comm function to send messages from orted's to the HNP so that others can overlay the standard RML methods, if desired.

* update the orterun help output to reflect that the "term w/o sync" error message can result from three, not two, scenarios

This commit was SVN r23023.
2010-04-23 04:44:41 +00:00
Shiqing Fan
d1e66bdd01 Use variables instead of hard-coded compiler flags, in order to support various C/C++ compilers on Windows.
This commit was SVN r23016.
2010-04-21 12:45:00 +00:00
Ralph Castain
86228aee38 Provide two new opal paffinity utilities for printing a hex representation of the cpu set and parsing that string back into a cpu set on the other end. Also add a new MCA param for passing the cpu set applied to a process during launch down to that process so it can know what we attempted to do.
All to be used in some new MPI extensions provided by Jeff so that users can easily query their binding situation.

This commit was SVN r22998.
2010-04-19 22:16:35 +00:00
Ralph Castain
4d06125a33 Establish a method by which a process knows if it has been bound by mpirun. This helps resolve a problem where a process gets "bound" to all available resources, which looks to the opal paffinity system as "not bound". This can cause mpi_init to attempt to "bind" the process itself, causing unintended behavior.
This commit was SVN r22985.
2010-04-17 01:58:26 +00:00
Ralph Castain
41428e6b61 Issue a warning if a requested binding operation results in processes being bound to all available processes, which is the equivalent of not being bound at all.
See the following email thread for further details:

http://www.open-mpi.org/community/lists/devel/2010/04/7745.php

This commit was SVN r22984.
2010-04-17 01:02:41 +00:00
Ralph Castain
6c379fed2e IOF components should not assume they will be selected when queried - thus, they should not perform init functions until after selection. Create init/finalize entry points for that purpose, and have select init the module after it has been selected.
This commit was SVN r22982.
2010-04-16 18:51:27 +00:00
Ralph Castain
2ecc9fc2b3 Additional diag output
This commit was SVN r22981.
2010-04-16 14:48:37 +00:00
Ralph Castain
4308922f59 Ensure that any application-specific selection of ess module doesn't get overridden by what is given to the orted or orterun
Cleanup tool name determination for CM

This commit was SVN r22980.
2010-04-15 18:10:50 +00:00
Ralph Castain
c3e4c40cdf Add another multicast tag for updating state
This commit was SVN r22979.
2010-04-15 18:08:53 +00:00
Ralph Castain
eeccf2f15c Some minor changes to support vm's
This commit was SVN r22975.
2010-04-14 01:20:43 +00:00
Ralph Castain
854dc12fc0 Add a tag for multicasting IOF messages
This commit was SVN r22972.
2010-04-13 22:51:26 +00:00
Ralph Castain
81329c637e Indentation corrections
This commit was SVN r22971.
2010-04-13 17:47:34 +00:00
Ralph Castain
8da781af84 Continue developing support for distributed virtual machines - minor changes to ensure correct jobid gets used and that dvm's can communicate with tools
This commit was SVN r22958.
2010-04-12 22:33:09 +00:00
Shiqing Fan
96b20a29b5 An easy solution to make singleton work on Windows.
This commit was SVN r22952.
2010-04-10 16:30:59 +00:00
Ralph Castain
d3ed4e68b7 Utilize a non-used mapping policy bit to define a policy that uses only existing alive daemons to support virtual machines and restarting processes on already-active nodes
This commit was SVN r22951.
2010-04-10 05:02:47 +00:00
Ralph Castain
4f8279df3d Enable substitution of the communication calls in the orted when sending messages back to the HNP by creating a function for this purpose and saving the pointer to it in orte_odls_base. Higher level libraries can then override the default function to use their own method.
This commit was SVN r22950.
2010-04-09 18:50:10 +00:00
Ralph Castain
c32f046d7c Tiny cleanup - when the user kills us with a ctrl-c, there really isn't a need to tell him "your procs died and we don't know why". Just shaddup and die.
This commit was SVN r22949.
2010-04-09 18:47:35 +00:00
Terry Dontje
282a537cf7 This commit fixes 2370, by having the solaris paffinity module return error codes for get_physical_processor_id and having odls_default_fork_local_proc check get_physical_processor_id for OPAL_ERROR
This commit was SVN r22948.
2010-04-09 15:10:46 +00:00
Brad Benton
101b896f2e IBM has approved the release of the LoadLeveler sample code under the
BSD license.  Consequently, a more restrictive licensing clause that was 
originally associated with the LoadLeveler sample code documentation and
replicated in a comment block in this file has been removed.

This commit was SVN r22947.
2010-04-08 19:41:44 +00:00
Terry Dontje
929c58e38d This commit fixes trac:2073
This commit was SVN r22946.

The following Trac tickets were found above:
  Ticket 2073 --> https://svn.open-mpi.org/trac/ompi/ticket/2073
2010-04-08 18:17:44 +00:00
Ralph Castain
75e99e6118 Do a better job of selecting cm ess component, handle tool and daemon issues
This commit was SVN r22942.
2010-04-07 18:59:21 +00:00
Ralph Castain
f1fc344336 Add some diagnostics
This commit was SVN r22941.
2010-04-07 18:58:17 +00:00
Ralph Castain
8e29a6858a Properly handle the case when a daemon is given both parts of its name
This commit was SVN r22935.
2010-04-06 22:41:18 +00:00
Ralph Castain
a1e82e9d05 Per discussion with Josh, cleanup the errmgr API by creating separate modules for the public vs internal APIs. This mirrors the architecture used in other frameworks that had similar requirements.
Remove the orcm errmgr module - moving to the orcm code base so it can utilize orcm communications and not interfere with ompi-related operations.

This commit was SVN r22931.
2010-04-05 22:59:21 +00:00
Ralph Castain
1caba7af2f Fix a bunch of compiler warnings reported by Jeff
This commit was SVN r22930.
2010-04-03 00:20:19 +00:00
Ralph Castain
84c7973df8 Update the #procs in the job prior to assigning vpids for each app_context.
This commit was SVN r22929.
2010-04-03 00:03:35 +00:00
Ralph Castain
6b43b76f9d Some updates required for generating a LAM-style virtual machine. Retain the local node if requested. Properly setup the daemon job map for a VM launch.
This commit was SVN r22928.
2010-04-03 00:03:01 +00:00
Ralph Castain
de6679dbd3 Truly respect the -quiet option. Make it an mca param so someone doesn't have to put it solely on the cmd line. Tell show_help to shaddup as well.
This commit was SVN r22926.
2010-04-02 14:19:38 +00:00
Ralph Castain
25a9a195b0 When the user requests "quiet", mpirun isn't supposed to output "helpful error messages" - so shaddup!
This commit was SVN r22925.
2010-04-02 07:13:11 +00:00
Ralph Castain
ed0f42fa49 Fix a bug courtesy of Jeff - since check_job_complete removes the child object and releases it, preserve the pointer to the next item on the list prior to working with it
This commit was SVN r22924.
2010-04-02 07:08:34 +00:00
Jeff Squyres
c57a8fba5a Improve the help message when mpirun cannot find an executable.
Refs trac:2035

This commit was SVN r22922.

The following Trac tickets were found above:
  Ticket 2035 --> https://svn.open-mpi.org/trac/ompi/ticket/2035
2010-04-01 13:26:29 +00:00
Ralph Castain
871a9e0df4 Track process heartbeats with time_t, be a little less restrictive on who can retrieve an orte_job_t object
This commit was SVN r22921.
2010-03-31 19:20:06 +00:00
Josh Hursey
62f8d3c471 r22885 missed a few symbol updates when it changed ompi_want_ft to opal_want_ft
This commit was SVN r22916.

The following SVN revision numbers were found above:
  r22885 --> open-mpi/ompi@522a23d6a3
2010-03-30 16:47:39 +00:00
Ralph Castain
f6bfaa76ba Add some debug output to job_complete. If no session dirs were created, then cannot check for abort file - which wouldn't be created anyway
This commit was SVN r22903.
2010-03-29 23:21:03 +00:00
Ralph Castain
24c3b4f849 Add the sysinfo framework to the "info" tools, especially since the odls_base_open function calls it!
This commit was SVN r22901.
2010-03-29 20:47:29 +00:00
Ralph Castain
2603bd8a47 Eliminate a race condition (first reported by Josh) when deliberately killing procs. Need to cancel the waitpid callback for the proc, then properly flag it as dead (both not-alive and waitpid-fired) so that the system cleans up properly.
This commit was SVN r22900.
2010-03-28 16:08:05 +00:00
Ralph Castain
4f9db20d94 Couple of minor cleanups
This commit was SVN r22899.
2010-03-28 15:41:27 +00:00
Ralph Castain
1bf9684ebb Don't include jobs in the nidmap if they aren't mapped jobs
This commit was SVN r22886.
2010-03-25 22:54:57 +00:00
Ralph Castain
522a23d6a3 A few changes to the FT-related configure options:
1. fix a bug that caused an infinite loop in configure when specifying want-ft but not want-ft-thread by removing a stale reference to the opal-progress-thread option

2. add want-ft=orcm so we can build the orcm errmgr component

3. cleanup the use of "ompi_want_ft_xxx" and replace it with "opal_want_ft_xxx" so that naming conventions are preserved

This commit was SVN r22885.
2010-03-25 22:53:48 +00:00
Josh Hursey
e4f2d03d28 ErrMgr Framework redesign to better support fault tolerance development activities.
Explained in more detail in the following RFC:
  http://www.open-mpi.org/community/lists/devel/2010/03/7589.php

This commit was SVN r22872.
2010-03-23 21:28:02 +00:00
Ralph Castain
0b9552cd4e Expand the ESS framework's API to include a new function "query_sys_info" that allows the caller to retrieve key-value pairs of info on the local system capabilities (e.g., cpu type/model). Have each daemon and the HNP "sense" that information and provide it to their local procs to avoid having every proc querying the system directly.
This commit was SVN r22870.
2010-03-23 20:47:41 +00:00
Josh Hursey
9e967a3a9b Revert this change, since even in the CR case we want to reset this var to NULL.
Thanks to Jeff for the catch.

This commit was SVN r22868.
2010-03-23 19:55:21 +00:00
Shiqing Fan
9591680ec0 One of the binaries was generated from a wrong source.
This commit was SVN r22865.
2010-03-23 09:56:11 +00:00
Ralph Castain
62e751a95c Add a tag
This commit was SVN r22862.
2010-03-22 15:46:00 +00:00
Ralph Castain
d49f93b743 Cleanup the initialization handshake for multicast apps
This commit was SVN r22855.
2010-03-19 20:15:01 +00:00
Ralph Castain
74bd4adc6b Add some diagnostics, correctly check for existing channel
This commit was SVN r22854.
2010-03-19 08:21:01 +00:00
Ralph Castain
abbdc2b527 Pass the job family to tools that need to connect to specific HNPs
This commit was SVN r22853.
2010-03-19 04:01:33 +00:00
Ralph Castain
a479e6c320 Provide the sender's name for blocking recv's
This commit was SVN r22852.
2010-03-19 04:00:34 +00:00
Ralph Castain
8fb71c0fe6 Add some helpful defined values
This commit was SVN r22850.
2010-03-19 03:59:29 +00:00
Ralph Castain
e291fc2c69 With Jeff's help, get the libraries to link as required.
Update ompi_info and orte-info to include the new framework.

Fix some selection logic and a typo'd variable name

Still remains ompi_ignored until we complete testing

This commit was SVN r22848.
2010-03-18 02:12:59 +00:00
Ralph Castain
3cd96928a9 Use the OMPI_CHECK_PACKAGE macro to check both header file and library existence before building the component.
Still haven't gotten the right libraries linked in...so add ompi_ignore/unignore until we get it all fully integrated.

This commit was SVN r22843.
2010-03-17 00:46:12 +00:00
Ralph Castain
b400b84162 Merge in the modified thread configure option branch per today's telecon.
Remove the --enable-progress-threads option as this is no longer functional, and hardcode OPAL_ENABLE_PROGRESS_THREADS to 0.

Replace the --enable-mpi-threads option with --enable-mpi-thread-multiple as this is clearer as to meaning. This option automatically turns "on" opal thread support if it wasn't already so specified. If the user specifies --disable-opal-multi-threads --enable-mpi-thread-multiple, we will error out with a message

Add a new --enable-opal-multi-threads option that turns "on" opal thread support without doing anything wrt mpi-thread-multiple

This commit was SVN r22841.
2010-03-16 23:10:50 +00:00
Ralph Castain
ffd5be6aa1 Add a new framework to ORTE for saving and recovering state information. Two components are included that use the db or dbm library for storing the data, with a distributed hash table component coming later.
Note that each of these components will only be selected if specifically requested - otherwise, a "NULL" component will be used.  The framework is only opened by the HNP and orteds, though neither is currently coded to save/restore state

This commit was SVN r22839.
2010-03-16 20:59:48 +00:00
Rainer Keller
814fb9399f - Further patches for support on NetBSD (and DragonFly) by
Aleksej Saushev.
   Dont use bash or bashism in shell scripts
   We should use Posix' setpgid(0,0), which is equivalent to setpgrp().

This commit was SVN r22829.
2010-03-15 05:33:42 +00:00
Josh Hursey
e9b5162d79 Fix the configure logic for --with-ft so that it properly takes a comma separated list.
Many of the OPAL_ENABLE_FT should be OPAL_ENABLE_FT_CR, so fix those.

The OPAL Layer INC should call opal_output on restart so that it can refresh the string it prints to reflect the current pid/hostname which may have changed.

This commit was SVN r22824.
2010-03-12 23:57:50 +00:00
Josh Hursey
b43d621f30 Remove an errant '$' in the configure.m4 files. Was causing problems with configure.
This commit was SVN r22821.
2010-03-12 20:08:22 +00:00
Ralph Castain
c16cd10bb2 Save the username, if specified, for each node
This commit was SVN r22817.
2010-03-11 15:24:18 +00:00
Ralph Castain
7105207b1c If we only have one app participating in a all_gather (which lies under a modex as well), then we need to ensure that the returned buffer has the proper packing order so it can be unpacked correctly.
This commit was SVN r22815.
2010-03-10 19:22:06 +00:00
Ralph Castain
7ebf72b4aa Trivial cleanup
This commit was SVN r22813.
2010-03-10 18:24:38 +00:00
Ralph Castain
7fd7b7a8cc Fix the load_balance mapper so that it sets the #procs in the job before attempting to compute vpids
This commit was SVN r22812.
2010-03-10 17:52:19 +00:00
Ralph Castain
17936e6e5f Ensure we cleanly terminate if an executable cannot be found
This commit was SVN r22805.
2010-03-10 16:45:08 +00:00
Josh Hursey
b73237c92a Identify the process sending the update in the verbose message (helps debugging of process control).
This commit was SVN r22804.
2010-03-10 00:23:24 +00:00
Shiqing Fan
49502af2ba fix the type cast.
This commit was SVN r22800.
2010-03-09 10:02:50 +00:00
Ralph Castain
4355134991 Let the vm launcher specify the mapping policy
This commit was SVN r22797.
2010-03-08 19:13:21 +00:00
Ralph Castain
bfa39d7f7e Update the seq mapper to support lists from -host. Reorg the dash_host code to provide an ordered list as required by the seq mapper
This commit was SVN r22795.
2010-03-08 09:54:49 +00:00
Ralph Castain
9e7f621a98 Port Brad's paffinity change to the 1.4 branch over to the trunk so we don't lose it going forward.
This commit was SVN r22794.
2010-03-07 18:44:22 +00:00
Ralph Castain
2a0f7e95ee Don't double account for the killed local proc - only adjust num_local_procs when the proc actually dies.
This commit was SVN r22787.
2010-03-05 13:53:18 +00:00
Ralph Castain
b2e24693c4 Check the return status when we forward stdin and remove the recipient when they are no longer alive
This commit was SVN r22786.
2010-03-05 13:41:28 +00:00
Ralph Castain
577eef1491 Pretty-print the recvd command for debug purposes
This commit was SVN r22785.
2010-03-05 13:38:20 +00:00
Ralph Castain
cdae19cf7b Add a convenience macro to make a job family
This commit was SVN r22784.
2010-03-05 13:35:09 +00:00
Ralph Castain
f2c65dc70f Ensure that the errmgr does not take action if the process was terminated by a "kill_procs" command as this can lead to circular logic.
Cleanup the kill_procs command by removing a no-longer-used param. We update the process state when the proc actually exits.

This commit was SVN r22783.
2010-03-05 13:22:12 +00:00
Ralph Castain
ef6c432e22 Fix a nasty bug where we would hang if an application trapped signals such as SIGTERM - a permissible thing to do. In such cases, we removed the process from the waitpid system and then sent it a SIGTERM. If the application trapped that and attempted to cleanly terminate, it would send us a sync message - and the daemon would then add it back to its local child list, causing both the daemon and the process to hang.
In this revision, we let the process terminate/exit however it can, and then pick it up via the usual waitpid.

This commit was SVN r22781.
2010-03-05 04:14:56 +00:00
Shiqing Fan
db747e4390 Remove the old timing parameter but using orte_timing instead. Thanks for Rainer.
This commit was SVN r22775.
2010-03-04 15:00:03 +00:00
Ralph Castain
c88fe1ea54 Create a new mca parameter to control creation of session directories. Defaults to true so that the current behavior of always creating them is preserved. If set to false (0), then don't create session directories. Helps in those environments where session directories are a problem.
Tell the sm btl that it cannot run if no session directories were created.

This commit was SVN r22756.
2010-03-02 15:18:33 +00:00
Ralph Castain
cd1efbb41e Try and do a better job of cleanup in abnormal termination. Ensure the daemons whack session directories prior to disabling signal traps. Ensure that the HNP and daemons all cleanup when they are doing an internal abort.
This commit was SVN r22755.
2010-03-02 14:51:23 +00:00
Ralph Castain
b692645772 Remote daemons should -always- whack any lingering session directories when exiting
This commit was SVN r22749.
2010-03-02 05:28:53 +00:00
Ralph Castain
69fe5ca69b Correctly compute bynode mapping, even in the presence of a $#$%#@^$ rankfile
This commit was SVN r22748.
2010-03-02 05:21:42 +00:00
Ralph Castain
bef06d52bc Silence compiler warning
This commit was SVN r22747.
2010-03-01 21:04:26 +00:00
Ralph Castain
5514d9c673 Fix the stupid rankfile mapper again, hopefully not breaking everything else to accommodate it. Looks like the round-robin mappers still work, at least...
This commit was SVN r22746.
2010-03-01 20:40:47 +00:00
Ralph Castain
96590b9fad Filter multicast messages to avoid cross-job confusion
This commit was SVN r22729.
2010-02-28 18:22:56 +00:00
Ralph Castain
359dc5cad3 Complete the app_idx change by cleaning up warnings in mappers
This commit was SVN r22728.
2010-02-27 18:14:27 +00:00
Ralph Castain
2541aa98ab Change the app_idx type to uint32_t to support users who use large numbers of app_contexts. Set it up as a new typedef so we can change it later without as much effort.
This commit was SVN r22727.
2010-02-27 17:37:34 +00:00
Ralph Castain
6c0d7940c7 Add a new MCA param (and corresponding mpirun cmd line option) to output the debugger proctable info after launch. The output is just the job map with the process pid included, so you get a node-by-node list of the process ranks on that node and thier pids.
Works for initial launch and comm_spawn. xml and non-xml output is available

This commit was SVN r22725.
2010-02-27 08:32:25 +00:00
Shiqing Fan
4a3f42d159 Correctly initialize the CCP command line buffer.
This commit was SVN r22721.
2010-02-26 15:53:00 +00:00
Ralph Castain
c6448587fe It is okay to not select an rmcast module
This commit was SVN r22719.
2010-02-26 02:39:04 +00:00
Ralph Castain
b89a21f0fa Grrr....cleanup the new module
This commit was SVN r22711.
2010-02-25 06:08:04 +00:00
Ralph Castain
8954700845 No, we don't have a .windows file...
This commit was SVN r22710.
2010-02-25 02:18:54 +00:00
Ralph Castain
18c7aaff08 Update the grpcomm framework to be more thread-friendly.
Modify the orte configure options to specify --enable-multicast such that it directs components to build or not instead of littering the code base with #if's. Remove those #if's where they used to occur.

Add a new grpcomm "mcast" module to support multicast operations. Still some work required to properly perform daemon collectives for comm_spawn operations. New module only builds when --enable-multicast is provided, and when specifically selected.

This commit was SVN r22709.
2010-02-25 01:11:29 +00:00
Jeff Squyres
af6f1f4b00 Add pkg-config(1) config files to Open MPI. Additionally, fix a minor
bug: libmpi_f90 had libmpi.la in its LIBADD instead of libmpi_f77.la.

Fixes trac:2244.

This commit was SVN r22704.

The following Trac tickets were found above:
  Ticket 2244 --> https://svn.open-mpi.org/trac/ompi/ticket/2244
2010-02-24 18:46:06 +00:00
Shiqing Fan
44fe33452c Use this option only on Windows, so protect it with #ifdef __WINDOWS__.
This commit was SVN r22701.
2010-02-24 08:50:03 +00:00
Jeff Squyres
d9b6b5af0c This commit converts us to the "one big libmpi" scheme that has been
discussed extensively.  See
https://svn.open-mpi.org/trac/ompi/ticket/2092 and the RFC thread
http://www.open-mpi.org/community/lists/devel/2010/02/7447.php.

Specifically:

 * Create LT convenience libraries for OPAL and ORTE if the layer
   above them is being created (use the already-defined
   AM_CONDITIONALs to know if the project above us is being built).
 * ORTE slurps in the LT convenience library for OPAL; OMPI slurps in
   the LT convenience library for ORTE.
 * Wrapper compilers now only -l one library (e.g., ortecc only does
   -lopen-ret, and mpicc only does -lmpi).

This commit was SVN r22691.
2010-02-23 22:20:01 +00:00
Shiqing Fan
7a5a5ce024 Add a global option for inputing the head node name for Windows CCP trough command line.
This commit was SVN r22688.
2010-02-23 19:42:51 +00:00
Jeff Squyres
f65eebf53d More changes for NetBSD. Thanks to Aleksej Saushev for this patch.
This commit was SVN r22680.
2010-02-22 15:05:09 +00:00
Ralph Castain
65a8ab4267 Cleanup the kill_procs command. Send a SIGTERM initially to allow C/R operations, and to be polite. Correctly update proc state if there is a problem so we don't hang.
The change to just using SIGKILL was originally done due to problems whereby waitpid thought a proc had died, but it hadn't. We'll continue debugging that problem separately, but SIGTERM is required for C/R to work properly.

This commit was SVN r22674.
2010-02-21 19:35:32 +00:00
Ralph Castain
2be03b4fb6 Cleanup a few bugs in the rmcast subsystem
This commit was SVN r22650.
2010-02-18 01:54:45 +00:00
Shiqing Fan
08ffdbe987 Changes for portable platform headers. Commit it on behalf of Ralph.
This commit was SVN r22619.
2010-02-15 22:14:59 +00:00
Ralph Castain
9a5fdbb622 Continue development of reliable multicast
This commit was SVN r22616.
2010-02-14 19:20:56 +00:00
Ralph Castain
58a1151566 Ensure the man page gets into the tarball
This commit was SVN r22613.
2010-02-13 02:39:10 +00:00
Ralph Castain
dc6de3e9b5 Add an "orte-info" tool to report ORTE and OPAL configuration and parameter info ala "ompi_info". Temporary solution until refactoring of the "info" system can be undertaken.
This commit was SVN r22608.
2010-02-11 23:02:14 +00:00
Josh Hursey
ec4498c258 update copyright
This commit was SVN r22605.
2010-02-11 14:03:36 +00:00
Josh Hursey
4513440df9 The ordering of the appfile matters. The ranks must be in increasing rank
order, so that they get assigned the same ORTE name on restart.

cmr:v1.4.2
cmr:v1.5

This commit was SVN r22586.
2010-02-09 00:31:24 +00:00
Eugene Loh
924bc10ceb Add description of the sequential mapper (-mca rmaps seq) to the
mpirun man page.

This commit was SVN r22567.
2010-02-06 06:22:29 +00:00
Josh Hursey
a3583b8f57 Fix --bynode option to remember for subsequent jobs where it left off last time.
Add a ''map_bynode'' info key to determine if the job to be started by comm_spawn* should be mapped by node or by slot. Default is to map according to the default policy set when the parent job was started.

cmr:v1.5.1

This commit was SVN r22564.
2010-02-05 15:37:49 +00:00
Iain Bason
28f03a2d86 Suspend/resume enhancements:
Have orte call setpgrp after forking (but before exec) when
orte_forward_job_control is set. Then have it send signals to the
child's process group.  This allows suspending jobs that fork.

If a SIGTSTP arrives before the processes have been launched, then
record it and suspend them right after launching.

This commit was SVN r22557.
2010-02-04 15:47:20 +00:00
Shiqing Fan
bbcf1f71c4 Remove a incorrect callback, which was based on the old source base without WMI. This makes no harm on 32 bit Windows, but it seems causing exceptions on 64 bit Windows sometimes. What this callback does is just waiting on the given pid which actually is a remote pid, so it won't work as expected.
cmr:v1.4.2
cmr:v1.5

This commit was SVN r22549.
2010-02-04 10:47:28 +00:00
Ralph Castain
c2aba2a6d7 Pack the resource key correctly
This commit was SVN r22541.
2010-02-03 18:21:27 +00:00
Ralph Castain
16b7bc7a82 Sigh...get the order right to match unpack
This commit was SVN r22539.
2010-02-03 15:50:43 +00:00
Ralph Castain
e88627a7ca Ensure we don't go through rml open/select more than once.
Open the rml to get the uri when bootstrapping daemons

This commit was SVN r22538.
2010-02-03 15:38:32 +00:00
Ralph Castain
cb1007b5a9 Pass back the number of daemons in the system
This commit was SVN r22537.
2010-02-03 14:31:16 +00:00
Shiqing Fan
bdc13dacb1 A type cast.
This commit was SVN r22520.
2010-01-31 20:22:22 +00:00
Jeff Squyres
007a6c7b99 Per #2201, move the user arguments up to be the first set of argv
after the compiler argv tokens.  

Not closing #2201 yet; there's still discussion on that ticket about
whether we want to do more or not.

Refs trac:2201
cmr:v1.4.2 
cmr:v1.5

This commit was SVN r22513.

The following Trac tickets were found above:
  Ticket 2201 --> https://svn.open-mpi.org/trac/ompi/ticket/2201
2010-01-29 22:51:35 +00:00
Josh Hursey
4f56b4b133 revert mistaken commit r22511
This commit was SVN r22512.

The following SVN revision numbers were found above:
  r22511 --> open-mpi/ompi@a16cb86181
2010-01-29 16:30:16 +00:00
Josh Hursey
a16cb86181 copyright fixes
This commit was SVN r22511.
2010-01-29 16:28:41 +00:00
Ralph Castain
7badff9d2d Okay to return no available nodes for mapping when launching daemons - just means there is nothing to do
This commit was SVN r22509.
2010-01-28 22:58:28 +00:00
Ralph Castain
86dd1d41af Handle zero-length iovecs in multicast messages
This commit was SVN r22507.
2010-01-28 15:29:43 +00:00
Ralph Castain
f66b6cae23 Enable the boot of an orted "virtual machine". Modify the mapper framework to allow mapping of only daemons. Remove the cm ras module as no longer required. Modify the orted code to always send back node arch info. Remove the "--enable-bootstrap" configure option as this feature will now always be available.
This commit was SVN r22480.
2010-01-25 22:25:13 +00:00
Josh Hursey
b749ecbab8 This commit fixes trac:2190.
Originally the patch was to improve the error message, but when digging into the code I found a subtle bug. If the daemon does not tell the HNP what CRS component it used, then the HNP tries to figure it out from the metadata (this is an uncommon case). The path the HNP used was not complete, so it was unable to find the metadata information. This patch fixes this by adding the 'snapshot_reference' to the 'snapshot_location' which completes the path for this search.

cmr:v1.4 (needs a custom patch)

cmr:v1.5

This commit was SVN r22479.

The following Trac tickets were found above:
  Ticket 2190 --> https://svn.open-mpi.org/trac/ompi/ticket/2190
2010-01-25 20:28:38 +00:00
Josh Hursey
44497a0567 forgot to update the copyright refs trac:2194
This commit was SVN r22478.

The following Trac tickets were found above:
  Ticket 2194 --> https://svn.open-mpi.org/trac/ompi/ticket/2194
2010-01-25 19:24:09 +00:00
Josh Hursey
36ab4a60b3 Improve the error message for when ompi-checkpoint cannot find a HNP to connect to.
This commit fixes trac:2189.

cmr:v1.4 (needs custom patch)
cmr:v1.5

This commit was SVN r22477.

The following Trac tickets were found above:
  Ticket 2189 --> https://svn.open-mpi.org/trac/ompi/ticket/2189
2010-01-25 19:17:43 +00:00
Ralph Castain
e4bf33dcab Just a slight efficiency improvement - why check a flag twice?
This commit was SVN r22472.
2010-01-23 03:57:56 +00:00
Shiqing Fan
c29a668e37 Remove flex.exe and its license file from the tarball.
cmr:v1.4
cmr:v1.5

This commit was SVN r22469.
2010-01-22 16:40:13 +00:00
Ralph Castain
00b493e10d Make the sigpipe message be a verbose output
This commit was SVN r22464.
2010-01-22 03:46:11 +00:00
Ralph Castain
2517799102 Report, but ignore, SIGPIPE events. The odls already resets this signal handler when spawning local procs.
This commit was SVN r22463.
2010-01-21 05:01:06 +00:00
Shiqing Fan
4836e8878a Update a few more CMake scripts.
This commit was SVN r22454.
2010-01-19 17:34:55 +00:00
Ralph Castain
3fe5e3e142 Propagate the user's callback data during non-blocking sends
This commit was SVN r22432.
2010-01-15 20:02:47 +00:00
Shiqing Fan
ad763c327d Restore several linked libraries that were deleted by mistake in r22405.
This commit was SVN r22415.

The following SVN revision numbers were found above:
  r22405 --> open-mpi/ompi@872a4047ba
2010-01-14 21:50:42 +00:00
Shiqing Fan
872a4047ba Fix the bug that caused by ADD_DEPENDENCIES() from different version of CMake.
In CMake 2.6 and earlier, this function add dependencies for targets and also link the target libraries automatically, but in CMake 2.8,this behavior has been changed, i.e. it will only add the dependencies but no link, which will cause linking errors at compilation time.

This commit was SVN r22405.
2010-01-14 18:10:20 +00:00
Ralph Castain
cec840f6b9 The ability to add procs to a running job was unfortunately borked when we added the detection of a proc exiting before calling init. Re-enable it here, ensuring that procs that are being restarted and/or added to a job do -not- call barrier during orte_init.
This commit was SVN r22404.
2010-01-14 17:59:42 +00:00
Shiqing Fan
0259fa0b9c Correct a few variable names.
This commit was SVN r22401.
2010-01-14 10:55:15 +00:00
Ralph Castain
adb2430e24 Missed one place, of course
This commit was SVN r22400.
2010-01-13 23:11:44 +00:00
Ralph Castain
c782c98433 Rename the "basic" rmcast component "udp" to more accurately reflect its operation
This commit was SVN r22399.
2010-01-13 23:01:25 +00:00
Ralph Castain
237eb4e8df For some strange reason, every so often it appears possible for the event library to trip the read event on a socket, yet have the read itself yield an error. If/when that happens, report the error and continue on.
This happens rarely, but it does seem to happen.

This commit was SVN r22398.
2010-01-13 19:23:28 +00:00
Ralph Castain
ae1719306b Fix a bug in non-blocking sends
This commit was SVN r22395.
2010-01-13 05:37:36 +00:00
Ralph Castain
b35486d945 The CM ess module needs to open the sysinfo framework and select modules prior to when others need it. Thus, setup a flag to avoid multiple open/select within that framework.
This commit was SVN r22393.
2010-01-12 22:03:49 +00:00
Ralph Castain
48486df4fe Cleanup some diagnostics
This commit was SVN r22389.
2010-01-12 01:25:19 +00:00
Ralph Castain
9f3ccebeaa We need to barrier for orte apps when the job is initially started, but we must not do the barrier when a proc is restarted as the other procs in the job won't know to participate.
This commit was SVN r22388.
2010-01-10 02:21:30 +00:00
Ralph Castain
16b16c5cb8 Fix a silly typo
This commit was SVN r22387.
2010-01-09 15:34:49 +00:00
Ralph Castain
e0afc30708 Since the decision on whether or not to use script wrapper compilers is a configure-time option, we need the wrapper compiler script to be in the tarball
This commit was SVN r22386.
2010-01-09 01:08:10 +00:00
Ralph Castain
add84178ef Fix a silly typo that prevented tcp multicast messages from being delivered
This commit was SVN r22384.
2010-01-08 20:30:27 +00:00
Brian Barrett
86d8356b13 Updates to allow OMPI to build on Cray XT platforms running Catamount
This commit was SVN r22381.
2010-01-07 18:14:03 +00:00
Ralph Castain
09763ec711 Since we modified ORTE to declare that any process that terminates after calling "init" while at least one other process has not yet called "init" is an error, we have to ensure that non-MPI ORTE apps (i.e., apps that call orte_init but not mpi_init) include a barrier in orte_init. Otherwise, fast ORTE apps almost always wind up triggering the "abnormal termination" condition.
The barrier is protected with a test to ensure that MPI apps don't execute it and wind up doing two barriers during their init.

This commit was SVN r22378.
2010-01-07 06:58:01 +00:00
Ralph Castain
ef1bfaa823 Add the ability to track how many times a process has been restarted, and to communicate that value to a process when it is restarted in case it needs to take action when it is restarted as opposed to being started for the first time.
This commit was SVN r22377.
2010-01-07 01:19:44 +00:00
Ralph Castain
a12de9d1e8 Oh, the pain one little word can make...sigh.
This commit was SVN r22364.
2010-01-05 23:29:56 +00:00
Ralph Castain
5faf857840 Add a new tag for pnp/multicast send of direct messages
This commit was SVN r22352.
2009-12-31 20:34:58 +00:00
Ralph Castain
b3a58f8b83 Pass the correct address when packing iovec bytes for multicast.
Thanks to Rick Payne for the correction.

This commit was SVN r22351.
2009-12-30 20:59:31 +00:00
Ralph Castain
bb7aa9797f Not sure why the nightly tarball had a problem as this wasn't changed, but comment out the orte wrapper compiler man pages for now
This commit was SVN r22350.
2009-12-30 02:52:29 +00:00
Ralph Castain
89a6131032 Check the return status code on all dss operations within the rmcast modules
This commit was SVN r22349.
2009-12-30 01:45:31 +00:00
Ralph Castain
fad1ba15b0 Move the test for case-sensitive file system from ompi to opal so that all layers can have that knowledge.
Use that for the orte wrapper compilers

This commit was SVN r22348.
2009-12-29 23:26:45 +00:00
Ralph Castain
d6003e3369 Update the mpirun man page to include info on the "!" option to the xterm cmd line option that holds the xterm window open after process termination so the user can see the process' output.
This commit was SVN r22341.
2009-12-26 16:01:03 +00:00
Ralph Castain
50074f0770 Remove unused (and uninitialized) variable
This commit was SVN r22340.
2009-12-24 01:36:47 +00:00
Ralph Castain
aaf1119f40 Garrr...ensure we accurately know when to update the contact info so we don't do it incorrectly as procs terminate, thus causing the system to think that perfectly good apps are incorrectly terminating.
Thanks to George for pointing out the problem

This commit was SVN r22332.
2009-12-17 20:40:21 +00:00
Ralph Castain
db2cbd3166 Okay, okay - do it at destruct time too.
This commit was SVN r22331.
2009-12-17 20:08:49 +00:00
Ralph Castain
a56e09c874 Per suggestion from Josh, init the sender field of the msg_packet object to INVALID
This commit was SVN r22330.
2009-12-17 20:03:35 +00:00
Ralph Castain
8ab962411c Detect the scenario where one or more procs fail to call orte/ompi_init while others in the job do. This scenario can cause the job to hang as MPI_Init contains a barrier operation that will not complete. Although ORTE does not contain such a barrier, it still will be considered as an error scenario so that we can detect the MPI case - otherwise, ORTE has no knowledge of OMPI and wouldn't know how to differentiate the use-cases.
Take advantage of the changes to update the routed_base_receive code to avoid message overlap.

This commit was SVN r22329.
2009-12-17 19:39:53 +00:00
Ralph Castain
06d1f2cfe2 Add some new tests to the ORTE collection
This commit was SVN r22328.
2009-12-17 19:30:57 +00:00
Josh Hursey
313acba4ce Move the mca_base_is_component_required() functionality to mca/base per suggestion so that it can be reused in other components.
This commit was SVN r22327.
2009-12-17 15:12:26 +00:00
Josh Hursey
a418a7dc43 Make sure to look in not only the env var, but also {{{orte_routed_base_components}}} to confirm that this is the only component available, and intended for selection.
This commit was SVN r22323.
2009-12-16 20:17:26 +00:00
Josh Hursey
646f90a90a Small fix for a egde case
This commit was SVN r22322.
2009-12-16 18:06:05 +00:00
George Bosilca
a2310808f1 Santa's back! Fix all warnings about the deprecated usage of
stringWithCString as well as the casting issue between NSInteger and
%d. The first is solved by using stringWithUTF8String, which apparently
will always give the right answer (sic). The second is fixed as suggested
by Apple by casting the NSInteger (hint: which by definition is large
enough to hold a pointer) to a long and use %ld in the printf.

This commit was SVN r22317.
2009-12-16 00:06:37 +00:00
Ralph Castain
9acec283af Add a new TCP module to the reliable multicast framework. This module uses ORTE's grpcomm.xcast functionality to "fake" multicasts for environments where regular multicast isn't reliable.
Modify the startup logic to allow for this use-case.

This commit was SVN r22310.
2009-12-15 01:18:27 +00:00
Ralph Castain
0ffa4f2f0c Ensure we cancel the lingering recv in the allgather code to avoid having incorrect counters.
Thanks to Damien for spotting the problem.

This commit was SVN r22301.
2009-12-14 13:21:56 +00:00
Josh Hursey
4357159ac9 Make sure to check for the NO_CKPT state while waiting. This means that the target was not able to checkpoint [ever | at this time]. So {{{ompi-checkpoint}}} should exit after printing the error message, instead of hanging and waiting.
Will need to be moved to v1.5 and v1.4. v1.4 will require a custom patch, but should apply cleanly to v1.5. CMRs to follow.

This commit was SVN r22289.
2009-12-09 16:01:33 +00:00
George Bosilca
501d1cc4ad Set default values to avoid using these variables uninitialized.
This commit was SVN r22279.
2009-12-08 18:42:22 +00:00
Ralph Castain
e3a2e66ec2 Add limits on rmcast seq numbers
This commit was SVN r22269.
2009-12-05 01:20:14 +00:00
Jeff Squyres
9fffd30660 Fix a typo in the man page. Thanks to Jeremiah Willcock for pointing
it out.

This commit was SVN r22268.
2009-12-04 21:10:50 +00:00
Ralph Castain
4026a9c873 Update all the tests to the new orte_init API
This commit was SVN r22263.
2009-12-04 04:31:06 +00:00
Ralph Castain
4a82dd9a45 Add message sequence numbers to multicast messages, tracked by channel
This commit was SVN r22262.
2009-12-04 04:17:44 +00:00
Jeff Squyres
16b100219d A patch from UTK to allow orte_init(), opal_init(), and associated
friends also receive &argc and &argv (George asked Jeff to Ralph to
review before committing).  The thought is that passing argv and argc
to opal/orte_init be useful to other projects outside of OMPI that are
using OPAL and/or ORTE (especially in conjunction with some other
bootstrapping code where it is helpful to modify argv).  It's such a
small thing that it's easy to apply here to make others' lives a
little easier.

Ask George for more details; I'm just the messenger.  :-)

Judging by the copyrights on this patch, it's been around for a
while.  :-)

This commit was SVN r22260.
2009-12-04 00:51:15 +00:00
Ralph Castain
ae3e9f2aee Update the spin.c test
This commit was SVN r22259.
2009-12-03 04:46:31 +00:00
Ralph Castain
4ec9c4b532 Do a better job of ensuring session directories are removed when procs abnormally terminate and/or we order "kill local procs"
This commit was SVN r22258.
2009-12-03 04:46:17 +00:00
Ralph Castain
93ebed48b1 Update the multicast test. Some cleanups to the basic rmcast module
This commit was SVN r22257.
2009-12-03 04:30:58 +00:00
Ralph Castain
66efa05a53 Don't cancel the recv unless it was issued or else we generate an error whenever we launch an app without having to launch daemons (e.g., a completely local launch to mpirun)
This commit was SVN r22256.
2009-12-03 04:28:43 +00:00
Ralph Castain
3a72ee9dca Fix a bug reported by Rainer whereby we could free and reuse an address if the user specified the tmp dir base. After discussing with Josh, we also removed the code that had us retry creation of the session dir (using default values) if the user-specified value didn't work for some reason. Adhering to OMPI standard practices, we abort if the user-specified value doesn't work.
This commit was SVN r22255.
2009-12-03 01:57:35 +00:00
George Bosilca
7bf1d7a1c4 A more asynchronous startup over rsh/ssh.
This commit was SVN r22253.
2009-12-02 20:29:32 +00:00
Ralph Castain
a0d5c80ce0 Add a new framework for discovering local resource information such as cpu type/model, #cpus, available physical memory, etc. Two initial components (darwin and linux) are provided. This is needed to support bootstrap operations where daemons are started at node boot, and applications where initial knowledge of cpu identification is needed to guide framework component selection.
Add orte configuration option to control the use of the framework in the system. Although the code will build, it will not be active unless configured with --enable-bootstrap.

If bootstrap is enabled and the new opal_sysinfo framework can successfully determine the cpu model, pass that info to the application as an MCA param to support some work at Sun.

Also, have daemons report back the resources they find to guide process mapping in bootstrap operations (i.e., where the daemon starts at node boot as opposed to being launched at application start).

Adjust some platform files to enable these capabilities.

This commit was SVN r22244.
2009-11-30 23:11:25 +00:00
Ralph Castain
e38a0eab9f Remove the fddp and sensor frameworks - relocated to new cluster mgr project
This commit was SVN r22240.
2009-11-27 22:14:47 +00:00
Rainer Keller
70a69e796f - Get rid of a small nuisance: after installation of the
alps-resid script, set it to exec, to allow:

   export OMPI_ALPS_RESID=`$OMPI/share/openmpi/ras-alps-command.sh`

This commit was SVN r22234.
2009-11-25 19:01:33 +00:00
Ralph Castain
9a6d5697a8 Protect against NULL input - I'm -sure- no one will do it, but...well, actually, they did. :-/
This commit was SVN r22232.
2009-11-25 15:13:21 +00:00
Ralph Castain
c1206139dd Ensure the thread-safe data buffers are initialized prior to use
This commit was SVN r22231.
2009-11-25 15:12:45 +00:00
Ralph Castain
92733b13d9 Add a couple of new tests to the orte system.
Modify the job_complete check so we don't kill jobs when a single proc was terminated by ORTE command via plm.terminate_procs

Still dies gracefully with a ctrl-c, and behaves as before when using plm.terminate_job

This commit was SVN r22227.
2009-11-20 01:47:49 +00:00
Ralph Castain
5e031d9ded Let a restarted process have access to all known nodes instead of only those already in its prior job map
This commit was SVN r22225.
2009-11-19 19:45:11 +00:00
Ralph Castain
852e5d9ee0 Add some diag output
This commit was SVN r22224.
2009-11-19 19:43:36 +00:00
Ralph Castain
a401f05ea3 Add some diagnostics to chase down forced termination of procs. Ensure that procs are removed from the local data list upon termination
This commit was SVN r22223.
2009-11-19 19:43:10 +00:00
Ralph Castain
3921069230 Ensure we completely cleanout the old nidmap info
This commit was SVN r22222.
2009-11-19 19:42:15 +00:00
Ralph Castain
8dc08e304f No longer require name passed separately
This commit was SVN r22221.
2009-11-19 19:41:41 +00:00
Ralph Castain
1a44b84b25 If a process is in certain states (e.g., polling for messages in the event lib), then it can blissfully ignore SIGTERM when we try to order it to die. Unfortunately, the OS thinks the process actually did die, leading us to leave orphaned procs around.
The only sure way to kill the thing is with SIGKILL. After hours spent trying to debug this bizarre situation with a reliable reproducer, I finally tracked it down and fixed it.

Go figure...I sure can't.

This commit was SVN r22220.
2009-11-19 17:25:15 +00:00
Shiqing Fan
11ad25fa77 A few windows fixes:
Add a missing value for the configure file. 
Fix the bug that generating wrong svn version number.
Correct the wrong string length of the headnode name.

cmr:v1.5
cmr:v1.3.4

This commit was SVN r22219.
2009-11-18 09:43:47 +00:00
Ralph Castain
840766a894 Update the rmcast APIs to include tag params and reorder them to look like their rml cousins
This commit was SVN r22218.
2009-11-17 15:58:59 +00:00
Ralph Castain
aea1ab3bd6 Remove diagnostic
This commit was SVN r22216.
2009-11-11 22:16:15 +00:00
Ralph Castain
a2f3a47b92 Update the orte_mcast test
This commit was SVN r22214.
2009-11-11 22:11:19 +00:00
Ralph Castain
6496ce7212 Expand the reliable multicast APIs to support sending/recving of iovecs
This commit was SVN r22213.
2009-11-11 22:10:35 +00:00
Rainer Keller
366bd96c88 - Allow to work without xt-catamount module on Jaguar,
reducing the amount of components, that up to now needed to be
   deselected.

This commit was SVN r22205.
2009-11-09 14:26:24 +00:00
Shiqing Fan
6f8d0a1ab8 Update a few CMake scripts.
Add Program Database (pdb) files for installation for debug build.

This commit was SVN r22188.
2009-11-03 10:40:58 +00:00
Rainer Keller
f121e46db1 - Finalize ornl_configure
This commit was SVN r22178.
2009-11-01 03:25:57 +00:00
Rainer Keller
7dfe709ac1 - Initialize n before usage.
This commit was SVN r22169.
2009-10-29 15:52:53 +00:00
Terry Dontje
c6ebc7c341 rename macros ompi_check_optflags and ompi_make_stripped_flags based on comments in #2072
This commit was SVN r22151.
2009-10-28 10:51:59 +00:00
Terry Dontje
6df802424d remove duplicate setting of CFLAGS_WITHOUT_OPTFLAGS and special case DEBUGGER_FLAGS for intel compiler
This commit was SVN r22143.
2009-10-26 18:41:53 +00:00
Ralph Castain
13d86e100b Courtesy of Ralph and Jeff:
Continue the reorganization of the configure system. Move files from the main config directory to their appropriate level-specific config directories. Modify the configure system to correctly handle compiler detection, test, and setup so that all things pertaining to opal and orte are done at the lower level, with the ompi configure system only looking at mpi-specific options.

Ensure the wrapper compilers for orte and ompi only get built when appropriate. Add support for c++ to the orte wrapper compilers, both script and non-script versions.

This commit was SVN r22138.
2009-10-24 01:04:35 +00:00
Ralph Castain
7afd65d631 Add a couple of test programs
This commit was SVN r22137.
2009-10-24 01:00:38 +00:00
Jeff Squyres
02db4f5146 Terry pointed out that ORTE also needs the "totalview" flags, and
therefore the m4 test really belongs on orte/config.  Thank Terry!

Additionally, I took the opprotunity to rename the variable so that
"TOTALVIEW" is not in the name anymore (because it applies to all
variables, not just Totalview).

This commit was SVN r22134.
2009-10-23 13:00:59 +00:00
Tim Mattox
4acfbe6554 Unfortunately, the typo's that r22129 tried to fix were not
as simple as I or Ralph had hoped.  This should be the real fix,
or very close to it.  I can now see both the sensor and rmcast
information from ompi_info when configured
with --enable-monitoring --enable_multicast

This commit was SVN r22131.

The following SVN revision numbers were found above:
  r22129 --> open-mpi/ompi@02ff00dfb5
2009-10-23 02:38:51 +00:00
Jeff Squyres
e0e20870e1 Generated files should not be in SVN.
This commit was SVN r22126.
2009-10-22 17:57:05 +00:00
Pavel Shamis
7425255be5 Fixing compilation failure. Adding missing include.
This commit was SVN r22119.
2009-10-21 16:28:40 +00:00
Ralph Castain
c33866f0df Per Tim Mattox (again, via my branch):
Add a script wrapper compiler version of ortecc for use when in cross-compile scenarios

This commit was SVN r22115.
2009-10-20 23:46:46 +00:00
Ralph Castain
ee82d42a1c Add a new sensor component that pulls data via an external shared memory interface
Only builds when the appropriate library is present

This commit was SVN r22114.
2009-10-20 23:45:35 +00:00
Ralph Castain
214e26b539 Per Jeff (this work was done on a branch of mine, so I will do the commit):
Re-enable "./autogen.sh -no-ompi" again. If you -no-ompi, the entire OMPI
configury is skipped and the entire ompi/ subtree is not built. There's
some simple m4-isms that prune out the relevant parts.

I added ompi/config/, orte/config/, and opal/config/ directories. I moved a
bunch of m4 files from the top-level config/ dir into ompi/config/, and a few
into orte/config/.

Note that all 3 <project>/config directories have a config_files.m4 file. This
file contains the AC_CONFIG_FILES list for that project. The AC_CONFIG_FILES
call cannot be in an AC_DEFUN macro and conditionally called -- if it is
included at all, Autoconf will process it. Hence, these config_files.m4 files
don't AC_DEFUN -- they just have AC_CONFIG_FILES. m4_ifdef() is used to
conditionally include the files or not.

I moved a bunch of obvious OMPI-only m4 files from config/ to ompi/config/,
but I'm sure that there's more that could go. A ticket will be filed with
thoughts on future work in this area.

This commit was SVN r22113.
2009-10-20 23:44:20 +00:00
Ralph Castain
f1f156d57b Make rmaps base open function play nicely with ompi_info
This commit was SVN r22111.
2009-10-20 07:28:23 +00:00
Ralph Castain
ff9d72b3ab Add a new multicast tag for collecting ps data
This commit was SVN r22107.
2009-10-16 04:21:22 +00:00
Terry Dontje
13907781b2 missed adding report-uri option
This commit was SVN r22106.
2009-10-15 18:05:24 +00:00
Ralph Castain
49ce2b4342 Add a new interface to the rmcast framework to query the output channel for the proc
This commit was SVN r22105.
2009-10-15 17:47:42 +00:00
Terry Dontje
c96af5654c correct options and wording that were dropped in the last change due to committing v1.3 manpage to the trunk
This commit was SVN r22104.
2009-10-15 15:03:21 +00:00
Ralph Castain
99c67183d2 Minor cleanups, mainly to ensure we correctly block on blocking sends
This commit was SVN r22102.
2009-10-15 02:39:15 +00:00
Ralph Castain
2f91a4833b Have the trigger event return the event itself in the callback function so it can be reset, if desired
This commit was SVN r22101.
2009-10-15 02:35:53 +00:00
Ralph Castain
2665825693 Correct an error that causes the system to "bounce" when we order a job killed. We didn't used to discriminate between a process being ordered to die, and a process that was aborted by an external signal. Unfortunately, that means the error mgr gets called and told a process abnormally aborted when we order termination, thus causing the errmgr to send out a "kill procs" command again.
Wouldn't be so bad, except...the errmgr orders the termination of ALL procs, which kills any other job that should have been left alone.

Add a new proc and job state indicating "killed_by_cmd" so we can tell the difference between a proc/job that was deliberately terminated by us vs one that is killed by external signal.

This change was tested to ensure it didn't interfere with ctrl-c operation (it doesn't - we order termination of all jobs when we get a ctrl-c).

This commit was SVN r22100.
2009-10-14 22:49:56 +00:00
Ralph Castain
18960a9c5a Refactor the multicast support so the data type objects can be accessed beyond just the one component
Ensure that the local node is included in the allocation prior to bootstrap discovery

This commit was SVN r22099.
2009-10-14 17:43:40 +00:00
Terry Dontje
0a8645a411 This commit fixes trac:2017
This commit was SVN r22098.

The following Trac tickets were found above:
  Ticket 2017 --> https://svn.open-mpi.org/trac/ompi/ticket/2017
2009-10-14 11:40:47 +00:00
Ralph Castain
bc869636be Reset the verbosity levels to suppress debug output
This commit was SVN r22095.
2009-10-13 15:29:38 +00:00
Ralph Castain
e501589b3b Cleanup the bootstrap procedure for multiple daemons starting up
This commit was SVN r22094.
2009-10-13 15:14:54 +00:00
Ralph Castain
c25dd14440 Correctly set the multicast interface, cleanup a comment
This commit was SVN r22093.
2009-10-13 15:14:28 +00:00
Ralph Castain
d8d80d6f1a Closes trac:2054. Check if a user specifies more cpus-per-rank than there are cpus in a socket - if so, politely tell them "you are stupid" and abort.
This commit was SVN r22091.

The following Trac tickets were found above:
  Ticket 2054 --> https://svn.open-mpi.org/trac/ompi/ticket/2054
2009-10-13 04:19:07 +00:00
Ralph Castain
1475d34c13 Ensure we default to byslot mapping
This commit was SVN r22090.
2009-10-11 23:50:42 +00:00
Ralph Castain
67625d5df6 Esure that mca params have a chance to be used in the standard hierarchy - in this case, allowing a choice of mapping policy to be made at the rmaps mca level.
This commit was SVN r22089.
2009-10-11 03:44:39 +00:00
Ralph Castain
84cc847be8 Next phase of auto-wireup using multicast. Enable use of multicast groups to separate comm from different application groups. Have the orted bootstrap message go to a different rml tag so the node can be added to the pool.
This commit was SVN r22083.
2009-10-10 01:19:56 +00:00
Ralph Castain
40e2299fa7 Test to ensure that num_procs was provided for the resilient mapper - it cannot be used with options like npernode.
Cleanup the show_help text file

This commit was SVN r22082.
2009-10-09 15:26:23 +00:00
Ralph Castain
b7a0125bb7 Add a test for the new opal if.c functions. Modify the multicast test
This commit was SVN r22081.
2009-10-09 15:25:18 +00:00
Shiqing Fan
7dff65cbc9 Clean up a little bit.
Add an option for setting up the job name.

This commit was SVN r22053.
2009-10-06 07:52:43 +00:00
Ralph Castain
dcab61ad83 Restore the prior default rank assignment scheme for round-robin mappers. Ensure that each app_context has sequential vpids.
This commit was SVN r22048.
2009-10-02 03:16:18 +00:00
Ralph Castain
a15c58c583 Fix the proc assignment into the job data object during assignment of vpids as comm_spawned procs were being overwritten by their parents with the same vpid.
Add a little debug output when updating proc state

This commit was SVN r22042.
2009-10-01 13:44:34 +00:00
Ralph Castain
51f64aaf96 Add a new ras module to support bootstrap operations. Additional functionality may eventually be required in the component, but for now all it does is provide a mechanism for ensuring that other allocations don't confuse the system.
Only active if specifically directed to use it

This commit was SVN r22040.
2009-09-30 23:30:24 +00:00
Ralph Castain
1d7ab97c84 Update the multicast framework to allow specification of different message scopes per various RFCs. Redefine the API a little to utilize channel numbers without worrying about the specifics of their addressing
This commit was SVN r22037.
2009-09-30 14:40:43 +00:00
Ralph Castain
19a13039aa Strange - remove duplicated comment
This commit was SVN r22035.
2009-09-30 14:38:13 +00:00
Ralph Castain
105ef7eeaf Turn off a debug by properly setting the verbosity value
This commit was SVN r22033.
2009-09-30 06:47:48 +00:00
George Bosilca
a6a6df0c48 MPIR_Breakpoint should be externally visible in order to allow the
debugger to find the symbol in our libs.

This commit was SVN r22032.
2009-09-30 04:07:55 +00:00
Ralph Castain
5a24d6f60e Remove an option that the orteds don't actually support...
This commit was SVN r22027.
2009-09-29 02:08:27 +00:00
Ralph Castain
c749fefbd0 Instead of an odls-base mca param, make report_bindings a global param so that we can (a) detect it was set in the plm, and then (b) ensure it gets passed along to remote orteds so they will comply with the request.
This commit was SVN r22021.
2009-09-28 03:17:15 +00:00
Ralph Castain
47c9a5409e Ensure that tools init the multicast channel correctly
This commit was SVN r22020.
2009-09-28 03:15:51 +00:00
Ralph Castain
ef0fd8b8d1 Return an error code if the job failed to start
This commit was SVN r22019.
2009-09-26 03:34:58 +00:00
Ralph Castain
e337fa686e Correct handling of pointer array indexing
This commit was SVN r22018.
2009-09-26 03:33:55 +00:00
Ralph Castain
6fa2e81491 Correct handling of pointer array indexing
This commit was SVN r22017.
2009-09-26 03:33:26 +00:00
Ralph Castain
709b36efb4 Cleanup auto-wireup and enable tools to "discover" the HNP via multicast
This commit was SVN r22012.
2009-09-25 01:00:09 +00:00
Abhishek Kulkarni
2af7657db1 A few changes to the FTB notifier interface:
- add an orte ftb notifier help file for more verbose error messages
- check if we can connect to the FTB during component->query and close
  the component, if we cannot.
- make the ftb component interface methods static.
- add mca parameters to set override the default subscription style and
  priority.

This commit was SVN r22011.
2009-09-24 23:56:41 +00:00
Ralph Castain
3167f0a0a0 Complete the next round of the multicast framework development. Needs further polish, upgrade to handle message fragmentation - but good enough for auto-bootstrap of orteds.
Teach the ess cm module to bootstrap orted launch

This commit was SVN r22006.
2009-09-23 20:57:49 +00:00
Josh Hursey
c9bd045cff move {{{ess_env_ft_event_update_process_info}}} into SnapC {{{snapc_full_app_ft_event_update_process_info}}} where it should have been all along.
This commit was SVN r22004.
2009-09-23 18:29:13 +00:00
Josh Hursey
a6ee73156c Add a verbose debug options. And add some error prints in the ESS' ft_event code.
This commit was SVN r22003.
2009-09-23 17:05:49 +00:00
Josh Hursey
2769091261 Fix for the stalled scenario in which 'options' might be reset to NULL inadvertently.
Thanks to MTT for picking this up.

This commit was SVN r22002.
2009-09-23 13:26:48 +00:00
Ralph Castain
26bb6e8f79 Add a couple of non-orte multicast tests
This commit was SVN r22001.
2009-09-23 05:24:22 +00:00
Ralph Castain
dff0d01673 Yet another paffinity cleanup...sigh.
1. ensure that orte_rmaps_base_schedule_policy does not override cmd line settings

2. when you try to bind to more cores than we have, generate a not-enough-processors error message

3. allow npersocket -bind-to-core combination - because, yes, somebody actually wants to do it.

This commit was SVN r21996.
2009-09-22 18:44:53 +00:00
Josh Hursey
5406fdfb80 Add support for sending SIGSTOP the MPI job after the checkpoint is taken (uses a BLCR feature for the option).
This commit looks larger than it really is since it includes a fair amount of code cleanup.

The SIGSTOP/SIGCONT+checkpointing work uses some of the functionality in r20391. Basic use case below (note that the checkpoint generated is useable as usual if the stopped application is terminated).
{{{
shell 1) mpirun -np 2 -am ft-enable-cr my-app
... running ...

shell 2) ompi-checkpoint --stop -v MPIRUN_PID
[localhost:001300] [  0.00 /   0.20]                 Requested - ...
[localhost:001300] [  0.00 /   0.20]                   Pending - ...
[localhost:001300] [  0.01 /   0.21]                   Running - ...
[localhost:001300] [  1.01 /   1.22]                   Stopped - ompi_global_snapshot_1234.ckpt
Snapshot Ref.: 0 ompi_global_snapshot_1234.ckpt

shell 2) killall -CONT mpirun

... Application Continues execution in shell 1 ...
}}}

Other items in this commit are mostly cleanup that has been sitting off-trunk for too long:
 * Add a new {{{opal_crs_base_ckpt_options_t}}} type that encapsulates the various options that could be passed to the CRS. Currently only TERM and STOP, but this makes adding others ''much'' easier.
 * Eliminate ORTE_SNAPC_CKPT_STATE_PENDING_TERM, since it served a redundant purpose with the new options type.
 * Lay some basic ground work for some future features.

This commit was SVN r21995.

The following SVN revision numbers were found above:
  r20391 --> open-mpi/ompi@0704b98668
2009-09-22 18:26:12 +00:00
Ralph Castain
8da3aa8d5c Some (hopefully final!) adjustments and corrections to the paffinity support:
1. default -npersocket to force -bind-to-socket

2. if we cannot get a value for cores/socket, try using #logical cpus. otherwise, default to 1 core

3. add missing error message for not-enough-processors

4. since we no longer loop through orte_register_params twice, put the auto-detect of
   topology info in the rte_init for hnp and std_orted

5. fix bind-to-core, bysocket combination

This commit was SVN r21992.
2009-09-22 15:41:03 +00:00
Ralph Castain
12613352eb Add missing header file
This commit was SVN r21990.
2009-09-22 13:07:57 +00:00
Ralph Castain
2210989e2d Update the cm ess module to support orted bootstrap. Continue work towards bootstrap capability.
This commit was SVN r21989.
2009-09-22 02:16:40 +00:00
Ralph Castain
c3f9096fd9 Add a reliable multicast framework, with an initial basic module. This is configured out unless specifically requested via --enable-multicast.
This commit was SVN r21988.
2009-09-22 00:58:29 +00:00
Ralph Castain
82af6ee940 Update test
This commit was SVN r21987.
2009-09-22 00:55:02 +00:00
Terry Dontje
0ccf2d87b6 rename do-not-bind to bind-to-none and clean up an error message
This commit was SVN r21980.
2009-09-21 17:00:02 +00:00
Terry Dontje
13be2d2a00 correct mistype in odle should be odls call to orte_show_help
This commit was SVN r21979.
2009-09-21 13:22:37 +00:00
Ralph Castain
7138fd131f Final cleanup on new paffinity "if-avail" messages, plus fix one bug reported by Terry
This commit was SVN r21978.
2009-09-19 17:43:21 +00:00
Ralph Castain
2028017554 Modify the paffinity system to handle binding directives that are "soft" - i.e., when someone directs that we bind if the system supports it. This allows community members to distribute OMPI with default MCA param files that direct general binding policies, without having the distributed software fail if the system cannot support those policies.
The new options work by adding an ":if-avail" qualifier to the "bind-to-socket" and "bind-to-core" MCA params. If the system does not support this capability, the job will launch anyway. Without the qualifier, the job will abort with an error message indicating that the required functionality is not supported on this system.

This commit was SVN r21975.
2009-09-18 19:48:42 +00:00
Ralph Castain
98a4450df6 Fix the seq mapper by initializing the proc object to NULL before claiming a slot for it
This commit was SVN r21969.
2009-09-17 05:18:37 +00:00
Ralph Castain
ae31af7dec Enable monitoring if configured to do so. Update the sensor framework
This commit was SVN r21964.
2009-09-09 21:00:27 +00:00
Ralph Castain
5fb3d13c24 Cleanup some pointer array addressing
This commit was SVN r21963.
2009-09-09 20:59:17 +00:00
Ralph Castain
e554fc282d Add some diagnostic output when daemons die
This commit was SVN r21960.
2009-09-09 18:16:50 +00:00
Ralph Castain
c20d977a30 Report the allocate event, if requested
This commit was SVN r21959.
2009-09-09 17:47:58 +00:00
Ralph Castain
2688ad2c9f Ensure the odls_types are included when referencing the APIs
This commit was SVN r21958.
2009-09-09 17:47:13 +00:00
Ralph Castain
cb7f608006 Remove debug output
This commit was SVN r21957.
2009-09-09 17:46:28 +00:00
Ralph Castain
51b13b3d5c A few minor cleanups in where threads are unlocked.
Reset mpirun's exit code when we restart failed procs

This commit was SVN r21955.
2009-09-09 05:31:06 +00:00
Ralph Castain
8ae4b55d16 Enable a new command line option to --report-events that instructs mpirun to RML-report specific events during job life to the requestor.
This commit was SVN r21954.
2009-09-09 05:28:45 +00:00
Ralph Castain
c877b1a5f8 Silence a compiler warning about no format
This commit was SVN r21951.
2009-09-08 15:03:14 +00:00
Ralph Castain
81b8bc5b54 Silence a compiler warning about no format
This commit was SVN r21950.
2009-09-08 15:02:48 +00:00
Ralph Castain
142036f2c0 Issue an error message and abort if the user requests a number of processes that conflicts with nperxxx directives when evaluated against available resources
This commit was SVN r21949.
2009-09-07 03:36:10 +00:00
Ralph Castain
ca09e8f604 Minor modification required to allow opal_paffinity_alone to default to bind-to-core
This commit was SVN r21948.
2009-09-05 15:24:26 +00:00
Ralph Castain
17444243f7 Correct the bit mask to properly set the binding policy
This commit was SVN r21934.
2009-09-03 17:58:23 +00:00
Jeff Squyres
e1fe03ad44 Minor grammar fixes, and use "#" for separating lines, not blank lines.
This commit was SVN r21931.
2009-09-03 07:02:21 +00:00
Ralph Castain
0421a49844 Update the xml support to allow -xml-file foo whereby we redirect all xml formatted output (and ONLY xml formatted output) to a specified file
This commit was SVN r21930.
2009-09-02 18:03:10 +00:00
Ralph Castain
d3d34f8f15 Correct a bug in the assignment of node index value. Ensure we set the app number so that MPI attributes get set correctly.
This commit was SVN r21927.
2009-09-02 01:15:44 +00:00
Ralph Castain
50ca27c1c8 Ensure that procs launched natively by slurm do not mistakenly identify themselves as daemons to the system
This commit was SVN r21926.
2009-09-01 17:57:15 +00:00
Lenny Verkhovsky
2a594fec6c added help message to rankfile mapper when failed if using alias instead of full hostname
This commit was SVN r21919.
2009-09-01 11:17:32 +00:00
Ralph Castain
59645c5c8e Per direction from the slurm team, change the envar we look at to get our allocation
This commit was SVN r21915.
2009-08-30 15:57:27 +00:00
Ralph Castain
0394a4884d Setup cpus-per-proc and cpus-per-rank as synonyms, both in mca params and on mpirun cmd line
This commit was SVN r21914.
2009-08-30 14:30:36 +00:00
Ralph Castain
ef4cdeeb69 Fix round-robin mapping when bind-to-socket in cases where #procs > #sockets and #cores
This commit was SVN r21913.
2009-08-29 03:36:21 +00:00
Ralph Castain
433673c64f Report bindings in all cases, including external bindings and slot lists
This commit was SVN r21911.
2009-08-28 13:58:46 +00:00
Ralph Castain
3ef028ca23 Trap mpirun error messages in xml format
This commit was SVN r21910.
2009-08-28 02:46:15 +00:00
Ralph Castain
59f08dd2ff Support the combination of npersocket and bind-to-core
This commit was SVN r21909.
2009-08-28 02:31:26 +00:00
Shiqing Fan
fb777134cf Adjust the command string length.
This commit was SVN r21905.
2009-08-27 13:42:55 +00:00
Ralph Castain
2d27bc9824 Default npersocket to bind-to-socket unless otherwise directed
This commit was SVN r21904.
2009-08-27 13:21:14 +00:00
Shiqing Fan
bdb45b25c6 Skip a few more signal events on Windows for orted, for the signal events don't work for select on Windows.
This commit was SVN r21903.
2009-08-27 13:17:13 +00:00
Shiqing Fan
ffd55631bc Deal with the case when the prefix is NULL.
This commit was SVN r21902.
2009-08-27 13:11:18 +00:00
Ralph Castain
01ba0eaa47 Correctly handle npersocket corner cases
This commit was SVN r21901.
2009-08-27 11:25:48 +00:00
Ralph Castain
12d6595c41 Do not deprecate the rankfile mca param.
This commit was SVN r21900.
2009-08-27 10:40:51 +00:00
Shiqing Fan
4119497c5a Use select() for Windows events by default.
For some historical reasons, we didn't use select() for the Windows events. Now it could be merged back to have a better performance on Windows.

This commit was SVN r21899.
2009-08-27 08:11:56 +00:00
Shiqing Fan
6e57772ebd Remove a few redundant security functions, and use USERDOMAIN environment variable for the domain name.
This commit was SVN r21898.
2009-08-27 08:00:49 +00:00
Shiqing Fan
1b6db85988 Complete the support for building on UNC path.
This commit was SVN r21897.
2009-08-27 07:57:26 +00:00
Ralph Castain
5e710928a5 Revise the new binding system slightly:
1. finalize the logic for properly respecting externally assigned bindings. Thanks to Chris Samuel for his help with this. Still needs some acid testing, but appears to now work.

2. remove the double-logic of requiring opal_paffinity_alone AND bind-to-foo. If the user specifies bind-to-foo, trust her and just do it.

This commit was SVN r21885.
2009-08-26 02:01:49 +00:00
Ralph Castain
2016a3180b Silence compiler warnings about uninitialized variables
This commit was SVN r21883.
2009-08-26 01:56:39 +00:00
Ralph Castain
9ad33a4688 Silence compiler warning about uninitialized variable
This commit was SVN r21882.
2009-08-26 01:56:11 +00:00
Ralph Castain
0e528e994f Revert last commit - went to wrong repo!
Didn't we just have that happen the other day too? :-)

This commit was SVN r21878.
2009-08-25 13:06:14 +00:00
Ralph Castain
15d12b240b Sync to r21876
This commit was SVN r21877.

The following SVN revision numbers were found above:
  r21876 --> open-mpi/ompi@ef970293f0
2009-08-25 13:04:12 +00:00
Ralph Castain
0bd12e99ff Fix typo - we want to detect bindings, not set them in the mask.
Thanks to Chris Samuel for finding it!

This commit was SVN r21875.
2009-08-25 11:53:05 +00:00
Ralph Castain
509cc0553c When directly launched by an RM, flag that a process is operating without daemons - i.e., standalone. Provide an error string for the new socket_not_available error. Use errmgr.abort to exit when we cannot get a socket, and ensure that the slurmd module returns the proper exit status for slurm 2.0
This commit was SVN r21868.
2009-08-22 02:58:20 +00:00
Ralph Castain
7370235c3e Create a more specific error code for when specific sockets are not available. Ensure that slurm 2.0 gets the expected error return if the process can't start for that reason so it can take corrective action.
This commit was SVN r21867.
2009-08-21 21:28:15 +00:00
Ralph Castain
7183179f56 Provide native integration with SLURM 2.0's OMPI support
This commit was SVN r21865.
2009-08-21 18:03:34 +00:00
Ralph Castain
35f8b68de6 Note to self: save all changes before committing
This commit was SVN r21863.
2009-08-21 12:54:29 +00:00
Ralph Castain
535408d6c2 Answer a Jeff-ism and check malloc for NULL return - for all xml formatting errors, revert to at least showing the non-xml formatted message
This commit was SVN r21862.
2009-08-21 12:41:54 +00:00
Shiqing Fan
baa81a6525 Add a missing header to compile on Windows.
This commit was SVN r21861.
2009-08-21 07:20:21 +00:00
Ralph Castain
2e0bd04755 Ensure that show_help messages are properly xml formatted
This commit was SVN r21858.
2009-08-20 19:23:26 +00:00
Rainer Keller
8e1b23779f - Replace combinations of
#if defined (c_plusplus)
          defined (__cplusplus)
   followed by
      extern "C" {
   and the closing counterpart by BEGIN_C_DECLS and END_C_DECLS.

   Notable exceptions are:
    - opal/include/opal_config_bottom.h:
      This is our generated code, that itself defines BEGIN_C_DECL and
      END_C_DECL
    - ompi/mpi/cxx/mpicxx.h:
      Here we do not include opal_config_bottom.h:                                 
    - Belongs to external code:                                                    
      opal/mca/backtrace/darwin/MoreBacktrace/MoreDebugging/MoreBacktrace.c        
      opal/mca/backtrace/darwin/MoreBacktrace/MoreDebugging/MoreBacktrace.h        
    - opal/include/opal/prefetch.h:
      Has C++ specific macros that are protected:                                  

    - Had #if ... } #endif  _and_ END_C_DECLS (aka end up with 2x
      END_C_DECLS)
      ompi/mca/btl/openib/btl_openib.h
    - opal/event/event.h has #ifdef __cplusplus as BEGIN_C_DECLS...
    - opal/win32/ompi_process.h: had extern "C"\n {...
      opal/win32/ompi_process.h: dito
    - ompi/mca/btl/pcie/btl_pcie_lex.l: needed to add *_C_DECLS
      ompi/mpi/f90/test/align_c.c: dito
    - ompi/debuggers/msgq_interface.h: used #ifdef __cplusplus
    - ompi/mpi/f90/xml/common-C.xsl: Amend

   Tested on linux using --with-openib and --with-mx

   The following do not contain either opal_config.h, orte_config.h or
   ompi_config.h
   (but possibly other header files, that include one of the above):
      ompi/mca/bml/r2/bml_r2_ft.h
      ompi/mca/btl/gm/btl_gm_endpoint.h
      ompi/mca/btl/gm/btl_gm_proc.h
      ompi/mca/btl/mx/btl_mx_endpoint.h
      ompi/mca/btl/ofud/btl_ofud_endpoint.h
      ompi/mca/btl/ofud/btl_ofud_frag.h
      ompi/mca/btl/ofud/btl_ofud_proc.h
      ompi/mca/btl/openib/btl_openib_mca.h
      ompi/mca/btl/portals/btl_portals_endpoint.h
      ompi/mca/btl/portals/btl_portals_frag.h
      ompi/mca/btl/sctp/btl_sctp_endpoint.h
      ompi/mca/btl/sctp/btl_sctp_proc.h
      ompi/mca/btl/tcp/btl_tcp_endpoint.h
      ompi/mca/btl/tcp/btl_tcp_ft.h
      ompi/mca/btl/tcp/btl_tcp_proc.h
      ompi/mca/btl/template/btl_template_endpoint.h
      ompi/mca/btl/template/btl_template_proc.h
      ompi/mca/btl/udapl/btl_udapl_eager_rdma.h
      ompi/mca/btl/udapl/btl_udapl_endpoint.h
      ompi/mca/btl/udapl/btl_udapl_mca.h
      ompi/mca/btl/udapl/btl_udapl_proc.h
      ompi/mca/mtl/mx/mtl_mx_endpoint.h
      ompi/mca/mtl/mx/mtl_mx.h
      ompi/mca/mtl/psm/mtl_psm_endpoint.h
      ompi/mca/mtl/psm/mtl_psm.h
      ompi/mca/pml/cm/pml_cm_component.h
      ompi/mca/pml/csum/pml_csum_comm.h
      ompi/mca/pml/dr/pml_dr_comm.h
      ompi/mca/pml/dr/pml_dr_component.h
      ompi/mca/pml/dr/pml_dr_endpoint.h
      ompi/mca/pml/dr/pml_dr_recvfrag.h
      ompi/mca/pml/example/pml_example.h
      ompi/mca/pml/ob1/pml_ob1_comm.h
      ompi/mca/pml/ob1/pml_ob1_component.h
      ompi/mca/pml/ob1/pml_ob1_endpoint.h
      ompi/mca/pml/ob1/pml_ob1_rdmafrag.h
      ompi/mca/pml/ob1/pml_ob1_recvfrag.h
      ompi/mca/pml/v/pml_v_output.h
      opal/include/opal/prefetch.h
      opal/mca/timer/aix/timer_aix.h
      opal/util/qsort.h
      test/support/components.h

This commit was SVN r21855.

The following SVN revision numbers were found above:
  r2 --> open-mpi/ompi@58fdc18855
2009-08-20 11:42:18 +00:00
Rainer Keller
3f742fc35b - add missing #include
This commit was SVN r21854.
2009-08-20 11:20:53 +00:00
Rainer Keller
9a0b6ef71e - For now fixup our headers for known issue of Sun Studio 12.1 compiler
"...", line XXX: warning: linker scope was specified more than once:

This commit was SVN r21853.
2009-08-20 11:12:45 +00:00
Ralph Castain
40fc0b6367 Silence compiler warning
This commit was SVN r21850.
2009-08-20 04:57:23 +00:00
Ralph Castain
c3c642aa0d Add two new frameworks for sensing and predicting faults. This is just the bare-bones plumbing for now - will instantiate soon.
No ess modules reference these frameworks yet, so they are completely inactive.

This commit was SVN r21847.
2009-08-20 04:27:16 +00:00
Ralph Castain
646a3500a7 Correctly account for number of procs in the job
This commit was SVN r21843.
2009-08-20 00:07:38 +00:00
Ralph Castain
e66a0be796 First attempt at making OMPI respect external bindings. Detect any external bindings on the daemons, and use that to determine which sockets/cores to bind to.
I have no machine which allows me to do external binding, so I will have to ask others to test the new logic. However, I did verify that these changes don't break the existing logic when no external bindings were present.

This commit was SVN r21842.
2009-08-19 19:29:15 +00:00
Ralph Castain
d3313f732a Flush the mpirun xml tag output to maintain ordering
This commit was SVN r21836.
2009-08-18 21:20:55 +00:00
Ralph Castain
3796cc568a Add a --report-bindings cmd line option to mpirun
This commit was SVN r21829.
2009-08-18 17:10:23 +00:00
Ralph Castain
dbb3cbe3dd Fix a reported problem when specifying orte_launch_agent - if only one word was given, we inadvertently appended a "NULL" to the end of the cmd.
This commit was SVN r21827.
2009-08-18 14:57:34 +00:00
Ralph Castain
a489f3fc6c Surround the entire xml output from an mpirun with <mpirun> tags.
This commit was SVN r21826.
2009-08-18 03:15:29 +00:00
Ralph Castain
511fe5da8b Minor cleanup - get reported iters right in test
This commit was SVN r21819.
2009-08-14 03:33:59 +00:00
Ralph Castain
3f3b46495e Add some error checking to the tm launcher
This commit was SVN r21818.
2009-08-14 03:13:02 +00:00
Ralph Castain
9da6b46e7d Add several options to the sendrecv_blaster to make it more powerful
This commit was SVN r21817.
2009-08-14 03:12:43 +00:00
Ralph Castain
0005e6e834 Correct a couple of bugs in the rank_file mapper that were incorrectly assigning vpids.
Add a capability to parse the rankfile to extract node information in place of requiring both hostfile and rankfile for non-RM managed environments. The rankfile is -only- parsed for this IF the hostfile and -host options are not given. Otherwise, those are used to establish allocation info as we did before this commit.

This commit was SVN r21815.
2009-08-13 16:08:43 +00:00
Rainer Keller
cea3d68ef6 - Fix reference counting of daemons killed.
This commit was SVN r21810.
2009-08-12 14:04:50 +00:00
Shiqing Fan
bce2f44154 Update related .windows files with proper compiling properties, in order to have a successful DSO build.
This commit was SVN r21805.
2009-08-12 08:55:58 +00:00
Josh Hursey
843a61f7eb Add another missing header for FT that got lost in the shuffle yesterday.
This commit was SVN r21794.
2009-08-11 13:32:27 +00:00
Ralph Castain
1dc12046f1 Modify the OMPI paffinity and mapping system to support socket-level mapping and binding. Mostly refactors existing code, with modifications to the odls_default module to support the new capabilities.
Adds several new mpirun options:

* -bysocket - assign ranks on a node by socket. Effectively load balances the procs assigned to a node across the available sockets. Note that ranks can still be bound to a specific core within the socket, or to the entire socket - the mapping is independent of the binding.

* -bind-to-socket - bind each rank to all the cores on the socket to which they are assigned.

* -bind-to-core - currently the default behavior (maintained from prior default)

* -npersocket N - launch N procs for every socket on a node. Note that this implies we know how many sockets are on a node. Mpirun will determine its local values. These can be overridden by provided values, either via MCA param or in a hostfile

Similar features/options are provided at the board level for multi-board nodes.

Documentation to follow...

This commit was SVN r21791.
2009-08-11 02:51:27 +00:00
Ralph Castain
007cbe74f4 Include the sendrecv blaster in the tarball
This commit was SVN r21790.
2009-08-11 02:42:47 +00:00
Rainer Keller
76469ea64a - Change the property of a few files, that obviously
don't need to be svn:executable...

This commit was SVN r21786.
2009-08-11 01:40:00 +00:00
Rainer Keller
784b9b9f5b - Based and updated from Ken's patch: since CLE-2.1 does not offer
the BATCH_PARTITION_ID anymore, use the ras-alps-command.sh script to
   figure out the jobs ID to query from ALPS.

   Gracefully report errors, update the help file and parse the sysconfig file

This commit was SVN r21772.
2009-08-07 01:15:09 +00:00
Ralph Castain
c66a5a9504 Add another test that just blasts the system with MPI_Sendrecv to myself commands of varying sizes
This commit was SVN r21748.
2009-07-31 14:57:03 +00:00
George Bosilca
0bf381e931 This patch try to solve a issue on Leopard. The supposedly global
variables that are not initialized and are declared in a file that
doesn't export any globally visible function are marked as
non-initialized constants, i.e. uninitialized common symbols. For some
obscure reasons, they get removed from the object files on Mac OS X.

So far I found two solution to this problem. One require the addition
of "-c" to the linker command, the second one (corresponding to this
patch) force them to became a common initialized symbol.

This commit was SVN r21739.
2009-07-28 17:06:16 +00:00
Jeff Squyres
c7376ae053 Start using Libtool's shared library versioning scheme. See lengthy
note in VERSION file.

NOTE: the versions will ''always'' be 0:0:0 on the SVN trunk and
developer branches.  They will only have meaningful values (starting
with 0:0:0 in 1.3.4) on release branches.  Only RM's will modify these
values immediately preceeding a release.

This commit was SVN r21729.
2009-07-23 21:35:17 +00:00
Ralph Castain
55e7365e7a Jeff correctly pointed out that char values > 127 also don't print. Adjust the xml output to handle those too.
Thanks Jeff!

This commit was SVN r21727.
2009-07-22 13:28:27 +00:00
Shiqing Fan
3e24d3df70 An ORTE event fix for Windows, i.e. using socket pairs instead of pipes on Windows.
This commit was SVN r21726.
2009-07-22 07:39:52 +00:00
Ralph Castain
6c85d954f3 Use a conditioned wait to serialize launches when they come from multiple sources (e.g., an orte application that spawns multiple jobs).
This commit was SVN r21718.
2009-07-20 01:51:29 +00:00
Ralph Castain
1a5f7245c8 Create a new message handling method for serializing responses. Place recvd messages on a list, using a file descriptor and the event library to trigger processing. This is identical in design to what is used in the IOF.
Use it first in the plm_base_receive to serialize multiple comm_spawn and update_proc requests.

This commit was SVN r21717.
2009-07-19 18:07:04 +00:00
Ralph Castain
1d74ab6e3c Cleanup some pointer array addressing and ensure we always exit the function cleanly
This commit was SVN r21716.
2009-07-19 18:05:04 +00:00
Ralph Castain
c3ce908515 Shift a debug output to come at a better place
This commit was SVN r21715.
2009-07-19 17:56:48 +00:00
Ralph Castain
1a5a591424 Clarify some comments
This commit was SVN r21714.
2009-07-19 17:56:19 +00:00
Ralph Castain
43d532cfd3 Minor tweak to xml ouput
This commit was SVN r21713.
2009-07-19 17:45:01 +00:00
Ralph Castain
2f515c8357 Per request of the Eclipse team, further modify the xml output to "escape" all non-printable characters
This commit was SVN r21712.
2009-07-17 22:45:32 +00:00
Ralph Castain
210f591f1c Cleanup array addressing for opal_pointer_array
This commit was SVN r21710.
2009-07-17 22:20:30 +00:00
Ralph Castain
51a8b89a83 Treat termination of continuously operating processes as an abort
This commit was SVN r21709.
2009-07-17 22:20:05 +00:00
Ralph Castain
08e17b72cf Break a circular logic loop in the cm routed module.
This commit was SVN r21708.
2009-07-17 18:07:35 +00:00
Ralph Castain
ef20e778b3 Ensure that output ends on an appropriate suffix tag when --tag-output or --xml are selected.
When we read the input buffer, we don't always get a complete printf output - we sometimes end mid stream. We still need to add the suffix and a <CR> to keep the output working right.

This commit was SVN r21706.
2009-07-17 05:02:53 +00:00
Ralph Castain
4c1eb040b0 Enable the system to keep functioning even when multiple launches are occurring simultaneously.
This is a bit of a hack, but it does seem to allow the system to work. A better solution is being discussed.

This commit was SVN r21705.
2009-07-17 02:28:47 +00:00
Ralph Castain
c0e85a492c Deleted one too many lines...might be good to set the value of oldnode!
Thanks George.

This commit was SVN r21702.
2009-07-16 18:49:24 +00:00
George Bosilca
3e971e61f3 The system headers are supposed to be protected by #ifdef and not by #if.
This commit was SVN r21700.
2009-07-16 18:27:33 +00:00
George Bosilca
ed93b967f7 Remove some warnings about uninitialized values.
This commit was SVN r21695.
2009-07-16 17:38:09 +00:00
George Bosilca
52d013baae Add a missing header.
This commit was SVN r21694.
2009-07-16 17:21:37 +00:00
Ralph Castain
007d14f238 Add a threshold reporting level to the orte notifier framework. This takes a string value:
"critical" - any error at or above the critical severity will be reported (i.e., only critical errors)
"warning" - any error at or above the warning severity will be reported (i.e., warning and critical errors)
"notice" - pretty much everything will be reported

Default to "critical" to keep down the chatter.

Obviously, only places that call orte_notifier will be affected - all other error reporting (e.g., via opal_output calls) is unaffected.

This commit was SVN r21693.
2009-07-16 13:31:23 +00:00
Ralph Castain
ae6c36ae01 Ensure that jdata->num_procs is correct when the rank_file mapper is mapping more procs than are specified in the rank_file
This commit was SVN r21690.
2009-07-15 22:45:12 +00:00
George Bosilca
dc9370598f This looks more like the correct solution. We only pack the known information, so
we can now deal with partial mapping without segfaulting.

This commit was SVN r21688.
2009-07-15 20:06:45 +00:00
Ralph Castain
e75d9b8296 Use orte_notifier to alert sys admins to checksum violations in the csum pml.
Add ability to store the RM's jobid string to tag the notifier message so that the sys admin knows what job had the problem.

This commit was SVN r21687.
2009-07-15 19:43:26 +00:00
George Bosilca
d66632fdc9 Reorder the nidmap encoding function. Add a check to make sure we don't write
outside the boundaries of the allocated array.

However, the problem is still there. If we have rmaps file containing only
partial information the num_procs get set to the wrong value (the number of
hosts in the rmaps file instead of the number of processes requested on the
command line).

This commit was SVN r21686.
2009-07-15 19:36:53 +00:00
Ralph Castain
90a2db25e9 Modify the errmgr callback function so it passes the proc that failed instead of only the jobid.
Update the cm routed module to detect and pass orted failures.

This commit was SVN r21682.
2009-07-15 11:43:33 +00:00
Ralph Castain
247ba7e90d Use the base function to claim a slot when fault groups are not defined
This commit was SVN r21681.
2009-07-15 11:28:58 +00:00
Ralph Castain
7161b37c76 Ensure that the stdin channel is closed when we kill a local proc - all other channels will automatically be closed when the proc terminates
This commit was SVN r21680.
2009-07-15 11:28:19 +00:00
Josh Hursey
dcedc99a6a It seems that we do no really want to set this particular variable to NULL. (causes badness in ess)
This commit was SVN r21673.
2009-07-14 19:01:43 +00:00
Josh Hursey
b03923cd72 Make sure to set these values to NULL, so that if we release an object (knowingly or not) twice then we do not fault by referencing unallocated memory.
This commit was SVN r21672.
2009-07-14 18:56:49 +00:00
Ralph Castain
dbac602be5 Add support for the add-host and add-hostfile MPI Info keys to allow Comm_spawn users to add new hosts to those already known by mpirun.
Requires full testing once comm_spawn is fixed (Edgar is working that now).

This commit was SVN r21664.
2009-07-14 14:34:11 +00:00
Ralph Castain
60edbc7220 Fix hetero operations and comm_spawn (to a point).
Remove all architecture references from ORTE and put them back in the modex using modex_send/recv calls.

Hetero operations are now fully supported again. Comm_spawn now works up to the point where it segfaults due to an error in the CID code - which now allows Edgar to dig further! :-)

This commit was SVN r21655.
2009-07-13 20:03:41 +00:00
George Bosilca
a934b9d975 Add the Open MPI specific part based on a patch from Manuel. Add the
sparc and alpha. A manpage patch is also included. This partially fixes
ticket #1973.

This commit was SVN r21654.
2009-07-13 20:01:12 +00:00
Ralph Castain
1b418dd397 Fix segfault in comm_spawn. The underlying problem breaking comm_spawn, however, remains - the change to make modex non-blocking causes the system to fail due to the arch not getting properly set.
Fix for that coming shortly.

This commit was SVN r21646.
2009-07-13 15:13:06 +00:00
Ralph Castain
1f147cf9c6 Don't do an automatic "phone home" if a regex was given to the orted.
This commit was SVN r21645.
2009-07-13 14:50:01 +00:00
Ralph Castain
235db33e83 Some more pointer array addressing cleanup
This commit was SVN r21644.
2009-07-13 14:49:20 +00:00
Ralph Castain
b97f885c00 Restore the original API to terminate individual processes instead of the entire job. This was originally removed as we didn't at that time know how to take advantage of it. Some of us are now working on proactive resilience methods that move procs prior to node failure, so this is now a required API. Modify the odls, plm, and orted functions to support this new functionality.
Continue work on the resilient mapper, completing support for fault groups.

This commit was SVN r21639.
2009-07-13 02:29:17 +00:00
Ralph Castain
50bd635200 Also require that the routed framework be initialized before attempting to use orte_show_help
This commit was SVN r21638.
2009-07-12 10:50:14 +00:00
Ralph Castain
e30826c6e1 Quiet some compiler warnings
This commit was SVN r21591.
2009-07-02 17:48:36 +00:00
Shiqing Fan
0b56a8a4d5 Enable IPv6 on Windows by default, and fix two type casts for IPv6 operations.
This commit was SVN r21586.
2009-07-02 14:41:03 +00:00
Ralph Castain
dd5e195a7d Don't treat the HNP node entry separately - this was just a holdover from the days when we didn't have the regex generator.
Ensure we get an accurate count of the number of daemons in the system.

This commit was SVN r21582.
2009-07-01 20:46:05 +00:00
Ralph Castain
4adb3ed80f Print out a more meaningful and correct error message
This commit was SVN r21581.
2009-07-01 20:16:15 +00:00
Ralph Castain
f832352b45 Clean up some compiler warnings
This commit was SVN r21577.
2009-07-01 16:51:11 +00:00
Ralph Castain
bc0fe3c6da Add some more tests for parallel IO that have caused problems in the past.
Add a README that explains how to run the ziatest for launch timing

This commit was SVN r21576.
2009-07-01 14:47:14 +00:00
Ralph Castain
9635db373d Ensure that we properly exit if the executable isn't found
This commit was SVN r21570.
2009-07-01 03:16:13 +00:00
Ralph Castain
2fbdea0273 Add a test for loop over bcast
This commit was SVN r21560.
2009-06-29 17:06:19 +00:00
Josh Hursey
91c4869cdc Use {{{ORTE_*_VERSION}}} instead of {{{OMPI_*_VERSION}}} in orte
This commit was SVN r21558.
2009-06-29 12:53:23 +00:00
Lenny Verkhovsky
e03807a3d1 small patch to extend current rankfile syntax to be compliant with orte_hosts syntax
making it possible to claim relative hosts from the hostfile/scheduler
by using +n# hostname, where  0 <= # < np
ex:
cat ~/work/svn/hpc/dev/test/Rankfile/rankfile
rank 0=+n0 slot=0
rank 1=+n0 slot=1
rank 2=+n1 slot=2
rank 3=+n1 slot=1

This commit was SVN r21557.
2009-06-28 11:20:56 +00:00
Shiqing Fan
656ec00611 Remove the definition of _WIN32_DCOM. It only enables DCOM when _WIN32_WINNT is less then 0x400 (Win 95, 98), and we are supporting 0x502(Win XP) and above in Open MPI. Thanks George for pointing this out.
This commit was SVN r21555.
2009-06-27 23:36:25 +00:00
Ralph Castain
d0a5468deb Continue gradual cleanup of pointer array addressing.
This commit was SVN r21550.
2009-06-26 22:40:28 +00:00
Ralph Castain
8ccc47b152 Don't wakeup mpirun too early - even though we ordered an abort, we have to wait for the procs/daemons to properly terminate, and give the kill_procs command a chance to get out.
This commit was SVN r21549.
2009-06-26 22:40:00 +00:00
Ralph Castain
2b4f051b7f Cleanup some indexing bugs so that shared memory can function
This commit was SVN r21548.
2009-06-26 22:07:25 +00:00
Ralph Castain
0a92fe3739 Pull the HNP node from the right index
This commit was SVN r21547.
2009-06-26 21:43:09 +00:00
Ralph Castain
b96a71b62e Enable restart of individual processes upon command via the errmgr callback function. It needs an external application to drive this capability, so normal operations shouldn't be affected.
Does not support MPI applications. More work coming to update daemon accounting on movement of procs across nodes.

This commit was SVN r21545.
2009-06-26 20:54:58 +00:00
George Bosilca
f06198a999 BAD grpcomm now has the ability to execute the modex offline. The MPI process
prepare the send buffer, and post the collective order to the local daemon. It
then register the callback and return fromthe modex exchange. It will only 
wait for this modex completion when the modex_recv is called. Meanwhile, the
daemon will do the allgather.

This commit was SVN r21543.
2009-06-26 20:32:31 +00:00
George Bosilca
6239e714a6 Extract the modex unpacking in it's own function.
This commit was SVN r21542.
2009-06-26 20:29:54 +00:00
George Bosilca
760df7fcb9 Get rid of the static buffer. Instead use directly the user supplied one.
This commit was SVN r21541.
2009-06-26 19:21:27 +00:00
Shiqing Fan
b0b11e8465 Clean up things properly.
This commit was SVN r21539.
2009-06-26 15:20:19 +00:00
Shiqing Fan
6ba25951b4 Add a separate function for reading remote registry, so that it could be easily reused.
Add two options for plm process module, i.e. remote_env_prefix for getting OMPI prefix on remote computer by reading its user environment variable (OPENMPI_HOME), and remote_reg_prefix is similar, but it reads the registry on the remote computer. Reading remote env prefix has a higher priority than reading reg prefix, so that user can use their own installation of OMPI.

This commit was SVN r21538.
2009-06-26 13:35:45 +00:00
Shiqing Fan
5b4b8f8899 Get rid of a bunch of compiler warnings on Windows.
This commit was SVN r21536.
2009-06-26 07:51:39 +00:00
Lenny Verkhovsky
efa800efea removed orphan files in rankfile mapper
This commit was SVN r21532.
2009-06-25 17:14:10 +00:00
Ralph Castain
863e57700e Cleanup use of pointer arrays - thanks to Lenny for pointing it out.
This commit was SVN r21529.
2009-06-25 14:08:36 +00:00
Josh Hursey
26363fcc8b Move the end timer to the end of the launch loop for the HNP. This matches the timing in the SLURM component, and makes the timing work when enabling the 'tree' based rsh launch.
This commit was SVN r21528.
2009-06-25 13:28:34 +00:00
Josh Hursey
8705c7edb1 Fix {{{orte_timing}}} for the RSH component of the PLM.
It never set the daemon launch counter before launching.
In the output for 'total job launch time' make the message match that of the SLURM PLM for easier parsing.

This commit was SVN r21527.
2009-06-25 12:37:58 +00:00
Shiqing Fan
07f7fe1a1b Add two CCP params, for user specifying stdout and stderr to files.
This commit was SVN r21521.
2009-06-25 09:03:50 +00:00
Ralph Castain
00fb79567f Shift some setup items from orterun to the ess/hnp module so that any HNP will perform them.
This commit was SVN r21520.
2009-06-25 03:00:53 +00:00
George Bosilca
fabebd140f If the daemon doesn't contain orted we're adding the prefix_dir and
bin_base twice. Create a temporary full_orted_cmd to cope with this
case.

This commit was SVN r21519.
2009-06-24 23:57:32 +00:00
George Bosilca
32f632ad5a Put back the multi-word command line argument for the RSH PLM.
This commit was SVN r21518.
2009-06-24 23:51:53 +00:00
George Bosilca
c0750343b5 Partially revert 21513. Beware of the exception on the orte_launch_agent which
is treated apart in orte_plm_base_setup_orted_cmd.

This commit was SVN r21517.
2009-06-24 23:48:14 +00:00
Ralph Castain
2e98ba3fd0 Complete implementation of regexp launch with static oob ports. Only enabled for SLURM at this time - migration to Torque coming
This commit was SVN r21516.
2009-06-24 20:31:26 +00:00
George Bosilca
53e76eed75 Missed an include.
This commit was SVN r21515.
2009-06-24 20:19:47 +00:00
George Bosilca
35033aeae9 Allow the PLM RSH to follow the routing table with the spawn, instead of the hard
coded binomial approach. Now, the PLM extract the children information from the
routed component, and the startup follow the routed topology. As a side effect,
we can now launch using linear or radix topologies, in addition to the previously
binomial topology.

This commit was SVN r21514.
2009-06-24 19:56:29 +00:00
George Bosilca
8cb8f28d9d When we get a report from an orted about its state, don't use the sender of
the message to update the structures, but instead use the information from
the URI. The reason is that even the launch report messages can get routed.

Deal with the orted_cmd_line in a single location.

This commit was SVN r21513.
2009-06-24 19:51:52 +00:00
George Bosilca
84a953a2a6 Be less verbose.
This commit was SVN r21512.
2009-06-24 19:49:06 +00:00
George Bosilca
07954c388a Reactivate the orte_daemon_spin option.
Reorder the sends with the contact info, first to our parent and then
to the HNP to avoid a possible race condition.

This commit was SVN r21511.
2009-06-24 19:30:34 +00:00
Shiqing Fan
6750657a97 Change the way that we detect cluster nodes status and clean up the CCP components.
Normally, any non-windows nodes should return  NodeStatus_Unreachable, but that's not true. We have noticed that Scientific Linux cluster nodes, which are in the same subnet as the Windows nodes, return NodeStatus_PendingApproval value, and the RAS CCP takes them as Windows nodes. So let's just use the "Ready" nodes.

This commit was SVN r21510.
2009-06-24 19:15:15 +00:00
Shiqing Fan
9bdf258d3c Add an initial support for remote process launch using WMI (Windows Management Instrumentation) in the environment where Windows 2003/2008 server and HPC pack are not installed. Users in a domain will be able to use domain computers as a cluster with their domain Credentials, the computers may run Windows XP or higher, e.g. Vista, and/or Windows 7 (later), with correct settings and permissions for WMI and Windows firewall.
This commit was SVN r21508.
2009-06-24 18:59:24 +00:00
Ralph Castain
51ee170f75 Update the pidmap decode logic to handle pidmap updates for restarted processes
This commit was SVN r21506.
2009-06-24 03:05:04 +00:00