1
1
Граф коммитов

2433 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
888472f671 Do not release recv as the calling function needs that data and will release it later
This commit was SVN r24557.
2011-03-22 18:44:56 +00:00
Ralph Castain
30981de200 Minor cleanups courtesy of Nysal - thanks!
This commit was SVN r24552.
2011-03-22 13:48:58 +00:00
Ralph Castain
c1396b278c Resolve the rsh confusion by splitting the initial search for a launch agent from the actual setup of the launch agent values in the plm base globals. Have each aspiring rsh-clone call lookup to see if their desired launch agent is available - if not, then reject that plm component.
If so, then setup the actual launch agent values only when the module init function is called.

This resolves the current conflict between the rsh and rshd components. Hopefully, it may avoid future problems in this area -provided- any new uses of rsh-like launchers abide by the lookup-and-then-setup rule.

This commit was SVN r24550.
2011-03-22 02:23:09 +00:00
Ralph Castain
d17b50e1ff Add the appropriate hooks to tell Totalview to display the user's main program upon startup. Apparently, this hook got lost somewhere after the 1.2 series :-(
Thanks to David Turner and the TV folks for passing this along.

This commit was SVN r24549.
2011-03-21 17:40:58 +00:00
Ralph Castain
795ca2cff2 Complete implementation of the multicast-based grpcomm module
This commit was SVN r24548.
2011-03-20 01:18:06 +00:00
Eugene Loh
2770a12beb Continue clean up of thread options started in r22841, 22842, and 22849.
No need for any CMRs to 1.5... that was already done in CMR 2728.

This commit was SVN r24545.

The following SVN revision numbers were found above:
  r22841 --> open-mpi/ompi@b400b84162
2011-03-18 21:36:35 +00:00
Ralph Castain
ee68cd102c Fix the hier grpcomm module so modex results in correct data. The prior implementation stored the modex data as node-based attributes. This worked fine for BTL's such as openib where the interfaces were associated with the node. However, BTL's such as TCP have interfaces associated with a specific process, not a node. Thus, store the data in the modex database so it is correctly indexed.
This commit was SVN r24536.
2011-03-17 02:22:23 +00:00
Ralph Castain
d5dfe05521 Remove stale code associated with OPAL_THREADS_HAVE_DIFFERENT_PIDS. In the past, we have supported the case of really, really old Linux kernels where threads have different pids. However, when we updated the event library, we didn't also update that support code. In addition, when we dropped progress thread support, we didn't remove areas of the code that could no longer be compiled (i.e., were protected by "if progress thread && if have different pids).
There was no compelling reason to support such old kernels. Accordingly, convert the test to print a nice error message indicating we no longer support old kernels (but indicate that earlier OMPI versions do) and error out. Remove all code that was protected by "if have different pids" since it can no longer be compiled.

This commit was SVN r24531.
2011-03-15 21:05:03 +00:00
Ralph Castain
de092af8ef Add a little more debug
This commit was SVN r24526.
2011-03-14 18:43:49 +00:00
Ralph Castain
dc6f616599 Enable VM launch.
For some time, ORTE has had the ability to launch daemons on all nodes prior to launching an application. It has largely been used outside of the OMPI community, and so was never explicitly turned "on" inside OMPI releases. Nevertheless, the code has been there.

Allowing VM launches does not require ANY changes to existing PLM components. All that was required was to have orterun launch the daemons as a separate call to orte_plm.spawn -prior- to launching the applications. The rest of the VM support code resides in the rmaps framework:

(a) a check when asked to map a job to see if it is the daemon job, and

(b) a separate "setup_virtual_machine" mapper in the rmaps base that creates the required map so the PLM's will do the right thing.

In order to support those users who have no RM allocation but like to give the allocation in the form of a -host or -hostfile argument to their application, there is a little more code in orterun and the setup_virtual_machine mapper to capture information passed in that manner.

This has been tested with rsh and slurm environments, and, since there is nothing environment-specific in the implementation, should work in others as well - but needs to be proven.

This commit was SVN r24524.
2011-03-12 22:50:53 +00:00
Ralph Castain
80265b472e Avoid direct reference of pointer_array elements
This commit was SVN r24523.
2011-03-12 20:18:51 +00:00
Ralph Castain
df82e4cd36 Plug a memory leak
This commit was SVN r24521.
2011-03-12 15:37:33 +00:00
Ralph Castain
1297acde13 George raised some valid concerns about the extensibility of the revised rmaps framework. Address those by:
1. removing the enum of mapper values

2. change the req_mapper and last_mapper fields to char* so they can hold the component name instead of a mapper flag

3. revise the selection logic in the mapper components to reflect the change. Components now look for their name in the req_mapper field, or to see if other criteria (e.g., npernode) are set that mandate their doing the mapping

Several MCA params resided in the rmaps base for historical reasons - they have been in the base since at least the original 1.2 release (and perhaps earlier). However, George correctly pointed out that they really should reside in their respective components. Accordingly, move them to the components, but register synonyms to the old names to avoid breaking backward compatibility.

These revisions retain the current functionality of allowing comm_spawn'd jobs to use different mappers than the original job, and for the errmgr to utilize the resilient mapper to recover processes regardless of how they were originally mapped.

Given the large number of possible combinations, I am sure that someone will find a corner-case combination of values and selection criteria that cause either no mapper to be selected, or one other than the intended to be used. No one can test all the ways people will use this system, so I expect debugging to continue for awhile.

The ability of comm_spawn'd jobs to exploit this functionality relies on changes to the orte_dpm component - this will be committed separately.

This commit was SVN r24520.
2011-03-12 05:30:09 +00:00
Samuel Gutierrez
830c7c66dc fixes CID #1667
This commit was SVN r24518.
2011-03-12 03:09:01 +00:00
Ralph Castain
e6a76cc923 Fixes CID #1954
This commit was SVN r24516.
2011-03-11 23:00:27 +00:00
Samuel Gutierrez
2a2319d23a when orte_timing is enabled, always record daemon launch start time before starting the real work.
This commit was SVN r24513.
2011-03-11 00:09:23 +00:00
George Bosilca
7f34a28c8f Correct a comment.
This commit was SVN r24504.
2011-03-10 00:41:41 +00:00
George Bosilca
d2502b14f9 Destruct the OOB TCP internal objects.
This commit was SVN r24503.
2011-03-10 00:40:54 +00:00
Ralph Castain
3b4421d8e3 Separately track requested and last-used mapper so we don't lose that info
This commit was SVN r24502.
2011-03-09 18:51:36 +00:00
Jeff Squyres
06d5c59115 Fix a few valgrind-reported memory leaks
This commit was SVN r24498.
2011-03-08 17:37:28 +00:00
Jeff Squyres
0586612bd5 Fix another minor memory leak
This commit was SVN r24495.
2011-03-08 15:46:13 +00:00
Ralph Castain
63f38e38bb Fix ompi-server: remove extra command flag in buffer being sent to mpirun, ensure that tools route messages thru a remote HNP
This commit was SVN r24491.
2011-03-05 17:12:46 +00:00
George Bosilca
9bbe00bdc3 Set the return code from the processes upstream.
This commit was SVN r24483.
2011-03-03 00:02:21 +00:00
George Bosilca
c6a5f9706a Thomas's patch: Assume we won't fail unless notified by a child.
This commit was SVN r24482.
2011-03-02 23:50:01 +00:00
Josh Hursey
62bba1bf12 Name the enum so that it represents as an actual symbol in gdb, instead of just a number.
This commit was SVN r24472.
2011-03-01 21:00:03 +00:00
Nysal Jan
4030111478 Add missing copyright and fix the year
This commit was SVN r24446.
2011-02-23 15:52:06 +00:00
Nysal Jan
42a73bb887 POE is supported on both AIX and Linux. Build POE PLM only if we find the poe binary. Fix hostfile creation and POE command line arguments.
This commit was SVN r24444.
2011-02-23 15:38:41 +00:00
Ralph Castain
f014284f91 Update resilient recovery mapping algorithm to be a bit more sophisticated. Track the prior node a proc was on so we avoid ricochet effect. Also avoid putting recovering proc onto node that is already occupied by a peer as this degrades fault tolerance.
This commit was SVN r24417.
2011-02-20 18:46:21 +00:00
Ralph Castain
a8cf19a7bc Ensure heartbeat only started once and only for daemon job
This commit was SVN r24416.
2011-02-18 20:33:54 +00:00
Ralph Castain
ef56e6d78b Helps to move the pointer
This commit was SVN r24414.
2011-02-18 14:01:25 +00:00
Ralph Castain
7b35ada7fc Fix ricochet effect - move failed procs to next on list instead of loadbalancing
This commit was SVN r24413.
2011-02-18 13:11:55 +00:00
Ralph Castain
b98a2917ff Add an API to the errmgr so that apps can register for a callback to warn them of an impending migration - this gives apps a chance to cleanly terminate prior to being migrated for external reasons (e.g., impending failures). The timeout provided indicates to the daemon how long it should wait before proceeding to kill/migrate the process - if the process fails to exit before that time, the daemon will kill it.
This commit was SVN r24412.
2011-02-18 02:48:12 +00:00
Ralph Castain
51cf0a16c3 Some minor cleanups to support VM and CM operations
This commit was SVN r24408.
2011-02-16 23:03:08 +00:00
Ralph Castain
9b48c07599 CM daemons handle their own output
This commit was SVN r24407.
2011-02-16 23:02:23 +00:00
Ralph Castain
65ba6af44d Cleanup our handling of VMs to ensure daemons don't get mapped when operating with a VM.
Have each mapper flag it did the map so we can see who did it later.

Ensure procs are flagged as "ready to launch".

This commit was SVN r24406.
2011-02-16 23:01:57 +00:00
Jeff Squyres
3f4d4886f2 Minor update for something that has been bugging me for quite a while:
OMPI supports multiple different repository systems (SVN, hg, git).
But the VERSION file has listed "want_svn" and "svn_r" as fields, even
though the actual repo system and version may not be SVN.

So search/replace those fields (and derrivative values that come from
those fields) with "want_repo_rev" and "repo_rev", respectively.

This commit was SVN r24405.
2011-02-16 22:53:23 +00:00
Ralph Castain
a32a7d9a82 Update heartbeat system
This commit was SVN r24404.
2011-02-16 18:50:51 +00:00
Ralph Castain
9b38525d1e Remove unused include files
This commit was SVN r24394.
2011-02-16 00:32:47 +00:00
Ralph Castain
5120e6aec3 Redefine the rmaps framework to allow multiple mapper modules to be active at the same time. This allows users to map the primary job one way, and map any comm_spawn'd job in a different way. Modules are given the opportunity to map a job in priority order, with the round-robin mapper having the highest default priority. Priority of each module can be defined using mca param.
When called, each mapper checks to see if it can map the job. If npernode is provided, for example, then the loadbalance mapper accepts the assignment and performs the operation - all mappers before it will "pass" as they can't map npernode requests.

Also remove the stale and never completed topo mapper.

This commit was SVN r24393.
2011-02-15 23:24:31 +00:00
Ralph Castain
c1da94a444 Dead children have no pid
This commit was SVN r24387.
2011-02-15 13:30:51 +00:00
Ralph Castain
a3607ff35d Make it easier to send a kill-local-procs command for an arbitrary number of procs
This commit was SVN r24386.
2011-02-15 13:26:11 +00:00
Ralph Castain
bf1cff3711 Plug a couple of additional memory leaks - try to highlight a little better that strings returned from reg_string_name must be freed by caller
This commit was SVN r24383.
2011-02-14 20:58:22 +00:00
Ralph Castain
a9dca25ca5 Remove the distinction between local and global restarts - leave it up to the error strategy to decide which to do.
Cleanup the heartbeat handling so it is associated with the proc, not a node.

Cleanup handling of recovery options so that defaults do not override user values iff they are provided.

This commit was SVN r24382.
2011-02-14 20:49:12 +00:00
Ralph Castain
b5de068533 Clean up an error in r24371 - can't use a const parameter as target in asprintf as it changes the value of the address.
Add some new proc/job states

Rename a constant to reflect coming change - remove the arbitrary difference between restarting a proc locally and relocating it to another node in terms of the number of restarts allowed.

Add pretty-print of signals for "proc aborted due to signal" reports.

This commit was SVN r24378.

The following SVN revision numbers were found above:
  r24371 --> open-mpi/ompi@93d28a5792
2011-02-14 19:29:09 +00:00
Abhishek Kulkarni
93d28a5792 Change opal_err2str_fn_t to return the error string as an argument.
This means that the converters (opal_err2str, orte_err2str) can now
return NULL as a "silent error". The return value of opal_err2str_fn_t
is the status of the operation (OPAL_SUCCESS or OPAL_ERROR).

This fixes the "Unknown error" message issues on the trunk.

This commit was SVN r24371.
2011-02-13 16:09:17 +00:00
Ralph Castain
33b68132cc Update the rmcast framework
This commit was SVN r24370.
2011-02-12 16:52:03 +00:00
Josh Hursey
a9335ea423 Make sure to initialize the 'update_state' function for the default module.
This will prevent tools from segfaulting if the mpirun process goes away suddenly while they are trying to communicate with it over the OOB.

This commit was SVN r24365.
2011-02-08 20:42:32 +00:00
Nysal Jan
3a8d251daa vsyslog is not included in SUSv3. Add a check for platforms that do not have vsyslog
This commit was SVN r24339.
2011-02-02 10:05:57 +00:00
Josh Hursey
fa3f6485d8 Make sure to define the region of time in which the migration is occurring so that the automatic recovery does not jump in the middle when we are moving processes around.
This commit was SVN r24326.
2011-01-31 19:09:47 +00:00
Josh Hursey
5b58ff0663 Fix a C/R checkpoint->restart->checkpoint->restart case.
The problem is that the SStore components were not flushing the old, stale checkpoint information. As a result the checkpoint was writing into the wrong directory, which produced an invalid checkpoint.

This seems to be fixed now. Thanks to Alex Brick for the bug report.

This commit was SVN r24325.
2011-01-28 21:25:14 +00:00
Josh Hursey
8ec85c6b8f Fixes the C/R Automatic Recovery feature when the HNP is also hosting processes locally.
I want to thank Hugo Meyer for reporting this/these bugs.

Notes:
 * Moved over a patch from the stabilization branch that makes sure we close the peer socket in the OOB TCP component fully during shutdown (after the de-registration sync). It also ensures that we free the rml_uri only after we are done communicating with the peer (in the odls_base deregister sync operation).
 * When an error is detected while delivering messages, we really want to bail out of the loop since the error manager is likely mutating the orte_local_children data structure, so it is no longer safe to iterate over in the orte_odls_base_default_deliver_message() function.
 * When the HNP is hosting processes make sure it accounts for processes that may have failed locally in the ErrMgr HNP component by decrementing the num_local_procs. This makes it match the orted ErrMgr component accounting. This is what was causing the modex to fail (the number of participants was wrong on a rolling recovery.
 * The crmig and autor features of the hnp ErrMgr component now check for the jobid from both the 'job' parameter and from the process name (since one may be there and not the other). This caused some additional error messages during startup.
 * If we fail to migrate (e.g., due to invalid node specification), print only the error message, not the error and success messages. This can be misleading.

This commit was SVN r24317.
2011-01-27 20:40:23 +00:00
Josh Hursey
81fd41f811 Return an informative error message if the user requests a migration of a job that is not capable of it.
C/R Functionality cleanup

This commit was SVN r24307.
2011-01-26 15:36:34 +00:00
Josh Hursey
8f45fcb429 More fixes for the C/R support. Fixes a couple bugs with the migration and autor features. The C/R functionality should be fully working now.
* Fix the checkpoint-restart-checkpoint case which would previous reject the checkpoint of the newly restarted process. By making sure to re-enable checkpointing once the application has fully restarted fixes this issue (make sure to set is_app_checkpointable to true on restart confirmation).
 * In the case of an invalid checkpoint, do not try to access the SStore datastore as it will be using a dummy handler, and return NULL strings. mpirun was segfaulting in the error case because it was trying to convert the seq_num from a string to an integer.
 * Make sure to initialize the timer event in the Automatic Recovery section of the HNP errmgr, per the libevent update. This caused a segfault when attempting to recover a failed process.
 * If ompi-checkpoint loses connection to the HNP/mpirun the TCP socket will fail and call the ErrMgr update_state function. This commit adds a dummy function {{{orte_errmgr_base_update_state()}}} that will prevent the ompi-checkpoint command from segfaulting in this error scenario.

This commit was SVN r24306.
2011-01-26 14:56:35 +00:00
Nathan Hjelm
8a3179cdcb removed c99 test code
This commit was SVN r24297.
2011-01-25 23:02:35 +00:00
Nathan Hjelm
e2126512a9 test c99 struct initialization with mtt. remove on jan 20, 2011
This commit was SVN r24271.
2011-01-19 22:21:21 +00:00
Abhishek Kulkarni
fd7ef7a1f1 Fixes broken trunk compile: call process status notify
only when ft-enable-cr is selected.

This commit was SVN r24255.
2011-01-14 18:37:07 +00:00
Abhishek Kulkarni
87d2c9b31d Few fault tolerance updates related to the CIFTS project (http://www.mcs.anl.gov/research/cifts/)
* Improve the FTB notifier to publish (C/R, process/communication failure) events to the FTB with the
   OMPI jobid as the associated payload.
 * Add notifier calls for C/R events and process status events in SnapC and ErrMgr components.
 * Fix a bug where the SnapC states and process states collide before being thrown out over the notifier.

This commit was SVN r24251.
2011-01-13 20:13:49 +00:00
Ralph Castain
b09f57b03d Update the multicast subsystem - ported from Cisco branch
This commit was SVN r24246.
2011-01-13 01:54:05 +00:00
Terry Dontje
f3aaa885a3 corrected a couple places in orte where it said cpu_model when it should have been cpu_type.
This commit was SVN r24221.
2011-01-11 19:56:26 +00:00
Abhishek Kulkarni
11ffa854ff Update the FTB notifier
* fix indentation issues
 * update the name of one of the fault events published to the FTB (per the FTB MPI standard)

This commit was SVN r24213.
2011-01-10 18:58:31 +00:00
Nathan Hjelm
c082d05ecb Reset the timer on MPIR_being_debugged only if MPIR_being_debugged is not set. Fix typo in return code.
This commit was SVN r24187.
2010-12-20 21:00:49 +00:00
Ralph Castain
2dc5cbb483 Remove stale code and API from the RML/OOB frameworks. Stopped using this code years ago.
This commit was SVN r24153.
2010-12-05 15:58:21 +00:00
Rolf vandeVaart
b67d3398da It is convention to have orte_config.h included at top of file.
This commit was SVN r24146.
2010-12-03 16:13:31 +00:00
Shiqing Fan
f43862420c Convert the bad dos line endings to unix style for all windows related files.
This commit was SVN r24137.
2010-12-02 12:08:08 +00:00
Ralph Castain
aaad8ae891 Remove unused var
This commit was SVN r24136.
2010-12-02 02:38:13 +00:00
Ralph Castain
f9ffff59f8 Ensure clean termination of threads and tcp multicast
This commit was SVN r24134.
2010-12-02 00:23:42 +00:00
Nathan Hjelm
75605faa75 added support for reattaching a debugger using the MPIR_attach_fifo
This commit was SVN r24132.
2010-12-01 20:13:58 +00:00
Ralph Castain
ad814f26cd One more time, into the breach!
Restore the use of override_oversubscribe to indicate that the data source for resources on the backend nodes used in mapping is unreliable. In this situation (e.g., data came from hostfile, or we are just using localhost because nothing was provided), we don't trust the oversubscribe condition passed by the mapper. Instead, we check locally to ensure we set sched_yield correctly.

This commit was SVN r24130.
2010-12-01 15:15:26 +00:00
Ralph Castain
eba65e97f3 Extend the rmcast APIs to allow enable/disable of comm, required for clean termination by upper layer users.
Point the recv thread event base to the right place so it can wakeup when required.

Add a new error code for "comm disabled" when attempting to communicate after disabling comm.

This commit was SVN r24129.
2010-12-01 13:41:19 +00:00
Ralph Castain
9224302c10 Remove debug
This commit was SVN r24128.
2010-12-01 13:12:24 +00:00
Ralph Castain
30c37ea536 Ensure that the oversubscribed condition of nodes is accurately reported by the mapper, and that the results are communicated and used by the backend orteds when setting sched_yield on local procs. Restores prior behavior that was somehow lost along the way.
Includes a patch from Damien Guinier to fix vpid assignments when cpus-per-task is specified.

This commit was SVN r24126.
2010-12-01 12:51:39 +00:00
Ralph Castain
85a974b0de Better check for NULL before using the value
This commit was SVN r24122.
2010-12-01 04:48:50 +00:00
Ralph Castain
c56185887b Change the event base "wakeup" support to enable the passing of events to the central thread for add/del. Add a macro OPAL_UPDATE_EVBASE for this purpose as it will likely be widely used.
Update the ORTE thread support to utilize this capability. Update the rmcast framework to track the change.

This commit was SVN r24121.
2010-12-01 04:26:43 +00:00
Ralph Castain
0441e81882 Oops - ensure that multicast msgs get circulated properly with the tcp module
This commit was SVN r24118.
2010-11-30 21:13:53 +00:00
Ralph Castain
d20c023348 Checkpoint the threading support for multicast - will be revised shortly, but this version currently works.
This commit was SVN r24117.
2010-11-30 17:30:16 +00:00
Ralph Castain
71669720a3 Just get the output once on sigpipe error, and include the fd
This commit was SVN r24092.
2010-11-25 15:32:48 +00:00
Ralph Castain
30c635fd4d Don't endlessly output sigpipe errors. Count the number of times we trap it, and abort if we get more than 10 of them.
This commit was SVN r24091.
2010-11-25 15:25:24 +00:00
Rolf vandeVaart
09fdd5cc23 Include fcntl.h, not sys/fcntl.h so we get the definition
of the open system call.  That is what man page says to do.
Fixes warning on Solaris.

This commit was SVN r24073.
2010-11-19 17:40:02 +00:00
Ethan Mallove
66f2301170 Just plain "Grid Engine" instead of "Sun Grid Engine"
This commit was SVN r24068.
2010-11-18 19:30:04 +00:00
Abhishek Kulkarni
78a67654d4 add notifier events for process migration
This commit was SVN r24058.
2010-11-16 17:57:44 +00:00
Abhishek Kulkarni
6e6ccae082 Update the checkpoint notification events that we throw out over the FTB with a payload embedded in {}
This commit was SVN r24057.
2010-11-16 17:55:57 +00:00
Jeff Squyres
e4744b4ed5 Per http://www.open-mpi.org/community/lists/devel/2010/11/8671.php,
change a bunch of OMPI_<foo> names to OPAL_<foo>.

This commit was SVN r24046.
2010-11-12 23:22:11 +00:00
Ralph Castain
bb521c6b7e Properly count local procs to set oversubscribed condition
This commit was SVN r24037.
2010-11-10 21:59:35 +00:00
Ralph Castain
021bd77bf1 Don't free the event base if we aren't using progress threads
This commit was SVN r24036.
2010-11-10 21:58:58 +00:00
Ralph Castain
57257ab9b4 Use the right event base if threads are disabled. Always update the seq num
This commit was SVN r24034.
2010-11-10 21:26:04 +00:00
Ralph Castain
cbb758c4fb Allow mcast threads to be disabled
This commit was SVN r24032.
2010-11-10 20:16:41 +00:00
Ralph Castain
22e40d92a0 Cleanup thread termination
This commit was SVN r24031.
2010-11-10 19:36:44 +00:00
Ralph Castain
01347926d1 Be a little more thorough about cleaning up during finalize
This commit was SVN r24014.
2010-11-09 14:56:27 +00:00
Shiqing Fan
d3701ccba8 type casts.
This commit was SVN r24013.
2010-11-09 09:17:22 +00:00
Ralph Castain
f2f41d1ca9 Be nice to those who don't enable-multicast...poor wretches.
This commit was SVN r24011.
2010-11-09 05:08:55 +00:00
Ralph Castain
a47b33678b Add orte-level thread support to avoid some of the opal_if_threads protection used solely for ompi.
Use threads to help process multicast messages.

This commit was SVN r24009.
2010-11-08 19:09:23 +00:00
Ralph Castain
bf665692c3 Update the rmcast callback function API to return message sequence number. Update orte_mcast test to stress the system.
This commit was SVN r24004.
2010-11-07 23:29:52 +00:00
Abhishek Kulkarni
e0660101d3 Throw notifier events for checkpointing status (success or failure)
This commit was SVN r24003.
2010-11-07 22:12:09 +00:00
Abhishek Kulkarni
8cd3759f21 use the saved value of PID, saving some calls to getpid()
This commit was SVN r24002.
2010-11-07 22:09:49 +00:00
Abhishek Kulkarni
d1a4cc33dd Update the FTB notifier wrt events decided by the CIFTS working group
This commit was SVN r24001.
2010-11-07 22:06:32 +00:00
Abhishek Kulkarni
ac2768ca7c LOG_SYSLOG is a syslog facility. take it off the syslog options
This commit was SVN r24000.
2010-11-06 22:05:45 +00:00
Ralph Castain
875a6d61a4 Return correct status code
This commit was SVN r23969.
2010-10-29 00:43:50 +00:00
Ralph Castain
9ea2b196ce Convert the opal_event framework to use direct function calls instead of hiding functions behind function pointers. Eliminate the opal_object_t abstraction of libevent's event struct so it can be directly passed to the libevent functions.
Note: the ompi_check_libfca.m4 file had to be modified to avoid it stomping on global CPPFLAGS and the like. The file was also relocated to the ompi/config directory as it pertains solely to an ompi-layer component.

Forgive the mid-day configure change, but I know Shiqing is working the windows issues and don't want to cause him unnecessary redo work.

This commit was SVN r23966.
2010-10-28 15:22:46 +00:00
Ralph Castain
c13b0bb668 Update some debugger attachment code per LLNL request
This commit was SVN r23965.
2010-10-28 03:06:20 +00:00
Brian Barrett
3ed00ba148 More fixes to make OMPI compile with minimal ORTE support again
This commit was SVN r23962.
2010-10-27 20:40:39 +00:00
Shiqing Fan
199df1eadf Rename a few var names.
This commit was SVN r23959.
2010-10-27 11:52:57 +00:00
Nathan Hjelm
e7bfbe1d1a added missing object initialization/destruction of mca_oob_tcp_component.tcp_listen_thread_event
This commit was SVN r23958.
2010-10-26 22:09:37 +00:00
Shiqing Fan
a3d9c91ff7 Exclude stdbool.h for Windows, and use the definition in opal. Immigrate the socket pair support from libevent. Fix other minor things and make it compile.
This commit was SVN r23951.
2010-10-26 14:53:50 +00:00
Ralph Castain
894230b121 This stuff is soooo out-of-date that a complete rewrite would be required - thankfully, nobody cares
This commit was SVN r23944.
2010-10-26 06:22:31 +00:00
Ralph Castain
86c7365e8e Clean up a few initialization issues - don't think these are impacting the shared memory situation as it didn't fix the problem.
Setup the event API to support multiple bases in preparation for splitting the OMPI and ORTE events. Holding here pending shared memory resolution.

This commit was SVN r23943.
2010-10-26 02:41:42 +00:00
Ralph Castain
fc46dfa78a Remove stale code
This commit was SVN r23942.
2010-10-26 02:37:56 +00:00
George Bosilca
5882290cdd We need a default value or the compiler will whine.
This commit was SVN r23940.
2010-10-25 19:05:45 +00:00
Abhishek Kulkarni
c671ec52d1 Fix broken trunk compile after the libevent changes.
This commit was SVN r23929.
2010-10-25 14:11:48 +00:00
Ralph Castain
fceabb2498 Update libevent to the 2.0 series, currently at 2.0.7rc. We will update to their final release when it becomes available. Currently known errors exist in unused portions of the libevent code. This revision passes the IBM test suite on a Linux machine and on a standalone Mac.
This is a fairly intrusive change, but outside of the moving of opal/event to opal/mca/event, the only changes involved (a) changing all calls to opal_event functions to reflect the new framework instead, and (b) ensuring that all opal_event_t objects are properly constructed since they are now true opal_objects.

Note: Shiqing has just returned from vacation and has not yet had a chance to complete the Windows integration. Thus, this commit almost certainly breaks Windows support on the trunk. However, I want this to have a chance to soak for as long as possible before I become less available a week from today (going to be at a class for 5 days, and thus will only be sparingly available) so we can find and fix any problems.

Biggest change is moving the libevent code from opal/event to a new opal/mca/event framework. This was done to make it much easier to update libevent in the future. New versions can be inserted as a new component and tested in parallel with the current version until validated, then we can remove the earlier version if we so choose. This is a statically built framework ala installdirs, so only one component will build at a time. There is no selection logic - the sole compiled component simply loads its function pointers into the opal_event struct.

I have gone thru the code base and converted all the libevent calls I could find. However, I cannot compile nor test every environment. It is therefore quite likely that errors remain in the system. Please keep an eye open for two things:

1. compile-time errors: these will be obvious as calls to the old functions (e.g., opal_evtimer_new) must be replaced by the new framework APIs (e.g., opal_event.evtimer_new)

2. run-time errors: these will likely show up as segfaults due to missing constructors on opal_event_t objects. It appears that it became a typical practice for people to "init" an opal_event_t by simply using memset to zero it out. This will no longer work - you must either OBJ_NEW or OBJ_CONSTRUCT an opal_event_t. I tried to catch these cases, but may have missed some. Believe me, you'll know when you hit it.

There is also the issue of the new libevent "no recursion" behavior. As I described on a recent email, we will have to discuss this and figure out what, if anything, we need to do.

This commit was SVN r23925.
2010-10-24 18:35:54 +00:00
Ralph Castain
8a37b3071a Cleanup mpirx co-spawn of debugger daemons. Resolve block-until-both-ends-are-opened behavior for fifos. Add some debugging output. Make the fifo event be non-persistent and read the value in the fifo to avoid spinning on the event.
This commit was SVN r23922.
2010-10-22 20:46:07 +00:00
Ralph Castain
2c1a658232 Fix debugger attach
This commit was SVN r23921.
2010-10-22 20:07:24 +00:00
Ralph Castain
1e93437cd4 To help with debugging, add a new mca param that instructs ORTE_ERROR_LOG to output "silent" errors. Helps to track down silent errors that don't have an associated error message (e.g., via show_help).
This commit was SVN r23893.
2010-10-16 03:29:47 +00:00
Brian Barrett
9febaa475e * Add shell of functionality required for supporting Portals4
* Update places where orte-free builds have failed

This commit was SVN r23891.
2010-10-14 22:49:09 +00:00
Ralph Castain
37f566bf1e Cancel recvs when finalizing
This commit was SVN r23871.
2010-10-07 22:02:12 +00:00
Jeff Squyres
eaab8d0062 * Ensure to set paffinity_enabled in all cases
* Ensure to set the mask value before we use it

This commit was SVN r23861.
2010-10-07 15:48:49 +00:00
Ralph Castain
92d69effc9 Read the data from the fifo to clear the event so it doesn't immediately re-trigger
This commit was SVN r23856.
2010-10-07 02:41:02 +00:00
Ralph Castain
871b685e89 Ensure all debugger interface symbols are present in orterun
This commit was SVN r23823.
2010-09-30 21:36:00 +00:00
Josh Hursey
c8692198a2 Fix an issue migrating/autorecovering processes that mask SIGTERM using the C/R functionality.
I did not want to make this change globally since there could be good reason to keep the check before calling SIGKILL that I am not seeing at the moment.

This commit was SVN r23821.
2010-09-30 20:55:12 +00:00
Ralph Castain
dd959f5ab6 Silence an idiotic warning
This commit was SVN r23819.
2010-09-30 17:54:13 +00:00
Jeff Squyres
73bcc4a36b Fix mistake that came in via the ompi-agen tree in r23764. The mistake wasn't part of the core autogen upgrade; it was an additional 'bonus' cleanup. Oops. The mistake will always create a set of directories under installdir, even if you do not --with-devel-headers. The set of directories will be empty, but still -- they should not be there at all. This commit fixes that -- the directories are not created at all if you do not --with-devel-headers
This commit was SVN r23801.

The following SVN revision numbers were found above:
  r23764 --> open-mpi/ompi@40a2bfa238
2010-09-24 22:53:28 +00:00
Ralph Castain
3631e4e936 Revert remaining svn kruft from r23764
This commit was SVN r23786.

The following SVN revision numbers were found above:
  r23764 --> open-mpi/ompi@40a2bfa238
2010-09-22 01:11:40 +00:00
Ralph Castain
40a2bfa238 WARNING: Work on the temp branch being merged here encountered problems with bugs in subversion. Considerable effort has gone into validating the branch. However, not all conditions can be checked, so users are cautioned that it may be advisable to not update from the trunk for a few days to allow MTT to identify platform-specific issues.
This merges the branch containing the revamped build system based around converting autogen from a bash script to a Perl program. Jeff has provided emails explaining the features contained in the change.

Please note that configure requirements on components HAVE CHANGED. For example. a configure.params file is no longer required in each component directory. See Jeff's emails for an explanation.

This commit was SVN r23764.
2010-09-17 23:04:06 +00:00
Abhishek Kulkarni
a143622b54 Remove unused code. Notifier events are aggregated on a per-event basis by the HNP notifier.
This commit was SVN r23711.
2010-09-02 16:00:22 +00:00
Ralph Castain
f75437f5a3 Add the ability to receive notifier output when job completes. Set the notification level to INFO for normal job completion, and to ALERT for abnormal termination.
This commit was SVN r23710.
2010-09-02 14:42:41 +00:00
Ralph Castain
b982f908e8 Fixed some newly-induced warnings
This commit was SVN r23694.
2010-08-31 14:51:19 +00:00
Rainer Keller
97511912ec - Fixup several functions, that cannot return
- Add one instance where we do not use a parameter in a function
 - Fix a buglet in commit r23689, where the attribute-for-function ptrs
   was applied.

This commit was SVN r23690.

The following SVN revision numbers were found above:
  r23689 --> open-mpi/ompi@5eb571c458
2010-08-31 12:21:13 +00:00
Rainer Keller
5eb571c458 - As suggested in CMR #2558, attribute-macros should be
be tested on function pointers and assigned accordingly,
   instead of using the pre-processor in the header files.

   A functional change is (re-) specifying __opal_attribute_noreturn__
   on orte_errmgr_base_abort(): All modules in the errmgr framework
   either use this function, or define their own abort function,
   which sets __opal_attribute_noreturn__.
   This attributes was taken out with the errmgr overhaul in r22872.

This commit was SVN r23689.

The following SVN revision numbers were found above:
  r22872 --> open-mpi/ompi@e4f2d03d28
2010-08-31 10:28:51 +00:00
Ralph Castain
b81358815c Add some debug
This commit was SVN r23686.
2010-08-29 13:45:10 +00:00
Ralph Castain
554aede041 Fix a situation where we were unlocking a thread that isn't locked for the main launch - it is only used for dynamic spawns.
This commit was SVN r23682.
2010-08-28 14:03:17 +00:00
Rainer Keller
4abcf5a0d7 - The Sun-compiler 12 update 1 complains about noreturn-attributes
assigned to function-declarations.
   Check this case and mark the currently only case existing in trunk.

   Thanks to Paul Hargrove for bringing this up.

   Let's test the svn commit msg CMR:v1.5

This commit was SVN r23676.
2010-08-27 09:18:30 +00:00
Rainer Keller
12ed573e5e - Include <strings.h> for rindex(3).
Thanks to Paul Hargrove.

   Please CMR:v1.5

This commit was SVN r23671.
2010-08-26 13:42:36 +00:00
Shiqing Fan
9911797867 Rename a few odls help files for Windows installation.
This commit was SVN r23668.
2010-08-26 09:31:18 +00:00
Ralph Castain
2e223abe33 Restore the auto-poll method for detecting debugger attachment, but only in the mpirx debugger module and only if the corresponding rate mca param is set.
Guess we missed it before, but add the debugger framework to the orte-info and ompi_info tools

This commit was SVN r23667.
2010-08-25 22:52:33 +00:00
Ralph Castain
4ecd9a0bbe Protect against an obscure race condition that AFAICT only occurs when we are in a loop waiting to recv a message from a peer who is then killed by signal.
This commit was SVN r23662.
2010-08-25 15:35:01 +00:00
Ralph Castain
f1a00c9a21 Per Jeff's inquiry, play chicken and don't assume herror exists everywhere.
This commit was SVN r23656.
2010-08-24 20:46:41 +00:00
Jeff Squyres
207ca2d928 This commit is the first of several steps in a paffinity makeover
extravaganza.

= Short version =

This commit does several things, but the short version is that it
re-orients the error message creation of the ODLS default module to
generate error strings in the child process for errors that occur
after the fork but before the exec (such errors are ''usually''
related to paffinity).  A show_help string is rendered in the child
and then IPC'ed up to the parent, who displays the string through
normal ORTE show_help aggregation mechanisms.  We also broke up the
ginormous paffinity-setting logic into a few separate functions, both
to help us understand the code, and hopefully to ease future
maintenance.

The logic for the ODLS default binding should not have changed -- this
is mainly a code reshuffle and improvement on error reporting.

= Rationale =

The reasoning for this commit is complex.  As mentioned above, it's
the first step in some paffinity cleanup.  Here's the line of dominoes
that must fall (in this order):

 1. Add hwloc paffinity component (already done).
 1. While testing hwloc, we discovered that the error reporting from
    the ODLS default module was abysmal.  So we fixed it.
 1. Further, we reorganized the code in the odsl_default_module.c a bit
    to help our understanding of it.
 1. We also discovered a few bugs in the original ODLS default module
    logic that existed before this code shuffle; separate tickets
    will be filed to fix them.
 1. Next up will be some improvements to paffinity / odls default to
    make the act of binding to a core ensure to bind to ''all''
    hardware threads contained in that core (similar for sockets:
    binding to a socket will bind to ''all'' hardware threads in that
    socket).
 1. Next will be improvements to paffinity to expose binding to
    hardware threads through the paffinity framework API.
 1. Finally, we'll expose these binding controls to the user (e.g.,
    through mpirun command line arguments, MCA parameters, etc.).

This commit represents the first few bullets; the last 4 bullets are
being worked on right now, but there is no definite timeline for
completion. 

= Miscelaneous =

A few points worth mentioning:

 * We have tested this new code a bunch; we're pretty sure it behaves
   just like the trunk -- but with better / more precise error
   reporting.  More testing is needed on a wider array of platforms,
   however. 
 * A big comment at the top of odls_default_module.c explains the
   (new) general scheme for the error reporting.
 * The error reporting in the parent process is now really dumb;
   almost all the intelligence about creating error messages is in the
   child.
 * The show_help file was renamed to be more consistent with other
   help files (help-odls-default.txt -> help-orte-odls-default.txt)
 * Removed the use of sched_yield() because of recent changes in the
   Linux 2.6.3x kernels.  We already had an #else clause for
   select()'ing for 1us if we didn't have sched_yield() -- that is now
   the only code path.  This is not a performance-critical section of
   the code, so this shouldn't be controversial.
 * Replaced the macro-based error reporting with function-based
   reporting.  It's a bit more bulky, but it helped us understand the
   code and saved us multiple times with compile-time parameter
   checking, etc.
 * Cleaned up the use of several show_help messages to ensure that
   they mapped to real messages in help*.txt files.

This commit was SVN r23652.
2010-08-24 19:38:29 +00:00
Ralph Castain
3b3cd67d07 If we are using static ports and cannot resolve a hostname, then see if the proc is on the local host. If so, then attempt to use a loopback interface to complete the connection. Only implemented for IPv4 because the if.c code has been so hashed I couldn't figure out how to do this cleanly for all cases.
This commit was SVN r23647.
2010-08-24 14:14:59 +00:00
Ralph Castain
7608513158 Cleanup the code and add some comments to make it easier to understand. Add a bozo error check
This commit was SVN r23639.
2010-08-24 04:46:59 +00:00
Ralph Castain
2886da5669 Ensure that the local daemon vpid gets defined so that the locality procedures work when using the ess generic module.
This commit was SVN r23638.
2010-08-24 04:38:21 +00:00
Josh Hursey
4ffc2d6f68 fix a couple of missed prefixes
This commit was SVN r23629.
2010-08-19 13:26:33 +00:00
Josh Hursey
fabd5cc153 Simplification of the ErrMgr framework by removing the 'stack'/composite functionality.
The composite functionality was becoming difficult to maintain, so we removed it for now which simplifies the framework design considerably.

Since the 'crmig' and 'autor' components were -very- similar to the 'hnp' component, this commit also merges them together. By moving the 'crmig' and 'autor' to a separate file under the 'hnp' component we are able to isolate the C/R logic to a large extent, thus being only minimally hooked into the previous 'hnp' component.

So other than some name changes, the functionality is all still in place. I will update the C/R documentation later this morning.

This commit was SVN r23628.
2010-08-19 13:09:20 +00:00
Josh Hursey
77792c937d When we checkpoint with the --stop option, be sure to write out all the metadata before clearing the storage handle.
Here we tried to write out the session directory marker after we sync the directory, which happens early in the case of --stop.

Thanks to Ananda Mudar for noticing the bug.

This commit was SVN r23627.
2010-08-18 20:44:03 +00:00
Brian Barrett
13c827dda8 Make trunk compile on Red Storm again
This commit was SVN r23622.
2010-08-17 21:51:38 +00:00
Ralph Castain
bbf84fd92b Refine the protection from cross-dvm communications
This commit was SVN r23615.
2010-08-16 16:33:39 +00:00
Ralph Castain
930f7adb0f Check the return status and report any error
This commit was SVN r23611.
2010-08-13 15:04:59 +00:00
Ralph Castain
4491a0e5dc Add a channel for reporting errors, fix a bug in the tcp module
This commit was SVN r23610.
2010-08-13 15:04:22 +00:00
Ralph Castain
ace1f60429 Rename an mca param to something more intuitive and set its default to 0 so the module only runs if a non-zero value is provided
This commit was SVN r23609.
2010-08-13 15:03:45 +00:00
Rainer Keller
fc4cb0c0c1 - Allow changing ALPS run command
- Fix misnomer

This commit was SVN r23601.
2010-08-12 14:41:35 +00:00
Ralph Castain
5715a5b421 Let VM-based mappings include the updated nidmap
This commit was SVN r23596.
2010-08-11 21:04:28 +00:00
Josh Hursey
e12ca48cd9 A number of C/R enhancements per RFC below:
http://www.open-mpi.org/community/lists/devel/2010/07/8240.php

Documentation:
  http://osl.iu.edu/research/ft/

Major Changes: 
-------------- 
 * Added C/R-enabled Debugging support. 
   Enabled with the --enable-crdebug flag. See the following website for more information: 
   http://osl.iu.edu/research/ft/crdebug/ 
 * Added Stable Storage (SStore) framework for checkpoint storage 
   * 'central' component does a direct to central storage save 
   * 'stage' component stages checkpoints to central storage while the application continues execution. 
     * 'stage' supports offline compression of checkpoints before moving (sstore_stage_compress) 
     * 'stage' supports local caching of checkpoints to improve automatic recovery (sstore_stage_caching) 
 * Added Compression (compress) framework to support 
 * Add two new ErrMgr recovery policies 
   * {{{crmig}}} C/R Process Migration 
   * {{{autor}}} C/R Automatic Recovery 
 * Added the {{{ompi-migrate}}} command line tool to support the {{{crmig}}} ErrMgr component 
 * Added CR MPI Ext functions (enable them with {{{--enable-mpi-ext=cr}}} configure option) 
   * {{{OMPI_CR_Checkpoint}}} (Fixes trac:2342) 
   * {{{OMPI_CR_Restart}}} 
   * {{{OMPI_CR_Migrate}}} (may need some more work for mapping rules) 
   * {{{OMPI_CR_INC_register_callback}}} (Fixes trac:2192) 
   * {{{OMPI_CR_Quiesce_start}}} 
   * {{{OMPI_CR_Quiesce_checkpoint}}} 
   * {{{OMPI_CR_Quiesce_end}}} 
   * {{{OMPI_CR_self_register_checkpoint_callback}}} 
   * {{{OMPI_CR_self_register_restart_callback}}} 
   * {{{OMPI_CR_self_register_continue_callback}}} 
 * The ErrMgr predicted_fault() interface has been changed to take an opal_list_t of ErrMgr defined types. This will allow us to better support a wider range of fault prediction services in the future. 
 * Add a progress meter to: 
   * FileM rsh (filem_rsh_process_meter) 
   * SnapC full (snapc_full_progress_meter) 
   * SStore stage (sstore_stage_progress_meter) 
 * Added 2 new command line options to ompi-restart 
   * --showme : Display the full command line that would have been exec'ed. 
   * --mpirun_opts : Command line options to pass directly to mpirun. (Fixes trac:2413) 
 * Deprecated some MCA params: 
   * crs_base_snapshot_dir deprecated, use sstore_stage_local_snapshot_dir 
   * snapc_base_global_snapshot_dir deprecated, use sstore_base_global_snapshot_dir 
   * snapc_base_global_shared deprecated, use sstore_stage_global_is_shared 
   * snapc_base_store_in_place deprecated, replaced with different components of SStore 
   * snapc_base_global_snapshot_ref deprecated, use sstore_base_global_snapshot_ref 
   * snapc_base_establish_global_snapshot_dir deprecated, never well supported 
   * snapc_full_skip_filem deprecated, use sstore_stage_skip_filem 

Minor Changes: 
-------------- 
 * Fixes trac:1924 : {{{ompi-restart}}} now recognizes path prefixed checkpoint handles and does the right thing. 
 * Fixes trac:2097 : {{{ompi-info}}} should now report all available CRS components 
 * Fixes trac:2161 : Manual checkpoint movement. A user can 'mv' a checkpoint directory from the original location to another and still restart from it. 
 * Fixes trac:2208 : Honor various TMPDIR varaibles instead of forcing {{{/tmp}}} 
 * Move {{{ompi_cr_continue_like_restart}}} to {{{orte_cr_continue_like_restart}}} to be more flexible in where this should be set. 
 * opal_crs_base_metadata_write* functions have been moved to SStore to support a wider range of metadata handling functionality. 
 * Cleanup the CRS framework and components to work with the SStore framework. 
 * Cleanup the SnapC framework and components to work with the SStore framework (cleans up these code paths considerably). 
 * Add 'quiesce' hook to CRCP for a future enhancement. 
 * We now require a BLCR version that supports {{{cr_request_file()}}} or {{{cr_request_checkpoint()}}} in order to make the code more maintainable. Note that {{{cr_request_file}}} has been deprecated since 0.7.0, so we prefer to use {{{cr_request_checkpoint()}}}. 
 * Add optional application level INC callbacks (registered through the CR MPI Ext interface). 
 * Increase the {{{opal_cr_thread_sleep_wait}}} parameter to 1000 microseconds to make the C/R thread less aggressive. 
 * {{{opal-restart}}} now looks for cache directories before falling back on stable storage when asked. 
 * {{{opal-restart}}} also support local decompression before restarting 
 * {{{orte-checkpoint}}} now uses the SStore framework to work with the metadata 
 * {{{orte-restart}}} now uses the SStore framework to work with the metadata 
 * Remove the {{{orte-restart}}} preload option. This was removed since the user only needs to select the 'stage' component in order to support this functionality. 
 * Since the '-am' parameter is saved in the metadata, {{{ompi-restart}}} no longer hard codes {{{-am ft-enable-cr}}}. 
 * Fix {{{hnp}}} ErrMgr so that if a previous component in the stack has 'fixed' the problem, then it should be skipped. 
 * Make sure to decrement the number of 'num_local_procs' in the orted when one goes away. 
 * odls now checks the SStore framework to see if it needs to load any checkpoint files before launching (to support 'stage'). This separates the SStore logic from the --preload-[binary|files] options. 
 * Add unique IDs to the named pipes established between the orted and the app in SnapC. This is to better support migration and automatic recovery activities. 
 * Improve the checks for 'already checkpointing' error path. 
 * A a recovery output timer, to show how long it takes to restart a job 
 * Do a better job of cleaning up the old session directory on restart. 
 * Add a local module to the autor and crmig ErrMgr components. These small modules prevent the 'orted' component from attempting a local recovery (Which does not work for MPI apps at the moment) 
 * Add a fix for bounding the checkpointable region between MPI_Init and MPI_Finalize. 

This commit was SVN r23587.

The following Trac tickets were found above:
  Ticket 1924 --> https://svn.open-mpi.org/trac/ompi/ticket/1924
  Ticket 2097 --> https://svn.open-mpi.org/trac/ompi/ticket/2097
  Ticket 2161 --> https://svn.open-mpi.org/trac/ompi/ticket/2161
  Ticket 2192 --> https://svn.open-mpi.org/trac/ompi/ticket/2192
  Ticket 2208 --> https://svn.open-mpi.org/trac/ompi/ticket/2208
  Ticket 2342 --> https://svn.open-mpi.org/trac/ompi/ticket/2342
  Ticket 2413 --> https://svn.open-mpi.org/trac/ompi/ticket/2413
2010-08-10 20:51:11 +00:00