openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	39369f8807	Remove stale fields from global objects - have been moved to the layer that actually uses them This commit was SVN r24644.	2011-04-28 00:20:49 +00:00
Ralph Castain	8858d9a40e	Add a marker for other layers to use in defining data types This commit was SVN r24643.	2011-04-28 00:19:35 +00:00
Ralph Castain	9988b97b97	Extend/update how we handle process stats. Add the ability to collect node-level stats separate from the process stats. Update the process stat memory fields to report in MBytes instead of KBytes as I can't find any process that runs in KBytes nowadays. Rename the memusage sensor plugin to "resusage" as it will soon be updated to include full process stat monitoring. Extend the heartbeat sensor to report node and process stats in the heartbeat. Store the process and node stats in their respective orte_xxx_t object. This commit was SVN r24629.	2011-04-21 22:55:45 +00:00
Ralph Castain	5f64b830f9	Ensure we only kill threads once This commit was SVN r24620.	2011-04-18 14:47:09 +00:00
Ralph Castain	8014e7432c	Send recovery defined flag in app_contexts, include recovery flags in debug prints This commit was SVN r24619.	2011-04-18 14:46:42 +00:00
Ralph Castain	89501e6e24	Don't try to politely end threads when abnormally terminating as we can hang if the thread is in a stuck callback. This commit was SVN r24618.	2011-04-18 12:21:47 +00:00
Ralph Castain	3a28556472	Expand our handling of non-zero exit status. If a process exits with non-zero status, pass that info along to the user in case it means something to them, even if the process also exited without calling MPI_Finalize. If the process calls MPI_Abort, that trumps the exit status question. Provide a new MCA param that allows the user to direct that we abort the job once a process exits with non-zero status. No recovery is allowed in such cases to avoid trying to restart a process that has already exited MPI. This commit was SVN r24614.	2011-04-14 15:04:21 +00:00
Ralph Castain	30fb002524	Take the first small step towards rationalizing rsh support. Create a new "rshbase" component that contains a simple rsh module - no tree spawn, uses all the base functions for launch support. Extend the base rsh support functions to include those functions in common across all rsh modules. Only a minor change made to the current rsh module to avoid a naming conflict. Otherwise, left it alone to avoid creating conflicts with other external work. The current rsh module remains the default for rsh/ssh support, and continues to contain the support for SGE and Loadleveler. This commit was SVN r24593.	2011-03-30 01:15:07 +00:00
Nysal Jan	866ae8b43a	Close the file descriptor This commit was SVN r24580.	2011-03-29 08:42:49 +00:00
Nysal Jan	c8c6b0edab	Improve LoadLeveler integration with Open MPI. Add support for LL native rsh agent - llspawn This commit was SVN r24579.	2011-03-29 07:46:59 +00:00
Nathan Hjelm	8634b6394f	fixed plm/tm component This commit was SVN r24577.	2011-03-25 22:20:15 +00:00
Ralph Castain	d7e029cb40	Convert heartbeat to multicast basis This commit was SVN r24570.	2011-03-24 19:05:39 +00:00
Ralph Castain	90698a2c02	Ensure that blocking recvs wait until the data is actually recvd This commit was SVN r24558.	2011-03-22 18:45:54 +00:00
Ralph Castain	888472f671	Do not release recv as the calling function needs that data and will release it later This commit was SVN r24557.	2011-03-22 18:44:56 +00:00
Ralph Castain	30981de200	Minor cleanups courtesy of Nysal - thanks! This commit was SVN r24552.	2011-03-22 13:48:58 +00:00
Ralph Castain	c1396b278c	Resolve the rsh confusion by splitting the initial search for a launch agent from the actual setup of the launch agent values in the plm base globals. Have each aspiring rsh-clone call lookup to see if their desired launch agent is available - if not, then reject that plm component. If so, then setup the actual launch agent values only when the module init function is called. This resolves the current conflict between the rsh and rshd components. Hopefully, it may avoid future problems in this area -provided- any new uses of rsh-like launchers abide by the lookup-and-then-setup rule. This commit was SVN r24550.	2011-03-22 02:23:09 +00:00
Ralph Castain	d17b50e1ff	Add the appropriate hooks to tell Totalview to display the user's main program upon startup. Apparently, this hook got lost somewhere after the 1.2 series :-( Thanks to David Turner and the TV folks for passing this along. This commit was SVN r24549.	2011-03-21 17:40:58 +00:00
Ralph Castain	795ca2cff2	Complete implementation of the multicast-based grpcomm module This commit was SVN r24548.	2011-03-20 01:18:06 +00:00
Ralph Castain	fa40f5d7c3	Fix bad formatting This commit was SVN r24547.	2011-03-20 01:17:29 +00:00
Ralph Castain	281116ddc5	A max_restarts value of -1 is now valid and indicates infinite restarts, so correct the validity check This commit was SVN r24546.	2011-03-20 01:17:00 +00:00
Eugene Loh	2770a12beb	Continue clean up of thread options started in r22841, 22842, and 22849. No need for any CMRs to 1.5... that was already done in CMR 2728. This commit was SVN r24545. The following SVN revision numbers were found above: r22841 --> open-mpi/ompi@b400b84162	2011-03-18 21:36:35 +00:00
Ralph Castain	ee68cd102c	Fix the hier grpcomm module so modex results in correct data. The prior implementation stored the modex data as node-based attributes. This worked fine for BTL's such as openib where the interfaces were associated with the node. However, BTL's such as TCP have interfaces associated with a specific process, not a node. Thus, store the data in the modex database so it is correctly indexed. This commit was SVN r24536.	2011-03-17 02:22:23 +00:00
Ralph Castain	d5dfe05521	Remove stale code associated with OPAL_THREADS_HAVE_DIFFERENT_PIDS. In the past, we have supported the case of really, really old Linux kernels where threads have different pids. However, when we updated the event library, we didn't also update that support code. In addition, when we dropped progress thread support, we didn't remove areas of the code that could no longer be compiled (i.e., were protected by "if progress thread && if have different pids). There was no compelling reason to support such old kernels. Accordingly, convert the test to print a nice error message indicating we no longer support old kernels (but indicate that earlier OMPI versions do) and error out. Remove all code that was protected by "if have different pids" since it can no longer be compiled. This commit was SVN r24531.	2011-03-15 21:05:03 +00:00
Ralph Castain	de092af8ef	Add a little more debug This commit was SVN r24526.	2011-03-14 18:43:49 +00:00
Ralph Castain	ebabe9c83a	Forgot that Terry wanted to control the vm launch with an mca param - set one up for that purpose This commit was SVN r24525.	2011-03-13 00:46:42 +00:00
Ralph Castain	dc6f616599	Enable VM launch. For some time, ORTE has had the ability to launch daemons on all nodes prior to launching an application. It has largely been used outside of the OMPI community, and so was never explicitly turned "on" inside OMPI releases. Nevertheless, the code has been there. Allowing VM launches does not require ANY changes to existing PLM components. All that was required was to have orterun launch the daemons as a separate call to orte_plm.spawn -prior- to launching the applications. The rest of the VM support code resides in the rmaps framework: (a) a check when asked to map a job to see if it is the daemon job, and (b) a separate "setup_virtual_machine" mapper in the rmaps base that creates the required map so the PLM's will do the right thing. In order to support those users who have no RM allocation but like to give the allocation in the form of a -host or -hostfile argument to their application, there is a little more code in orterun and the setup_virtual_machine mapper to capture information passed in that manner. This has been tested with rsh and slurm environments, and, since there is nothing environment-specific in the implementation, should work in others as well - but needs to be proven. This commit was SVN r24524.	2011-03-12 22:50:53 +00:00
Ralph Castain	80265b472e	Avoid direct reference of pointer_array elements This commit was SVN r24523.	2011-03-12 20:18:51 +00:00
Ralph Castain	df82e4cd36	Plug a memory leak This commit was SVN r24521.	2011-03-12 15:37:33 +00:00
Ralph Castain	1297acde13	George raised some valid concerns about the extensibility of the revised rmaps framework. Address those by: 1. removing the enum of mapper values 2. change the req_mapper and last_mapper fields to char* so they can hold the component name instead of a mapper flag 3. revise the selection logic in the mapper components to reflect the change. Components now look for their name in the req_mapper field, or to see if other criteria (e.g., npernode) are set that mandate their doing the mapping Several MCA params resided in the rmaps base for historical reasons - they have been in the base since at least the original 1.2 release (and perhaps earlier). However, George correctly pointed out that they really should reside in their respective components. Accordingly, move them to the components, but register synonyms to the old names to avoid breaking backward compatibility. These revisions retain the current functionality of allowing comm_spawn'd jobs to use different mappers than the original job, and for the errmgr to utilize the resilient mapper to recover processes regardless of how they were originally mapped. Given the large number of possible combinations, I am sure that someone will find a corner-case combination of values and selection criteria that cause either no mapper to be selected, or one other than the intended to be used. No one can test all the ways people will use this system, so I expect debugging to continue for awhile. The ability of comm_spawn'd jobs to exploit this functionality relies on changes to the orte_dpm component - this will be committed separately. This commit was SVN r24520.	2011-03-12 05:30:09 +00:00
Samuel Gutierrez	830c7c66dc	fixes CID #1667 This commit was SVN r24518.	2011-03-12 03:09:01 +00:00
Ralph Castain	e6a76cc923	Fixes CID #1954 This commit was SVN r24516.	2011-03-11 23:00:27 +00:00
Ralph Castain	2ccd514b9a	Add version string to app This commit was SVN r24514.	2011-03-11 20:38:37 +00:00
Samuel Gutierrez	2a2319d23a	when orte_timing is enabled, always record daemon launch start time before starting the real work. This commit was SVN r24513.	2011-03-11 00:09:23 +00:00
Ralph Castain	f9a9fac76b	Minor typo This commit was SVN r24506.	2011-03-10 16:09:31 +00:00
George Bosilca	80fe617cd2	If we don't release the OPAL utils explicitly there will be a memory leak. This commit was SVN r24505.	2011-03-10 00:42:28 +00:00
George Bosilca	7f34a28c8f	Correct a comment. This commit was SVN r24504.	2011-03-10 00:41:41 +00:00
George Bosilca	d2502b14f9	Destruct the OOB TCP internal objects. This commit was SVN r24503.	2011-03-10 00:40:54 +00:00
Ralph Castain	3b4421d8e3	Separately track requested and last-used mapper so we don't lose that info This commit was SVN r24502.	2011-03-09 18:51:36 +00:00
Jeff Squyres	06d5c59115	Fix a few valgrind-reported memory leaks This commit was SVN r24498.	2011-03-08 17:37:28 +00:00
Jeff Squyres	0586612bd5	Fix another minor memory leak This commit was SVN r24495.	2011-03-08 15:46:13 +00:00
Jeff Squyres	79cf382ff3	Fix a few issues with error messages: * If something goes wrong during ompi_mpi_init, don't erroneously report that it is illegal to invoke MPI_INIT* before MPI_INIT * Aggregate help messages when possible when something goes wring during ompi_mpi_init This commit was SVN r24492.	2011-03-07 16:45:45 +00:00
Ralph Castain	63f38e38bb	Fix ompi-server: remove extra command flag in buffer being sent to mpirun, ensure that tools route messages thru a remote HNP This commit was SVN r24491.	2011-03-05 17:12:46 +00:00
Ralph Castain	d764e7a398	We want uid/gid support at the individual application level. Ensure the values get initialized and packed/unpacked for transfer. This commit was SVN r24489.	2011-03-04 18:46:43 +00:00
George Bosilca	9bbe00bdc3	Set the return code from the processes upstream. This commit was SVN r24483.	2011-03-03 00:02:21 +00:00
George Bosilca	c6a5f9706a	Thomas's patch: Assume we won't fail unless notified by a child. This commit was SVN r24482.	2011-03-02 23:50:01 +00:00
Josh Hursey	62bba1bf12	Name the enum so that it represents as an actual symbol in gdb, instead of just a number. This commit was SVN r24472.	2011-03-01 21:00:03 +00:00
Nysal Jan	4030111478	Add missing copyright and fix the year This commit was SVN r24446.	2011-02-23 15:52:06 +00:00
Nysal Jan	42a73bb887	POE is supported on both AIX and Linux. Build POE PLM only if we find the poe binary. Fix hostfile creation and POE command line arguments. This commit was SVN r24444.	2011-02-23 15:38:41 +00:00
Ralph Castain	f014284f91	Update resilient recovery mapping algorithm to be a bit more sophisticated. Track the prior node a proc was on so we avoid ricochet effect. Also avoid putting recovering proc onto node that is already occupied by a peer as this degrades fault tolerance. This commit was SVN r24417.	2011-02-20 18:46:21 +00:00
Ralph Castain	a8cf19a7bc	Ensure heartbeat only started once and only for daemon job This commit was SVN r24416.	2011-02-18 20:33:54 +00:00
Ralph Castain	ef56e6d78b	Helps to move the pointer This commit was SVN r24414.	2011-02-18 14:01:25 +00:00
Ralph Castain	7b35ada7fc	Fix ricochet effect - move failed procs to next on list instead of loadbalancing This commit was SVN r24413.	2011-02-18 13:11:55 +00:00
Ralph Castain	b98a2917ff	Add an API to the errmgr so that apps can register for a callback to warn them of an impending migration - this gives apps a chance to cleanly terminate prior to being migrated for external reasons (e.g., impending failures). The timeout provided indicates to the daemon how long it should wait before proceeding to kill/migrate the process - if the process fails to exit before that time, the daemon will kill it. This commit was SVN r24412.	2011-02-18 02:48:12 +00:00
Ralph Castain	a0f6e153c7	Add missing fields to copy of app_context object This commit was SVN r24411.	2011-02-17 23:55:05 +00:00
Ralph Castain	51cf0a16c3	Some minor cleanups to support VM and CM operations This commit was SVN r24408.	2011-02-16 23:03:08 +00:00
Ralph Castain	9b48c07599	CM daemons handle their own output This commit was SVN r24407.	2011-02-16 23:02:23 +00:00
Ralph Castain	65ba6af44d	Cleanup our handling of VMs to ensure daemons don't get mapped when operating with a VM. Have each mapper flag it did the map so we can see who did it later. Ensure procs are flagged as "ready to launch". This commit was SVN r24406.	2011-02-16 23:01:57 +00:00
Jeff Squyres	3f4d4886f2	Minor update for something that has been bugging me for quite a while: OMPI supports multiple different repository systems (SVN, hg, git). But the VERSION file has listed "want_svn" and "svn_r" as fields, even though the actual repo system and version may not be SVN. So search/replace those fields (and derrivative values that come from those fields) with "want_repo_rev" and "repo_rev", respectively. This commit was SVN r24405.	2011-02-16 22:53:23 +00:00
Ralph Castain	a32a7d9a82	Update heartbeat system This commit was SVN r24404.	2011-02-16 18:50:51 +00:00
Ralph Castain	9b38525d1e	Remove unused include files This commit was SVN r24394.	2011-02-16 00:32:47 +00:00
Ralph Castain	5120e6aec3	Redefine the rmaps framework to allow multiple mapper modules to be active at the same time. This allows users to map the primary job one way, and map any comm_spawn'd job in a different way. Modules are given the opportunity to map a job in priority order, with the round-robin mapper having the highest default priority. Priority of each module can be defined using mca param. When called, each mapper checks to see if it can map the job. If npernode is provided, for example, then the loadbalance mapper accepts the assignment and performs the operation - all mappers before it will "pass" as they can't map npernode requests. Also remove the stale and never completed topo mapper. This commit was SVN r24393.	2011-02-15 23:24:31 +00:00
Ralph Castain	c1da94a444	Dead children have no pid This commit was SVN r24387.	2011-02-15 13:30:51 +00:00
Ralph Castain	a3607ff35d	Make it easier to send a kill-local-procs command for an arbitrary number of procs This commit was SVN r24386.	2011-02-15 13:26:11 +00:00
Ralph Castain	bf1cff3711	Plug a couple of additional memory leaks - try to highlight a little better that strings returned from reg_string_name must be freed by caller This commit was SVN r24383.	2011-02-14 20:58:22 +00:00
Ralph Castain	a9dca25ca5	Remove the distinction between local and global restarts - leave it up to the error strategy to decide which to do. Cleanup the heartbeat handling so it is associated with the proc, not a node. Cleanup handling of recovery options so that defaults do not override user values iff they are provided. This commit was SVN r24382.	2011-02-14 20:49:12 +00:00
Ralph Castain	6ee69c670e	Add a minor utility This commit was SVN r24379.	2011-02-14 19:45:59 +00:00
Ralph Castain	b5de068533	Clean up an error in r24371 - can't use a const parameter as target in asprintf as it changes the value of the address. Add some new proc/job states Rename a constant to reflect coming change - remove the arbitrary difference between restarting a proc locally and relocating it to another node in terms of the number of restarts allowed. Add pretty-print of signals for "proc aborted due to signal" reports. This commit was SVN r24378. The following SVN revision numbers were found above: r24371 --> open-mpi/ompi@93d28a5792	2011-02-14 19:29:09 +00:00
Abhishek Kulkarni	93d28a5792	Change opal_err2str_fn_t to return the error string as an argument. This means that the converters (opal_err2str, orte_err2str) can now return NULL as a "silent error". The return value of opal_err2str_fn_t is the status of the operation (OPAL_SUCCESS or OPAL_ERROR). This fixes the "Unknown error" message issues on the trunk. This commit was SVN r24371.	2011-02-13 16:09:17 +00:00
Ralph Castain	33b68132cc	Update the rmcast framework This commit was SVN r24370.	2011-02-12 16:52:03 +00:00
Josh Hursey	a9335ea423	Make sure to initialize the 'update_state' function for the default module. This will prevent tools from segfaulting if the mpirun process goes away suddenly while they are trying to communicate with it over the OOB. This commit was SVN r24365.	2011-02-08 20:42:32 +00:00
Nysal Jan	3a8d251daa	vsyslog is not included in SUSv3. Add a check for platforms that do not have vsyslog This commit was SVN r24339.	2011-02-02 10:05:57 +00:00
Josh Hursey	fa3f6485d8	Make sure to define the region of time in which the migration is occurring so that the automatic recovery does not jump in the middle when we are moving processes around. This commit was SVN r24326.	2011-01-31 19:09:47 +00:00
Josh Hursey	5b58ff0663	Fix a C/R checkpoint->restart->checkpoint->restart case. The problem is that the SStore components were not flushing the old, stale checkpoint information. As a result the checkpoint was writing into the wrong directory, which produced an invalid checkpoint. This seems to be fixed now. Thanks to Alex Brick for the bug report. This commit was SVN r24325.	2011-01-28 21:25:14 +00:00
Jeff Squyres	ec3d18dc9f	As noted on the mailing list by Gabriele Fatigati (http://www.open-mpi.org/community/lists/users/2011/01/15427.php), the --tv (and friends) switches to mpirun would effectively munge the orterun command line together and then split it apart again before exec'ing the underlying debugger. We would therefore lose multi-token argv[x] value and split them into multiple tokens. For example: mpirun --tv -np 2 a.out "foo bar" would get launched with "foo" and "bar" as separate arguments; not one argument. This was due to the underlying code joining the argv into a single string and then re-splitting it. This commit removed the argv join; it now does the parsing and re-jigering of the argv by only looking at each individual argv item; multi-word tokens like "foo bar" will never be split into separate tokens. This commit was SVN r24322.	2011-01-28 13:01:06 +00:00
Josh Hursey	8ec85c6b8f	Fixes the C/R Automatic Recovery feature when the HNP is also hosting processes locally. I want to thank Hugo Meyer for reporting this/these bugs. Notes: * Moved over a patch from the stabilization branch that makes sure we close the peer socket in the OOB TCP component fully during shutdown (after the de-registration sync). It also ensures that we free the rml_uri only after we are done communicating with the peer (in the odls_base deregister sync operation). * When an error is detected while delivering messages, we really want to bail out of the loop since the error manager is likely mutating the orte_local_children data structure, so it is no longer safe to iterate over in the orte_odls_base_default_deliver_message() function. * When the HNP is hosting processes make sure it accounts for processes that may have failed locally in the ErrMgr HNP component by decrementing the num_local_procs. This makes it match the orted ErrMgr component accounting. This is what was causing the modex to fail (the number of participants was wrong on a rolling recovery. * The crmig and autor features of the hnp ErrMgr component now check for the jobid from both the 'job' parameter and from the process name (since one may be there and not the other). This caused some additional error messages during startup. * If we fail to migrate (e.g., due to invalid node specification), print only the error message, not the error and success messages. This can be misleading. This commit was SVN r24317.	2011-01-27 20:40:23 +00:00
Josh Hursey	81fd41f811	Return an informative error message if the user requests a migration of a job that is not capable of it. C/R Functionality cleanup This commit was SVN r24307.	2011-01-26 15:36:34 +00:00
Josh Hursey	8f45fcb429	More fixes for the C/R support. Fixes a couple bugs with the migration and autor features. The C/R functionality should be fully working now. * Fix the checkpoint-restart-checkpoint case which would previous reject the checkpoint of the newly restarted process. By making sure to re-enable checkpointing once the application has fully restarted fixes this issue (make sure to set is_app_checkpointable to true on restart confirmation). * In the case of an invalid checkpoint, do not try to access the SStore datastore as it will be using a dummy handler, and return NULL strings. mpirun was segfaulting in the error case because it was trying to convert the seq_num from a string to an integer. * Make sure to initialize the timer event in the Automatic Recovery section of the HNP errmgr, per the libevent update. This caused a segfault when attempting to recover a failed process. * If ompi-checkpoint loses connection to the HNP/mpirun the TCP socket will fail and call the ErrMgr update_state function. This commit adds a dummy function {{{orte_errmgr_base_update_state()}}} that will prevent the ompi-checkpoint command from segfaulting in this error scenario. This commit was SVN r24306.	2011-01-26 14:56:35 +00:00
Nathan Hjelm	8a3179cdcb	removed c99 test code This commit was SVN r24297.	2011-01-25 23:02:35 +00:00
Josh Hursey	e4d13d338f	Fix a couple of compiler warnings This commit was SVN r24295.	2011-01-25 22:22:32 +00:00
George Bosilca	09f645f9a9	There is no need for the byte variable. This commit was SVN r24293.	2011-01-24 22:41:04 +00:00
Nathan Hjelm	e2126512a9	test c99 struct initialization with mtt. remove on jan 20, 2011 This commit was SVN r24271.	2011-01-19 22:21:21 +00:00
Abhishek Kulkarni	fd7ef7a1f1	Fixes broken trunk compile: call process status notify only when ft-enable-cr is selected. This commit was SVN r24255.	2011-01-14 18:37:07 +00:00
Abhishek Kulkarni	87d2c9b31d	Few fault tolerance updates related to the CIFTS project (http://www.mcs.anl.gov/research/cifts/) * Improve the FTB notifier to publish (C/R, process/communication failure) events to the FTB with the OMPI jobid as the associated payload. * Add notifier calls for C/R events and process status events in SnapC and ErrMgr components. * Fix a bug where the SnapC states and process states collide before being thrown out over the notifier. This commit was SVN r24251.	2011-01-13 20:13:49 +00:00
Ralph Castain	b09f57b03d	Update the multicast subsystem - ported from Cisco branch This commit was SVN r24246.	2011-01-13 01:54:05 +00:00
Terry Dontje	f3aaa885a3	corrected a couple places in orte where it said cpu_model when it should have been cpu_type. This commit was SVN r24221.	2011-01-11 19:56:26 +00:00
Abhishek Kulkarni	11ffa854ff	Update the FTB notifier * fix indentation issues * update the name of one of the fault events published to the FTB (per the FTB MPI standard) This commit was SVN r24213.	2011-01-10 18:58:31 +00:00
Ralph Castain	80ef1af8ba	Add psm key generator program This commit was SVN r24197.	2010-12-30 20:54:58 +00:00
Nathan Hjelm	c082d05ecb	Reset the timer on MPIR_being_debugged only if MPIR_being_debugged is not set. Fix typo in return code. This commit was SVN r24187.	2010-12-20 21:00:49 +00:00
Jeff Squyres	a525e70f46	Convert "opal_show_help" to be a global variable pointer. It is statically initialized to the real back-end OPAL show_help function. During orte_show_help_init(), the variable is re-assigned with the value of the back-end ORTE show_help function (the one that does error message aggregation). Therefore, anything that calls opal_show_help() after a certain point in orte_init() will have their show_help messages be aggregated. w00t! Even code down in OPAL -- that has no knowledge of ORTE -- will have their messages aggregated. '''Double w00t!''' During orte_show_help_finalize(), we restore the original pointer value so that it something calls opal_show_help() after orte_finalize(), it'll still work properly (but it won't be aggregated). This commit was SVN r24185.	2010-12-16 23:00:25 +00:00
Jeff Squyres	de97962aac	Fixes trac:2651. Fix off-by-one error when /dev/urandom doesn't exist. Thanks to "pth" for the patch. This commit was SVN r24170. The following Trac tickets were found above: Ticket 2651 --> https://svn.open-mpi.org/trac/ompi/ticket/2651	2010-12-14 14:52:51 +00:00
Ralph Castain	b251a59cdf	Cleanup nidmap finalize This commit was SVN r24164.	2010-12-11 16:42:06 +00:00
Ralph Castain	2dc5cbb483	Remove stale code and API from the RML/OOB frameworks. Stopped using this code years ago. This commit was SVN r24153.	2010-12-05 15:58:21 +00:00
Rolf vandeVaart	b67d3398da	It is convention to have orte_config.h included at top of file. This commit was SVN r24146.	2010-12-03 16:13:31 +00:00
Shiqing Fan	f43862420c	Convert the bad dos line endings to unix style for all windows related files. This commit was SVN r24137.	2010-12-02 12:08:08 +00:00
Ralph Castain	aaad8ae891	Remove unused var This commit was SVN r24136.	2010-12-02 02:38:13 +00:00
Ralph Castain	f9ffff59f8	Ensure clean termination of threads and tcp multicast This commit was SVN r24134.	2010-12-02 00:23:42 +00:00
Nathan Hjelm	75605faa75	added support for reattaching a debugger using the MPIR_attach_fifo This commit was SVN r24132.	2010-12-01 20:13:58 +00:00
Ralph Castain	ad814f26cd	One more time, into the breach! Restore the use of override_oversubscribe to indicate that the data source for resources on the backend nodes used in mapping is unreliable. In this situation (e.g., data came from hostfile, or we are just using localhost because nothing was provided), we don't trust the oversubscribe condition passed by the mapper. Instead, we check locally to ensure we set sched_yield correctly. This commit was SVN r24130.	2010-12-01 15:15:26 +00:00
Ralph Castain	eba65e97f3	Extend the rmcast APIs to allow enable/disable of comm, required for clean termination by upper layer users. Point the recv thread event base to the right place so it can wakeup when required. Add a new error code for "comm disabled" when attempting to communicate after disabling comm. This commit was SVN r24129.	2010-12-01 13:41:19 +00:00
Ralph Castain	9224302c10	Remove debug This commit was SVN r24128.	2010-12-01 13:12:24 +00:00
Ralph Castain	4f5625d699	Not totally necessary, but good form - init the oversubscribed field in the orte_nid_t object This commit was SVN r24127.	2010-12-01 12:58:37 +00:00
Ralph Castain	30c37ea536	Ensure that the oversubscribed condition of nodes is accurately reported by the mapper, and that the results are communicated and used by the backend orteds when setting sched_yield on local procs. Restores prior behavior that was somehow lost along the way. Includes a patch from Damien Guinier to fix vpid assignments when cpus-per-task is specified. This commit was SVN r24126.	2010-12-01 12:51:39 +00:00
Ralph Castain	85a974b0de	Better check for NULL before using the value This commit was SVN r24122.	2010-12-01 04:48:50 +00:00
Ralph Castain	c56185887b	Change the event base "wakeup" support to enable the passing of events to the central thread for add/del. Add a macro OPAL_UPDATE_EVBASE for this purpose as it will likely be widely used. Update the ORTE thread support to utilize this capability. Update the rmcast framework to track the change. This commit was SVN r24121.	2010-12-01 04:26:43 +00:00
Ralph Castain	963336ee5a	Remove the test for libevent internal threads This commit was SVN r24120.	2010-12-01 04:24:10 +00:00
Ralph Castain	0441e81882	Oops - ensure that multicast msgs get circulated properly with the tcp module This commit was SVN r24118.	2010-11-30 21:13:53 +00:00
Ralph Castain	d20c023348	Checkpoint the threading support for multicast - will be revised shortly, but this version currently works. This commit was SVN r24117.	2010-11-30 17:30:16 +00:00
Ralph Castain	0465605a9c	Cleanup condition check for a param so it doesn't show if not usable. This commit was SVN r24116.	2010-11-30 17:28:53 +00:00
Ralph Castain	09f02b3087	Update the ORTE thread acquire/release/wakeup macros to trigger release from event_loop so that conditions can be checked. Add macro versions of condition_wait and friends for debug use. This commit was SVN r24115.	2010-11-30 17:27:58 +00:00
Ralph Castain	d2547e84a3	MPI procs never use orte progress threads This commit was SVN r24093.	2010-11-29 03:52:46 +00:00
Ralph Castain	71669720a3	Just get the output once on sigpipe error, and include the fd This commit was SVN r24092.	2010-11-25 15:32:48 +00:00
Ralph Castain	30c635fd4d	Don't endlessly output sigpipe errors. Count the number of times we trap it, and abort if we get more than 10 of them. This commit was SVN r24091.	2010-11-25 15:25:24 +00:00
Ralph Castain	b9b2d101dc	Add an mca param to indicate if orte progress threads are to be enabled. Error out if this is given and libevent thread support was not built. This commit was SVN r24089.	2010-11-24 23:28:00 +00:00
Rolf vandeVaart	1d62542c23	Fix another Sun Studio warning. jobid and vpid need to be uint32_t. This commit was SVN r24074.	2010-11-19 18:12:46 +00:00
Rolf vandeVaart	09fdd5cc23	Include fcntl.h, not sys/fcntl.h so we get the definition of the open system call. That is what man page says to do. Fixes warning on Solaris. This commit was SVN r24073.	2010-11-19 17:40:02 +00:00
Shiqing Fan	358b4a5cba	Add an option to enable the debug postfix for executables. This commit was SVN r24070.	2010-11-19 15:54:13 +00:00
Ethan Mallove	66f2301170	Just plain "Grid Engine" instead of "Sun Grid Engine" This commit was SVN r24068.	2010-11-18 19:30:04 +00:00
Abhishek Kulkarni	78a67654d4	add notifier events for process migration This commit was SVN r24058.	2010-11-16 17:57:44 +00:00
Abhishek Kulkarni	6e6ccae082	Update the checkpoint notification events that we throw out over the FTB with a payload embedded in {} This commit was SVN r24057.	2010-11-16 17:55:57 +00:00
Ralph Castain	58e711a412	Update a test and add two new ones for testing event lib thread support This commit was SVN r24051.	2010-11-13 15:39:28 +00:00
Jeff Squyres	e4744b4ed5	Per http://www.open-mpi.org/community/lists/devel/2010/11/8671.php , change a bunch of OMPI_<foo> names to OPAL_<foo>. This commit was SVN r24046.	2010-11-12 23:22:11 +00:00
Ralph Castain	703684e071	Output the mca params for debug purposes This commit was SVN r24042.	2010-11-11 20:06:29 +00:00
Shiqing Fan	c03ea1a5f3	A more clean way to build on Windows. It's not possible to combine two shared libraries on Windows, so we have to do it a bit different. First generate a small event static library by just linking the object files, and link it into other libraries that needs the libevent API. This commit was SVN r24039.	2010-11-11 12:02:54 +00:00
Ralph Castain	bb521c6b7e	Properly count local procs to set oversubscribed condition This commit was SVN r24037.	2010-11-10 21:59:35 +00:00
Ralph Castain	021bd77bf1	Don't free the event base if we aren't using progress threads This commit was SVN r24036.	2010-11-10 21:58:58 +00:00
Ralph Castain	9c72737414	Send the recovery flag This commit was SVN r24035.	2010-11-10 21:26:28 +00:00
Ralph Castain	57257ab9b4	Use the right event base if threads are disabled. Always update the seq num This commit was SVN r24034.	2010-11-10 21:26:04 +00:00
Ralph Castain	cbb758c4fb	Allow mcast threads to be disabled This commit was SVN r24032.	2010-11-10 20:16:41 +00:00
Ralph Castain	22e40d92a0	Cleanup thread termination This commit was SVN r24031.	2010-11-10 19:36:44 +00:00
Ralph Castain	f5e50abab2	Make class visible This commit was SVN r24022.	2010-11-09 19:07:45 +00:00
Nathan Hjelm	986265fc6e	fixed crash in orte-ps caused by calls to OBJ_RELEASE on an opal_event_t object. This commit was SVN r24020.	2010-11-09 18:41:43 +00:00
Shiqing Fan	482a621e31	Change the behavior of exporting/importing symbols on Windows, so that to fit the new build procedure, i.e. import statically linked opal/orte libraries for other libraries/binaries. There are several use cases when creating dll libraries: 1. create DLL A, export symbols of A, import nothing (A normally is OPAL) should define _USRDLL , A_EXPORT 2. create DLL B, export symbols of B, import A.lib (B could be ORTE, OMPI or other ompi tools) should define _USRDLL, B_EXPORT 3. create DLL C, import B.dll (C could be external libs or apps) should define B_IMPORT This commit was SVN r24016.	2010-11-09 16:13:30 +00:00
Ralph Castain	01347926d1	Be a little more thorough about cleaning up during finalize This commit was SVN r24014.	2010-11-09 14:56:27 +00:00
Shiqing Fan	d3701ccba8	type casts. This commit was SVN r24013.	2010-11-09 09:17:22 +00:00
Shiqing Fan	7bac326920	Fix Windows build, add custom command to generate static libraries (opal and orte) for shared build. This commit was SVN r24012.	2010-11-09 08:32:45 +00:00
Ralph Castain	f2f41d1ca9	Be nice to those who don't enable-multicast...poor wretches. This commit was SVN r24011.	2010-11-09 05:08:55 +00:00
Ralph Castain	a47b33678b	Add orte-level thread support to avoid some of the opal_if_threads protection used solely for ompi. Use threads to help process multicast messages. This commit was SVN r24009.	2010-11-08 19:09:23 +00:00
Ralph Castain	bf665692c3	Update the rmcast callback function API to return message sequence number. Update orte_mcast test to stress the system. This commit was SVN r24004.	2010-11-07 23:29:52 +00:00
Abhishek Kulkarni	e0660101d3	Throw notifier events for checkpointing status (success or failure) This commit was SVN r24003.	2010-11-07 22:12:09 +00:00
Abhishek Kulkarni	8cd3759f21	use the saved value of PID, saving some calls to getpid() This commit was SVN r24002.	2010-11-07 22:09:49 +00:00
Abhishek Kulkarni	d1a4cc33dd	Update the FTB notifier wrt events decided by the CIFTS working group This commit was SVN r24001.	2010-11-07 22:06:32 +00:00
Abhishek Kulkarni	ac2768ca7c	LOG_SYSLOG is a syslog facility. take it off the syslog options This commit was SVN r24000.	2010-11-06 22:05:45 +00:00
Abhishek Kulkarni	132c8d1b00	removing some unneeded calls to ORTE_ERROR_LOG This commit was SVN r23999.	2010-11-06 22:00:18 +00:00
Brian Barrett	a94caae625	More lightweight build changes This commit was SVN r23989.	2010-11-03 15:39:10 +00:00
Shiqing Fan	505efbaa27	Update the CMake scripts, solve a few export symbols for Windows. This commit was SVN r23976.	2010-11-02 16:39:27 +00:00
Ralph Castain	875a6d61a4	Return correct status code This commit was SVN r23969.	2010-10-29 00:43:50 +00:00
Ralph Castain	9ea2b196ce	Convert the opal_event framework to use direct function calls instead of hiding functions behind function pointers. Eliminate the opal_object_t abstraction of libevent's event struct so it can be directly passed to the libevent functions. Note: the ompi_check_libfca.m4 file had to be modified to avoid it stomping on global CPPFLAGS and the like. The file was also relocated to the ompi/config directory as it pertains solely to an ompi-layer component. Forgive the mid-day configure change, but I know Shiqing is working the windows issues and don't want to cause him unnecessary redo work. This commit was SVN r23966.	2010-10-28 15:22:46 +00:00
Ralph Castain	c13b0bb668	Update some debugger attachment code per LLNL request This commit was SVN r23965.	2010-10-28 03:06:20 +00:00
Brian Barrett	3ed00ba148	More fixes to make OMPI compile with minimal ORTE support again This commit was SVN r23962.	2010-10-27 20:40:39 +00:00
Shiqing Fan	199df1eadf	Rename a few var names. This commit was SVN r23959.	2010-10-27 11:52:57 +00:00
Nathan Hjelm	e7bfbe1d1a	added missing object initialization/destruction of mca_oob_tcp_component.tcp_listen_thread_event This commit was SVN r23958.	2010-10-26 22:09:37 +00:00
Shiqing Fan	a3d9c91ff7	Exclude stdbool.h for Windows, and use the definition in opal. Immigrate the socket pair support from libevent. Fix other minor things and make it compile. This commit was SVN r23951.	2010-10-26 14:53:50 +00:00
Ralph Castain	ebb4962072	fix typo This commit was SVN r23949.	2010-10-26 14:48:20 +00:00
Ralph Castain	8d5045de16	Make what is going on more obvious This commit was SVN r23948.	2010-10-26 14:39:09 +00:00
Ralph Castain	7f103c8a9d	Supply missing argument This commit was SVN r23945.	2010-10-26 07:24:55 +00:00
Ralph Castain	894230b121	This stuff is soooo out-of-date that a complete rewrite would be required - thankfully, nobody cares This commit was SVN r23944.	2010-10-26 06:22:31 +00:00
Ralph Castain	86c7365e8e	Clean up a few initialization issues - don't think these are impacting the shared memory situation as it didn't fix the problem. Setup the event API to support multiple bases in preparation for splitting the OMPI and ORTE events. Holding here pending shared memory resolution. This commit was SVN r23943.	2010-10-26 02:41:42 +00:00
Ralph Castain	fc46dfa78a	Remove stale code This commit was SVN r23942.	2010-10-26 02:37:56 +00:00
George Bosilca	17379a6097	tmp is on the stack, therefore storing its address for later usage is FORBIDEN. Set it to NULL by now, and hope the cbdata is never used by the timer callbacks. This commit was SVN r23941.	2010-10-25 19:07:13 +00:00
George Bosilca	5882290cdd	We need a default value or the compiler will whine. This commit was SVN r23940.	2010-10-25 19:05:45 +00:00
Ralph Castain	aaec8ec426	Fix orte-ps so it correctly reports out on processes within a job This commit was SVN r23933.	2010-10-25 17:53:53 +00:00
Abhishek Kulkarni	c671ec52d1	Fix broken trunk compile after the libevent changes. This commit was SVN r23929.	2010-10-25 14:11:48 +00:00
Ralph Castain	fceabb2498	Update libevent to the 2.0 series, currently at 2.0.7rc. We will update to their final release when it becomes available. Currently known errors exist in unused portions of the libevent code. This revision passes the IBM test suite on a Linux machine and on a standalone Mac. This is a fairly intrusive change, but outside of the moving of opal/event to opal/mca/event, the only changes involved (a) changing all calls to opal_event functions to reflect the new framework instead, and (b) ensuring that all opal_event_t objects are properly constructed since they are now true opal_objects. Note: Shiqing has just returned from vacation and has not yet had a chance to complete the Windows integration. Thus, this commit almost certainly breaks Windows support on the trunk. However, I want this to have a chance to soak for as long as possible before I become less available a week from today (going to be at a class for 5 days, and thus will only be sparingly available) so we can find and fix any problems. Biggest change is moving the libevent code from opal/event to a new opal/mca/event framework. This was done to make it much easier to update libevent in the future. New versions can be inserted as a new component and tested in parallel with the current version until validated, then we can remove the earlier version if we so choose. This is a statically built framework ala installdirs, so only one component will build at a time. There is no selection logic - the sole compiled component simply loads its function pointers into the opal_event struct. I have gone thru the code base and converted all the libevent calls I could find. However, I cannot compile nor test every environment. It is therefore quite likely that errors remain in the system. Please keep an eye open for two things: 1. compile-time errors: these will be obvious as calls to the old functions (e.g., opal_evtimer_new) must be replaced by the new framework APIs (e.g., opal_event.evtimer_new) 2. run-time errors: these will likely show up as segfaults due to missing constructors on opal_event_t objects. It appears that it became a typical practice for people to "init" an opal_event_t by simply using memset to zero it out. This will no longer work - you must either OBJ_NEW or OBJ_CONSTRUCT an opal_event_t. I tried to catch these cases, but may have missed some. Believe me, you'll know when you hit it. There is also the issue of the new libevent "no recursion" behavior. As I described on a recent email, we will have to discuss this and figure out what, if anything, we need to do. This commit was SVN r23925.	2010-10-24 18:35:54 +00:00
Ralph Castain	8a37b3071a	Cleanup mpirx co-spawn of debugger daemons. Resolve block-until-both-ends-are-opened behavior for fifos. Add some debugging output. Make the fifo event be non-persistent and read the value in the fifo to avoid spinning on the event. This commit was SVN r23922.	2010-10-22 20:46:07 +00:00
Ralph Castain	2c1a658232	Fix debugger attach This commit was SVN r23921.	2010-10-22 20:07:24 +00:00
Jeff Squyres	4322a78e60	Update wrapper compiler scripts to search for perl during configure, per request from BSD maintainers. This commit was SVN r23914.	2010-10-19 22:45:54 +00:00
Ralph Castain	1e93437cd4	To help with debugging, add a new mca param that instructs ORTE_ERROR_LOG to output "silent" errors. Helps to track down silent errors that don't have an associated error message (e.g., via show_help). This commit was SVN r23893.	2010-10-16 03:29:47 +00:00
Brian Barrett	9febaa475e	* Add shell of functionality required for supporting Portals4 * Update places where orte-free builds have failed This commit was SVN r23891.	2010-10-14 22:49:09 +00:00
Jeff Squyres	c891ed34e2	More verbatim escaping. This commit was SVN r23873.	2010-10-07 22:26:51 +00:00
Ralph Castain	37f566bf1e	Cancel recvs when finalizing This commit was SVN r23871.	2010-10-07 22:02:12 +00:00
Jeff Squyres	eaab8d0062	* Ensure to set paffinity_enabled in all cases * Ensure to set the mask value before we use it This commit was SVN r23861.	2010-10-07 15:48:49 +00:00
Ralph Castain	92d69effc9	Read the data from the fifo to clear the event so it doesn't immediately re-trigger This commit was SVN r23856.	2010-10-07 02:41:02 +00:00
Ralph Castain	871b685e89	Ensure all debugger interface symbols are present in orterun This commit was SVN r23823.	2010-09-30 21:36:00 +00:00
Josh Hursey	c8692198a2	Fix an issue migrating/autorecovering processes that mask SIGTERM using the C/R functionality. I did not want to make this change globally since there could be good reason to keep the check before calling SIGKILL that I am not seeing at the moment. This commit was SVN r23821.	2010-09-30 20:55:12 +00:00
Ralph Castain	dd959f5ab6	Silence an idiotic warning This commit was SVN r23819.	2010-09-30 17:54:13 +00:00
Ralph Castain	162734731d	Sigh - fix one more place where r23764 had a problem This commit was SVN r23816. The following SVN revision numbers were found above: r23764 --> open-mpi/ompi@40a2bfa238	2010-09-29 20:47:52 +00:00
Jeff Squyres	73bcc4a36b	Fix mistake that came in via the ompi-agen tree in r23764. The mistake wasn't part of the core autogen upgrade; it was an additional 'bonus' cleanup. Oops. The mistake will always create a set of directories under installdir, even if you do not --with-devel-headers. The set of directories will be empty, but still -- they should not be there at all. This commit fixes that -- the directories are not created at all if you do not --with-devel-headers This commit was SVN r23801. The following SVN revision numbers were found above: r23764 --> open-mpi/ompi@40a2bfa238	2010-09-24 22:53:28 +00:00
Ralph Castain	b76278085e	Sigh. Since 0,0 is a valid process, this just doesn't work (duh) This commit was SVN r23795.	2010-09-23 15:03:05 +00:00
Ralph Castain	7aed412798	Add definitions to tell us when jobid and vpid haven't been defined This commit was SVN r23793.	2010-09-23 01:33:42 +00:00
Ralph Castain	3631e4e936	Revert remaining svn kruft from r23764 This commit was SVN r23786. The following SVN revision numbers were found above: r23764 --> open-mpi/ompi@40a2bfa238	2010-09-22 01:11:40 +00:00
Ralph Castain	40a2bfa238	WARNING: Work on the temp branch being merged here encountered problems with bugs in subversion. Considerable effort has gone into validating the branch. However, not all conditions can be checked, so users are cautioned that it may be advisable to not update from the trunk for a few days to allow MTT to identify platform-specific issues. This merges the branch containing the revamped build system based around converting autogen from a bash script to a Perl program. Jeff has provided emails explaining the features contained in the change. Please note that configure requirements on components HAVE CHANGED. For example. a configure.params file is no longer required in each component directory. See Jeff's emails for an explanation. This commit was SVN r23764.	2010-09-17 23:04:06 +00:00
Ethan Mallove	8052c9dd46	Prevent SEGV when using odls_base_verbose This commit was SVN r23758.	2010-09-15 20:17:50 +00:00
Ralph Castain	e96b5f486f	Reorganize the opal interface code in opal/util/if.c per prior emails and telecon discussions. Move the interface discovery code into a framework so that configuration logic can separate it out (instead of the prior #if-#else confusion). All interface APIs for accessing the info remain unchanged in opal/util/if.c. This has been tested on Mac, Linux, and NetBSD. Nobody else seemed interested in testing it, so there may be some future problems revealed as people try it on other OSs. This commit was SVN r23743.	2010-09-13 01:58:51 +00:00
Abhishek Kulkarni	a143622b54	Remove unused code. Notifier events are aggregated on a per-event basis by the HNP notifier. This commit was SVN r23711.	2010-09-02 16:00:22 +00:00
Ralph Castain	f75437f5a3	Add the ability to receive notifier output when job completes. Set the notification level to INFO for normal job completion, and to ALERT for abnormal termination. This commit was SVN r23710.	2010-09-02 14:42:41 +00:00
Rolf vandeVaart	14e7bcc383	Create new entries in the wrapper data files so the administrator can specify compiler flags that get inserted into the command before the user's flags. These flags can be specified at configure time. Reviewed by Jeff Squyres. This fixes ticket #2474. This commit was SVN r23709.	2010-09-02 10:47:55 +00:00
Ralph Castain	b982f908e8	Fixed some newly-induced warnings This commit was SVN r23694.	2010-08-31 14:51:19 +00:00
Rainer Keller	97511912ec	- Fixup several functions, that cannot return - Add one instance where we do not use a parameter in a function - Fix a buglet in commit r23689, where the attribute-for-function ptrs was applied. This commit was SVN r23690. The following SVN revision numbers were found above: r23689 --> open-mpi/ompi@5eb571c458	2010-08-31 12:21:13 +00:00
Rainer Keller	5eb571c458	- As suggested in CMR #2558 , attribute-macros should be be tested on function pointers and assigned accordingly, instead of using the pre-processor in the header files. A functional change is (re-) specifying __opal_attribute_noreturn__ on orte_errmgr_base_abort(): All modules in the errmgr framework either use this function, or define their own abort function, which sets __opal_attribute_noreturn__. This attributes was taken out with the errmgr overhaul in r22872. This commit was SVN r23689. The following SVN revision numbers were found above: r22872 --> open-mpi/ompi@e4f2d03d28	2010-08-31 10:28:51 +00:00
Ralph Castain	b81358815c	Add some debug This commit was SVN r23686.	2010-08-29 13:45:10 +00:00
Ralph Castain	554aede041	Fix a situation where we were unlocking a thread that isn't locked for the main launch - it is only used for dynamic spawns. This commit was SVN r23682.	2010-08-28 14:03:17 +00:00
Rainer Keller	4abcf5a0d7	- The Sun-compiler 12 update 1 complains about noreturn-attributes assigned to function-declarations. Check this case and mark the currently only case existing in trunk. Thanks to Paul Hargrove for bringing this up. Let's test the svn commit msg CMR:v1.5 This commit was SVN r23676.	2010-08-27 09:18:30 +00:00
Rainer Keller	12ed573e5e	- Include <strings.h> for rindex(3). Thanks to Paul Hargrove. Please CMR:v1.5 This commit was SVN r23671.	2010-08-26 13:42:36 +00:00
Shiqing Fan	9911797867	Rename a few odls help files for Windows installation. This commit was SVN r23668.	2010-08-26 09:31:18 +00:00
Ralph Castain	2e223abe33	Restore the auto-poll method for detecting debugger attachment, but only in the mpirx debugger module and only if the corresponding rate mca param is set. Guess we missed it before, but add the debugger framework to the orte-info and ompi_info tools This commit was SVN r23667.	2010-08-25 22:52:33 +00:00
Ralph Castain	f72cdc4160	Update the compare_name_fields function to allow the caller to specify that wildcard values are to be treated as wildcards This commit was SVN r23663.	2010-08-25 15:35:41 +00:00
Ralph Castain	4ecd9a0bbe	Protect against an obscure race condition that AFAICT only occurs when we are in a loop waiting to recv a message from a peer who is then killed by signal. This commit was SVN r23662.	2010-08-25 15:35:01 +00:00
Ralph Castain	f1a00c9a21	Per Jeff's inquiry, play chicken and don't assume herror exists everywhere. This commit was SVN r23656.	2010-08-24 20:46:41 +00:00
Jeff Squyres	207ca2d928	This commit is the first of several steps in a paffinity makeover extravaganza. = Short version = This commit does several things, but the short version is that it re-orients the error message creation of the ODLS default module to generate error strings in the child process for errors that occur after the fork but before the exec (such errors are ''usually'' related to paffinity). A show_help string is rendered in the child and then IPC'ed up to the parent, who displays the string through normal ORTE show_help aggregation mechanisms. We also broke up the ginormous paffinity-setting logic into a few separate functions, both to help us understand the code, and hopefully to ease future maintenance. The logic for the ODLS default binding should not have changed -- this is mainly a code reshuffle and improvement on error reporting. = Rationale = The reasoning for this commit is complex. As mentioned above, it's the first step in some paffinity cleanup. Here's the line of dominoes that must fall (in this order): 1. Add hwloc paffinity component (already done). 1. While testing hwloc, we discovered that the error reporting from the ODLS default module was abysmal. So we fixed it. 1. Further, we reorganized the code in the odsl_default_module.c a bit to help our understanding of it. 1. We also discovered a few bugs in the original ODLS default module logic that existed before this code shuffle; separate tickets will be filed to fix them. 1. Next up will be some improvements to paffinity / odls default to make the act of binding to a core ensure to bind to ''all'' hardware threads contained in that core (similar for sockets: binding to a socket will bind to ''all'' hardware threads in that socket). 1. Next will be improvements to paffinity to expose binding to hardware threads through the paffinity framework API. 1. Finally, we'll expose these binding controls to the user (e.g., through mpirun command line arguments, MCA parameters, etc.). This commit represents the first few bullets; the last 4 bullets are being worked on right now, but there is no definite timeline for completion. = Miscelaneous = A few points worth mentioning: * We have tested this new code a bunch; we're pretty sure it behaves just like the trunk -- but with better / more precise error reporting. More testing is needed on a wider array of platforms, however. * A big comment at the top of odls_default_module.c explains the (new) general scheme for the error reporting. * The error reporting in the parent process is now really dumb; almost all the intelligence about creating error messages is in the child. * The show_help file was renamed to be more consistent with other help files (help-odls-default.txt -> help-orte-odls-default.txt) * Removed the use of sched_yield() because of recent changes in the Linux 2.6.3x kernels. We already had an #else clause for select()'ing for 1us if we didn't have sched_yield() -- that is now the only code path. This is not a performance-critical section of the code, so this shouldn't be controversial. * Replaced the macro-based error reporting with function-based reporting. It's a bit more bulky, but it helped us understand the code and saved us multiple times with compile-time parameter checking, etc. * Cleaned up the use of several show_help messages to ensure that they mapped to real messages in help*.txt files. This commit was SVN r23652.	2010-08-24 19:38:29 +00:00
Jeff Squyres	2c03554fe7	Add new function: orte_show_help_norender(). It is exactly the same as orte_show_help(), but it takes a fully-rendered string instead of a varargs list that must be rendered. This function is useful in cases where one entity renders the "show help" string and a different entity sends the string via the normal orte "show help" mechanisms for aggregation, etc. Example usage: errors occur in the ODLS after forking but before exec'ing. In such cases, it makes sense for the the child process to render the "show help" string because it has all the details about the error. But the child process can't call orte_show_help() itself because it is not an ORTE process -- it can't OOB send the message to the HNP, etc. After rendering the help string, the child sends the rendered string to its parent via normal IPC (e.g., via a pipe) and the parent can then invoke orte_show_help_norender() with the ready-to-go string. The message then displays out via the normal mechanisms (i.e., out via the HNP, aggregated/coalesced, etc.). This commit was SVN r23651.	2010-08-24 19:12:57 +00:00

... 2 3 4 5 6 ...

3236 Коммитов