openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	4a55fba414	Fix registration of error handlers thru the pmix120 component. A thread-shift operation was hanging on the sync_event_base, which made it dependent on someone calling opal_progress. Unfortunately, a process in "sleep" or spinning outside the MPI library won't do that, and so we never complete errhandler registration.	2016-03-02 15:01:01 -08:00
Ralph Castain	06c3dfc052	Refactor the ORTE DVM code so that external codes can submit multiple jobs using only a single connection to the HNP. * Clean up the DVM so it continues to run even when applications error out and we would ordinarily abort the daemons. * Create a new errmgr component for the DVM to handle the differences. * Cleanup the DVM state component. * Add ORTE bindings directory and brief README * Pass a local tool index around to match jobs. * Pass the jobid on job completion. * Fix initialization logic. * Add framework for python wrapper. * Fix terminate-with-non-zero-exit behavior so it properly terminates only the indicated procs, notifies orte-submit, and orte-dvm continues executing. * Add some missing options to orte-dvm * Fix a bug in -host processing that caused us to ignore the #slots designator. Add a new attribute to indicate "do not expand the DVM" when submitting job spawn requests. * It actually makes no sense that we treat the termination of all children differently than terminating the children of a specific job - it only creates confusion over the difference in behavior. So terminate children the same way regardless. Extend the cmd_line utility to easily allow layering of command line definitions Catch up with ORTE interface change and make build more generic. Disable "fixed dvm" logic for now. Add another cmd_line function to merge a table of cmd line options with another one, reporting as errors any duplicate entries. Use this to allow orterun to reuse the orted_submit code Fix the "fixed_dvm" logic by ensuring we reset num_new_daemons to zero. Also ensure that the nidmap is sent with the first job so the downstream daemons get the node info. Remove a duplicate cmd line entry in orterun. Revise the DVM startup procedure to pass the nidmap only once, at the startup of the DVM. This reduces the overhead on each job launch and ensures that the nidmap doesn't get overwritten. Add new commands to get_orted_comm_cmd_str(). Move ORTE command line options to orte_globals.[ch]. Catch up with extra orte_submit_init parameter. Add example code. Add documentation. Bump version. The nidmap and routing data must be updated prior to propagating the xcast or else the xcast will fail. Fix the return code so it is something more expected when an error occurs. Ensure we get an error returned to us when we fail to launch for some reason. In this case, we will always get a launch_cb as we did indeed attempt to spawn it. The error code will be returned in the complete_cb. Fix the return code from orte_submit_job - it was returning the tracker index instead of "success". Take advantage of ORTE's pretty-print capabilities to provide a nice error output explaining why we failed to launch. Ensure we always get a launch_cb when we fail to launch, but no complete_cb as the job never launched. Extend the error reporting capability to job completion as well. Add index parameter to orte_submit_job(). Add orte_job_cancel and implement ORTE_DAEMON_TERMINATE_JOB_CMD. Factor out dvm termination. Parse the terminate option at tool level. Add error string for ORTE_ERR_JOB_CANCELLED. Add some safeguards. Cleanup and/of comments. Enable the return. Properly ORTE_DECLSPEC orte_submit_halt. Add orte_submit_halt and orte_submit_cancel to interface. Use the plm interface to terminate the job	2016-02-13 08:10:44 -08:00
Ralph Castain	f28448702a	Eliminate malloc by utilizing /proc/self/fd - optimization	2015-09-22 07:24:54 -07:00
Ralph Castain	92ae386a34	As Jeff proposed, change the check to looking for the filename's first character to be a digit	2015-09-21 08:22:58 -07:00
Ralph Castain	c167acc5a7	Cleanup the odls "close file descriptor" commit to conform to OMPI coding standards and remove memory leaks	2015-09-19 20:46:36 -07:00
Piotr Lesnicki	1dd5487fae	odls: close only used file descriptors at fork/exec	2015-09-18 16:44:57 +02:00
Jeff Squyres	31b329e585	odls default: ensure to initialize opts This fixes CID 71127.	2015-08-12 05:27:37 -07:00
Nathan Hjelm	4d92c9989e	more c99 updates This commit does two things. It removes checks for C99 required headers (stdlib.h, string.h, signal.h, etc). Additionally it removes definitions for required C99 types (intptr_t, int64_t, int32_t, etc). Signed-off-by: Nathan Hjelm <hjelmn@me.com>	2015-06-25 10:14:13 -06:00
Ralph Castain	869041f770	Purge whitespace from the repo	2015-06-23 20:59:57 -07:00
Ralph Castain	ea35e47228	Fat SMPs (i.e., systems with nodes containing large numbers of cpus) were failing to start due to connection failures of the opal/pmix support. Root cause was that (a) we were setting the client socket to non-blocking before calling connect, and (b) the server was using the event library to harvest the accepts, and also did the handshake while in that event. So the server would backup beyond the connection backlog limit, and we would fail. Changing the client to leave its socket as blocking during the connect doesn't solve the problem by itself - you also have to introduce a sleep delay once the backlog is hit to avoid simply machine-gunning your way thru retries. This gets somewhat difficult to adjust as you don't want to unnecessarily prolong startup time. We've solved this before by adding a listening thread that simply reaps accepts and shoves them into the event library for subsequent processing. This would resolve the problem, but meant yet another daemon-level thread. So I centralized the listening thread support and let multiple elements register listeners on it. Thus, each daemon now has a single listening thread that reaps accepts from multiple sources - for now, the orte/pmix server and the oob/usock support are using it. I'll add in the oob/tcp component later. This still didn't fully resolve the SMP problem, especially on coprocessor cards (e.g., KNC). Removing the shared memory dstore support helped further improve the behavior - it looks like there is some kind of memory paging issue there that needs further understanding. Given that the shared memory support was about to be lost when I bring over the PMIx integration (until it is restored in that library), it seemed like a reasonable thing to just remove it at this point.	2015-05-29 14:37:14 -07:00
Elena	03fc809bc9	This commit contains new dstore component sm which is used for communication between pmix server and clients at the same node via shared memory.	2014-11-06 16:01:19 +02:00
Ralph Castain	aec5cd08bd	Per the PMIx RFC: WHAT: Merge the PMIx branch into the devel repo, creating a new OPAL “lmix” framework to abstract PMI support for all RTEs. Replace the ORTE daemon-level collectives with a new PMIx server and update the ORTE grpcomm framework to support server-to-server collectives WHY: We’ve had problems dealing with variations in PMI implementations, and need to extend the existing PMI definitions to meet exascale requirements. WHEN: Mon, Aug 25 WHERE: https://github.com/rhc54/ompi-svn-mirror.git Several community members have been working on a refactoring of the current PMI support within OMPI. Although the APIs are common, Slurm and Cray implement a different range of capabilities, and package them differently. For example, Cray provides an integrated PMI-1/2 library, while Slurm separates the two and requires the user to specify the one to be used at runtime. In addition, several bugs in the Slurm implementations have caused problems requiring extra coding. All this has led to a slew of #if’s in the PMI code and bugs when the corner-case logic for one implementation accidentally traps the other. Extending this support to other implementations would have increased this complexity to an unacceptable level. Accordingly, we have: * created a new OPAL “pmix” framework to abstract the PMI support, with separate components for Cray, Slurm PMI-1, and Slurm PMI-2 implementations. * Replaced the current ORTE grpcomm daemon-based collective operation with an integrated PMIx server, and updated the grpcomm APIs to provide more flexible, multi-algorithm support for collective operations. At this time, only the xcast and allgather operations are supported. * Replaced the current global collective id with a signature based on the names of the participating procs. The allows an unlimited number of collectives to be executed by any group of processes, subject to the requirement that only one collective can be active at a time for a unique combination of procs. Note that a proc can be involved in any number of simultaneous collectives - it is the specific combination of procs that is subject to the constraint * removed the prior OMPI/OPAL modex code * added new macros for executing modex send/recv to simplify use of the new APIs. The send macros allow the caller to specify whether or not the BTL supports async modex operations - if so, then the non-blocking “fence” operation is used, if the active PMIx component supports it. Otherwise, the default is a full blocking modex exchange as we currently perform. * retained the current flag that directs us to use a blocking fence operation, but only to retrieve data upon demand This commit was SVN r32570.	2014-08-21 18:56:47 +00:00
Ralph Castain	5dbf4a62c4	Cleanup: we were accidentally killing ourselves (bad idea) Refs trac:4717 This commit was SVN r32022. The following Trac tickets were found above: Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717	2014-06-17 20:38:42 +00:00
Ralph Castain	42bf7466fc	This isn't as big a change as it appears - a change in one place caused a whole bunch of files to require updated #include's due to some arcane linkage. Rework the orte_wait code to reflect the introduction of the state machine. If we are in cleanup mode and just want to kill all our local children, then there is no reason to be polite about it as that introduces very long delays at scale. Just kill the procs and move on. Refs trac:4717 This commit was SVN r32019. The following Trac tickets were found above: Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717	2014-06-17 17:57:51 +00:00
Ralph Castain	8736a1c138	Per RFC: http://www.open-mpi.org/community/lists/devel/2014/05/14822.php Revamp the ORTE global data structures to reduce memory footprint and add new features. Add ability to control/set cpu frequency, though this can only be done if the sys admin has setup the system to support it (or you run as root). This commit was SVN r31916.	2014-06-01 16:14:10 +00:00
Ralph Castain	87d809eefe	Add a new "run-time controls" framework for setting controls on processes. Initially, just move the process binding code there under a new "hwloc" component. Additional components to support cgroups, power settings, etc. to follow This commit was SVN r31633.	2014-05-05 19:22:06 +00:00
Jeff Squyres	e1655ae68d	opal/util/fd.c: add new convenience function for setting FD_CLOEXEC Paul Hargrove pointed out that Stevens tells us that we should FD_GETFL before FD_SETFL. And so we shall. Make a new convenience function to do this (opal_fd_set_cloexec()), just so that we don't have to litter this 2-step process throughout the code. Refs trac:4550 This commit was SVN r31513. The following Trac tickets were found above: Ticket 4550 --> https://svn.open-mpi.org/trac/ompi/ticket/4550	2014-04-24 13:04:49 +00:00
Jeff Squyres	3da579139b	More corrections w.r.t. process groups To accompany r31092 and r310924, also ensure to create a new process group in the child right after the orted forks. Add trivial configury to ensure that we have setpgid, and only do the setpgid/getpgid if we have setpgid. Without this commit, killing the entire process group can do unexpected things (e.g., kill the orted, mpirun, and even mpirun's parent!). cmr=v1.7.5:reviewer=rhc This commit was SVN r31132. The following SVN revision numbers were found above: r31092 --> open-mpi/ompi@99c9ecaed0 The following SVN revisions from the original message are invalid or inconsistent and therefore were not cross-referenced: r310924	2014-03-18 21:31:01 +00:00
Ralph Castain	545ac7dc58	Remove the job_control_forwarding logic as we want any signal to go to all members of the process group Refs trac:4404 This commit was SVN r31094. The following Trac tickets were found above: Ticket 4404 --> https://svn.open-mpi.org/trac/ompi/ticket/4404	2014-03-17 22:45:33 +00:00
Ralph Castain	99c9ecaed0	Ensure that we send the specified signal to the entire process group of each member of the pid provided to us. This ensures that any children spawned by our children also see the signal cmr=v1.7.5:reviewer=jsquyres This commit was SVN r31092.	2014-03-17 22:12:15 +00:00
Ralph Castain	081669b440	When pretty-printing binding info, we need to pass the topology down to the routine as the mapper isn't always working with the local topology - otherwise, we get an erroneous help message. Thanks to Tetsuya Mishima for reporting it cmr=v1.7.5:reviewer=rhc:subject=fix pretty-print of bindings This commit was SVN r30968.	2014-03-10 15:53:07 +00:00
Ralph Castain	c9465d97b4	Resolve a race condition when responding to a SIGTERM to ensure that any final message from the application is correctly output. Remove a duplicate command, reduce the priority of the daemon exit command to MSG so that the IOF will have a chance to output cached messages. Update the signal trapping test. Thanks to Paul Kapinos for reporting the problem. cmr=v1.7.5:reviewer=jsquyres:subject=resolve a race condition This commit was SVN r30942.	2014-03-05 04:38:17 +00:00
Ralph Castain	0ac97761cc	Now that we are binding by default, the issue of #slots and what to do when oversubscribed has become a bit more complicated. This isn't a problem in managed environments as we are always provided an accurate assignment for the #slots, or when -host is used to define the allocation since we automatically assume one slot for every time a node is named. The problem arises when a hostfile is used, and the user provides host names without specifying the slots= paramater. In these cases, we assign slots=1, but automatically allow oversubscription since that number isn't confirmed. We then provide a separate parameter by which the user can direct that we assign the number of slots based on the sensed hardware - e.g., by telling us to set the #slots equal to the #cores on each node. However, this has been set to "off" by default. In order to make this a little less complex for the user, set the default such that we automatically set #slots equal to #cores (or #hwt's if use_hwthreads_as_cpus has been set) only for those cases where the user provides names in a hostfile but does not provide slot information. Also cleanup some a couple of issues in the mapping/binding system: * ensure we only override the binding directive if we are oversubscribed and overload is not allowed * ensure that the MPI procs don't attempt to bind themselves if they are launched by an orted as any binding directive (no matter what it was) would have been serviced by the orted on launch * minor cleanup to the warning message when oversubscribed and binding was requested cmr=v1.7.5:reviewer=rhc:subject=update mapping/binding system This commit was SVN r30909.	2014-03-03 16:46:37 +00:00
Ralph Castain	61a21e4f31	Based on Tetsuya's patch, with some changes, correct the case of map-by node where multiple cpus/rank are requested and result in a non-integer match with num slots. Also correct tests for binding policy given to use the proper macro. Refs trac:4296 This commit was SVN r30857. The following Trac tickets were found above: Ticket 4296 --> https://svn.open-mpi.org/trac/ompi/ticket/4296	2014-02-26 18:12:23 +00:00
Ralph Castain	193cceb483	Okay, since a certain other RM out there made a fuss about being able to lock their daemons to specified cores, offer the same option here. The MCA param orte_daemon_cores can be used to specify which core(s) you want the orte daemons to use. This will have no bearing on the application procs - unbound will remain unbound, and binding directives will be applied to the apps. Yippee skippee... This commit was SVN r30513.	2014-01-30 23:50:14 +00:00
Ralph Castain	fb9e427320	One last corner case - when encountering an overload condition (e.g., by comm_spawning more procs than we have cores) and we are using the default binding policy, do not bind the new procs to anything as this can cause major problems. Instead, let the spawn succeed since the user didn't specifically ask to be bound, and leave the new procs as unbound. Refs trac:4077 This commit was SVN r30200. The following Trac tickets were found above: Ticket 4077 --> https://svn.open-mpi.org/trac/ompi/ticket/4077	2014-01-09 22:39:34 +00:00
Ralph Castain	f179f2086b	Do a better job of reporting bindings - if someone gives a spec that binds us to all processors, then we are effectively unbound and should report it clearly instead of outputting a long line of B's. cmr=v1.7.4:reviewer=jsquyres:subject=Do a better job of reporting bindings This commit was SVN r30179.	2014-01-09 16:16:16 +00:00
Ralph Castain	55cd65b149	Don't warn about binding (process and/or memory) if the node cannot do it or if we would overload, but it wasn't specifically requested by the user (i.e., it is the result of the default policy). Instead, just don't bind and quietly move along. Reset topology usage for each node as we bind as multiple nodes may be linked to the same topology object. This will need to be revisited for scale as it does take some non-zero time to reset the usage each iteration. However, storing individual topology objects for every node consumes memory, so it's a tradeoff. cmr=v1.7.4:reviewer=jsquyres:subject=Eliminate excessive binding/memory warnings This commit was SVN r29978.	2013-12-19 16:31:45 +00:00
Jeff Squyres	349ee654c1	Fix some --without-hwloc compile errors. Also remove one assigned-but-not-used variable assignment. This commit was SVN r28321.	2013-04-10 15:08:31 +00:00
Ralph Castain	1f011bef99	Cleanup the updated sys limits capability. Fix a few copy/paste bugs (my bad). Shift the limit set to the ODLS default module so that we sete the limits for all apps, even those that don't call opal_init. Leave it in opal_init as well to support direct-launch apps, but ensure we only set the limits once by removing the envar after launch by ODLS. Provide some nice error messages if we fail to set the limits. Since the user had to specifically request we set the limit, treat failure as an error-out situation. This commit was SVN r28288.	2013-04-04 16:00:17 +00:00
Nathan Hjelm	c041156f60	Update ORTE frameworks to use the MCA framework system. This commit was SVN r28240.	2013-03-27 21:14:43 +00:00
Nathan Hjelm	cf377db823	MCA/base: Add new MCA variable system Features: - Support for an override parameter file (openmpi-mca-param-override.conf). Variable values in this file can not be overridden by any file or environment value. - Support for boolean, unsigned, and unsigned long long variables. - Support for true/false values. - Support for enumerations on integer variables. - Support for MPIT scope, verbosity, and binding. - Support for command line source. - Support for setting variable source via the environment using OMPI_MCA_SOURCE_<var name>=source (either command or file:filename) - Cleaner API. - Support for variable groups (equivalent to MPIT categories). Notes: - Variables must be created with a backing store (char *, int , or bool *) that must live at least as long as the variable. - Creating a variable with the MCA_BASE_VAR_FLAG_SETTABLE enables the use of mca_base_var_set_value() to change the value. - String values are duplicated when the variable is registered. It is up to the caller to free the original value if necessary. The new value will be freed by the mca_base_var system and must not be freed by the user. - Variables with constant scope may not be settable. - Variable groups (and all associated variables) are deregistered when the component is closed or the component repository item is freed. This prevents a segmentation fault from accessing a variable after its component is unloaded. - After some discussion we decided we should remove the automatic registration of component priority variables. Few component actually made use of this feature. - The enumerator interface was updated to be general enough to handle future uses of the interface. - The code to generate ompi_info output has been moved into the MCA variable system. See mca_base_var_dump(). opal: update core and components to mca_base_var system orte: update core and components to mca_base_var system ompi: update core and components to mca_base_var system This commit also modifies the rmaps framework. The following variables were moved from ppr and lama: rmaps_base_pernode, rmaps_base_n_pernode, rmaps_base_n_persocket. Both lama and ppr create synonyms for these variables. This commit was SVN r28236.	2013-03-27 21:09:41 +00:00
Ralph Castain	347df93cd4	Handle the case of someone specifying a directory for the application. Ensure we get a non-zero exit status and clarify the error message. cmr:v1.7 This commit was SVN r28119.	2013-02-27 01:36:21 +00:00
Ralph Castain	b9897267ef	Cleanup report-bindings so it always reports the actual binding instead of what was requested. Ensure we don't report twice if it is an MPI process being launched. This commit was SVN r28057.	2013-02-14 17:24:28 +00:00
Nathan Hjelm	2acd0f83de	Revert "Revert r27451 and r27456 - the cmd line parser is incorrectly marking the application as an MCA parameter". It appears the problem was not with the command line parser but the rsh plm. I don't know why this problem was not occuring before the command line parser changes but it appears to be resolved now. This commit was SVN r27527. The following SVN revision numbers were found above: r27451 --> open-mpi/ompi@d59034e6ef r27456 --> open-mpi/ompi@ecdbf34937	2012-10-30 19:45:18 +00:00
Ralph Castain	e6014bf2e1	Revert r27451 and r27456 - the cmd line parser is incorrectly marking the application as an MCA parameter This commit was SVN r27477. The following SVN revision numbers were found above: r27451 --> open-mpi/ompi@d59034e6ef r27456 --> open-mpi/ompi@ecdbf34937	2012-10-24 18:38:44 +00:00
Nathan Hjelm	d59034e6ef	MCA: remove deprecated mca_base_param functions (mca_base_param_register_int, mca_base_param_register_string, mca_base_param_environ_variable). Remove all uses of deprecated functions. cmr:v1.7 This commit was SVN r27451.	2012-10-17 20:17:37 +00:00
Ralph Castain	d5279b0dc8	Make an attempt to protect hwloc cset2str from segfaulting in weird scenario This commit was SVN r27361.	2012-09-23 16:51:51 +00:00
Jeff Squyres	0b8849e2c4	Make "mpirun --report-bindings" have a user-friendly output (i.e., readable by normal human beings, vs. having a bitmap of physical PU's). Use the new hwloc base prettyprint functions to generate the output. This commit was SVN r26533.	2012-06-01 16:35:31 +00:00
Jeff Squyres	2ba10c37fe	Per RFC, bring in the following changes: * Remove paffinity, maffinity, and carto frameworks -- they've been wholly replaced by hwloc. * Move ompi_mpi_init() affinity-setting/checking code down to ORTE. * Update sm, smcuda, wv, and openib components to no longer use carto. Instead, use hwloc data. There are still optimizations possible in the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old carto-based code found out how many NUMA nodes were ''available'' -- not how many were used ''in this job''. The new hwloc-using code computes the same value -- it was not updated to calculate how many NUMA nodes are used ''by this job.'' * Note that I cannot compile the smcuda and wv BTLs -- I ''think'' they're right, but they need to be verified by their owners. * The openib component now does a bunch of stuff to figure out where "near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors (I do not have a NUMA machine with an OpenFabrics device that is a non-uniform distance from multiple different NUMA nodes). * Completely rewrite the OMPI_Affinity_str() routine from the "affinity" mpiext extension. This extension now understands hyperthreads; the output format of it has changed a bit to reflect this new information. * Bunches of minor changes around the code base to update names/types from maffinity/paffinity-based names to hwloc-based names. * Add some helper functions into the hwloc base, mainly having to do with the fact that we have the hwloc data reporting ''all'' topology information, but sometimes you really only want the (online \| available) data. This commit was SVN r26391.	2012-05-07 14:52:54 +00:00
Ralph Castain	bd8b4f7f1e	Sorry for mid-day commit, but I had promised on the call to do this upon my return. Roll in the ORTE state machine. Remove last traces of opal_sos. Remove UTK epoch code. Please see the various emails about the state machine change for details. I'll send something out later with more info on the new arch. This commit was SVN r26242.	2012-04-06 14:23:13 +00:00
Ralph Castain	b3aabf1565	Cleanup the --without-hwloc build. Thanks to Paul Hargrove for reporting it broken. This commit was SVN r25931.	2012-02-15 11:08:57 +00:00
Ralph Castain	6310361532	At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here: https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation. In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions: 1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior. 2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation. 3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so. As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes. This commit was SVN r25476.	2011-11-15 03:40:11 +00:00
Ralph Castain	054c485dcf	Cleanup a race condition and an unreliable method that caused us to not properly handle procs that trapped sigterm for cleanup purposes while ORTE was trying to kill them. Thanks to Rick Payne and Ian Wells of Cisco for spending weeks chasing this down. Fix a termination issue that caused procs local to mpirun to not be killed if they weren't calling into the library. Thanks to Terry Dontje for spending countless hours chasing his tail on this one! :-( This commit was SVN r25285.	2011-10-14 15:39:54 +00:00
George Bosilca	454519842e	Report bindings if requested. This commit was SVN r24743.	2011-06-02 17:17:10 +00:00
Jeff Squyres	eaab8d0062	* Ensure to set paffinity_enabled in all cases * Ensure to set the mask value before we use it This commit was SVN r23861.	2010-10-07 15:48:49 +00:00
Jeff Squyres	207ca2d928	This commit is the first of several steps in a paffinity makeover extravaganza. = Short version = This commit does several things, but the short version is that it re-orients the error message creation of the ODLS default module to generate error strings in the child process for errors that occur after the fork but before the exec (such errors are ''usually'' related to paffinity). A show_help string is rendered in the child and then IPC'ed up to the parent, who displays the string through normal ORTE show_help aggregation mechanisms. We also broke up the ginormous paffinity-setting logic into a few separate functions, both to help us understand the code, and hopefully to ease future maintenance. The logic for the ODLS default binding should not have changed -- this is mainly a code reshuffle and improvement on error reporting. = Rationale = The reasoning for this commit is complex. As mentioned above, it's the first step in some paffinity cleanup. Here's the line of dominoes that must fall (in this order): 1. Add hwloc paffinity component (already done). 1. While testing hwloc, we discovered that the error reporting from the ODLS default module was abysmal. So we fixed it. 1. Further, we reorganized the code in the odsl_default_module.c a bit to help our understanding of it. 1. We also discovered a few bugs in the original ODLS default module logic that existed before this code shuffle; separate tickets will be filed to fix them. 1. Next up will be some improvements to paffinity / odls default to make the act of binding to a core ensure to bind to ''all'' hardware threads contained in that core (similar for sockets: binding to a socket will bind to ''all'' hardware threads in that socket). 1. Next will be improvements to paffinity to expose binding to hardware threads through the paffinity framework API. 1. Finally, we'll expose these binding controls to the user (e.g., through mpirun command line arguments, MCA parameters, etc.). This commit represents the first few bullets; the last 4 bullets are being worked on right now, but there is no definite timeline for completion. = Miscelaneous = A few points worth mentioning: * We have tested this new code a bunch; we're pretty sure it behaves just like the trunk -- but with better / more precise error reporting. More testing is needed on a wider array of platforms, however. * A big comment at the top of odls_default_module.c explains the (new) general scheme for the error reporting. * The error reporting in the parent process is now really dumb; almost all the intelligence about creating error messages is in the child. * The show_help file was renamed to be more consistent with other help files (help-odls-default.txt -> help-orte-odls-default.txt) * Removed the use of sched_yield() because of recent changes in the Linux 2.6.3x kernels. We already had an #else clause for select()'ing for 1us if we didn't have sched_yield() -- that is now the only code path. This is not a performance-critical section of the code, so this shouldn't be controversial. * Replaced the macro-based error reporting with function-based reporting. It's a bit more bulky, but it helped us understand the code and saved us multiple times with compile-time parameter checking, etc. * Cleaned up the use of several show_help messages to ensure that they mapped to real messages in help*.txt files. This commit was SVN r23652.	2010-08-24 19:38:29 +00:00
Ralph Castain	7190415977	Fix JEFF's mistake - we cannot use orte_show_help if execv fails because we already closed all the file descriptors! This commit was SVN r23334.	2010-07-01 19:41:26 +00:00
Jeff Squyres	f1a7b5cc33	Make "processor affinity not supported" error message a little better: * Remove OPAL_ERR_PAFFINITY_NOT_SUPPORTED; fit it into the generic OPAL_ERR_NOT_SUPPORTED case. * When odls_default detects that processor affinity is not supported, it prints a specific message about it, and then it suppressed a generic HNP help message that would normally follow it (i.e., it's easier to have the "processor affinity is not supported" show_help message last). * Use some symbolic names in odls_default instead of fixed int's, just for slight readability improvements in the code. * Introduce orte_show_help_suppress(), which gives the ability to suppress any future showings of any arbitrary show_help() message. This is useful if you display message X and want to suppress message Y. This suppression only works in environments where orte_show_help() does coalescing. This commit was SVN r23249.	2010-06-08 20:16:07 +00:00
Jeff Squyres	fec7918eea	Some paffinity functions had their return status overloaded: * If < 0, it's an OPAL_ERR_* value * If >= 0, it's the actual output value of the function This is problematic for the OPAL_SOS stuff. This commit changes those functions to always return OPAL_* statuses and send the output value back through output parameters (like 95% of the rest of the code base). This avoids the confusion with OPAL_SOS stuff and makes paffinity work again (e.g., mpirun --bind-to-core ...). I updated all paffinitiy modules for the new function signatures, and bumped the paffinity API version up to 2.0.1. I don't think the version change will matter, though, because we'll be introducing support for hardware threads soon, which will either bump the paffinity version again or we'll replace paffinity with a new framework. This commit was SVN r23197.	2010-05-21 16:55:28 +00:00

1 2 3 4

168 Коммитов