openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	abedb97be4	Resolve race condition when procs call MPI_Abort. Since we go thru the errmgr instead of the normal proc termination routines, we need to ensure we mark that the proc has fired its waitpid and is no longer alive. Otherwise, the local daemon won't terminate because it thinks there is still a local proc alive and we hang. Thanks to Gilles for tracking it down. cmr=v1.8.2:reviewer=rhc This commit was SVN r32460.	2014-08-08 15:58:49 +00:00
Ralph Castain	6c5e592785	Revert r32222, r32210, and r32203 as they created a problem when daemon collectives did not involve app procs on every node. Instead, modify the ompi/mca/rte/orte/rte_orte.h to add a new function that allows apps to request new daemon collective ids for use in barrier and modex operations. This will only appear in ORTE-based installations, but it is only being used by a couple of researchers at the moment. Update the orte/test/mpi/coll_test.c test to show the revised example. This commit was SVN r32234. The following SVN revision numbers were found above: r32203 --> open-mpi/ompi@a523dba41d r32210 --> open-mpi/ompi@2ce11ed5c4 r32222 --> open-mpi/ompi@d55f16db50	2014-07-15 03:48:00 +00:00
Ralph Castain	a523dba41d	NOTE: this modifies the MPI-RTE interface We have been getting several requests for new collectives that need to be inserted in various places of the MPI layer, all in support of either checkpoint/restart or various research efforts. Until now, this would require that the collective id's be generated at launch. which required modification s to ORTE and other places. We chose not to make collectives reusable as the race conditions associated with resetting collective counters are daunti ng. This commit extends the collective system to allow self-generation of collective id's that the daemons need to support, thereby allowing developers to request any number of collectives for their work. There is one restriction: RTE collectives must occur at the process level - i.e., we don't curren tly have a way of tagging the collective to a specific thread. From the comment in the code: * In order to allow scalable * generation of collective id's, they are formed as: * * top 32-bits are the jobid of the procs involved in * the collective. For collectives across multiple jobs * (e.g., in a connect_accept), the daemon jobid will * be used as the id will be issued by mpirun. This * won't cause problems because daemons don't use the * collective_id * * bottom 32-bits are a rolling counter that recycles * when the max is hit. The daemon will cleanup each * collective upon completion, so this means a job can * never have more than 2*32 collectives going on at a time. If someone needs more than that - they've got * a problem. * * Note that this means (for now) that RTE-level collectives * cannot be done by individual threads - they must be * done at the overall process level. This is required as * there is no guaranteed ordering for the collective id's, * and all the participants must agree on the id of the * collective they are executing. So if thread A on one * process asks for a collective id before thread B does, * but B asks before A on another process, the collectives will * be mixed and not result in the expected behavior. We may * find a way to relax this requirement in the future by * adding a thread context id to the jobid field (maybe taking the * lower 16-bits of that field). This commit includes a test program (orte/test/mpi/coll_test.c) that cycles 100 times across barrier and modex collectives. This commit was SVN r32203.	2014-07-10 18:53:12 +00:00
Ralph Castain	356e7ea904	Move all collective id's into the attributes and let the job pack/unpack take care of them instead of singling them out. Add the envars just prior to forking the children instead of into the launch message itself. Remove a few #if CR as the attributes functionality can handle this condition now. This commit was SVN r32133.	2014-07-03 15:58:13 +00:00
Adrian Reber	cabf1d4e68	use the orte attributes in the FT code to fix compile errors This commit was SVN r32093.	2014-06-26 03:19:17 +00:00
Ralph Castain	34e5573988	Resolve the MTT timeout problem. This appears to have largely been caused by missing sigchld notifications, thus causing the daemons to believe that not all procs had exited. Let comm failure also serve as notification of process termination, and add appropriate flags/attributes to avoid multiple reporting of proc termination. This won't transition cleanly to the 1.8 series, and may represent too much change, so we'll have to (a) evaluate whether or not to bring it over (once it demonstrates that it does indeed solve the problem), and (b) develop a custom patch for that purpose. Refs trac:4717 This commit was SVN r32063. The following Trac tickets were found above: Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717	2014-06-21 17:09:02 +00:00
Ralph Castain	5dbf4a62c4	Cleanup: we were accidentally killing ourselves (bad idea) Refs trac:4717 This commit was SVN r32022. The following Trac tickets were found above: Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717	2014-06-17 20:38:42 +00:00
Ralph Castain	42bf7466fc	This isn't as big a change as it appears - a change in one place caused a whole bunch of files to require updated #include's due to some arcane linkage. Rework the orte_wait code to reflect the introduction of the state machine. If we are in cleanup mode and just want to kill all our local children, then there is no reason to be polite about it as that introduces very long delays at scale. Just kill the procs and move on. Refs trac:4717 This commit was SVN r32019. The following Trac tickets were found above: Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717	2014-06-17 17:57:51 +00:00
Ralph Castain	8e768de317	Really should check our own node for oversubscription, not the HNP cmr=v1.8.2:reviewer=jsquyres This commit was SVN r31946.	2014-06-04 03:09:02 +00:00
Ralph Castain	8736a1c138	Per RFC: http://www.open-mpi.org/community/lists/devel/2014/05/14822.php Revamp the ORTE global data structures to reduce memory footprint and add new features. Add ability to control/set cpu frequency, though this can only be done if the sys admin has setup the system to support it (or you run as root). This commit was SVN r31916.	2014-06-01 16:14:10 +00:00
Ralph Castain	5602156a1c	Use the correct abstraction layer name for the data dirs This commit was SVN r31684.	2014-05-08 14:32:24 +00:00
Ralph Castain	34988ba2a2	Cleanup the MPI_Abort detection Refs trac:4576 This commit was SVN r31561. The following Trac tickets were found above: Ticket 4576 --> https://svn.open-mpi.org/trac/ompi/ticket/4576	2014-04-30 00:51:59 +00:00
Ralph Castain	e05b88fd18	Take another stab at resolving the "called-abort" requirement without getting stuck. Return to "drop a turd" mode, perhaps with a little more intelligence behind it. Don't worry about catching it if session dirs weren't created cmr=v1.8.2:reviewer=jsquyres:subject=cleanup MPI_Abort hangs This commit was SVN r31543.	2014-04-29 17:29:46 +00:00
Ralph Castain	a368e84e70	Per the RFC, remove the sensor framework from the ORTE code area, relocating it offsite to the ORCM code area. Also update some ignores to ensure we don't pickup crosstalk in components This commit was SVN r31403.	2014-04-15 21:48:24 +00:00
Ralph Castain	c9465d97b4	Resolve a race condition when responding to a SIGTERM to ensure that any final message from the application is correctly output. Remove a duplicate command, reduce the priority of the daemon exit command to MSG so that the IOF will have a chance to output cached messages. Update the signal trapping test. Thanks to Paul Kapinos for reporting the problem. cmr=v1.7.5:reviewer=jsquyres:subject=resolve a race condition This commit was SVN r30942.	2014-03-05 04:38:17 +00:00
Ralph Castain	0ac97761cc	Now that we are binding by default, the issue of #slots and what to do when oversubscribed has become a bit more complicated. This isn't a problem in managed environments as we are always provided an accurate assignment for the #slots, or when -host is used to define the allocation since we automatically assume one slot for every time a node is named. The problem arises when a hostfile is used, and the user provides host names without specifying the slots= paramater. In these cases, we assign slots=1, but automatically allow oversubscription since that number isn't confirmed. We then provide a separate parameter by which the user can direct that we assign the number of slots based on the sensed hardware - e.g., by telling us to set the #slots equal to the #cores on each node. However, this has been set to "off" by default. In order to make this a little less complex for the user, set the default such that we automatically set #slots equal to #cores (or #hwt's if use_hwthreads_as_cpus has been set) only for those cases where the user provides names in a hostfile but does not provide slot information. Also cleanup some a couple of issues in the mapping/binding system: * ensure we only override the binding directive if we are oversubscribed and overload is not allowed * ensure that the MPI procs don't attempt to bind themselves if they are launched by an orted as any binding directive (no matter what it was) would have been serviced by the orted on launch * minor cleanup to the warning message when oversubscribed and binding was requested cmr=v1.7.5:reviewer=rhc:subject=update mapping/binding system This commit was SVN r30909.	2014-03-03 16:46:37 +00:00
Adrian Reber	fde1040d2f	Use unique collective ids for the checkpoint/restart code This commit was SVN r30552.	2014-02-04 14:03:05 +00:00
Brian Barrett	8b778903d8	Fix longstanding issue with our multi-project support. Rather than using pkg{data,lib,includedir}, use our own ompi{data,lib,includedir}, which is always set to {datadir,libdir,includedir}/openmpi. This will keep us from having help files in prefix/share/open-rte when building without Open MPI, but in prefix/share/openmpi when building with Open MPI. This commit was SVN r30140.	2014-01-07 22:11:15 +00:00
Ralph Castain	d049731911	Add pubsub pmi component to list of components to avoid when indirect launch used Refs trac:4032 This commit was SVN r30083. The following Trac tickets were found above: Ticket 4032 --> https://svn.open-mpi.org/trac/ompi/ticket/4032	2013-12-25 16:25:37 +00:00
Ralph Castain	81df8d09ca	Avoid use of PMI components when launched via mpirun as this is just unnecessary overhead that can cause confusion. cmr=v1.7.4:reviewer=miked:subject=Avoid use of PMI components when launched via mpirun This commit was SVN r30078.	2013-12-24 16:32:31 +00:00
Ralph Castain	f4f2287958	Singletons currently start out by spawning an HNP - this is required solely in the cases where the singleton subsequently calls MPI_Comm_spawn or publishes port info without support from an external orte-server. In all other cases, the HNP is of no value and can actually be a detriment by creating additional overhead on the node. This is particularly concerning for async operations where processes may begin as singletons and then dynamically wireup to perform pt2pt communications. So we now allow singletons to start on their own, only spawning an HNP when initiating an operation that actually requires it. cmr:v1.7.4:reviewer=jsquyres This commit was SVN r29354.	2013-10-04 02:58:26 +00:00
Ralph Castain	e8697de521	Deal with PGI compilers on the Mac by initializing a global variable. cmr:v1.6.6:reviewer=jsquyres cmr:v1.7.3:reviewer=jsquyres This commit was SVN r29129.	2013-09-05 21:40:50 +00:00
Ralph Castain	a200e4f865	As per the RFC, bring in the ORTE async progress code and the rewrite of OOB: * THIS RFC INCLUDES A MINOR CHANGE TO THE MPI-RTE INTERFACE * Note: during the course of this work, it was necessary to completely separate the MPI and RTE progress engines. There were multiple places in the MPI layer where ORTE_WAIT_FOR_COMPLETION was being used. A new OMPI_WAIT_FOR_COMPLETION macro was created (defined in ompi/mca/rte/rte.h) that simply cycles across opal_progress until the provided flag becomes false. Places where the MPI layer blocked waiting for RTE to complete an event have been modified to use this macro. *************************************************************************************** I am reissuing this RFC because of the time that has passed since its original release. Since its initial release and review, I have debugged it further to ensure it fully supports tests like loop_spawn. It therefore seems ready for merge back to the trunk. Given its prior review, I have set the timeout for one week. The code is in https://bitbucket.org/rhc/ompi-oob2 WHAT: Rewrite of ORTE OOB WHY: Support asynchronous progress and a host of other features WHEN: Wed, August 21 SYNOPSIS: The current OOB has served us well, but a number of limitations have been identified over the years. Specifically: * it is only progressed when called via opal_progress, which can lead to hangs or recursive calls into libevent (which is not supported by that code) * we've had issues when multiple NICs are available as the code doesn't "shift" messages between transports - thus, all nodes had to be available via the same TCP interface. * the OOB "unloads" incoming opal_buffer_t objects during the transmission, thus preventing use of OBJ_RETAIN in the code when repeatedly sending the same message to multiple recipients * there is no failover mechanism across NICs - if the selected NIC (or its attached switch) fails, we are forced to abort * only one transport (i.e., component) can be "active" The revised OOB resolves these problems: * async progress is used for all application processes, with the progress thread blocking in the event library * each available TCP NIC is supported by its own TCP module. The ability to asynchronously progress each module independently is provided, but not enabled by default (a runtime MCA parameter turns it "on") * multi-address TCP NICs (e.g., a NIC with both an IPv4 and IPv6 address, or with virtual interfaces) are supported - reachability is determined by comparing the contact info for a peer against all addresses within the range covered by the address/mask pairs for the NIC. * a message that arrives on one TCP NIC is automatically shifted to whatever NIC that is connected to the next "hop" if that peer cannot be reached by the incoming NIC. If no TCP module will reach the peer, then the OOB attempts to send the message via all other available components - if none can reach the peer, then an "error" is reported back to the RML, which then calls the errmgr for instructions. * opal_buffer_t now conforms to standard object rules re OBJ_RETAIN as we no longer "unload" the incoming object * NIC failure is reported to the TCP component, which then tries to resend the message across any other available TCP NIC. If that doesn't work, then the message is given back to the OOB base to try using other components. If all that fails, then the error is reported to the RML, which reports to the errmgr for instructions * obviously from the above, multiple OOB components (e.g., TCP and UD) can be active in parallel * the matching code has been moved to the RML (and out of the OOB/TCP component) so it is independent of transport * routing is done by the individual OOB modules (as opposed to the RML). Thus, both routed and non-routed transports can simultaneously be active * all blocking send/recv APIs have been removed. Everything operates asynchronously. KNOWN LIMITATIONS: * although provision is made for component failover as described above, the code for doing so has not been fully implemented yet. At the moment, if all connections for a given peer fail, the errmgr is notified of a "lost connection", which by default results in termination of the job if it was a lifeline * the IPv6 code is present and compiles, but is not complete. Since the current IPv6 support in the OOB doesn't work anyway, I don't consider this a blocker * routing is performed at the individual module level, yet the active routed component is selected on a global basis. We probably should update that to reflect that different transports may need/choose to route in different ways * obviously, not every error path has been tested nor necessarily covered * determining abnormal termination is more challenging than in the old code as we now potentially have multiple ways of connecting to a process. Ideally, we would declare "connection failed" when all transports can no longer reach the process, but that requires some additional (possibly complex) code. For now, the code replicates the old behavior only somewhat modified - i.e., if a module sees its connection fail, it checks to see if it is a lifeline. If so, it notifies the errmgr that the lifeline is lost - otherwise, it notifies the errmgr that a non-lifeline connection was lost. * reachability is determined solely on the basis of a shared subnet address/mask - more sophisticated algorithms (e.g., the one used in the tcp btl) are required to handle routing via gateways * the RML needs to assign sequence numbers to each message on a per-peer basis. The receiving RML will then deliver messages in order, thus preventing out-of-order messaging in the case where messages travel across different transports or a message needs to be redirected/resent due to failure of a NIC This commit was SVN r29058.	2013-08-22 16:37:40 +00:00
Jeff Squyres	089c632cce	Remove a bunch of dead code: gcc 4.7 warns of set-but-unused variables. So get rid of them. This commit was SVN r28538.	2013-05-17 21:45:49 +00:00
Nathan Hjelm	c041156f60	Update ORTE frameworks to use the MCA framework system. This commit was SVN r28240.	2013-03-27 21:14:43 +00:00
Nathan Hjelm	cf377db823	MCA/base: Add new MCA variable system Features: - Support for an override parameter file (openmpi-mca-param-override.conf). Variable values in this file can not be overridden by any file or environment value. - Support for boolean, unsigned, and unsigned long long variables. - Support for true/false values. - Support for enumerations on integer variables. - Support for MPIT scope, verbosity, and binding. - Support for command line source. - Support for setting variable source via the environment using OMPI_MCA_SOURCE_<var name>=source (either command or file:filename) - Cleaner API. - Support for variable groups (equivalent to MPIT categories). Notes: - Variables must be created with a backing store (char *, int , or bool *) that must live at least as long as the variable. - Creating a variable with the MCA_BASE_VAR_FLAG_SETTABLE enables the use of mca_base_var_set_value() to change the value. - String values are duplicated when the variable is registered. It is up to the caller to free the original value if necessary. The new value will be freed by the mca_base_var system and must not be freed by the user. - Variables with constant scope may not be settable. - Variable groups (and all associated variables) are deregistered when the component is closed or the component repository item is freed. This prevents a segmentation fault from accessing a variable after its component is unloaded. - After some discussion we decided we should remove the automatic registration of component priority variables. Few component actually made use of this feature. - The enumerator interface was updated to be general enough to handle future uses of the interface. - The code to generate ompi_info output has been moved into the MCA variable system. See mca_base_var_dump(). opal: update core and components to mca_base_var system orte: update core and components to mca_base_var system ompi: update core and components to mca_base_var system This commit also modifies the rmaps framework. The following variables were moved from ppr and lama: rmaps_base_pernode, rmaps_base_n_pernode, rmaps_base_n_persocket. Both lama and ppr create synonyms for these variables. This commit was SVN r28236.	2013-03-27 21:09:41 +00:00
Ralph Castain	cf9796accd	Remove the old configure option for disabling full rte support - we now use the OMPI rte framework for such purposes This commit was SVN r28134.	2013-02-28 01:35:55 +00:00
Ralph Castain	347df93cd4	Handle the case of someone specifying a directory for the application. Ensure we get a non-zero exit status and clarify the error message. cmr:v1.7 This commit was SVN r28119.	2013-02-27 01:36:21 +00:00
Ralph Castain	c0b670bea8	I guess some profiling tools and debuggers require that the argv[0] of each rank be unique so they can create a filename based on that value. For those obscure cases, provide an mpirun cmd line option that indexes each argv[0] by rank This commit was SVN r28064.	2013-02-15 20:20:49 +00:00
Ralph Castain	2504da1ac9	Remove stale code - message arrival time doesn't really mean much anymore. This commit was SVN r27905.	2013-01-24 23:02:02 +00:00
Ralph Castain	bd887f7f56	Add a new "test" component to the DFS that treats all files as remote in order to test the app-to-daemon interactions on a single machine. Set a global param to indicate we are using staged execution. Add a param to indicate it is okay for non-MPI processes to execute without finalizing. Cleanup file map load and fetch operations. This commit was SVN r27587.	2012-11-10 14:09:12 +00:00
Nathan Hjelm	bdedd8b0d3	Per RFC modify the behavior of mca_base_components_close to NOT close the output. Modify frameworks to always close their output and set to -1. Reasoning: The old behavior was a little confusing. mca_base_components_open does not open an output stream so it is a little unexpected that mca_base_components_close does. To add to this several frameworks (that don't use mca_base_components_close) failed to close their output in the framework close function and others closed their output a second time. This change is an improvement to the symantics of mca_base_components_open/close as they are now symetric in their functionality. This commit was SVN r27570.	2012-11-06 19:09:26 +00:00
Nathan Hjelm	2acd0f83de	Revert "Revert r27451 and r27456 - the cmd line parser is incorrectly marking the application as an MCA parameter". It appears the problem was not with the command line parser but the rsh plm. I don't know why this problem was not occuring before the command line parser changes but it appears to be resolved now. This commit was SVN r27527. The following SVN revision numbers were found above: r27451 --> open-mpi/ompi@d59034e6ef r27456 --> open-mpi/ompi@ecdbf34937	2012-10-30 19:45:18 +00:00
Ralph Castain	e6014bf2e1	Revert r27451 and r27456 - the cmd line parser is incorrectly marking the application as an MCA parameter This commit was SVN r27477. The following SVN revision numbers were found above: r27451 --> open-mpi/ompi@d59034e6ef r27456 --> open-mpi/ompi@ecdbf34937	2012-10-24 18:38:44 +00:00
Nathan Hjelm	d59034e6ef	MCA: remove deprecated mca_base_param functions (mca_base_param_register_int, mca_base_param_register_string, mca_base_param_environ_variable). Remove all uses of deprecated functions. cmr:v1.7 This commit was SVN r27451.	2012-10-17 20:17:37 +00:00
Ralph Castain	285a3b168d	Add an ability to specify the max number of simultaneous procs/node for an application when operating in staged mode. Change some debug statements from OPAL_OUTPUT_VERBOSE to opal_output_verbose so they are available in optimized builds. This commit was SVN r27445.	2012-10-14 03:31:32 +00:00
George Bosilca	6ec41400b3	Fix the error message in case a daemon does not succeed at killing the local offspring. This commit was SVN r27362.	2012-09-24 15:25:21 +00:00
Ralph Castain	c82cfecc1c	Cleanup comm_spawn for the multi-node case where at least one new process isn't spawned on every node. Avoid the complexities of trying to execute a daemon collective across the dynamic spawn as it becomes too hard to ensure that all daemons participate or are accounted for - instead, use a less scalable but workable solution of sending the data directly between the participating procs. Ensure that singletons get their collectives properly defined at startup so the spawned "HNP" is ready for them. As a secondary cleanup, the HNP doesn't need to update its nidmap during an xcast as it already has an up-to-date picture of the situation. So just dump that data and move along. This commit was SVN r27318.	2012-09-12 11:31:36 +00:00
Ralph Castain	387f657fc2	Nuts - forgot to include this with the MPI Ticket 313 stuff. Set some of the envars needed for MPI_INFO_ENV This commit was SVN r27304.	2012-09-11 20:35:09 +00:00
Ralph Castain	38ce23db43	Add some protection to allow NULL bytes in byte objects and NULL strings to be handled cleanly in nidmaps and modex entries. Ensure there is a valid nidmap available for the HNP to pass down to any local procs when it is operating alone. This commit was SVN r27188.	2012-08-31 01:07:36 +00:00
Ralph Castain	1b659de132	Get staged execution working on multi-node setups. Improve efficiency by only remapping if all procs not yet mapped in the job. This commit was SVN r27181.	2012-08-29 20:35:52 +00:00
Ralph Castain	98580c117b	Introduce staged execution. If you don't have adequate resources to run everything without oversubscribing, don't want to oversubscribe, and aren't using MPI, then staged execution lets you (a) run as many procs as there are available resources, and (b) start additional procs as others complete and free up resources. Adds a new mapper as well as a new state machine. Remove some stale configure.m4's we no longer need. Optimize the nidmaps a bit by only sending info that has changed each time, instead of sending a complete copy of everything. Makes no difference for the typical MPI job - only impacts things like staged execution where we are sending multiple (possibly many) launch messages. This commit was SVN r27165.	2012-08-28 21:20:17 +00:00
Ralph Castain	d6cbff6d4e	Since the preload flags are at the app_context level, we need to link only those files/exe's that pertain to each app_context to the corresponding procs. Also, gain a little optimization by checking to ensure we only send files once - this probably won't work when daemons are created on-the-fly, but that's for some other day This commit was SVN r27134.	2012-08-24 16:16:30 +00:00
Ralph Castain	e0c39c94e8	Complete the cleanup of the preload files system. Remove the dest_dir option as moving things to arbitrary locations - especially absolute paths - can prove disastrous. Remove the preload_libs option as these can be treated as just files. Cleanup some of the pack/unpack code as the dss handles NULL strings just fine. Deal a little better with absolute paths, noting that tar now strips the leading '/' for us (showing my age as it didn't used to do so). Remove the odls_base_state.c file as that code is now covered by the new broadcast form of preload_files. This commit was SVN r27127.	2012-08-24 02:28:29 +00:00
Ralph Castain	b4a544ad2a	Per discussion with Josh, use the --preload-xxx cmd line options to broadcast files to all nodes. Add --set-cwd-to-session-dir option to start procs in their session directories. Add OMPI_FILE_LOCATION envar to tell procs where their prepositioned files went. This commit was SVN r27125.	2012-08-23 21:28:05 +00:00
Ralph Castain	7237a938bf	Extend the filem interface to support prepositioning and linking required local files for execution. Create a new "raw" module that uses xcast to send the files to all nodes as this is faster than doing an scp in a linear pattern This commit was SVN r27118.	2012-08-22 21:43:20 +00:00
Ralph Castain	cb48fd52d4	Implement the MPI_Info part of MPI-3 Ticket 313. Add an MPI_info object MPI_INFO_GET_ENV that contains a number of run-time related pieces of info. This includes all the required ones in the ticket, plus a few that specifically address recent user questions: "num_app_ctx" - the number of app_contexts in the job "first_rank" - the MPI rank of the first process in each app_context "np" - the number of procs in each app_context Still need clarification on the MPI_Init portion of the ticket. Specifically, does the ticket call for returning an error is someone calls MPI_Init more than once in a program? We set a flag to tell us that we have been initialized, but currently never check it. This commit was SVN r27005.	2012-08-12 01:28:23 +00:00
Ralph Castain	0dfe29b1a6	Roll in the rest of the modex change. Eliminate all non-modex API access of RTE info from the MPI layer - in some cases, the info was already present (either in the ompi_proc_t or in the orte_process_info struct) and no call was necessary. This removes all calls to orte_ess from the MPI layer. Calls to orte_grpcomm remain required. Update all the orte ess components to remove their associated APIs for retrieving proc data. Update the grpcomm API to reflect transfer of set/get modex info to the db framework. Note that this doesn't recreate the old GPR. This is strictly a local db storage that may (at some point) obtain any missing data from the local daemon as part of an async methodology. The framework allows us to experiment with such methods without perturbing the default one. This commit was SVN r26678.	2012-06-27 14:53:55 +00:00
Ralph Castain	078a4667e4	Some more cleanup on direct routed when daemons are involved This commit was SVN r26594.	2012-06-11 23:46:22 +00:00
Ralph Castain	d6279fc971	Fix the debugger daemon launch support to fit the new state machine. Treat debugger daemons just like any other job, except that we map them only to nodes where an app process currently exists (as opposed to every node in the system). Trigger breakpoint and rank0 release only after the debugger daemons are in position. This commit was SVN r26556.	2012-06-06 02:01:23 +00:00

1 2 3 4 5 ...

310 Коммитов