openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	abedb97be4	Resolve race condition when procs call MPI_Abort. Since we go thru the errmgr instead of the normal proc termination routines, we need to ensure we mark that the proc has fired its waitpid and is no longer alive. Otherwise, the local daemon won't terminate because it thinks there is still a local proc alive and we hang. Thanks to Gilles for tracking it down. cmr=v1.8.2:reviewer=rhc This commit was SVN r32460.	2014-08-08 15:58:49 +00:00
Ralph Castain	6c5e592785	Revert r32222, r32210, and r32203 as they created a problem when daemon collectives did not involve app procs on every node. Instead, modify the ompi/mca/rte/orte/rte_orte.h to add a new function that allows apps to request new daemon collective ids for use in barrier and modex operations. This will only appear in ORTE-based installations, but it is only being used by a couple of researchers at the moment. Update the orte/test/mpi/coll_test.c test to show the revised example. This commit was SVN r32234. The following SVN revision numbers were found above: r32203 --> open-mpi/ompi@a523dba41d r32210 --> open-mpi/ompi@2ce11ed5c4 r32222 --> open-mpi/ompi@d55f16db50	2014-07-15 03:48:00 +00:00
Ralph Castain	a523dba41d	NOTE: this modifies the MPI-RTE interface We have been getting several requests for new collectives that need to be inserted in various places of the MPI layer, all in support of either checkpoint/restart or various research efforts. Until now, this would require that the collective id's be generated at launch. which required modification s to ORTE and other places. We chose not to make collectives reusable as the race conditions associated with resetting collective counters are daunti ng. This commit extends the collective system to allow self-generation of collective id's that the daemons need to support, thereby allowing developers to request any number of collectives for their work. There is one restriction: RTE collectives must occur at the process level - i.e., we don't curren tly have a way of tagging the collective to a specific thread. From the comment in the code: * In order to allow scalable * generation of collective id's, they are formed as: * * top 32-bits are the jobid of the procs involved in * the collective. For collectives across multiple jobs * (e.g., in a connect_accept), the daemon jobid will * be used as the id will be issued by mpirun. This * won't cause problems because daemons don't use the * collective_id * * bottom 32-bits are a rolling counter that recycles * when the max is hit. The daemon will cleanup each * collective upon completion, so this means a job can * never have more than 2*32 collectives going on at a time. If someone needs more than that - they've got * a problem. * * Note that this means (for now) that RTE-level collectives * cannot be done by individual threads - they must be * done at the overall process level. This is required as * there is no guaranteed ordering for the collective id's, * and all the participants must agree on the id of the * collective they are executing. So if thread A on one * process asks for a collective id before thread B does, * but B asks before A on another process, the collectives will * be mixed and not result in the expected behavior. We may * find a way to relax this requirement in the future by * adding a thread context id to the jobid field (maybe taking the * lower 16-bits of that field). This commit includes a test program (orte/test/mpi/coll_test.c) that cycles 100 times across barrier and modex collectives. This commit was SVN r32203.	2014-07-10 18:53:12 +00:00
Ralph Castain	356e7ea904	Move all collective id's into the attributes and let the job pack/unpack take care of them instead of singling them out. Add the envars just prior to forking the children instead of into the launch message itself. Remove a few #if CR as the attributes functionality can handle this condition now. This commit was SVN r32133.	2014-07-03 15:58:13 +00:00
Adrian Reber	cabf1d4e68	use the orte attributes in the FT code to fix compile errors This commit was SVN r32093.	2014-06-26 03:19:17 +00:00
Ralph Castain	34e5573988	Resolve the MTT timeout problem. This appears to have largely been caused by missing sigchld notifications, thus causing the daemons to believe that not all procs had exited. Let comm failure also serve as notification of process termination, and add appropriate flags/attributes to avoid multiple reporting of proc termination. This won't transition cleanly to the 1.8 series, and may represent too much change, so we'll have to (a) evaluate whether or not to bring it over (once it demonstrates that it does indeed solve the problem), and (b) develop a custom patch for that purpose. Refs trac:4717 This commit was SVN r32063. The following Trac tickets were found above: Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717	2014-06-21 17:09:02 +00:00
Ralph Castain	5dbf4a62c4	Cleanup: we were accidentally killing ourselves (bad idea) Refs trac:4717 This commit was SVN r32022. The following Trac tickets were found above: Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717	2014-06-17 20:38:42 +00:00
Ralph Castain	42bf7466fc	This isn't as big a change as it appears - a change in one place caused a whole bunch of files to require updated #include's due to some arcane linkage. Rework the orte_wait code to reflect the introduction of the state machine. If we are in cleanup mode and just want to kill all our local children, then there is no reason to be polite about it as that introduces very long delays at scale. Just kill the procs and move on. Refs trac:4717 This commit was SVN r32019. The following Trac tickets were found above: Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717	2014-06-17 17:57:51 +00:00
Ralph Castain	8e768de317	Really should check our own node for oversubscription, not the HNP cmr=v1.8.2:reviewer=jsquyres This commit was SVN r31946.	2014-06-04 03:09:02 +00:00
Ralph Castain	8736a1c138	Per RFC: http://www.open-mpi.org/community/lists/devel/2014/05/14822.php Revamp the ORTE global data structures to reduce memory footprint and add new features. Add ability to control/set cpu frequency, though this can only be done if the sys admin has setup the system to support it (or you run as root). This commit was SVN r31916.	2014-06-01 16:14:10 +00:00
Ralph Castain	5602156a1c	Use the correct abstraction layer name for the data dirs This commit was SVN r31684.	2014-05-08 14:32:24 +00:00
Ralph Castain	11faab1091	The final step of the RFC: convert the <foo>libdir and friends to fit their respective code areas, and equate them all at the top. Note that we can't entirely separate things as the opal_install_dirs framework can't handle separated locations for the various trees. This commit was SVN r31679.	2014-05-08 02:01:35 +00:00
Ralph Castain	87d809eefe	Add a new "run-time controls" framework for setting controls on processes. Initially, just move the process binding code there under a new "hwloc" component. Additional components to support cgroups, power settings, etc. to follow This commit was SVN r31633.	2014-05-05 19:22:06 +00:00
Ralph Castain	34988ba2a2	Cleanup the MPI_Abort detection Refs trac:4576 This commit was SVN r31561. The following Trac tickets were found above: Ticket 4576 --> https://svn.open-mpi.org/trac/ompi/ticket/4576	2014-04-30 00:51:59 +00:00
Ralph Castain	1f0efe62a4	Minor cleanup - remove unused RML tag Refs trac:4576 This commit was SVN r31545. The following Trac tickets were found above: Ticket 4576 --> https://svn.open-mpi.org/trac/ompi/ticket/4576	2014-04-29 17:34:17 +00:00
Ralph Castain	e05b88fd18	Take another stab at resolving the "called-abort" requirement without getting stuck. Return to "drop a turd" mode, perhaps with a little more intelligence behind it. Don't worry about catching it if session dirs weren't created cmr=v1.8.2:reviewer=jsquyres:subject=cleanup MPI_Abort hangs This commit was SVN r31543.	2014-04-29 17:29:46 +00:00
Jeff Squyres	e1655ae68d	opal/util/fd.c: add new convenience function for setting FD_CLOEXEC Paul Hargrove pointed out that Stevens tells us that we should FD_GETFL before FD_SETFL. And so we shall. Make a new convenience function to do this (opal_fd_set_cloexec()), just so that we don't have to litter this 2-step process throughout the code. Refs trac:4550 This commit was SVN r31513. The following Trac tickets were found above: Ticket 4550 --> https://svn.open-mpi.org/trac/ompi/ticket/4550	2014-04-24 13:04:49 +00:00
Ralph Castain	a368e84e70	Per the RFC, remove the sensor framework from the ORTE code area, relocating it offsite to the ORCM code area. Also update some ignores to ensure we don't pickup crosstalk in components This commit was SVN r31403.	2014-04-15 21:48:24 +00:00
Jeff Squyres	3da579139b	More corrections w.r.t. process groups To accompany r31092 and r310924, also ensure to create a new process group in the child right after the orted forks. Add trivial configury to ensure that we have setpgid, and only do the setpgid/getpgid if we have setpgid. Without this commit, killing the entire process group can do unexpected things (e.g., kill the orted, mpirun, and even mpirun's parent!). cmr=v1.7.5:reviewer=rhc This commit was SVN r31132. The following SVN revision numbers were found above: r31092 --> open-mpi/ompi@99c9ecaed0 The following SVN revisions from the original message are invalid or inconsistent and therefore were not cross-referenced: r310924	2014-03-18 21:31:01 +00:00
Ralph Castain	545ac7dc58	Remove the job_control_forwarding logic as we want any signal to go to all members of the process group Refs trac:4404 This commit was SVN r31094. The following Trac tickets were found above: Ticket 4404 --> https://svn.open-mpi.org/trac/ompi/ticket/4404	2014-03-17 22:45:33 +00:00
Ralph Castain	99c9ecaed0	Ensure that we send the specified signal to the entire process group of each member of the pid provided to us. This ensures that any children spawned by our children also see the signal cmr=v1.7.5:reviewer=jsquyres This commit was SVN r31092.	2014-03-17 22:12:15 +00:00
Ralph Castain	081669b440	When pretty-printing binding info, we need to pass the topology down to the routine as the mapper isn't always working with the local topology - otherwise, we get an erroneous help message. Thanks to Tetsuya Mishima for reporting it cmr=v1.7.5:reviewer=rhc:subject=fix pretty-print of bindings This commit was SVN r30968.	2014-03-10 15:53:07 +00:00
Ralph Castain	c9465d97b4	Resolve a race condition when responding to a SIGTERM to ensure that any final message from the application is correctly output. Remove a duplicate command, reduce the priority of the daemon exit command to MSG so that the IOF will have a chance to output cached messages. Update the signal trapping test. Thanks to Paul Kapinos for reporting the problem. cmr=v1.7.5:reviewer=jsquyres:subject=resolve a race condition This commit was SVN r30942.	2014-03-05 04:38:17 +00:00
Ralph Castain	0ac97761cc	Now that we are binding by default, the issue of #slots and what to do when oversubscribed has become a bit more complicated. This isn't a problem in managed environments as we are always provided an accurate assignment for the #slots, or when -host is used to define the allocation since we automatically assume one slot for every time a node is named. The problem arises when a hostfile is used, and the user provides host names without specifying the slots= paramater. In these cases, we assign slots=1, but automatically allow oversubscription since that number isn't confirmed. We then provide a separate parameter by which the user can direct that we assign the number of slots based on the sensed hardware - e.g., by telling us to set the #slots equal to the #cores on each node. However, this has been set to "off" by default. In order to make this a little less complex for the user, set the default such that we automatically set #slots equal to #cores (or #hwt's if use_hwthreads_as_cpus has been set) only for those cases where the user provides names in a hostfile but does not provide slot information. Also cleanup some a couple of issues in the mapping/binding system: * ensure we only override the binding directive if we are oversubscribed and overload is not allowed * ensure that the MPI procs don't attempt to bind themselves if they are launched by an orted as any binding directive (no matter what it was) would have been serviced by the orted on launch * minor cleanup to the warning message when oversubscribed and binding was requested cmr=v1.7.5:reviewer=rhc:subject=update mapping/binding system This commit was SVN r30909.	2014-03-03 16:46:37 +00:00
Ralph Castain	61a21e4f31	Based on Tetsuya's patch, with some changes, correct the case of map-by node where multiple cpus/rank are requested and result in a non-integer match with num slots. Also correct tests for binding policy given to use the proper macro. Refs trac:4296 This commit was SVN r30857. The following Trac tickets were found above: Ticket 4296 --> https://svn.open-mpi.org/trac/ompi/ticket/4296	2014-02-26 18:12:23 +00:00
Adrian Reber	fde1040d2f	Use unique collective ids for the checkpoint/restart code This commit was SVN r30552.	2014-02-04 14:03:05 +00:00
Ralph Castain	193cceb483	Okay, since a certain other RM out there made a fuss about being able to lock their daemons to specified cores, offer the same option here. The MCA param orte_daemon_cores can be used to specify which core(s) you want the orte daemons to use. This will have no bearing on the application procs - unbound will remain unbound, and binding directives will be applied to the apps. Yippee skippee... This commit was SVN r30513.	2014-01-30 23:50:14 +00:00
Ralph Castain	fb9e427320	One last corner case - when encountering an overload condition (e.g., by comm_spawning more procs than we have cores) and we are using the default binding policy, do not bind the new procs to anything as this can cause major problems. Instead, let the spawn succeed since the user didn't specifically ask to be bound, and leave the new procs as unbound. Refs trac:4077 This commit was SVN r30200. The following Trac tickets were found above: Ticket 4077 --> https://svn.open-mpi.org/trac/ompi/ticket/4077	2014-01-09 22:39:34 +00:00
Ralph Castain	f179f2086b	Do a better job of reporting bindings - if someone gives a spec that binds us to all processors, then we are effectively unbound and should report it clearly instead of outputting a long line of B's. cmr=v1.7.4:reviewer=jsquyres:subject=Do a better job of reporting bindings This commit was SVN r30179.	2014-01-09 16:16:16 +00:00
Brian Barrett	8b778903d8	Fix longstanding issue with our multi-project support. Rather than using pkg{data,lib,includedir}, use our own ompi{data,lib,includedir}, which is always set to {datadir,libdir,includedir}/openmpi. This will keep us from having help files in prefix/share/open-rte when building without Open MPI, but in prefix/share/openmpi when building with Open MPI. This commit was SVN r30140.	2014-01-07 22:11:15 +00:00
Ralph Castain	d049731911	Add pubsub pmi component to list of components to avoid when indirect launch used Refs trac:4032 This commit was SVN r30083. The following Trac tickets were found above: Ticket 4032 --> https://svn.open-mpi.org/trac/ompi/ticket/4032	2013-12-25 16:25:37 +00:00
Ralph Castain	81df8d09ca	Avoid use of PMI components when launched via mpirun as this is just unnecessary overhead that can cause confusion. cmr=v1.7.4:reviewer=miked:subject=Avoid use of PMI components when launched via mpirun This commit was SVN r30078.	2013-12-24 16:32:31 +00:00
Ralph Castain	55cd65b149	Don't warn about binding (process and/or memory) if the node cannot do it or if we would overload, but it wasn't specifically requested by the user (i.e., it is the result of the default policy). Instead, just don't bind and quietly move along. Reset topology usage for each node as we bind as multiple nodes may be linked to the same topology object. This will need to be revisited for scale as it does take some non-zero time to reset the usage each iteration. However, storing individual topology objects for every node consumes memory, so it's a tradeoff. cmr=v1.7.4:reviewer=jsquyres:subject=Eliminate excessive binding/memory warnings This commit was SVN r29978.	2013-12-19 16:31:45 +00:00
Ralph Castain	f4f2287958	Singletons currently start out by spawning an HNP - this is required solely in the cases where the singleton subsequently calls MPI_Comm_spawn or publishes port info without support from an external orte-server. In all other cases, the HNP is of no value and can actually be a detriment by creating additional overhead on the node. This is particularly concerning for async operations where processes may begin as singletons and then dynamically wireup to perform pt2pt communications. So we now allow singletons to start on their own, only spawning an HNP when initiating an operation that actually requires it. cmr:v1.7.4:reviewer=jsquyres This commit was SVN r29354.	2013-10-04 02:58:26 +00:00
Ralph Castain	e8697de521	Deal with PGI compilers on the Mac by initializing a global variable. cmr:v1.6.6:reviewer=jsquyres cmr:v1.7.3:reviewer=jsquyres This commit was SVN r29129.	2013-09-05 21:40:50 +00:00
Ralph Castain	a200e4f865	As per the RFC, bring in the ORTE async progress code and the rewrite of OOB: * THIS RFC INCLUDES A MINOR CHANGE TO THE MPI-RTE INTERFACE * Note: during the course of this work, it was necessary to completely separate the MPI and RTE progress engines. There were multiple places in the MPI layer where ORTE_WAIT_FOR_COMPLETION was being used. A new OMPI_WAIT_FOR_COMPLETION macro was created (defined in ompi/mca/rte/rte.h) that simply cycles across opal_progress until the provided flag becomes false. Places where the MPI layer blocked waiting for RTE to complete an event have been modified to use this macro. *************************************************************************************** I am reissuing this RFC because of the time that has passed since its original release. Since its initial release and review, I have debugged it further to ensure it fully supports tests like loop_spawn. It therefore seems ready for merge back to the trunk. Given its prior review, I have set the timeout for one week. The code is in https://bitbucket.org/rhc/ompi-oob2 WHAT: Rewrite of ORTE OOB WHY: Support asynchronous progress and a host of other features WHEN: Wed, August 21 SYNOPSIS: The current OOB has served us well, but a number of limitations have been identified over the years. Specifically: * it is only progressed when called via opal_progress, which can lead to hangs or recursive calls into libevent (which is not supported by that code) * we've had issues when multiple NICs are available as the code doesn't "shift" messages between transports - thus, all nodes had to be available via the same TCP interface. * the OOB "unloads" incoming opal_buffer_t objects during the transmission, thus preventing use of OBJ_RETAIN in the code when repeatedly sending the same message to multiple recipients * there is no failover mechanism across NICs - if the selected NIC (or its attached switch) fails, we are forced to abort * only one transport (i.e., component) can be "active" The revised OOB resolves these problems: * async progress is used for all application processes, with the progress thread blocking in the event library * each available TCP NIC is supported by its own TCP module. The ability to asynchronously progress each module independently is provided, but not enabled by default (a runtime MCA parameter turns it "on") * multi-address TCP NICs (e.g., a NIC with both an IPv4 and IPv6 address, or with virtual interfaces) are supported - reachability is determined by comparing the contact info for a peer against all addresses within the range covered by the address/mask pairs for the NIC. * a message that arrives on one TCP NIC is automatically shifted to whatever NIC that is connected to the next "hop" if that peer cannot be reached by the incoming NIC. If no TCP module will reach the peer, then the OOB attempts to send the message via all other available components - if none can reach the peer, then an "error" is reported back to the RML, which then calls the errmgr for instructions. * opal_buffer_t now conforms to standard object rules re OBJ_RETAIN as we no longer "unload" the incoming object * NIC failure is reported to the TCP component, which then tries to resend the message across any other available TCP NIC. If that doesn't work, then the message is given back to the OOB base to try using other components. If all that fails, then the error is reported to the RML, which reports to the errmgr for instructions * obviously from the above, multiple OOB components (e.g., TCP and UD) can be active in parallel * the matching code has been moved to the RML (and out of the OOB/TCP component) so it is independent of transport * routing is done by the individual OOB modules (as opposed to the RML). Thus, both routed and non-routed transports can simultaneously be active * all blocking send/recv APIs have been removed. Everything operates asynchronously. KNOWN LIMITATIONS: * although provision is made for component failover as described above, the code for doing so has not been fully implemented yet. At the moment, if all connections for a given peer fail, the errmgr is notified of a "lost connection", which by default results in termination of the job if it was a lifeline * the IPv6 code is present and compiles, but is not complete. Since the current IPv6 support in the OOB doesn't work anyway, I don't consider this a blocker * routing is performed at the individual module level, yet the active routed component is selected on a global basis. We probably should update that to reflect that different transports may need/choose to route in different ways * obviously, not every error path has been tested nor necessarily covered * determining abnormal termination is more challenging than in the old code as we now potentially have multiple ways of connecting to a process. Ideally, we would declare "connection failed" when all transports can no longer reach the process, but that requires some additional (possibly complex) code. For now, the code replicates the old behavior only somewhat modified - i.e., if a module sees its connection fail, it checks to see if it is a lifeline. If so, it notifies the errmgr that the lifeline is lost - otherwise, it notifies the errmgr that a non-lifeline connection was lost. * reachability is determined solely on the basis of a shared subnet address/mask - more sophisticated algorithms (e.g., the one used in the tcp btl) are required to handle routing via gateways * the RML needs to assign sequence numbers to each message on a per-peer basis. The receiving RML will then deliver messages in order, thus preventing out-of-order messaging in the case where messages travel across different transports or a message needs to be redirected/resent due to failure of a NIC This commit was SVN r29058.	2013-08-22 16:37:40 +00:00
Jeff Squyres	089c632cce	Remove a bunch of dead code: gcc 4.7 warns of set-but-unused variables. So get rid of them. This commit was SVN r28538.	2013-05-17 21:45:49 +00:00
Jeff Squyres	349ee654c1	Fix some --without-hwloc compile errors. Also remove one assigned-but-not-used variable assignment. This commit was SVN r28321.	2013-04-10 15:08:31 +00:00
Ralph Castain	1f011bef99	Cleanup the updated sys limits capability. Fix a few copy/paste bugs (my bad). Shift the limit set to the ODLS default module so that we sete the limits for all apps, even those that don't call opal_init. Leave it in opal_init as well to support direct-launch apps, but ensure we only set the limits once by removing the envar after launch by ODLS. Provide some nice error messages if we fail to set the limits. Since the user had to specifically request we set the limit, treat failure as an error-out situation. This commit was SVN r28288.	2013-04-04 16:00:17 +00:00
Nathan Hjelm	c041156f60	Update ORTE frameworks to use the MCA framework system. This commit was SVN r28240.	2013-03-27 21:14:43 +00:00
Nathan Hjelm	cf377db823	MCA/base: Add new MCA variable system Features: - Support for an override parameter file (openmpi-mca-param-override.conf). Variable values in this file can not be overridden by any file or environment value. - Support for boolean, unsigned, and unsigned long long variables. - Support for true/false values. - Support for enumerations on integer variables. - Support for MPIT scope, verbosity, and binding. - Support for command line source. - Support for setting variable source via the environment using OMPI_MCA_SOURCE_<var name>=source (either command or file:filename) - Cleaner API. - Support for variable groups (equivalent to MPIT categories). Notes: - Variables must be created with a backing store (char *, int , or bool *) that must live at least as long as the variable. - Creating a variable with the MCA_BASE_VAR_FLAG_SETTABLE enables the use of mca_base_var_set_value() to change the value. - String values are duplicated when the variable is registered. It is up to the caller to free the original value if necessary. The new value will be freed by the mca_base_var system and must not be freed by the user. - Variables with constant scope may not be settable. - Variable groups (and all associated variables) are deregistered when the component is closed or the component repository item is freed. This prevents a segmentation fault from accessing a variable after its component is unloaded. - After some discussion we decided we should remove the automatic registration of component priority variables. Few component actually made use of this feature. - The enumerator interface was updated to be general enough to handle future uses of the interface. - The code to generate ompi_info output has been moved into the MCA variable system. See mca_base_var_dump(). opal: update core and components to mca_base_var system orte: update core and components to mca_base_var system ompi: update core and components to mca_base_var system This commit also modifies the rmaps framework. The following variables were moved from ppr and lama: rmaps_base_pernode, rmaps_base_n_pernode, rmaps_base_n_persocket. Both lama and ppr create synonyms for these variables. This commit was SVN r28236.	2013-03-27 21:09:41 +00:00
Ralph Castain	cf9796accd	Remove the old configure option for disabling full rte support - we now use the OMPI rte framework for such purposes This commit was SVN r28134.	2013-02-28 01:35:55 +00:00
Ralph Castain	347df93cd4	Handle the case of someone specifying a directory for the application. Ensure we get a non-zero exit status and clarify the error message. cmr:v1.7 This commit was SVN r28119.	2013-02-27 01:36:21 +00:00
Ralph Castain	8d2fa3693b	First cut at removing the native Windows support. Remove all the Windows-specific components, and the .windows files sprinkled around. Remove the Windows platform files and MTT scripts. Update the NEWS to point Windows users to the cygwin package. This commit was SVN r28116.	2013-02-26 20:44:56 +00:00
Ralph Castain	c0b670bea8	I guess some profiling tools and debuggers require that the argv[0] of each rank be unique so they can create a filename based on that value. For those obscure cases, provide an mpirun cmd line option that indexes each argv[0] by rank This commit was SVN r28064.	2013-02-15 20:20:49 +00:00
Ralph Castain	b9897267ef	Cleanup report-bindings so it always reports the actual binding instead of what was requested. Ensure we don't report twice if it is an MPI process being launched. This commit was SVN r28057.	2013-02-14 17:24:28 +00:00
Ralph Castain	2504da1ac9	Remove stale code - message arrival time doesn't really mean much anymore. This commit was SVN r27905.	2013-01-24 23:02:02 +00:00
Ralph Castain	bd887f7f56	Add a new "test" component to the DFS that treats all files as remote in order to test the app-to-daemon interactions on a single machine. Set a global param to indicate we are using staged execution. Add a param to indicate it is okay for non-MPI processes to execute without finalizing. Cleanup file map load and fetch operations. This commit was SVN r27587.	2012-11-10 14:09:12 +00:00
Nathan Hjelm	bdedd8b0d3	Per RFC modify the behavior of mca_base_components_close to NOT close the output. Modify frameworks to always close their output and set to -1. Reasoning: The old behavior was a little confusing. mca_base_components_open does not open an output stream so it is a little unexpected that mca_base_components_close does. To add to this several frameworks (that don't use mca_base_components_close) failed to close their output in the framework close function and others closed their output a second time. This change is an improvement to the symantics of mca_base_components_open/close as they are now symetric in their functionality. This commit was SVN r27570.	2012-11-06 19:09:26 +00:00
Nathan Hjelm	2acd0f83de	Revert "Revert r27451 and r27456 - the cmd line parser is incorrectly marking the application as an MCA parameter". It appears the problem was not with the command line parser but the rsh plm. I don't know why this problem was not occuring before the command line parser changes but it appears to be resolved now. This commit was SVN r27527. The following SVN revision numbers were found above: r27451 --> open-mpi/ompi@d59034e6ef r27456 --> open-mpi/ompi@ecdbf34937	2012-10-30 19:45:18 +00:00
Ralph Castain	e6014bf2e1	Revert r27451 and r27456 - the cmd line parser is incorrectly marking the application as an MCA parameter This commit was SVN r27477. The following SVN revision numbers were found above: r27451 --> open-mpi/ompi@d59034e6ef r27456 --> open-mpi/ompi@ecdbf34937	2012-10-24 18:38:44 +00:00
Nathan Hjelm	d59034e6ef	MCA: remove deprecated mca_base_param functions (mca_base_param_register_int, mca_base_param_register_string, mca_base_param_environ_variable). Remove all uses of deprecated functions. cmr:v1.7 This commit was SVN r27451.	2012-10-17 20:17:37 +00:00
Ralph Castain	285a3b168d	Add an ability to specify the max number of simultaneous procs/node for an application when operating in staged mode. Change some debug statements from OPAL_OUTPUT_VERBOSE to opal_output_verbose so they are available in optimized builds. This commit was SVN r27445.	2012-10-14 03:31:32 +00:00
George Bosilca	6ec41400b3	Fix the error message in case a daemon does not succeed at killing the local offspring. This commit was SVN r27362.	2012-09-24 15:25:21 +00:00
Ralph Castain	d5279b0dc8	Make an attempt to protect hwloc cset2str from segfaulting in weird scenario This commit was SVN r27361.	2012-09-23 16:51:51 +00:00
Ralph Castain	c82cfecc1c	Cleanup comm_spawn for the multi-node case where at least one new process isn't spawned on every node. Avoid the complexities of trying to execute a daemon collective across the dynamic spawn as it becomes too hard to ensure that all daemons participate or are accounted for - instead, use a less scalable but workable solution of sending the data directly between the participating procs. Ensure that singletons get their collectives properly defined at startup so the spawned "HNP" is ready for them. As a secondary cleanup, the HNP doesn't need to update its nidmap during an xcast as it already has an up-to-date picture of the situation. So just dump that data and move along. This commit was SVN r27318.	2012-09-12 11:31:36 +00:00
Ralph Castain	387f657fc2	Nuts - forgot to include this with the MPI Ticket 313 stuff. Set some of the envars needed for MPI_INFO_ENV This commit was SVN r27304.	2012-09-11 20:35:09 +00:00
Ralph Castain	38ce23db43	Add some protection to allow NULL bytes in byte objects and NULL strings to be handled cleanly in nidmaps and modex entries. Ensure there is a valid nidmap available for the HNP to pass down to any local procs when it is operating alone. This commit was SVN r27188.	2012-08-31 01:07:36 +00:00
Ralph Castain	1b659de132	Get staged execution working on multi-node setups. Improve efficiency by only remapping if all procs not yet mapped in the job. This commit was SVN r27181.	2012-08-29 20:35:52 +00:00
Ralph Castain	98580c117b	Introduce staged execution. If you don't have adequate resources to run everything without oversubscribing, don't want to oversubscribe, and aren't using MPI, then staged execution lets you (a) run as many procs as there are available resources, and (b) start additional procs as others complete and free up resources. Adds a new mapper as well as a new state machine. Remove some stale configure.m4's we no longer need. Optimize the nidmaps a bit by only sending info that has changed each time, instead of sending a complete copy of everything. Makes no difference for the typical MPI job - only impacts things like staged execution where we are sending multiple (possibly many) launch messages. This commit was SVN r27165.	2012-08-28 21:20:17 +00:00
Ralph Castain	d6cbff6d4e	Since the preload flags are at the app_context level, we need to link only those files/exe's that pertain to each app_context to the corresponding procs. Also, gain a little optimization by checking to ensure we only send files once - this probably won't work when daemons are created on-the-fly, but that's for some other day This commit was SVN r27134.	2012-08-24 16:16:30 +00:00
Ralph Castain	e0c39c94e8	Complete the cleanup of the preload files system. Remove the dest_dir option as moving things to arbitrary locations - especially absolute paths - can prove disastrous. Remove the preload_libs option as these can be treated as just files. Cleanup some of the pack/unpack code as the dss handles NULL strings just fine. Deal a little better with absolute paths, noting that tar now strips the leading '/' for us (showing my age as it didn't used to do so). Remove the odls_base_state.c file as that code is now covered by the new broadcast form of preload_files. This commit was SVN r27127.	2012-08-24 02:28:29 +00:00
Ralph Castain	b4a544ad2a	Per discussion with Josh, use the --preload-xxx cmd line options to broadcast files to all nodes. Add --set-cwd-to-session-dir option to start procs in their session directories. Add OMPI_FILE_LOCATION envar to tell procs where their prepositioned files went. This commit was SVN r27125.	2012-08-23 21:28:05 +00:00
Ralph Castain	7237a938bf	Extend the filem interface to support prepositioning and linking required local files for execution. Create a new "raw" module that uses xcast to send the files to all nodes as this is faster than doing an scp in a linear pattern This commit was SVN r27118.	2012-08-22 21:43:20 +00:00
Ralph Castain	cb48fd52d4	Implement the MPI_Info part of MPI-3 Ticket 313. Add an MPI_info object MPI_INFO_GET_ENV that contains a number of run-time related pieces of info. This includes all the required ones in the ticket, plus a few that specifically address recent user questions: "num_app_ctx" - the number of app_contexts in the job "first_rank" - the MPI rank of the first process in each app_context "np" - the number of procs in each app_context Still need clarification on the MPI_Init portion of the ticket. Specifically, does the ticket call for returning an error is someone calls MPI_Init more than once in a program? We set a flag to tell us that we have been initialized, but currently never check it. This commit was SVN r27005.	2012-08-12 01:28:23 +00:00
Ralph Castain	0dfe29b1a6	Roll in the rest of the modex change. Eliminate all non-modex API access of RTE info from the MPI layer - in some cases, the info was already present (either in the ompi_proc_t or in the orte_process_info struct) and no call was necessary. This removes all calls to orte_ess from the MPI layer. Calls to orte_grpcomm remain required. Update all the orte ess components to remove their associated APIs for retrieving proc data. Update the grpcomm API to reflect transfer of set/get modex info to the db framework. Note that this doesn't recreate the old GPR. This is strictly a local db storage that may (at some point) obtain any missing data from the local daemon as part of an async methodology. The framework allows us to experiment with such methods without perturbing the default one. This commit was SVN r26678.	2012-06-27 14:53:55 +00:00
Ralph Castain	078a4667e4	Some more cleanup on direct routed when daemons are involved This commit was SVN r26594.	2012-06-11 23:46:22 +00:00
Ralph Castain	d6279fc971	Fix the debugger daemon launch support to fit the new state machine. Treat debugger daemons just like any other job, except that we map them only to nodes where an app process currently exists (as opposed to every node in the system). Trigger breakpoint and rank0 release only after the debugger daemons are in position. This commit was SVN r26556.	2012-06-06 02:01:23 +00:00
Jeff Squyres	0b8849e2c4	Make "mpirun --report-bindings" have a user-friendly output (i.e., readable by normal human beings, vs. having a bitmap of physical PU's). Use the new hwloc base prettyprint functions to generate the output. This commit was SVN r26533.	2012-06-01 16:35:31 +00:00
Jeff Squyres	cab31eafce	Revert r26413: it was causing too much confusion. When an MPI proc exits with status 77, the whole job will be killed, but mpirun will still return an exit status of 77, so MTT will report it as a skip anyway. This commit was SVN r26445. The following SVN revision numbers were found above: r26413 --> open-mpi/ompi@02aa36f2e5	2012-05-16 14:45:58 +00:00
Jeff Squyres	02aa36f2e5	ORTE defaults to killing the entire job when any process exits with a nonzero status (we polled other MPI implementations since one one in the OMPI community had a concrete opinion on what behavior to do here -- all other MPI's seem to adhere to this behavior, too). This commit adds an MCA parameter that allows us to tell ORTE to ''not'' kill jobs when a process exits with a status of 77, meaning the GNU testing standard of "this test was skipped". In all the OMPI tests, all procs will either return 77 or not. So if they all return 77, mpirun won't consider it an error, but will still return an exit status of 77 (so that MTT can know that the test was cleanly skipped). This commit was SVN r26413.	2012-05-08 21:49:05 +00:00
Ralph Castain	70a106fa71	Fix binding on remote nodes - need to pass the binding bitmap! This commit was SVN r26403.	2012-05-08 03:52:39 +00:00
Jeff Squyres	2ba10c37fe	Per RFC, bring in the following changes: * Remove paffinity, maffinity, and carto frameworks -- they've been wholly replaced by hwloc. * Move ompi_mpi_init() affinity-setting/checking code down to ORTE. * Update sm, smcuda, wv, and openib components to no longer use carto. Instead, use hwloc data. There are still optimizations possible in the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old carto-based code found out how many NUMA nodes were ''available'' -- not how many were used ''in this job''. The new hwloc-using code computes the same value -- it was not updated to calculate how many NUMA nodes are used ''by this job.'' * Note that I cannot compile the smcuda and wv BTLs -- I ''think'' they're right, but they need to be verified by their owners. * The openib component now does a bunch of stuff to figure out where "near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors (I do not have a NUMA machine with an OpenFabrics device that is a non-uniform distance from multiple different NUMA nodes). * Completely rewrite the OMPI_Affinity_str() routine from the "affinity" mpiext extension. This extension now understands hyperthreads; the output format of it has changed a bit to reflect this new information. * Bunches of minor changes around the code base to update names/types from maffinity/paffinity-based names to hwloc-based names. * Add some helper functions into the hwloc base, mainly having to do with the fact that we have the hwloc data reporting ''all'' topology information, but sometimes you really only want the (online \| available) data. This commit was SVN r26391.	2012-05-07 14:52:54 +00:00
Ralph Castain	b2f77bf08f	Extend the iof by adding two new components to support map-reduce IO chaining. Add a mapreduce tool for running such applications. Fix the state machine to support multiple jobs being simultaneously launched as this is not only required for mapreduce, but can happen under comm-spawn applications as well. This commit was SVN r26380.	2012-05-02 21:00:22 +00:00
Ralph Castain	c5da4f24d7	Fix stupid singletons - get the pidmap message correct This commit was SVN r26378.	2012-05-02 17:48:02 +00:00
Ralph Castain	289f9f41ec	From long-term discussions, have the daemons use the node_t and proc_t structs and arrays instead of the pidmap and nidmap arrays. Sets the stage for future work. This commit was SVN r26359.	2012-04-29 00:10:01 +00:00
Ralph Castain	4d16790836	Fix collectives for jobs running across partial allocations This commit was SVN r26267.	2012-04-13 00:38:47 +00:00
Ralph Castain	d3dfba3872	Fix the scenario where an MPI error handler causes a proc to exit after finalize, but with non-zero status to indicate an error occurred. This commit was SVN r26265.	2012-04-11 02:23:46 +00:00
Ralph Castain	14d5525fb1	Some minor cleanups. Get singletons working. Cleanup abort handling so it gets properly identified. This commit was SVN r26261.	2012-04-10 19:08:54 +00:00
Ralph Castain	bd8b4f7f1e	Sorry for mid-day commit, but I had promised on the call to do this upon my return. Roll in the ORTE state machine. Remove last traces of opal_sos. Remove UTK epoch code. Please see the various emails about the state machine change for details. I'll send something out later with more info on the new arch. This commit was SVN r26242.	2012-04-06 14:23:13 +00:00
Ralph Castain	ca3ff58c76	Ensure we get a non-zero exit status when we can't find the specified fork agent. Output a better error message, and ensure we don't multiply report the problem. This commit was SVN r26191.	2012-03-24 00:49:38 +00:00
Ralph Castain	b3aabf1565	Cleanup the --without-hwloc build. Thanks to Paul Hargrove for reporting it broken. This commit was SVN r25931.	2012-02-15 11:08:57 +00:00
Ralph Castain	91977444af	Silence warnings This commit was SVN r25929.	2012-02-15 03:42:27 +00:00
Ralph Castain	9b59d8de6f	This is actually a much smaller commit than it appears at first glance - it just touches a lot of files. The --without-rte-support configuration option has never really been implemented completely. The option caused various objects not to be defined and conditionally compiled some base functions, but did nothing to prevent build of the component libraries. Unfortunately, since many of those components use objects covered by the option, it caused builds to break if those components were allowed to build. Brian dealt with this in the past by creating platform files and using "no-build" to block the components. This was clunky, but acceptable when only one organization was using that option. However, that number has now expanded to at least two more locations. Accordingly, make --without-rte-support actually work by adding appropriate configury to prevent components from building when they shouldn't. While doing so, remove two frameworks (db and rmcast) that are no longer used as ORCM comes to a close (besides, they belonged in ORCM now anyway). Do some minor cleanups along the way. This commit was SVN r25497.	2011-11-22 21:24:35 +00:00
Ralph Castain	6310361532	At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here: https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation. In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions: 1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior. 2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation. 3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so. As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes. This commit was SVN r25476.	2011-11-15 03:40:11 +00:00
Ralph Castain	12a589130a	Add some debug This commit was SVN r25395.	2011-10-29 15:07:58 +00:00
Ralph Castain	84713d5a84	Fix singletons again - must have been broken for a very long time, which only shows how little anyone cares about this capability. This commit was SVN r25332.	2011-10-19 20:19:08 +00:00
Ralph Castain	054c485dcf	Cleanup a race condition and an unreliable method that caused us to not properly handle procs that trapped sigterm for cleanup purposes while ORTE was trying to kill them. Thanks to Rick Payne and Ian Wells of Cisco for spending weeks chasing this down. Fix a termination issue that caused procs local to mpirun to not be killed if they weren't calling into the library. Thanks to Terry Dontje for spending countless hours chasing his tail on this one! :-( This commit was SVN r25285.	2011-10-14 15:39:54 +00:00
Ralph Castain	ca7638553f	Remove stale code This commit was SVN r25133.	2011-09-12 23:00:41 +00:00
Ralph Castain	92c7372e20	Per the RFC from Jeff, move hwloc from opal/mca/common to its own static framework ala libevent. Have ORTE daemons collect the topology info at startup and, if --enable-hwloc-xml is set, send that info back to the HNP for later use. The HNP only retains unique topology "templates" to reduce memory footprint. Have the daemon include the local topology info in the nidmap buffer sent to each app so the apps don't all hammer the local system to discover it for themselves. Remove the sysinfo framework as hwloc replaces that functionality. This commit was SVN r25124.	2011-09-11 19:02:24 +00:00
Wesley Bland	4e7ff0bd5e	By popular demand the epoch code is now disabled by default. To enable the epochs and the resilient orte code, use the configure flag: --enable-resilient-orte This will define both: ORTE_ENABLE_EPOCH ORTE_RESIL_ORTE This commit was SVN r25093.	2011-08-26 22:16:14 +00:00
Ralph Castain	1c08a4006c	Refactor some code to remove a few API handles from errmgr. Reviewed/tested by Wes. This commit was SVN r25064.	2011-08-18 16:24:45 +00:00
Wesley Bland	09274cd047	Make sure that the epoch is initialized everywhere so we don't get weird output during valgrind. This shouldn't have caused any problems with any actual execution. Just extra warnings in valgrind. This commit was SVN r25015.	2011-08-08 15:11:55 +00:00
Ralph Castain	8014e3429e	Don't double-count procs as they are launched This commit was SVN r25011.	2011-08-08 06:05:23 +00:00
Ralph Castain	4083dc617f	Fix computation of number of required files and file descriptors - it only depends on the total number of local procs, not on the number of procs in the entire job! This commit was SVN r25008.	2011-08-08 04:09:40 +00:00
Ralph Castain	8b3c562b84	Adjust verbosity levels to make it easier to debug at scale This commit was SVN r25006.	2011-08-07 21:14:21 +00:00
Ralph Castain	d603c79ab4	Fix the FAILED_TO_START scenario so orted doesn't segfault This commit was SVN r25002.	2011-08-05 20:29:50 +00:00
Ralph Castain	42b125ef35	Move the debug so it more accurately reports This commit was SVN r24961.	2011-07-29 20:48:46 +00:00
Ralph Castain	6c879f87fb	Add a new param "orte_remote_tmpdir_base" for those situations where the compute nodes require a different session directory head than the head node. This commit was SVN r24956.	2011-07-27 19:37:17 +00:00
Eugene Loh	921852e1e5	Clean up the computations of num_procs_alive. Do some code refactoring to improve readability and to compute num_procs_alive correctly and to remove the use of loop iteration variables for two loops nested one inside another (causing MPI_Comm_spawn_multiple to fail). This commit was SVN r24903.	2011-07-14 20:10:48 +00:00

1 2 3 4 5 ...

527 Коммитов