openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	449cd8f3d7	Update a couple of fields, add a scheduler field to proc_info This commit was SVN r30718.	2014-02-13 23:30:04 +00:00
Ralph Castain	1d8c061687	Fix a race condition that could result in assert failures during finalize. Ensure we shutdown the orte progress thread prior to finalizing the rml/oob frameworks so that no async operations are executing during destruct of the base-level lists and objects. cmr=v1.7.5:reviewer=jsquyres:subject=fix race condition in finalize This commit was SVN r30641.	2014-02-08 22:04:19 +00:00
Adrian Reber	fde1040d2f	Use unique collective ids for the checkpoint/restart code This commit was SVN r30552.	2014-02-04 14:03:05 +00:00
Ralph Castain	193cceb483	Okay, since a certain other RM out there made a fuss about being able to lock their daemons to specified cores, offer the same option here. The MCA param orte_daemon_cores can be used to specify which core(s) you want the orte daemons to use. This will have no bearing on the application procs - unbound will remain unbound, and binding directives will be applied to the apps. Yippee skippee... This commit was SVN r30513.	2014-01-30 23:50:14 +00:00
Ralph Castain	80497d73cf	Need to mark the daemon as alive so that exit commands are properly routed during abnormal terminations. Also, remove stale references to the "selected oob component" as we no longer require only one component be selected cmr=v1.7.4:reviewer=jsquyres This commit was SVN r30162.	2014-01-08 22:35:48 +00:00
Brian Barrett	8b778903d8	Fix longstanding issue with our multi-project support. Rather than using pkg{data,lib,includedir}, use our own ompi{data,lib,includedir}, which is always set to {datadir,libdir,includedir}/openmpi. This will keep us from having help files in prefix/share/open-rte when building without Open MPI, but in prefix/share/openmpi when building with Open MPI. This commit was SVN r30140.	2014-01-07 22:11:15 +00:00
Ralph Castain	9c768df8b8	Resolve an unexpected behavior in hostfile allocations. Now that we filter allocations to determine what will be used for mapping, let the initial global pool be the union of nodes from all sources (default hostfile, hostfiles, and dash-hosts). Each app will filter down to only those specified for it using its own hostfile and dash-host options. cmr=v1.7.4:reviewer=jsquyres:subject=Resolve an unexpected behavior in hostfile allocations This commit was SVN r30040.	2013-12-21 01:38:27 +00:00
Ralph Castain	71b52fe861	Ensure that comm_spawn'd procs get user-specified forwarded envars Thanks to Tim Miller for reporting the regression from the 1.6 series cmr=v1.7.4:reviewer=jsquyres:subject=Ensure that comm_spawn'd procs get user-specified forwarded envars This commit was SVN r30012.	2013-12-20 14:47:35 +00:00
Ralph Castain	d44e4a311f	Per request from Dave Goodell, add support for MPIEXEC_TIMEOUT - if set in the environment, terminate the job after the specified number of seconds has passed. Equivalent to MPICH functionality. cmr=v1.7.4:reviewer=dgoodell:subject=add support for MPIEXEC_TIMEOUT This commit was SVN r29831.	2013-12-07 01:58:32 +00:00
Ralph Castain	7480beb7f0	Per request from Nathan, add an offset value to the job struct so we can construct a "global rank" that spans multiple jobs during dynamic launch operations. Store a new ORTE_DB_GLOBAL_RANK value for each process in the database, and ensure that we share our own value during connect_accept so both sides can see it. This isn't being used yet - just enabling Nathan to do what he needs. *** NOTE: any use of the OMPI_DB_GLOBAL_RANK database key must be protected by #ifdef OMPI_DB_GLOBAL_RANK as not all RTE's will define this key. *** This commit was SVN r29708.	2013-11-14 17:01:43 +00:00
Ralph Castain	f1e510154c	Revise the launch timeout detection so we don't mistakenly declare "failed to start". Recognize that timeout is at the per-job level, and define the timeout param as a total value instead of seconds/daemon as it otherwise can get to be an enormous (and useless) number. Resolves problems in loop_spawn where the timer was incorrectly firing and killing the overall job. cmr=v1.7.4:reviewer=hjelmn This commit was SVN r29661.	2013-11-11 23:50:40 +00:00
Ralph Castain	e35ad23176	Correctly compute usage for dynamic spawns when binding is invoked. Ensure we correctly account for existing process usage on each node when computing bindings during dynamic spawns. cmr=v1.7.4:reviewer=hjelmn:subject=Correctly compute usage for dynamic spawns when binding is invoked This commit was SVN r29649.	2013-11-10 00:38:01 +00:00
Ralph Castain	604970a1a2	Initialize orte_coprocessors hash table to NULL. Delay coprocessor detection on HNP until after node topology final definition in case rmaps changes it. Minor spacing change. Refs trac:3847 This commit was SVN r29504. The following Trac tickets were found above: Ticket 3847 --> https://svn.open-mpi.org/trac/ompi/ticket/3847	2013-10-24 00:08:47 +00:00
Ralph Castain	b12167abef	Per a good suggestion from Jeff, make the coprocessor mapping more scalable by using a hash table to cache the coprocessor list, and then do a single pass thru the nodes at the end to assign hostid's. Refs trac:3847 This commit was SVN r29439. The following Trac tickets were found above: Ticket 3847 --> https://svn.open-mpi.org/trac/ompi/ticket/3847	2013-10-14 22:01:48 +00:00
Ralph Castain	24c811805f	************************************************************** This change contains a non-mandatory modification of the MPI-RTE interface. Anyone wishing to support coprocessors such as the Xeon Phi may wish to add the required definition and underlying support ************************************************************** Add locality support for coprocessors such as the Intel Xeon Phi. Detecting that we are on a coprocessor inside of a host node isn't straightforward. There are no good "hooks" provided for programmatically detecting that "we are on a coprocessor running its own OS", and the ORTE daemon just thinks it is on another node. However, in order to properly use the Phi's public interface for MPI transport, it is necessary that the daemon detect that it is colocated with procs on the host. So we have to split the locality to separately record "on the same host" vs "on the same board". We already have the board-level locality flag, but not quite enough flexibility to handle this use-case. Thus, do the following: 1. add OPAL_PROC_ON_HOST flag to indicate we share a host, but not necessarily the same board 2. modify OPAL_PROC_ON_NODE to indicate we share both a host AND the same board. Note that we have to modify the OPAL_PROC_ON_LOCAL_NODE macro to explicitly check both conditions 3. add support in opal/mca/hwloc/base/hwloc_base_util.c for the host to check for coprocessors, and for daemons to check to see if they are on a coprocessor. The former is done via hwloc, but support for the latter is not yet provided by hwloc. So the code for detecting we are on a coprocessor currently is Xeon Phi specific - hopefully, we will find more generic methods in the future. 4. modify the orted and the hnp startup so they check for coprocessors and to see if they are on a coprocessor, and have the orteds pass that info back in their callback message. Automatically detect that coprocessors have been found and identify which coprocessors are on which hosts. Note that this algo isn't scalable at the moment - this will hopefully be improved over time. 5. modify the ompi proc locality detection function to look for coprocessor host info IF the OMPI_RTE_HOST_ID database key has been defined. RTE's that choose not to provide this support do not have to do anything - the associated code will simply be ignored. 6. include some cleanup of the hwloc open/close code so it conforms to how we did things in other frameworks (e.g., having a single "frame" file instead of open/close). Also, fix the locality flags - e.g., being on the same node means you must also be on the same cluster/cu, so ensure those flags are also set. cmr:v1.7.4:reviewer=hjelmn This commit was SVN r29435.	2013-10-14 16:52:58 +00:00
Ralph Castain	f4f2287958	Singletons currently start out by spawning an HNP - this is required solely in the cases where the singleton subsequently calls MPI_Comm_spawn or publishes port info without support from an external orte-server. In all other cases, the HNP is of no value and can actually be a detriment by creating additional overhead on the node. This is particularly concerning for async operations where processes may begin as singletons and then dynamically wireup to perform pt2pt communications. So we now allow singletons to start on their own, only spawning an HNP when initiating an operation that actually requires it. cmr:v1.7.4:reviewer=jsquyres This commit was SVN r29354.	2013-10-04 02:58:26 +00:00
Ralph Castain	d565a76814	Do some cleanup of the way we handle modex data. Identify data that needs to be shared with peers in my job vs data that needs to be shared with non-peers - no point in sharing extra data. When we share data with some process(es) from another job, we cannot know in advance what info they have or lack, so we have to share everything just in case. This limits the optimization we can do for things like comm_spawn. Create a new required key in the OMPI layer for retrieving a "node id" from the database. ALL RTE'S MUST DEFINE THIS KEY. This allows us to compute locality in the MPI layer, which is necessary when we do things like intercomm_create. cmr:v1.7.4:reviewer=rhc:subject=Cleanup handling of modex data This commit was SVN r29274.	2013-09-27 00:37:49 +00:00
Ralph Castain	a200e4f865	As per the RFC, bring in the ORTE async progress code and the rewrite of OOB: * THIS RFC INCLUDES A MINOR CHANGE TO THE MPI-RTE INTERFACE * Note: during the course of this work, it was necessary to completely separate the MPI and RTE progress engines. There were multiple places in the MPI layer where ORTE_WAIT_FOR_COMPLETION was being used. A new OMPI_WAIT_FOR_COMPLETION macro was created (defined in ompi/mca/rte/rte.h) that simply cycles across opal_progress until the provided flag becomes false. Places where the MPI layer blocked waiting for RTE to complete an event have been modified to use this macro. *************************************************************************************** I am reissuing this RFC because of the time that has passed since its original release. Since its initial release and review, I have debugged it further to ensure it fully supports tests like loop_spawn. It therefore seems ready for merge back to the trunk. Given its prior review, I have set the timeout for one week. The code is in https://bitbucket.org/rhc/ompi-oob2 WHAT: Rewrite of ORTE OOB WHY: Support asynchronous progress and a host of other features WHEN: Wed, August 21 SYNOPSIS: The current OOB has served us well, but a number of limitations have been identified over the years. Specifically: * it is only progressed when called via opal_progress, which can lead to hangs or recursive calls into libevent (which is not supported by that code) * we've had issues when multiple NICs are available as the code doesn't "shift" messages between transports - thus, all nodes had to be available via the same TCP interface. * the OOB "unloads" incoming opal_buffer_t objects during the transmission, thus preventing use of OBJ_RETAIN in the code when repeatedly sending the same message to multiple recipients * there is no failover mechanism across NICs - if the selected NIC (or its attached switch) fails, we are forced to abort * only one transport (i.e., component) can be "active" The revised OOB resolves these problems: * async progress is used for all application processes, with the progress thread blocking in the event library * each available TCP NIC is supported by its own TCP module. The ability to asynchronously progress each module independently is provided, but not enabled by default (a runtime MCA parameter turns it "on") * multi-address TCP NICs (e.g., a NIC with both an IPv4 and IPv6 address, or with virtual interfaces) are supported - reachability is determined by comparing the contact info for a peer against all addresses within the range covered by the address/mask pairs for the NIC. * a message that arrives on one TCP NIC is automatically shifted to whatever NIC that is connected to the next "hop" if that peer cannot be reached by the incoming NIC. If no TCP module will reach the peer, then the OOB attempts to send the message via all other available components - if none can reach the peer, then an "error" is reported back to the RML, which then calls the errmgr for instructions. * opal_buffer_t now conforms to standard object rules re OBJ_RETAIN as we no longer "unload" the incoming object * NIC failure is reported to the TCP component, which then tries to resend the message across any other available TCP NIC. If that doesn't work, then the message is given back to the OOB base to try using other components. If all that fails, then the error is reported to the RML, which reports to the errmgr for instructions * obviously from the above, multiple OOB components (e.g., TCP and UD) can be active in parallel * the matching code has been moved to the RML (and out of the OOB/TCP component) so it is independent of transport * routing is done by the individual OOB modules (as opposed to the RML). Thus, both routed and non-routed transports can simultaneously be active * all blocking send/recv APIs have been removed. Everything operates asynchronously. KNOWN LIMITATIONS: * although provision is made for component failover as described above, the code for doing so has not been fully implemented yet. At the moment, if all connections for a given peer fail, the errmgr is notified of a "lost connection", which by default results in termination of the job if it was a lifeline * the IPv6 code is present and compiles, but is not complete. Since the current IPv6 support in the OOB doesn't work anyway, I don't consider this a blocker * routing is performed at the individual module level, yet the active routed component is selected on a global basis. We probably should update that to reflect that different transports may need/choose to route in different ways * obviously, not every error path has been tested nor necessarily covered * determining abnormal termination is more challenging than in the old code as we now potentially have multiple ways of connecting to a process. Ideally, we would declare "connection failed" when all transports can no longer reach the process, but that requires some additional (possibly complex) code. For now, the code replicates the old behavior only somewhat modified - i.e., if a module sees its connection fail, it checks to see if it is a lifeline. If so, it notifies the errmgr that the lifeline is lost - otherwise, it notifies the errmgr that a non-lifeline connection was lost. * reachability is determined solely on the basis of a shared subnet address/mask - more sophisticated algorithms (e.g., the one used in the tcp btl) are required to handle routing via gateways * the RML needs to assign sequence numbers to each message on a per-peer basis. The receiving RML will then deliver messages in order, thus preventing out-of-order messaging in the case where messages travel across different transports or a message needs to be redirected/resent due to failure of a NIC This commit was SVN r29058.	2013-08-22 16:37:40 +00:00
Ralph Castain	16c5b30a1f	Since the calls to "PMI get" scale by number of procs (not nodes), it makes more sense to have the MCA param be the cutoff based on number of procs. Also, it occurred to me that this shouldn't impact the nidmap process as that is built and circulated when we launch via mpirun, not during direct launch. So shift the cutoff param to the MPI layer, and have it solely determine whether or not we call modex_recv on the hostname. If comm_world is of size greater than the cutoff, then we don't automatically retrieve the hostname when we build the ompi_proc_t for a process - instead, we fill the hostname entry on first call to modex_recv for that process. The param is now "ompi_hostname_cutoff=N", where N=number of procs for cutoff. Refs trac:3729 This commit was SVN r29056. The following Trac tickets were found above: Ticket 3729 --> https://svn.open-mpi.org/trac/ompi/ticket/3729	2013-08-22 03:40:26 +00:00
Ralph Castain	45e695928f	As per the email discussion, revise the sparse handling of hostnames so that we avoid potential infinite loops while allowing large-scale users to improve their startup time: * add a new MCA param orte_hostname_cutoff to specify the number of nodes at which we stop including hostnames. This defaults to INT_MAX => always include hostnames. If a value is given, then we will include hostnames for any allocation smaller than the given limit. * remove ompi_proc_get_hostname. Replace all occurrences with a direct link to ompi_proc_t's proc_hostname, protected by appropriate "if NULL" * modify the OMPI-ORTE integration component so that any call to modex_recv automatically loads the ompi_proc_t->proc_hostname field as well as returning the requested info. Thus, any process whose modex info you retrieve will automatically receive the hostname. Note that on-demand retrieval is still enabled - i.e., if we are running under direct launch with PMI, the hostname will be fetched upon first call to modex_recv, and then the ompi_proc_t->proc_hostname field will be loaded * removed a stale MCA param "mpi_keep_peer_hostnames" that was no longer used anywhere in the code base * added an envar lookup in ess/pmi for the number of nodes in the allocation. Sadly, PMI itself doesn't provide that info, so we have to get it a different way. Currently, we support PBS-based systems and SLURM - for any other, rank0 will emit a warning and we assume max number of daemons so we will always retain hostnames This commit was SVN r29052.	2013-08-20 18:59:36 +00:00
Ralph Castain	bebe852057	Add new info key for publish that allows user to designate that the port is to be unique - i.e., to return an error if that service has already been published. Default is to overwrite This commit was SVN r29028.	2013-08-14 04:21:17 +00:00
Nathan Hjelm	841ed962f6	fix MCA variable and component system leaks cmr=v1.7.3:reviewer=rhc This commit was SVN r29011.	2013-08-09 19:50:28 +00:00
Nathan Hjelm	ebbb32120a	MCA/base: variable system updates - Use an enumerator to handle bool values. - Fix a leak in the variable enumerator. - Fix a leak in an orte parameter. This commit was SVN r28949.	2013-07-25 15:42:01 +00:00
Ralph Castain	563bf60fb8	Fix an ordering problem that crept in due to the change in MCA param system. One MCA param would set a value, and then we did a hard reset of that value before testing another MCA param, thus removing a critical value for proper operation of the first param. cmr:v1.7.3:reviewer=brbarret This commit was SVN r28788.	2013-07-14 16:22:13 +00:00
Jeff Squyres	b9ca8e3cd1	Tweaked the help message a bit (this is the end result of iterating on the message in email between Mike, Ralph, Jeff). Add this to CMR #3642 and #3643. This commit was SVN r28662.	2013-06-20 13:19:23 +00:00
Ralph Castain	13665bffe8	Per an off-list discussion, it appears possible for a system to report failure when executing getpwuid. There are several reasons for this error to occur, most notably if the system uses a network-based authentication protocol (e.g., NIS) and that sytem gets overwhelmed when we launch on a lot of nodes. There is no good way to recover from this scenario, and from past experience, using the user's name in the session directory (as opposed to the uid) is very helpful when things go wrong. So print a help message when this happens (it is extremely rare, but has happened at least once now) and return an error. cmr:v1.7.3,reviewer=jsquyres cmr:v1.6.5,reviewer=jsquyres This commit was SVN r28658.	2013-06-20 04:30:42 +00:00
Jeff Squyres	089c632cce	Remove a bunch of dead code: gcc 4.7 warns of set-but-unused variables. So get rid of them. This commit was SVN r28538.	2013-05-17 21:45:49 +00:00
Ralph Castain	93ba4247f8	remove extra paren when --without-hwloc This commit was SVN r28530.	2013-05-15 21:31:45 +00:00
Ralph Castain	5296099ecb	Fix the cpus-per-rank when binding to hwthreads. Add cpus-per-rank to diag printout Thanks to Elena for reporting the problem This commit was SVN r28508.	2013-05-14 20:17:50 +00:00
Ralph Castain	27e3e382d5	No need for ORTE tools to use orte progress thread This commit was SVN r28445.	2013-05-04 21:13:20 +00:00
Nathan Hjelm	bccf8c657a	Per RFC add initial support for the MPI 3.0 tools interface. Current MPI_T support: - Full cvar interface. - Full categories interface. - No pvar support at this time. This commit was SVN r28376.	2013-04-24 15:59:23 +00:00
Ralph Castain	45af6cf59e	The move of the orte_db framework to opal required that we create an opaque opal_identifier_t type as OPAL cannot know anything about the ORTE process name. However, passing a value down to opal and then having the db components reference it causes alignment issues on Solaris Sparc platforms. So pass the pointer instead and do the old "memcpy" trick to avoid the problem. This commit was SVN r28308.	2013-04-08 23:34:16 +00:00
Ralph Castain	1b5b9afa85	Silence warnings This commit was SVN r28244.	2013-03-27 21:56:46 +00:00
Nathan Hjelm	9d4a26f47d	Update OMPI frameworks to use the MCA framework system. Notes: - This commit also eliminates the need for an available components list in use in several frameworks. None of the code in question was making use of the priority field of the priority component list item so these extra lists were removed. - Cleaned up selection code in several frameworks to sort lists using opal_list_sort. - Cleans up the ompi/orte-info functions. Expose the functions that construct the list of params so they can be used elsewhere. patches for mtl/portals4 from brian missed a few output variables in openib This commit was SVN r28241.	2013-03-27 21:17:31 +00:00
Nathan Hjelm	c041156f60	Update ORTE frameworks to use the MCA framework system. This commit was SVN r28240.	2013-03-27 21:14:43 +00:00
Nathan Hjelm	cf377db823	MCA/base: Add new MCA variable system Features: - Support for an override parameter file (openmpi-mca-param-override.conf). Variable values in this file can not be overridden by any file or environment value. - Support for boolean, unsigned, and unsigned long long variables. - Support for true/false values. - Support for enumerations on integer variables. - Support for MPIT scope, verbosity, and binding. - Support for command line source. - Support for setting variable source via the environment using OMPI_MCA_SOURCE_<var name>=source (either command or file:filename) - Cleaner API. - Support for variable groups (equivalent to MPIT categories). Notes: - Variables must be created with a backing store (char *, int , or bool *) that must live at least as long as the variable. - Creating a variable with the MCA_BASE_VAR_FLAG_SETTABLE enables the use of mca_base_var_set_value() to change the value. - String values are duplicated when the variable is registered. It is up to the caller to free the original value if necessary. The new value will be freed by the mca_base_var system and must not be freed by the user. - Variables with constant scope may not be settable. - Variable groups (and all associated variables) are deregistered when the component is closed or the component repository item is freed. This prevents a segmentation fault from accessing a variable after its component is unloaded. - After some discussion we decided we should remove the automatic registration of component priority variables. Few component actually made use of this feature. - The enumerator interface was updated to be general enough to handle future uses of the interface. - The code to generate ompi_info output has been moved into the MCA variable system. See mca_base_var_dump(). opal: update core and components to mca_base_var system orte: update core and components to mca_base_var system ompi: update core and components to mca_base_var system This commit also modifies the rmaps framework. The following variables were moved from ppr and lama: rmaps_base_pernode, rmaps_base_n_pernode, rmaps_base_n_persocket. Both lama and ppr create synonyms for these variables. This commit was SVN r28236.	2013-03-27 21:09:41 +00:00
Ralph Castain	317915225c	Finish the binding cleanup by removing the no-longer-used binding level scheme. This proved to be fallible as there is no guarantee that the hierarchy it used matched physical reality of the machine (e.g., is L3 "above" the socket or not). Still have to complete the ppr update, but get the rest of it correct. This commit was SVN r28223.	2013-03-26 20:09:49 +00:00
Ralph Castain	6ee32767d4	Restore the cpus-per-proc option for byslot and bynode mapping. Remove the bind_idx (which recorded the index of the hwloc object where the proc was bound) as this would no longer be unique, and just use the bitmap as the standard reference for location. Update the relative locality computation to take bitmaps as its argument. This commit was SVN r28219.	2013-03-26 18:27:50 +00:00
Ralph Castain	147c6ff9e7	Clean out the cruft leftover from the use_common_ports experiment cmr:v1.7 This commit was SVN r28184.	2013-03-20 15:07:43 +00:00
Ralph Castain	a4b6fb241f	Remove all remaining vestiges of the Windows integration This commit was SVN r28137.	2013-02-28 17:31:47 +00:00
Ralph Castain	cf9796accd	Remove the old configure option for disabling full rte support - we now use the OMPI rte framework for such purposes This commit was SVN r28134.	2013-02-28 01:35:55 +00:00
Ralph Castain	bd9265c560	Per the meeting on moving the BTLs to OPAL, move the ORTE database "db" framework to OPAL so the relocated BTLs can access it. Because the data is indexed by process, this requires that we define a new "opal_identifier_t" that corresponds to the orte_process_name_t struct. In order to support multiple run-times, this is defined in opal/mca/db/db_types.h as a uint64_t without identifying the meaning of any part of that data. A few changes were required to support this move: 1. the PMI component used to identify rte-related data (e.g., host name, bind level) and package them as a unit to reduce the number of PMI keys. This code was moved up to the ORTE layer as the OPAL layer has no understanding of these concepts. In addition, the component locally stored data based on process jobid/vpid - this could no longer be supported (see below for the solution). 2. the hash component was updated to use the new opal_identifier_t instead of orte_process_name_t as its index for storing data in the hash tables. Previously, we did a hash on the vpid and stored the data in a 32-bit hash table. In the revised system, we don't see a separate "vpid" field - we only have a 64-bit opaque value. The orte_process_name_t hash turned out to do nothing useful, so we now store the data in a 64-bit hash table. Preliminary tests didn't show any identifiable change in behavior or performance, but we'll have to see if a move back to the 32-bit table is required at some later time. 3. the db framework was a "select one" system. However, since the PMI component could no longer use its internal storage system, the framework has now been changed to a "select many" mode of operation. This allows the hash component to handle all internal storage, while the PMI component only handles pushing/pulling things from the PMI system. This was something we had planned for some time - when fetching data, we first check internal storage to see if we already have it, and then automatically go to the global system to look for it if we don't. Accordingly, the framework was provided with a custom query function used during "select" that lets you seperately specify the "store" and "fetch" ordering. 4. the ORTE grpcomm and ess/pmi components, and the nidmap code, were updated to work with the new db framework and to specify internal/global storage options. No changes were made to the MPI layer, except for modifying the ORTE component of the OMPI/rte framework to support the new db framework. This commit was SVN r28112.	2013-02-26 17:50:04 +00:00
Ralph Castain	c0b670bea8	I guess some profiling tools and debuggers require that the argv[0] of each rank be unique so they can create a filename based on that value. For those obscure cases, provide an mpirun cmd line option that indexes each argv[0] by rank This commit was SVN r28064.	2013-02-15 20:20:49 +00:00
Ralph Castain	a591fbf06f	Add initial support for dynamic allocations. At this time, only Slurm supports the new capability, which will be included in an upcoming release. Add hooks for supporting dynamic allocation and deallocation to support application-driven requests and fault recovery operations. This commit was SVN r27879.	2013-01-20 00:33:42 +00:00
Ralph Castain	c96cc2d5a0	In order to properly connect to debuggers like STAT, we need to get the hostname in its unstripped version for the MPIR_proctab. Unfortunately, we need a stripped version for Cray's alps launcher. So when we are stripping the hostname prefix, retain alias hostnames and add the ability to specify an alias to use in the proctab. This commit was SVN r27863.	2013-01-18 05:00:05 +00:00
Ralph Castain	97ee683275	Track parentage of procs This commit was SVN r27758.	2013-01-08 04:41:12 +00:00
Nathan Hjelm	6a9ab9b221	Change orte_startup_timeout to be in seconds and remove the 10 second maximum This commit was SVN r27741.	2013-01-03 23:56:34 +00:00
Ralph Castain	81a8e21939	Need to have the event thread running during init/finalize, but we still have a problem with cleanup - so comment out the event_base_free for now. This commit was SVN r27738.	2013-01-03 02:16:57 +00:00
Ralph Castain	64da742d5f	Remove the orte_finalize_event variable - no longer needed This commit was SVN r27722.	2012-12-25 19:33:20 +00:00
Ralph Castain	cada035f38	Fix the segfault problem in the orteds - turns out it only occurred with progress threads enabled. Ensure the thread gets started at the right time (at the end of init), although the event base gets created earlier. Remove the finalize event as we can instead use the loopbreak call to exit the event loop. This commit was SVN r27721.	2012-12-25 19:30:18 +00:00
Ralph Castain	c8e34813b6	THIS IS A TEMPORARY FIX - do not finalize opal as the parameter system has been broken and will segfault when finalized. THIS PATCH MUST BE REMOVED WHEN THE PARAMETER SYSTEM HAS BEEN FIXED. This commit was SVN r27720.	2012-12-24 18:42:19 +00:00
Ralph Castain	e11f32038a	Add an MCA param to retain all aliases based on IP addrs for node names so that procs can look them up by interface, if desired. If the param is set, pass aliases around to all daemons and procs for local use This commit was SVN r27619.	2012-11-16 04:04:29 +00:00
Ralph Castain	da6428a822	Add an MCA param to help debug the ORTE progress thread This commit was SVN r27614.	2012-11-15 15:54:38 +00:00
Ralph Castain	5241925b62	Add the dfs to ompi_info This commit was SVN r27613.	2012-11-15 15:54:07 +00:00
Ralph Castain	fefec03e78	Enable all ORTE tools to use progress threads if they are enabled This commit was SVN r27593.	2012-11-12 02:54:09 +00:00
Ralph Castain	bd887f7f56	Add a new "test" component to the DFS that treats all files as remote in order to test the app-to-daemon interactions on a single machine. Set a global param to indicate we are using staged execution. Add a param to indicate it is okay for non-MPI processes to execute without finalizing. Cleanup file map load and fetch operations. This commit was SVN r27587.	2012-11-10 14:09:12 +00:00
Nathan Hjelm	a754674fd7	Per the specification for putenv (http://pubs.opengroup.org/onlinepubs/009604599/functions/putenv.html ) the string given to putenv becomes part of the environment. The string must not be changed or freed. cmr:v1.7 This commit was SVN r27578.	2012-11-09 16:33:14 +00:00
Ralph Castain	3812979315	Remove stale file - we removed the size functions awhile back This commit was SVN r27551.	2012-11-01 03:36:51 +00:00
Ralph Castain	a080de188f	Enable orterun to directly support staged execution, treating each app as a separate job. Support transfer of file maps when support exists. This commit was SVN r27516.	2012-10-29 23:11:30 +00:00
Ralph Castain	4e52a15e70	Provide for sync on seek and close DFS operations. Eliminate an unnecessary wake-up timer when using ORTE progress thread This commit was SVN r27500.	2012-10-26 15:49:04 +00:00
Ralph Castain	35e5e5b512	Set the orte_event_base to the opal_event_base in ompi_info - we aren't doing anything with progress threads anyway This commit was SVN r27488.	2012-10-25 22:36:08 +00:00
Ralph Castain	79e36413c2	There was some discussion of this at an earlier time, but we never got around to doing it - so make orte behave more like a regular library, counting the number of times init is called, and executing finalize when all those are exhausted. This commit was SVN r27484.	2012-10-25 18:39:37 +00:00
Ralph Castain	4028ce7a5d	Silence warnings by making types match This commit was SVN r27446.	2012-10-14 03:45:28 +00:00
Ralph Castain	285a3b168d	Add an ability to specify the max number of simultaneous procs/node for an application when operating in staged mode. Change some debug statements from OPAL_OUTPUT_VERBOSE to opal_output_verbose so they are available in optimized builds. This commit was SVN r27445.	2012-10-14 03:31:32 +00:00
Jeff Squyres	fb2e543a57	Refs trac:3275. We ran into a case where the OMPI SVN trunk grew a new acceptable MCA parameter value, but this new value was not accepted on the v1.6 branch (hwloc_base_mem_bind_failure_action -- on the trunk it accepts the value "silent", but on the older v1.6 branch, it doesn't). If you set "hwloc_base_mem_bind_failure_action=silent" in the default MCA params file and then accidentally ran with the v1.6 branch, every OMPI executable (including ompi_info) just failed because hwloc_base_open() would say "hey, 'silent' is not a valid value for hwloc_base_mem_bind_failure_action!". Kaboom. The only problem is that it didn't give you any indication of where this value was being set. Quite maddening, from a user perspective. So we changed the ompi_info handles this case. If any framework open function return OMPI_ERR_BAD_PARAM (either because its base MCA params got a bad value or because one of its component register/open functions return OMPI_ERR_BAD_PARAM), ompi_info will stop, print out a warning that it received and error, and then dump out the parameters that it has received so far in the framework that had a problem. At a minimum, this will show the user the MCA param that had an error (it's usually the last one), and ''where it was set from'' (so that they can go fix it). We updated ompi_info to check for O???_ERR_BAD_PARAM from each from the framework opens. Also updated the doxygen docs in mca.h for this O???_BAD_PARAM behavior. And we noticed that mca.h had MCA_SUCCESS and MCA_ERR_??? codes. Why? I think we used them in exactly one place in the code base (mca_base_components_open.c). So we deleted those and just used the normal OPAL_* codes instead. While we were doing this, we also cleaned up a little memory management during ompi_info/orte-info/opal-info finalization. Valgrind still reports a truckload of memory still in use at ompi_info termination, but they mostly look to be components not freeing memory/resources properly (and outside the scope of this fix). This commit was SVN r27306. The following Trac tickets were found above: Ticket 3275 --> https://svn.open-mpi.org/trac/ompi/ticket/3275	2012-09-11 20:47:24 +00:00
Ralph Castain	a0ffeb205a	Add an orted component for staged operations and rename the staged component to "staged_hnp". This commit was SVN r27305.	2012-09-11 20:35:46 +00:00
Ralph Castain	d772e0fc3d	Add an option to treat dash-host specifications as "requested, but not required". So-called "soft" location requests can allow an application to execute even if the ideal allocation isn't available. This commit was SVN r27242.	2012-09-05 18:42:09 +00:00
Ralph Castain	fde83a44ab	This confusion has been around for awhile, caused by a long-ago decision to track slots allocated to a specific job as opposed to allocated to the overall mpirun instance. We eliminated that quite a while ago, but never consolidated the "slots_alloc" and "slots" fields in orte_node_t. As a result, confusion has grown in the code base as to which field to look at and/or update. So (finally) consolidate these two fields into one "slots" field. Add a field in orte_job_t to indicate when all the procs for a job will be launched together, so that staged operations can know when MPI operations are allowed. This commit was SVN r27239.	2012-09-05 01:30:39 +00:00
Ralph Castain	bae5dab916	If (and only if) a user requests, set the default number of slots on any node to the number of objects of the specified type. This only takes effect in an unmanaged environment - i.e., if an external resource manager assigns us a number of slots, then that is what we use. However, if we are using a hostfile, then the user may or may not have given us a value for the number of slots on each node. For those nodes (and only those nodes) where the user does not specify a slot count, we will set the number of slots according to their direction: either to the number of cores, numas, sockets, or hwthreads. Otherwise, the slot count is set to 1. Note that the default behavior remains unchanged: in the absence of any value for #slots, and in the absence of any directive to set #slots, we will set #slots=1. This commit was SVN r27236.	2012-09-04 20:58:26 +00:00
Ralph Castain	11de735e8a	Complete the revamp of hostfile support in non-managed environments. Working at the app level, ensure that we utilize only those nodes specified for that app, but fall back to the default hostfile (if available) for those with no specification, further falling back to the local host if the default hostfile is not present or is empty. This commit was SVN r27230.	2012-09-04 16:34:05 +00:00
Ralph Castain	1b659de132	Get staged execution working on multi-node setups. Improve efficiency by only remapping if all procs not yet mapped in the job. This commit was SVN r27181.	2012-08-29 20:35:52 +00:00
Ralph Castain	98580c117b	Introduce staged execution. If you don't have adequate resources to run everything without oversubscribing, don't want to oversubscribe, and aren't using MPI, then staged execution lets you (a) run as many procs as there are available resources, and (b) start additional procs as others complete and free up resources. Adds a new mapper as well as a new state machine. Remove some stale configure.m4's we no longer need. Optimize the nidmaps a bit by only sending info that has changed each time, instead of sending a complete copy of everything. Makes no difference for the typical MPI job - only impacts things like staged execution where we are sending multiple (possibly many) launch messages. This commit was SVN r27165.	2012-08-28 21:20:17 +00:00
Ralph Castain	e0c39c94e8	Complete the cleanup of the preload files system. Remove the dest_dir option as moving things to arbitrary locations - especially absolute paths - can prove disastrous. Remove the preload_libs option as these can be treated as just files. Cleanup some of the pack/unpack code as the dss handles NULL strings just fine. Deal a little better with absolute paths, noting that tar now strips the leading '/' for us (showing my age as it didn't used to do so). Remove the odls_base_state.c file as that code is now covered by the new broadcast form of preload_files. This commit was SVN r27127.	2012-08-24 02:28:29 +00:00
Ralph Castain	b4a544ad2a	Per discussion with Josh, use the --preload-xxx cmd line options to broadcast files to all nodes. Add --set-cwd-to-session-dir option to start procs in their session directories. Add OMPI_FILE_LOCATION envar to tell procs where their prepositioned files went. This commit was SVN r27125.	2012-08-23 21:28:05 +00:00
Ralph Castain	e4d82b8912	Turn off the common port by default by now until we get rollup working properly on ALL platforms This commit was SVN r27060.	2012-08-15 22:13:04 +00:00
Ralph Castain	cb48fd52d4	Implement the MPI_Info part of MPI-3 Ticket 313. Add an MPI_info object MPI_INFO_GET_ENV that contains a number of run-time related pieces of info. This includes all the required ones in the ticket, plus a few that specifically address recent user questions: "num_app_ctx" - the number of app_contexts in the job "first_rank" - the MPI rank of the first process in each app_context "np" - the number of procs in each app_context Still need clarification on the MPI_Init portion of the ticket. Specifically, does the ticket call for returning an error is someone calls MPI_Init more than once in a program? We set a flag to tell us that we have been initialized, but currently never check it. This commit was SVN r27005.	2012-08-12 01:28:23 +00:00
Ralph Castain	431d5361ed	For those who really preferred our prior mode of operation that mapped procs and only launched daemons on the nodes that had procs on them, introduce the "novm" state machine component. This recreates the old mode of operation by re-ordering the launch sequence so that we allocate, then map, and then launch daemons only on the reqd nodes (instead of across the entire allocation). This commit was SVN r26946.	2012-08-03 16:30:05 +00:00
Shiqing Fan	660188307c	fix an export declaration name This commit was SVN r26895.	2012-07-27 13:26:24 +00:00
Shiqing Fan	8c4a3e1269	correct the symbol dllexports for windows build This commit was SVN r26827.	2012-07-22 08:54:50 +00:00
Ralph Castain	e335de3564	Refactor ompi_info, splitting it into parts according to the layer involved. Thus, we call down to the opal layer to get those frameworks and components, and down to the orte layer to get those. Still some abstraction breaks, but they mostly involve renaming of OMPI_foo labels that have been around since before we split the build system by layer. This commit was SVN r26695.	2012-06-28 18:23:34 +00:00
Ralph Castain	0dfe29b1a6	Roll in the rest of the modex change. Eliminate all non-modex API access of RTE info from the MPI layer - in some cases, the info was already present (either in the ompi_proc_t or in the orte_process_info struct) and no call was necessary. This removes all calls to orte_ess from the MPI layer. Calls to orte_grpcomm remain required. Update all the orte ess components to remove their associated APIs for retrieving proc data. Update the grpcomm API to reflect transfer of set/get modex info to the db framework. Note that this doesn't recreate the old GPR. This is strictly a local db storage that may (at some point) obtain any missing data from the local daemon as part of an async methodology. The framework allows us to experiment with such methods without perturbing the default one. This commit was SVN r26678.	2012-06-27 14:53:55 +00:00
Ralph Castain	b990c65a53	Remove another antiquated dss function - the 'size' API isn't used anywhere since the GPR went away This commit was SVN r26646.	2012-06-25 13:33:45 +00:00
Ralph Castain	abe7dd8274	Cleanup the dss by removing unused functions This commit was SVN r26644.	2012-06-23 21:20:09 +00:00
Ralph Castain	019857b616	Ensure that we don't attempt to use common ports if --disable-static was specified. This commit was SVN r26620.	2012-06-20 03:14:11 +00:00
Ralph Castain	9b026c6695	For now, run MTT with the use_common_port option enabled. This would be the desirable scenario for users, especially at scale, so let's see if it creates any issues. This commit was SVN r26609.	2012-06-15 15:46:38 +00:00
Ralph Castain	96c778656a	Improve launch performance on clusters that use dedicated nodes by instructing the orteds to use the same port as the HNP, thus allowing them to "rollup" their initial callback via the routed network. This substantially reduces the HNP bottleneck and the number of ports opened by the HNP. Restore enable-static-ports option by default - the Cray will have to disable it to get around their library issues, but that's just a warning problem as opposed to blocking the build. This commit was SVN r26606.	2012-06-15 10:15:07 +00:00
Ralph Castain	9506ac1617	Remove debug This commit was SVN r26592.	2012-06-11 20:02:53 +00:00
Ralph Castain	269cb2b8d9	Some cleanup to remove calls to opal_progress when running with orte progress threads, and to ensure that all orte-related events are in the orte event base. This commit was SVN r26591.	2012-06-11 19:59:53 +00:00
Ralph Castain	0442a807c0	Default the OOB to the "ud" component IFF the HNP finds itself on a node with a supported Infiniband device. Ensure that the daemons all pick the matching component by dictating the selection via mca param on the orted cmd line. This commit was SVN r26582.	2012-06-08 01:23:08 +00:00
Ralph Castain	d6279fc971	Fix the debugger daemon launch support to fit the new state machine. Treat debugger daemons just like any other job, except that we map them only to nodes where an app process currently exists (as opposed to every node in the system). Trigger breakpoint and rank0 release only after the debugger daemons are in position. This commit was SVN r26556.	2012-06-06 02:01:23 +00:00
Ralph Castain	9883f42caf	Add missing commit This commit was SVN r26501.	2012-05-28 02:20:20 +00:00
Ralph Castain	be6ed9c2df	Allow partial use of allocations by specifying the max number of daemons (i.e., max VM size) for the job This commit was SVN r26499.	2012-05-27 16:48:19 +00:00
Ralph Castain	83d69b6c95	Enable the ORTE progress thread for apps (not needed in the tools as they already continuously loop in the event lib). This appears to be working, at least for MPI apps that only use shared memory (a simple "hello"). More testing is required to identify where problems will occur - this is only intended to allow further development. In order to use the progress thread, you must configure with: --enable-orte-progress-threads --enable-event-thread-support This commit was SVN r26457.	2012-05-20 15:14:43 +00:00
Jeff Squyres	dab7d36a81	Fix location of the default hostfile. Thanks to Götz Waschk for identifying the problem. This commit was SVN r26441.	2012-05-15 16:13:39 +00:00
Jeff Squyres	2ba10c37fe	Per RFC, bring in the following changes: * Remove paffinity, maffinity, and carto frameworks -- they've been wholly replaced by hwloc. * Move ompi_mpi_init() affinity-setting/checking code down to ORTE. * Update sm, smcuda, wv, and openib components to no longer use carto. Instead, use hwloc data. There are still optimizations possible in the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old carto-based code found out how many NUMA nodes were ''available'' -- not how many were used ''in this job''. The new hwloc-using code computes the same value -- it was not updated to calculate how many NUMA nodes are used ''by this job.'' * Note that I cannot compile the smcuda and wv BTLs -- I ''think'' they're right, but they need to be verified by their owners. * The openib component now does a bunch of stuff to figure out where "near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors (I do not have a NUMA machine with an OpenFabrics device that is a non-uniform distance from multiple different NUMA nodes). * Completely rewrite the OMPI_Affinity_str() routine from the "affinity" mpiext extension. This extension now understands hyperthreads; the output format of it has changed a bit to reflect this new information. * Bunches of minor changes around the code base to update names/types from maffinity/paffinity-based names to hwloc-based names. * Add some helper functions into the hwloc base, mainly having to do with the fact that we have the hwloc data reporting ''all'' topology information, but sometimes you really only want the (online \| available) data. This commit was SVN r26391.	2012-05-07 14:52:54 +00:00
Ralph Castain	b2f77bf08f	Extend the iof by adding two new components to support map-reduce IO chaining. Add a mapreduce tool for running such applications. Fix the state machine to support multiple jobs being simultaneously launched as this is not only required for mapreduce, but can happen under comm-spawn applications as well. This commit was SVN r26380.	2012-05-02 21:00:22 +00:00
Ralph Castain	47a5e30095	Ensure debug output levels if we are debugging This commit was SVN r26358.	2012-04-29 00:03:28 +00:00
Ralph Castain	f3e3704c9e	Per request from Brian, enable mapping of stddiag output (output from opal_output calls) to stderr of the local process. This allows you to obtain that output in a local window (for example, when using xterm for each process) instead of having it automatically forwarded to mpirun. Turn this on automatically whenever someone uses the -xterm option, and to be set manually using the orte_map_stddiag_to_stderr mca param. This commit was SVN r26352.	2012-04-27 14:39:34 +00:00
Ralph Castain	3b5b185c86	Don't double free timer events This commit was SVN r26341.	2012-04-25 17:36:12 +00:00
Ralph Castain	ddfbde587f	Change the default to "abort" the job when any process exits with a non-zero status. Add the required code to ensure the orted tells the HNP about the problem. This commit was SVN r26270.	2012-04-13 21:19:46 +00:00

1 2 3 4 5 ...

601 Коммитов