openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	42eb0bbe1b	Fix --without-hwloc builds cmr=v1.7.4:reviewer=jsquyres This commit was SVN r30462.	2014-01-28 17:10:32 +00:00
Ralph Castain	84a0ab3a75	Ah @$#!$#% - missed one last help message that needs to be corrected. cmr=v1.7.4:reviewer=jsquyres:subject=correct help message This commit was SVN r30449.	2014-01-28 04:03:24 +00:00
Ralph Castain	941bfd4604	Final cleanup of cpus-per-proc for 1.7.4 - provide better checking for cpus-per-proc and mismatched mapping/binding directives, and provide error messages telling the user what to do to get it right. cmr=v1.7.4:reviewer=jsquyres This commit was SVN r30438.	2014-01-27 22:40:51 +00:00
Jeff Squyres	7768828d2d	Addendum to r30298: tweak the wording of the help messages a bit. Refs trac:4117. Please use this commit rather than the patch attached to the ticket; the patch had a few mistakes in the tweaked wording. This commit was SVN r30362. The following SVN revision numbers were found above: r30298 --> open-mpi/ompi@58479399c3 The following Trac tickets were found above: Ticket 4117 --> https://svn.open-mpi.org/trac/ompi/ticket/4117	2014-01-22 12:17:14 +00:00
Ralph Castain	58479399c3	As per RFC and telecon, deprecate cmd line options and their corresponding MCA params for old-style mapping and binding directives cmr=v1.7.5:reviewer=jsquyres:subject=deprecate old-style mapping and binding directives This commit was SVN r30298.	2014-01-15 14:48:39 +00:00
Ralph Castain	fb9e427320	One last corner case - when encountering an overload condition (e.g., by comm_spawning more procs than we have cores) and we are using the default binding policy, do not bind the new procs to anything as this can cause major problems. Instead, let the spawn succeed since the user didn't specifically ask to be bound, and leave the new procs as unbound. Refs trac:4077 This commit was SVN r30200. The following Trac tickets were found above: Ticket 4077 --> https://svn.open-mpi.org/trac/ompi/ticket/4077	2014-01-09 22:39:34 +00:00
Ralph Castain	24e990e747	Fix comm_spawn for oversubscribed systems by correctly computing the number of available slots cmr=v1.7.4:reviewer=jsquyres:subject=Fix comm_spawn for oversubscribed systems This commit was SVN r30197.	2014-01-09 20:33:48 +00:00
Ralph Castain	9fcb46d85a	Correctly detect and handle oversubscription for comm_spawn cmr=v1.7.4:reviewer=jsquyres:subject=Correctly detect and handle oversubscription for comm_spawn This commit was SVN r30186.	2014-01-09 18:27:51 +00:00
Ralph Castain	6e5fedeb04	Oops - add verbose output to inform that cannot default bind due to no cores detected Refs trac:4074 This commit was SVN r30185. The following Trac tickets were found above: Ticket 4074 --> https://svn.open-mpi.org/trac/ompi/ticket/4074	2014-01-09 18:17:14 +00:00
Ralph Castain	7e4748a0f1	Handle the case of nodes that do not report cores, and thus our default binding policy will fail even though binding is supported by defaulting to not binding on those nodes. Thanks to Paul Hargrove for reporting the problem on NetBSD. cmr=v1.7.4:reviewer=jsquyres:subject=Handle the case of nodes that do not report cores This commit was SVN r30180.	2014-01-09 16:27:58 +00:00
Ralph Castain	bf453a2575	Reference the correct variable...sigh Refs trac:4059 This commit was SVN r30163. The following Trac tickets were found above: Ticket 4059 --> https://svn.open-mpi.org/trac/ompi/ticket/4059	2014-01-08 22:36:39 +00:00
Ralph Castain	e724d0d12d	Ensure comm_spawn'd jobs get treated the same wrt setting default mapping directives Refs trac:4059 This commit was SVN r30158. The following Trac tickets were found above: Ticket 4059 --> https://svn.open-mpi.org/trac/ompi/ticket/4059	2014-01-08 15:16:22 +00:00
Ralph Castain	fb650aed0c	Fix how we transfer mapping directives to the job, ensuring that directives that can be given outside of a mapping policy (e.g., oversubscribe and no-use-local) are retained. cmr=v1.7.4:reviewer=jsquyres:subject=Fix how we transfer mapping directives to the job This commit was SVN r30155.	2014-01-08 04:25:43 +00:00
Brian Barrett	8b778903d8	Fix longstanding issue with our multi-project support. Rather than using pkg{data,lib,includedir}, use our own ompi{data,lib,includedir}, which is always set to {datadir,libdir,includedir}/openmpi. This will keep us from having help files in prefix/share/open-rte when building without Open MPI, but in prefix/share/openmpi when building with Open MPI. This commit was SVN r30140.	2014-01-07 22:11:15 +00:00
Mike Dubman	40aadab85f	re-enable map-by dist after last refactoring in rmaps, map-by dist:hca was disabled. reverting it back found/fixed by Elena, reviewed by miked cmr=v1.7.4:reviewer=ompi-rm1.7 This commit was SVN r30118.	2014-01-04 20:44:41 +00:00
Ralph Castain	d5a5caa7e0	Restore the bycore mpirun option for backward compatibility Refs trac:4044 cmr=v1.7.4:reviewer=jsquyres This commit was SVN r30103. The following Trac tickets were found above: Ticket 4044 --> https://svn.open-mpi.org/trac/ompi/ticket/4044	2014-01-02 04:16:43 +00:00
George Bosilca	38cbaeaa82	Try to impose a little bit of consistency on how we parse lists of modules by enforcing the use of OPAL list accessors. This commit was SVN r30045.	2013-12-21 23:23:33 +00:00
Ralph Castain	31248c0985	Correctly add support for the "env" MPI_Info key during comm_spawn, update the "map-by", "rank-by", and "bind-to" Info key behaviors to match the new mapping/ranking/binding system, and update all docs and comments to match. Fix comm_spawn on a single host - with the new default mapping scheme, we were incorrectly computing the number of procs to put on the node. Refs trac:4003 This commit was SVN r30033. The following Trac tickets were found above: Ticket 4003 --> https://svn.open-mpi.org/trac/ompi/ticket/4003	2013-12-20 20:42:39 +00:00
Ralph Castain	55cd65b149	Don't warn about binding (process and/or memory) if the node cannot do it or if we would overload, but it wasn't specifically requested by the user (i.e., it is the result of the default policy). Instead, just don't bind and quietly move along. Reset topology usage for each node as we bind as multiple nodes may be linked to the same topology object. This will need to be revisited for scale as it does take some non-zero time to reset the usage each iteration. However, storing individual topology objects for every node consumes memory, so it's a tradeoff. cmr=v1.7.4:reviewer=jsquyres:subject=Eliminate excessive binding/memory warnings This commit was SVN r29978.	2013-12-19 16:31:45 +00:00
Ralph Castain	c5956e7b8c	Convert debug output to opal_output_verbose Thanks to Tetsuya Mishima for reporting it cmr=v1.7.4:reviewer=jsquyres This commit was SVN r29969.	2013-12-19 00:36:15 +00:00
Ralph Castain	ab4636c47b	Per email on devel list, change the default rank-by to slot unless map-by <obj> is specified, in which case use rank-by <obj> Refs trac:3977 This commit was SVN r29945. The following Trac tickets were found above: Ticket 3977 --> https://svn.open-mpi.org/trac/ompi/ticket/3977	2013-12-18 00:48:50 +00:00
Ralph Castain	53cd00fe16	By setting a default mapping/ranking/binding policy that wasn't "none", we introduced a problem for users of the Mac and any other machine where sockets aren't defined and/or binding is not supported. Fix that by checking to see if the user specified the failing policy - if not, then fall back to the old map/rank by slot and no binding. Refs trac:3977 This commit was SVN r29933. The following Trac tickets were found above: Ticket 3977 --> https://svn.open-mpi.org/trac/ompi/ticket/3977	2013-12-17 14:50:10 +00:00
Ralph Castain	8b6d117541	Per the OMPI devel conference that changed our default behaviors: * default to bind-to core * map-by slot if np=2 * map-by socket (balance across sockets on each node) if np > 2 * map-by <obj> will imply rank-by <obj> by default (leave default binding as above) Fix a bug in the map-by <obj> mapper where we incorrectly compute the #procs to assign if the #slots > #procs cmr=v1.7.4:reviewer=jsquyres:subject=Update default binding and mapping values This commit was SVN r29919.	2013-12-15 17:25:54 +00:00
Ralph Castain	7480beb7f0	Per request from Nathan, add an offset value to the job struct so we can construct a "global rank" that spans multiple jobs during dynamic launch operations. Store a new ORTE_DB_GLOBAL_RANK value for each process in the database, and ensure that we share our own value during connect_accept so both sides can see it. This isn't being used yet - just enabling Nathan to do what he needs. *** NOTE: any use of the OMPI_DB_GLOBAL_RANK database key must be protected by #ifdef OMPI_DB_GLOBAL_RANK as not all RTE's will define this key. *** This commit was SVN r29708.	2013-11-14 17:01:43 +00:00
Ralph Castain	e35ad23176	Correctly compute usage for dynamic spawns when binding is invoked. Ensure we correctly account for existing process usage on each node when computing bindings during dynamic spawns. cmr=v1.7.4:reviewer=hjelmn:subject=Correctly compute usage for dynamic spawns when binding is invoked This commit was SVN r29649.	2013-11-10 00:38:01 +00:00
Ralph Castain	a200e4f865	As per the RFC, bring in the ORTE async progress code and the rewrite of OOB: * THIS RFC INCLUDES A MINOR CHANGE TO THE MPI-RTE INTERFACE * Note: during the course of this work, it was necessary to completely separate the MPI and RTE progress engines. There were multiple places in the MPI layer where ORTE_WAIT_FOR_COMPLETION was being used. A new OMPI_WAIT_FOR_COMPLETION macro was created (defined in ompi/mca/rte/rte.h) that simply cycles across opal_progress until the provided flag becomes false. Places where the MPI layer blocked waiting for RTE to complete an event have been modified to use this macro. *************************************************************************************** I am reissuing this RFC because of the time that has passed since its original release. Since its initial release and review, I have debugged it further to ensure it fully supports tests like loop_spawn. It therefore seems ready for merge back to the trunk. Given its prior review, I have set the timeout for one week. The code is in https://bitbucket.org/rhc/ompi-oob2 WHAT: Rewrite of ORTE OOB WHY: Support asynchronous progress and a host of other features WHEN: Wed, August 21 SYNOPSIS: The current OOB has served us well, but a number of limitations have been identified over the years. Specifically: * it is only progressed when called via opal_progress, which can lead to hangs or recursive calls into libevent (which is not supported by that code) * we've had issues when multiple NICs are available as the code doesn't "shift" messages between transports - thus, all nodes had to be available via the same TCP interface. * the OOB "unloads" incoming opal_buffer_t objects during the transmission, thus preventing use of OBJ_RETAIN in the code when repeatedly sending the same message to multiple recipients * there is no failover mechanism across NICs - if the selected NIC (or its attached switch) fails, we are forced to abort * only one transport (i.e., component) can be "active" The revised OOB resolves these problems: * async progress is used for all application processes, with the progress thread blocking in the event library * each available TCP NIC is supported by its own TCP module. The ability to asynchronously progress each module independently is provided, but not enabled by default (a runtime MCA parameter turns it "on") * multi-address TCP NICs (e.g., a NIC with both an IPv4 and IPv6 address, or with virtual interfaces) are supported - reachability is determined by comparing the contact info for a peer against all addresses within the range covered by the address/mask pairs for the NIC. * a message that arrives on one TCP NIC is automatically shifted to whatever NIC that is connected to the next "hop" if that peer cannot be reached by the incoming NIC. If no TCP module will reach the peer, then the OOB attempts to send the message via all other available components - if none can reach the peer, then an "error" is reported back to the RML, which then calls the errmgr for instructions. * opal_buffer_t now conforms to standard object rules re OBJ_RETAIN as we no longer "unload" the incoming object * NIC failure is reported to the TCP component, which then tries to resend the message across any other available TCP NIC. If that doesn't work, then the message is given back to the OOB base to try using other components. If all that fails, then the error is reported to the RML, which reports to the errmgr for instructions * obviously from the above, multiple OOB components (e.g., TCP and UD) can be active in parallel * the matching code has been moved to the RML (and out of the OOB/TCP component) so it is independent of transport * routing is done by the individual OOB modules (as opposed to the RML). Thus, both routed and non-routed transports can simultaneously be active * all blocking send/recv APIs have been removed. Everything operates asynchronously. KNOWN LIMITATIONS: * although provision is made for component failover as described above, the code for doing so has not been fully implemented yet. At the moment, if all connections for a given peer fail, the errmgr is notified of a "lost connection", which by default results in termination of the job if it was a lifeline * the IPv6 code is present and compiles, but is not complete. Since the current IPv6 support in the OOB doesn't work anyway, I don't consider this a blocker * routing is performed at the individual module level, yet the active routed component is selected on a global basis. We probably should update that to reflect that different transports may need/choose to route in different ways * obviously, not every error path has been tested nor necessarily covered * determining abnormal termination is more challenging than in the old code as we now potentially have multiple ways of connecting to a process. Ideally, we would declare "connection failed" when all transports can no longer reach the process, but that requires some additional (possibly complex) code. For now, the code replicates the old behavior only somewhat modified - i.e., if a module sees its connection fail, it checks to see if it is a lifeline. If so, it notifies the errmgr that the lifeline is lost - otherwise, it notifies the errmgr that a non-lifeline connection was lost. * reachability is determined solely on the basis of a shared subnet address/mask - more sophisticated algorithms (e.g., the one used in the tcp btl) are required to handle routing via gateways * the RML needs to assign sequence numbers to each message on a per-peer basis. The receiving RML will then deliver messages in order, thus preventing out-of-order messaging in the case where messages travel across different transports or a message needs to be redirected/resent due to failure of a NIC This commit was SVN r29058.	2013-08-22 16:37:40 +00:00
Ralph Castain	7a21661785	Silence a warning when --without-hwloc is used This commit was SVN r28783.	2013-07-13 17:17:17 +00:00
Dave Goodell	3741d62308	fix --without-hwloc build failure All builds since r28682 configured with '--without-hwloc' fail at "make" time without this fix. Reviewed by rhc@ This commit was SVN r28769. The following SVN revision numbers were found above: r28682 --> open-mpi/ompi@446e33a5d8	2013-07-12 17:21:14 +00:00
Ralph Castain	62378209f0	Even if we don't find the default hostfile, and nothing else was provided, then use all the known nodes. cmr:v1.7.3:#3653:reviewer=jsquyres cmr:v1.6.6:#3654:reviewer=jsquyres This commit was SVN r28718.	2013-07-03 22:31:32 +00:00
Ralph Castain	443a6802b9	If the default hostfile is empty, we need to pickup all the known nodes, not just the head node. cmr:v1.7.3:reviewer=jsquyres cmr:v1.6.6:reviewer=jsquyres This commit was SVN r28717.	2013-07-03 22:25:51 +00:00
Ralph Castain	446e33a5d8	There are cases where we want to use the novm state machine, but the backend node topology differs from that where mpirun is executing. In those cases, we can wind up thinking we are oversubscribed because the head node has fewer cores than the compute nodes. To resolve this situation, add the ability to specify a backend topology file that mpirun shall use for its mapping operations. Create a new "set_topology" function in opal hwloc to support it. This commit was SVN r28682.	2013-06-27 03:04:50 +00:00
Ralph Castain	a51a0a8c48	Fix uninitialized var This commit was SVN r28652.	2013-06-18 22:41:47 +00:00
Jeff Squyres	6d173af329	This commit introduces a new "mindist" ORTE RMAPS mapper, as well as some relevant updates/new functionality in the opal/mca/hwloc and orte/mca/rmaps bases. This work was mainly developed by Mellanox, with a bunch of advice from Ralph Castain, and some minor advice from Brice Goglin and Jeff Squyres. Even though this is mainly Mellanox's work, Jeff is committing only for logistical reasons (he holds the hg+svn combo tree, and can therefore commit it directly back to SVN). ----- Implemented distance-based mapping algorithm as a new "mindist" component in the rmaps framework. It allows mapping processes by NUMA due to PCI locality information as reported by the BIOS - from the closest to device to furthest. To use this algorithm, specify: {{{mpirun --map-by dist:<device_name>}}} where <device_name> can be mlx5_0, ib0, etc. There are two modes provided: 1. bynode: load-balancing across nodes 1. byslot: go through slots sequentially (i.e., the first nodes are more loaded) These options are regulated by the optional ''span'' modifier; the command line parameter looks like: {{{mpirun --map-by dist:<device_name>,span}}} So, for example, if there are 2 nodes, each with 8 cores, and we'd like to run 10 processes, the mindist algorithm will place 8 processes to the first node and 2 to the second by default. But if you want to place 5 processes to each node, you can add a span modifier in your command line to do that. If there are two NUMA nodes on the node, each with 4 cores, and we run 6 processes, the mindist algorithm will try to find the NUMA closest to the specified device, and if successful, it will place 4 processes on that NUMA but leaving the remaining two to the next NUMA node. You can also specify the number of cpus per MPI process. This option is handled so that we map as many processes to the closest NUMA as we can (number of available processors at the NUMA divided by number of cpus per rank) and then go on with the next closest NUMA. The default binding option for this mapping is bind-to-numa. It works if you don't specify any binding policy. But if you specified binding level that was "lower" than NUMA (i.e hwthread, core, socket) it would bind to whatever level you specify. This commit was SVN r28552.	2013-05-22 13:04:40 +00:00
Ralph Castain	5296099ecb	Fix the cpus-per-rank when binding to hwthreads. Add cpus-per-rank to diag printout Thanks to Elena for reporting the problem This commit was SVN r28508.	2013-05-14 20:17:50 +00:00
Ralph Castain	427b6b0b47	Fix the verbosity of yet another framework...sigh. This commit was SVN r28481.	2013-05-13 14:36:32 +00:00
Ralph Castain	5d7a93c032	Add the ability to use an external version of libevent. Clearly not recommended at this time. I've verified that it works in limited scenarios, but more thorough testing and performance impacts need to be assessed. Interesting how many includes had to be fixed here and there to fill in missing dependencies :-) This commit was SVN r28411.	2013-04-29 17:02:37 +00:00
Ralph Castain	252147fba6	Cleanup error message if unknown host is given in -host and -hostfile options This commit was SVN r28262.	2013-03-28 16:52:10 +00:00
Nathan Hjelm	c041156f60	Update ORTE frameworks to use the MCA framework system. This commit was SVN r28240.	2013-03-27 21:14:43 +00:00
Nathan Hjelm	cf377db823	MCA/base: Add new MCA variable system Features: - Support for an override parameter file (openmpi-mca-param-override.conf). Variable values in this file can not be overridden by any file or environment value. - Support for boolean, unsigned, and unsigned long long variables. - Support for true/false values. - Support for enumerations on integer variables. - Support for MPIT scope, verbosity, and binding. - Support for command line source. - Support for setting variable source via the environment using OMPI_MCA_SOURCE_<var name>=source (either command or file:filename) - Cleaner API. - Support for variable groups (equivalent to MPIT categories). Notes: - Variables must be created with a backing store (char *, int , or bool *) that must live at least as long as the variable. - Creating a variable with the MCA_BASE_VAR_FLAG_SETTABLE enables the use of mca_base_var_set_value() to change the value. - String values are duplicated when the variable is registered. It is up to the caller to free the original value if necessary. The new value will be freed by the mca_base_var system and must not be freed by the user. - Variables with constant scope may not be settable. - Variable groups (and all associated variables) are deregistered when the component is closed or the component repository item is freed. This prevents a segmentation fault from accessing a variable after its component is unloaded. - After some discussion we decided we should remove the automatic registration of component priority variables. Few component actually made use of this feature. - The enumerator interface was updated to be general enough to handle future uses of the interface. - The code to generate ompi_info output has been moved into the MCA variable system. See mca_base_var_dump(). opal: update core and components to mca_base_var system orte: update core and components to mca_base_var system ompi: update core and components to mca_base_var system This commit also modifies the rmaps framework. The following variables were moved from ppr and lama: rmaps_base_pernode, rmaps_base_n_pernode, rmaps_base_n_persocket. Both lama and ppr create synonyms for these variables. This commit was SVN r28236.	2013-03-27 21:09:41 +00:00
Ralph Castain	256414121e	Protect the cpus-per-rank MCA param registration so that --without-hwloc will build This commit was SVN r28232.	2013-03-27 14:53:30 +00:00
Ralph Castain	317915225c	Finish the binding cleanup by removing the no-longer-used binding level scheme. This proved to be fallible as there is no guarantee that the hierarchy it used matched physical reality of the machine (e.g., is L3 "above" the socket or not). Still have to complete the ppr update, but get the rest of it correct. This commit was SVN r28223.	2013-03-26 20:09:49 +00:00
Ralph Castain	6ee32767d4	Restore the cpus-per-proc option for byslot and bynode mapping. Remove the bind_idx (which recorded the index of the hwloc object where the proc was bound) as this would no longer be unique, and just use the bitmap as the standard reference for location. Update the relative locality computation to take bitmaps as its argument. This commit was SVN r28219.	2013-03-26 18:27:50 +00:00
Ralph Castain	2f43989d22	Add debug and handle the use-case where someone (a) uses a hostfile while in a managed allocation to sub-allocate runs, and (b) includes the HNP's node in one of those hostfiles. cmr:v1.7 This commit was SVN r28203.	2013-03-22 00:53:33 +00:00
Ralph Castain	cf9796accd	Remove the old configure option for disabling full rte support - we now use the OMPI rte framework for such purposes This commit was SVN r28134.	2013-02-28 01:35:55 +00:00
Ralph Castain	112f8eedb1	Handle the case where rankfile is providing the allocation This commit was SVN r27971.	2013-01-29 20:37:58 +00:00
Nathan Hjelm	bdedd8b0d3	Per RFC modify the behavior of mca_base_components_close to NOT close the output. Modify frameworks to always close their output and set to -1. Reasoning: The old behavior was a little confusing. mca_base_components_open does not open an output stream so it is a little unexpected that mca_base_components_close does. To add to this several frameworks (that don't use mca_base_components_close) failed to close their output in the framework close function and others closed their output a second time. This change is an improvement to the symantics of mca_base_components_open/close as they are now symetric in their functionality. This commit was SVN r27570.	2012-11-06 19:09:26 +00:00
Ralph Castain	285a3b168d	Add an ability to specify the max number of simultaneous procs/node for an application when operating in staged mode. Change some debug statements from OPAL_OUTPUT_VERBOSE to opal_output_verbose so they are available in optimized builds. This commit was SVN r27445.	2012-10-14 03:31:32 +00:00
Ralph Castain	d95025f53a	Ensure we clear the usage numbers when binding on multiple nodes so we don't "carry over" info from one node to the next. Use the same tracking mechanism for binding upwards and in-place to avoid doing a bunch of mallocs. Refs trac:3322 This commit was SVN r27356. The following Trac tickets were found above: Ticket 3322 --> https://svn.open-mpi.org/trac/ompi/ticket/3322	2012-09-20 15:16:06 +00:00
Ralph Castain	a3060cdd15	Fix the bind_downward code - it was incorrectly looking across the entire node instead of only looking below the locale to which the proc had been assigned. In other words, if the proc was mapped to a core, then the only hwthreads that should be considered for binding are those directly below that core. The binding algo was incorrectly looking at ALL hwthreads in that scenario, causing the proc to be bound to an HT outside of the mapped location. This now results in the procs being bound within their assigned location. It also causes us to use only the 0th HT on a core unless --use-hwthread-cpus has been specified (in which case, we use all the HTs in a core). Bind to core binds you to all HTs regardless - the --use-hwthread-cpus only impacts the oversubscribed determination and when binding to HT. cmr:v1.7 This commit was SVN r27342.	2012-09-14 22:01:19 +00:00
Ralph Castain	ca40cb5f1c	Fix comm_spawn by mpirun This commit was SVN r27285.	2012-09-10 17:09:25 +00:00

1 2 3 4 5 ...

285 Коммитов