openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	fc3b777ef5	Cleanup a variable that isn't used if pmi2 support is available Refs trac:3683 This commit was SVN r28841. The following Trac tickets were found above: Ticket 3683 --> https://svn.open-mpi.org/trac/ompi/ticket/3683	2013-07-18 17:19:13 +00:00
Ralph Castain	92c6b806b9	Based on a patch submitted by Piotr Lesnicki of Bull, cleanup the PMI2 support. This has not been tested yet on multiple environments (e.g., Cray), so it needs more evaluation prior to moving to the 1.7 branch. cmr:v1.7.3:reviewer=rhc This commit was SVN r28837.	2013-07-18 14:46:07 +00:00
Ralph Castain	10ca1c1b04	Turns out that there was exactly ONE place in all of the OMPI code base that still referred to OPAL_TRACE, though a few places retained the include file for no reason. So no point in letting this sit as it is clearly an unused "feature". This commit was SVN r28789.	2013-07-14 18:57:20 +00:00
Ralph Castain	7a21661785	Silence a warning when --without-hwloc is used This commit was SVN r28783.	2013-07-13 17:17:17 +00:00
Dave Goodell	3741d62308	fix --without-hwloc build failure All builds since r28682 configured with '--without-hwloc' fail at "make" time without this fix. Reviewed by rhc@ This commit was SVN r28769. The following SVN revision numbers were found above: r28682 --> open-mpi/ompi@446e33a5d8	2013-07-12 17:21:14 +00:00
Jeff Squyres	baa3182794	Per RFC (http://www.open-mpi.org/community/lists/devel/2013/07/12534.php), remove a bunch of dead code. This commit was SVN r28756.	2013-07-11 17:34:28 +00:00
Ralph Castain	028f5ee7a6	Cleanup some bitrot from moving the db framework to opal and from the new mca param system This commit was SVN r28741.	2013-07-09 14:37:08 +00:00
Ralph Castain	62378209f0	Even if we don't find the default hostfile, and nothing else was provided, then use all the known nodes. cmr:v1.7.3:#3653:reviewer=jsquyres cmr:v1.6.6:#3654:reviewer=jsquyres This commit was SVN r28718.	2013-07-03 22:31:32 +00:00
Ralph Castain	443a6802b9	If the default hostfile is empty, we need to pickup all the known nodes, not just the head node. cmr:v1.7.3:reviewer=jsquyres cmr:v1.6.6:reviewer=jsquyres This commit was SVN r28717.	2013-07-03 22:25:51 +00:00
Ralph Castain	446e33a5d8	There are cases where we want to use the novm state machine, but the backend node topology differs from that where mpirun is executing. In those cases, we can wind up thinking we are oversubscribed because the head node has fewer cores than the compute nodes. To resolve this situation, add the ability to specify a backend topology file that mpirun shall use for its mapping operations. Create a new "set_topology" function in opal hwloc to support it. This commit was SVN r28682.	2013-06-27 03:04:50 +00:00
Joshua Ladd	0b5c1f2ea8	Add 'generic' support for PMI2 (previously, we checked for PMI2 only on Cray systems.) If your resource manager (e.g. SLURM) has support for PMI2, then the --with-pmi configure flag will enable its usage. If you don't have PMI2, then you will fallback to regular old PMI1. This patch was submitted by Ralph Castain and reviewed and pushed by Josh Ladd. This should be added to cmr:v1.7:reviewer=jladd This commit was SVN r28666.	2013-06-21 15:28:14 +00:00
Nathan Hjelm	299d5b3dd7	Fix two debugger attach bugs. - orte_debugger_init_after_spawn was not being called for debuggers that use the MPIR_attach_fifo to co-locate debugger daemons. - MPIR_Breakpoint was not getting called if a debugger reattached. Add a job state (ORTE_JOB_STATE_DEBUGGER_DETACH) to reset mpir_breakpoint_fired to false when a debugger detaches to ensure MPIR_Breakpoint is called if another debugger attaches. Tested with STAT 2.0/launchmon 1.0. cmr:v1.7 This commit was SVN r28665.	2013-06-20 16:18:05 +00:00
Ralph Castain	a51a0a8c48	Fix uninitialized var This commit was SVN r28652.	2013-06-18 22:41:47 +00:00
George Bosilca	b4ebc417a1	Correctly register the component MCA parameters. Few cleanups in the includes. This commit was SVN r28645.	2013-06-15 16:05:09 +00:00
Nathan Hjelm	518d1fe200	Fix two typos that prevented alps direct launch from working This commit was SVN r28628.	2013-06-13 17:04:08 +00:00
Joshua Ladd	61ffb47573	Minor fix for the min-dist mapping algorithm: we need to call 'get_nbobjs_by_type' first, before we get the sorted list of nodes - we need to add node objects and fill them in the summary object for the current topology. This patch was submitted by Elena Elkina and pushed by Josh Ladd. This should be added to cmr:v1.7:reviewer=jladd This commit was SVN r28578.	2013-05-31 15:19:59 +00:00
Jeff Squyres	6d173af329	This commit introduces a new "mindist" ORTE RMAPS mapper, as well as some relevant updates/new functionality in the opal/mca/hwloc and orte/mca/rmaps bases. This work was mainly developed by Mellanox, with a bunch of advice from Ralph Castain, and some minor advice from Brice Goglin and Jeff Squyres. Even though this is mainly Mellanox's work, Jeff is committing only for logistical reasons (he holds the hg+svn combo tree, and can therefore commit it directly back to SVN). ----- Implemented distance-based mapping algorithm as a new "mindist" component in the rmaps framework. It allows mapping processes by NUMA due to PCI locality information as reported by the BIOS - from the closest to device to furthest. To use this algorithm, specify: {{{mpirun --map-by dist:<device_name>}}} where <device_name> can be mlx5_0, ib0, etc. There are two modes provided: 1. bynode: load-balancing across nodes 1. byslot: go through slots sequentially (i.e., the first nodes are more loaded) These options are regulated by the optional ''span'' modifier; the command line parameter looks like: {{{mpirun --map-by dist:<device_name>,span}}} So, for example, if there are 2 nodes, each with 8 cores, and we'd like to run 10 processes, the mindist algorithm will place 8 processes to the first node and 2 to the second by default. But if you want to place 5 processes to each node, you can add a span modifier in your command line to do that. If there are two NUMA nodes on the node, each with 4 cores, and we run 6 processes, the mindist algorithm will try to find the NUMA closest to the specified device, and if successful, it will place 4 processes on that NUMA but leaving the remaining two to the next NUMA node. You can also specify the number of cpus per MPI process. This option is handled so that we map as many processes to the closest NUMA as we can (number of available processors at the NUMA divided by number of cpus per rank) and then go on with the next closest NUMA. The default binding option for this mapping is bind-to-numa. It works if you don't specify any binding policy. But if you specified binding level that was "lower" than NUMA (i.e hwthread, core, socket) it would bind to whatever level you specify. This commit was SVN r28552.	2013-05-22 13:04:40 +00:00
Jeff Squyres	089c632cce	Remove a bunch of dead code: gcc 4.7 warns of set-but-unused variables. So get rid of them. This commit was SVN r28538.	2013-05-17 21:45:49 +00:00
Ralph Castain	e100b8d165	don't need the return value, but should check for error This commit was SVN r28534.	2013-05-16 15:15:02 +00:00
Jeff Squyres	128cc27417	Minor type fix (they're both enums/ints, so the compiler previously silently cast them). This commit was SVN r28532.	2013-05-16 00:47:37 +00:00
Ralph Castain	3a372a65b8	Mapping policies must be tested as equalities as they are values, not bitmasks This commit was SVN r28526.	2013-05-15 13:45:00 +00:00
Ralph Castain	29e4b0cc50	Cannot test equality on mapping directives as it is a bitmask This commit was SVN r28525.	2013-05-15 13:41:49 +00:00
Ralph Castain	04b11accd3	Silience a few warnings This commit was SVN r28515.	2013-05-14 21:58:40 +00:00
Ralph Castain	5296099ecb	Fix the cpus-per-rank when binding to hwthreads. Add cpus-per-rank to diag printout Thanks to Elena for reporting the problem This commit was SVN r28508.	2013-05-14 20:17:50 +00:00
Ralph Castain	427b6b0b47	Fix the verbosity of yet another framework...sigh. This commit was SVN r28481.	2013-05-13 14:36:32 +00:00
Jeff Squyres	456df1c9f7	Remove redundant opal_output() messages from the module; the called functions will now show_help() their own error messages if something goes wrong (per r28470). This commit was SVN r28471. The following SVN revision numbers were found above: r28470 --> open-mpi/ompi@2ff95a7739	2013-05-10 15:12:07 +00:00
Jeff Squyres	2ff95a7739	Proper show_help error messages for LAMA. This commit was SVN r28470.	2013-05-10 15:06:25 +00:00
Ralph Castain	f15fe5045e	Ensure that debugger connect can occur by getting the rml contact info updated before calling init_after_spawn cmr:v1.7.3,reviewer=jsquyres This commit was SVN r28455.	2013-05-06 22:00:45 +00:00
Ralph Castain	c52b94af8b	Revert r28453 and r28452 - wrong fix This commit was SVN r28454. The following SVN revision numbers were found above: r28452 --> open-mpi/ompi@756ee4b5e0 r28453 --> open-mpi/ompi@6da24143a2	2013-05-06 21:52:17 +00:00
Ralph Castain	6da24143a2	Minor performance improvement This commit was SVN r28453.	2013-05-06 20:27:16 +00:00
Ralph Castain	756ee4b5e0	Update the rml_uri for each proc so debuggers can attach This commit was SVN r28452.	2013-05-06 20:18:14 +00:00
Ralph Castain	707d0e653a	Must use equal and not & comparison for mapping directives This commit was SVN r28451.	2013-05-06 15:07:12 +00:00
Ralph Castain	a0a6412545	Do a little cleanup on abnormal termination procedure - don't keep submitting forced exit events (one will do), no need to reset the abnormal termination pipe event in orterun, etc. This commit was SVN r28450.	2013-05-05 17:39:45 +00:00
Jeff Squyres	42a9a4c62c	After examining a '''lot''' of MTT output with Ralph, fix the cause of many, many MTT timeouts when running jobs under SLURM: send the right command at the end to cause remote orteds to shut down. This commit was SVN r28438.	2013-05-02 00:23:53 +00:00
Ralph Castain	5d7a93c032	Add the ability to use an external version of libevent. Clearly not recommended at this time. I've verified that it works in limited scenarios, but more thorough testing and performance impacts need to be assessed. Interesting how many includes had to be fixed here and there to fill in missing dependencies :-) This commit was SVN r28411.	2013-04-29 17:02:37 +00:00
Ralph Castain	3a354c4ea3	Cleanup the verbose output channel name This commit was SVN r28391.	2013-04-24 23:44:02 +00:00
Ralph Castain	c5e1a7dc65	fix typo This commit was SVN r28390.	2013-04-24 23:37:59 +00:00
Nathan Hjelm	55decca2b7	routed/debruijn: if the next hop for a message is unknown forward the message to the parent process This commit was SVN r28384.	2013-04-24 19:05:25 +00:00
Ralph Castain	2e8946db0a	Add some debug output This commit was SVN r28371.	2013-04-23 23:11:22 +00:00
Ralph Castain	b6377a2138	Properly remove the object from the list prior to releasing it when an error is encountered This commit was SVN r28370.	2013-04-23 22:44:52 +00:00
Ralph Castain	1a54e52da4	Per conversation with Josh H., remove the stale rsh component from the filem framework. Let the "raw" component do the job. This commit was SVN r28338.	2013-04-16 20:39:48 +00:00
Jeff Squyres	349ee654c1	Fix some --without-hwloc compile errors. Also remove one assigned-but-not-used variable assignment. This commit was SVN r28321.	2013-04-10 15:08:31 +00:00
Ralph Castain	45af6cf59e	The move of the orte_db framework to opal required that we create an opaque opal_identifier_t type as OPAL cannot know anything about the ORTE process name. However, passing a value down to opal and then having the db components reference it causes alignment issues on Solaris Sparc platforms. So pass the pointer instead and do the old "memcpy" trick to avoid the problem. This commit was SVN r28308.	2013-04-08 23:34:16 +00:00
Ralph Castain	76426285f0	Cannot retain opal_buffer_t, so use a copy This commit was SVN r28302.	2013-04-07 23:02:59 +00:00
Ralph Castain	698b4ad6e7	Fix the parameter handling so no-tree-spawn isn't getting reversed This commit was SVN r28300.	2013-04-07 15:48:25 +00:00
Ralph Castain	1f011bef99	Cleanup the updated sys limits capability. Fix a few copy/paste bugs (my bad). Shift the limit set to the ODLS default module so that we sete the limits for all apps, even those that don't call opal_init. Leave it in opal_init as well to support direct-launch apps, but ensure we only set the limits once by removing the envar after launch by ODLS. Provide some nice error messages if we fail to set the limits. Since the user had to specifically request we set the limit, treat failure as an error-out situation. This commit was SVN r28288.	2013-04-04 16:00:17 +00:00
Ralph Castain	252147fba6	Cleanup error message if unknown host is given in -host and -hostfile options This commit was SVN r28262.	2013-03-28 16:52:10 +00:00
Ralph Castain	e6ae088813	Cleanup error outputs when a daemon fails to start This commit was SVN r28261.	2013-03-28 16:51:19 +00:00
Ralph Castain	21ee48de57	Add missing static declaration This commit was SVN r28247.	2013-03-27 21:59:17 +00:00
Ralph Castain	1b5b9afa85	Silence warnings This commit was SVN r28244.	2013-03-27 21:56:46 +00:00

1 2 3 4 5 ...

2986 Коммитов