openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	4fc4a8346b	Fix a couple of minor issues. Ensure usock isn't used if the session dirs aren't setup. Protect an oddball case where orte_xml_fp is NULL.	2014-10-09 20:58:46 -07:00
Elena	e319c95267	fixes for grpcomm rcd/brucks algorithms	2014-10-09 06:12:26 +02:00
Howard Pritchard	d2bb8d8829	remove alps ess component The alps ess component is obsolete. It relies on header files only present in very old CLE (Cray Linux) 3.X for the Cray XT series. As support for these systems is being dropped starting with release 1.9, this code is being removed.	2014-10-03 13:17:33 -06:00
hpp	8ded59ce0f	fix alps plm to allow explicit host placement It turns out that the alps plm code was developed only on cray systems that were running batch schedulers. However, for bring up and development systems, its not at all uncommon for there to be no batch scheduler, and thus to orte it appears that orte_num_allocated_nodes is always zero. This forces a user using mpirun on such a system to always specify a host list: mpirun -n 4 -N 1 -host 32,45,68 .... just to get the job to run, but then since the -L argument for aprun is never built, the app always runs on the first batch of nodes that aprun finds available.	2014-10-02 10:42:01 -06:00
Ralph Castain	4320457394	Fix the debug output - you can't print the cpuset pointer using the %p format without generating warnings This commit was SVN r32811.	2014-09-29 17:10:38 +00:00
Howard Pritchard	f8ac8bb6b0	remove improper use of hwloc_bitmap_free When using the native aprun launcher, it was observed that there were frequent memory corruption errors occuring either during a PMI kvs-fence operation, or at mpi termation during opal cleanup of allocated objects. This was especially bad when using aprun --c none In some cases, the application would even just hang in finalize if using ptmalloc, owing to some kind of infinite loop in cleanup of small blocks, etc. It turns out that the proble was in orte_ess_base_proc_binding's improper use of opal_hwloc_base_get_available_cpus. The cpuset (bitmap) returned from that function is not meant to be freed by the caller. This problem is likely never observed when using the mpirun launcher as there's an early exit if the OMPI_MCA_orte_bound_at_launch environment variable is set. This commit was SVN r32809.	2014-09-29 16:10:37 +00:00
Gilles Gouaillardet	9661e4537f	oob/tcp: fix a race condition Mimick the btl/tcp protocol to solve the race condition that happens when two peers try to connect to each other at the same time cmr=v1.8.4:reviewer=rhc This commit was SVN r32799.	2014-09-26 06:54:30 +00:00
Ralph Castain	17846411c3	Now that we have an ORTE thread running in apps, we can't just call "exit" during RTE abort as that is happening in a thread, and (at least in some environments) doesn't result in the main thread being immediately terminated. Instead, we wind up going thru orte_finalize in the main thread, which isn't what we want. So replace the call to "exit" with the "quick exit" variant "_exit", which causes the entire process to exit immediately. (custom patch has been posted for 1.8.3) This commit was SVN r32780.	2014-09-23 22:51:10 +00:00
Howard Pritchard	1508a01325	Fixes to enable mpirun to work again on Cray The ess pmi module was not handling aprun launched daemons. All daemons were thinking they were vpid 1. Also, turns out that on cray systems using MOM nodes for launched jobs, just detecting whether or not a process is in a PAGG container is not sufficient. Crank up the priority of the alps PLM component in the event that the configure detected the presence of both slurm and alps. Have the ESS pmi component open the pmix framework and select a pmix component. This commit was SVN r32773.	2014-09-23 15:37:26 +00:00
Gilles Gouaillardet	5fa2b6c59c	oob/tcp: fix a race condition Refs trac:4909 This commit was SVN r32754. The following Trac tickets were found above: Ticket 4909 --> https://svn.open-mpi.org/trac/ompi/ticket/4909	2014-09-18 08:17:25 +00:00
Ralph Castain	3a437cbdb3	Silence set-but-not-used warning when timing isn't enabled This commit was SVN r32749.	2014-09-17 00:40:10 +00:00
Ralph Castain	414f4e9783	Try to provide a real hostname for the remote host to aid in debugging Refs trac:4908 This commit was SVN r32748. The following Trac tickets were found above: Ticket 4908 --> https://svn.open-mpi.org/trac/ompi/ticket/4908	2014-09-17 00:39:49 +00:00
Jeff Squyres	9dc49c5f92	oob_tcp_connection: print "<unknown>" instead of "NULL" "NULL" doesn't meany anything to the user, and is somewhat confusing to see in an error message. "<unknown>" at least indicates that there's an error, and we know who the peer is. This commit was SVN r32747.	2014-09-16 22:47:57 +00:00
Ralph Castain	09aecea55a	Can't use show_help as the RML has already been enabled, but we haven't successfully connected back to the HNP. So use opal_output instead and hardwire the message. Refs trac:4908 This commit was SVN r32746. The following Trac tickets were found above: Ticket 4908 --> https://svn.open-mpi.org/trac/ompi/ticket/4908	2014-09-16 22:21:02 +00:00
Ralph Castain	4bbc9a28d6	Try to resolve the simultaneous connection problem by being a little more careful about the choice of returned status when a connection is refused. As before, have the higher vpid of the two peers retry the connection, while the lower one waits. This can happen in a couple of places, so try to hit them all. Since this is hard to test, will ask Gilles to give it a try since he's the one who is seeing it. cmr=v1.8.3:reviewer=rhc This commit was SVN r32744.	2014-09-16 18:59:36 +00:00
Ralph Castain	a74428513d	Provide a better help message when we are unable to complete a connection due to a firewall. cmr=v1.8.3:reviewer=jsquyres This commit was SVN r32743.	2014-09-16 16:28:29 +00:00
Ralph Castain	dfb952fa78	[Contribution from Artem - moved it to svn from git for him] Replace our old, clunky timing setup with a much nicer one that is only available if configured with --enable-timing. Add a tool for profiling clock differences between the nodes so you can get more precise timing measurements. I'll ask Artem to update the Github wiki with full instructions on how to use this setup. This commit was SVN r32738.	2014-09-15 18:00:46 +00:00
Jeff Squyres	e95ed94a94	plm_rsh_module.c: output to the framework output Trivial fix from r32686: don't output to stream 0, but rather to orte_plm_base_framework.framework_output (this is the way it was before r32686). In reality, this is going to end up being stream 0, anyway, but we might as well be pedantically correct... Refs trac:4897. This commit was SVN r32726. The following SVN revision numbers were found above: r32686 --> open-mpi/ompi@4df1aa63f7 The following Trac tickets were found above: Ticket 4897 --> https://svn.open-mpi.org/trac/ompi/ticket/4897	2014-09-13 00:46:35 +00:00
Ralph Castain	9e7e90265f	Temporarily make the direct grpcomm component the default until we can debug the other modules This commit was SVN r32707.	2014-09-11 14:47:54 +00:00
Ralph Castain	4eb6291334	Avoid conflicts when multiple collectives are underway in ORTE by giving each grpcomm component its own RML tag and posting persistent receives. We use the signature anyway to determine which collective the received message is addressing, so there is no need to post non-persistent receives. This commit was SVN r32703.	2014-09-10 17:36:16 +00:00
Ralph Castain	ea11e63f59	Per patch from Tetsuya, allow the user to bind-to none when specifying multiple pe's/rank as requested by Reuti. This allows the user to reserve multiple "slots" in the allocation for each process while mapping, but not to bind the process to specific processing elements on the node. Reviewed by rhc, so RM-approved to go across to v1.8.3 cmr=v1.8.3:reviewer=ompi-gk1.8 This commit was SVN r32701.	2014-09-10 15:52:18 +00:00
Ralph Castain	4207b4c4ad	Improve the --bind-to help message to better indicate the default options under various values of np. Remove the warning message if the user doesn't specify a binding policy and we are overloaded cmr=v1.8.3:reviewer=jsquyres This commit was SVN r32687.	2014-09-08 21:03:51 +00:00
Ralph Castain	4df1aa63f7	Since we've run into the situation where someone puts a script wrapper around a launcher such as srun, we need to always protect MCA cmd line params with quotes. This means we also need to protect the backend from quotes coming into the system as part of a value, or else the parser gets confused. So add a new function for wrapping MCA arguments, and tell the backend parser to ignore/remove leading/trailing quotes. cmr=v1.8.3:reviewer=jsquyres This commit was SVN r32686.	2014-09-08 20:38:46 +00:00
Ralph Castain	6323b226c7	Bring over some updates from the PMIx branch - mostly just minor cleanups. Make the direct grpcomm component no longer be the default. For now, we seem to be having problems with non-blocking fence operations, so make them not be the default under any scenario (e.g., when sm is the only btl in operation). This commit was SVN r32673.	2014-09-06 19:19:44 +00:00
Ralph Castain	4d186e6402	Properly protect the MCA parameters being registered by the OOB/TCP component when IPv6 is enabled cmr=v1.8.3:reviewer=jsquyres This commit was SVN r32662.	2014-09-02 14:53:00 +00:00
Ralph Castain	f2b26bde4c	Resolve a race condition that could cause us to hang during abnormal terminations due to multi-counting num_terminated This commit was SVN r32660.	2014-09-02 00:32:52 +00:00
Ralph Castain	e49ca05f11	Remove unused variable This commit was SVN r32651.	2014-08-31 03:11:50 +00:00
Ralph Castain	5cdbc00136	Re-enable the usock oob component. Ensure the TCP component promotes messages for other procs to the OOB base so that other components have a chance to send the relay. Seems to be passing MTT, so let's see how it works for others. This commit was SVN r32650.	2014-08-30 19:33:46 +00:00
Ralph Castain	a2085a5916	Fix the PSM transport key generator to match prior releases This commit was SVN r32649.	2014-08-30 00:48:25 +00:00
Ralph Castain	8faabed2cd	Add some further initialization and protection for zero-byte messages This commit was SVN r32644.	2014-08-29 17:24:55 +00:00
Ralph Castain	2b225e3776	Cleanup a race condition regarding marking that waitpid_fired. We should always mark it as fired when we enter the wait_local_proc routine, and also mark it as no longer alive if iof_complete has also been found. If other places in the code also update those flags, there is no harm done. This commit was SVN r32643.	2014-08-29 17:03:31 +00:00
Ralph Castain	730e28349e	Some minor uninitialized variable cleanups This commit was SVN r32629.	2014-08-29 02:21:13 +00:00
Ralph Castain	fafdbeec0c	Cleanup and enable the new daemon collective modules for more scalable operations. Thanks to Nadezhda Kogteva (Mellanox) for doing them. This commit was SVN r32624.	2014-08-28 20:35:35 +00:00
Ralph Castain	731a878ff3	Add a bunch of debug to help track down the problem, and eventually find another place where comparison of signatures was incorrectly performed - use the dss compare operation to be consistent and safe This commit was SVN r32620.	2014-08-27 19:52:20 +00:00
Ralph Castain	b87b69e977	Ensure the nodes get added to the job map on the remote nodes, add some debug to grpcomm daemon array construction This commit was SVN r32617.	2014-08-27 16:16:46 +00:00
Ralph Castain	842aaf6167	Correctly end mapping oversubscribed nodes round-robin byslot cmr=v1.8.3:reviewer=rhc This commit was SVN r32616.	2014-08-27 16:15:18 +00:00
Ralph Castain	5a13cdb739	Fix a race condition caused by a bad attribute flag that created an OR instead of an AND condition check This commit was SVN r32587.	2014-08-22 22:48:16 +00:00
Ralph Castain	039b7acfb5	Fix the quoting algorithm so only rsh command lines get quoted values cmr=v1.8.2:reviewer=jsquyres This commit was SVN r32586.	2014-08-22 22:47:38 +00:00
Ralph Castain	f00af81c1d	Little more cleanup under the abort cases cited by Gilles. All seem to be working now This commit was SVN r32585.	2014-08-22 19:57:57 +00:00
Ralph Castain	8f1b9b463e	Fix shared memory operations - need to pass the local topology and cpusets of all local peers so we can properly compute relative locality for them. Also need to set default locality to "on node" in case where cpusets are not passed because procs are not bound. This commit was SVN r32577.	2014-08-22 05:17:51 +00:00
Ralph Castain	c6f78d6e54	The PMI ess component now gets used for more than direct launch, so only set standalone_operation flag if no daemon uri is available so we aggregate show_help messages This commit was SVN r32574.	2014-08-22 03:00:56 +00:00
Ralph Castain	aec5cd08bd	Per the PMIx RFC: WHAT: Merge the PMIx branch into the devel repo, creating a new OPAL “lmix” framework to abstract PMI support for all RTEs. Replace the ORTE daemon-level collectives with a new PMIx server and update the ORTE grpcomm framework to support server-to-server collectives WHY: We’ve had problems dealing with variations in PMI implementations, and need to extend the existing PMI definitions to meet exascale requirements. WHEN: Mon, Aug 25 WHERE: https://github.com/rhc54/ompi-svn-mirror.git Several community members have been working on a refactoring of the current PMI support within OMPI. Although the APIs are common, Slurm and Cray implement a different range of capabilities, and package them differently. For example, Cray provides an integrated PMI-1/2 library, while Slurm separates the two and requires the user to specify the one to be used at runtime. In addition, several bugs in the Slurm implementations have caused problems requiring extra coding. All this has led to a slew of #if’s in the PMI code and bugs when the corner-case logic for one implementation accidentally traps the other. Extending this support to other implementations would have increased this complexity to an unacceptable level. Accordingly, we have: * created a new OPAL “pmix” framework to abstract the PMI support, with separate components for Cray, Slurm PMI-1, and Slurm PMI-2 implementations. * Replaced the current ORTE grpcomm daemon-based collective operation with an integrated PMIx server, and updated the grpcomm APIs to provide more flexible, multi-algorithm support for collective operations. At this time, only the xcast and allgather operations are supported. * Replaced the current global collective id with a signature based on the names of the participating procs. The allows an unlimited number of collectives to be executed by any group of processes, subject to the requirement that only one collective can be active at a time for a unique combination of procs. Note that a proc can be involved in any number of simultaneous collectives - it is the specific combination of procs that is subject to the constraint * removed the prior OMPI/OPAL modex code * added new macros for executing modex send/recv to simplify use of the new APIs. The send macros allow the caller to specify whether or not the BTL supports async modex operations - if so, then the non-blocking “fence” operation is used, if the active PMIx component supports it. Otherwise, the default is a full blocking modex exchange as we currently perform. * retained the current flag that directs us to use a blocking fence operation, but only to retrieve data upon demand This commit was SVN r32570.	2014-08-21 18:56:47 +00:00
Ralph Castain	b4511913f6	Remove an unnecessary optimization that can cause more trouble than it's worth - just try all the addresses that are given to us. Refs trac:4870 This commit was SVN r32558. The following Trac tickets were found above: Ticket 4870 --> https://svn.open-mpi.org/trac/ompi/ticket/4870	2014-08-20 20:58:07 +00:00
Ralph Castain	fa28710d53	Track down the last piece of the connection problem. It appears that providing a netmask of 0 to opal_net_samenetwork results in everything looking like it is on the same network. Hence, we were not retaining any of the alternative addresses, so we had no other way to check them. Refs trac:4870 This commit was SVN r32556. The following Trac tickets were found above: Ticket 4870 --> https://svn.open-mpi.org/trac/ompi/ticket/4870	2014-08-20 16:55:36 +00:00
Ralph Castain	343038af7b	Frazzle-frump! Missed that we reset the peer state just before the new check. Refs trac:4870 This commit was SVN r32554. The following Trac tickets were found above: Ticket 4870 --> https://svn.open-mpi.org/trac/ompi/ticket/4870	2014-08-19 22:34:49 +00:00
Ralph Castain	0a91fdf85f	If an initial address fails to connect, record that fact and attempt the next address for that proc. If nothing succeeds, then declare failure. cmr=v1.8.2:reviewer=edgar This commit was SVN r32553.	2014-08-19 19:48:24 +00:00
Ralph Castain	024572cb6c	Sigh - I promised to remove these deprecation warnings back in June. My apologies to Dave Goodell and others who requested it. cmr=v1.8.2:reviewer=dgoodell:subject=remove deprecation warnings for pernode, npernode, and npersocket This commit was SVN r32552.	2014-08-19 19:40:20 +00:00
Jeff Squyres	1551339eba	rsh: revert part of r32517: keep the quoting As part of reviewing CMR #4860, I talked through r32517 with Ralph. In attempt to fix various rsh quoting problems, r32517 removed all the quoting from the main code path and then only added it back in at the end in some cases. This commit puts back the quoting parts that were removed in r32517 (r32517 fixed 2 other important bugs: a) change "--<foo>" to "--mca <foo_equivalent> 1" so that de-duplication works, and b) change a != to ==). refs trac:4860 This commit was SVN r32524. The following SVN revision numbers were found above: r32517 --> open-mpi/ompi@7342bce58f The following Trac tickets were found above: Ticket 4860 --> https://svn.open-mpi.org/trac/ompi/ticket/4860	2014-08-13 19:27:10 +00:00
Ralph Castain	7342bce58f	Cleanup the over-aggressive quoting of params on the orted cmd line. Remove duplicates caused by passing on both cmd line shortcuts and the mca param version of the same thing. Fixes trac:4857 cmr=v1.8.2:reviewer=jsquyres This commit was SVN r32517. The following Trac tickets were found above: Ticket 4857 --> https://svn.open-mpi.org/trac/ompi/ticket/4857	2014-08-13 03:51:04 +00:00
George Bosilca	de7191132d	Remove few warnings. This commit was SVN r32506.	2014-08-11 13:34:44 +00:00

1 2 3 4 5 ...

3426 Коммитов