1
1

4897 Коммитов

Автор SHA1 Сообщение Дата
Nadezhda Kogteva
a828eada98 sm dstore: set pmix segment size to proper value 2015-03-30 13:34:25 +03:00
Ralph Castain
d07dc362d5 Ensure we can authenticate when crossing security domains by including all available credentials, and letting the receiver use the highest priority one they have in common. 2015-03-28 20:34:26 -07:00
Ralph Castain
b67b3619fc If we are using the default bindings, and one or more nodes are not setup to support binding, then don't error out - just don't bind.
Thanks to Annu Desari for pointing out the problem.
2015-03-28 08:20:24 -07:00
Ralph Castain
2f365720b0 Allow root to request the version and help from mpirun without having to override the run-as-root protection.
Thanks to Robert McLay for pointing this out
2015-03-28 08:17:44 -07:00
Ralph Castain
d2d02a1642 ckpt 2015-03-28 07:59:20 -07:00
Nathan Hjelm
b68d66bb9b MCA: Add the project/project version to the MCA base component
This commit adds support for project_framework_component_* parameter
matching. This is the first step in allowing the same framework name
in multiple projects. This change also bumps the MCA component version
to 2.1.0.

All master frameworks have been updated to use the new component
versioning macro. An mca.h has been added to each project to add a
project specific versioning macro of the form
PROJECT_MCA_VERSION_2_1_0.

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2015-03-27 10:59:04 -06:00
Elena
90f5b2bb84 Introduce -tune command line option to set env vars and mca params from file 2015-03-26 18:33:53 +02:00
rhc54
2ff7575dde Merge pull request #497 from rhc54/topic/sec
Allow for different security domains.
2015-03-25 21:01:29 -07:00
Ralph Castain
6aa33deafb Remove debug 2015-03-25 19:58:51 -07:00
Ralph Castain
10cf455080 Tools need to use the TCP OOB component 2015-03-25 19:56:49 -07:00
Ralph Castain
1b24536941 Allow for different security domains. Let the initiator of the connection determine the method to be used - if the receiver cannot support it, then that's an error that will cause the connection attempt to fail. 2015-03-25 13:22:01 -07:00
Ralph Castain
6ba76ed8d8 Per user request, we allow -host to specify a host that is not included in a hostfile (however, we reject it if we were given an allocation by a resource manager). Since we cannot know if an IP addr form references the same node that was previously given as a string name, we have no choice but to assume they are different. Get the topology from the right place in that situation so mpirun can succeed. 2015-03-25 06:16:01 -07:00
rhc54
df24816d64 Merge pull request #488 from lrrajesh/master
Notification msg add severity to the message header.
2015-03-20 09:45:46 -07:00
Ralph Castain
095a8fa684 We don't need to know about non-fatal errors from setting socket options 2015-03-20 07:16:31 -07:00
Ralph Castain
a013f3059f For scalability reasons, and to make life easier for the poor Cray-ites, don't bang on the system for the username - we'll just use the uid. 2015-03-19 21:24:13 -07:00
Howard Pritchard
990e9b47e0 Merge pull request #486 from hppritcha/topic/issue_484
orte/oob: implement alps oob component
2015-03-19 19:40:40 -06:00
Ralph Castain
43a3baad5e Ensure we use the first compute node's topology for mapping
Don't filter the topology by cpuset if you are mpirun until you know that no other compute nodes are involved. This deals with the corner case where mpirun is executing on a node of different topology from the compute nodes.

Simplify - don't mandate that all cpus in the given cpuset be present on every node. We can then run everything thru the filter as before, which ensures that any procs run on mpirun are also contained within the specified cpuset.

Correctly count the number of available PUs under each object when given a cpuset

Fix the default binding settings, and correctly count PUs when no cpuset is given

Ensure the binding policy gets set in all cases
2015-03-19 16:30:36 -07:00
Howard Pritchard
6054975913 oob/alps: add configure file for alps oob
Have to have alps rpms installed on a system
for alps component to build, even if separated
by a level of indirection.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-03-19 15:38:14 -07:00
Howard Pritchard
b1f31a4364 orte/oob: implement alps oob component
Implement an almost-do-nothing alps oob component.
When using aprun to launch a job on Cray system,
there is no reason to need an oob system, since ompi
relies on Cray PMI for oob communication.

Fixes #484
2015-03-19 14:11:40 -07:00
lrrajesh
4dc75687e2 Notification msg add severity to the output 2015-03-18 13:55:03 -07:00
Nadezhda Kogteva
7c25b4cea6 grpcomm: fixed brks and rcd algorithms - added enough space for masks in order to get them working in the large scale. 2015-03-18 14:33:04 +02:00
Ralph Castain
50277fec76 Adjust MCA param 2015-03-17 19:46:31 -07:00
rhc54
b41d2ad6c4 Merge pull request #481 from rhc54/topic/slurm
Add new MCA parameter to support edge case with debugger at LLNL
2015-03-17 07:40:55 -07:00
Ralph Castain
b01e8c1063 Include the FQDN version and non-stripped version of the hostname in our list of aliases as these (plus localhost) are the most common aliases we see. 2015-03-17 06:26:26 -07:00
Ralph Castain
d7d8ae46ed We no longer pass the RML URI for procs launched via mpirun as the daemon has no need for that info. 2015-03-17 06:10:20 -07:00
Ralph Castain
3e32c360c7 Add new MCA parameter to support edge case with debugger at LLNL 2015-03-16 20:04:05 -07:00
Ralph Castain
a0487e014c Further reduce the RARP load by removing getaddrinfo for IPv6 connections. Correct typo when checking return on inet_pton. Don't consider the TCP component for apps that are launched via mpirun as it will never be used. 2015-03-16 19:42:05 -07:00
Ralph Castain
5ae42c816e Attempt to reduce the RARP traffic during definition of allocations 2015-03-16 16:26:40 -07:00
Ralph Castain
64d11f170a Adjust the default keepalive interval. Refactor the code when setting keepalive options 2015-03-16 12:32:58 -07:00
Ralph Castain
4ded049cbc Modify MCA param description 2015-03-16 11:57:32 -07:00
Ralph Castain
019bba5caf Cleanup a bit - don't need to lookup the protocol number if we just use the right define 2015-03-16 11:54:51 -07:00
Ralph Castain
69ac25bf55 Add support for TCP keepalive on inter-node sockets 2015-03-16 09:59:44 -07:00
adrianreber
714d9aa67e Merge pull request #348 from adrianreber/topic/orte_cr_continue_like_restart
Topic/orte cr continue like restart
2015-03-12 14:54:02 +01:00
Nathan Hjelm
695dcd5a28 oob/ud: fix compiler warning 2015-03-11 10:53:32 -06:00
Adrian Reber
c08e234af7 FT: fix compilation using --with-ft (5/5)
Enabling the FT code breaks compilation (again). This series
tries to fix the compiler errors. This is again only fixing
the compiler errors without any warranty that the result
might actually support FT again.

With the changes introduced in the previous patches in this series
some goto constructs for cleanup are no longer necessary and removed.
2015-03-11 14:23:33 +01:00
Adrian Reber
8ba41a834a FT: fix compilation using --with-ft (4/5)
Enabling the FT code breaks compilation (again). This series
tries to fix the compiler errors. This is again only fixing
the compiler errors without any warranty that the result
might actually support FT again.

This patch tries to handle the new xcast semantic.
2015-03-11 14:23:33 +01:00
Adrian Reber
1c5a8df724 FT: fix compilation using --with-ft (2/5)
Enabling the FT code breaks compilation (again). This series
tries to fix the compiler errors. This is again only fixing
the compiler errors without any warranty that the result
might actually support FT again.

The FT code used barrier mechanisms which have been removed
with aec5cd08bd8c33677276612b899b48618d271efa. This patch replaces
all those different barriers with opal_pmix.fence(NULL, 0);
I am not sure this is completely correct but at least a starting
point for a review.
2015-03-11 14:23:33 +01:00
Adrian Reber
f45dd069bd FT: fix compilation using --with-ft (1/5)
Enabling the FT code breaks compilation (again). This series
tries to fix the compiler errors. This is again only fixing
the compiler errors without any warranty that the result
might actually support FT again.

This first patch moves orte_cr_continue_like_restart from ORTE
to opal_cr_continue_like_restart in OPAL. This only leaves three
calls from OPAL to ORTE in the FT code. As it is not yet 100%
clear how to handle these calls the code orte_sstore.set_attr()
has been #ifdef'd out for now.
2015-03-11 14:23:33 +01:00
Gilles Gouaillardet
a69d935d55 oob/tcp: fix misc issues
as reported by Coverity with CIDs 70726, 710564,
1196630, 1269805, 1269803, 1269932
2015-03-10 19:32:01 +09:00
Gilles Gouaillardet
dc0bc756dc iof/base: fix misc memory leak
as reported by Coverity with CID 1196732
2015-03-10 14:37:53 +09:00
Jeff Squyres
a026456bef (orte|ompi|oshmem)*info tools: convert to opal_dl interface
Noe that this commit removes option:lt_dladvise from the various
"info" tools output.  This technically breaks our CLI "ABI" because
we're not deprecating it / replacing it with an alias to some other
"into" tool output.

Although the dl/libltdl component contains an "have_lt_dladvise" MCA
var that contains the same information, the "option:lt_dladvise"
output from the various "info" tools is *not* an MCA var, and
therefore we can't alias it.  So it just has to die.
2015-03-09 08:18:13 -07:00
Gilles Gouaillardet
59be12b260 filem/raw: fix misc memory leaks
as reported by Coverity with CIDs 716815, 716817, 720760,
1196703, 1196704, 1196746
2015-03-09 19:56:20 +09:00
Gilles Gouaillardet
2ab9a411f8 plm/base: fix misc memory leaks
as reported by Coverity with CIDs 1196733 and 1196745
2015-03-09 16:25:07 +09:00
Gilles Gouaillardet
fa10025843 ras/slurm: fix misc memory leaks
as reported by Coverity with CIDs 968580 and 1196723-1196727
2015-03-09 15:58:51 +09:00
Gilles Gouaillardet
eae39bd948 ras/simulator: fix misc memory leaks
as reported by Coverity with CIDs 710647, 714133 and 714134
2015-03-09 15:52:29 +09:00
Gilles Gouaillardet
4c0eb11e08 orterun: fix misc errors
as reported by Coverity with CIDs 70700, 71039, 710651
2015-03-09 11:57:18 +09:00
Gilles Gouaillardet
33841361c0 orte-clean: use pclose instead of fclose
as reported by Coverity with CID 1287029
2015-03-09 11:17:59 +09:00
Elena
6c6fe75c7b added one more time interval for barrier to pmix unit test 2015-03-06 10:33:14 +02:00
Ralph Castain
64ec498a20 Add a declspec 2015-03-05 19:48:27 -08:00
Ralph Castain
eaa666bd57 Instantiate debug output variable 2015-03-05 12:25:49 -08:00