1
1
Граф коммитов

4800 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
6d205a3c80 Ensure that singletons pickup the oob/tcp component 2015-03-30 18:10:08 -07:00
Ralph Castain
2fa56fb329 Ensure that orte-submit picks the correct ess module as it is -never- allowed to be used as a distributed tool
Thanks to Mark Santcroos for diagnosing this one.
2015-03-30 18:08:34 -07:00
rhc54
bc016617a0 Merge pull request #501 from rhc54/topic/sec2
Support authentication across security domains
2015-03-30 09:59:43 -07:00
Nadezhda Kogteva
a828eada98 sm dstore: set pmix segment size to proper value 2015-03-30 13:34:25 +03:00
Ralph Castain
d07dc362d5 Ensure we can authenticate when crossing security domains by including all available credentials, and letting the receiver use the highest priority one they have in common. 2015-03-28 20:34:26 -07:00
Ralph Castain
b67b3619fc If we are using the default bindings, and one or more nodes are not setup to support binding, then don't error out - just don't bind.
Thanks to Annu Desari for pointing out the problem.
2015-03-28 08:20:24 -07:00
Ralph Castain
2f365720b0 Allow root to request the version and help from mpirun without having to override the run-as-root protection.
Thanks to Robert McLay for pointing this out
2015-03-28 08:17:44 -07:00
Ralph Castain
d2d02a1642 ckpt 2015-03-28 07:59:20 -07:00
Nathan Hjelm
b68d66bb9b MCA: Add the project/project version to the MCA base component
This commit adds support for project_framework_component_* parameter
matching. This is the first step in allowing the same framework name
in multiple projects. This change also bumps the MCA component version
to 2.1.0.

All master frameworks have been updated to use the new component
versioning macro. An mca.h has been added to each project to add a
project specific versioning macro of the form
PROJECT_MCA_VERSION_2_1_0.

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2015-03-27 10:59:04 -06:00
Elena
90f5b2bb84 Introduce -tune command line option to set env vars and mca params from file 2015-03-26 18:33:53 +02:00
rhc54
2ff7575dde Merge pull request #497 from rhc54/topic/sec
Allow for different security domains.
2015-03-25 21:01:29 -07:00
Ralph Castain
6aa33deafb Remove debug 2015-03-25 19:58:51 -07:00
Ralph Castain
10cf455080 Tools need to use the TCP OOB component 2015-03-25 19:56:49 -07:00
Ralph Castain
1b24536941 Allow for different security domains. Let the initiator of the connection determine the method to be used - if the receiver cannot support it, then that's an error that will cause the connection attempt to fail. 2015-03-25 13:22:01 -07:00
Ralph Castain
6ba76ed8d8 Per user request, we allow -host to specify a host that is not included in a hostfile (however, we reject it if we were given an allocation by a resource manager). Since we cannot know if an IP addr form references the same node that was previously given as a string name, we have no choice but to assume they are different. Get the topology from the right place in that situation so mpirun can succeed. 2015-03-25 06:16:01 -07:00
rhc54
df24816d64 Merge pull request #488 from lrrajesh/master
Notification msg add severity to the message header.
2015-03-20 09:45:46 -07:00
Ralph Castain
095a8fa684 We don't need to know about non-fatal errors from setting socket options 2015-03-20 07:16:31 -07:00
Ralph Castain
a013f3059f For scalability reasons, and to make life easier for the poor Cray-ites, don't bang on the system for the username - we'll just use the uid. 2015-03-19 21:24:13 -07:00
Howard Pritchard
990e9b47e0 Merge pull request #486 from hppritcha/topic/issue_484
orte/oob: implement alps oob component
2015-03-19 19:40:40 -06:00
Ralph Castain
43a3baad5e Ensure we use the first compute node's topology for mapping
Don't filter the topology by cpuset if you are mpirun until you know that no other compute nodes are involved. This deals with the corner case where mpirun is executing on a node of different topology from the compute nodes.

Simplify - don't mandate that all cpus in the given cpuset be present on every node. We can then run everything thru the filter as before, which ensures that any procs run on mpirun are also contained within the specified cpuset.

Correctly count the number of available PUs under each object when given a cpuset

Fix the default binding settings, and correctly count PUs when no cpuset is given

Ensure the binding policy gets set in all cases
2015-03-19 16:30:36 -07:00
Howard Pritchard
6054975913 oob/alps: add configure file for alps oob
Have to have alps rpms installed on a system
for alps component to build, even if separated
by a level of indirection.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-03-19 15:38:14 -07:00
Howard Pritchard
b1f31a4364 orte/oob: implement alps oob component
Implement an almost-do-nothing alps oob component.
When using aprun to launch a job on Cray system,
there is no reason to need an oob system, since ompi
relies on Cray PMI for oob communication.

Fixes #484
2015-03-19 14:11:40 -07:00
lrrajesh
4dc75687e2 Notification msg add severity to the output 2015-03-18 13:55:03 -07:00
Nadezhda Kogteva
7c25b4cea6 grpcomm: fixed brks and rcd algorithms - added enough space for masks in order to get them working in the large scale. 2015-03-18 14:33:04 +02:00
Ralph Castain
50277fec76 Adjust MCA param 2015-03-17 19:46:31 -07:00
rhc54
b41d2ad6c4 Merge pull request #481 from rhc54/topic/slurm
Add new MCA parameter to support edge case with debugger at LLNL
2015-03-17 07:40:55 -07:00
Ralph Castain
b01e8c1063 Include the FQDN version and non-stripped version of the hostname in our list of aliases as these (plus localhost) are the most common aliases we see. 2015-03-17 06:26:26 -07:00
Ralph Castain
d7d8ae46ed We no longer pass the RML URI for procs launched via mpirun as the daemon has no need for that info. 2015-03-17 06:10:20 -07:00
Ralph Castain
3e32c360c7 Add new MCA parameter to support edge case with debugger at LLNL 2015-03-16 20:04:05 -07:00
Ralph Castain
a0487e014c Further reduce the RARP load by removing getaddrinfo for IPv6 connections. Correct typo when checking return on inet_pton. Don't consider the TCP component for apps that are launched via mpirun as it will never be used. 2015-03-16 19:42:05 -07:00
Ralph Castain
5ae42c816e Attempt to reduce the RARP traffic during definition of allocations 2015-03-16 16:26:40 -07:00
Ralph Castain
64d11f170a Adjust the default keepalive interval. Refactor the code when setting keepalive options 2015-03-16 12:32:58 -07:00
Ralph Castain
4ded049cbc Modify MCA param description 2015-03-16 11:57:32 -07:00
Ralph Castain
019bba5caf Cleanup a bit - don't need to lookup the protocol number if we just use the right define 2015-03-16 11:54:51 -07:00
Ralph Castain
69ac25bf55 Add support for TCP keepalive on inter-node sockets 2015-03-16 09:59:44 -07:00
adrianreber
714d9aa67e Merge pull request #348 from adrianreber/topic/orte_cr_continue_like_restart
Topic/orte cr continue like restart
2015-03-12 14:54:02 +01:00
Nathan Hjelm
695dcd5a28 oob/ud: fix compiler warning 2015-03-11 10:53:32 -06:00
Adrian Reber
c08e234af7 FT: fix compilation using --with-ft (5/5)
Enabling the FT code breaks compilation (again). This series
tries to fix the compiler errors. This is again only fixing
the compiler errors without any warranty that the result
might actually support FT again.

With the changes introduced in the previous patches in this series
some goto constructs for cleanup are no longer necessary and removed.
2015-03-11 14:23:33 +01:00
Adrian Reber
8ba41a834a FT: fix compilation using --with-ft (4/5)
Enabling the FT code breaks compilation (again). This series
tries to fix the compiler errors. This is again only fixing
the compiler errors without any warranty that the result
might actually support FT again.

This patch tries to handle the new xcast semantic.
2015-03-11 14:23:33 +01:00
Adrian Reber
1c5a8df724 FT: fix compilation using --with-ft (2/5)
Enabling the FT code breaks compilation (again). This series
tries to fix the compiler errors. This is again only fixing
the compiler errors without any warranty that the result
might actually support FT again.

The FT code used barrier mechanisms which have been removed
with aec5cd08bd. This patch replaces
all those different barriers with opal_pmix.fence(NULL, 0);
I am not sure this is completely correct but at least a starting
point for a review.
2015-03-11 14:23:33 +01:00
Adrian Reber
f45dd069bd FT: fix compilation using --with-ft (1/5)
Enabling the FT code breaks compilation (again). This series
tries to fix the compiler errors. This is again only fixing
the compiler errors without any warranty that the result
might actually support FT again.

This first patch moves orte_cr_continue_like_restart from ORTE
to opal_cr_continue_like_restart in OPAL. This only leaves three
calls from OPAL to ORTE in the FT code. As it is not yet 100%
clear how to handle these calls the code orte_sstore.set_attr()
has been #ifdef'd out for now.
2015-03-11 14:23:33 +01:00
Gilles Gouaillardet
a69d935d55 oob/tcp: fix misc issues
as reported by Coverity with CIDs 70726, 710564,
1196630, 1269805, 1269803, 1269932
2015-03-10 19:32:01 +09:00
Gilles Gouaillardet
dc0bc756dc iof/base: fix misc memory leak
as reported by Coverity with CID 1196732
2015-03-10 14:37:53 +09:00
Jeff Squyres
a026456bef (orte|ompi|oshmem)*info tools: convert to opal_dl interface
Noe that this commit removes option:lt_dladvise from the various
"info" tools output.  This technically breaks our CLI "ABI" because
we're not deprecating it / replacing it with an alias to some other
"into" tool output.

Although the dl/libltdl component contains an "have_lt_dladvise" MCA
var that contains the same information, the "option:lt_dladvise"
output from the various "info" tools is *not* an MCA var, and
therefore we can't alias it.  So it just has to die.
2015-03-09 08:18:13 -07:00
Gilles Gouaillardet
59be12b260 filem/raw: fix misc memory leaks
as reported by Coverity with CIDs 716815, 716817, 720760,
1196703, 1196704, 1196746
2015-03-09 19:56:20 +09:00
Gilles Gouaillardet
2ab9a411f8 plm/base: fix misc memory leaks
as reported by Coverity with CIDs 1196733 and 1196745
2015-03-09 16:25:07 +09:00
Gilles Gouaillardet
fa10025843 ras/slurm: fix misc memory leaks
as reported by Coverity with CIDs 968580 and 1196723-1196727
2015-03-09 15:58:51 +09:00
Gilles Gouaillardet
eae39bd948 ras/simulator: fix misc memory leaks
as reported by Coverity with CIDs 710647, 714133 and 714134
2015-03-09 15:52:29 +09:00
Gilles Gouaillardet
4c0eb11e08 orterun: fix misc errors
as reported by Coverity with CIDs 70700, 71039, 710651
2015-03-09 11:57:18 +09:00
Gilles Gouaillardet
33841361c0 orte-clean: use pclose instead of fclose
as reported by Coverity with CID 1287029
2015-03-09 11:17:59 +09:00
Elena
6c6fe75c7b added one more time interval for barrier to pmix unit test 2015-03-06 10:33:14 +02:00
Ralph Castain
64ec498a20 Add a declspec 2015-03-05 19:48:27 -08:00
Ralph Castain
eaa666bd57 Instantiate debug output variable 2015-03-05 12:25:49 -08:00
Ralph Castain
7ce0a9931c Updates to the notifier interfaces to support system events 2015-03-05 10:39:25 -08:00
Gilles Gouaillardet
7de3f35b90 pml/rsh: fix misc memory leaks
as reported by Coverity with CIDs 71091, 71230, 71231, 72274, 72389,
1196718 and 1196719
2015-03-05 20:03:37 +09:00
Gilles Gouaillardet
33352e9506 schizo: fix misc memory leak
as reported by Coverity with CID 1196722
2015-03-05 14:06:18 +09:00
Gilles Gouaillardet
89806c6261 orte/util: fix memory leaks
as reported by Coverity with CIDs 70845, 71855, 710652,
1196738, 1196739, 1196757, 1196758, 1269863 and 1269883
2015-03-05 14:06:18 +09:00
Gilles Gouaillardet
4e7b5240e4 orte/tools: fix misc memory leaks
as reported by Coverity with CIDs 70700, 71039, 71854, 72384 and 710651
2015-03-05 14:06:18 +09:00
Gilles Gouaillardet
d1b2f043ff fix misc memory leaks
as already reported by Coverity with CIDs
71818, 71819, 72250, 715767, 1196749 and 1274002
2015-03-05 13:58:05 +09:00
Gilles Gouaillardet
42f5a36ee3 rmaps/seq: fix misc memory leaks
as reported by Coverity with CIDs 1269886 and 1269887
2015-03-02 15:31:11 +09:00
Gilles Gouaillardet
0c7a2846d1 rmaps/rank_file: fix misc memory leaks
as reported by Coverity with CIDs 72250 and 1196774
2015-03-02 15:31:11 +09:00
Gilles Gouaillardet
c15b919635 rmaps/lama: fix misc memory leaks
as reported by Coverity with CIDs 719263, 719264, 1196712 and 1269842
2015-03-02 15:31:11 +09:00
Gilles Gouaillardet
456baeb71b rmaps/base: fix misc memory leaks
as reported by Coverity with CIDs 1196751, 1196754, 1196755 and 1269866
2015-03-02 15:31:11 +09:00
Gilles Gouaillardet
d8f3b378b3 orte/oob: fix misc memory leaks
as reported by Coverity as CIDs 1196748, 1196749 and 1269895
2015-03-02 15:31:11 +09:00
Jeff Squyres
336626dafe spelling: trivial spelling fix
s/interupted/interrupted/gi
2015-02-27 18:30:43 -08:00
Gilles Gouaillardet
ab78c7f54a orted/pmix: fix misc resource leak
as reported by Coverity with CID 1269844
2015-02-27 19:25:55 +09:00
Mike Dubman
dbc15009b6 Merge pull request #415 from alinask/topic/fix_fork_support_flow
Fix the calls to ibv_fork_init and remove btl_openib_want_fork_support.
2015-02-26 21:50:11 +02:00
Nathan Hjelm
883d09376f Fix coverity #1271536 2015-02-25 11:35:45 -07:00
rhc54
efbb57430b Merge pull request #419 from nkogteva/master
grpcomm brcks: fix copy-paste bug which affects performance
2015-02-25 07:39:55 -08:00
Alina Sklarevich
e4c4e7df5e Fix the calls to ibv_fork_init and remove btl_openib_want_fork_support.
In order to have an effect, ibv_fork_init should be called in the
beginning of the verbs initialization flow - before the calls to the
ibv_create_qp and ibv_create_cq verbs.
These functions are called from the oob/ud code and by the time the
other verbs components (btl openib, pml yalla, ...) call ibv_fork_init,
it's too late. This commit forces the call to ibv_fork_init (if it's
requested) right at the beginning of all the components that are using
verbs.
(ibv_fork_init() can be safely called multiple times)

This commit also removes the btl_openib_want_fork_support mca parameter
and adds a new mca parameter instead - opal_verbs_want_fork_support.
Through this new parameter, fork support may be requested for ALL
components.
The default value for this parameter is set to 1.

Before this commit the btl_openib_want_fork_support parameter didn't
provide fork support for the openib btl if its value was set to 1.
(because when openib called ibv_fork_init, it was already after the
calls to ibv_create_* in oob/ud and thereofre it failed).
2015-02-25 10:58:50 +02:00
Jeff Squyres
a85a392896 Merge pull request #422 from jsquyres/topic/coverity-fixes
Some Coverity fixes
2015-02-24 17:00:10 -05:00
Jeff Squyres
05f00aface plm base: ensure mca_base_var_get_value() and mca_base_var_find() succeed
This was CID 993712
2015-02-24 15:48:50 -05:00
Ralph Castain
451bd16a10 Remove dead code 2015-02-24 12:41:12 -08:00
Jeff Squyres
4f54fedf05 orterun: ensure to set used_num_procs=true after finding that token
This was CID 71687.
2015-02-24 15:25:39 -05:00
Jeff Squyres
398ae15533 rmaps_base_frame: remove dead code
This was CID 1196641
2015-02-24 15:24:11 -05:00
Jeff Squyres
71ae0ad5ec oob_tcp_component: add #if OPAL_ENABLE_IPV6 around IPv6-specific code
This was CID 1196629
2015-02-24 15:24:11 -05:00
Jeff Squyres
0bd2783b91 oob_usock: don't try to close the socket if it didn't open
This was CID 1196663
2015-02-24 15:24:09 -05:00
Jeff Squyres
e2223cd9bf plm_rsh: ensure cwd array is \0-terminated
This was CID 72257
2015-02-24 15:24:08 -05:00
Ralph Castain
332e4fa7aa Minor fix - relative host name syntax cannot support usernames as you can't know which hosts will be selected 2015-02-24 12:15:28 -08:00
Nathan Hjelm
ed78553512 Update opal_free_list_t usage to reflect new class interface.
Please verify your components have been updated correctly. Keep in
mind that in terms of threading:

OPAL_FREE_LIST_GET -> opal_free_list_get_st
OPAL_FREE_LIST_RETURN -> opal_free_list_return_st

I used the opal_using_threads() variant anytime it appeared multiple
threads could be operating on the free list. If this is not the case
update to _st. If multiple threads are always in use change to _mt.
2015-02-24 10:05:44 -07:00
Nadezhda Kogteva
c4d6ca6468 grpcomm brcks: fix copy-paste bug which affects performance 2015-02-24 17:06:39 +02:00
Jeff Squyres
226a814c9d grpcomm_brks: fix minor compiler warning (rc used before set)
Also check for OBJ_NEW returning NULL.
2015-02-23 09:04:45 -08:00
Jeff Squyres
600858609e grpcomm_rcd: fix minor compiler warning (rc used before set)
Also check for OBJ_NEW returning NULL.
2015-02-23 09:03:07 -08:00
Howard Pritchard
bf89131f9e add owner files to opa/ompi/orte mca directories
This commit adds an owner file in each of the component directories
for each framework.  This allows for a simple script to parse
the contents of the files and generate, among other things, tables
to be used on the project's wiki page.  Currently there are two
"fields" in the file, an owner and a status.  A tool to parse
the files and generate tables for the wiki page will be added
in a subsequent commit.
2015-02-22 15:10:23 -07:00
Jeff Squyres
15be948d79 wrappers: *_EXTRA_INCLUDES does not exist any more
There were a few places where *_EXTRA_INCLUDES (and derivates) were
still being used.  This commit removes all of them.
2015-02-20 08:43:25 -08:00
Jeff Squyres
9b716d946e wrappers: fix errant @{libdir} reference in pkg-config files
The RPATH support added a @{libdir} token into
<package>_WRAPPER_EXTRA_LDFLAGS.  However, these flags are also
substituted into the pkg-config data files, and they don't understand
the @{foo} notation.  So convert @{libdir} into ${libdir}, which
pkg-config *does* understand.

Thanks to Christoph Junghans (@junghans) for notifying us of the issue.

Fixes #406.
2015-02-20 08:43:19 -08:00
Jeff Squyres
ec62766a71 notifier base: remove unused variables 2015-02-20 07:06:13 -08:00
Elena
48eae25b8f fixed issue with grpcomm rcd and brks algorithms which led to performance issues: data just for part of processes was unpacked and stored locally during fence, therefore clients were forced to ask daemons for data directly during get request 2015-02-20 16:41:25 +02:00
Ralph Castain
f7c28ea706 Fix bad test - opal_buffer and opal_ptr can support NULL locations 2015-02-17 21:46:23 -08:00
Ralph Castain
852fbca020 Shut coverity up 2015-02-17 21:17:23 -08:00
Ralph Castain
c1282d5b99 The opal_buffer type also generates its own alloc, so need to let it pass thru the check 2015-02-17 21:06:19 -08:00
Ralph Castain
207cc74f87 Correct name of help file 2015-02-17 16:03:20 -08:00
Ralph Castain
624b16e070 Protect the unload attribute function 2015-02-17 14:21:23 -08:00
Ralph Castain
78245e8a33 Continue massaging of the notifier framework. Convert it to an event-driven interface. Add the ability to report job state if requested. Cleanup object declarations. 2015-02-17 12:51:11 -08:00
Gilles Gouaillardet
8dc4f30fae orte/tools: fix NULL pointer dereference
as reported by Coverity with CIDs 1196671 and 1196824
2015-02-17 15:45:06 +09:00
Gilles Gouaillardet
b762766969 orte/util: fix misc memory leaks
as reported by Coverity with CIDS 70314, 710653-710657 and 1196741-1196744
2015-02-17 12:27:23 +09:00
Ralph Castain
22f1d29b82 Re-introduce the ORTE notifier framework for logging errors that would otherwise result in abort for persistent systems. Thanks to L. Rajeshnarayanan of Intel for the contribution
Subsequent commits will integrate this capability with the state and errmgr frameworks.
2015-02-16 12:46:58 -08:00
Gilles Gouaillardet
8fe8079080 Fix a build failure when configure'd with --without-hwloc
see http://mtt.open-mpi.org/index.php?do_redir=2235
2015-02-16 10:31:09 +09:00
Jeff Squyres
3ac1d0dae5 *-info: add "lt_dladvise support" lines 2015-02-11 12:25:20 -08:00
Ralph Castain
2a83d2613a Cleanup the orte/test/system directory 2015-02-11 10:42:38 -08:00