Gilles Gouaillardet
dd28b1f680
orted/dfs: fix misc memory leaks
...
as reported by Coverity with CIDs 739887, 747706, 1196707-1196709 and 1269849
2015-05-20 13:09:46 +09:00
Ralph Castain
d3d3e73099
Per request from George, use defined(__APPLE__) instead of OPAL_HAVE_MAC. Don't try to close a negative socket
2015-05-15 07:13:42 -06:00
Ralph Castain
0a345d34e6
Plug the memory leak identified by George
2015-05-14 21:33:48 -06:00
Howard Pritchard
578430c36d
oob/alps: remove comment with personal reference
...
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-05-14 20:06:21 -07:00
Ralph Castain
8e30579e6e
The Mac appears to have problems with the keepalive support - once keepalive starts, the memory footprint soars. So disable keepalive on the Mac
2015-05-14 18:09:13 -06:00
Nadezhda Kogteva
d9dcf8352e
oob ud: fixed a bug that prevented the work with QoS framework (oob_stress_channel test)
2015-05-13 11:40:01 +03:00
Jeff Squyres
8e8d104520
oob ud: ibv_get_device_list()==NULL can mean no devices present
...
...which is not an error. Don't complain about it.
2015-05-12 10:54:39 -07:00
Jeff Squyres
8f941a6613
oob ud: better error msgs, tolerate systems without UD devices
...
It is perfectly ok to be on a system without UD devices.
Also, make some of the error messages better -- so that the user has a
clue about where the error messages are coming from, and what they
should do.
2015-05-11 13:11:51 -07:00
Mike Dubman
894ba28390
Merge pull request #559 from nkogteva/oob_ud
...
oob ud: made component more user adaptive; opal outputs were replaced by...
2015-05-11 21:09:28 +03:00
Ralph Castain
3cee4152fc
Fix the intercommunictor issue reported by Gilles. Instead of directly checking the reachability bitmap, ask the component if the proc is reachable when doing a send as the component is the final arbiter in such cases. Recirculate any messages that a daemon is trying to send to void race conditions. Cleanup listener sockets so we don't leak them
2015-05-11 09:16:25 -07:00
Howard Pritchard
3382d3ce61
ess/alps: remove unnecessary vpid calc
...
There was a redundant computation of the vpid
for orted's happening in ess/alps rte_init
method. Keep the more efficient alps based
method.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-05-09 20:07:38 -07:00
Ralph Castain
b5382c9bf9
Rework the OOB selection logic to allow a component (e.g., usock) to direct that it be the sole active component. Remove prior disqualifying code in the oob/tcp component as it was too restrictive - if usock wasn't able to run, it left apps with no way to communicate to their daemon. Have the local daemon check the global modex for the RML URI info of the local procs so it can route messages between them when tcp is the primary channel.
...
A few other minor cleanups included.
2015-05-08 11:15:21 -07:00
Ralph Castain
6e95bcd583
Fix typo in oob_tcp.c when IPV6 enabled. Cleanup a few other warnings, including a type in coll_sm that prevented that component from registering its MCA params!
2015-05-07 21:05:08 -07:00
Gilles Gouaillardet
a80fda25d8
orte: rename the global variable component_map into orte_component_map
...
Thanks @goodell for pointing this !
2015-05-08 10:11:59 +09:00
Gilles Gouaillardet
2e384a3b65
initialize common symbols from orte
...
A few uninitialized common symbols are remaining (generated by flex) :
* orte/mca/rmaps/rank_file/rmaps_rank_file_lex.c: orte_rmaps_rank_file_leng
* orte/mca/rmaps/rank_file/rmaps_rank_file_lex.c: orte_rmaps_rank_file_text
* orte/util/hostfile/hostfile_lex.c: orte_util_hostfile_leng
* orte/util/hostfile/hostfile_lex.c: orte_util_hostfile_text
2015-05-08 10:11:58 +09:00
Ralph Castain
9cb2fcfa5c
Cleanup the qos code when --enable-timings is given
2015-05-06 20:24:27 -07:00
Ralph Castain
01a9bdf4cf
Cleanup of ud/oob component
2015-05-06 19:48:42 -07:00
Ralph Castain
1f8de276de
Consolidate all the QOS changes into one clean commit
2015-05-06 19:48:42 -07:00
Ralph Castain
8e3f0b1d33
Ensure the --tree-spawn option is inside any parens from the sh and ksh shell support
2015-05-06 15:18:15 -07:00
Ralph Castain
0bb73645f0
Silence Coverity warning
2015-04-30 20:49:28 -07:00
Ralph Castain
7d1980ba83
Add the ability to specify the number of desired slots in the --host option. Just giving a host name => one slot (multiple copies of the name yield one slot per copy). Giving "foo:3" indicates you want three slots - a shorthand notation for saying "foo" three times. Giving "foo:*" indicates you want the topology to set the number of slots based on the orte_set_slots param.
2015-04-30 20:35:23 -07:00
Ralph Castain
e26e7ad736
Better support automated tests for map, rank, and bind options
2015-04-30 14:01:13 -07:00
Ralph Castain
7d4f9970d8
Minor cleanup
2015-04-29 17:49:35 -07:00
Nadezhda Kogteva
01ce58391e
oob ud: made component more user adaptive; opal outputs were replaced by help messages.
2015-04-28 15:36:32 +03:00
Jeff Squyres
8fbf34b196
oob ud: put call to ibv_fork_init() before *all* ibv calls
...
Move the call to opal_common_verbs_fork_test() to up before the call
to ibv_get_device_list() (just curious -- why not use
opal_ibv_get_device_list()?). This ensures that the call to
ibv_fork_init() is before *all* other ibv_* calls.
2015-04-24 14:19:06 -07:00
Ralph Castain
9104e81958
When --map-by node, we should be unbound. Also remove dead code due to copy/paste error.
2015-04-23 20:35:54 -07:00
Ralph Castain
5003be5c5c
If the user specifies a --map-by <foo> option, then default to bind-to <foo> unless they specify a bind-to option. If they map-by slot/node, then use the default policy based on num_procs.
2015-04-23 13:30:21 -07:00
Ralph Castain
d5e4fd059f
Ensure the binding and locale strings are always defined
2015-04-23 07:43:37 -07:00
Ralph Castain
cb7330a543
Get the output to lineup properly
2015-04-23 07:38:51 -07:00
Jeff Squyres
79243aca4e
display-devel-map: minor output tweak
...
hwloc output can get fairly long, especially on machines with lots of
cores and/or hyperthreads. So put the Locale and Binding output on
separate lines.
2015-04-23 06:14:57 -07:00
Ralph Castain
58e646ccfd
Reduce confusion by having the devel-map display in the same format as report-bindings
2015-04-23 04:30:00 -07:00
Ralph Castain
43229d056e
Protect one more place from a NULL object
2015-04-20 18:45:57 -07:00
Jeff Squyres
11e8c2096b
plm rsh: assign some levels to the rsh PLM MCA params
2015-04-20 16:18:57 -07:00
Nathan Hjelm
359a282e7d
ess/singleton: MCA variable synonyms can not currently have NULL for both framework and component
...
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-04-20 16:50:52 -06:00
Ralph Castain
e8387fcf88
Protect tools that can never run in distributed mode from getting confused by PMI.
2015-04-20 15:42:57 -07:00
Nathan Hjelm
45e053dbce
orte: use C99 subobject naming for component initialization
...
This commit helps future-proof orte components by initializing each
component member by name.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-04-18 10:29:58 -06:00
Ralph Castain
34b53ac3dc
Silence Coverity warnings
2015-04-18 07:48:22 -07:00
Ralph Castain
12bfb27161
Redo in cleaner form: Per request from Andy Rieb, add ability to pass PATH and LD_LIBRARY_PATH elements to ssh command
2015-04-17 16:11:37 -07:00
Nadezhda Kogteva - nadezhda.kogteva@itseez.com
c2678b0cc9
oob ud: fixes and parameter adjustment
2015-04-17 16:22:43 +03:00
Nathan Hjelm
3436f2917d
Merge pull request #449 from hjelmn/mca_base_update
...
mca/base update
2015-04-16 08:41:48 -06:00
Ralph Castain
d9c555b547
Revert "Per request from Andy Rieb, add ability to pass PATH and LD_LIBRARY_PATH elements to ssh command"
...
This reverts commit open-mpi/ompi@278324c52a .
Revert "Add the ability to pass args to the rsh/ssh command line"
This reverts commit open-mpi/ompi@6f227f8564 .
2015-04-16 08:03:14 -06:00
rhc54
79b9c50717
Merge pull request #535 from rhc54/topic/rsh
...
Add the ability to pass args to the rsh/ssh command line
2015-04-15 21:11:46 -06:00
Ralph Castain
278324c52a
Per request from Andy Rieb, add ability to pass PATH and LD_LIBRARY_PATH elements to ssh command
2015-04-15 20:30:04 -06:00
Ralph Castain
0e23f76eee
Fix comment
2015-04-15 20:09:14 -06:00
Ralph Castain
6f227f8564
Add the ability to pass args to the rsh/ssh command line
2015-04-15 20:07:13 -06:00
Howard Pritchard
283ef4c05d
oob/config: if --with-verbs=no, no ud
...
The oob/ud configure was not honoring the case
if the ompi is configured with --with-verbs=no.
This fixes that problems.
Fixes #522
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-04-14 06:31:18 -07:00
Nathan Hjelm
113c890ccf
Merge pull request #520 from hjelmn/valgrind_cleanness
...
fix memory leaks and valgrind errors
2015-04-13 10:09:34 -06:00
Ralph Castain
9c6d452d6b
If we are using HT cpus and have <= 2 procs, then map-by hwthread by default
2015-04-11 21:18:05 -07:00
Ralph Castain
cd686057f6
If the HNP is on a coprocessor, record it so we don't get an error log later
2015-04-11 15:30:15 -07:00
Nathan Hjelm
a7b0c00ab6
fix memory leaks and valgrind errors
...
This commit fixes several vagrind errors. Included:
- installdirs did not correctly reinitialize all pointers to NULL
at close. This causes valgrind errors on a subsequent call to
opal_init_tool.
- several opal strings were leaked by opal_deregister_params which
was setting them to NULL instead of letting them be freed by the
MCA variable system.
- move opal_net_init to AFTER the variable system is initialized and
opal's MCA variables have been registered. opal_net_init uses a
variable registered by opal_register_params!
- do not leak ompi_mpi_main_thread when it is allocated by
MPI_T_init_thread.
- do not overwrite ompi_mpi_main_thread if it is already set (by
MPI_T_init_thread).
- mca_base_var: read_files was overwritting mca_base_var_file_list
even if it was non-NULL.
- mca_base_var: set all file global variables to initial states on
finalize.
- btl/vader: decrement enumerator reference count to ensure that it
is freed.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-04-11 09:28:35 -06:00
Ralph Castain
91e1cbf284
Init variable
2015-04-11 07:44:57 -07:00
Ralph Castain
033418f62a
Correct a typo that reversed the default binding pattern. Ensure we default bind to hwthread if user specified --use-hwthread-cpus if nprocs <= 2, and bind to hwthread if told to do so.
2015-04-10 15:58:35 -07:00
Ralph Castain
3e44d3c9e3
Enable singletons to run without any active OOB module until they attempt to comm_spawn
2015-04-10 14:06:42 -07:00
Ralph Castain
e4f6f83b9d
Attempt to silence new Coverity complaint by ensuring the string read from file is NULL terminated.
2015-04-10 07:54:37 -07:00
Ralph Castain
396700ad8b
Protect the notifier macro's against NULL job objects
2015-04-09 16:04:43 -07:00
Nathan Hjelm
c416c423bb
ess/singleton: do not put component strings into the environment
...
putenv requires that any string put into the environment is not
changed or freed. That is not the case with constant strings as they
will go away when dlclose is called on the component. Instead, just
use opal_setenv which does not have this restriction.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-04-09 11:00:47 -06:00
Nathan Hjelm
9cd955badf
opal: fix multiple bugs in MCA and opal
...
This commit fixes the following bugs:
- opal_output_finalize did not properly set internal state. This
caused problems when calling the sequence opal_output_init (),
opal_output_finalize (), opal_output_init ().
- opal_info support called mca_base_open () but never called the
matching mca_base_close (). mca_base_open () and mca_base_close ()
have been updated to use a open count instead of an open flag to
allow mca_base_open to be called through multiple paths (as may be
the case when MPI_T is in use).
- orte_info support did not register opal variables. This can cause
orte-info to not return opal variables.
- opal_info, orte_info, and ompi_info support have been updated to
use a register count.
- When opening the dl framework the reference count was added to
ensure the framework stuck around. The framework being closed
prematurely was a bug in the MCA base that has since been
corrected. The increment (and associated decrement) have been
removed.
- dl/dlopen did not set the value of
mca_dl_dlopen_component.filename_suffixes_mca_storage on each call
to register. Instead the value was set in the component
structure. This caused the value to be lost when re-loading the
component. Fixed by setting the default value in register.
- Reset shmem framework state on close to avoid returning a stale
component after reloading opal/shmem.
- MCA base parameters were not properly deregistered when the MCA
base was closed.
This commit may fix #374 .
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-04-07 19:13:20 -06:00
Ralph Castain
0c043dbdc9
Fix typo in var name
2015-04-02 02:32:42 -07:00
Ralph Castain
a4b466efc4
Support attempts to connect async processes by allowing the oob/tcp connection to retry the attempt to connect to a peer. Off by default, operates if someone specifies how long to wait between retry attempts.
2015-04-01 20:21:23 -07:00
Ralph Castain
9f8ae59162
Properly enclose the different && clauses
2015-04-01 18:48:25 -07:00
Ralph Castain
57c21d5209
Ensure the DVM flows thru the "daemons reported" state
2015-04-01 16:47:34 -07:00
Jeff Squyres
99754afd25
orterun.c: re-justify the output message text
...
The type-A personality / english lit major in me compells me to
re-justify the text. :-)
2015-04-01 10:57:23 -07:00
Mike Dubman
8914a9c070
Merge pull request #494 from elenash/modifiers
...
changed mindist mapping policy specifier
2015-04-01 16:31:46 +03:00
Elena
1e913c76c4
changed mindist mapping policy specifier from map-bt dist:device,modifiers to --map-by dist:modifiers -mca rmaps_dist_device device
2015-04-01 15:07:35 +03:00
Nadezhda Kogteva
2d49d9bd45
grpcomm rcd: remove unnecessary malloc warning for case when number of daemons == 1
2015-04-01 11:07:44 +03:00
Mike Dubman
58d002098b
Merge pull request #474 from elenash/master
...
Introduce -tune command line option to set env vars and mca params from ...
2015-04-01 08:23:34 +03:00
Ralph Castain
b468f6a503
Okay, Jeff - use opal_setenv
2015-03-31 20:34:02 -07:00
Ralph Castain
6f9140a341
Add a little more debug to launch
2015-03-31 20:10:21 -07:00
Ralph Castain
e5d96417e7
Update warnings for run-as-root
2015-03-31 17:55:28 -07:00
Ralph Castain
41dd65d6cd
Per Jeff's request, tone down the comments and "standardize" the warning
2015-03-31 17:54:54 -07:00
Ralph Castain
f04eb6a9c0
Extend the root-user protection to some more ORTE tools
2015-03-31 10:34:35 -07:00
Ralph Castain
f863147b05
Per the telecon and chat with Jeff, let root only do the version option without warning. Otherwise, require that the user specifically indicate allow-use-as-root
2015-03-31 10:34:35 -07:00
Ralph Castain
b209c9efa5
Move the "dvm ready" message to stdout so it is easier to trap
2015-03-30 20:12:56 -07:00
Ralph Castain
6d205a3c80
Ensure that singletons pickup the oob/tcp component
2015-03-30 18:10:08 -07:00
Ralph Castain
2fa56fb329
Ensure that orte-submit picks the correct ess module as it is -never- allowed to be used as a distributed tool
...
Thanks to Mark Santcroos for diagnosing this one.
2015-03-30 18:08:34 -07:00
rhc54
bc016617a0
Merge pull request #501 from rhc54/topic/sec2
...
Support authentication across security domains
2015-03-30 09:59:43 -07:00
Nadezhda Kogteva
a828eada98
sm dstore: set pmix segment size to proper value
2015-03-30 13:34:25 +03:00
Ralph Castain
d07dc362d5
Ensure we can authenticate when crossing security domains by including all available credentials, and letting the receiver use the highest priority one they have in common.
2015-03-28 20:34:26 -07:00
Ralph Castain
b67b3619fc
If we are using the default bindings, and one or more nodes are not setup to support binding, then don't error out - just don't bind.
...
Thanks to Annu Desari for pointing out the problem.
2015-03-28 08:20:24 -07:00
Ralph Castain
2f365720b0
Allow root to request the version and help from mpirun without having to override the run-as-root protection.
...
Thanks to Robert McLay for pointing this out
2015-03-28 08:17:44 -07:00
Ralph Castain
d2d02a1642
ckpt
2015-03-28 07:59:20 -07:00
Nathan Hjelm
b68d66bb9b
MCA: Add the project/project version to the MCA base component
...
This commit adds support for project_framework_component_* parameter
matching. This is the first step in allowing the same framework name
in multiple projects. This change also bumps the MCA component version
to 2.1.0.
All master frameworks have been updated to use the new component
versioning macro. An mca.h has been added to each project to add a
project specific versioning macro of the form
PROJECT_MCA_VERSION_2_1_0.
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2015-03-27 10:59:04 -06:00
Elena
90f5b2bb84
Introduce -tune command line option to set env vars and mca params from file
2015-03-26 18:33:53 +02:00
rhc54
2ff7575dde
Merge pull request #497 from rhc54/topic/sec
...
Allow for different security domains.
2015-03-25 21:01:29 -07:00
Ralph Castain
6aa33deafb
Remove debug
2015-03-25 19:58:51 -07:00
Ralph Castain
10cf455080
Tools need to use the TCP OOB component
2015-03-25 19:56:49 -07:00
Ralph Castain
1b24536941
Allow for different security domains. Let the initiator of the connection determine the method to be used - if the receiver cannot support it, then that's an error that will cause the connection attempt to fail.
2015-03-25 13:22:01 -07:00
Ralph Castain
6ba76ed8d8
Per user request, we allow -host to specify a host that is not included in a hostfile (however, we reject it if we were given an allocation by a resource manager). Since we cannot know if an IP addr form references the same node that was previously given as a string name, we have no choice but to assume they are different. Get the topology from the right place in that situation so mpirun can succeed.
2015-03-25 06:16:01 -07:00
rhc54
df24816d64
Merge pull request #488 from lrrajesh/master
...
Notification msg add severity to the message header.
2015-03-20 09:45:46 -07:00
Ralph Castain
095a8fa684
We don't need to know about non-fatal errors from setting socket options
2015-03-20 07:16:31 -07:00
Ralph Castain
a013f3059f
For scalability reasons, and to make life easier for the poor Cray-ites, don't bang on the system for the username - we'll just use the uid.
2015-03-19 21:24:13 -07:00
Howard Pritchard
990e9b47e0
Merge pull request #486 from hppritcha/topic/issue_484
...
orte/oob: implement alps oob component
2015-03-19 19:40:40 -06:00
Ralph Castain
43a3baad5e
Ensure we use the first compute node's topology for mapping
...
Don't filter the topology by cpuset if you are mpirun until you know that no other compute nodes are involved. This deals with the corner case where mpirun is executing on a node of different topology from the compute nodes.
Simplify - don't mandate that all cpus in the given cpuset be present on every node. We can then run everything thru the filter as before, which ensures that any procs run on mpirun are also contained within the specified cpuset.
Correctly count the number of available PUs under each object when given a cpuset
Fix the default binding settings, and correctly count PUs when no cpuset is given
Ensure the binding policy gets set in all cases
2015-03-19 16:30:36 -07:00
Howard Pritchard
6054975913
oob/alps: add configure file for alps oob
...
Have to have alps rpms installed on a system
for alps component to build, even if separated
by a level of indirection.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-03-19 15:38:14 -07:00
Howard Pritchard
b1f31a4364
orte/oob: implement alps oob component
...
Implement an almost-do-nothing alps oob component.
When using aprun to launch a job on Cray system,
there is no reason to need an oob system, since ompi
relies on Cray PMI for oob communication.
Fixes #484
2015-03-19 14:11:40 -07:00
lrrajesh
4dc75687e2
Notification msg add severity to the output
2015-03-18 13:55:03 -07:00
Nadezhda Kogteva
7c25b4cea6
grpcomm: fixed brks and rcd algorithms - added enough space for masks in order to get them working in the large scale.
2015-03-18 14:33:04 +02:00
Ralph Castain
50277fec76
Adjust MCA param
2015-03-17 19:46:31 -07:00
rhc54
b41d2ad6c4
Merge pull request #481 from rhc54/topic/slurm
...
Add new MCA parameter to support edge case with debugger at LLNL
2015-03-17 07:40:55 -07:00
Ralph Castain
b01e8c1063
Include the FQDN version and non-stripped version of the hostname in our list of aliases as these (plus localhost) are the most common aliases we see.
2015-03-17 06:26:26 -07:00
Ralph Castain
d7d8ae46ed
We no longer pass the RML URI for procs launched via mpirun as the daemon has no need for that info.
2015-03-17 06:10:20 -07:00
Ralph Castain
3e32c360c7
Add new MCA parameter to support edge case with debugger at LLNL
2015-03-16 20:04:05 -07:00
Ralph Castain
a0487e014c
Further reduce the RARP load by removing getaddrinfo for IPv6 connections. Correct typo when checking return on inet_pton. Don't consider the TCP component for apps that are launched via mpirun as it will never be used.
2015-03-16 19:42:05 -07:00
Ralph Castain
5ae42c816e
Attempt to reduce the RARP traffic during definition of allocations
2015-03-16 16:26:40 -07:00
Ralph Castain
64d11f170a
Adjust the default keepalive interval. Refactor the code when setting keepalive options
2015-03-16 12:32:58 -07:00
Ralph Castain
4ded049cbc
Modify MCA param description
2015-03-16 11:57:32 -07:00
Ralph Castain
019bba5caf
Cleanup a bit - don't need to lookup the protocol number if we just use the right define
2015-03-16 11:54:51 -07:00
Ralph Castain
69ac25bf55
Add support for TCP keepalive on inter-node sockets
2015-03-16 09:59:44 -07:00
adrianreber
714d9aa67e
Merge pull request #348 from adrianreber/topic/orte_cr_continue_like_restart
...
Topic/orte cr continue like restart
2015-03-12 14:54:02 +01:00
Nathan Hjelm
695dcd5a28
oob/ud: fix compiler warning
2015-03-11 10:53:32 -06:00
Adrian Reber
c08e234af7
FT: fix compilation using --with-ft (5/5)
...
Enabling the FT code breaks compilation (again). This series
tries to fix the compiler errors. This is again only fixing
the compiler errors without any warranty that the result
might actually support FT again.
With the changes introduced in the previous patches in this series
some goto constructs for cleanup are no longer necessary and removed.
2015-03-11 14:23:33 +01:00
Adrian Reber
8ba41a834a
FT: fix compilation using --with-ft (4/5)
...
Enabling the FT code breaks compilation (again). This series
tries to fix the compiler errors. This is again only fixing
the compiler errors without any warranty that the result
might actually support FT again.
This patch tries to handle the new xcast semantic.
2015-03-11 14:23:33 +01:00
Adrian Reber
1c5a8df724
FT: fix compilation using --with-ft (2/5)
...
Enabling the FT code breaks compilation (again). This series
tries to fix the compiler errors. This is again only fixing
the compiler errors without any warranty that the result
might actually support FT again.
The FT code used barrier mechanisms which have been removed
with aec5cd08bd
. This patch replaces
all those different barriers with opal_pmix.fence(NULL, 0);
I am not sure this is completely correct but at least a starting
point for a review.
2015-03-11 14:23:33 +01:00
Adrian Reber
f45dd069bd
FT: fix compilation using --with-ft (1/5)
...
Enabling the FT code breaks compilation (again). This series
tries to fix the compiler errors. This is again only fixing
the compiler errors without any warranty that the result
might actually support FT again.
This first patch moves orte_cr_continue_like_restart from ORTE
to opal_cr_continue_like_restart in OPAL. This only leaves three
calls from OPAL to ORTE in the FT code. As it is not yet 100%
clear how to handle these calls the code orte_sstore.set_attr()
has been #ifdef'd out for now.
2015-03-11 14:23:33 +01:00
Gilles Gouaillardet
a69d935d55
oob/tcp: fix misc issues
...
as reported by Coverity with CIDs 70726, 710564,
1196630, 1269805, 1269803, 1269932
2015-03-10 19:32:01 +09:00
Gilles Gouaillardet
dc0bc756dc
iof/base: fix misc memory leak
...
as reported by Coverity with CID 1196732
2015-03-10 14:37:53 +09:00
Jeff Squyres
a026456bef
(orte|ompi|oshmem)*info tools: convert to opal_dl interface
...
Noe that this commit removes option:lt_dladvise from the various
"info" tools output. This technically breaks our CLI "ABI" because
we're not deprecating it / replacing it with an alias to some other
"into" tool output.
Although the dl/libltdl component contains an "have_lt_dladvise" MCA
var that contains the same information, the "option:lt_dladvise"
output from the various "info" tools is *not* an MCA var, and
therefore we can't alias it. So it just has to die.
2015-03-09 08:18:13 -07:00
Gilles Gouaillardet
59be12b260
filem/raw: fix misc memory leaks
...
as reported by Coverity with CIDs 716815, 716817, 720760,
1196703, 1196704, 1196746
2015-03-09 19:56:20 +09:00
Gilles Gouaillardet
2ab9a411f8
plm/base: fix misc memory leaks
...
as reported by Coverity with CIDs 1196733 and 1196745
2015-03-09 16:25:07 +09:00
Gilles Gouaillardet
fa10025843
ras/slurm: fix misc memory leaks
...
as reported by Coverity with CIDs 968580 and 1196723-1196727
2015-03-09 15:58:51 +09:00
Gilles Gouaillardet
eae39bd948
ras/simulator: fix misc memory leaks
...
as reported by Coverity with CIDs 710647, 714133 and 714134
2015-03-09 15:52:29 +09:00
Gilles Gouaillardet
4c0eb11e08
orterun: fix misc errors
...
as reported by Coverity with CIDs 70700, 71039, 710651
2015-03-09 11:57:18 +09:00
Gilles Gouaillardet
33841361c0
orte-clean: use pclose instead of fclose
...
as reported by Coverity with CID 1287029
2015-03-09 11:17:59 +09:00
Elena
6c6fe75c7b
added one more time interval for barrier to pmix unit test
2015-03-06 10:33:14 +02:00
Ralph Castain
64ec498a20
Add a declspec
2015-03-05 19:48:27 -08:00
Ralph Castain
eaa666bd57
Instantiate debug output variable
2015-03-05 12:25:49 -08:00
Ralph Castain
7ce0a9931c
Updates to the notifier interfaces to support system events
2015-03-05 10:39:25 -08:00
Gilles Gouaillardet
7de3f35b90
pml/rsh: fix misc memory leaks
...
as reported by Coverity with CIDs 71091, 71230, 71231, 72274, 72389,
1196718 and 1196719
2015-03-05 20:03:37 +09:00
Gilles Gouaillardet
33352e9506
schizo: fix misc memory leak
...
as reported by Coverity with CID 1196722
2015-03-05 14:06:18 +09:00
Gilles Gouaillardet
89806c6261
orte/util: fix memory leaks
...
as reported by Coverity with CIDs 70845, 71855, 710652,
1196738, 1196739, 1196757, 1196758, 1269863 and 1269883
2015-03-05 14:06:18 +09:00
Gilles Gouaillardet
4e7b5240e4
orte/tools: fix misc memory leaks
...
as reported by Coverity with CIDs 70700, 71039, 71854, 72384 and 710651
2015-03-05 14:06:18 +09:00
Gilles Gouaillardet
d1b2f043ff
fix misc memory leaks
...
as already reported by Coverity with CIDs
71818, 71819, 72250, 715767, 1196749 and 1274002
2015-03-05 13:58:05 +09:00
Gilles Gouaillardet
42f5a36ee3
rmaps/seq: fix misc memory leaks
...
as reported by Coverity with CIDs 1269886 and 1269887
2015-03-02 15:31:11 +09:00
Gilles Gouaillardet
0c7a2846d1
rmaps/rank_file: fix misc memory leaks
...
as reported by Coverity with CIDs 72250 and 1196774
2015-03-02 15:31:11 +09:00
Gilles Gouaillardet
c15b919635
rmaps/lama: fix misc memory leaks
...
as reported by Coverity with CIDs 719263, 719264, 1196712 and 1269842
2015-03-02 15:31:11 +09:00
Gilles Gouaillardet
456baeb71b
rmaps/base: fix misc memory leaks
...
as reported by Coverity with CIDs 1196751, 1196754, 1196755 and 1269866
2015-03-02 15:31:11 +09:00
Gilles Gouaillardet
d8f3b378b3
orte/oob: fix misc memory leaks
...
as reported by Coverity as CIDs 1196748, 1196749 and 1269895
2015-03-02 15:31:11 +09:00
Jeff Squyres
336626dafe
spelling: trivial spelling fix
...
s/interupted/interrupted/gi
2015-02-27 18:30:43 -08:00
Gilles Gouaillardet
ab78c7f54a
orted/pmix: fix misc resource leak
...
as reported by Coverity with CID 1269844
2015-02-27 19:25:55 +09:00
Mike Dubman
dbc15009b6
Merge pull request #415 from alinask/topic/fix_fork_support_flow
...
Fix the calls to ibv_fork_init and remove btl_openib_want_fork_support.
2015-02-26 21:50:11 +02:00
Nathan Hjelm
883d09376f
Fix coverity #1271536
2015-02-25 11:35:45 -07:00
rhc54
efbb57430b
Merge pull request #419 from nkogteva/master
...
grpcomm brcks: fix copy-paste bug which affects performance
2015-02-25 07:39:55 -08:00
Alina Sklarevich
e4c4e7df5e
Fix the calls to ibv_fork_init and remove btl_openib_want_fork_support.
...
In order to have an effect, ibv_fork_init should be called in the
beginning of the verbs initialization flow - before the calls to the
ibv_create_qp and ibv_create_cq verbs.
These functions are called from the oob/ud code and by the time the
other verbs components (btl openib, pml yalla, ...) call ibv_fork_init,
it's too late. This commit forces the call to ibv_fork_init (if it's
requested) right at the beginning of all the components that are using
verbs.
(ibv_fork_init() can be safely called multiple times)
This commit also removes the btl_openib_want_fork_support mca parameter
and adds a new mca parameter instead - opal_verbs_want_fork_support.
Through this new parameter, fork support may be requested for ALL
components.
The default value for this parameter is set to 1.
Before this commit the btl_openib_want_fork_support parameter didn't
provide fork support for the openib btl if its value was set to 1.
(because when openib called ibv_fork_init, it was already after the
calls to ibv_create_* in oob/ud and thereofre it failed).
2015-02-25 10:58:50 +02:00
Jeff Squyres
a85a392896
Merge pull request #422 from jsquyres/topic/coverity-fixes
...
Some Coverity fixes
2015-02-24 17:00:10 -05:00
Jeff Squyres
05f00aface
plm base: ensure mca_base_var_get_value() and mca_base_var_find() succeed
...
This was CID 993712
2015-02-24 15:48:50 -05:00
Ralph Castain
451bd16a10
Remove dead code
2015-02-24 12:41:12 -08:00
Jeff Squyres
4f54fedf05
orterun: ensure to set used_num_procs=true after finding that token
...
This was CID 71687.
2015-02-24 15:25:39 -05:00
Jeff Squyres
398ae15533
rmaps_base_frame: remove dead code
...
This was CID 1196641
2015-02-24 15:24:11 -05:00
Jeff Squyres
71ae0ad5ec
oob_tcp_component: add #if OPAL_ENABLE_IPV6 around IPv6-specific code
...
This was CID 1196629
2015-02-24 15:24:11 -05:00
Jeff Squyres
0bd2783b91
oob_usock: don't try to close the socket if it didn't open
...
This was CID 1196663
2015-02-24 15:24:09 -05:00
Jeff Squyres
e2223cd9bf
plm_rsh: ensure cwd array is \0-terminated
...
This was CID 72257
2015-02-24 15:24:08 -05:00
Ralph Castain
332e4fa7aa
Minor fix - relative host name syntax cannot support usernames as you can't know which hosts will be selected
2015-02-24 12:15:28 -08:00
Nathan Hjelm
ed78553512
Update opal_free_list_t usage to reflect new class interface.
...
Please verify your components have been updated correctly. Keep in
mind that in terms of threading:
OPAL_FREE_LIST_GET -> opal_free_list_get_st
OPAL_FREE_LIST_RETURN -> opal_free_list_return_st
I used the opal_using_threads() variant anytime it appeared multiple
threads could be operating on the free list. If this is not the case
update to _st. If multiple threads are always in use change to _mt.
2015-02-24 10:05:44 -07:00
Nadezhda Kogteva
c4d6ca6468
grpcomm brcks: fix copy-paste bug which affects performance
2015-02-24 17:06:39 +02:00
Jeff Squyres
226a814c9d
grpcomm_brks: fix minor compiler warning (rc used before set)
...
Also check for OBJ_NEW returning NULL.
2015-02-23 09:04:45 -08:00
Jeff Squyres
600858609e
grpcomm_rcd: fix minor compiler warning (rc used before set)
...
Also check for OBJ_NEW returning NULL.
2015-02-23 09:03:07 -08:00
Howard Pritchard
bf89131f9e
add owner files to opa/ompi/orte mca directories
...
This commit adds an owner file in each of the component directories
for each framework. This allows for a simple script to parse
the contents of the files and generate, among other things, tables
to be used on the project's wiki page. Currently there are two
"fields" in the file, an owner and a status. A tool to parse
the files and generate tables for the wiki page will be added
in a subsequent commit.
2015-02-22 15:10:23 -07:00
Jeff Squyres
15be948d79
wrappers: *_EXTRA_INCLUDES does not exist any more
...
There were a few places where *_EXTRA_INCLUDES (and derivates) were
still being used. This commit removes all of them.
2015-02-20 08:43:25 -08:00
Jeff Squyres
9b716d946e
wrappers: fix errant @{libdir} reference in pkg-config files
...
The RPATH support added a @{libdir} token into
<package>_WRAPPER_EXTRA_LDFLAGS. However, these flags are also
substituted into the pkg-config data files, and they don't understand
the @{foo} notation. So convert @{libdir} into ${libdir}, which
pkg-config *does* understand.
Thanks to Christoph Junghans (@junghans) for notifying us of the issue.
Fixes #406 .
2015-02-20 08:43:19 -08:00
Jeff Squyres
ec62766a71
notifier base: remove unused variables
2015-02-20 07:06:13 -08:00
Elena
48eae25b8f
fixed issue with grpcomm rcd and brks algorithms which led to performance issues: data just for part of processes was unpacked and stored locally during fence, therefore clients were forced to ask daemons for data directly during get request
2015-02-20 16:41:25 +02:00
Ralph Castain
f7c28ea706
Fix bad test - opal_buffer and opal_ptr can support NULL locations
2015-02-17 21:46:23 -08:00
Ralph Castain
852fbca020
Shut coverity up
2015-02-17 21:17:23 -08:00
Ralph Castain
c1282d5b99
The opal_buffer type also generates its own alloc, so need to let it pass thru the check
2015-02-17 21:06:19 -08:00
Ralph Castain
207cc74f87
Correct name of help file
2015-02-17 16:03:20 -08:00
Ralph Castain
624b16e070
Protect the unload attribute function
2015-02-17 14:21:23 -08:00
Ralph Castain
78245e8a33
Continue massaging of the notifier framework. Convert it to an event-driven interface. Add the ability to report job state if requested. Cleanup object declarations.
2015-02-17 12:51:11 -08:00
Gilles Gouaillardet
8dc4f30fae
orte/tools: fix NULL pointer dereference
...
as reported by Coverity with CIDs 1196671 and 1196824
2015-02-17 15:45:06 +09:00
Gilles Gouaillardet
b762766969
orte/util: fix misc memory leaks
...
as reported by Coverity with CIDS 70314, 710653-710657 and 1196741-1196744
2015-02-17 12:27:23 +09:00
Ralph Castain
22f1d29b82
Re-introduce the ORTE notifier framework for logging errors that would otherwise result in abort for persistent systems. Thanks to L. Rajeshnarayanan of Intel for the contribution
...
Subsequent commits will integrate this capability with the state and errmgr frameworks.
2015-02-16 12:46:58 -08:00
Gilles Gouaillardet
8fe8079080
Fix a build failure when configure'd with --without-hwloc
...
see http://mtt.open-mpi.org/index.php?do_redir=2235
2015-02-16 10:31:09 +09:00
Jeff Squyres
3ac1d0dae5
*-info: add "lt_dladvise support" lines
2015-02-11 12:25:20 -08:00
Ralph Castain
2a83d2613a
Cleanup the orte/test/system directory
2015-02-11 10:42:38 -08:00
Ralph Castain
d5775bf9de
Cleanup orte MPI test directory so it all builds again
2015-02-11 10:14:06 -08:00
Ralph Castain
ce56c0a2cf
Oops - remove debug/exit
2015-02-11 10:14:06 -08:00
Jeff Squyres
c9e3f22933
orte mpi tests: fix a bunch of compiler warnings
2015-02-11 12:28:10 -05:00
Jeff Squyres
07179ef669
orte mpi tests: don't use deprecated MPI functions
...
Change MPI_Errhandler_set -> MPI_Comm_set_errhandler
2015-02-11 12:28:10 -05:00
Jeff Squyres
cc7f433c0f
Makefile: this file should not be executable
2015-02-11 07:33:56 -08:00
Ralph Castain
3de8c5c7c6
Cleanup the munge support - the credential cannot be reused for multiple connections
2015-02-10 20:34:35 -08:00
Ralph Castain
46fb850bb0
Continue adding support for options on orte-submit - still need to shift some of the MCA params to job object attributes
2015-02-10 13:56:14 -08:00
Ralph Castain
116fcaff2c
Start adding support for cmd line options to orte-submit
2015-02-10 12:13:21 -08:00
rhc54
cf3f4def48
Merge pull request #386 from marksantcroos/master
...
Add debug option to orte-dvm.
Looks fine - thanks
2015-02-10 11:38:52 -08:00
Ralph Castain
df2cd96772
Display the local/global attribute flag more prominently. Mark the attributes as global in orte-submit so they will be communicated
2015-02-10 10:47:32 -08:00
Mark Santcroos
ff6a69a68d
Add debug option to orte-dvm.
2015-02-10 13:02:23 -05:00
Ralph Castain
063e4c9989
Cleanup the pretty-print of odls cmds as some were missing. Add a new cmd to terminate the DVM, which the HNP will use to trun around and issue an xcast to the DVM.
2015-02-10 08:27:13 -08:00
Ralph Castain
3ae3b96c17
Fix master compilation - a buried header dependency must have been removed.
2015-02-10 07:22:10 -08:00
Elena
948c20d862
added pmix unit test to tarball
2015-02-10 13:41:15 +02:00
Howard Pritchard
b62d9c2c70
ess/alps: fix compile issue for pgi
...
remote -fi-noident cflag option. Wasn't helping anyway
and caused pgi compiles to break.
2015-02-09 20:49:04 -08:00
Ralph Castain
3478def791
Ensure that nodes get included in the nidmap when spawning a new DVM job - we really only need to do this once, but for now we do it for every job until we work out how to avoid the duplication. Remove debug from orte-dvm tool
2015-02-09 23:47:46 -05:00
Ralph Castain
ef13ba7db3
Add debug-daemons option to orte-dvm
2015-02-09 11:08:45 -05:00
Ralph Castain
a3275aa867
Once again, fix the blasted singleton comm_spawn
2015-02-05 17:34:25 -08:00
Ralph Castain
f28238af59
Fix a race condition seen by Absoft during finalize. Stop the orte progress thread without cleaning it up, thus allowing the frameworks to still cancel their posted recv's. Then cleanup the memory footprint afterwards.
2015-02-05 11:41:37 -08:00
Jeff Squyres
938b8e1dad
schitzo: fix free of uninitialized value
...
The "param" value is not assigned before this free() statement. So
remove it.
(yay clang compiler warnings)
2015-02-04 15:50:24 -05:00
Ralph Castain
251084a2da
When a tool requests the spawn of a new job, then exclusively forward output to that tool - the DVM should not output its own copy as well.
2015-02-04 07:59:47 -08:00
Ralph Castain
2b0b012460
Continue refinement of the DVM operations. Send the spawn request to the right place (it helps) as it isn't a comm_spawn request and has to be treated a little differently. Ensure IO gets forwarded back to the tool. Ensure the tool outputs show_help locally as there is no place to send it.
2015-02-04 06:21:54 -08:00
Ralph Castain
7299cc3ab9
Cleanup the communications handshake so that orte-submit properly terminates upon job completion, and properly sends the terminate command to orte-dvm
2015-02-03 07:25:43 -08:00
Elena
5919b636e1
changed output format in pmix unit test
2015-02-02 14:22:51 +02:00
Ralph Castain
4dba298e6e
Update orte-submit manpage, add the ompi-* versions of orte-dvm and orte-submit manpages
2015-02-01 15:46:40 -08:00
Ralph Castain
e303a9b1d6
Provide an orte-dvm man page. Provide an option to orte-submit for terminating the DVM
2015-02-01 12:14:44 -08:00
Ralph Castain
ec5ccb76cf
Enable persistent ORTE DVM so users can execute multiple OMPI jobs within an allocation without restarting the DVM every time.
2015-01-30 11:00:43 -08:00
rhc54
e7fa600d85
Merge pull request #360 from elenash/master
...
added unit test for pmix functionality
2015-01-28 06:18:57 -06:00
Elena
472baa1284
added unit test for pmix functionality
2015-01-28 13:18:26 +02:00
Ralph Castain
b838df9eb8
Get slurm to stay out of the way on singletons
2015-01-27 09:29:43 -06:00
Ralph Castain
294ebc907a
Fix singleton operations so they can work inside a slurm environment
2015-01-27 09:29:42 -06:00
Ralph Castain
3eca55caec
Continue fixing singletons in slurm environments
2015-01-27 09:29:42 -06:00
Ralph Castain
fcec24b2a4
Minor cleanups to handle comm_spawn and singletons
2015-01-27 09:29:42 -06:00
Ralph Castain
74385302c0
Add the personality to the orte_job_t datatype support
2015-01-27 09:29:42 -06:00
Ralph Castain
88c38f87d2
Get the orteds to use schizo as well
2015-01-27 09:29:42 -06:00
Ralph Castain
028b00154d
Complete implementation of the schizo framework to support OMPI component
2015-01-27 09:29:42 -06:00
Ralph Castain
11c92eefe6
ckpt
2015-01-27 09:29:42 -06:00
rhc54
a1707326bf
Merge pull request #359 from hppritcha/topic/better_help
...
orte/util: minor improvement to show_help
2015-01-25 08:13:49 -08:00
Howard Pritchard
1e94d84ae6
orte/util: minor improvement to show_help
...
Make sure the show help gives it a good try to
print an error message locally if the
send_buffer_nb method returns an error.
2015-01-23 13:54:03 -08:00
Howard Pritchard
2809c21e0f
rml/oob: check peer param in send methods
...
The rml/oob was not doing sanity checks on the input peer
parameter for the orte_rml_oob_send_nb and orte_rml_oob_send_buffer_nd.
Owing to the fact that there are places in the ompi/orte stack
where things like orte_show_help_norender are called way before
ORTE_PROC_MY_HNP, are setup properly, all kinds of weird
startup failures can occur as the rml/oob tries to process send
requests where the peer is junk.
Rather than try to expand this kind of thing:
/* if we are the HNP, or the RML has not yet been setup,
* or ROUTED has not been setup,
* or we weren't given an HNP, or we are running in standalone
* mode, then all we can do is process this locally
*/
if (ORTE_PROC_IS_HNP || orte_standalone_operation ||
NULL == orte_rml.send_buffer_nb ||
NULL == orte_routed.get_route ||
NULL == orte_process_info.my_hnp_uri) {
rc = show_help(filename, topic, output, ORTE_PROC_MY_NAME);
}
do the right thing in the rml level and return an error rather than
eventually failing in the send owing to peer not being valid.
2015-01-22 06:12:39 -08:00
Howard Pritchard
06d3b57c07
Merge pull request #351 from hppritcha/topic/alps_odls_spawn_bug
...
odls/alps: check if PMI gni rdma creds already set
2015-01-19 11:48:24 -07:00
Howard Pritchard
fd807aee69
odls/alps: check if PMI gni rdma creds already set
...
Need to check if the alps odls component has already
read the rdma creds from alps. Its okay to ask apshepherd
multiple times for rdma creds, but opal_setenv gets
a bit picky about this. Rather than check for the OPAL_EXISTS
return value from opal_setenv, for now just check with
a static variable whether or not orte_odls_alps_get_rdma_creds
has already been successfully called before.
Would be nice to have an opal_getenv function for checking
if an env. variable had already been set by opal_putenv.
2015-01-19 10:12:38 -08:00
Gilles Gouaillardet
661c35ca67
cleanup dead code caused by the removal of the --with-threads configure option
2015-01-16 19:13:59 +09:00
Ralph Castain
e7ff21b3aa
The opal_stop_progress_thread function releases the event base, so don't do it again
2015-01-15 10:48:40 -08:00
Ralph Castain
9ac39b63cc
Use the opal_progress_threads support for the ORTE progress thread in applications
2015-01-15 07:55:19 -08:00
Ralph Castain
d2938a144f
Use the proper interface index. Thanks to Mark Kettenis for spotting the problem and providing a patch
2015-01-12 05:31:02 -08:00
Howard Pritchard
f34dd5f5fd
plm/alps: update copyright
2015-01-07 12:33:38 -07:00
Howard Pritchard
c454d11b01
plm/alps: fix orted abort hang problem
...
Turns out the alps plm component wasn't changing the state
of the job upon terminating the orted's in the case of
an abnormal termination. This caused mpirun to hang
with a zommbie'd aprun process if an orted on a node
in the job was killed via signal.
2015-01-07 12:31:41 -07:00
Howard Pritchard
f0f98f13b6
odls/base: fix an edge case with signals
...
In the course of doing some testing with how orted's
handle signaled child processes, found out that very
often doing a kill -9 on a process on a node just
results in the job hanging. The problem was that the
orted odls/errmgr was not properly handling the exit_code
being returned from waitpid. Now mark the proc state
as ORTE_PROC_STATE_ABORTED_BY_SIG if the exit_code
from waitpid indicates the process exited owing to
a signal.
2015-01-06 15:42:38 -07:00
Nadezhda Kogteva
05af80b302
Fix commit bffb2b7a4b
which broke pmix server functionality
2014-12-24 13:25:23 +02:00
Ralph Castain
43a40f8aac
LSF expresses its affinity file in hwthreads and expects those to be used as cpus, so set things accordingly
2014-12-19 12:06:05 -08:00
Ralph Castain
b314bfb5e9
If someone specifies the bitmap for hwthreads and wants hwthread cpus, then don't parse the slot list as it expects cores - just copy the provided bitmap across as it already has the required info
2014-12-19 10:56:14 -08:00
Jeff Squyres
7b43bdc984
plm base: move flag inside the #if in which it is used
...
Avoid a compiler warning by declaring the tflag only inside the #if in
which it is used (i.e., if hwloc support is built).
2014-12-18 10:56:23 -08:00
Ralph Castain
2581b41d08
Continue refactoring code by splitting the msg processing from the sendrecv code
2014-12-17 19:57:14 -08:00
Ralph Castain
f489e871c2
Take first step towards refactoring the PMIx server code by splitting out the proc_map function into its own file. Update ignore to include .DS_Store from the Mac
2014-12-17 19:08:52 -08:00
Artem Polyakov
01601f3284
Merge pull request #305 from artpol84/timing
...
Timing framework improvement
2014-12-16 15:13:48 +06:00
Ralph Castain
573a574a3c
Remove an unused dstore type that was redundant with another one. Define a corresponding PMIX_NODE_ID type (contains the vpid of the daemon hosting the proc) and ensure that the PMIx server includes that info in its process map
2014-12-15 12:11:13 -08:00
Ralph Castain
a22cc45769
Close the pmix server sockets on exec
2014-12-13 20:30:21 -08:00
Ralph Castain
f4ff791335
Close oob/usock connections upon exec
2014-12-13 20:24:09 -08:00
Ralph Castain
6c4d5a51c4
Close tcp sockets upon exec
2014-12-13 20:23:53 -08:00
Ralph Castain
9658256a98
Restore the passing of the complete job map to the local proc on first get_attr so the info can be used by the MPI layer without continual calls back to the server. We'll find a more memory efficient method later.
2014-12-13 18:44:09 -08:00
Ralph Castain
bffb2b7a4b
Correct some issues with variables used before being set
2014-12-12 17:23:32 -08:00
Ralph Castain
0630680f36
Two cleanups required for transfer to 1.8.4:
...
* Use %d format for the topo signature as some systems apparently have problems with %u
* Use correct variable in show_help message
2014-12-12 17:23:32 -08:00
Rolf vandeVaart
f4aecdbfd2
Change logging function name from log to logfn. Fixes issue with PGI compile
2014-12-12 09:46:44 -05:00
Artem Polyakov
8ffad75a0a
Introduce timing interval measurement facility in timing framework
2014-12-10 16:47:49 +06:00
Ralph Castain
9d5135e6cd
Function definition should use the correct type
2014-12-09 01:04:31 -08:00
Ralph Castain
bb529ebd8e
Revise the way we handle hetero nodes as users are finding this (a) a significant surprise, and (b) confusing as to when it is required. So try to automate it a bit by creating a topology "signature" that mpirun can share on the cmd line with the remote daemons, thus allowing them to check to see if they match. This isn't comprehensive of course - for now, it only checks the number of each type of hwloc object on the node. This is good enough to pickup major differences (e.g., where we have different numbers of sockets or assigned core bindings).
...
Retain the hetero-nodes flag for those cases where the user *knows* that there are differences and our automated system isn't good enough to see it.
Will obviously require further refinement as we find out which variances it can detect, and which it cannot.
2014-12-08 15:38:14 -08:00
elenash
baf32fe480
Merge pull request #308 from elenash/master
...
restored _process_name_print_for_opal function in orte_init: it's requir...
2014-12-08 19:14:36 +03:00
Ralph Castain
b757b3f452
Ensure that the #nodes in the job map gets properly updated when using the sequential mapper. Provide some further diagnostic info to help understand the problem when encountered.
2014-12-08 08:03:53 -08:00
Elena
6cf3925b09
restored _process_name_print_for_opal function in orte_init: it's required for opal output from daemons which never called ompi_init so didn't set opal_process_name_print pointer
2014-12-08 13:13:35 +02:00
Ralph Castain
d6d69e2b13
Get the direct routed component to work with both TCP and USOCK OOB components. We previously had setup the direct component so it would only support direct-launched applications. Thus, all routes went direct between processes. However, if the job had been launched by mpirun, this made no sense - what you wanted instead was to have each app proc talk directly to its daemon, but have the daemons all directly connect to each other.
...
So we need all the routing code for dealing with cross-job communications, lifelines, etc. The HNP will be directly connected to all daemons as they must callback at startup, and so we need to track those children correctly so we know when it is okay to terminate.
We still have to support direct launch, though, as this is the only component we can use in that scenario. So if the app doesn't have daemon URI info, then it must fall back to directly connecting to everything.
2014-12-07 09:11:48 -08:00
Ralph Castain
b1bf557024
Fix the hostfile parser so it correctly ignores binding directives that are just integers. Fix the create_dmns function so we don't hang if we can't get an error before creating the job map for an application.
2014-12-05 15:47:09 -08:00
Elena
af38a762a2
these changes fix direct routed component under mpirun; oob tcp and oob ud are working with direct routed component, but usock doesn't work with direct routed component yet.
2014-12-05 12:38:59 +02:00
Ralph Castain
c4fd6d1cde
Fix typo
2014-12-04 12:24:35 -08:00
Ralph Castain
c4002a8485
Further cleanups on the LSF integration - the affinity file is apparently always present, but simply empty if affinity wasn't set.
2014-12-04 12:24:35 -08:00
Ralph Castain
c88f181efe
Fix singleton comm-spawn, yet again. The new grpcomm collectives require a complete knowledge of every active proc in the system in case they participate in a collective. So ensure we pass the required job info when we spawn new daemons, and construct the necessary connections to allow grpcomm to operate.
2014-12-03 18:11:17 -08:00
Howard Pritchard
c67afadcfc
Merge pull request #289 from hppritcha/topic/remove_pmi
...
Topic/remove pmi
2014-12-03 16:58:35 -07:00
Jeff Squyres
a3af7d6dbb
Revert "lsf configury: add dependent libraries for static linking"
...
This reverts commit 56cfa90dda
.
2014-12-03 13:32:56 -08:00
Jeff Squyres
92c2ff91ec
Revert "Cleanup static build requirements by adding the wrapper flags back to the component configure.m4's. Minor cleanup of the lsf configure logic."
...
This reverts commit open-mpi/ompi@32bf0e7b7e .
2014-12-03 13:15:20 -08:00
Ralph Castain
54c955c92d
Fix a race condition that only appears to be affecting certain setups. The pmix.finalize function closes the file descriptor to the server, which then triggers the errhandler callback. Since the errmgr is about to be unloaded, it might be getting hit.
2014-12-03 12:19:00 -08:00
Howard Pritchard
666344a081
orte/mca/common/alps: fix configure file
...
Fix configure file for alps to actually check for
alps being available.
Also include stdio.h explicitly in common_alps.c
2014-12-03 09:44:18 -07:00
Howard Pritchard
ec38aa3732
orte/mca/common: add missing Makefile.am
2014-12-03 09:44:18 -07:00
Howard Pritchard
191fe0f949
alps configury changes
...
Clean up the orte_check_alps.m4. There was a little of
unnecesary stuff for handling cle 5, since it wasn't actually
doing the right thing, which would be to use pkg-config to
find dependencies both for dynamic and static linking.
Decouple the searching for alps libs, etc. from cray pmi.
Switch the alps ess and alps odls components' config files
to use the ALPS m4 macro.
alps configury fixes
Improve a check for detecting CLE release.
Improve an error message.
2014-12-03 09:44:17 -07:00
Howard Pritchard
d749077e1e
odls/alps: make sure PMI env. variables set up
...
Add call to orte_odls_alps_get_rdma_creds in the
local proc launch step to obtain the Cray Rdma
credentials from the apshepherd, and to set
the PMI env. variables expected by uGNI BTL, etc.
2014-12-03 09:44:17 -07:00
Howard Pritchard
e0487e7702
orte/common/alps: add an alps common lib to orte
...
Add an alps common lib to orte. Add a function
to determine whether or not a process is in a
PAGG container.
Note: we need a better naming convention for
common libs, since right now they use a "flat"
naming convention.
2014-12-03 09:44:17 -07:00
Howard Pritchard
a753c3ece0
ess/alps: add initial alps ess component
...
Note this alps ess component has nothing to do
with the old CNOS alps component used on
Cray Seastar/Portals3 (Cray XT) systems.
To work properly, changes need to be made to the
open method of the ess/pmi component to keep it
from selecting, and thus initializing, the opal/pmix/cray
component.
2014-12-03 09:44:17 -07:00
Ralph Castain
32bf0e7b7e
Cleanup static build requirements by adding the wrapper flags back to the component configure.m4's. Minor cleanup of the lsf configure logic.
2014-12-03 07:14:06 -08:00
Ralph Castain
cb15cc06e1
Minor changes per Jeff's request on PR for 1.8.4
2014-12-02 19:54:10 -08:00
Ralph Castain
6294ed991b
Fix singletons - still working on singleton comm_spawn
2014-12-02 14:12:24 -08:00
Ralph Castain
14cdb04327
Revise the ess/pmi selection logic as all APPs must select it, and no daemons. Cleanup some of the mca param levels in ess so we don't printout the topology quite as easily.
2014-12-01 21:19:11 -08:00
Jeff Squyres
56cfa90dda
lsf configury: add dependent libraries for static linking
...
Ensure to add the LSF dependent libraries and LD flags for the wrapper
compiler static linking case.
2014-12-01 14:59:10 -08:00
Ralph Castain
f92ccaf0f9
Add missing var declarations
2014-12-01 09:36:28 -08:00
Ralph Castain
960ef34988
Ensure the LSF ras adds the hosts to the allocation. Correctly handle the semi-colon vs comma situation in hwloc slot_lists
2014-11-30 14:37:37 -08:00
Ralph Castain
3f9d9ae8b6
Provide tighter LSF integration by correctly handling scenarios where the user has asked LSF to assign bindings. Fix a couple of typos in lex parser definitions. Tell hostfile parser to ignore binding designations in hostfiles. Add an attribute to indicate that cpusets were provided as physical cpu ids.
...
Once validated, a version of this will be backported to the v1.8.4 release.
2014-11-30 11:50:31 -08:00
Elena
b17ea23ce0
fixes for direct routed component under mpirun
2014-11-26 13:36:49 +02:00
Ralph Castain
f48b9012cb
Some minor cleanup. We really don't need another peer error constant to indicate that a peer closed as we already have one for "connection failed", and that's all we really know. Update the orte constants to track their opal equivalents.
2014-11-25 08:02:29 -08:00
Nadezhda Kogteva
8dd21c7736
OOB UD: fix case when multiple oob components were specified in command line (checking of uri).
2014-11-25 11:48:11 +02:00
Gilles Gouaillardet
578fe41788
fix hangs introduced by previous commit a6744b8177
2014-11-25 17:50:44 +09:00
Gilles Gouaillardet
a6744b8177
fix misc memory leaks specific to the master
2014-11-25 13:52:10 +09:00
Gilles Gouaillardet
38879cf682
fix misc memory leaks
2014-11-25 11:32:43 +09:00
Ralph Castain
48f702827e
First part of memory leak cleanups from Gilles
2014-11-24 16:53:33 -08:00
Ralph Castain
2e00e335b9
Add missing header to tarball. Remove stale opal_unignore
2014-11-21 17:35:11 -08:00
Howard Pritchard
6e807c4e8a
odls/alps: minor config cleanup
2014-11-21 11:09:28 -07:00
Nadezhda Kogteva
05b2eb1270
OOB UD: opal_ignore removed from oob ud component: component is compilable. Added support of new RML API, support of opal_buffer as input data. Added usage of routed component.
2014-11-20 10:20:35 +02:00
rhc54
7c0273ecb3
Merge pull request #276 from teng-lin/master
...
Fixed a bug that fails to parse hostname starting with numbers.
2014-11-19 16:39:00 -08:00
Teng Lin
07ff51f43f
Fixed a bug that fails to parse hostname starting with numbers.
...
According to RFC 1123, hostnames that begin with numbers are valid.
2014-11-19 16:03:55 -08:00
Howard Pritchard
9425ebefae
Be more selective about closing fd's for alps/odls
...
Be more selective about closing fd's for the alps odls
component. Don't close fd's of pipes set up by the
apshepherd for providing RDMA credentials, etc.
Add an entry to the help file in case
alps_app_lli_pipes returns an error.
2014-11-19 11:21:30 -07:00
Ralph Castain
bb91517349
All other layers to register their own print-attribute functions so we can maintain pretty-print capabilities as the attributes are extended.
2014-11-19 09:37:59 -08:00
Ralph Castain
37593b232d
Add a marker for the max attr value being used by ORTE so that other, higher-levels can also use the attribute system
2014-11-19 09:37:59 -08:00
Howard Pritchard
34c156759e
fix some compiler warnings in ras/alps
2014-11-18 11:32:37 -07:00
Howard Pritchard
4df3447d96
fix compare_nodes bug in alps ras component
...
There was an obvious bug in the alps/ras component compare_nodes method
which resulted in the function always evaluating the nodes
as being equivalent.
2014-11-18 11:15:02 -07:00
Howard Pritchard
ff362c16ce
add/update copyrights for alps odls component
2014-11-18 10:16:11 -07:00
Howard Pritchard
dc98b62070
add initial support for an alps odls component
...
It turns out that the support for Open MPI apps on
Cray was hanging on a thin thread of support when
using the mpirun job launcher. It just happened that
with a certain set of configuration options things would
work. This is bound to backfire at some point.
To fix this weakness, as well as to allow for mpirun launched
jobs to benefit from many of the advanced placement features
provided by the Cray Linux Environment (as opposed to the hwloc
only default env of orte), a new odls alps component is introduced.
2014-11-17 14:00:09 -07:00
Ralph Castain
d9ceb5aea4
Fix C++ builds by removing no-longer-needed type declaration
2014-11-14 11:44:24 -08:00
Gilles Gouaillardet
f3b36fdf6e
orted/pmix: fix pmix_server_release when several jobids are running on the same node
2014-11-14 16:17:28 +09:00
Gilles Gouaillardet
84b21d726e
orte/util: add OPAL_{VPID,JOBID} types to orte_attr_{load,unload}
2014-11-14 15:55:25 +09:00
rhc54
1fdb6a62d3
Merge pull request #265 from miked-mellanox/topic/undeprecate_env_x
...
ORTE: undeprecate -x var=val in mpirun
Looks okay to me - thanks!
2014-11-12 08:46:09 -08:00
Mike Dubman
f83d6045aa
ORTE: undeprecate -x var=val in mpirun
...
mpirun -x var=val is back, actually it is useful alias for -mca mca_base_env_list "var=val"
2014-11-12 10:51:15 +02:00
Ralph Castain
780c93ee57
Per the PR and discussion on today's telecon, extend the process name definition as a two-field struct of uint32_t's down to the OPAL layer. This resolves issues created by prior commits that impacted both heterogeneous and SPARC support. This also simplifies the OMPI code base by removing the need for frequent memcpy's when transitioning between the OMPI/ORTE layers and OPAL.
...
We recognize that this means other users of OPAL will need to "wrap" the opal_process_name_t if they desire to abstract it in some fashion. This is regrettable, and we are looking at possible alternatives that might mitigate that requirement. Meantime, however, we have to put the needs of the OMPI community first, and are taking this step to restore hetero and SPARC support.
2014-11-11 17:00:42 -08:00
Ralph Castain
d0704ef118
Restore handling of physical processors in rankfiles. Note that the prior implementation was likely incorrect as it falsely assumed that physical core indices were unique, which isn't always true. Stipulate that physical rankfiles can only include PU numbers, and bind the result to the core that contains that physical PU. Update the mpirun man page to cover the new use-case.
2014-11-10 14:00:40 -08:00
Ralph Castain
2a90788724
Support physical processor ids in rankfile
2014-11-10 14:00:40 -08:00
Ralph Castain
8c837d3cb3
Doh - if we can't output an entire block, then we need to adjust the number of bytes remaining to be output or else we will output duplicate bytes when next we are able to write.
2014-11-07 13:13:13 -08:00
Ralph Castain
b56b744041
Silence some warnings and remove debug output
2014-11-07 07:54:01 -08:00
Elena
03fc809bc9
This commit contains new dstore component sm which is used for communication between pmix server and clients at the same node via shared memory.
2014-11-06 16:01:19 +02:00
Ralph Castain
738c3e1d72
Ensure that mpirun correctly selects the HNP ess component without attempting to init the PMI subsystem as mpirun won't be supported anyway, so let's avoid the error message. Also, daemons launched by the plm/slurm component must use the ess/slurm module as we cannot trust the Slurm PMI_Init functions to correctly tell us when PMI support is available.
2014-11-03 21:35:42 -08:00
Ralph Castain
6fbc68c830
Update the grpcomm direct component's priority so it sits at the bottom of the list, as it should now that the other components are active. Cleanup up the signature print function a touch to make it more readable. Remove the unneeded xcast functions in brks and rcd components as we will just fall thru to using the "direct" one
2014-11-03 14:43:17 -08:00
Gilles Gouaillardet
652ecdb888
oob/tcp: always include a missing header file
...
improve open-mpi/ompi@c9d1e16a9e
2014-10-29 13:39:23 +09:00