1
1
Граф коммитов

24518 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
d653cf2847 Convert the orte_job_data pointer array to a hash table so it doesn't grow forever as we run lots and lots of jobs in the persistent DVM. 2016-02-21 11:55:49 -08:00
Ralph Castain
309e23ab3a Fix minor typo 2016-02-20 01:33:10 -08:00
yohann
59b6d041f8 mtl/ofi: Check allocated pointer. 2016-02-19 16:59:47 -08:00
yohann
bd47062764 mtl/ofi: Fix error handling. 2016-02-19 16:58:41 -08:00
yohann
404987e9b3 mtl/ofi: Fix mismatching types. 2016-02-19 16:57:26 -08:00
yohann
3ad59435ce mtl/ofi: Prevent possible memory leak. 2016-02-19 16:57:02 -08:00
Ralph Castain
8c92a179c0 Minor memory leak 2016-02-19 15:05:39 -08:00
rhc54
1f7e2d7d41 Merge pull request #1388 from rhc54/topic/iof
Cleanup the output-filename options so they work as expected.
2016-02-19 13:56:05 -08:00
Ralph Castain
0c72ba89b9 Cleanup the output-filename options so they work as expected. Have the remote nodes output locally to the files instead of sending it all back to the HNP.
Fix Solaris issues by renaming struct field
2016-02-19 12:41:46 -08:00
Edgar Gabriel
b33db517c1 Merge pull request #1387 from edgargabriel/dynamic_gen2-overlap
Updates to the dynamic_gen2 component
2016-02-19 12:00:29 -06:00
Edgar Gabriel
92d1b99468 optimize the shuffle step:
1. use communicator collectives if possible for performance reasons
 2. combined multiple allgathers into a single one
2016-02-19 11:04:04 -06:00
Edgar Gabriel
e63836c653 clean up the mca parameter handling of the component. Add new parameters for number of sub groups and write chunk size. This will allow to perform a systematic parameter study. 2016-02-19 10:15:28 -06:00
Edgar Gabriel
4f400314e0 add the dynamic_gen2 component into the fcoll selection table. 2016-02-19 09:32:54 -06:00
Edgar Gabriel
268d525053 change the tag to be a positive value. handle 0-byte situations correctly. 2016-02-19 08:28:50 -06:00
Edgar Gabriel
ad79012059 first cut on the version which overlaps the communication/computation of 2 iterations. 2016-02-19 08:28:50 -06:00
Nathan Hjelm
e57ce1e1ef Merge pull request #1384 from hjelmn/xrc_get_fix
btl/openib: XRC save SRQ#s on the loopback endpoint
2016-02-18 21:48:45 -07:00
Nathan Hjelm
2031bb6f01 btl/openib: XRC save SRQ#s on the loopback endpoint
This commit fixes a bug that can occur when communicating via XRC to
peers on the same node. UDCM was not saving the SRQ numbers on the
loopback endpoint (which shares its ib_addr info with all local peers)
so any messages to local peers use an invalid SRQ number.

Fixes open-mpi/ompi#1383

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2016-02-18 20:59:11 -07:00
rhc54
bfd4254a7b Merge pull request #1382 from rhc54/topic/cleanup
Cleanup some valgrind complaints about jumps with uninitialized values.
2016-02-18 17:29:37 -08:00
Nathan Hjelm
27e7b6e466 Merge pull request #1381 from hjelmn/ddt_colon_fix
orterun: allow DDT if options contain :'s
2016-02-18 17:48:21 -07:00
Nathan Hjelm
820b178384 Merge pull request #1380 from hjelmn/xrc_get_fix
btl/openib: XRC fix bug that could cause an invalid SRQ# to be used
2016-02-18 17:33:09 -07:00
Ralph Castain
6e68d758b9 Cleanup some valgrind complaints about jumps with uninitialized values. Fix a few IOF issues reported by Mark Santcroos when submitting jobs from tools. Add the ability to pass directives to the --output-filename option that tell ORTE to (a) not include the jobid in the path to the output files, and (b) not to copy the output to the tool (i.e., just store it in the files).
ck

Remove stale debug

Fix a segfault if no subscribers are present
2016-02-18 16:30:37 -08:00
Nathan Hjelm
69de442136 orterun: allow DDT if options contain :'s
There is a bug in MPMD detection that disables totalview if a : is
found anywhere on the command line. This includes inside an argument
option or MCA variable value. This commit changes the check to look
for the string " : " instead of the character : which should eliminate
the issue in most cases.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-02-18 16:56:08 -07:00
Nathan Hjelm
371df45bf8 btl/openib: fix locking bugs with XRC ib_addr lock
This bug fixes two issue with the ib_addr lock:

 - The ib_addr lock must always be obtained regardless of
   opal_using_threads() as the CPC is run in a seperate thread.

 - The ib_addr lock is held in mca_btl_openib_endpoint_connected when
   calling back into the CPC start_connect on any pending
   connections. This will attempt to obtain the ib_addr lock
   again. Since this is not a performance-critical part of the code
   the lock has been changed to be recursive.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-02-18 15:55:34 -07:00
Nathan Hjelm
4dc73d7765 btl/openib: XRC fix bug that could cause an invalid SRQ# to be used
This commit fixes a bug that occurs when attempting a get or put
operation on an endpoint that is not already connected. In this case
the remote_srqn may be set to an invalid value as the rem_srqs array
on the endpoint is not populated. This commit moves the usage of the
rem_srqs array to the internal put/get functions where it is
guaranteed this array is populated.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-02-18 15:44:29 -07:00
Ralph Castain
1748f44147 Stop a segfault that results in zombied processes by checking for NULL prior to object release 2016-02-18 13:48:41 -08:00
Jeff Squyres
7b73c868d5 memchecker.h: fix memchecker no-data case
Thanks to @clintonstimpson for reporting the issue.

Fixes open-mpi/ompi#100

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-02-18 10:48:11 -08:00
rhc54
142e38cbb2 Merge pull request #1358 from rhc54/topic/notification
Enable the PMIx notification callback system and fix debugger attach
2016-02-18 10:15:07 -08:00
Ralph Castain
60a7bc2e50 Enable the PMIx notification callback system. This currently is only supported by the pmix120 component, which is not selected by default. All other components will ignore error registration requests, and thus do not support debugger attach when launched via mpirun. Note that direct launched applications will support such attachment, but may not do so in a scalable fashion.
Fixes ##1225
2016-02-18 09:29:12 -08:00
Nysal Jan K A
c18af0d61f Merge pull request #1378 from nysal/issue_1363
Make UD OOB memory registrations a multiple of page size
2016-02-18 12:34:09 +05:30
rhc54
2745610eb7 Merge pull request #1377 from rhc54/topic/pmix
Plug a leak in the PMIx subsystem
2016-02-17 20:05:45 -08:00
Nysal Jan K.A
cc9b1316a4 Make UD OOB memory registrations a multiple of page size
If ibv_fork_init() has been invoked the pages are marked MADV_DONTFORK.
If we only partially use a page, any data allocated on the remainder of
the page will be inaccessible to the child process.

Fixes open-mpi/ompi#1363
2016-02-17 22:19:49 -05:00
Ralph Castain
efb0eff43e Plug a leak in the PMIx subsystem 2016-02-17 19:00:36 -08:00
rhc54
dc4d3edc06 Merge pull request #1372 from rhc54/topic/sing
Further enhance the support for Singularity containers.
2016-02-17 16:39:23 -08:00
Nathan Hjelm
92a15cc316 Merge pull request #1374 from hjelmn/tune_fix
Fix parsing of envvars in MCA files
2016-02-17 17:24:33 -07:00
Nathan Hjelm
4a8fbb5ff3 Merge pull request #1373 from hjelmn/xrc_fixes
btl/openib/udcm: fix local XRC connections
2016-02-17 17:07:15 -07:00
Nathan Hjelm
32236736a4 Fix parsing of envvars in MCA files
This commit fixes a memory corruption bug when parsing lines of the
form:

-x FOO=bar

The code was making changes to the size of the buffer allocated for
key_buffer without making the appropriate changes to
key_buffer_len. This was causing subsequent calls to save_param_name
to write to invalid memory.

This commit makes the following changes:

  - Fix the above bug by modifying trim_name to move the string within
    the buffer instead of re-allocating space for the trimmed string.

  - Cleaned up both trim_name and save_param_name. Both functions took
    a prefix and suffix to trim. Problem was the prefix was not
    treated like a prefix. Instead the "prefix" was located inside the
    string using strstr then the trimmed value started after the
    substring (even in the middle of the string). To allow trimming
    both -x and --x (as well as -mca and --mca) trim_name is now
    called with each prefix.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-02-17 14:58:05 -07:00
Nathan Hjelm
4f4ea96940 btl/openib/udcm: fix local XRC connections
This commit ensures ib_addr->remote_xrc_rcv_qp_num value is set when
creating the loopback queue pair. This is needed when communicating
with any other local peer.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-02-17 14:54:19 -07:00
Howard Pritchard
72c7558176 Merge pull request #1371 from hppritcha/topic/alps_common_syms
ras/alps: squelch common symbol warnings
2016-02-17 14:35:43 -07:00
Ralph Castain
8f9508cace Further enhance the support for Singularity containers. Extend the "personality" command-line option to allow specifying both model (e.g., "ompi") and container (e.g., "singularity"), and add the necessary logic to support multiple options. Add a new pmix "isolated" component to handle singletons where no HNP is available since containers cannot launch the HNP. 2016-02-17 13:33:06 -08:00
Howard Pritchard
31841b4367 ras/alps: squelch common symbol warnings
squelch a couple of warnings from the common symbols
script.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2016-02-17 13:27:29 -06:00
Jeff Squyres
d544e0e6e0 Merge pull request #1347 from ggouaillardet/topic/dss_tests
test/dss: update tests to make them usable again, and run them
2016-02-17 09:01:36 -05:00
Nathan Hjelm
2a728f3cbc Merge pull request #1367 from hjelmn/xrc_fixes
Fix XRC support
2016-02-16 23:43:56 -07:00
Ralph Castain
e0de4423ba Remove debug 2016-02-16 20:58:53 -08:00
rhc54
7a0605f6c9 Merge pull request #1368 from rhc54/topic/iof
Modify the IOF subsystem to handle per-job directives for…
2016-02-16 20:53:49 -08:00
Ralph Castain
50431001a3 Modify the IOF subsystem to handle per-job directives for redirecting IO to files, tagging IO, and timestamping IO.
Fix stdin reader
2016-02-16 18:54:38 -08:00
Gilles Gouaillardet
1e26f9cda4 test/dss: update tests to make them usable again, and run them 2016-02-17 11:27:00 +09:00
Nathan Hjelm
bf8360388f btl/openib/udcm: fix XRC support
This commit fixes two bugs in XRC support

 - When dynamic add_procs support was added to master the remote
   process name was added to the non-XRC request structure. The same
   value was not added to the XRC xconnect structure. This error was
   not caught because the send/recv code was incorrectly using the
   wrong structure member. This commmit adds the member and ensure the
   xconnect code uses the correct structure.

 - XRC loopback QP support has been fixed. It was 1) not setting the
   correct fields on the endpoint structure, 2) calling
   udcm_xrc_recv_qp_connect, and 3) was not initializing the endpoint
   data.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-02-16 16:49:04 -07:00
Nathan Hjelm
201c280e6c btl/openib: fix error in param check in mca_btl_openib_put
mca_btl_openib_put incorrectly checks the qp inline max before
allowing an inline put. This check will always fail for an endpoint
that has not been connected. The commit changes the check to use the
btl_put_local_registration_threshold instead.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-02-16 16:46:32 -07:00
Nathan Hjelm
123a39ac3c btl/openib: fix regression in XRC support
Commit open-mpi/ompi@400af6c52d
introduced a regression in XRC support. The commit reversed the
ordering of shared receive queue (SRQ) and completion queue (CQ)
completion. CQ creation must always preceed SRQ creation when using
XRC as the CQs are needed to create the SRQs. This commit fixes the
ordering so that CQs are always created before SRQs.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-02-16 16:46:20 -07:00
yohann
7fe395c82a mtl/ofi: cleanup 2016-02-16 09:57:57 -08:00