1
1
Граф коммитов

26529 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
410befd255 Merge pull request #2864 from rhc54/topic/rsh
Repair rsh/ssh tree spawn
2017-01-27 16:31:35 -08:00
Ralph Castain
3440b46e5e Merge pull request #2820 from rhc54/topic/async
Per f2f meeting: if async modex is given, default to no MPI init barr…
2017-01-27 15:43:43 -08:00
Ralph Castain
7c795f4416 If the HNP is going to request topology info, it cannot do so via a routed OOB message as the intervening daemons may not be ready. So disable routing until the VM is ready, and have daemons start routing as they receive the xcast launch msg (which includes the data they need to talk to their peers).
Do a little optimization and minimize recomputation of the routing plan.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-27 15:37:16 -08:00
Ralph Castain
d672fad849 Repair rsh/ssh tree spawn
Repair rsh/ssh tree spawn by unpacking and updating the nidmap in remote_spawn.

Add more specific error messages so the cause of a messaging problem is a little clearer. Remove some stale code. Ensure we stop trying to send a message after a few times.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-27 11:35:00 -08:00
Josh Hursey
f4a86904c4 Merge pull request #2813 from jjhursey/fix/ibm/comm-cleanup
communicator: Fix uninitialized variable
2017-01-26 14:35:32 -06:00
Josh Hursey
2e64bf42fb Merge pull request #2810 from jjhursey/fix/ibm/stdiag-to-stdout
Extend options for stddiag routing
2017-01-26 14:29:16 -06:00
Josh Hursey
770c41f493 Merge pull request #2807 from jjhursey/fix/ibm/event-external
libevent/external: Add opal_event_include to this component
2017-01-26 14:26:50 -06:00
Josh Hursey
ebc90f926e Merge pull request #2806 from jjhursey/fix/ibm/aint-diff-type
Fix a minor error at MPI_AINT_DIFF.
2017-01-26 14:23:21 -06:00
Josh Hursey
0408c116eb Merge pull request #2805 from jjhursey/fix/ibm/base-allgatherv
coll/base: Allgatherv MPI_IN_PLACE Bug
2017-01-26 14:21:57 -06:00
Jeff Squyres
b9d7b27cfa Merge pull request #2855 from gpaulsen/fix/ibm/nbc_ireduce_anysrc
Fixing comment only in MPI_IN_PLACE case for ireduce in libnbc.
2017-01-26 11:00:50 -08:00
Geoffrey Paulsen
d2527cff46 Fixing comment only in MPI_IN_PLACE case for ireduce in libnbc.
Signed-off-by: Geoffrey Paulsen <gpaulsen@us.ibm.com>
2017-01-26 10:58:51 -08:00
Jeff Squyres
2c277a66fd Merge pull request #2772 from jjhursey/topic/stacktrace-improv
master: opal/stacktrace improvements
2017-01-26 10:48:41 -08:00
Joshua Hursey
6d98559be9 stacktrace: Add flexibility in stacktrace ouptut
- New MCA option: opal_stacktrace_output
   - Specifies where the stack trace output stream goes.
   - Accepts: none, stdout, stderr, file[:filename]
   - Default filename 'stacktrace'
     - Filename will be `stacktrace.PID`, or if VPID is available,
       then the filename will be `stacktrace.VPID.PID`
 - Update util/stacktrace to allow for different output avenues
   including files. Previously this was hardcoded to 'stderr'.
 - Since opal_backtrace_print needs to be signal safe, passing it a
   FILE object that actually represents a file stream is difficult. This
   is because we cannot open the file in the signal handler using
   `fopen` (not safe), but have to use `open` (safe). Additionally, we
   cannot use `fdopen` to convert the `int fd` to a `FILE *fh` since it
   is also not signal safe.
   - I did not want to break the backtrace.h API so I introduced a new
     rule (documented in `backtrace.c`) that if the `FILE *file`
     argument is `NULL` then look for the `opal_stacktrace_output_fileno`
     variable to tell you which file descriptor to use for output.

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2017-01-26 11:55:32 -06:00
Joshua Hursey
f8918e37a9 opal/stacktace: Raise the signal after processing
- This prevents us for accidentally masking a signal that was meant to
   terminate the application.

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2017-01-26 11:55:28 -06:00
Nathan Hjelm
fe1c6bd881 Merge pull request #2840 from hjelmn/event_fix
verbs: remove extra event user increment/decrement operation
2017-01-26 07:30:24 -08:00
Ralph Castain
7b6e36ee51 Merge pull request #2844 from rhc54/topic/sig
Cleanup launch
2017-01-26 05:05:23 -08:00
Howard Pritchard
3b82015a78 Merge pull request #2824 from gpaulsen/fix/ibm/nbc_ireduce_anysrc
Fix for Ireduce + MPI_IN_PLACE
2017-01-26 06:03:39 -07:00
Ralph Castain
399de0738e Cleanup launch
Given that we only set OOB contact info from inside of events, or before we begin threaded operations (e.g., in the ess), allow set_contact_info to directly update the oob/base framework globals.

Correct the nidmap regex decompression routine.

Ensure that rank=1 daemon always sends back its topology as this is the most common use-case.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-25 22:06:09 -08:00
Gilles Gouaillardet
896434b1bd pmix/ext2x: plug a memory leak in opal_lkupcbfunc()
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-01-26 14:07:15 +09:00
Gilles Gouaillardet
6b8e1c217c pmix/ext2x: plug misc memory leaks
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-01-26 14:06:58 +09:00
Ralph Castain
6c6687116c Merge pull request #2826 from rhc54/topic/topo
Have rank=1 daemon always send its topology back as this is the most common use-case
2017-01-25 19:10:44 -08:00
Geoffrey Paulsen
045d0c5f4c Fix for Ireduce + MPI_IN_PLACE.
Fixes a wrong answer from MPI_Ireduce when the red_sched_chain()
path was taken (which only happens for np<=4 and mesgsize>=64k).

The way libnbc treats MPI_IN_PLACE is to set sbuf == rbuf, and
whether an algorithm will work cleanly or not after that depends on the
details.

In this case the last steps of the algorithm amounted to
    (right neighbor is sending us reduction results from ranks 1..n-1)
    recv into rbuf from right neighbor
    add the contribution from our sbuf into rbuf
this would be fine in general, but if sbuf==rbuf, that recv overwrites
the sbuf. I changed it to recv into a tmpbuf if MPI_IN_PLACE was used.

Signed-off-by: Geoffrey Paulsen <gpaulsen@us.ibm.com>
2017-01-25 18:08:08 -08:00
Nathan Hjelm
9f28c0af39 verbs: remove extra event user increment/decrement operation
Since the oob and connections systems do not work the same way they
did in older versions of Open MPI these operations are no longer
necessary. At best they do nothing and at worst they hurt performance
by making us enter the event library more often in opal_progress().

Fixes #2839

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-01-25 18:37:06 -07:00
Gilles Gouaillardet
d3a5065288 Merge pull request #2815 from ggouaillardet/topic/opal_tsd_keys_destruct
opal/threads: protect opal_tsd_keys_destruct() to fix Java bindings.
2017-01-26 09:24:14 +09:00
Ralph Castain
a7b8190fdc Per f2f meeting: if async modex is given, default to no MPI init barrier, letting the user override that if desired.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-25 10:13:53 -08:00
Ralph Castain
2f4e87eae9 Have rank=1 daemon always send its topology back as this is the most common use-case
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-25 09:33:11 -08:00
Jeff Squyres
230bbc597d plm base: make sure to assign "node" early enough
Make sure to assign "node" before using it in ORTE_FLAG_SET.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2017-01-25 08:02:59 -08:00
Ralph Castain
e7323fdd93 Merge pull request #2823 from rhc54/topic/oob4
Cleanup some code so it is clear that it is executing in an event. En…
2017-01-25 07:48:31 -08:00
Ralph Castain
184ccc8e91 Cleanup some code so it is clear that it is executing in an event. Ensure that peer event base is properly set on incoming connections
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-25 06:55:11 -08:00
Edgar Gabriel
4e06b96701 Merge pull request #2800 from edgargabriel/pr/sharedfp-append-fix
Pr/sharedfp append fix
2017-01-25 08:01:04 -06:00
Gilles Gouaillardet
142b95df87 pmix/ext2x: plug misc memory leaks regarding opal_pmix2x_event_chain_t handling
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-01-25 16:17:10 +09:00
Gilles Gouaillardet
7a3d39f079 pmix/ext2x: plug a memory leak in _reg_nspace()
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-01-25 16:17:01 +09:00
Gilles Gouaillardet
ef10d3fd7b orte: add missing include file
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-01-25 16:15:20 +09:00
Ralph Castain
186059cc00 Merge pull request #2803 from rhc54/topic/host
Revamp -host and -cpu-list options per f2f meeting
2017-01-24 18:37:24 -08:00
Gilles Gouaillardet
e1811cfe17 opal/threads: protect opal_tsd_keys_destruct() to fix Java bindings.
When Java bindings are used, MPI_Init() is not invoked
by the main thread, and this causes some keys being destructed twice.
Reset the per thread values to NULL in order to correctly handle this

Fixes open-mpi/ompi#2811

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-01-25 10:58:55 +09:00
Joshua Hursey
a2d45f6e9f communicator: Fix uninitialized variable
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2017-01-24 16:46:13 -06:00
Joshua Hursey
0e9a06d2c3 orte/iof: Add app stderr to stdout redirection at source
* Add an MCA parameter to combine stdout and stderr at the source
   - `iof_base_redirect_app_stderr_to_stdout`
 * Aids in user debugging when using libraries that mix stderr with stdout

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2017-01-24 16:23:48 -06:00
Joshua Hursey
dcd9801f7c orte/iof: Add orte_map_stddiag_to_stdout option
* Similar to `orte_map_stddiag_to_stderr` except it redirects `stddiag`
   to `stdout` instead of `stderr`.
 * Add protection so that the user canot supply both:
   - `orte_map_stddiag_to_stderr`
   - `orte_map_stddiag_to_stdout`

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2017-01-24 16:22:59 -06:00
Zhi Ming Wang
9718bbac82 Fix a minor error at MPI_AINT_DIFF.
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2017-01-24 16:06:14 -06:00
Joshua Hursey
d6b306d716 libevent/external: Add opal_event_include to this component
* Adds a parameter to adjust the method used by libevent.
   - Matches that of the libevent2022 component.

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2017-01-24 16:03:09 -06:00
Mark Allen
a3452adfa9 coll/base: Allgatherv MPI_IN_PLACE Bug
MPI_Allgatherv with MPI_IN_PLACE reads data from wrong location.

They were locating the MPI_IN_PLACE send buffer as
```c
         send_buf = (char*)rbuf;
         for (i = 0; i < rank; ++i) {
             send_buf += ((ptrdiff_t)rcounts[i] * extent);
         }
```
when it should be
```c
         send_buf = (char*)rbuf;
         send_buf += ((ptrdiff_t)disps[rank] * extent);
```
because disps[] specifies where things are in the v-style buffers.

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2017-01-24 15:52:36 -06:00
Ralph Castain
ef86707fbe Deprecate the --slot-list paramaeter in favor of --cpu-list. Remove the --cpu-set param (mark it as deprecated) and use --cpu-list instead as it was confusing having the two params. The --cpu-list param defines the cpus to be used by procs of this job, and the binding policy will be overlayed on top of it.
Note: since the discovered cpus are filtered against this list, #slots will be set to the #cpus in the list if no slot values are given in a -host or -hostname specification.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-24 13:33:22 -08:00
Ralph Castain
0bfdc0057a Extend the -host:N syntax to accept "*" or "auto" to indicate "auto-detect the #cpus and set #slots to that value"
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-24 10:21:01 -08:00
Ralph Castain
d3907dec98 Make master continue the -host behavior of prior releases: use of -host <foo> specifies a single slot. Requests to run more than one process will require either specifying slots using the "-host foo:N" syntax, or adding --oversubscribe to the cmd line.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-24 10:11:56 -08:00
Edgar Gabriel
cbb3cb9745 fs/ufs: avoid using the exclusive flag with shared file pointer
when a file is opened a second time for shared file pointer operations,
avoid setting the create and exclusive flag.

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2017-01-24 12:11:29 -06:00
Edgar Gabriel
f5289a1803 common/ompio: store correctly the SHAREDFP_IS_SET flag
it looks like disabling the lazy_open flag for sharedfp components
revealead a bug that lead to a crash in file_close in some tests. Make
sure the SHAREDFP_IS_SET flag is correctly set (and not overwritten again),
and we use that to avoid a double-free of the communicator.

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
2017-01-24 12:09:56 -06:00
Josh Hursey
c6595c2289 Merge pull request #2792 from jjhursey/topic/libevent-conf2
libevent2022: Fix broken configure AC_LANG_PROGRAM
2017-01-24 08:31:46 -06:00
Ralph Castain
4e9364b9a4 Merge pull request #2794 from rhc54/topic/regs
Next step in reducing launch time
2017-01-24 03:19:57 -08:00
Gilles Gouaillardet
682f5116aa Merge pull request #2781 from ggouaillardet/topic/misc_fixes_and_plugs
fix misc bugs and plug misc memory leaks
2017-01-24 14:41:45 +09:00
Ralph Castain
86ab751c5e Next step in reducing launch time: begin reducing the size of the launch message itself. Start by expressing the daemon map as a set of three regular expression strings. On an 8k cluster, this reduces the nidmap contribution from over 200kBytes to 21 bytes in size.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-23 19:54:47 -08:00