1
1
Граф коммитов

4257 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
5708872112 Implement support for "local" range when publishing data
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 2d54f7e0dd3a47260b0b2634aae3361316005933)
2017-09-18 19:34:08 -07:00
Josh Hursey
252be7ffb0 Merge pull request #4215 from jjhursey/fix/plm-lsf-rc
plm/lsf: Improve error message if lsb_launch fails
2017-09-18 11:14:25 -05:00
Ralph Castain
3c914a7a97 Complete the fix of the ORTE DVM. We will now use "prun" instead of "orterun -hnp foo" to execute jobs. This provides the feature of automatic discovery of the orte-dvm so you don't need to manually enter URI's or contact file locations. All IO is forwarded to prun.
Still in the "needs to be done" category:

* mapping/ranking/binding options aren't correctly supported

* if the DVM encounters some errors (e.g., not enough resources for the job), the resulting error is globally set and impacts any subsequent job submission

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-09-16 13:13:07 -07:00
Joshua Hursey
89c1aaf646 plm/lsf: Improve error message if lsb_launch fails
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2017-09-15 09:45:58 -05:00
Ralph Castain
7c7d8a69a0 Backport changes from PMIx reference server
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-09-14 11:48:56 -07:00
anandhi
4d7de8882f Checking for generic transport "fabric" in mca parameter rml_ofi_transports
to choose the first available non-socket provider.
	modified:   orte/mca/rml/ofi/rml_ofi_component.c
	modified:   orte/mca/rml/ofi/rml_ofi_send.c

Signed-off-by: Anandhi Jayakumar <anandhi.s.jayakumar@intel.com>
2017-09-12 15:39:55 -07:00
Ralph Castain
3477079804 Repair the ORTE DVM
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-09-11 17:38:21 -07:00
Howard Pritchard
5db9416724 rml/ofi: swat a compiler warning
On the path to -Werror passing builds!

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2017-08-30 09:16:49 -06:00
Josh Hursey
ad87aa2674 Merge pull request #4121 from jjhursey/explore/dlopen-local
mca: Dynamic components link against project lib
2017-08-25 13:15:51 -05:00
Joshua Hursey
e1d079544b mca: Dynamic components link against project lib
* Resolves #3705
 * Components should link against the project level library to better
   support `dlopen` with `RTLD_LOCAL`.
 * Extend the `mca_FRAMEWORK_COMPONENT_la_LIBADD` in the `Makefile.am`
   with the appropriate project level library:
```
MCA components in ompi/
       $(top_builddir)/ompi/lib@OMPI_LIBMPI_NAME@.la
MCA components in orte/
       $(top_builddir)/orte/lib@ORTE_LIB_PREFIX@open-rte.la
MCA components in opal/
       $(top_builddir)/opal/lib@OPAL_LIB_PREFIX@open-pal.la
MCA components in oshmem/
       $(top_builddir)/oshmem/liboshmem.la"
```

Note: The changes in this commit were automated by the script in
the commit that proceeds it with the `libadd_mca_comp_update.py`
script. Some components were not included in this change because
they are statically built only.

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2017-08-24 11:56:16 -04:00
Ralph Castain
68029b27e4 Fix the orte-dvm operations so that orterun can connect and execute an application. There is a lingering problem, though. The first invocation of orterun succeeds every time. However, subsequent invocations have a high probability of hanging in the OOB connection handshake.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-08-23 17:31:08 -07:00
Ralph Castain
d80b0c7990 If the HWLOC shared memory system is unable to connect, then fallback to providing the topology via XML. Do not automatically provide the XML to every process as that defeats the purpose of the shared memory system. Instead, use PMIx_Query_info_nb to get the info from the server when required.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-08-22 18:12:26 -07:00
Brice Goglin
046d870124 rtc/hwloc/shmem: add Inria copyrights
The code for finding the hole for the shmem region actually came from me.

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
2017-08-21 23:09:57 +02:00
Brice Goglin
baf762d99d rtc/hwloc/shmem: dump /proc/self/maps if failed to find a hole and verbosity > 4
Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
2017-08-21 19:57:38 +02:00
Brice Goglin
8f6afbb641 rtc/hwloc/shmem: fix "heap" hole search kind
There can be multiple [heap] consecutively in proc/<pid>/maps,
and there's no room between them.
Don't use a hole after the first [heap] is there's another [heap]
immediately after it.

This code would fail to find the last [heap] if there were multiple
[heap] interleaved with non-heap VMA, but our kind "after heap"
wouldn't be meaningful anymore anyway.

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
2017-08-21 15:42:38 +02:00
Brice Goglin
b8b46b253b rtc/hwloc/shmem: fix "libs" hole search kind
We want the biggest hole *between* heap and stack, not outside.

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
2017-08-21 15:40:36 +02:00
Ralph Castain
d515f48885 The local PMIx server is notifying its clients of all events, but for some reason I don't recall, the broadcast notification was marked for delivery only to non-default event handlers. This creates a discrepancy between the two behaviors, so don't restrict the broadcast notifications.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-08-18 17:26:11 -07:00
Ralph Castain
088b6cdeee Silence coverity warnings
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-08-17 09:49:35 -07:00
Ralph Castain
65fb6070d9 Update tool support by adding MCA params to direct orted's to drop
session and/or system-level tool rendezous files. Ensure PMIx is
enabled for tools

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-08-15 17:49:47 -07:00
Artem Polyakov
10d6e90bf5 Revert "plm/rsh: Propagate PMIx prefix to orted's"
This reverts commit 71da0fcbef.
(per https://github.com/open-mpi/ompi/pull/4052).
Refs: https://github.com/open-mpi/ompi/issues/3980

Signed-off-by: Artem Polyakov <artpol84@gmail.com>
2017-08-14 21:37:57 +07:00
Ralph Castain
d1b7c3d8d5 Silence some compile-time warnings. Update scripts now that AUTHORS is gone
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-08-04 20:08:31 -07:00
Ralph Castain
88a7c9dca3 Merge pull request #4013 from rhc54/topic/hwloc
Silence warning on Mac - we know Mac doesn't support hwloc, and so it…
2017-08-03 15:52:44 -06:00
Howard Pritchard
897c62756b Merge pull request #3999 from hppritcha/topic/slurmd_controls_them_all
SLURM: launch all processes via slurmd
2017-08-03 15:33:44 -06:00
Gilles Gouaillardet
6b6e65a5bc rtc/hwloc: fix MCA parameter handling
always re-initialize vmhole *before* mca_base_component_var_register()
otherwise the vmhole gets NULL'ified if orte is initialized a second time.
that typically occurs when Open MPI is configure'd with --disable-dlopen
and the app does MPI_T_init_thread(); MPI_T_finalize(); MPI_T_init_thread();

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-08-03 14:45:43 +09:00
Ralph Castain
9f926b8083 Silence warning on Mac - we know Mac doesn't support hwloc, and so it doesn't matter if a VM hole isn't found. It also doesn't matter in general as all it really means is that we have to turn the hwloc shmem support "off".
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-08-02 20:20:45 -06:00
Howard Pritchard
d08be74573 SLURM: launch all processes via slurmd
It turns out that the approach of having the HNP do the
fork/exec of MPI ranks on the head node in a SLURM environment
introduces problems when users/sysadmins want to use the SLURM
scancl tool or sbatch --signal option to signal a job.

This commit disables use of the HNP fork/exec procedure when
a job is launched into a SLURM controlled allocation.

update NEWS with a blurb about new ras framework mca parameter.

related to #3998

Signed-off-by: Howard Pritchard <hppritcha@gmail.com>
2017-08-02 14:56:55 -06:00
Artem Polyakov
71da0fcbef plm/rsh: Propagate PMIx prefix to orted's
Signed-off-by: Artem Polyakov <artpol84@gmail.com>
2017-08-02 08:06:13 +03:00
Ralph Castain
f39ce67982 Merge pull request #3951 from rhc54/topic/hwloc2
Update to hwloc 2.0.0a
2017-08-01 15:18:31 -06:00
Ralph Castain
e94786f4b7 Revert "Check for OPAL_PREFIX and set corresponding PMIX_PREFIX if found"
This reverts commit 3744967adb.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-08-01 08:14:12 -06:00
Ralph Castain
3744967adb Check for OPAL_PREFIX and set corresponding PMIX_PREFIX if found
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-07-31 09:14:01 -06:00
Boris Karasev
e20b581529 pmix: fixed immediate request
This commit fixes a hang when using external PMIx v1 module

Signed-off-by: Boris Karasev <karasev.b@gmail.com>
2017-07-28 15:53:48 +06:00
Ralph Castain
7a83fdb9bb Update to hwloc 2.0.0a with shmem support.
Update to support passing of HWLOC shmem topology to client procs
Update use of distance API per @bgoglin
Have the openib component lookup its object in the distance matrix
Bring usnic up-to-date
Restore binding for hwloc2

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-07-25 20:26:22 -07:00
Ralph Castain
0042c758f1 Update the tools support so it allows tools to access PMIx
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-07-25 17:10:08 -07:00
Ralph Castain
b225366012 Bring the ofi/rml component online by completing the wireup protocol for the daemons. Cleanup the current confusion over how connection info gets created and
passed to make it all flow thru the opal/pmix "put/get" operations. Update the PMIx code to latest master to pickup some required behaviors.

Remove the no-longer-required get_contact_info and set_contact_info from the RML layer.

Add an MCA param to allow the ofi/rml component to route messages if desired. This is mainly for experimentation at this point as we aren't sure if routing wi
ll be beneficial at large scales. Leave it "off" by default.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-07-20 21:01:57 -07:00
Gilles Gouaillardet
60aa9cfcb6 hwloc: add support for hwloc v2 API
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-07-20 17:39:44 +09:00
Gilles Gouaillardet
9f29f3bff4 hwloc: since WHOLE_SYSTEM is no more used, remove useless
checks related to offline and disallowed elements

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-07-20 17:39:21 +09:00
Gilles Gouaillardet
1a34224948 hwloc: do not set the HWLOC_TOPOLOGY_FLAG_WHOLE_SYSTEM flag
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-07-20 17:39:16 +09:00
Ralph Castain
fca68b070b Merge pull request #3934 from rhc54/topic/singleton
Fix the isolated pmix component. Cleanup the ess/singleton component …
2017-07-19 16:02:37 -05:00
Ralph Castain
543c16b28d Fix the isolated pmix component. Cleanup the ess/singleton component - we shouldn't be automatically discovering the local topology as that is now done on-demand.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-07-19 12:14:29 -07:00
Geoffrey Paulsen
71333a4b14 Transitioning ownership of rmaps/seq and rmaps/rank_file from Intel to IBM. 2017-07-18 21:31:01 -04:00
Gilles Gouaillardet
da34e2f109 ess/base: silence a warning
by fixing a static initializer

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-07-19 09:30:53 +09:00
Ralph Castain
8a98aab6cc Fix signal forwarding on ORTE daemons so that _all_ daemons do it, regardless of environment. Add missing support for SIGTSTP and a few others.
Thanks to Eugene Dedits for reporting the problem.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-07-18 09:58:55 -07:00
Jeff Squyres
ccf17808b6 Merge pull request #3258 from markalle/pr/symbol_name_pollution
symbol name pollution
2017-07-12 16:19:25 -05:00
Artem Polyakov
832f1b03a4 Merge pull request #3790 from artpol84/orte/iof_sbatch
orte/iof: Address the case when output is a regular file
2017-07-12 09:38:01 -05:00
Gilles Gouaillardet
626e94b689 oob/tcp: make mca_oob_tcp_msg_type_t an uint8_t
so no conversion is required when heterogeneous mode is enabled

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-07-12 10:27:45 +09:00
Mark Allen
552216f9ba scripted symbol name change (ompi_ prefix)
Passed the below set of symbols into a script that added ompi_ to them all.

Note that if processing a symbol named "foo" the script turns
    foo  into  ompi_foo
but doesn't turn
    foobar  into  ompi_foobar

But beyond that the script is blind to C syntax, so it hits strings and
comments etc as well as vars/functions.

    coll_base_comm_get_reqs
    comm_allgather_pml
    comm_allreduce_pml
    comm_bcast_pml
    fcoll_base_coll_allgather_array
    fcoll_base_coll_allgatherv_array
    fcoll_base_coll_bcast_array
    fcoll_base_coll_gather_array
    fcoll_base_coll_gatherv_array
    fcoll_base_coll_scatterv_array
    fcoll_base_sort_iovec
    mpit_big_lock
    mpit_init_count
    mpit_lock
    mpit_unlock
    netpatterns_base_err
    netpatterns_base_verbose
    netpatterns_cleanup_narray_knomial_tree
    netpatterns_cleanup_recursive_doubling_tree_node
    netpatterns_cleanup_recursive_knomial_allgather_tree_node
    netpatterns_cleanup_recursive_knomial_tree_node
    netpatterns_init
    netpatterns_register_mca_params
    netpatterns_setup_multinomial_tree
    netpatterns_setup_narray_knomial_tree
    netpatterns_setup_narray_tree
    netpatterns_setup_narray_tree_contigous_ranks
    netpatterns_setup_recursive_doubling_n_tree_node
    netpatterns_setup_recursive_doubling_tree_node
    netpatterns_setup_recursive_knomial_allgather_tree_node
    netpatterns_setup_recursive_knomial_tree_node
    pml_v_output_close
    pml_v_output_open
    intercept_extra_state_t
    odls_base_default_wait_local_proc
    _event_debug_mode_on
    _evthread_cond_fns
    _evthread_id_fn
    _evthread_lock_debugging_enabled
    _evthread_lock_fns
    cmd_line_option_t
    cmd_line_param_t
    crs_base_self_checkpoint_fn
    crs_base_self_continue_fn
    crs_base_self_restart_fn
    event_enable_debug_output
    event_global_current_base_
    event_module_include
    eventops
    sync_wait_mt
    trigger_user_inc_callback
    var_type_names
    var_type_sizes

Signed-off-by: Mark Allen <markalle@us.ibm.com>
2017-07-11 02:13:23 -04:00
Mark Allen
efc25168cd symbol name pollution: making some vars static
As part of addressing symbol name pollution, I'm switching a few
vars/functions to static.

Signed-off-by: Mark Allen <markalle@us.ibm.com>
2017-07-11 02:13:22 -04:00
Gilles Gouaillardet
823382f5d7 plm/base: do not abort when configure'd with --enable-heterogeneous
and a mix of BE/LE is detected

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-07-07 10:43:54 +09:00
Ralph Castain
2a580fa71e Merge pull request #3801 from rhc54/topic/hetero
Detect that we have a mix of BE/LE in the system
2017-07-06 15:29:06 -07:00
Ralph Castain
8979bfe71e Silence Coverity warnings
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-07-06 06:07:28 -07:00