1
1
Граф коммитов

3029 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
24b91839aa Ensure the process knows it local cpuset early enough to perform the locality computation
This commit was SVN r28221.
2013-03-26 19:14:23 +00:00
Ralph Castain
6ee32767d4 Restore the cpus-per-proc option for byslot and bynode mapping. Remove the bind_idx (which recorded the index of the hwloc object where the proc was bound) as this would no longer be unique, and just use the bitmap as the standard reference for location. Update the relative locality computation to take bitmaps as its argument.
This commit was SVN r28219.
2013-03-26 18:27:50 +00:00
Ralph Castain
2f43989d22 Add debug and handle the use-case where someone (a) uses a hostfile while in a managed allocation to sub-allocate runs, and (b) includes the HNP's node in one of those hostfiles.
cmr:v1.7

This commit was SVN r28203.
2013-03-22 00:53:33 +00:00
Jeff Squyres
63d17ce901 Fix CID 968581: ensure that the string read from the socket is always
\0-terminated so that strlen() and strstr() can be used without fear.
Also fix some insignificant mem leaks (which is somewhat moot, because
as soon as we leave those error conditions, the process will be
terminating, but what the heck, might as well fix these while I was in
the file for the \0-termination issue...).

This commit was SVN r28199.
2013-03-21 16:05:50 +00:00
Jeff Squyres
562db0dd11 Fix CID 741328: remove some dead code
This commit was SVN r28192.
2013-03-21 11:15:06 +00:00
Ralph Castain
fa13d27238 Avoid double-release in error path
This commit was SVN r28190.
2013-03-20 21:00:59 +00:00
Ralph Castain
147c6ff9e7 Clean out the cruft leftover from the use_common_ports experiment
cmr:v1.7

This commit was SVN r28184.
2013-03-20 15:07:43 +00:00
Ralph Castain
a4b6fb241f Remove all remaining vestiges of the Windows integration
This commit was SVN r28137.
2013-02-28 17:31:47 +00:00
Ralph Castain
cf9796accd Remove the old configure option for disabling full rte support - we now use the OMPI rte framework for such purposes
This commit was SVN r28134.
2013-02-28 01:35:55 +00:00
Ralph Castain
347df93cd4 Handle the case of someone specifying a directory for the application. Ensure we get a non-zero exit status and clarify the error message.
cmr:v1.7

This commit was SVN r28119.
2013-02-27 01:36:21 +00:00
Ralph Castain
f36312ee6f Continue cleanup - this time, start working on the "without full support" flags in ORTE. Remove no-longer-needed configure.m4 files from the ess and errmgr. In the former case, since all priorities are now the same (given the removal of the cnos component), configure priorities are no longer required.
This commit was SVN r28118.
2013-02-26 21:27:48 +00:00
Ralph Castain
74a3ece313 Remove unused component
This commit was SVN r28117.
2013-02-26 20:58:43 +00:00
Ralph Castain
8d2fa3693b First cut at removing the native Windows support. Remove all the Windows-specific components, and the .windows files sprinkled around. Remove the Windows platform files and MTT scripts. Update the NEWS to point Windows users to the cygwin package.
This commit was SVN r28116.
2013-02-26 20:44:56 +00:00
Ralph Castain
bd9265c560 Per the meeting on moving the BTLs to OPAL, move the ORTE database "db" framework to OPAL so the relocated BTLs can access it. Because the data is indexed by process, this requires that we define a new "opal_identifier_t" that corresponds to the orte_process_name_t struct. In order to support multiple run-times, this is defined in opal/mca/db/db_types.h as a uint64_t without identifying the meaning of any part of that data.
A few changes were required to support this move:

1. the PMI component used to identify rte-related data (e.g., host name, bind level) and package them as a unit to reduce the number of PMI keys. This code was moved up to the ORTE layer as the OPAL layer has no understanding of these concepts. In addition, the component locally stored data based on process jobid/vpid - this could no longer be supported (see below for the solution).

2. the hash component was updated to use the new opal_identifier_t instead of orte_process_name_t as its index for storing data in the hash tables. Previously, we did a hash on the vpid and stored the data in a 32-bit hash table. In the revised system, we don't see a separate "vpid" field - we only have a 64-bit opaque value. The orte_process_name_t hash turned out to do nothing useful, so we now store the data in a 64-bit hash table. Preliminary tests didn't show any identifiable change in behavior or performance, but we'll have to see if a move back to the 32-bit table is required at some later time.

3. the db framework was a "select one" system. However, since the PMI component could no longer use its internal storage system, the framework has now been changed to a "select many" mode of operation. This allows the hash component to handle all internal storage, while the PMI component only handles pushing/pulling things from the PMI system. This was something we had planned for some time - when fetching data, we first check internal storage to see if we already have it, and then automatically go to the global system to look for it if we don't. Accordingly, the framework was provided with a custom query function used during "select" that lets you seperately specify the "store" and "fetch" ordering.

4. the ORTE grpcomm and ess/pmi components, and the nidmap code,  were updated to work with the new db framework and to specify internal/global storage options.

No changes were made to the MPI layer, except for modifying the ORTE component of the OMPI/rte framework to support the new db framework.

This commit was SVN r28112.
2013-02-26 17:50:04 +00:00
Ralph Castain
c0b670bea8 I guess some profiling tools and debuggers require that the argv[0] of each rank be unique so they can create a filename based on that value. For those obscure cases, provide an mpirun cmd line option that indexes each argv[0] by rank
This commit was SVN r28064.
2013-02-15 20:20:49 +00:00
Ralph Castain
b9897267ef Cleanup report-bindings so it always reports the actual binding instead of what was requested. Ensure we don't report twice if it is an MPI process being launched.
This commit was SVN r28057.
2013-02-14 17:24:28 +00:00
Ralph Castain
744ed49b2d Begin cleanup of the thread_lock calls in ORTE. We'll ignore the ones in the rml/oob for now as that code block is being rewritten anyway.
This commit was SVN r28053.
2013-02-13 01:53:12 +00:00
Ralph Castain
b360156a37 Extend print coverage to all types
This commit was SVN r28012.
2013-02-01 14:21:06 +00:00
Ralph Castain
53e0ed71b0 Disqualify slurm module even if slurm support was configured into the build if we don't have an allocation and haven't enabled dynamic allocations
This commit was SVN r27995.
2013-01-31 18:15:47 +00:00
Ralph Castain
166f512924 Add some useful debug to the heartbeat sensor
This commit was SVN r27994.
2013-01-31 18:01:13 +00:00
Ralph Castain
ca9605773b If sensors are enabled, then the daemons need to have their proc->node field linked to their local node object
This commit was SVN r27991.
2013-01-31 16:38:57 +00:00
Ralph Castain
c87fa68f9b Cleanup the resource usage sensor, letting the db handle any printing requests.
This commit was SVN r27990.
2013-01-31 15:20:56 +00:00
Ralph Castain
9625757a71 Add new database component for printing "add_log" info
This commit was SVN r27989.
2013-01-31 15:19:39 +00:00
Ralph Castain
8e8e95ca6b Silence error report - just because someone only defines ipv4 static ports doesn't make a fatal error
This commit was SVN r27976.
2013-01-29 23:48:22 +00:00
Jeff Squyres
8e25b927ab Clean some minor warnings: remove variables that were set but never
used.

This commit was SVN r27974.
2013-01-29 23:35:42 +00:00
Ralph Castain
112f8eedb1 Handle the case where rankfile is providing the allocation
This commit was SVN r27971.
2013-01-29 20:37:58 +00:00
Nathan Hjelm
666bd826dc fix alps configury
This commit was SVN r27962.
2013-01-29 15:44:30 +00:00
Brian Barrett
b8442ba505 Revamp the handling of wrapper compiler flags. The user flags, main configure
flags, and mca flags are kept seperate until the very end.  The main configure
wrapper flags should now be modified by using the OPAL_WRAPPER_FLAGS_ADD
macro.  MCA components should either let <framework>_<component>_{LIBS,LDFLAGS}
be copied over OR set <framework>_<component>_WRAPPER_EXTRA_{LIBS,LDFLAGS}.
The situations in which WRAPPER CPPFLAGS can be set by MCA components was
made very small to match the one use case where it makes sense.

This commit was SVN r27950.
2013-01-29 00:00:43 +00:00
Ralph Castain
cfaefb3286 Remove the only place where PMI was used outside a component, and relocate that code to common/pmi.
This commit was SVN r27944.
2013-01-28 20:14:51 +00:00
Brian Barrett
f42783ae1a Move the RTE framework change into the trunk. With this change, all non-CR
runtime code goes through one of the rte, dpm, or pubsub frameworks.

This commit was SVN r27934.
2013-01-27 23:25:10 +00:00
Ralph Castain
6eaf601ae6 Good ol' Cray changed the way node/cpu allocation is handled in their latest release of ALPS, and so our allocator is broken. Adjust for the revised method, but preserve the older method for those Cray users who have not updated their system.
cmr:v1.7

This commit was SVN r27911.
2013-01-25 21:53:31 +00:00
Ralph Castain
f6b4db0b79 Fix rank_file operations. We changed the syntax to use semi-colons between multiple slot assignments so that we could use the comma to separate specific cores, but somehow the flex definitions didn't get updated to accept that character. We also incorrectly zero'd the bitmap between slot assignment sections, and so multiple slot assignments only wound up making the last one in the list.
This commit was SVN r27908.
2013-01-25 18:33:25 +00:00
Ralph Castain
2504da1ac9 Remove stale code - message arrival time doesn't really mean much anymore.
This commit was SVN r27905.
2013-01-24 23:02:02 +00:00
Ralph Castain
9bfb2b989b Silence warning
This commit was SVN r27901.
2013-01-24 19:38:51 +00:00
Ralph Castain
4b310473a1 Correct the computation of the daemon vpid
cmr:v1.7

This commit was SVN r27899.
2013-01-24 18:04:53 +00:00
Ralph Castain
b403ca5bd8 Silence warning
This commit was SVN r27897.
2013-01-23 22:17:08 +00:00
Ralph Castain
4d34d30a97 Silence warning
This commit was SVN r27896.
2013-01-23 22:16:48 +00:00
Ralph Castain
a591fbf06f Add initial support for dynamic allocations. At this time, only Slurm supports the new capability, which will be included in an upcoming release.
Add hooks for supporting dynamic allocation and deallocation to support application-driven requests and fault recovery operations.

This commit was SVN r27879.
2013-01-20 00:33:42 +00:00
Ralph Castain
e4673f3283 Add new job state
This commit was SVN r27878.
2013-01-20 00:30:27 +00:00
Ralph Castain
73387e50e2 Add missing variable def - thanks to Paul Hargrove for spotting.
This commit was SVN r27865.
2013-01-18 14:32:53 +00:00
Ralph Castain
54266837e9 Remove use of param_find function as that function will be disappearing
This commit was SVN r27831.
2013-01-15 19:50:38 +00:00
Ralph Castain
aea6787918 Add new routed component with self-healing connections - based on radix component - for use in monitoring system
This commit was SVN r27757.
2013-01-08 04:40:35 +00:00
Ralph Castain
c9a596b487 Remove unused var
This commit was SVN r27756.
2013-01-08 04:39:30 +00:00
Ralph Castain
beddf3b379 Add required rml tag
This commit was SVN r27751.
2013-01-05 06:32:20 +00:00
Ralph Castain
bee8bf5d8f Update the sensor framework to report stats back to the HNP if requested by including the data in heartbeats.
This commit was SVN r27748.
2013-01-05 06:30:20 +00:00
Ralph Castain
c71e119bbb Extend the db framework to add support for logging data to databases without duplicating all the modex-related storage.
This commit was SVN r27746.
2013-01-05 06:28:09 +00:00
George Bosilca
34eecb8956 Be more explicit about the operation (store or update). complain loudly
if something goes wrong.

This commit was SVN r27743.
2013-01-04 20:47:25 +00:00
Ralph Castain
cc29f8ff95 Attempt to fix the stupid Cray PMI problem
This commit was SVN r27742.
2013-01-04 02:53:42 +00:00
Nathan Hjelm
6a9ab9b221 Change orte_startup_timeout to be in seconds and remove the 10 second maximum
This commit was SVN r27741.
2013-01-03 23:56:34 +00:00
Ralph Castain
c65de32218 Cleanup the PMI subsystems to support Sam's "rml-less" shared memory wireup. Only retrieve keys that are specifically requested, and only when they are requested. Let string values be segmented across multiple keys, but don't do it for anything else.
This commit was SVN r27737.
2013-01-03 02:16:10 +00:00
Ralph Castain
d1163ebbf2 Ensure we cleanup DFS worker threads during finalize to avoid segfaulting in MCA param cleanup
This commit was SVN r27723.
2012-12-25 21:17:35 +00:00
Ralph Castain
c5ba59ba67 Remove stale component
This commit was SVN r27684.
2012-12-18 04:01:16 +00:00
Ralph Castain
0427a478b2 Remove stale component
This commit was SVN r27683.
2012-12-18 04:00:51 +00:00
Ralph Castain
82f1ba0ea8 Fix static port usage, ensure that both ipv4 and ipv6 are given if ipv6 was enabled
This commit was SVN r27682.
2012-12-18 03:59:49 +00:00
Ralph Castain
2fdd367aa9 Refs trac:3429
Fix bug reported by FreyGuy19713: in cases where HNP node has multiple entries in a hostfile or other allocation, we need to track the total slots allocated to that node.

This commit was SVN r27673.

The following Trac tickets were found above:
  Ticket 3429 --> https://svn.open-mpi.org/trac/ompi/ticket/3429
2012-12-14 17:00:44 +00:00
Ralph Castain
1e92aa2b66 Enable multiple worker threads for processing DFS requests
This commit was SVN r27659.
2012-12-09 02:54:19 +00:00
Ralph Castain
c26ed7dcdd Fix comm_spawn when ORTE progress thread is enabled by ensuring that all operations on the global list of active collectives are done in events to avoid conflicts.
This commit was SVN r27658.
2012-12-09 02:53:20 +00:00
Nathan Hjelm
3e1b13b13a Re-add support for old flex (2.5.4a and earlier) while still cleaning up properly in new flex.
This commit was SVN r27657.
2012-12-07 00:12:43 +00:00
Ralph Castain
1237f8db57 Extend the ras module interface to include the orte_job_t being allocated so that dynamic allocations can be supported
This commit was SVN r27627.
2012-11-23 13:50:10 +00:00
George Bosilca
994d1aba50 Nothing.
This commit was SVN r27626.
2012-11-21 20:07:20 +00:00
Ralph Castain
43f883cb42 Add some more detailed error output to the db_hash component and nidmap code. Ensure the local nodename is included in the HNP's aliases
This commit was SVN r27622.
2012-11-18 17:57:19 +00:00
Ralph Castain
f2ec35536e Fix a bug that prevented MCA params from being forwarded to daemons upon launch
cmr:v1.7

This commit was SVN r27621.
2012-11-18 17:55:26 +00:00
Ralph Castain
e11f32038a Add an MCA param to retain all aliases based on IP addrs for node names so that procs can look them up by interface, if desired. If the param is set, pass aliases around to all daemons and procs for local use
This commit was SVN r27619.
2012-11-16 04:04:29 +00:00
Ralph Castain
3cecc1569b Fix segfault if no file_maps were pushed
This commit was SVN r27612.
2012-11-15 15:39:17 +00:00
Ralph Castain
fe6dfad625 Update DFS to support multi-node operations
This commit was SVN r27594.
2012-11-12 02:54:53 +00:00
Ralph Castain
a6325e4546 Silence compiler warning
This commit was SVN r27590.
2012-11-12 02:51:29 +00:00
Ralph Castain
26f1cd0909 Fix compiler warnings
This commit was SVN r27588.
2012-11-12 02:50:45 +00:00
Ralph Castain
bd887f7f56 Add a new "test" component to the DFS that treats all files as remote in order to test the app-to-daemon interactions on a single machine. Set a global param to indicate we are using staged execution. Add a param to indicate it is okay for non-MPI processes to execute without finalizing. Cleanup file map load and fetch operations.
This commit was SVN r27587.
2012-11-10 14:09:12 +00:00
Ralph Castain
615cc66b44 Protect the HNP cleanup in cases where no session dirs are created
This commit was SVN r27585.
2012-11-10 14:03:07 +00:00
Nathan Hjelm
e0f5137e46 add prototypes for lex destroy functions
This commit was SVN r27580.
2012-11-09 22:00:27 +00:00
Nathan Hjelm
8658bbc902 instead of relying on yyterminate to clean up the lex context call the destroy functions directly (after closing the file)
This commit was SVN r27577.
2012-11-09 16:10:55 +00:00
Ralph Castain
9b729794f2 A prior commit apparently broke the trunk when something was inadvertently left behind - so remove a reference to a no-longer-existing function
This commit was SVN r27574.
2012-11-07 11:11:05 +00:00
Nathan Hjelm
7fb5caea92 Remove the finish_parsing function from various .l files. The function is incomplete (doesn't clean up the lex state) and should be replaced by *_yylex_destroy which correctly cleans up the state.
Checked with the flex 2.5.35. Verified with valgrind that this fixes several "still reachable" leaks.

cmr:v1.7

This commit was SVN r27571.
2012-11-06 19:26:14 +00:00
Nathan Hjelm
bdedd8b0d3 Per RFC modify the behavior of mca_base_components_close to NOT close the output. Modify frameworks to always close their output and set to -1.
Reasoning: The old behavior was a little confusing. mca_base_components_open does not open an output stream so it is a little unexpected that mca_base_components_close does. To add to this several frameworks (that don't use mca_base_components_close) failed to close their output in the framework close function and others closed their output a second time. This change is an improvement to the symantics of mca_base_components_open/close as they are now symetric in their functionality.

This commit was SVN r27570.
2012-11-06 19:09:26 +00:00
Brian Barrett
e61c00212d Add files found in svn but not tarball
This commit was SVN r27549.
2012-11-01 02:27:03 +00:00
Nathan Hjelm
2acd0f83de Revert "Revert r27451 and r27456 - the cmd line parser is incorrectly marking the application as an MCA parameter".
It appears the problem was not with the command line parser but the rsh plm. I don't know why this problem was not occuring before the command line parser changes but it appears to be resolved now.

This commit was SVN r27527.

The following SVN revision numbers were found above:
  r27451 --> open-mpi/ompi@d59034e6ef
  r27456 --> open-mpi/ompi@ecdbf34937
2012-10-30 19:45:18 +00:00
Nathan Hjelm
df9bd0ed59 fix bug in plm/rsh that could add extraneous mca options to the orted argv
cmr:v1.7

This commit was SVN r27526.
2012-10-30 19:40:04 +00:00
Ralph Castain
a080de188f Enable orterun to directly support staged execution, treating each app as a separate job. Support transfer of file maps when support exists.
This commit was SVN r27516.
2012-10-29 23:11:30 +00:00
Ralph Castain
e5e72c3137 Expand the dfs API to support retrieval, loading and purging of file maps.
This commit was SVN r27515.
2012-10-29 23:05:45 +00:00
Ralph Castain
4e52a15e70 Provide for sync on seek and close DFS operations. Eliminate an unnecessary wake-up timer when using ORTE progress thread
This commit was SVN r27500.
2012-10-26 15:49:04 +00:00
Ralph Castain
4ef30c016b Remove stale windows references
This commit was SVN r27491.
2012-10-26 01:19:14 +00:00
Ralph Castain
df642f1508 Add an API to get a remote file's size. Separate dfs cmds from returned data messages so daemons don't get confused.
This commit was SVN r27487.
2012-10-25 22:23:08 +00:00
Ralph Castain
094d6f3143 Add a new "distributed file system" capability to support file access operations across nodes that do not have a network file system attached to them.
Add a set of URI create/parse utilities

This commit was SVN r27483.
2012-10-25 17:15:17 +00:00
Ralph Castain
32c185f730 Set a priority for output of forwarded IO so it can effectively compete against inbound messages
This commit was SVN r27480.
2012-10-24 23:34:50 +00:00
Ralph Castain
e06c330635 Add the ability to set a backlog limit on forwarded output waiting at mpirun - helps to avoid crashing systems during debug. Note that we default to "unlimited" to maintain current behavior.
This commit was SVN r27479.
2012-10-24 23:21:40 +00:00
Ralph Castain
e6014bf2e1 Revert r27451 and r27456 - the cmd line parser is incorrectly marking the application as an MCA parameter
This commit was SVN r27477.

The following SVN revision numbers were found above:
  r27451 --> open-mpi/ompi@d59034e6ef
  r27456 --> open-mpi/ompi@ecdbf34937
2012-10-24 18:38:44 +00:00
Ralph Castain
7574d6673b If someone provides the launch_agent cmd, then don't prefix it
cmr:v1.7

This commit was SVN r27473.
2012-10-24 16:14:04 +00:00
Ralph Castain
5c0534a7ad Ensure that comm_spawn launches procs on the nodes specified by add-host and add-hostfile
This commit was SVN r27452.
2012-10-18 00:40:44 +00:00
Nathan Hjelm
d59034e6ef MCA: remove deprecated mca_base_param functions (mca_base_param_register_int, mca_base_param_register_string, mca_base_param_environ_variable). Remove all uses of deprecated functions.
cmr:v1.7

This commit was SVN r27451.
2012-10-17 20:17:37 +00:00
Ralph Castain
4028ce7a5d Silence warnings by making types match
This commit was SVN r27446.
2012-10-14 03:45:28 +00:00
Ralph Castain
285a3b168d Add an ability to specify the max number of simultaneous procs/node for an application when operating in staged mode. Change some debug statements from OPAL_OUTPUT_VERBOSE to opal_output_verbose so they are available in optimized builds.
This commit was SVN r27445.
2012-10-14 03:31:32 +00:00
Ralph Castain
04304c186f Remove the setup_hadoop configure script as it is no longer required - the hadoop support components can build without accessing hadoop itself.
This commit was SVN r27385.
2012-09-29 18:30:35 +00:00
Ralph Castain
54db4c35eb Get the trunk to build again when --without-hwloc is specified. Move a couple of key type definitions and utilities out from under the HAVE_HWLOC test so they are always available as they don't really depend on hwloc's presence. Tell two compnents not to build if hwloc is disabled:
ompi/mca/sbgp/basesmsocket
orte/mca/rmaps/lama

Remove stale configure.params files from the sbgp framework as the OMPI build system no longer looks at those files.

This commit was SVN r27377.
2012-09-26 23:24:27 +00:00
Samuel Gutierrez
42280e2af5 Temporarily make routed binomial the default. We are experiencing issues with
debruijn when launching fewer processes than are actually available within an
allocation. When this is fixed, please revert this change.

This commit was SVN r27376.
2012-09-26 16:08:12 +00:00
Jeff Squyres
cb65a44c6c Fix the component priority assignment. Thanks to Alex Margolin for
the patch.

This commit was SVN r27363.
2012-09-25 07:13:23 +00:00
George Bosilca
6ec41400b3 Fix the error message in case a daemon does not succeed at killing the
local offspring.

This commit was SVN r27362.
2012-09-24 15:25:21 +00:00
Ralph Castain
d5279b0dc8 Make an attempt to protect hwloc cset2str from segfaulting in weird scenario
This commit was SVN r27361.
2012-09-23 16:51:51 +00:00
Ralph Castain
d95025f53a Ensure we clear the usage numbers when binding on multiple nodes so we don't "carry over" info from one node to the next. Use the same tracking mechanism for binding upwards and in-place to avoid doing a bunch of mallocs.
Refs trac:3322

This commit was SVN r27356.

The following Trac tickets were found above:
  Ticket 3322 --> https://svn.open-mpi.org/trac/ompi/ticket/3322
2012-09-20 15:16:06 +00:00
Ralph Castain
445161cd2e Correctly count the total number of allocated slots
This commit was SVN r27353.
2012-09-20 02:50:14 +00:00
Ralph Castain
f592967685 Add missing retain to maintain correct accounting on nodes
This commit was SVN r27352.
2012-09-20 02:30:53 +00:00
Ralph Castain
e309db0be9 Ensure file descriptors are closed upon completion of transfer
This commit was SVN r27349.
2012-09-18 18:39:29 +00:00
Ralph Castain
11305109e1 Track positioned files so we avoid re-positioning them across jobs
This commit was SVN r27347.
2012-09-18 15:56:21 +00:00
Ralph Castain
a3060cdd15 Fix the bind_downward code - it was incorrectly looking across the entire node instead of only looking below the locale to which the proc had been assigned. In other words, if the proc was mapped to a core, then the only hwthreads that should be considered for binding are those directly below that core. The binding algo was incorrectly looking at ALL hwthreads in that scenario, causing the proc to be bound to an HT outside of the mapped location.
This now results in the procs being bound within their assigned location. It also causes us to use only the 0th HT on a core unless --use-hwthread-cpus has been specified (in which case, we use all the HTs in a core). Bind to core binds you to all HTs regardless - the --use-hwthread-cpus only impacts the oversubscribed determination and when binding to HT.

cmr:v1.7

This commit was SVN r27342.
2012-09-14 22:01:19 +00:00
Ralph Castain
c4fd3df2df Remove unused variables
This commit was SVN r27319.
2012-09-12 12:03:24 +00:00
Ralph Castain
c82cfecc1c Cleanup comm_spawn for the multi-node case where at least one new process isn't spawned on every node. Avoid the complexities of trying to execute a daemon collective across the dynamic spawn as it becomes too hard to ensure that all daemons participate or are accounted for - instead, use a less scalable but workable solution of sending the data directly between the participating procs. Ensure that singletons get their collectives properly defined at startup so the spawned "HNP" is ready for them.
As a secondary cleanup, the HNP doesn't need to update its nidmap during an xcast as it already has an up-to-date picture of the situation. So just dump that data and move along.

This commit was SVN r27318.
2012-09-12 11:31:36 +00:00
Ralph Castain
6b5f9d7767 Some cleanups for staged execution
This commit was SVN r27317.
2012-09-12 09:15:33 +00:00
Ralph Castain
a0ffeb205a Add an orted component for staged operations and rename the staged component to "staged_hnp".
This commit was SVN r27305.
2012-09-11 20:35:46 +00:00
Ralph Castain
387f657fc2 Nuts - forgot to include this with the MPI Ticket 313 stuff. Set some of the envars needed for MPI_INFO_ENV
This commit was SVN r27304.
2012-09-11 20:35:09 +00:00
Ralph Castain
e8ecd67d53 Once again, bloody SLURM changes the envars and breaks things. Try and track their changes so we get a correct allocation.
This commit was SVN r27302.
2012-09-11 20:31:33 +00:00
Josh Hursey
40132f1874 Protect a potentially uninitialized variable (orte_sstore_base_global_snapshot_ref).
If orte_sstore_base_global_snapshot_ref is null, then it will default appropriately when it is used. When prelaunching we always specify this parameter, but if we are not prelaunching it is possible to allow this to be null and it will initialize when used. However we setup the prelaunching variable in both situtations and in the latter that would result in a NULL reference. This patch protects that code segment.

This commit was SVN r27289.
2012-09-11 15:14:28 +00:00
Ralph Castain
ca40cb5f1c Fix comm_spawn by mpirun
This commit was SVN r27285.
2012-09-10 17:09:25 +00:00
Ralph Castain
4ca495c7e3 In managed allocations, we need to ensure that all nodes are flagged as having their slots "given" in case the user incorrectly asks us to change them. Also need to update the HNP node's "slots_given" flag as we don't replace the orte_node_t object when inserting the node info for that node.
This commit was SVN r27258.
2012-09-07 04:08:17 +00:00
Ralph Castain
2110fb7f95 Add some debug
This commit was SVN r27257.
2012-09-07 04:06:37 +00:00
Ralph Castain
78ccb097f0 Fix vm setup in unmanaged environments - needs to construct a node list in the same way we now do for mapping
This commit was SVN r27256.
2012-09-07 01:53:19 +00:00
Ralph Castain
36acbe4ca6 Multiple apps might want the same files, so instead of using the app_idx to determine who gets what, use the actual file names as they are sent anyway
This commit was SVN r27254.
2012-09-06 22:02:05 +00:00
Ralph Castain
e9e52fc78f Gain some efficiency in the staged mapper - if soft locations are in use and get_nodes returns busy, then no need to continue cycling thru the remaining apps as all nodes are occupied
This commit was SVN r27253.
2012-09-06 22:01:18 +00:00
Ralph Castain
efa50346c8 Error out if we are filtering a hostfile and encounter a node that is not in the resource-managed allocation, giving an error message identifying the file and the node. Don't filter managed allocations thru a default hostfile as this can lead to "hidden" errors.
Don't use dash-host info on managed allocations if we using soft locations

This commit was SVN r27245.
2012-09-05 19:42:00 +00:00
Ralph Castain
d772e0fc3d Add an option to treat dash-host specifications as "requested, but not required". So-called "soft" location requests can allow an application to execute even if the ideal allocation isn't available.
This commit was SVN r27242.
2012-09-05 18:42:09 +00:00
Ralph Castain
64ccf789f2 Ensure the final output is printed
cmr:v1.7

This commit was SVN r27240.
2012-09-05 16:25:15 +00:00
Ralph Castain
fde83a44ab This confusion has been around for awhile, caused by a long-ago decision to track slots allocated to a specific job as opposed to allocated to the overall mpirun instance. We eliminated that quite a while ago, but never consolidated the "slots_alloc" and "slots" fields in orte_node_t. As a result, confusion has grown in the code base as to which field to look at and/or update.
So (finally) consolidate these two fields into one "slots" field. Add a field in orte_job_t to indicate when all the procs for a job will be launched together, so that staged operations can know when MPI operations are allowed.

This commit was SVN r27239.
2012-09-05 01:30:39 +00:00
Ralph Castain
bae5dab916 If (and only if) a user requests, set the default number of slots on any node to the number of objects of the specified type. This *only* takes effect in an unmanaged environment - i.e., if an external resource manager assigns us a number of slots, then that is what we use. However, if we are using a hostfile, then the user may or may not have given us a value for the number of slots on each node.
For those nodes (and *only* those nodes) where the user does *not* specify a slot count, we will set the number of slots according to their direction: either to the number of cores, numas, sockets, or hwthreads. Otherwise, the slot count is set to 1.

Note that the default behavior remains unchanged: in the absence of any value for #slots, and in the absence of any directive to set #slots, we will set #slots=1.

This commit was SVN r27236.
2012-09-04 20:58:26 +00:00
Ralph Castain
18d2f75b56 Ensure we don't re-link files when staging execution as we may be executing more members of the same app. Allow the user to ask that directory trees be "flattened" so that all files appear in the proc's session directory itself.
This commit was SVN r27232.
2012-09-04 17:52:12 +00:00
Ralph Castain
86a16edd5a Move the debug output to the right place
This commit was SVN r27231.
2012-09-04 17:22:17 +00:00
Ralph Castain
11de735e8a Complete the revamp of hostfile support in non-managed environments. Working at the app level, ensure that we utilize only those nodes specified for that app, but fall back to the default hostfile (if available) for those with no specification, further falling back to the local host if the default hostfile is not present or is empty.
This commit was SVN r27230.
2012-09-04 16:34:05 +00:00
Jeff Squyres
b23a6b8eda Shiqing removed this file in r27217 (but neglected to remove it from
the Makefile.am).

This commit was SVN r27226.

The following SVN revision numbers were found above:
  r27217 --> open-mpi/ompi@ddbd542732
2012-09-04 13:06:39 +00:00
Ralph Castain
42d58f17bf Provide a more user-expected way of handling hostfile and dash-host allocations in unmanaged environments. First look for an RM-managed allocation. If nothing is found, or no active module is alive, proceed to look for hostfile and dash-host assignments. These are provided on a per-app basis, so we have to cycle across the apps using the following algorithm:
1. if a hostfile is given, then add the nodes found in that hostfile to our list - i.e., the resulting allocation contains the UNION of all nodes specified in hostfiles from across all apps.

2. any app that has no hostfile but has a dash-host, will have those nodes added to the list

3. any app that fails to have a hostfile or a dash-host will be given the default hostfile, if we have it

Each app will subsequently be filtered using their hostfile and/or dash-host data to ensure that the app only has access to the hosts it specified

Note that any relative node syntax found in the hostfiles or dash-host data will generate an error in this scenario, so only non-relative syntax can be present

This commit was SVN r27223.
2012-09-04 11:50:52 +00:00
Ralph Castain
3894179e2f Add missing file
This commit was SVN r27222.
2012-09-04 01:16:58 +00:00
Ralph Castain
b5e26a90ea Singletons should insist that the spawned orted HNP use the novm state machine so that any subsequent comm_spawn occurs only on required nodes
This commit was SVN r27220.
2012-09-04 01:09:01 +00:00
Shiqing Fan
ddbd542732 Remove one .windows file.
Add a macro definition for isblank function.

This commit was SVN r27217.
2012-09-03 09:51:44 +00:00
Ralph Castain
66c3f5d18d When getting target nodes for mapping, there is a difference between not finding any nodes that match the required constraints (either in hostfile or dash-host filtering) and finding at least one such node, but all its slots are busy. Make the return code reflect this difference so the caller can take appropriate action.
This commit was SVN r27213.
2012-09-01 10:30:40 +00:00
Ralph Castain
95019cc310 Fix a few places where we weren't completely identifying hostfile-based operations against "localhost" entries. Tell the mapper base to be silent when we don't want errors announced because nodes aren't available for mapping (something it is okay if they are fully used). Fix an infinite loop in the file prepositioning code.
This commit was SVN r27210.
2012-08-31 21:28:49 +00:00
Jeff Squyres
da00d281e6 Oops -- we want the priority to be low, not high (for now). :-)
This commit was SVN r27208.
2012-08-31 21:08:35 +00:00
Jeff Squyres
fcc1c7e33c = Overview =
First revision of the Locatation Aware Mapping Algorithm (LAMA) RMAPS
component.  This component is used to effect many different types of
regular of process/processor affinity patterns.  Although quite
flexible in the patterns that it provides, it is ''not'' a
fully-arbitrary, rankfile-like solution for process/processor
affinity.  

Inspiried by !BlueGene-like network specifications, LAMA has a core
algorithm that is quite good at specifying regular patterns in
multiple "dimensions" (where "dimensions" are expressed in terms of
different hardware elements: processor hardware threads, cores,
sockets, ...etc.).  The LAMA core algorithm is described here:

  http://www.open-mpi.org/papers/cluster-2011-lama/

= LAMA Usage Levels =

LAMA allows specifying affinity multiple different ways:

 1. None: Speciying no affinity options to mpirun results in exactly
    the same behavior as today: no affinity is used.
 1. Simple: Using the mpirun options "--bind-to <WIDTH>" and "--map-to
    <LEVEL>" to indicate how "wide" each process should be bound
    (i.e., bind to a processor core, or to a processor socket, etc.)
    and how to lay out the processes (i.e., round robin by cores,
    sockets, etc.).  
 1. Expert: Using four new MCA parameters to effect process mapping
    and binding to processors.  These options are a bit complex, and
    are not for the faint at heart, but offer a high degree of
    (regular pattern) flexibility (each of these are described more
    fully below): 
    * rmaps_lama_map: a sequence of characters describing how to lay
      out processes
    * rmaps_lama_bind: a sequence of characters describing the
      resources to bind to each process
    * rmaps_lama_mppr: a sequence of characters describing the maximum
      number of processes to allow per resource (i.e., a specific
      definition of "oversubscription")
    * rmaps_lama_ordering: once all processes are in place, how to
      order the ranks in MPI_COMM_WORLD

We anticipate that most users will utilize the "None" and "Simple"
levels of affinity, and they continue to work just as they do with the
v1.6 series and SVN trunk.  

The Expert level was designed for two purposes:

 1. To provide a precise definition for the "Simple" level (i.e.,
 every
    --bind-to/--map-by option in the "Simple" level has a
 corresponding
    precise specification in the "Expert" level)
 1. As modern computing platforms become more complex, we simply
    cannot predict what application developers will need in terms of
    processor affinity.  LAMA is an attempt to provide a highly
    flexible mechanism that allows applications to utilize a variety
    of complex, unique affinity patterns beyond the common "bind to
    core" and "bind to socket" patterns.

= LAMA Simple Level =

The "Simple" level is pretty much the same as what Open MPI has
offered for years.  It supports the same --bind-to and --map-by
options that Open MPI has supported for a while, but expands their
scope a bit.

Specifically, the following options are available for both --bind-to
and --map-by:

 * slot
 * hwthread
 * core
 * l1cache
 * l2cache
 * l3cache
 * socket
 * numa
 * board
 * node

= LAMA Expert Level =

The "Expert" level requires some explanation.  I'll repeat my
disclaimer here: the LAMA Expert level is not for the meek.  It is
flexible, but complex.  '''Most users won't need the Expert level.'''

LAMA works in three phases: mapping, binding, and ordering.  Each is
described below.

== Expert: Mapping ==

Processes are paired with sets of resources.  For example, each
process may be paired with a single processor core.  Or each process
may be paired with an entire processor socket.  LAMA performs this
mapping, obeying the Max Processes Per Resource ("MPPR", pronounced
"mipper") limits.  More on MPPR, below.

Mapping can be performed across multiple hardware levels:

 * h: Hardware thread
 * c: Processor core
 * s: Processor socket
 * L1: L1 cache
 * L2: L2 cache
 * L3: L3 cache
 * N: NUMA node
 * b: Processor board
 * n: Server node

If the act of mapping is that of pairing MPI processes to the
resources that have been allocated to a job, one can easily imagine
looping through all the resources and assigning processes to them.

But to effect different process process layout patterns across those
resources, one may want to loop over those resources ''in a different
order.''  That is, if the above-mentioned nine hardware resources
(hardware thread, processor core, etc.) can be thought of as an
nine-dimensional space, you can imagine nine nested loops to traverse
all of them.  And you can imagine that changing the order of nesting
would change the traversal pattern.

LAMA accepts a sequence of tokens representing the above-mentioned
nine hardware resources to specify the order of looping when mapping
resources to processes.

For example, consider a "simple" traversal: csL1L2L3Nbnh.  Reading
that sequence of letters from left-to-right, it specifies mapping by
processor core, processor socket, L1 cache, L2 cache, L3 cache, NUMA
node, processor board, server node, and finally hardware thread.

Wait... what?  That string specifies resources from "smallest" to
"largest" -- with the exception of hardware threads.  Why are they
tacked on to the end?

In short, this string of letters means "map by round robin by core" --
(indeed, it exactly corresponds to the Simple level "--map-by core").
Specifically, LAMA traverses the string from left-to-right and maps
processes to all the resources indicated by that token (e.g., "c" for
processor core).  When there are no more resources indicated by that
token, it goes on to the next token.

Hence, in this case, LAMA will map the first process to the first
core, then it will map the second process to the second core, and so
on.

Once all the cores are exhausted, LAMA effectively ignores all the
other letters until "h" (because all the other resources are made up
of cores; when cores are exhausted, those resources are exhausted,
too).  

If there are still more processes to be mapped, LAMA will then
traverse all the hyperthreads -- meaning that the next process will be
mapped to the second hyperthread on the first core.  And the next
process will be mapped to the second hyperthread on the second core.
And so on.

Keep in mind that the cores involved may span many server nodes; we're
not just talking about the cores (etc.) in a single machine.

As another example, the sequence "sL1L2L3Nbnch" is exactly equivalent
to "--map-by socket" (i.e., LAMA maps the first process to the first
socket, the second process to the second socket, and so on).

The sequence of letter can be combined in many, many different ways to
produce many different regular mapping patterns.

=== Max Processes Per Resource (MPPR) ===

The MPPR is an expression that precisely defines the maximum number of
processes that can be mapped to any single resource.  In effect, it
defines the concept of "oversubscription."  Specifically, traditional
HPC wisdom is that "oversubscription" is when there is more than one
MPI process per processor core.

This conventional defintion is expressed in a MPPR string of "1:c"
(one process per core).  

But what if your MPI processes are multi-threaded, and they need
multiple processes per core?  You'd need a different description of
"oversubscription" in this case.  Perhaps you want to have one MPI
process per socket.  This would be expressed in a MPPR string of
"1:s".

The general form of an individual MPPR specification is an integer
follow by a colon, followed by any of the tokens from mapping can be
used in the MPPR specification.  For example "1:c" is pronounced "one
process per core."

Multiple MPPR specifications can be strung together into a
comma-delimited list, too.  All of these MPPR values and then taken
into account when mapping.  Here's some examples:

 * 1:c -- allow, at most, one process per processor core (i.e., don't
   schedule by hyperthread)
 * 1:s -- allow, at most, one process per processor socket (e.g., 
   that process may be multithreaded, or wants exclusive use of the
   socket's caches)
 * 1:s,2:n -- only allow one process per processor socket, but, at
   most, two processes per server node (e.g., if the two MPI processes
   will consume all the RAM on the server node, even if there are more
   processor cores available)

If mapping all processes to resources would exceed a MPPR limit, this
job is ruled to be oversubscribed.  If --oversubscribe was specified
on the mpirun command line, the job continues.  Otherwise, LAMA will
abort the job.

Additionally, if --oversubscribe is specified, LAMA will endlessly
cycle through the mapping token string untill all processes have been
mapped.

== Expert: Binding == 

Once processes have been paired with resources during the Mapping
stage, they are optionally bound to a (potentially different) set of
resources.  For example, processes may be mapped round robin by
processor socket, but bound to an individual processor core.

To be clear: if binding is not used, then mapping is effectively
reduced to "counting how many processes end up on each server node."
Without binding, there's no enforcement that a process will stay where
LAMA thinks it was placed.  

With binding, however, processes are bound to a set of hardware
threads.  The number of threads to which the process is bound is
sometimes referred to as the "binding width".  For example, if a
process is bound to all the hardware threads in a processor socket,
its "width" is the processor socket.

(note that we specifically do not say that the hardware threads are
sequential, even if they are all within a single resource such as a
processor core or socket.  BIOS ordering of hardware threads can be
wonky; so we only refer to "sets of hardware threads")

Bindings are expressed as an integer and a token from the mapping
string.  For example "1s" means "bind each process to one processor
socket" (there is no ":" in the binding string because the ":" is
pronounced as "per" when reading the MPPR string).

Note that it only makes sense to bind processes to a single resource
specification (unlike the MPPR specification, where multiple limits
can be specified).

== Expert: Ordering ==

Finally, processes are assigned a rank in MPI_COMM_WORLD.  LAMA
currently offers two ordering modes: sequential or natural:

 * Sequential: if you laid out all the hardware resources in a single
   line, and then overlaid all the MPI processes on top of them, they
   are ordered from 0 to (N-1) from left-to-right.
 * Natural: the ordering of ranks follows the mapping ordering.  For
   example, consider a server node with two processor sockets, each
   containing four cores.  The command line "mpirun -np 8 --bind-to
   core --map-by socket --order n a.out" would result in MCW ranks
   that look like this: [0 2 4 6] [1 3 5 7].

= Execution =

At this point, the job is fully mapped, optionally bound, and its
ranks in MPI_COMM_WORLD are ordered.  It now starts its execution.

= Final Notes =

Note that at this point, lama is not the default mapper.  It must be
activiated with "--mca rmaps lama".  We'll continue to do further
testing and comparitive analysis with the current set of ORTE mappers.

Also, note that the LAMA algorithm can handle heterogeneity between
hardware resources (e.g., an MPI job spanning server nodes with
differing numbers of processor sockets).  For lack of a longer
explanation (this commit message already long enough!), LAMA considers
each server node individually during mapping and binding.

See the LAMA paper for more details:
http://www.open-mpi.org/papers/cluster-2011-lama/

This commit was SVN r27206.
2012-08-31 19:57:53 +00:00
Jeff Squyres
ca5bd85364 Add some help messages to the RAS simulator for PEBKAC issues.
This commit was SVN r27203.
2012-08-31 16:33:29 +00:00
Ralph Castain
38ce23db43 Add some protection to allow NULL bytes in byte objects and NULL strings to be handled cleanly in nidmaps and modex entries. Ensure there is a valid nidmap available for the HNP to pass down to any local procs when it is operating alone.
This commit was SVN r27188.
2012-08-31 01:07:36 +00:00
Ralph Castain
efc4a40c8a It is okay for a key not to be found
This commit was SVN r27187.
2012-08-30 15:12:23 +00:00
Ralph Castain
6dbb7a8493 Just because a peer didn't post a particular pmi key, that doesn't mean it is an immediate irrecoverable error - could be they just don't have a matching interface. Let the upper layer decide what to do about it.
This commit was SVN r27186.
2012-08-30 14:12:09 +00:00
Ralph Castain
1b659de132 Get staged execution working on multi-node setups. Improve efficiency by only remapping if all procs not yet mapped in the job.
This commit was SVN r27181.
2012-08-29 20:35:52 +00:00
Ralph Castain
a3b08f5800 Fix a few things relating to comm_spawn that causes new daemons to be launched. Ensure that all new daemons receive a full pidmap. Properly mark the daemon job as "updated" when daemons are added
This commit was SVN r27177.
2012-08-29 03:11:37 +00:00
Ralph Castain
f0077820f2 Silence warning
This commit was SVN r27175.
2012-08-28 22:27:41 +00:00
Ralph Castain
a414ffdf4c Remove debug
This commit was SVN r27174.
2012-08-28 22:18:00 +00:00
Ralph Castain
30fd9d7abc MPI procs should definitely not be trapping SIGCHLD - only ORTE tools need to do so
This commit was SVN r27166.
2012-08-28 21:39:06 +00:00
Ralph Castain
98580c117b Introduce staged execution. If you don't have adequate resources to run everything without oversubscribing, don't want to oversubscribe, and aren't using MPI, then staged execution lets you (a) run as many procs as there are available resources, and (b) start additional procs as others complete and free up resources. Adds a new mapper as well as a new state machine.
Remove some stale configure.m4's we no longer need.

Optimize the nidmaps a bit by only sending info that has changed each time, instead of sending a complete copy of everything. Makes no difference for the typical MPI job - only impacts things like staged execution where we are sending multiple (possibly many) launch messages.

This commit was SVN r27165.
2012-08-28 21:20:17 +00:00
Ralph Castain
aadfe1b61e Fix a missing test that breaks novm operation.
CMR:v1.7

This commit was SVN r27163.
2012-08-28 21:13:57 +00:00
Ralph Castain
d310dd8c58 Fix a strange race condition by creating a separate buffer for each send - apparently, just a retain isn't enough protection on some systems
This commit was SVN r27161.
2012-08-28 17:17:34 +00:00
Ralph Castain
11c68e2299 Correct the count in the pmi key
This commit was SVN r27156.
2012-08-28 15:05:02 +00:00
Ralph Castain
6e8c97c77c Per Sam's eagle-eyed review, free the malloc'd memory if getcwd fails for some strange reason.
This commit was SVN r27150.
2012-08-27 19:15:16 +00:00
Ralph Castain
bccc20d13e Deal with one last corner case of positioning a dot-file
This commit was SVN r27144.
2012-08-26 03:49:31 +00:00
Ralph Castain
63d41c643d Minor cleanup
This commit was SVN r27143.
2012-08-25 14:24:45 +00:00
Ralph Castain
05f0b4c653 Couple of minor cleanups
This commit was SVN r27135.
2012-08-24 21:14:40 +00:00