Ralph Castain
2f43989d22
Add debug and handle the use-case where someone (a) uses a hostfile while in a managed allocation to sub-allocate runs, and (b) includes the HNP's node in one of those hostfiles.
...
cmr:v1.7
This commit was SVN r28203.
2013-03-22 00:53:33 +00:00
Jeff Squyres
63d17ce901
Fix CID 968581: ensure that the string read from the socket is always
...
\0-terminated so that strlen() and strstr() can be used without fear.
Also fix some insignificant mem leaks (which is somewhat moot, because
as soon as we leave those error conditions, the process will be
terminating, but what the heck, might as well fix these while I was in
the file for the \0-termination issue...).
This commit was SVN r28199.
2013-03-21 16:05:50 +00:00
Jeff Squyres
562db0dd11
Fix CID 741328: remove some dead code
...
This commit was SVN r28192.
2013-03-21 11:15:06 +00:00
Ralph Castain
fa13d27238
Avoid double-release in error path
...
This commit was SVN r28190.
2013-03-20 21:00:59 +00:00
Ralph Castain
147c6ff9e7
Clean out the cruft leftover from the use_common_ports experiment
...
cmr:v1.7
This commit was SVN r28184.
2013-03-20 15:07:43 +00:00
Ralph Castain
a4b6fb241f
Remove all remaining vestiges of the Windows integration
...
This commit was SVN r28137.
2013-02-28 17:31:47 +00:00
Ralph Castain
cf9796accd
Remove the old configure option for disabling full rte support - we now use the OMPI rte framework for such purposes
...
This commit was SVN r28134.
2013-02-28 01:35:55 +00:00
Ralph Castain
347df93cd4
Handle the case of someone specifying a directory for the application. Ensure we get a non-zero exit status and clarify the error message.
...
cmr:v1.7
This commit was SVN r28119.
2013-02-27 01:36:21 +00:00
Ralph Castain
f36312ee6f
Continue cleanup - this time, start working on the "without full support" flags in ORTE. Remove no-longer-needed configure.m4 files from the ess and errmgr. In the former case, since all priorities are now the same (given the removal of the cnos component), configure priorities are no longer required.
...
This commit was SVN r28118.
2013-02-26 21:27:48 +00:00
Ralph Castain
74a3ece313
Remove unused component
...
This commit was SVN r28117.
2013-02-26 20:58:43 +00:00
Ralph Castain
8d2fa3693b
First cut at removing the native Windows support. Remove all the Windows-specific components, and the .windows files sprinkled around. Remove the Windows platform files and MTT scripts. Update the NEWS to point Windows users to the cygwin package.
...
This commit was SVN r28116.
2013-02-26 20:44:56 +00:00
Ralph Castain
bd9265c560
Per the meeting on moving the BTLs to OPAL, move the ORTE database "db" framework to OPAL so the relocated BTLs can access it. Because the data is indexed by process, this requires that we define a new "opal_identifier_t" that corresponds to the orte_process_name_t struct. In order to support multiple run-times, this is defined in opal/mca/db/db_types.h as a uint64_t without identifying the meaning of any part of that data.
...
A few changes were required to support this move:
1. the PMI component used to identify rte-related data (e.g., host name, bind level) and package them as a unit to reduce the number of PMI keys. This code was moved up to the ORTE layer as the OPAL layer has no understanding of these concepts. In addition, the component locally stored data based on process jobid/vpid - this could no longer be supported (see below for the solution).
2. the hash component was updated to use the new opal_identifier_t instead of orte_process_name_t as its index for storing data in the hash tables. Previously, we did a hash on the vpid and stored the data in a 32-bit hash table. In the revised system, we don't see a separate "vpid" field - we only have a 64-bit opaque value. The orte_process_name_t hash turned out to do nothing useful, so we now store the data in a 64-bit hash table. Preliminary tests didn't show any identifiable change in behavior or performance, but we'll have to see if a move back to the 32-bit table is required at some later time.
3. the db framework was a "select one" system. However, since the PMI component could no longer use its internal storage system, the framework has now been changed to a "select many" mode of operation. This allows the hash component to handle all internal storage, while the PMI component only handles pushing/pulling things from the PMI system. This was something we had planned for some time - when fetching data, we first check internal storage to see if we already have it, and then automatically go to the global system to look for it if we don't. Accordingly, the framework was provided with a custom query function used during "select" that lets you seperately specify the "store" and "fetch" ordering.
4. the ORTE grpcomm and ess/pmi components, and the nidmap code, were updated to work with the new db framework and to specify internal/global storage options.
No changes were made to the MPI layer, except for modifying the ORTE component of the OMPI/rte framework to support the new db framework.
This commit was SVN r28112.
2013-02-26 17:50:04 +00:00
Ralph Castain
c0b670bea8
I guess some profiling tools and debuggers require that the argv[0] of each rank be unique so they can create a filename based on that value. For those obscure cases, provide an mpirun cmd line option that indexes each argv[0] by rank
...
This commit was SVN r28064.
2013-02-15 20:20:49 +00:00
Ralph Castain
b9897267ef
Cleanup report-bindings so it always reports the actual binding instead of what was requested. Ensure we don't report twice if it is an MPI process being launched.
...
This commit was SVN r28057.
2013-02-14 17:24:28 +00:00
Ralph Castain
744ed49b2d
Begin cleanup of the thread_lock calls in ORTE. We'll ignore the ones in the rml/oob for now as that code block is being rewritten anyway.
...
This commit was SVN r28053.
2013-02-13 01:53:12 +00:00
Ralph Castain
b360156a37
Extend print coverage to all types
...
This commit was SVN r28012.
2013-02-01 14:21:06 +00:00
Ralph Castain
53e0ed71b0
Disqualify slurm module even if slurm support was configured into the build if we don't have an allocation and haven't enabled dynamic allocations
...
This commit was SVN r27995.
2013-01-31 18:15:47 +00:00
Ralph Castain
166f512924
Add some useful debug to the heartbeat sensor
...
This commit was SVN r27994.
2013-01-31 18:01:13 +00:00
Ralph Castain
ca9605773b
If sensors are enabled, then the daemons need to have their proc->node field linked to their local node object
...
This commit was SVN r27991.
2013-01-31 16:38:57 +00:00
Ralph Castain
c87fa68f9b
Cleanup the resource usage sensor, letting the db handle any printing requests.
...
This commit was SVN r27990.
2013-01-31 15:20:56 +00:00
Ralph Castain
9625757a71
Add new database component for printing "add_log" info
...
This commit was SVN r27989.
2013-01-31 15:19:39 +00:00
Ralph Castain
8e8e95ca6b
Silence error report - just because someone only defines ipv4 static ports doesn't make a fatal error
...
This commit was SVN r27976.
2013-01-29 23:48:22 +00:00
Jeff Squyres
8e25b927ab
Clean some minor warnings: remove variables that were set but never
...
used.
This commit was SVN r27974.
2013-01-29 23:35:42 +00:00
Ralph Castain
112f8eedb1
Handle the case where rankfile is providing the allocation
...
This commit was SVN r27971.
2013-01-29 20:37:58 +00:00
Nathan Hjelm
666bd826dc
fix alps configury
...
This commit was SVN r27962.
2013-01-29 15:44:30 +00:00
Brian Barrett
b8442ba505
Revamp the handling of wrapper compiler flags. The user flags, main configure
...
flags, and mca flags are kept seperate until the very end. The main configure
wrapper flags should now be modified by using the OPAL_WRAPPER_FLAGS_ADD
macro. MCA components should either let <framework>_<component>_{LIBS,LDFLAGS}
be copied over OR set <framework>_<component>_WRAPPER_EXTRA_{LIBS,LDFLAGS}.
The situations in which WRAPPER CPPFLAGS can be set by MCA components was
made very small to match the one use case where it makes sense.
This commit was SVN r27950.
2013-01-29 00:00:43 +00:00
Ralph Castain
cfaefb3286
Remove the only place where PMI was used outside a component, and relocate that code to common/pmi.
...
This commit was SVN r27944.
2013-01-28 20:14:51 +00:00
Brian Barrett
f42783ae1a
Move the RTE framework change into the trunk. With this change, all non-CR
...
runtime code goes through one of the rte, dpm, or pubsub frameworks.
This commit was SVN r27934.
2013-01-27 23:25:10 +00:00
Ralph Castain
6eaf601ae6
Good ol' Cray changed the way node/cpu allocation is handled in their latest release of ALPS, and so our allocator is broken. Adjust for the revised method, but preserve the older method for those Cray users who have not updated their system.
...
cmr:v1.7
This commit was SVN r27911.
2013-01-25 21:53:31 +00:00
Ralph Castain
f6b4db0b79
Fix rank_file operations. We changed the syntax to use semi-colons between multiple slot assignments so that we could use the comma to separate specific cores, but somehow the flex definitions didn't get updated to accept that character. We also incorrectly zero'd the bitmap between slot assignment sections, and so multiple slot assignments only wound up making the last one in the list.
...
This commit was SVN r27908.
2013-01-25 18:33:25 +00:00
Ralph Castain
2504da1ac9
Remove stale code - message arrival time doesn't really mean much anymore.
...
This commit was SVN r27905.
2013-01-24 23:02:02 +00:00
Ralph Castain
9bfb2b989b
Silence warning
...
This commit was SVN r27901.
2013-01-24 19:38:51 +00:00
Ralph Castain
4b310473a1
Correct the computation of the daemon vpid
...
cmr:v1.7
This commit was SVN r27899.
2013-01-24 18:04:53 +00:00
Ralph Castain
b403ca5bd8
Silence warning
...
This commit was SVN r27897.
2013-01-23 22:17:08 +00:00
Ralph Castain
4d34d30a97
Silence warning
...
This commit was SVN r27896.
2013-01-23 22:16:48 +00:00
Ralph Castain
a591fbf06f
Add initial support for dynamic allocations. At this time, only Slurm supports the new capability, which will be included in an upcoming release.
...
Add hooks for supporting dynamic allocation and deallocation to support application-driven requests and fault recovery operations.
This commit was SVN r27879.
2013-01-20 00:33:42 +00:00
Ralph Castain
e4673f3283
Add new job state
...
This commit was SVN r27878.
2013-01-20 00:30:27 +00:00
Ralph Castain
73387e50e2
Add missing variable def - thanks to Paul Hargrove for spotting.
...
This commit was SVN r27865.
2013-01-18 14:32:53 +00:00
Ralph Castain
54266837e9
Remove use of param_find function as that function will be disappearing
...
This commit was SVN r27831.
2013-01-15 19:50:38 +00:00
Ralph Castain
aea6787918
Add new routed component with self-healing connections - based on radix component - for use in monitoring system
...
This commit was SVN r27757.
2013-01-08 04:40:35 +00:00
Ralph Castain
c9a596b487
Remove unused var
...
This commit was SVN r27756.
2013-01-08 04:39:30 +00:00
Ralph Castain
beddf3b379
Add required rml tag
...
This commit was SVN r27751.
2013-01-05 06:32:20 +00:00
Ralph Castain
bee8bf5d8f
Update the sensor framework to report stats back to the HNP if requested by including the data in heartbeats.
...
This commit was SVN r27748.
2013-01-05 06:30:20 +00:00
Ralph Castain
c71e119bbb
Extend the db framework to add support for logging data to databases without duplicating all the modex-related storage.
...
This commit was SVN r27746.
2013-01-05 06:28:09 +00:00
George Bosilca
34eecb8956
Be more explicit about the operation (store or update). complain loudly
...
if something goes wrong.
This commit was SVN r27743.
2013-01-04 20:47:25 +00:00
Ralph Castain
cc29f8ff95
Attempt to fix the stupid Cray PMI problem
...
This commit was SVN r27742.
2013-01-04 02:53:42 +00:00
Nathan Hjelm
6a9ab9b221
Change orte_startup_timeout to be in seconds and remove the 10 second maximum
...
This commit was SVN r27741.
2013-01-03 23:56:34 +00:00
Ralph Castain
c65de32218
Cleanup the PMI subsystems to support Sam's "rml-less" shared memory wireup. Only retrieve keys that are specifically requested, and only when they are requested. Let string values be segmented across multiple keys, but don't do it for anything else.
...
This commit was SVN r27737.
2013-01-03 02:16:10 +00:00
Ralph Castain
d1163ebbf2
Ensure we cleanup DFS worker threads during finalize to avoid segfaulting in MCA param cleanup
...
This commit was SVN r27723.
2012-12-25 21:17:35 +00:00
Ralph Castain
c5ba59ba67
Remove stale component
...
This commit was SVN r27684.
2012-12-18 04:01:16 +00:00
Ralph Castain
0427a478b2
Remove stale component
...
This commit was SVN r27683.
2012-12-18 04:00:51 +00:00
Ralph Castain
82f1ba0ea8
Fix static port usage, ensure that both ipv4 and ipv6 are given if ipv6 was enabled
...
This commit was SVN r27682.
2012-12-18 03:59:49 +00:00
Ralph Castain
2fdd367aa9
Refs trac:3429
...
Fix bug reported by FreyGuy19713: in cases where HNP node has multiple entries in a hostfile or other allocation, we need to track the total slots allocated to that node.
This commit was SVN r27673.
The following Trac tickets were found above:
Ticket 3429 --> https://svn.open-mpi.org/trac/ompi/ticket/3429
2012-12-14 17:00:44 +00:00
Ralph Castain
1e92aa2b66
Enable multiple worker threads for processing DFS requests
...
This commit was SVN r27659.
2012-12-09 02:54:19 +00:00
Ralph Castain
c26ed7dcdd
Fix comm_spawn when ORTE progress thread is enabled by ensuring that all operations on the global list of active collectives are done in events to avoid conflicts.
...
This commit was SVN r27658.
2012-12-09 02:53:20 +00:00
Nathan Hjelm
3e1b13b13a
Re-add support for old flex (2.5.4a and earlier) while still cleaning up properly in new flex.
...
This commit was SVN r27657.
2012-12-07 00:12:43 +00:00
Ralph Castain
1237f8db57
Extend the ras module interface to include the orte_job_t being allocated so that dynamic allocations can be supported
...
This commit was SVN r27627.
2012-11-23 13:50:10 +00:00
George Bosilca
994d1aba50
Nothing.
...
This commit was SVN r27626.
2012-11-21 20:07:20 +00:00
Ralph Castain
43f883cb42
Add some more detailed error output to the db_hash component and nidmap code. Ensure the local nodename is included in the HNP's aliases
...
This commit was SVN r27622.
2012-11-18 17:57:19 +00:00
Ralph Castain
f2ec35536e
Fix a bug that prevented MCA params from being forwarded to daemons upon launch
...
cmr:v1.7
This commit was SVN r27621.
2012-11-18 17:55:26 +00:00
Ralph Castain
e11f32038a
Add an MCA param to retain all aliases based on IP addrs for node names so that procs can look them up by interface, if desired. If the param is set, pass aliases around to all daemons and procs for local use
...
This commit was SVN r27619.
2012-11-16 04:04:29 +00:00
Ralph Castain
3cecc1569b
Fix segfault if no file_maps were pushed
...
This commit was SVN r27612.
2012-11-15 15:39:17 +00:00
Ralph Castain
fe6dfad625
Update DFS to support multi-node operations
...
This commit was SVN r27594.
2012-11-12 02:54:53 +00:00
Ralph Castain
a6325e4546
Silence compiler warning
...
This commit was SVN r27590.
2012-11-12 02:51:29 +00:00
Ralph Castain
26f1cd0909
Fix compiler warnings
...
This commit was SVN r27588.
2012-11-12 02:50:45 +00:00
Ralph Castain
bd887f7f56
Add a new "test" component to the DFS that treats all files as remote in order to test the app-to-daemon interactions on a single machine. Set a global param to indicate we are using staged execution. Add a param to indicate it is okay for non-MPI processes to execute without finalizing. Cleanup file map load and fetch operations.
...
This commit was SVN r27587.
2012-11-10 14:09:12 +00:00
Ralph Castain
615cc66b44
Protect the HNP cleanup in cases where no session dirs are created
...
This commit was SVN r27585.
2012-11-10 14:03:07 +00:00
Nathan Hjelm
e0f5137e46
add prototypes for lex destroy functions
...
This commit was SVN r27580.
2012-11-09 22:00:27 +00:00
Nathan Hjelm
8658bbc902
instead of relying on yyterminate to clean up the lex context call the destroy functions directly (after closing the file)
...
This commit was SVN r27577.
2012-11-09 16:10:55 +00:00
Ralph Castain
9b729794f2
A prior commit apparently broke the trunk when something was inadvertently left behind - so remove a reference to a no-longer-existing function
...
This commit was SVN r27574.
2012-11-07 11:11:05 +00:00
Nathan Hjelm
7fb5caea92
Remove the finish_parsing function from various .l files. The function is incomplete (doesn't clean up the lex state) and should be replaced by *_yylex_destroy which correctly cleans up the state.
...
Checked with the flex 2.5.35. Verified with valgrind that this fixes several "still reachable" leaks.
cmr:v1.7
This commit was SVN r27571.
2012-11-06 19:26:14 +00:00
Nathan Hjelm
bdedd8b0d3
Per RFC modify the behavior of mca_base_components_close to NOT close the output. Modify frameworks to always close their output and set to -1.
...
Reasoning: The old behavior was a little confusing. mca_base_components_open does not open an output stream so it is a little unexpected that mca_base_components_close does. To add to this several frameworks (that don't use mca_base_components_close) failed to close their output in the framework close function and others closed their output a second time. This change is an improvement to the symantics of mca_base_components_open/close as they are now symetric in their functionality.
This commit was SVN r27570.
2012-11-06 19:09:26 +00:00
Brian Barrett
e61c00212d
Add files found in svn but not tarball
...
This commit was SVN r27549.
2012-11-01 02:27:03 +00:00
Nathan Hjelm
2acd0f83de
Revert "Revert r27451 and r27456 - the cmd line parser is incorrectly marking the application as an MCA parameter".
...
It appears the problem was not with the command line parser but the rsh plm. I don't know why this problem was not occuring before the command line parser changes but it appears to be resolved now.
This commit was SVN r27527.
The following SVN revision numbers were found above:
r27451 --> open-mpi/ompi@d59034e6ef
r27456 --> open-mpi/ompi@ecdbf34937
2012-10-30 19:45:18 +00:00
Nathan Hjelm
df9bd0ed59
fix bug in plm/rsh that could add extraneous mca options to the orted argv
...
cmr:v1.7
This commit was SVN r27526.
2012-10-30 19:40:04 +00:00
Ralph Castain
a080de188f
Enable orterun to directly support staged execution, treating each app as a separate job. Support transfer of file maps when support exists.
...
This commit was SVN r27516.
2012-10-29 23:11:30 +00:00
Ralph Castain
e5e72c3137
Expand the dfs API to support retrieval, loading and purging of file maps.
...
This commit was SVN r27515.
2012-10-29 23:05:45 +00:00
Ralph Castain
4e52a15e70
Provide for sync on seek and close DFS operations. Eliminate an unnecessary wake-up timer when using ORTE progress thread
...
This commit was SVN r27500.
2012-10-26 15:49:04 +00:00
Ralph Castain
4ef30c016b
Remove stale windows references
...
This commit was SVN r27491.
2012-10-26 01:19:14 +00:00
Ralph Castain
df642f1508
Add an API to get a remote file's size. Separate dfs cmds from returned data messages so daemons don't get confused.
...
This commit was SVN r27487.
2012-10-25 22:23:08 +00:00
Ralph Castain
094d6f3143
Add a new "distributed file system" capability to support file access operations across nodes that do not have a network file system attached to them.
...
Add a set of URI create/parse utilities
This commit was SVN r27483.
2012-10-25 17:15:17 +00:00
Ralph Castain
32c185f730
Set a priority for output of forwarded IO so it can effectively compete against inbound messages
...
This commit was SVN r27480.
2012-10-24 23:34:50 +00:00
Ralph Castain
e06c330635
Add the ability to set a backlog limit on forwarded output waiting at mpirun - helps to avoid crashing systems during debug. Note that we default to "unlimited" to maintain current behavior.
...
This commit was SVN r27479.
2012-10-24 23:21:40 +00:00
Ralph Castain
e6014bf2e1
Revert r27451 and r27456 - the cmd line parser is incorrectly marking the application as an MCA parameter
...
This commit was SVN r27477.
The following SVN revision numbers were found above:
r27451 --> open-mpi/ompi@d59034e6ef
r27456 --> open-mpi/ompi@ecdbf34937
2012-10-24 18:38:44 +00:00
Ralph Castain
7574d6673b
If someone provides the launch_agent cmd, then don't prefix it
...
cmr:v1.7
This commit was SVN r27473.
2012-10-24 16:14:04 +00:00
Ralph Castain
5c0534a7ad
Ensure that comm_spawn launches procs on the nodes specified by add-host and add-hostfile
...
This commit was SVN r27452.
2012-10-18 00:40:44 +00:00
Nathan Hjelm
d59034e6ef
MCA: remove deprecated mca_base_param functions (mca_base_param_register_int, mca_base_param_register_string, mca_base_param_environ_variable). Remove all uses of deprecated functions.
...
cmr:v1.7
This commit was SVN r27451.
2012-10-17 20:17:37 +00:00
Ralph Castain
4028ce7a5d
Silence warnings by making types match
...
This commit was SVN r27446.
2012-10-14 03:45:28 +00:00
Ralph Castain
285a3b168d
Add an ability to specify the max number of simultaneous procs/node for an application when operating in staged mode. Change some debug statements from OPAL_OUTPUT_VERBOSE to opal_output_verbose so they are available in optimized builds.
...
This commit was SVN r27445.
2012-10-14 03:31:32 +00:00
Ralph Castain
04304c186f
Remove the setup_hadoop configure script as it is no longer required - the hadoop support components can build without accessing hadoop itself.
...
This commit was SVN r27385.
2012-09-29 18:30:35 +00:00
Ralph Castain
54db4c35eb
Get the trunk to build again when --without-hwloc is specified. Move a couple of key type definitions and utilities out from under the HAVE_HWLOC test so they are always available as they don't really depend on hwloc's presence. Tell two compnents not to build if hwloc is disabled:
...
ompi/mca/sbgp/basesmsocket
orte/mca/rmaps/lama
Remove stale configure.params files from the sbgp framework as the OMPI build system no longer looks at those files.
This commit was SVN r27377.
2012-09-26 23:24:27 +00:00
Samuel Gutierrez
42280e2af5
Temporarily make routed binomial the default. We are experiencing issues with
...
debruijn when launching fewer processes than are actually available within an
allocation. When this is fixed, please revert this change.
This commit was SVN r27376.
2012-09-26 16:08:12 +00:00
Jeff Squyres
cb65a44c6c
Fix the component priority assignment. Thanks to Alex Margolin for
...
the patch.
This commit was SVN r27363.
2012-09-25 07:13:23 +00:00
George Bosilca
6ec41400b3
Fix the error message in case a daemon does not succeed at killing the
...
local offspring.
This commit was SVN r27362.
2012-09-24 15:25:21 +00:00
Ralph Castain
d5279b0dc8
Make an attempt to protect hwloc cset2str from segfaulting in weird scenario
...
This commit was SVN r27361.
2012-09-23 16:51:51 +00:00
Ralph Castain
d95025f53a
Ensure we clear the usage numbers when binding on multiple nodes so we don't "carry over" info from one node to the next. Use the same tracking mechanism for binding upwards and in-place to avoid doing a bunch of mallocs.
...
Refs trac:3322
This commit was SVN r27356.
The following Trac tickets were found above:
Ticket 3322 --> https://svn.open-mpi.org/trac/ompi/ticket/3322
2012-09-20 15:16:06 +00:00
Ralph Castain
445161cd2e
Correctly count the total number of allocated slots
...
This commit was SVN r27353.
2012-09-20 02:50:14 +00:00
Ralph Castain
f592967685
Add missing retain to maintain correct accounting on nodes
...
This commit was SVN r27352.
2012-09-20 02:30:53 +00:00
Ralph Castain
e309db0be9
Ensure file descriptors are closed upon completion of transfer
...
This commit was SVN r27349.
2012-09-18 18:39:29 +00:00
Ralph Castain
11305109e1
Track positioned files so we avoid re-positioning them across jobs
...
This commit was SVN r27347.
2012-09-18 15:56:21 +00:00