1
1
Граф коммитов

403 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
92cb93b21e Remove set-but-unused variable
Refs trac:3685

This commit was SVN r28855.

The following Trac tickets were found above:
  Ticket 3685 --> https://svn.open-mpi.org/trac/ompi/ticket/3685
2013-07-19 01:42:35 +00:00
Ralph Castain
bc2586cf3c Refs trac:3685. Check error code returned by PMI2_Info_GetJobAttr.
This commit was SVN r28854.

The following Trac tickets were found above:
  Ticket 3685 --> https://svn.open-mpi.org/trac/ompi/ticket/3685
2013-07-19 01:24:51 +00:00
Ralph Castain
a10546d5c1 Cleanup and rename of platform files
This commit was SVN r28853.
2013-07-19 01:18:41 +00:00
Ralph Castain
6c50c8167c Fix pmi-1 compile when no pmi2 is present
This commit was SVN r28849.
2013-07-18 22:45:08 +00:00
Ralph Castain
256034a3dc Sigh - fix a couple of spots I missed
Refs trac:3683

This commit was SVN r28843.

The following Trac tickets were found above:
  Ticket 3683 --> https://svn.open-mpi.org/trac/ompi/ticket/3683
2013-07-18 19:07:16 +00:00
Ralph Castain
fc3b777ef5 Cleanup a variable that isn't used if pmi2 support is available
Refs trac:3683

This commit was SVN r28841.

The following Trac tickets were found above:
  Ticket 3683 --> https://svn.open-mpi.org/trac/ompi/ticket/3683
2013-07-18 17:19:13 +00:00
Ralph Castain
92c6b806b9 Based on a patch submitted by Piotr Lesnicki of Bull, cleanup the PMI2 support. This has not been tested yet on multiple environments (e.g., Cray), so it needs more evaluation prior to moving to the 1.7 branch.
cmr:v1.7.3:reviewer=rhc

This commit was SVN r28837.
2013-07-18 14:46:07 +00:00
Joshua Ladd
0b5c1f2ea8 Add 'generic' support for PMI2 (previously, we checked for PMI2 only on Cray systems.) If your resource manager (e.g. SLURM) has support for PMI2, then the --with-pmi configure flag will enable its usage. If you don't have PMI2, then you will fallback to regular old PMI1. This patch was submitted by Ralph Castain and reviewed and pushed by Josh Ladd. This should be added to cmr:v1.7:reviewer=jladd
This commit was SVN r28666.
2013-06-21 15:28:14 +00:00
Ralph Castain
a0a6412545 Do a little cleanup on abnormal termination procedure - don't keep submitting forced exit events (one will do), no need to reset the abnormal termination pipe event in orterun, etc.
This commit was SVN r28450.
2013-05-05 17:39:45 +00:00
Ralph Castain
5d7a93c032 Add the ability to use an external version of libevent. Clearly not recommended at this time. I've verified that it works in limited scenarios, but more thorough testing and performance impacts need to be assessed.
Interesting how many includes had to be fixed here and there to fill in missing dependencies :-)

This commit was SVN r28411.
2013-04-29 17:02:37 +00:00
Ralph Castain
45af6cf59e The move of the orte_db framework to opal required that we create an opaque opal_identifier_t type as OPAL cannot know anything about the ORTE process name. However, passing a value down to opal and then having the db components reference it causes alignment issues on Solaris Sparc platforms. So pass the pointer instead and do the old "memcpy" trick to avoid the problem.
This commit was SVN r28308.
2013-04-08 23:34:16 +00:00
Ralph Castain
1b5b9afa85 Silence warnings
This commit was SVN r28244.
2013-03-27 21:56:46 +00:00
Nathan Hjelm
17315bf360 Now that the entire codebase has been updated to use the MCA framework
system remove the last calls to the MCA parameter system.

This commit was SVN r28242.
2013-03-27 21:17:53 +00:00
Nathan Hjelm
c041156f60 Update ORTE frameworks to use the MCA framework system.
This commit was SVN r28240.
2013-03-27 21:14:43 +00:00
Nathan Hjelm
365cf48db5 Update OPAL frameworks to use the MCA framework system.
This commit was SVN r28239.
2013-03-27 21:11:47 +00:00
Nathan Hjelm
cf377db823 MCA/base: Add new MCA variable system
Features:
 - Support for an override parameter file (openmpi-mca-param-override.conf).
   Variable values in this file can not be overridden by any file or environment
   value.
 - Support for boolean, unsigned, and unsigned long long variables.
 - Support for true/false values.
 - Support for enumerations on integer variables.
 - Support for MPIT scope, verbosity, and binding.
 - Support for command line source.
 - Support for setting variable source via the environment using
   OMPI_MCA_SOURCE_<var name>=source (either command or file:filename)
 - Cleaner API.
 - Support for variable groups (equivalent to MPIT categories).

Notes:
 - Variables must be created with a backing store (char **, int *, or bool *)
   that must live at least as long as the variable.
 - Creating a variable with the MCA_BASE_VAR_FLAG_SETTABLE enables the use of
   mca_base_var_set_value() to change the value.
 - String values are duplicated when the variable is registered. It is up to
   the caller to free the original value if necessary. The new value will be
   freed by the mca_base_var system and must not be freed by the user.
 - Variables with constant scope may not be settable.
 - Variable groups (and all associated variables) are deregistered when the
   component is closed or the component repository item is freed. This
   prevents a segmentation fault from accessing a variable after its component
   is unloaded.
 - After some discussion we decided we should remove the automatic registration
   of component priority variables. Few component actually made use of this
   feature.
 - The enumerator interface was updated to be general enough to handle
   future uses of the interface.
 - The code to generate ompi_info output has been moved into the MCA variable
   system. See mca_base_var_dump().

opal: update core and components to mca_base_var system
orte: update core and components to mca_base_var system
ompi: update core and components to mca_base_var system

This commit also modifies the rmaps framework. The following variables were
moved from ppr and lama: rmaps_base_pernode, rmaps_base_n_pernode,
rmaps_base_n_persocket. Both lama and ppr create synonyms for these variables.

This commit was SVN r28236.
2013-03-27 21:09:41 +00:00
Ralph Castain
24b91839aa Ensure the process knows it local cpuset early enough to perform the locality computation
This commit was SVN r28221.
2013-03-26 19:14:23 +00:00
Ralph Castain
6ee32767d4 Restore the cpus-per-proc option for byslot and bynode mapping. Remove the bind_idx (which recorded the index of the hwloc object where the proc was bound) as this would no longer be unique, and just use the bitmap as the standard reference for location. Update the relative locality computation to take bitmaps as its argument.
This commit was SVN r28219.
2013-03-26 18:27:50 +00:00
Ralph Castain
147c6ff9e7 Clean out the cruft leftover from the use_common_ports experiment
cmr:v1.7

This commit was SVN r28184.
2013-03-20 15:07:43 +00:00
Ralph Castain
a4b6fb241f Remove all remaining vestiges of the Windows integration
This commit was SVN r28137.
2013-02-28 17:31:47 +00:00
Ralph Castain
cf9796accd Remove the old configure option for disabling full rte support - we now use the OMPI rte framework for such purposes
This commit was SVN r28134.
2013-02-28 01:35:55 +00:00
Ralph Castain
f36312ee6f Continue cleanup - this time, start working on the "without full support" flags in ORTE. Remove no-longer-needed configure.m4 files from the ess and errmgr. In the former case, since all priorities are now the same (given the removal of the cnos component), configure priorities are no longer required.
This commit was SVN r28118.
2013-02-26 21:27:48 +00:00
Ralph Castain
8d2fa3693b First cut at removing the native Windows support. Remove all the Windows-specific components, and the .windows files sprinkled around. Remove the Windows platform files and MTT scripts. Update the NEWS to point Windows users to the cygwin package.
This commit was SVN r28116.
2013-02-26 20:44:56 +00:00
Ralph Castain
bd9265c560 Per the meeting on moving the BTLs to OPAL, move the ORTE database "db" framework to OPAL so the relocated BTLs can access it. Because the data is indexed by process, this requires that we define a new "opal_identifier_t" that corresponds to the orte_process_name_t struct. In order to support multiple run-times, this is defined in opal/mca/db/db_types.h as a uint64_t without identifying the meaning of any part of that data.
A few changes were required to support this move:

1. the PMI component used to identify rte-related data (e.g., host name, bind level) and package them as a unit to reduce the number of PMI keys. This code was moved up to the ORTE layer as the OPAL layer has no understanding of these concepts. In addition, the component locally stored data based on process jobid/vpid - this could no longer be supported (see below for the solution).

2. the hash component was updated to use the new opal_identifier_t instead of orte_process_name_t as its index for storing data in the hash tables. Previously, we did a hash on the vpid and stored the data in a 32-bit hash table. In the revised system, we don't see a separate "vpid" field - we only have a 64-bit opaque value. The orte_process_name_t hash turned out to do nothing useful, so we now store the data in a 64-bit hash table. Preliminary tests didn't show any identifiable change in behavior or performance, but we'll have to see if a move back to the 32-bit table is required at some later time.

3. the db framework was a "select one" system. However, since the PMI component could no longer use its internal storage system, the framework has now been changed to a "select many" mode of operation. This allows the hash component to handle all internal storage, while the PMI component only handles pushing/pulling things from the PMI system. This was something we had planned for some time - when fetching data, we first check internal storage to see if we already have it, and then automatically go to the global system to look for it if we don't. Accordingly, the framework was provided with a custom query function used during "select" that lets you seperately specify the "store" and "fetch" ordering.

4. the ORTE grpcomm and ess/pmi components, and the nidmap code,  were updated to work with the new db framework and to specify internal/global storage options.

No changes were made to the MPI layer, except for modifying the ORTE component of the OMPI/rte framework to support the new db framework.

This commit was SVN r28112.
2013-02-26 17:50:04 +00:00
Ralph Castain
b9897267ef Cleanup report-bindings so it always reports the actual binding instead of what was requested. Ensure we don't report twice if it is an MPI process being launched.
This commit was SVN r28057.
2013-02-14 17:24:28 +00:00
Ralph Castain
ca9605773b If sensors are enabled, then the daemons need to have their proc->node field linked to their local node object
This commit was SVN r27991.
2013-01-31 16:38:57 +00:00
Brian Barrett
b8442ba505 Revamp the handling of wrapper compiler flags. The user flags, main configure
flags, and mca flags are kept seperate until the very end.  The main configure
wrapper flags should now be modified by using the OPAL_WRAPPER_FLAGS_ADD
macro.  MCA components should either let <framework>_<component>_{LIBS,LDFLAGS}
be copied over OR set <framework>_<component>_WRAPPER_EXTRA_{LIBS,LDFLAGS}.
The situations in which WRAPPER CPPFLAGS can be set by MCA components was
made very small to match the one use case where it makes sense.

This commit was SVN r27950.
2013-01-29 00:00:43 +00:00
Ralph Castain
cfaefb3286 Remove the only place where PMI was used outside a component, and relocate that code to common/pmi.
This commit was SVN r27944.
2013-01-28 20:14:51 +00:00
Ralph Castain
4b310473a1 Correct the computation of the daemon vpid
cmr:v1.7

This commit was SVN r27899.
2013-01-24 18:04:53 +00:00
Ralph Castain
c65de32218 Cleanup the PMI subsystems to support Sam's "rml-less" shared memory wireup. Only retrieve keys that are specifically requested, and only when they are requested. Let string values be segmented across multiple keys, but don't do it for anything else.
This commit was SVN r27737.
2013-01-03 02:16:10 +00:00
Ralph Castain
d1163ebbf2 Ensure we cleanup DFS worker threads during finalize to avoid segfaulting in MCA param cleanup
This commit was SVN r27723.
2012-12-25 21:17:35 +00:00
Ralph Castain
43f883cb42 Add some more detailed error output to the db_hash component and nidmap code. Ensure the local nodename is included in the HNP's aliases
This commit was SVN r27622.
2012-11-18 17:57:19 +00:00
Ralph Castain
e11f32038a Add an MCA param to retain all aliases based on IP addrs for node names so that procs can look them up by interface, if desired. If the param is set, pass aliases around to all daemons and procs for local use
This commit was SVN r27619.
2012-11-16 04:04:29 +00:00
Ralph Castain
615cc66b44 Protect the HNP cleanup in cases where no session dirs are created
This commit was SVN r27585.
2012-11-10 14:03:07 +00:00
Nathan Hjelm
bdedd8b0d3 Per RFC modify the behavior of mca_base_components_close to NOT close the output. Modify frameworks to always close their output and set to -1.
Reasoning: The old behavior was a little confusing. mca_base_components_open does not open an output stream so it is a little unexpected that mca_base_components_close does. To add to this several frameworks (that don't use mca_base_components_close) failed to close their output in the framework close function and others closed their output a second time. This change is an improvement to the symantics of mca_base_components_open/close as they are now symetric in their functionality.

This commit was SVN r27570.
2012-11-06 19:09:26 +00:00
Nathan Hjelm
2acd0f83de Revert "Revert r27451 and r27456 - the cmd line parser is incorrectly marking the application as an MCA parameter".
It appears the problem was not with the command line parser but the rsh plm. I don't know why this problem was not occuring before the command line parser changes but it appears to be resolved now.

This commit was SVN r27527.

The following SVN revision numbers were found above:
  r27451 --> open-mpi/ompi@d59034e6ef
  r27456 --> open-mpi/ompi@ecdbf34937
2012-10-30 19:45:18 +00:00
Ralph Castain
094d6f3143 Add a new "distributed file system" capability to support file access operations across nodes that do not have a network file system attached to them.
Add a set of URI create/parse utilities

This commit was SVN r27483.
2012-10-25 17:15:17 +00:00
Ralph Castain
e6014bf2e1 Revert r27451 and r27456 - the cmd line parser is incorrectly marking the application as an MCA parameter
This commit was SVN r27477.

The following SVN revision numbers were found above:
  r27451 --> open-mpi/ompi@d59034e6ef
  r27456 --> open-mpi/ompi@ecdbf34937
2012-10-24 18:38:44 +00:00
Nathan Hjelm
d59034e6ef MCA: remove deprecated mca_base_param functions (mca_base_param_register_int, mca_base_param_register_string, mca_base_param_environ_variable). Remove all uses of deprecated functions.
cmr:v1.7

This commit was SVN r27451.
2012-10-17 20:17:37 +00:00
Ralph Castain
64ccf789f2 Ensure the final output is printed
cmr:v1.7

This commit was SVN r27240.
2012-09-05 16:25:15 +00:00
Ralph Castain
b5e26a90ea Singletons should insist that the spawned orted HNP use the novm state machine so that any subsequent comm_spawn occurs only on required nodes
This commit was SVN r27220.
2012-09-04 01:09:01 +00:00
Ralph Castain
30fd9d7abc MPI procs should definitely not be trapping SIGCHLD - only ORTE tools need to do so
This commit was SVN r27166.
2012-08-28 21:39:06 +00:00
Ralph Castain
98580c117b Introduce staged execution. If you don't have adequate resources to run everything without oversubscribing, don't want to oversubscribe, and aren't using MPI, then staged execution lets you (a) run as many procs as there are available resources, and (b) start additional procs as others complete and free up resources. Adds a new mapper as well as a new state machine.
Remove some stale configure.m4's we no longer need.

Optimize the nidmaps a bit by only sending info that has changed each time, instead of sending a complete copy of everything. Makes no difference for the typical MPI job - only impacts things like staged execution where we are sending multiple (possibly many) launch messages.

This commit was SVN r27165.
2012-08-28 21:20:17 +00:00
Ralph Castain
7bcf2f8b5c Stop leaving droppings behind us
This commit was SVN r27111.
2012-08-22 17:39:22 +00:00
Ralph Castain
589acf550c Improve the new MPI_INFO_ENV to better handle Java applications and to correctly report the info for singletons.
This commit was SVN r27025.
2012-08-13 22:13:49 +00:00
Ralph Castain
cb48fd52d4 Implement the MPI_Info part of MPI-3 Ticket 313. Add an MPI_info object MPI_INFO_GET_ENV that contains a number of run-time related pieces of info. This includes all the required ones in the ticket, plus a few that specifically address recent user questions:
"num_app_ctx" - the number of app_contexts in the job
"first_rank" - the MPI rank of the first process in each app_context
"np" - the number of procs in each app_context

Still need clarification on the MPI_Init portion of the ticket. Specifically, does the ticket call for returning an error is someone calls MPI_Init more than once in a program? We set a flag to tell us that we have been initialized, but currently never check it.

This commit was SVN r27005.
2012-08-12 01:28:23 +00:00
Shiqing Fan
e304c19920 This is also used on Windows.
This commit was SVN r26975.
2012-08-08 16:44:00 +00:00
Ralph Castain
dc22ea5cde A little cleaner on the message about repeated ctrl-c, and re-enable the event so we can abort if we see multiple ctrl-c's that don't meet the time requirement
This commit was SVN r26945.
2012-08-03 01:26:18 +00:00
Ralph Castain
e6c72bfd53 Ensure we can forcibly exit even when we are stuck inside of an event by replacing the libevent signal handler with a POSIX one that (a) attempts to trip a libevent termination event and (b) if anothe ctrl-c hits within 5 seconds, just calls exit.
This commit was SVN r26943.
2012-08-02 21:15:35 +00:00
Ralph Castain
6ee35e4977 Add num_local_peers to orte_process_info so we don't keep re-computing it, ensure it is available for direct launch via pmi as well
This commit was SVN r26931.
2012-07-31 21:21:50 +00:00
Abhishek Kulkarni
1ce378b5c6 Make C/R work with nodes > 1. This fix makes sure that the app coordinators send
the "ready-to-checkpoint" signal to the global coordinator only after ORTE has
initialized.

This commit was SVN r26795.
2012-07-13 23:37:29 +00:00
Abhishek Kulkarni
eec5a28aa4 More C/R fixes.
* Fix a typo introduced by the removal of the notifier framework
* Fix to flush the modex cached data correctly using the orte DB API.

This commit was SVN r26773.
2012-07-10 01:19:46 +00:00
Abhishek Kulkarni
5c58a1c9c1 Fix C/R support in the trunk.
Among other things, this patch deals with the following issues:
* fix ompi-checkpoint argument parsing
* ompi-restart -showme prints an extraneous "Restarted child with PID" 
  message. Move around the debug statement to avoid this.
* fixes for the state machine changes

This commit was SVN r26770.
2012-07-09 23:34:13 +00:00
Ralph Castain
0dfe29b1a6 Roll in the rest of the modex change. Eliminate all non-modex API access of RTE info from the MPI layer - in some cases, the info was already present (either in the ompi_proc_t or in the orte_process_info struct) and no call was necessary. This removes all calls to orte_ess from the MPI layer. Calls to orte_grpcomm remain required.
Update all the orte ess components to remove their associated APIs for retrieving proc data. Update the grpcomm API to reflect transfer of set/get modex info to the db framework.

Note that this doesn't recreate the old GPR. This is strictly a local db storage that may (at some point) obtain any missing data from the local daemon as part of an async methodology. The framework allows us to experiment with such methods without perturbing the default one.

This commit was SVN r26678.
2012-06-27 14:53:55 +00:00
Brian Barrett
b22faedd9d Remove the Portals4 SHMEM reference implementation runtime support, as we're
no longer using the runtime provided by the reference implementation.

Remove the Catamount support from ORTE, since we're no longer supporting
Catamount.  Left the Catamount timer component, because I'm not sure whether
it's used on the XTs running CNL.

This commit was SVN r26677.
2012-06-27 14:17:43 +00:00
Ralph Castain
92527da4e3 Remove unused component
This commit was SVN r26660.
2012-06-26 00:49:28 +00:00
Ralph Castain
e6f3586415 Remove the orte notifier framework, per discussion at the devel meeting and follow-up with Jeff (who took the action item)
This commit was SVN r26637.
2012-06-22 18:09:23 +00:00
Ralph Castain
0a713cd27e Add database framework to ORTE and refactor modex code to utilize it. Create the "hash" db component from the prior modex db code. Leave the other components ignored for now - will activate them later.
Modex is still a blocking operation at this point.

This commit was SVN r26618.
2012-06-19 13:38:42 +00:00
Ralph Castain
96c778656a Improve launch performance on clusters that use dedicated nodes by instructing the orteds to use the same port as the HNP, thus allowing them to "rollup" their initial callback via the routed network. This substantially reduces the HNP bottleneck and the number of ports opened by the HNP.
Restore enable-static-ports option by default - the Cray will have to disable it to get around their library issues, but that's just a warning problem as opposed to blocking the build.

This commit was SVN r26606.
2012-06-15 10:15:07 +00:00
Ralph Castain
269cb2b8d9 Some cleanup to remove calls to opal_progress when running with orte progress threads, and to ensure that all orte-related events are in the orte event base.
This commit was SVN r26591.
2012-06-11 19:59:53 +00:00
Brian Barrett
7406ef1241 Make all the PMI components depend on the common pmi library and properly
install the common pmi library

This commit was SVN r26588.
2012-06-11 15:58:09 +00:00
Nathan Hjelm
f2d4e95429 doh! add missing include
This commit was SVN r26471.
2012-05-22 20:49:13 +00:00
Nathan Hjelm
cdc3c87ba6 move pmi init/finalize into a common component
This commit was SVN r26470.
2012-05-22 15:15:39 +00:00
Nathan Hjelm
78b8b3cf76 bug fix: actually close ess components
This commit was SVN r26469.
2012-05-22 15:09:18 +00:00
Nathan Hjelm
6eeca66475 add an option to enable static ports. diabled by default
This commit was SVN r26462.
2012-05-21 19:56:15 +00:00
Ralph Castain
70a106fa71 Fix binding on remote nodes - need to pass the binding bitmap!
This commit was SVN r26403.
2012-05-08 03:52:39 +00:00
Jeff Squyres
2ba10c37fe Per RFC, bring in the following changes:
* Remove paffinity, maffinity, and carto frameworks -- they've been
   wholly replaced by hwloc.
 * Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
 * Update sm, smcuda, wv, and openib components to no longer use carto.
   Instead, use hwloc data.  There are still optimizations possible in
   the sm/smcuda BTLs (i.e., making multiple mpools).  Also, the old
   carto-based code found out how many NUMA nodes were ''available''
   -- not how many were used ''in this job''.  The new hwloc-using
   code computes the same value -- it was not updated to calculate how
   many NUMA nodes are used ''by this job.''
   * Note that I cannot compile the smcuda and wv BTLs -- I ''think''
     they're right, but they need to be verified by their owners.
 * The openib component now does a bunch of stuff to figure out where
   "near" OpenFabrics devices are.  '''THIS IS A CHANGE IN DEFAULT
   BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
   (I do not have a NUMA machine with an OpenFabrics device that is a
   non-uniform distance from multiple different NUMA nodes).
 * Completely rewrite the OMPI_Affinity_str() routine from the
   "affinity" mpiext extension.  This extension now understands
   hyperthreads; the output format of it has changed a bit to reflect
   this new information.
 * Bunches of minor changes around the code base to update names/types
   from maffinity/paffinity-based names to hwloc-based names.
 * Add some helper functions into the hwloc base, mainly having to do
   with the fact that we have the hwloc data reporting ''all''
   topology information, but sometimes you really only want the
   (online | available) data.

This commit was SVN r26391.
2012-05-07 14:52:54 +00:00
Ralph Castain
45fee2b491 Resolve the case where only the HNP is in the system (i.e., single-node operation)
This commit was SVN r26382.
2012-05-03 18:00:01 +00:00
Ralph Castain
9f724db182 Remove duplicate event assignment
This commit was SVN r26360.
2012-04-30 16:06:20 +00:00
Ralph Castain
289f9f41ec From long-term discussions, have the daemons use the node_t and proc_t structs and arrays instead of the pidmap and nidmap arrays. Sets the stage for future work.
This commit was SVN r26359.
2012-04-29 00:10:01 +00:00
Ralph Castain
5d14fa7546 Fix mpi_abort, minimize error output.
This commit was SVN r26266.
2012-04-11 14:37:08 +00:00
Ralph Castain
14d5525fb1 Some minor cleanups. Get singletons working. Cleanup abort handling so it gets properly identified.
This commit was SVN r26261.
2012-04-10 19:08:54 +00:00
Ralph Castain
a34be856aa Now that we have PMI support, this is no longer needed
This commit was SVN r26254.
2012-04-07 13:36:24 +00:00
Ralph Castain
bd8b4f7f1e Sorry for mid-day commit, but I had promised on the call to do this upon my return.
Roll in the ORTE state machine. Remove last traces of opal_sos. Remove UTK epoch code.

Please see the various emails about the state machine change for details. I'll send something out later with more info on the new arch.

This commit was SVN r26242.
2012-04-06 14:23:13 +00:00
Ralph Castain
46b040c79f Fix typo
This commit was SVN r26189.
2012-03-24 00:31:05 +00:00
Ralph Castain
2bd75ec7e3 Fix Cray XE builds - the priority here needs to equal that of the HNP component so that both build. Otherwise, mpirun tries to use PMI for its basis, and that doesn't work!
This commit was SVN r26188.
2012-03-23 20:06:34 +00:00
Josh Hursey
4dd9f89a99 Create an MCA parameter (ess_base_stream_buffering) that allows the user to override the system default for buffering of stdout/stderr streams. See 'man setvbuf' for more information.
Note: I am working on a system that buffered all output until the application fishished due to a default of 'fully buffered.' This makes debugging painful. This switch fixed the problem by allowing me to adjust the buffering.

This commit was SVN r26119.
2012-03-08 22:02:28 +00:00
Ralph Castain
834a86420b Ensure we use the slurm module for slurm environments, and correct init order in pmi module when used by daemons
This commit was SVN r26089.
2012-03-02 23:10:48 +00:00
Ralph Castain
a83da303c5 When using PMI, we know the ranks that share our node and their relative local/node ranks. Save that info in the pidmap array so that BTLs that require early knowledge of local ranks can access it.
This commit was SVN r25992.
2012-02-21 16:43:17 +00:00
Jeff Squyres
b295a01d8e Fix another configury error found by Paul Hargrove. Thanks, Paul!
This commit was SVN r25971.
2012-02-20 21:38:27 +00:00
Ralph Castain
a0edae52f2 Ensure the wrapper flags get entered in the right order, with -lpmi coming before the alps util libs
This commit was SVN r25809.
2012-01-27 20:56:21 +00:00
Ralph Castain
be3dfb6a1a Ensure that we only add -lpmi once to the wrapper compilers, no matter how many components might use it.
This commit was SVN r25753.
2012-01-20 04:56:38 +00:00
Ralph Castain
d7fe1615b6 Add missing dollar sign on variable
This commit was SVN r25745.
2012-01-19 20:45:22 +00:00
Ralph Castain
0d20f745e2 Remove stale function def
This commit was SVN r25744.
2012-01-19 20:40:48 +00:00
Nathan Hjelm
6d0e7a0a0e don't enable ess/alps unless cnos is available
This commit was SVN r25743.
2012-01-19 19:36:00 +00:00
Ralph Castain
9d556e2f17 Allow daemons to use PMI to get their name where PMI support is available while using the standard grpcomm and other capabilities. Remove the GNI code from the alps ess component as that component should only be for alps/cnos installations.
This commit was SVN r25737.
2012-01-18 20:56:53 +00:00
Ralph Castain
bf103de66c My apologies for doing this outside of the usual time restrictions, but we need to get this in so we can make progress.
Move the ORTE-level debugger code back into orterun and out of the ORTE library to resolve symbol conflicts.

This commit was SVN r25713.
2012-01-11 15:53:09 +00:00
Ralph Castain
7180ad40ad Fix a copule of minor buglets
This commit was SVN r25589.
2011-12-07 21:08:35 +00:00
Abhishek Kulkarni
0b7c51fae2 Correct an invalid reference to a missing help file.
This commit was SVN r25573.
2011-12-05 21:29:07 +00:00
Josh Hursey
cc57840b53 Fix ess/tool so that it does not segv when using the rsh PLM. Just have it use the base function directly to avoid similar problems with finalizing other components.
This commit was SVN r25571.
2011-12-05 15:40:46 +00:00
Ralph Castain
c56acf60ca Although we never really thought about it, we made an unconscious assumption in the mapper system - we assumed that the daemons would be placed on nodes in the order that the nodes appear in the allocation. In other words, we assumed that the launch environment would map processes in node order.
Turns out, this isn't necessarily true. The Cray, for example, launches processes in a toroidal pattern, thus causing the daemons to wind up somewhere other than what we thought. Other environments (e.g., slurm) are also capable of such behavior, depending upon the default mapping algorithm they are told to use.

Resolve this problem by making the daemon-to-node assignment in the affected environments when the daemon calls back and tells us what node it is on. Order the nodes in the mapping list so they are in daemon-vpid order as opposed to the order in which they show in the allocation. For environments that don't exhibit this mapping behavior (e.g., rsh), this won't have any impact.

Also, clean up the vm launch procedure a little bit so it more closely aligns with the state machine implementation that is coming, and remove some lingering "slave" code.

This commit was SVN r25551.
2011-11-30 19:58:24 +00:00
Ralph Castain
b475421c16 As promised, rationalize the rsh support. Remove rshbase and the base rsh support, centralizing all rsh support into the rsh component. Remove the "slave" launch support as that experiment is complete. Fix tree spawn and make that the default method for rsh launch, turning it "off" for qrsh as that system does not support tree spawn.
This commit was SVN r25507.
2011-11-26 02:33:05 +00:00
Ralph Castain
9b59d8de6f This is actually a much smaller commit than it appears at first glance - it just touches a lot of files. The --without-rte-support configuration option has never really been implemented completely. The option caused various objects not to be defined and conditionally compiled some base functions, but did nothing to prevent build of the component libraries. Unfortunately, since many of those components use objects covered by the option, it caused builds to break if those components were allowed to build.
Brian dealt with this in the past by creating platform files and using "no-build" to block the components. This was clunky, but acceptable when only one organization was using that option. However, that number has now expanded to at least two more locations.

Accordingly, make --without-rte-support actually work by adding appropriate configury to prevent components from building when they shouldn't. While doing so, remove two frameworks (db and rmcast) that are no longer used as ORCM comes to a close (besides, they belonged in ORCM now anyway). Do some minor cleanups along the way.

This commit was SVN r25497.
2011-11-22 21:24:35 +00:00
George Bosilca
61f273b987 Do not tolerate uninitialized variables.
This commit was SVN r25489.
2011-11-18 10:19:24 +00:00
Samuel Gutierrez
e03bc93fb7 only use pmi grpcomm and pubsub during the direct launch case. use PMI environment variable to setup vpid in ess alps on cray xe systems. add pmi test code.
This commit was SVN r25447.
2011-11-06 17:28:40 +00:00
Samuel Gutierrez
3fe7b3ee54 add PMI support to ess alps module. xt system guys: please yell at me if i missed something in cnos.
This commit was SVN r25423.
2011-11-03 04:04:32 +00:00
Samuel Gutierrez
27b9bcfafd update ess alps configuration file to include CNOS and PMI checks. some of the features committed here aren't being used, but they will be. also update orte_check_pmi.m4 to include missing call to action-if-not-found if --with-pmi is not specified or is disabled.
This commit was SVN r25422.
2011-11-03 02:14:47 +00:00
Ralph Castain
b77552c45d Cleanup some include files, return a silent error in open/select as the complaining component already output a message
This commit was SVN r25416.
2011-11-02 17:42:06 +00:00
Ralph Castain
55b996678e Minor indentation changes
This commit was SVN r25414.
2011-11-02 15:56:56 +00:00
Ralph Castain
d28dd55d33 Minimize the amount of topology info returned by the daemons. Most clusters, especially at scale, use the same node topology on every node, so there is no re
ason to return the topology from every daemon. Borrow a page from the --hetero-apps page and let users indicate that the node topology differs by adding a --
hetero-nodes option to mpirun. If the option is set, then every daemon returns topology info. If not set, then only daemon vpid=1 returns it.

We always want one daemon to return the topology as the head node is often different from the compute nodes. Having one daemon return the compute node topolo
gy allows us to detect any such difference. All compute nodes are then set to the same topology.

This commit was SVN r25408.
2011-11-01 18:43:10 +00:00