This allows the user to specify certain options to srun when an application
is launched with this PLS.
A useful example is the need to set the time to wait from when the first
process completes and when slurm kills remaining processes:
pls_slurm_args=--wait=1200
This commit was SVN r7206.
app_context:
mpirun -np 2 -prefix /path/to/ompi/on/machineA ./exec1 : \
-np 2 -prefix /path/to/ompi/on/machineB ./exec2
- Allow with -mca pls_rsh_assume_same_shell 0, the checking for the
SHELL-variable on the actual node (currently 1st node).
Sets the prefix, PATH and LD_LIBRARY_PATH for bash/ksh and
csh/tcsh.
This commit was SVN r7195.
add a -I to find the included ltdl.h (vs. a system-installed ltdl.h)
- Clean up kruft in a bunch of Makefile.am's to remove now-unnecessary
AM_CPPFLAGS settings to get static-components.h for each framework
- Move the component_repository API functions out of opal/mca/base/base.h
and into opal/mca/base/mca_base_component_repository.h in order to
decrease unnecessary dependencies (e.g., before this, almost
everything in the tree depended on ltdl.h, which is unnecessary --
only a small number of files really need ltdl.h)
This commit was SVN r7127.
it to be an exit.
* Put the srun process (or what is about to become the srun process) in
it's own process group so that group-wide signals (such as the
SIGINT sent by hitting cntl-c in a shell) are not sent to the srun
process.
This commit was SVN r7068.
tree.
- fix up #include's throughout the tree (yay contrib/search_replace.pl!)
- remove a few extraneous #include's
- remove orte_sys_info*() from opal_init()/opal_finalize() (it's
already in orte_init_stage1() and orte_system_finalize())
- remove dependencies in opal on orte_system_info -- util/os_path.c
and util/os_create_dirpath.c (they only used path_sep, anyway --
easily changed to #defines)
This commit was SVN r7059.
session directory cleanup (among other things)
- When we get an abnormal exit in orterun (i.e., timeout expires and
we haven't gotten termination notices from all processes), print a
better message an exit in a better way (which includes session
directory cleanup)
- Fix tm and poe pls's to not exit() but rather propagate the error up
the stack (where relevant)
This commit was SVN r7058.
multi-client issues the old version had. Also, ignore the NULL iof
component, since we shouldn't use it when using the proxy orteds
This commit was SVN r6939.
against the total number of processors. If not oversubscribing, emit
the MCA environment variable mpi_paffinity_processor with the
processor number to bind the process to. This parameter is picked up
during MPI_Init (i.e., ompi_mpi_init()) and used to bind the process,
but currently iif the MCA param mpi_paffinity_alone is set to a
nonzero value (i.e., the user asks for it).
This commit was SVN r6906.
- change the framework opens to [mostly] use the new MCA param API
- properly pass in framework debug output streams to the
mca_base_component_open() function
This commit was SVN r6888.
ns_replica.c
- Removed the error logging since I use this function in orte_init_stage1 to
check if we have created a cellid yet or not.
ras_types.h & rase_base_node.h
- This was an empty file. moved the orte_ras_node_t from base/ras_base_node.h
to this file.
- Changed the name of orte_ras_base_node_t to orte_ras_node_t to match the
naming mechanisms in place.
ras.h
- Exposed 2 functions:
- node_insert:
This takes a list of orte_ras_base_node_t's and places them in the Node
Segment of the GPR. This is to be used in orte_init_stage1 for singleton
processes, and the hostfile parsing (see rds_hostfile.c). This just puts
in the appropriate API interface to keep from calling the
orte_ras_base_node_insert function directly.
- node_query:
This is used in hostfile parsing. This just puts in the appropriate API
interface to keep from calling the orte_ras_base_node_query function
directly.
- Touched all of the implemented components to add reference to these new
function pointers
ras_base_select.c & ras_base_open.c
- Add and set the global module reference
rds.h
- Exposed 1 function:
- store_resource:
This stores a list of rds_cell_desc_t's to the Resource Segment.
This is used in conjunction with the orte_ras.node_insert function in
both the orte_init_stage1 for singleton processes and rds_hostfile.c
rds_base_select.c & rds_base_open.c
- Add and set the global module reference
rds_hostfile.c
- Added functionality to create a new cellid for each hostfile, placing
each entry in the hostfile into the same cellid. Currently this is
commented out with the cellid hard coded to 0, with the intention of
taking this out once ORTE is able to handle multiple cellid's
- Instead of just adding hosts to the Node Segment via a direct call to
the ras_base_node_insert() function. First add the hosts to the Resource
Segment of the GPR using the orte_rds.store_resource() function then use
the API version of orte_ras.node_insert() to store the hosts on the Node
Segment.
- Add 1 new function pointer to module as required by the API.
rds_hostfile_component.c
- Converted this to use the new MCA parameter registration
orte_init_stage1.c
- It is possible that a cellid was not created yet for the current environment.
So I put in some logic to test if the cellid 0 existed. If it does then
continue, otherwise create the cellid so we can properly interact with the
GPR via the RDS.
- For the singleton case we insert some 'dummy' data into the GPR. The RAS
matches this logic, so I took out the duplicate GPR put logic, and
replaced it with a call to the orte_ras.node_insert() function.
- Further before calling orte_ras.node_insert() in the singleton case,
we also call orte_rds.store_resource() to add the singleton node to the
Resource Segment.
Console:
- Added a bunch of new functions. Still experimenting with many aspects of the
implementation. This is a checkpoint, and has very limited functionality.
- Should not be considered stable at the moment.
This commit was SVN r6813.
- converted some things to new MCA param API
- renamed the pls_bproc_seed component struct so its name isn't the same as
the pls_bproc component's struct
- minor bugfixes
This commit was SVN r6774.
- convert MCA params to the new API
- some style and indenting fixes
- look at local shell, and if [new] MCA param
pls_rsh_assume_same_shell is 1, then assume that the remote shell is
the same as the local shell. If pls_rsh_assume_same_shell is 0, do
a probe to figure out what the remote shell is (NOT CURRENTLY
IMPLEMENTED! you'll get a run-time warning if you set this MCA param
to 0).
- if the remote shell is not csh and not bash, then prefix the remote
command with "( ! [ -e ./.profile ] || . ./.profile;" (and suffix it
with ")") so that we run the .profile on the remote side in order to
set PATHs and the like. See the LAM FAQ for details (will someday
be on the Open MPI FAQ:
http://www.lam-mpi.org/faq/category4.php3#question8)
- add a bunch of debugging output if the MCA param pls_rsh_debug is
enabled (or the top-level debug MCA param is enabled)
- add more help messages (and corresponding calls to opal_show_help())
in help-pls-rsh.txt
This commit was SVN r6731.
- we now properly support multiple application contexts
- much improved error messages, using opal_show_help
- fix some small bugs in the way the processes were discovering their names
- better searching for orted
- use the new mca parameter interface
These changes still need some testing, but they seem stable.
This commit was SVN r6719.
test from orte_init_stage1 into a new framework, Startup Discovery Service
(sds). This allows us to have more flexibility with platforms like
Red Storm, which do not have a universe in the usual meaning and don't have
a seed daemon they can contact
This commit was SVN r6630.