cleanup code in the signal part of the event library
* Only attempt to forward standard input if we have a controlling terminal
(isatty() returns 1) and we are the foreground process OR we do not have
a controlling terminal (isatty() returns 0). If we have a controlling
terminal, check at each SIGCONT if we should change our forwarding,
since our foreground / background status may have changed.
Unfortunately, there isn't a great way in the iof framework to know if
we are capturing a starter's stdin. Use the logic that if it's a source
AND tagged as standard input, it's a starter's stdin. This seems to
work for all the common usages.
Both these need to go to the v1.0 branch.
This commit was SVN r8894.
the svc component so that it can disable the rml exception callback, fixing
a race condition in the shutdown mechanism of orte.
This should probably go to the v1.0 branch.
This commit was SVN r8893.
discards all of the data in the pty that hasn't been read. This was
leading to data being discarded when files were redirected into
mpirun and read by rank 0 of the job. This was very "not good".
The decision to not use ptys for stdin was made based on what Tim said
that LA-MPI was doing.
This needs to go to the v1.0 branch... Tim should probably review...
This commit was SVN r8892.
r8698), with changes below:
- Split wrapper flags into those required for each of the three projects,
and cleaned up some cruft (including the LIBMPI_EXTRA_*FLAGS) through-
out the build system
- Added opal_init_util and opal_finalize_util to allow init / cleanup
of all the opal code that doesn't require the MCA system
- Create standalone key=value file parser, based on the one that used
to be in the mca param parser, so that it can be shared in multiple
places
- Add wrapper datafiles for opal, orte, and ompi wrappers, and add
wrapper compiler with support for all the old features
This commit was SVN r8699.
The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
r8690
r8698
intended to include the OMPI_DEBUG_ZERO call).
These debugging statements should not have affected correcteness
because the value of 78 will be overridden in the read() and the
assert()/abort() stuff will only be triggered on an error which should
never happen (i.e., the error should have been handled by the prior if
conditional). But still, thise code should not be there.
This commit was SVN r8649.
The following SVN revision numbers were found above:
r8643 --> open-mpi/ompi@a6b869ed68
(windows). Instead use the LN_S variable exported by the Makefile (set to
"ln -s" on all Unixes and to "cp -p" on windows).
When we remove an executable use the correct extension for its name
(add $(EXEEXT) to the name).
This commit was SVN r8616.
and path->agent_path so that it's totally clear what these are for
- make a new rsh component param for agent_param (the value from the
MCA param)
- delay the path check for the agent until the component init -- don't
make it fail during open, because the MCA base will print a warning
if a component fails open() (e.g., on clusters without rsh/ssh (!),
this component was failing noisly even though it was
normal/expected)
This commit was SVN r8596.
In case of checking for Shell with --mca pls_rsh_assume_same_shell 0
have the node point to sensible values.
This commit was SVN r8563.
The following SVN revision numbers were found above:
r7664 --> open-mpi/ompi@0629cdc2d7
now take a colon-delimited list of agents (and associated argv). Also
change the default value to "ssh : rsh". Hence, if we run on a
cluster that does not have ssh, we'll fall back to rsh. If we can't
find rsh, then the rsh component will disqualify itself from
selection.
This commit was SVN r8514.
- Need to make sure that SIZE_MAX exists as a constant if stdint.h
doesn't exist
- struct timeval is defined in unistd.h on IRIX, so need to include
that headerfile where ever struct timeval is used.
This commit was SVN r8361.
- when eof is reached at orterun, send a 0 byte message to peer indicating eof
- on receipt of zero byte message - close corresponding file descriptor associated with the endpoint
- require setup ptys for stdin and stdout so that stdin can be closed independently of stdout
This commit was SVN r8264.
debugger scheme described in
http://www.open-mpi.org/community/lists/users/2005/10/0214.php. This
makes our user-level debugger scheme much more vendor-independent
(although the "-tv" option will still work for backwards compatibility
-- it'll just be a synonum of "--debug").
This commit was SVN r8206.
* turns out (duh!) that there was a reason that the <projectdir>dir
variable was set in the AM conditional. If not, stupid directories
are created and not needed... duh.
This commit was SVN r8205.
component/base Makefile.am files, reducing the time configure spends
stamping out Makefiles at the end
* Install base_impl.h file when devel-headers are being installed
This commit was SVN r8200.
When compiling C++ code that includes something that looks for the C++
header file "memory" (stupid C++ headers not having .h extensions), it
goes through the header file search path, which includes $(topsrcdir)/opal,
so it finds the directory $(topsrcdir)/opal/memory/ and tries to load
that as the memory header file and all goes downhill.
This commit was SVN r8111.
spining in orte_iof_base_flush() when running
intel_tests/src/MPI_Errhandler_fatal_c
When we close an endpoint by taking it out of the envent handler, we need to make
sure that it fits the criteria to pass through orte_iof_base_flush(), specificly
make sure we clean out the ep_frags list.
Note: This is more of a sanity check, since the endpoint should already be
in this state at the point of closure.
Secondly in orte_iof_base_endpoint_read_handler(), if we determine that it is
necessary to close the endpoint we have to "return" after doing so, otherwise
we add another frag to the endpoint which will cause it to hang in
orte_iof_base_flush().
Bug go squish!
This commit was SVN r8109.
OS X (you get an undefined symbol opal_event_lock). Since the code is
all #if 0'ed out, #if 0 out the header for now as well.
I believe console and openmpi are to be removed from OMPI before 1.0
release, so this doesn't need to go to the 1.0 branch
This commit was SVN r8089.
This takes care of Troy's first segfault problem, and compile errors that will likely happen as soon as Ken applies George's patch and runs make again.
This commit was SVN r7833.
its not needed and there could be multiple sources each w/ their
own sequence.
- if a write doesn't complete, need to check for non-blocking case..
This commit was SVN r7795.
originally suggested by Ralf Wildenhues, to try to speed autogen, configure,
and make (and possibly even make install). Use automake's include directive
to drastically reduce the number of Makefile files (although the number of
Makefile.am files is the same - most are just included in a top-level
Makefile.am). Also use an Automake SUBDIRs feature to eliminate the
dynamic-mca tree, which was no longer really needed. This makes adding
a framework easier (since you don't have to remember the dynamic-mca
tree) and makes building faster (as make doesn't have to recurse through
the dynamic-mca tree)
This commit was SVN r7777.
oversubscribed on a node. And thus whether to call sched_yield or not.
The value of node->node_slots_inuse does not currently represent the number of
slots actually in use, at the moment. This is actually a bug in the RAS/RMAPS
base components, but the fix for that specific bug is bigger than we want to
address at the moment (but will certianly do so in the near future).
Since we cannot trust this value, use the total number of mapped processes
(which was properly set by the RMAPS component upon mapping -- Just not
properly propagated back to the registry's node segment) from the process
mapping.
In addition to this change I cleaned up a couple of the debug messages. It
seems that TM and RSH are the only two directly effected by this. SLURM
would be if that section of code wasn't currently inactive, but put the fix
in for prosparity.
This commit was SVN r7743.
too distant past
* work around apparently broken handling of max_slots somewhere along
the line by just setting it to 0
Both changes should go to the trunk.
This commit was SVN r7710.
this loop if "nodes" is an empty list. get_first, in this loop context,
allows us to do just that, while get_begin doesn't.
This fixes a --host problem that appeared on the Linux PPC64 build.
This commit was SVN r7703.
Also revised the callbacks to store and utilize local variables to avoid problems where threads modify the global structures. Not sure this totally fixes the problem, but it's a shot - suggested by Josh (and Jeff, I believe).
This commit was SVN r7694.
command:
svn merge -r 7567:7663 https://svn.open-mpi.org/svn/ompi/tmp/jjhursey-rmaps .
(where "." is a trunk checkout)
The logs from this branch are much more descriptive than I will put
here (including a *really* long description from last night). Here's
the short version:
- fixed some broken implementations in ras and rmaps
- "orterun --host ..." now works and has clearly defined semantics
(this was the impetus for the branch and all these fixes -- LANL had
a requirement for --host to work for 1.0)
- there is still a little bit of cleanup left to do post-1.0 (we got
correct functionality for 1.0 -- we did not fix bad implementations
that still "work")
- rds/hostfile and ras/hostfile handshaking
- singleton node segment assignments in stage1
- remove the default hostfile (no need for it anymore with the
localhost ras component)
- clean up pls components to avoid duplicate ras mapping queries
- [possible] -bynode/-byslot being specific to a single app context
This commit was SVN r7664.
Some of the functions in opal_init are void or return a bool (opal_output_init, but always returns true.. eh?), so I don't check them.
This commit was SVN r7638.
it's always ORTE_SUCCESS and sometimes masks real !=ORTE_SUCCESS rc
values.
- Add MCA param pls_tm_want_path_check. If nonzero (the default),
check for the orted in the PATH before each tm_spawn()'ing (doing a
little caching so that we don't hammer on the filesystem -- remember
all the PATH's where we successfully found the orted so that we
don't have to query the filesystem multiple times for a PATH where
we previously found the orted)
- Be sure to opal_argv_split() the pls_tm_orted MCA param
This commit was SVN r7625.
will fire some subscriptions that will eventually result in invoking
terminate_job (i.e., terminate anything that may have been
successfully started by launch).
This commit was SVN r7622.
at the moment.
Also remove all references to --map, and (C, N) command line options in the
help file. These references will be put back in when these options are
implemented.
This commit was SVN r7574.
If you use --prefix and then "-x LD_LIBRARY_PATH", the rsh pls would
take great pains to ensure that PATH and LD_LIBRARY_PATH were setup
correctly on the local and remote nodes, but then the fork pls would
blitely overwrite LD_LIBRARY_PATH with what the user exported (i.e.,
most likely without our prefix). This patch takes care of that -- the
fork pls examines the incoming environment, and if it sees PATH or
LD_LIBRARY_PATH, it re-prefixes those variables.
This commit was SVN r7566.
it is possible that if the receive has been arrived the callback will
be called before recv_buffer_nb() returns. This causes deadlock
as we try to acquire the lock, but already hold it.
This was causing orterun and orteds to stall in certian situations.
Became evident when stress testing dynamics with remote nodes.
This commit was SVN r7543.
- Fix bug identified by users: --prefix may also apply on the local
node; we need to prefix the PATH and LD_LIBRARY_PATH environment
variables before invoking execve()
This commit was SVN r7541.
some orted's to stall on locks in the MPI Dynamics cases. Since it
is not essentual that we call these functions, they can so away.
Unlock the peer lock when aborting. This causes a potential deadlock
in do_waitall [see comment in code]. This was causing orteds to
deadlock at times when the seed had terminated. With proper interleaving
and timing the orted was deadlocking. This seems to have fixed this in
my stress testing with MPI 2 Dynamics.
This commit was SVN r7539.
The NS replica should give out tags that are over ORTE_RML_TAG_DYNAMIC
or it will overlap with other outstanding tags. This overlap was killing
MPI_Comm_spawn when a program tried to use it multiple times (> 3).
With this fix MPI_Comm_spawn is behaving properly.
A program can call it many times in a row with out problem.
NOTE: Not tested for multi-threaded build yet
(A long time debugging for a one liner... :/)
This commit was SVN r7529.
Have the ras_base_schedule_policy MCA parameter working once again. before it
would only do slot based allocation, even if the MCA parameter was set properly.
Currently you can specify to orterun a node allocation by either:
-mca ras_base_schedule_policy node
-bynode
and slot allocation (which is the default) by:
-mca ras_base_schedule_policy slot
-byslot
This commit was SVN r7513.
with the mutex locked and as this function will call oob_send which will call the lookup again
... we will deadlock as the mutex is already lock. The solution is to release the mutex before
going into the subscription. Then of course the logic to remote the item when something went
wrong with the subscrition is a little bit more complex.
This commit was SVN r7429.
add an event, it can call the spawn function directly. This will avoid it standing on the condition who
will never get released.
This commit was SVN r7428.
This fixes one of the race conditions in orterun is sent a kill signal.
Before it would sometimes spin in the OOB waiting for a message to complete
to a peer that was no longer around. Stalling at this level prevented orterun
from noticing that it had received a kill signal.
This commit was SVN r7408.
automagically don't build on platforms without such things
* Fix for mistaken use of cache variable in assembly setup
* one more cached test hits the books
This commit was SVN r7404.
LIBADD instead of appending to the existing one.
Also removed some more Makefile.options whitespace, and I think emacs
removed some tabs (i.e., replaced them with whitespace).
This commit was SVN r7399.
Makefile.options
- Sample in each of the three projects of how to link againt the
relevant libraries so that when components are loaded into a parent
process' space, we don't rely on the libopal/liborte/libmpi symbols
being in the parent's public symbol namespace -- instead,
dynamically link to the relevant libraries, allowing the dynamic
linker to pull those libraries in at run-time, if needed
This commit was SVN r7397.
However we do want to do a bit of cleanup on the node before we exit,
specificly clean out the session directory. I also had a couple of the
subsystems that don't depend upon peers (which is key) clean up as well.
Pedantic formatting issue in oob_tcp.h
This commit was SVN r7387.
caller to specify a subset of the state variables that it can can subscribe to.
This is specified with one of three special flags defined in rmgr/rmgr_types.h
This is useful when we only care about a subset of the state changes, such as
in orted which only needs to know when a job has terminated or aborted.
This commit was SVN r7356.
waiting instead for the SOH to indicate that the jobid has terminated.
In a scheduled environment, if your program has a section of MPI code
followed by a section of computation that some processes execute while
other proceses terminate normally. This patch keeps the scheduler from
terminating all of the processes and the allocation if all of the processes
on an allocated node exit well before other processes on other nodes.
This commit was SVN r7333.
The following formats are parsed:
user@IPv4
user@fqdn
IPv4 or fqdn [username|user-name|user_name]=user
- Try a better error-detection when parsing (recognize wrong
IPs, fqdns...)
This commit was SVN r7288.
that multiple processes don't overwrite each other. Change that
default in orte_init_stage1() to just "output-" (because the file will
be in a process-unique directory at that point; the pid is no longer
necessary).
This commit was SVN r7256.
opal_output_set_output_file_info(). This allows getting and setting
the default directory where output stream files will be opened (for
all *new* streams). Before this function is not invoked, the default
location is $TMPDIR or $HOME (if $TMPDIR is not defined).
Added a call into orte_init_stage1() to call this function
immediately after the session directory is created and set the default
location of stream files to be the process' session directory.
This commit was SVN r7254.
AM_INIT_AUTOMAKE, instead of the deprecated version.
* Work around dumbness in modern AC_INIT that requires the version
number to be set at autoconf time (instead of at configure time, as
it was before). Set the version number, minus the subversion r number,
at autoconf time. Override the internal variables to include the r
number (if needed) at configure time. Basically, the right thing
should always happen. The only place it might not is the version
reported as part of configure --help will not have an r number.
* Since AM_INIT_AUTOMAKE taks a list of options, no need to specify
them in all the Makefile.am files.
* Addes support for subdir-objects, meaning that object files are put
in the directory containing source files, even if the Makefile.am is
in another directory. This should start making it feasible to
reduce the number of Makefile.am files we have in the tree, which
will greatly reduce the time to run autogen and configure.
This commit was SVN r7211.
This allows the user to specify certain options to srun when an application
is launched with this PLS.
A useful example is the need to set the time to wait from when the first
process completes and when slurm kills remaining processes:
pls_slurm_args=--wait=1200
This commit was SVN r7206.
app_context:
mpirun -np 2 -prefix /path/to/ompi/on/machineA ./exec1 : \
-np 2 -prefix /path/to/ompi/on/machineB ./exec2
- Allow with -mca pls_rsh_assume_same_shell 0, the checking for the
SHELL-variable on the actual node (currently 1st node).
Sets the prefix, PATH and LD_LIBRARY_PATH for bash/ksh and
csh/tcsh.
This commit was SVN r7195.
CTRL-C'd.
We were calling orte_finalize recursively which caused a segv when it tried to
use a freed framework (orte_rmgr in this case).
I added a status flag to orte_universe_info to indicate where we are in the code.
This was needed to determine if we should call orte_abort or not when shutting
down in the tcp oob.
This commit was SVN r7160.
1. Valgrind is good for something - chasing down memory leaks in registry led me to re-visit the dictionary functions and discover that I wasn't keeping track of the number of dictionary entries on each segment! Resulted in wasted time searching blank entries as well as leaked memory. This has now been fixed.
2. Fixed the orte_bitmap test. The init function for that class has been eliminated and the constructor adjusted to provide that functionality.
This commit was SVN r7136.
1. user does NOT specify the universe name. For the default universe case, if we detect an existing default universe and cannot connect to it, we quietly create an alternative default name by adding the pid to the orte_default_universe name and move on - we no longer provide a warning message for this case.
2. user specified a universe name. If we detect an existing universe of that name and cannot connect to it, we consider this an error condition and abort.
This commit was SVN r7131.
1. Added OMPI_PROC_ARCH as a defined registry key and added the code so that the architecture info gets properly transmitted across all processes using the startup message.
2. Added an OMPI_MODEX_KEY definition and removed the hard-coded "modex" key from pml_modex_exchange
This commit was SVN r7129.
add a -I to find the included ltdl.h (vs. a system-installed ltdl.h)
- Clean up kruft in a bunch of Makefile.am's to remove now-unnecessary
AM_CPPFLAGS settings to get static-components.h for each framework
- Move the component_repository API functions out of opal/mca/base/base.h
and into opal/mca/base/mca_base_component_repository.h in order to
decrease unnecessary dependencies (e.g., before this, almost
everything in the tree depended on ltdl.h, which is unnecessary --
only a small number of files really need ltdl.h)
This commit was SVN r7127.
include any optimization flags
- Use these flags to always compile ompi/debuggers/* and orterun so
that parallel debuggers (such as Totalview) can always see the
debugging symbols (see comments in ompi/debuggers/Makefile.am and
orte/tools/orterun/Makefile.am)
- Remove some obsolete LAM-named variables from configure.ac
This commit was SVN r7125.
Here's the huge registry check-in you've all been waiting for with baited breath. The revised version sends a single message to all processes at the various stage gates, thus making the startup much more scalable. I could provide you with all the tawdry details, but won't for now - you are welcome to ask, though, and I'll merrily bore your ears to tears.
In addition, the commit contains the following:
1. set the ignore properties on ompi/debuggers and orte/mca/pls/poe
2. Added simplified subscribe and put functions to the registry's API. I have also converted all of the ompi functions that registered subscriptions to the new API, and caught their associated put's as well.
In a follow-on commit, I'll be adding support for George's hetero arch registry subscription (wanted to get this one in first).
This commit was SVN r7118.
it to be an exit.
* Put the srun process (or what is about to become the srun process) in
it's own process group so that group-wide signals (such as the
SIGINT sent by hitting cntl-c in a shell) are not sent to the srun
process.
This commit was SVN r7068.
orte_init_stage1(), since not all ORTE processes call orte_init().
* Expad opal_error test case to make sure ORTE error codes print
properly
* Make project error codes start at easy values (OPAL is -1 to -100,
ORTE is -101 to -200, OMPI is less than -201) to make it easier
to figure out what an error code as an integer means. Also has
the nice property of not changing the values of error codes ever
time a new error code is added.
This commit was SVN r7061.
tree.
- fix up #include's throughout the tree (yay contrib/search_replace.pl!)
- remove a few extraneous #include's
- remove orte_sys_info*() from opal_init()/opal_finalize() (it's
already in orte_init_stage1() and orte_system_finalize())
- remove dependencies in opal on orte_system_info -- util/os_path.c
and util/os_create_dirpath.c (they only used path_sep, anyway --
easily changed to #defines)
This commit was SVN r7059.
session directory cleanup (among other things)
- When we get an abnormal exit in orterun (i.e., timeout expires and
we haven't gotten termination notices from all processes), print a
better message an exit in a better way (which includes session
directory cleanup)
- Fix tm and poe pls's to not exit() but rather propagate the error up
the stack (where relevant)
This commit was SVN r7058.
- Change orte_base_infrastructre to orte_infrastructre to conform with
ompi_info's needs
- Move MCA Param registration in ORTE to a centralized function that is
called first in orte_init_stage1
- Set the infrastructre flag as an argument to orte_init
- Adjust initalization functions to properly pass down the infrastructre
flag.
This commit was SVN r7053.
Also check to see if infrastructre flag was previously set before assuming it
to be false. This was causing orterun to operate incorrectly in the presence
of a persistant daemon.
This commit was SVN r7039.
NOTE: These have NOT been added to the Makefile.am in the repository. Please do NOT add them at this time - I will do so later.
This commit was SVN r6979.
OPAL_ERROR, same for all the other error codes. Also, make sure that there
are never conflicts between OPAL anr ORTE error codes (for example).
Finally, provide opal_perror(), opal_strerror(), and opal_strerror_r() to
give stringified error messages for the different error codes
This commit was SVN r6969.
multi-client issues the old version had. Also, ignore the NULL iof
component, since we shouldn't use it when using the proxy orteds
This commit was SVN r6939.
against the total number of processors. If not oversubscribing, emit
the MCA environment variable mpi_paffinity_processor with the
processor number to bind the process to. This parameter is picked up
during MPI_Init (i.e., ompi_mpi_init()) and used to bind the process,
but currently iif the MCA param mpi_paffinity_alone is set to a
nonzero value (i.e., the user asks for it).
This commit was SVN r6906.
- change the framework opens to [mostly] use the new MCA param API
- properly pass in framework debug output streams to the
mca_base_component_open() function
This commit was SVN r6888.
* Add base to memory framework so that we can do something sane with
ompi_info
* Updated ompi_info to print components for memory framework and
show whether we have memory hooks active or not.
This commit was SVN r6861.
ns_replica.c
- Removed the error logging since I use this function in orte_init_stage1 to
check if we have created a cellid yet or not.
ras_types.h & rase_base_node.h
- This was an empty file. moved the orte_ras_node_t from base/ras_base_node.h
to this file.
- Changed the name of orte_ras_base_node_t to orte_ras_node_t to match the
naming mechanisms in place.
ras.h
- Exposed 2 functions:
- node_insert:
This takes a list of orte_ras_base_node_t's and places them in the Node
Segment of the GPR. This is to be used in orte_init_stage1 for singleton
processes, and the hostfile parsing (see rds_hostfile.c). This just puts
in the appropriate API interface to keep from calling the
orte_ras_base_node_insert function directly.
- node_query:
This is used in hostfile parsing. This just puts in the appropriate API
interface to keep from calling the orte_ras_base_node_query function
directly.
- Touched all of the implemented components to add reference to these new
function pointers
ras_base_select.c & ras_base_open.c
- Add and set the global module reference
rds.h
- Exposed 1 function:
- store_resource:
This stores a list of rds_cell_desc_t's to the Resource Segment.
This is used in conjunction with the orte_ras.node_insert function in
both the orte_init_stage1 for singleton processes and rds_hostfile.c
rds_base_select.c & rds_base_open.c
- Add and set the global module reference
rds_hostfile.c
- Added functionality to create a new cellid for each hostfile, placing
each entry in the hostfile into the same cellid. Currently this is
commented out with the cellid hard coded to 0, with the intention of
taking this out once ORTE is able to handle multiple cellid's
- Instead of just adding hosts to the Node Segment via a direct call to
the ras_base_node_insert() function. First add the hosts to the Resource
Segment of the GPR using the orte_rds.store_resource() function then use
the API version of orte_ras.node_insert() to store the hosts on the Node
Segment.
- Add 1 new function pointer to module as required by the API.
rds_hostfile_component.c
- Converted this to use the new MCA parameter registration
orte_init_stage1.c
- It is possible that a cellid was not created yet for the current environment.
So I put in some logic to test if the cellid 0 existed. If it does then
continue, otherwise create the cellid so we can properly interact with the
GPR via the RDS.
- For the singleton case we insert some 'dummy' data into the GPR. The RAS
matches this logic, so I took out the duplicate GPR put logic, and
replaced it with a call to the orte_ras.node_insert() function.
- Further before calling orte_ras.node_insert() in the singleton case,
we also call orte_rds.store_resource() to add the singleton node to the
Resource Segment.
Console:
- Added a bunch of new functions. Still experimenting with many aspects of the
implementation. This is a checkpoint, and has very limited functionality.
- Should not be considered stable at the moment.
This commit was SVN r6813.
using orteprobe.
Created a header file for orte_setup_hnp. [HNP = Head Node Process]
General cleanup and added a bit of documentation in orte_setup_hnp.c
Also fixed a cellid tokens issue (circa line 285)
Changed the launched scope from private to public
In orteprobe:
- added reference to orted.h to avoid duplicate header contents in orteprobe.h
- removed the version tag, and put in a verbose argument
- Fixed a buffer packing problem that was causing the parent from receiving the
proper contact information for the new daemon.
This commit was SVN r6802.
If you register a parameter a second time, it overwrites the default
value (this was causing a problem with mpirun not being marked as orte
infrastructure, and therefore thinking that it was a singleton, and
therefore always adding the localhost into the node list).
This commit was SVN r6789.
- converted some things to new MCA param API
- renamed the pls_bproc_seed component struct so its name isn't the same as
the pls_bproc component's struct
- minor bugfixes
This commit was SVN r6774.
his e-mail:
I ran into a small bug in rmaps_rr.c: map_app_by_slot which was
triggered by using multiple app contexts. Basically, if not all the
slots we allocated on a node were used by an app, we would
automatically move onto the next node. This caused a problem with
multiple app contexts when the first app takes a partial allocation of
a node, the second app would not be able to access these slots because
we had already move past the node, and the byslot routine does not
wrap back around the list.
This commit was SVN r6766.
that were set on the command line. This was techinically exactly the
way the code was designed, but it certainly violated the Law of Least
Astonishment (even to its designer ;-) ). So now if you execute
something like this:
mpirun -mca pls_rsh_debug 1 -np 4 hello
You'll see debugging output from the rsh pls component, as you would
expect (this was not previously the case -- the MCA pls_rsh_debug
parame would be set to 1 in the 4 spawned hello processes, but *not*
in the orterun process).
More specifically, MCA parameters will be set in the orterun process
in the following cases:
- The new command line switch "--gmca" (or "-gmca") is used,
indicating that the MCA parameter is "global". --gmca also means
that that MCA parameter will be applied to all context app's. For
example:
mpirun -gmca foo bar -np 1 hello : -np 2 goodbye
The foo MCA param will be set in both the hello and goodbye
processes.
- If there is only one context app. For example:
mpirun -mca pls_rsh_debug 1 -np 4 hello
will set pls_rsh_debug to 1 in both the orterun process and the 4
spawned hello processes.
Also added a few more comments inside orterun to document a somewhat
confusing use of a state variable in a recursive case.
This commit was SVN r6764.
1. dump_xxx - analogous to the registry's dump commands, allows you to examine the contents of the name services' structures
2. get_job_peers - get an array of process names for all processes in the specified job
This commit was SVN r6759.
Somehow, in changing over to the new MCA interfaces, the "set" part of that logic got lost, so the singleton flag was always being set. This should repair some of the anomalous behavior seen recently where the local host was always being used for an application process.
This commit was SVN r6757.
containers when one is requested.
Fix a bug in gpr_replica_del_index_api which doesn't preset num_tokens and
num_keys, but assumes they are 0.
Fix orte_ras_base_node_delete() function to operate properly to delete the
appropriate container in the 'orte-node' segment when requested.
This commit was SVN r6756.
- convert MCA params to the new API
- some style and indenting fixes
- look at local shell, and if [new] MCA param
pls_rsh_assume_same_shell is 1, then assume that the remote shell is
the same as the local shell. If pls_rsh_assume_same_shell is 0, do
a probe to figure out what the remote shell is (NOT CURRENTLY
IMPLEMENTED! you'll get a run-time warning if you set this MCA param
to 0).
- if the remote shell is not csh and not bash, then prefix the remote
command with "( ! [ -e ./.profile ] || . ./.profile;" (and suffix it
with ")") so that we run the .profile on the remote side in order to
set PATHs and the like. See the LAM FAQ for details (will someday
be on the Open MPI FAQ:
http://www.lam-mpi.org/faq/category4.php3#question8)
- add a bunch of debugging output if the MCA param pls_rsh_debug is
enabled (or the top-level debug MCA param is enabled)
- add more help messages (and corresponding calls to opal_show_help())
in help-pls-rsh.txt
This commit was SVN r6731.
- we now properly support multiple application contexts
- much improved error messages, using opal_show_help
- fix some small bugs in the way the processes were discovering their names
- better searching for orted
- use the new mca parameter interface
These changes still need some testing, but they seem stable.
This commit was SVN r6719.
- Add functionality to parse multiple arguments provided in the console
- Cleaned up help function
- Added an option to hide commands from the help menu
Working on launching and reaping of daemons from within the console.
This commit was SVN r6699.
This required a little fiddling with a number of areas. Biggest problem was that it uncovered a potential for an infinite loop to be created in the registry. If a callback function modified the registry, the registry checked the triggers to see if anything had fired. Well, if the original callback was due to a trigger firing, that condition hadn't changed - so the trigger fired again....which caused the callback to be called, which modified the registry, which checked the triggers, etc. etc.
Triggers are now checked and then "flagged" as being "in process" so that the registry will NOT recheck that trigger until all callbacks have been processed. Tried doing this with subscriptions as well, but that caused a problem - when we release processes from a stagegate, they (at the moment) immediately place data on the registry that should cause a subscription to fire. Unfortunately, the system will just hang if that subscription doesn't get processed. So, I have left the subscription system alone - any callback function that modifies the registry in a fashion that will fire a subscription will indeed fire that subscription. We'll have to see if this causes problems - it shouldn't, but a careless user could lock things up if the callback generates a callback to itself.
Also fixed the code that placed a process' RML contact info on the registry to eliminate the leading '/' from the string.
This commit was SVN r6684.
- Added user help messages.
- Abstracted the internal commands, and the mechanism for
parsing and executing them.
- Cleaned up the command line parsing
- Some other misc. cleanup items.
Still much more work to do here, but should provide a more
intuitive interface for extending functionality in the
system.
This commit was SVN r6676.
more user friendly error messages.
Removed the "--version" command line option, since they should
get this from ompi_info [later to be orte_info].
If we find an invalid command line option print out the help
screen before exiting.
This commit was SVN r6670.
support in OMPI. Currently only enables/disables the architecture
sharing modex in ob1 pml.
* Add sds framework to ompi_info
* Figure out table ids to use for Portals BTL at configure time, since
we should use 30 & 31 on Red Storm, but the reference implementation
only supports 0-8.
* Some bug fixes in Portals UTCP sds
This commit was SVN r6650.
* Add Portals UTCP reference sds for when we are using the portals
reference implementation without the ORTE starters (when we want to
pretend like we're on Red Storm, only with a debugger and valgrind and
possibly even a printf that actually works...)
* Add super-secret --with flag to cnos rml to enable the cnos rml but
disable cnos_barrier (for use with portals utcp reference implementation)
This commit was SVN r6642.
test from orte_init_stage1 into a new framework, Startup Discovery Service
(sds). This allows us to have more flexibility with platforms like
Red Storm, which do not have a universe in the usual meaning and don't have
a seed daemon they can contact
This commit was SVN r6630.
- only call sched_yield if it exists
- don't fail out if modex doens't work in ob1
- bunch of fixes for Portals BTL
- add cnos rml component
- add NULL gpr component (should only be used if replica AND proxy
fail to load)
This commit was SVN r6629.
Haven't fully tested these yet (nobody is using them at the moment that I know of - good thing, since they haven't been working for a long time - though I know the MPI-2 stuff needs the functionality), but will do so shortly. For now, they compile.
This commit was SVN r6567.
1. Modify the registry to eliminate redundant data copying for startup messages.
2. Revise the subscription/trigger system to avoid redundant storage of triggers and subscriptions. This dramatically reduces the search time when a registry action occurs - to illustrate the point, there are now only a handful of triggers on the system for each job. Before, there were a handful of triggers for each PROCESS in the job, all of which had to be checked every time something happened on the registry. This is much, much faster now.
3. Update all subscriptions to the new format. There are now "named" subscriptions - this allows you to "name" a subscription that all the processes will be using. The first one to hit the registry actually defines the subscription. From then on, any subsequent "subscribes" to the same name just cause that process to "attach" to the existing subscription. This keeps the number of subscriptions being tracked by the registry to a minimum, while ensuring that each process still gets notified.
4. Do the same for triggers.
Also fixed a duplicate subscription problem that was causing people to receive data equal to the number of processes times the data they should have received from a trigger/subscription. Sorry about that... :-( ...but it's all better now!
Uncovered a situation where the modex data seems to be getting entered on the registry a second time - the latter time coming after the compound command has been "fired", thereby causing all the subscriptions to fire. Asked Tim and Jeff to look into this.
Second phase of the changes will involve modifying the xcast system so that the same message gets sent to all processes. This will further reduce the message traffic, and - once we have a true "broadcast" version of xcast - really speed things up and improve scalability.
This commit was SVN r6542.
1. Fix the reigstry's overwrite logic. It was only overwriting the first keyval specified in a value - the rest were just added on regardless of whether or not the keyval already existed. This was the source of the multiple keyvals some people were seeing - should be fixed now.
2. Change the orted command parsing options so it reports options that aren't recognized - should help reduce confusion
This commit was SVN r6536.