- move files out of toplevel include/ and etc/, moving it into the
sub-projects
- rather than including config headers with <project>/include,
have them as <project>
- require all headers to be included with a project prefix, with
the exception of the config headers ({opal,orte,ompi}_config.h
mpi.h, and mpif.h)
This commit was SVN r8985.
The INIT counter is supposed to be adjusted when the processes are mapped - this is now done correctly.
The LAUNCHED counter is supposed to be adjusted when the pls sets the process pid info into the registry and changes the state to LAUNCHED. This could probably be changed to have that function use the set_proc_soh API, but this fixes the problem for now.
Thanks to Brian for finding that the triggers were not being fired.
This commit was SVN r8948.
cleanup code in the signal part of the event library
* Only attempt to forward standard input if we have a controlling terminal
(isatty() returns 1) and we are the foreground process OR we do not have
a controlling terminal (isatty() returns 0). If we have a controlling
terminal, check at each SIGCONT if we should change our forwarding,
since our foreground / background status may have changed.
Unfortunately, there isn't a great way in the iof framework to know if
we are capturing a starter's stdin. Use the logic that if it's a source
AND tagged as standard input, it's a starter's stdin. This seems to
work for all the common usages.
Both these need to go to the v1.0 branch.
This commit was SVN r8894.
the svc component so that it can disable the rml exception callback, fixing
a race condition in the shutdown mechanism of orte.
This should probably go to the v1.0 branch.
This commit was SVN r8893.
discards all of the data in the pty that hasn't been read. This was
leading to data being discarded when files were redirected into
mpirun and read by rank 0 of the job. This was very "not good".
The decision to not use ptys for stdin was made based on what Tim said
that LA-MPI was doing.
This needs to go to the v1.0 branch... Tim should probably review...
This commit was SVN r8892.
intended to include the OMPI_DEBUG_ZERO call).
These debugging statements should not have affected correcteness
because the value of 78 will be overridden in the read() and the
assert()/abort() stuff will only be triggered on an error which should
never happen (i.e., the error should have been handled by the prior if
conditional). But still, thise code should not be there.
This commit was SVN r8649.
The following SVN revision numbers were found above:
r8643 --> open-mpi/ompi@a6b869ed68
and path->agent_path so that it's totally clear what these are for
- make a new rsh component param for agent_param (the value from the
MCA param)
- delay the path check for the agent until the component init -- don't
make it fail during open, because the MCA base will print a warning
if a component fails open() (e.g., on clusters without rsh/ssh (!),
this component was failing noisly even though it was
normal/expected)
This commit was SVN r8596.
In case of checking for Shell with --mca pls_rsh_assume_same_shell 0
have the node point to sensible values.
This commit was SVN r8563.
The following SVN revision numbers were found above:
r7664 --> open-mpi/ompi@0629cdc2d7
now take a colon-delimited list of agents (and associated argv). Also
change the default value to "ssh : rsh". Hence, if we run on a
cluster that does not have ssh, we'll fall back to rsh. If we can't
find rsh, then the rsh component will disqualify itself from
selection.
This commit was SVN r8514.
- Need to make sure that SIZE_MAX exists as a constant if stdint.h
doesn't exist
- struct timeval is defined in unistd.h on IRIX, so need to include
that headerfile where ever struct timeval is used.
This commit was SVN r8361.
- when eof is reached at orterun, send a 0 byte message to peer indicating eof
- on receipt of zero byte message - close corresponding file descriptor associated with the endpoint
- require setup ptys for stdin and stdout so that stdin can be closed independently of stdout
This commit was SVN r8264.
* turns out (duh!) that there was a reason that the <projectdir>dir
variable was set in the AM conditional. If not, stupid directories
are created and not needed... duh.
This commit was SVN r8205.
component/base Makefile.am files, reducing the time configure spends
stamping out Makefiles at the end
* Install base_impl.h file when devel-headers are being installed
This commit was SVN r8200.
spining in orte_iof_base_flush() when running
intel_tests/src/MPI_Errhandler_fatal_c
When we close an endpoint by taking it out of the envent handler, we need to make
sure that it fits the criteria to pass through orte_iof_base_flush(), specificly
make sure we clean out the ep_frags list.
Note: This is more of a sanity check, since the endpoint should already be
in this state at the point of closure.
Secondly in orte_iof_base_endpoint_read_handler(), if we determine that it is
necessary to close the endpoint we have to "return" after doing so, otherwise
we add another frag to the endpoint which will cause it to hang in
orte_iof_base_flush().
Bug go squish!
This commit was SVN r8109.
its not needed and there could be multiple sources each w/ their
own sequence.
- if a write doesn't complete, need to check for non-blocking case..
This commit was SVN r7795.
originally suggested by Ralf Wildenhues, to try to speed autogen, configure,
and make (and possibly even make install). Use automake's include directive
to drastically reduce the number of Makefile files (although the number of
Makefile.am files is the same - most are just included in a top-level
Makefile.am). Also use an Automake SUBDIRs feature to eliminate the
dynamic-mca tree, which was no longer really needed. This makes adding
a framework easier (since you don't have to remember the dynamic-mca
tree) and makes building faster (as make doesn't have to recurse through
the dynamic-mca tree)
This commit was SVN r7777.
oversubscribed on a node. And thus whether to call sched_yield or not.
The value of node->node_slots_inuse does not currently represent the number of
slots actually in use, at the moment. This is actually a bug in the RAS/RMAPS
base components, but the fix for that specific bug is bigger than we want to
address at the moment (but will certianly do so in the near future).
Since we cannot trust this value, use the total number of mapped processes
(which was properly set by the RMAPS component upon mapping -- Just not
properly propagated back to the registry's node segment) from the process
mapping.
In addition to this change I cleaned up a couple of the debug messages. It
seems that TM and RSH are the only two directly effected by this. SLURM
would be if that section of code wasn't currently inactive, but put the fix
in for prosparity.
This commit was SVN r7743.
too distant past
* work around apparently broken handling of max_slots somewhere along
the line by just setting it to 0
Both changes should go to the trunk.
This commit was SVN r7710.
this loop if "nodes" is an empty list. get_first, in this loop context,
allows us to do just that, while get_begin doesn't.
This fixes a --host problem that appeared on the Linux PPC64 build.
This commit was SVN r7703.
Also revised the callbacks to store and utilize local variables to avoid problems where threads modify the global structures. Not sure this totally fixes the problem, but it's a shot - suggested by Josh (and Jeff, I believe).
This commit was SVN r7694.
command:
svn merge -r 7567:7663 https://svn.open-mpi.org/svn/ompi/tmp/jjhursey-rmaps .
(where "." is a trunk checkout)
The logs from this branch are much more descriptive than I will put
here (including a *really* long description from last night). Here's
the short version:
- fixed some broken implementations in ras and rmaps
- "orterun --host ..." now works and has clearly defined semantics
(this was the impetus for the branch and all these fixes -- LANL had
a requirement for --host to work for 1.0)
- there is still a little bit of cleanup left to do post-1.0 (we got
correct functionality for 1.0 -- we did not fix bad implementations
that still "work")
- rds/hostfile and ras/hostfile handshaking
- singleton node segment assignments in stage1
- remove the default hostfile (no need for it anymore with the
localhost ras component)
- clean up pls components to avoid duplicate ras mapping queries
- [possible] -bynode/-byslot being specific to a single app context
This commit was SVN r7664.
it's always ORTE_SUCCESS and sometimes masks real !=ORTE_SUCCESS rc
values.
- Add MCA param pls_tm_want_path_check. If nonzero (the default),
check for the orted in the PATH before each tm_spawn()'ing (doing a
little caching so that we don't hammer on the filesystem -- remember
all the PATH's where we successfully found the orted so that we
don't have to query the filesystem multiple times for a PATH where
we previously found the orted)
- Be sure to opal_argv_split() the pls_tm_orted MCA param
This commit was SVN r7625.
will fire some subscriptions that will eventually result in invoking
terminate_job (i.e., terminate anything that may have been
successfully started by launch).
This commit was SVN r7622.
If you use --prefix and then "-x LD_LIBRARY_PATH", the rsh pls would
take great pains to ensure that PATH and LD_LIBRARY_PATH were setup
correctly on the local and remote nodes, but then the fork pls would
blitely overwrite LD_LIBRARY_PATH with what the user exported (i.e.,
most likely without our prefix). This patch takes care of that -- the
fork pls examines the incoming environment, and if it sees PATH or
LD_LIBRARY_PATH, it re-prefixes those variables.
This commit was SVN r7566.
it is possible that if the receive has been arrived the callback will
be called before recv_buffer_nb() returns. This causes deadlock
as we try to acquire the lock, but already hold it.
This was causing orterun and orteds to stall in certian situations.
Became evident when stress testing dynamics with remote nodes.
This commit was SVN r7543.
- Fix bug identified by users: --prefix may also apply on the local
node; we need to prefix the PATH and LD_LIBRARY_PATH environment
variables before invoking execve()
This commit was SVN r7541.
some orted's to stall on locks in the MPI Dynamics cases. Since it
is not essentual that we call these functions, they can so away.
Unlock the peer lock when aborting. This causes a potential deadlock
in do_waitall [see comment in code]. This was causing orteds to
deadlock at times when the seed had terminated. With proper interleaving
and timing the orted was deadlocking. This seems to have fixed this in
my stress testing with MPI 2 Dynamics.
This commit was SVN r7539.
The NS replica should give out tags that are over ORTE_RML_TAG_DYNAMIC
or it will overlap with other outstanding tags. This overlap was killing
MPI_Comm_spawn when a program tried to use it multiple times (> 3).
With this fix MPI_Comm_spawn is behaving properly.
A program can call it many times in a row with out problem.
NOTE: Not tested for multi-threaded build yet
(A long time debugging for a one liner... :/)
This commit was SVN r7529.
Have the ras_base_schedule_policy MCA parameter working once again. before it
would only do slot based allocation, even if the MCA parameter was set properly.
Currently you can specify to orterun a node allocation by either:
-mca ras_base_schedule_policy node
-bynode
and slot allocation (which is the default) by:
-mca ras_base_schedule_policy slot
-byslot
This commit was SVN r7513.
with the mutex locked and as this function will call oob_send which will call the lookup again
... we will deadlock as the mutex is already lock. The solution is to release the mutex before
going into the subscription. Then of course the logic to remote the item when something went
wrong with the subscrition is a little bit more complex.
This commit was SVN r7429.
add an event, it can call the spawn function directly. This will avoid it standing on the condition who
will never get released.
This commit was SVN r7428.
This fixes one of the race conditions in orterun is sent a kill signal.
Before it would sometimes spin in the OOB waiting for a message to complete
to a peer that was no longer around. Stalling at this level prevented orterun
from noticing that it had received a kill signal.
This commit was SVN r7408.
automagically don't build on platforms without such things
* Fix for mistaken use of cache variable in assembly setup
* one more cached test hits the books
This commit was SVN r7404.
LIBADD instead of appending to the existing one.
Also removed some more Makefile.options whitespace, and I think emacs
removed some tabs (i.e., replaced them with whitespace).
This commit was SVN r7399.
Makefile.options
- Sample in each of the three projects of how to link againt the
relevant libraries so that when components are loaded into a parent
process' space, we don't rely on the libopal/liborte/libmpi symbols
being in the parent's public symbol namespace -- instead,
dynamically link to the relevant libraries, allowing the dynamic
linker to pull those libraries in at run-time, if needed
This commit was SVN r7397.
However we do want to do a bit of cleanup on the node before we exit,
specificly clean out the session directory. I also had a couple of the
subsystems that don't depend upon peers (which is key) clean up as well.
Pedantic formatting issue in oob_tcp.h
This commit was SVN r7387.
caller to specify a subset of the state variables that it can can subscribe to.
This is specified with one of three special flags defined in rmgr/rmgr_types.h
This is useful when we only care about a subset of the state changes, such as
in orted which only needs to know when a job has terminated or aborted.
This commit was SVN r7356.
The following formats are parsed:
user@IPv4
user@fqdn
IPv4 or fqdn [username|user-name|user_name]=user
- Try a better error-detection when parsing (recognize wrong
IPs, fqdns...)
This commit was SVN r7288.
AM_INIT_AUTOMAKE, instead of the deprecated version.
* Work around dumbness in modern AC_INIT that requires the version
number to be set at autoconf time (instead of at configure time, as
it was before). Set the version number, minus the subversion r number,
at autoconf time. Override the internal variables to include the r
number (if needed) at configure time. Basically, the right thing
should always happen. The only place it might not is the version
reported as part of configure --help will not have an r number.
* Since AM_INIT_AUTOMAKE taks a list of options, no need to specify
them in all the Makefile.am files.
* Addes support for subdir-objects, meaning that object files are put
in the directory containing source files, even if the Makefile.am is
in another directory. This should start making it feasible to
reduce the number of Makefile.am files we have in the tree, which
will greatly reduce the time to run autogen and configure.
This commit was SVN r7211.
This allows the user to specify certain options to srun when an application
is launched with this PLS.
A useful example is the need to set the time to wait from when the first
process completes and when slurm kills remaining processes:
pls_slurm_args=--wait=1200
This commit was SVN r7206.
app_context:
mpirun -np 2 -prefix /path/to/ompi/on/machineA ./exec1 : \
-np 2 -prefix /path/to/ompi/on/machineB ./exec2
- Allow with -mca pls_rsh_assume_same_shell 0, the checking for the
SHELL-variable on the actual node (currently 1st node).
Sets the prefix, PATH and LD_LIBRARY_PATH for bash/ksh and
csh/tcsh.
This commit was SVN r7195.
CTRL-C'd.
We were calling orte_finalize recursively which caused a segv when it tried to
use a freed framework (orte_rmgr in this case).
I added a status flag to orte_universe_info to indicate where we are in the code.
This was needed to determine if we should call orte_abort or not when shutting
down in the tcp oob.
This commit was SVN r7160.
1. Valgrind is good for something - chasing down memory leaks in registry led me to re-visit the dictionary functions and discover that I wasn't keeping track of the number of dictionary entries on each segment! Resulted in wasted time searching blank entries as well as leaked memory. This has now been fixed.
2. Fixed the orte_bitmap test. The init function for that class has been eliminated and the constructor adjusted to provide that functionality.
This commit was SVN r7136.
1. user does NOT specify the universe name. For the default universe case, if we detect an existing default universe and cannot connect to it, we quietly create an alternative default name by adding the pid to the orte_default_universe name and move on - we no longer provide a warning message for this case.
2. user specified a universe name. If we detect an existing universe of that name and cannot connect to it, we consider this an error condition and abort.
This commit was SVN r7131.
1. Added OMPI_PROC_ARCH as a defined registry key and added the code so that the architecture info gets properly transmitted across all processes using the startup message.
2. Added an OMPI_MODEX_KEY definition and removed the hard-coded "modex" key from pml_modex_exchange
This commit was SVN r7129.
add a -I to find the included ltdl.h (vs. a system-installed ltdl.h)
- Clean up kruft in a bunch of Makefile.am's to remove now-unnecessary
AM_CPPFLAGS settings to get static-components.h for each framework
- Move the component_repository API functions out of opal/mca/base/base.h
and into opal/mca/base/mca_base_component_repository.h in order to
decrease unnecessary dependencies (e.g., before this, almost
everything in the tree depended on ltdl.h, which is unnecessary --
only a small number of files really need ltdl.h)
This commit was SVN r7127.
Here's the huge registry check-in you've all been waiting for with baited breath. The revised version sends a single message to all processes at the various stage gates, thus making the startup much more scalable. I could provide you with all the tawdry details, but won't for now - you are welcome to ask, though, and I'll merrily bore your ears to tears.
In addition, the commit contains the following:
1. set the ignore properties on ompi/debuggers and orte/mca/pls/poe
2. Added simplified subscribe and put functions to the registry's API. I have also converted all of the ompi functions that registered subscriptions to the new API, and caught their associated put's as well.
In a follow-on commit, I'll be adding support for George's hetero arch registry subscription (wanted to get this one in first).
This commit was SVN r7118.
it to be an exit.
* Put the srun process (or what is about to become the srun process) in
it's own process group so that group-wide signals (such as the
SIGINT sent by hitting cntl-c in a shell) are not sent to the srun
process.
This commit was SVN r7068.
tree.
- fix up #include's throughout the tree (yay contrib/search_replace.pl!)
- remove a few extraneous #include's
- remove orte_sys_info*() from opal_init()/opal_finalize() (it's
already in orte_init_stage1() and orte_system_finalize())
- remove dependencies in opal on orte_system_info -- util/os_path.c
and util/os_create_dirpath.c (they only used path_sep, anyway --
easily changed to #defines)
This commit was SVN r7059.
session directory cleanup (among other things)
- When we get an abnormal exit in orterun (i.e., timeout expires and
we haven't gotten termination notices from all processes), print a
better message an exit in a better way (which includes session
directory cleanup)
- Fix tm and poe pls's to not exit() but rather propagate the error up
the stack (where relevant)
This commit was SVN r7058.
- Change orte_base_infrastructre to orte_infrastructre to conform with
ompi_info's needs
- Move MCA Param registration in ORTE to a centralized function that is
called first in orte_init_stage1
- Set the infrastructre flag as an argument to orte_init
- Adjust initalization functions to properly pass down the infrastructre
flag.
This commit was SVN r7053.
Also check to see if infrastructre flag was previously set before assuming it
to be false. This was causing orterun to operate incorrectly in the presence
of a persistant daemon.
This commit was SVN r7039.
NOTE: These have NOT been added to the Makefile.am in the repository. Please do NOT add them at this time - I will do so later.
This commit was SVN r6979.
OPAL_ERROR, same for all the other error codes. Also, make sure that there
are never conflicts between OPAL anr ORTE error codes (for example).
Finally, provide opal_perror(), opal_strerror(), and opal_strerror_r() to
give stringified error messages for the different error codes
This commit was SVN r6969.
multi-client issues the old version had. Also, ignore the NULL iof
component, since we shouldn't use it when using the proxy orteds
This commit was SVN r6939.