- Need to make sure that SIZE_MAX exists as a constant if stdint.h
doesn't exist
- struct timeval is defined in unistd.h on IRIX, so need to include
that headerfile where ever struct timeval is used.
This commit was SVN r8361.
- when eof is reached at orterun, send a 0 byte message to peer indicating eof
- on receipt of zero byte message - close corresponding file descriptor associated with the endpoint
- require setup ptys for stdin and stdout so that stdin can be closed independently of stdout
This commit was SVN r8264.
debugger scheme described in
http://www.open-mpi.org/community/lists/users/2005/10/0214.php. This
makes our user-level debugger scheme much more vendor-independent
(although the "-tv" option will still work for backwards compatibility
-- it'll just be a synonum of "--debug").
This commit was SVN r8206.
* turns out (duh!) that there was a reason that the <projectdir>dir
variable was set in the AM conditional. If not, stupid directories
are created and not needed... duh.
This commit was SVN r8205.
component/base Makefile.am files, reducing the time configure spends
stamping out Makefiles at the end
* Install base_impl.h file when devel-headers are being installed
This commit was SVN r8200.
When compiling C++ code that includes something that looks for the C++
header file "memory" (stupid C++ headers not having .h extensions), it
goes through the header file search path, which includes $(topsrcdir)/opal,
so it finds the directory $(topsrcdir)/opal/memory/ and tries to load
that as the memory header file and all goes downhill.
This commit was SVN r8111.
spining in orte_iof_base_flush() when running
intel_tests/src/MPI_Errhandler_fatal_c
When we close an endpoint by taking it out of the envent handler, we need to make
sure that it fits the criteria to pass through orte_iof_base_flush(), specificly
make sure we clean out the ep_frags list.
Note: This is more of a sanity check, since the endpoint should already be
in this state at the point of closure.
Secondly in orte_iof_base_endpoint_read_handler(), if we determine that it is
necessary to close the endpoint we have to "return" after doing so, otherwise
we add another frag to the endpoint which will cause it to hang in
orte_iof_base_flush().
Bug go squish!
This commit was SVN r8109.
OS X (you get an undefined symbol opal_event_lock). Since the code is
all #if 0'ed out, #if 0 out the header for now as well.
I believe console and openmpi are to be removed from OMPI before 1.0
release, so this doesn't need to go to the 1.0 branch
This commit was SVN r8089.
This takes care of Troy's first segfault problem, and compile errors that will likely happen as soon as Ken applies George's patch and runs make again.
This commit was SVN r7833.
its not needed and there could be multiple sources each w/ their
own sequence.
- if a write doesn't complete, need to check for non-blocking case..
This commit was SVN r7795.
originally suggested by Ralf Wildenhues, to try to speed autogen, configure,
and make (and possibly even make install). Use automake's include directive
to drastically reduce the number of Makefile files (although the number of
Makefile.am files is the same - most are just included in a top-level
Makefile.am). Also use an Automake SUBDIRs feature to eliminate the
dynamic-mca tree, which was no longer really needed. This makes adding
a framework easier (since you don't have to remember the dynamic-mca
tree) and makes building faster (as make doesn't have to recurse through
the dynamic-mca tree)
This commit was SVN r7777.
oversubscribed on a node. And thus whether to call sched_yield or not.
The value of node->node_slots_inuse does not currently represent the number of
slots actually in use, at the moment. This is actually a bug in the RAS/RMAPS
base components, but the fix for that specific bug is bigger than we want to
address at the moment (but will certianly do so in the near future).
Since we cannot trust this value, use the total number of mapped processes
(which was properly set by the RMAPS component upon mapping -- Just not
properly propagated back to the registry's node segment) from the process
mapping.
In addition to this change I cleaned up a couple of the debug messages. It
seems that TM and RSH are the only two directly effected by this. SLURM
would be if that section of code wasn't currently inactive, but put the fix
in for prosparity.
This commit was SVN r7743.
too distant past
* work around apparently broken handling of max_slots somewhere along
the line by just setting it to 0
Both changes should go to the trunk.
This commit was SVN r7710.
this loop if "nodes" is an empty list. get_first, in this loop context,
allows us to do just that, while get_begin doesn't.
This fixes a --host problem that appeared on the Linux PPC64 build.
This commit was SVN r7703.
Also revised the callbacks to store and utilize local variables to avoid problems where threads modify the global structures. Not sure this totally fixes the problem, but it's a shot - suggested by Josh (and Jeff, I believe).
This commit was SVN r7694.
command:
svn merge -r 7567:7663 https://svn.open-mpi.org/svn/ompi/tmp/jjhursey-rmaps .
(where "." is a trunk checkout)
The logs from this branch are much more descriptive than I will put
here (including a *really* long description from last night). Here's
the short version:
- fixed some broken implementations in ras and rmaps
- "orterun --host ..." now works and has clearly defined semantics
(this was the impetus for the branch and all these fixes -- LANL had
a requirement for --host to work for 1.0)
- there is still a little bit of cleanup left to do post-1.0 (we got
correct functionality for 1.0 -- we did not fix bad implementations
that still "work")
- rds/hostfile and ras/hostfile handshaking
- singleton node segment assignments in stage1
- remove the default hostfile (no need for it anymore with the
localhost ras component)
- clean up pls components to avoid duplicate ras mapping queries
- [possible] -bynode/-byslot being specific to a single app context
This commit was SVN r7664.
Some of the functions in opal_init are void or return a bool (opal_output_init, but always returns true.. eh?), so I don't check them.
This commit was SVN r7638.
it's always ORTE_SUCCESS and sometimes masks real !=ORTE_SUCCESS rc
values.
- Add MCA param pls_tm_want_path_check. If nonzero (the default),
check for the orted in the PATH before each tm_spawn()'ing (doing a
little caching so that we don't hammer on the filesystem -- remember
all the PATH's where we successfully found the orted so that we
don't have to query the filesystem multiple times for a PATH where
we previously found the orted)
- Be sure to opal_argv_split() the pls_tm_orted MCA param
This commit was SVN r7625.
will fire some subscriptions that will eventually result in invoking
terminate_job (i.e., terminate anything that may have been
successfully started by launch).
This commit was SVN r7622.
at the moment.
Also remove all references to --map, and (C, N) command line options in the
help file. These references will be put back in when these options are
implemented.
This commit was SVN r7574.
If you use --prefix and then "-x LD_LIBRARY_PATH", the rsh pls would
take great pains to ensure that PATH and LD_LIBRARY_PATH were setup
correctly on the local and remote nodes, but then the fork pls would
blitely overwrite LD_LIBRARY_PATH with what the user exported (i.e.,
most likely without our prefix). This patch takes care of that -- the
fork pls examines the incoming environment, and if it sees PATH or
LD_LIBRARY_PATH, it re-prefixes those variables.
This commit was SVN r7566.
it is possible that if the receive has been arrived the callback will
be called before recv_buffer_nb() returns. This causes deadlock
as we try to acquire the lock, but already hold it.
This was causing orterun and orteds to stall in certian situations.
Became evident when stress testing dynamics with remote nodes.
This commit was SVN r7543.
- Fix bug identified by users: --prefix may also apply on the local
node; we need to prefix the PATH and LD_LIBRARY_PATH environment
variables before invoking execve()
This commit was SVN r7541.
some orted's to stall on locks in the MPI Dynamics cases. Since it
is not essentual that we call these functions, they can so away.
Unlock the peer lock when aborting. This causes a potential deadlock
in do_waitall [see comment in code]. This was causing orteds to
deadlock at times when the seed had terminated. With proper interleaving
and timing the orted was deadlocking. This seems to have fixed this in
my stress testing with MPI 2 Dynamics.
This commit was SVN r7539.
The NS replica should give out tags that are over ORTE_RML_TAG_DYNAMIC
or it will overlap with other outstanding tags. This overlap was killing
MPI_Comm_spawn when a program tried to use it multiple times (> 3).
With this fix MPI_Comm_spawn is behaving properly.
A program can call it many times in a row with out problem.
NOTE: Not tested for multi-threaded build yet
(A long time debugging for a one liner... :/)
This commit was SVN r7529.
Have the ras_base_schedule_policy MCA parameter working once again. before it
would only do slot based allocation, even if the MCA parameter was set properly.
Currently you can specify to orterun a node allocation by either:
-mca ras_base_schedule_policy node
-bynode
and slot allocation (which is the default) by:
-mca ras_base_schedule_policy slot
-byslot
This commit was SVN r7513.
with the mutex locked and as this function will call oob_send which will call the lookup again
... we will deadlock as the mutex is already lock. The solution is to release the mutex before
going into the subscription. Then of course the logic to remote the item when something went
wrong with the subscrition is a little bit more complex.
This commit was SVN r7429.
add an event, it can call the spawn function directly. This will avoid it standing on the condition who
will never get released.
This commit was SVN r7428.
This fixes one of the race conditions in orterun is sent a kill signal.
Before it would sometimes spin in the OOB waiting for a message to complete
to a peer that was no longer around. Stalling at this level prevented orterun
from noticing that it had received a kill signal.
This commit was SVN r7408.
automagically don't build on platforms without such things
* Fix for mistaken use of cache variable in assembly setup
* one more cached test hits the books
This commit was SVN r7404.
LIBADD instead of appending to the existing one.
Also removed some more Makefile.options whitespace, and I think emacs
removed some tabs (i.e., replaced them with whitespace).
This commit was SVN r7399.
Makefile.options
- Sample in each of the three projects of how to link againt the
relevant libraries so that when components are loaded into a parent
process' space, we don't rely on the libopal/liborte/libmpi symbols
being in the parent's public symbol namespace -- instead,
dynamically link to the relevant libraries, allowing the dynamic
linker to pull those libraries in at run-time, if needed
This commit was SVN r7397.
However we do want to do a bit of cleanup on the node before we exit,
specificly clean out the session directory. I also had a couple of the
subsystems that don't depend upon peers (which is key) clean up as well.
Pedantic formatting issue in oob_tcp.h
This commit was SVN r7387.
caller to specify a subset of the state variables that it can can subscribe to.
This is specified with one of three special flags defined in rmgr/rmgr_types.h
This is useful when we only care about a subset of the state changes, such as
in orted which only needs to know when a job has terminated or aborted.
This commit was SVN r7356.
waiting instead for the SOH to indicate that the jobid has terminated.
In a scheduled environment, if your program has a section of MPI code
followed by a section of computation that some processes execute while
other proceses terminate normally. This patch keeps the scheduler from
terminating all of the processes and the allocation if all of the processes
on an allocated node exit well before other processes on other nodes.
This commit was SVN r7333.
The following formats are parsed:
user@IPv4
user@fqdn
IPv4 or fqdn [username|user-name|user_name]=user
- Try a better error-detection when parsing (recognize wrong
IPs, fqdns...)
This commit was SVN r7288.
that multiple processes don't overwrite each other. Change that
default in orte_init_stage1() to just "output-" (because the file will
be in a process-unique directory at that point; the pid is no longer
necessary).
This commit was SVN r7256.
opal_output_set_output_file_info(). This allows getting and setting
the default directory where output stream files will be opened (for
all *new* streams). Before this function is not invoked, the default
location is $TMPDIR or $HOME (if $TMPDIR is not defined).
Added a call into orte_init_stage1() to call this function
immediately after the session directory is created and set the default
location of stream files to be the process' session directory.
This commit was SVN r7254.
AM_INIT_AUTOMAKE, instead of the deprecated version.
* Work around dumbness in modern AC_INIT that requires the version
number to be set at autoconf time (instead of at configure time, as
it was before). Set the version number, minus the subversion r number,
at autoconf time. Override the internal variables to include the r
number (if needed) at configure time. Basically, the right thing
should always happen. The only place it might not is the version
reported as part of configure --help will not have an r number.
* Since AM_INIT_AUTOMAKE taks a list of options, no need to specify
them in all the Makefile.am files.
* Addes support for subdir-objects, meaning that object files are put
in the directory containing source files, even if the Makefile.am is
in another directory. This should start making it feasible to
reduce the number of Makefile.am files we have in the tree, which
will greatly reduce the time to run autogen and configure.
This commit was SVN r7211.
This allows the user to specify certain options to srun when an application
is launched with this PLS.
A useful example is the need to set the time to wait from when the first
process completes and when slurm kills remaining processes:
pls_slurm_args=--wait=1200
This commit was SVN r7206.
app_context:
mpirun -np 2 -prefix /path/to/ompi/on/machineA ./exec1 : \
-np 2 -prefix /path/to/ompi/on/machineB ./exec2
- Allow with -mca pls_rsh_assume_same_shell 0, the checking for the
SHELL-variable on the actual node (currently 1st node).
Sets the prefix, PATH and LD_LIBRARY_PATH for bash/ksh and
csh/tcsh.
This commit was SVN r7195.
CTRL-C'd.
We were calling orte_finalize recursively which caused a segv when it tried to
use a freed framework (orte_rmgr in this case).
I added a status flag to orte_universe_info to indicate where we are in the code.
This was needed to determine if we should call orte_abort or not when shutting
down in the tcp oob.
This commit was SVN r7160.
1. Valgrind is good for something - chasing down memory leaks in registry led me to re-visit the dictionary functions and discover that I wasn't keeping track of the number of dictionary entries on each segment! Resulted in wasted time searching blank entries as well as leaked memory. This has now been fixed.
2. Fixed the orte_bitmap test. The init function for that class has been eliminated and the constructor adjusted to provide that functionality.
This commit was SVN r7136.
1. user does NOT specify the universe name. For the default universe case, if we detect an existing default universe and cannot connect to it, we quietly create an alternative default name by adding the pid to the orte_default_universe name and move on - we no longer provide a warning message for this case.
2. user specified a universe name. If we detect an existing universe of that name and cannot connect to it, we consider this an error condition and abort.
This commit was SVN r7131.