with the mutex locked and as this function will call oob_send which will call the lookup again
... we will deadlock as the mutex is already lock. The solution is to release the mutex before
going into the subscription. Then of course the logic to remote the item when something went
wrong with the subscrition is a little bit more complex.
This commit was SVN r7429.
add an event, it can call the spawn function directly. This will avoid it standing on the condition who
will never get released.
This commit was SVN r7428.
This fixes one of the race conditions in orterun is sent a kill signal.
Before it would sometimes spin in the OOB waiting for a message to complete
to a peer that was no longer around. Stalling at this level prevented orterun
from noticing that it had received a kill signal.
This commit was SVN r7408.
automagically don't build on platforms without such things
* Fix for mistaken use of cache variable in assembly setup
* one more cached test hits the books
This commit was SVN r7404.
LIBADD instead of appending to the existing one.
Also removed some more Makefile.options whitespace, and I think emacs
removed some tabs (i.e., replaced them with whitespace).
This commit was SVN r7399.
Makefile.options
- Sample in each of the three projects of how to link againt the
relevant libraries so that when components are loaded into a parent
process' space, we don't rely on the libopal/liborte/libmpi symbols
being in the parent's public symbol namespace -- instead,
dynamically link to the relevant libraries, allowing the dynamic
linker to pull those libraries in at run-time, if needed
This commit was SVN r7397.
However we do want to do a bit of cleanup on the node before we exit,
specificly clean out the session directory. I also had a couple of the
subsystems that don't depend upon peers (which is key) clean up as well.
Pedantic formatting issue in oob_tcp.h
This commit was SVN r7387.
caller to specify a subset of the state variables that it can can subscribe to.
This is specified with one of three special flags defined in rmgr/rmgr_types.h
This is useful when we only care about a subset of the state changes, such as
in orted which only needs to know when a job has terminated or aborted.
This commit was SVN r7356.
waiting instead for the SOH to indicate that the jobid has terminated.
In a scheduled environment, if your program has a section of MPI code
followed by a section of computation that some processes execute while
other proceses terminate normally. This patch keeps the scheduler from
terminating all of the processes and the allocation if all of the processes
on an allocated node exit well before other processes on other nodes.
This commit was SVN r7333.
The following formats are parsed:
user@IPv4
user@fqdn
IPv4 or fqdn [username|user-name|user_name]=user
- Try a better error-detection when parsing (recognize wrong
IPs, fqdns...)
This commit was SVN r7288.
that multiple processes don't overwrite each other. Change that
default in orte_init_stage1() to just "output-" (because the file will
be in a process-unique directory at that point; the pid is no longer
necessary).
This commit was SVN r7256.
opal_output_set_output_file_info(). This allows getting and setting
the default directory where output stream files will be opened (for
all *new* streams). Before this function is not invoked, the default
location is $TMPDIR or $HOME (if $TMPDIR is not defined).
Added a call into orte_init_stage1() to call this function
immediately after the session directory is created and set the default
location of stream files to be the process' session directory.
This commit was SVN r7254.
AM_INIT_AUTOMAKE, instead of the deprecated version.
* Work around dumbness in modern AC_INIT that requires the version
number to be set at autoconf time (instead of at configure time, as
it was before). Set the version number, minus the subversion r number,
at autoconf time. Override the internal variables to include the r
number (if needed) at configure time. Basically, the right thing
should always happen. The only place it might not is the version
reported as part of configure --help will not have an r number.
* Since AM_INIT_AUTOMAKE taks a list of options, no need to specify
them in all the Makefile.am files.
* Addes support for subdir-objects, meaning that object files are put
in the directory containing source files, even if the Makefile.am is
in another directory. This should start making it feasible to
reduce the number of Makefile.am files we have in the tree, which
will greatly reduce the time to run autogen and configure.
This commit was SVN r7211.
This allows the user to specify certain options to srun when an application
is launched with this PLS.
A useful example is the need to set the time to wait from when the first
process completes and when slurm kills remaining processes:
pls_slurm_args=--wait=1200
This commit was SVN r7206.
app_context:
mpirun -np 2 -prefix /path/to/ompi/on/machineA ./exec1 : \
-np 2 -prefix /path/to/ompi/on/machineB ./exec2
- Allow with -mca pls_rsh_assume_same_shell 0, the checking for the
SHELL-variable on the actual node (currently 1st node).
Sets the prefix, PATH and LD_LIBRARY_PATH for bash/ksh and
csh/tcsh.
This commit was SVN r7195.
CTRL-C'd.
We were calling orte_finalize recursively which caused a segv when it tried to
use a freed framework (orte_rmgr in this case).
I added a status flag to orte_universe_info to indicate where we are in the code.
This was needed to determine if we should call orte_abort or not when shutting
down in the tcp oob.
This commit was SVN r7160.
1. Valgrind is good for something - chasing down memory leaks in registry led me to re-visit the dictionary functions and discover that I wasn't keeping track of the number of dictionary entries on each segment! Resulted in wasted time searching blank entries as well as leaked memory. This has now been fixed.
2. Fixed the orte_bitmap test. The init function for that class has been eliminated and the constructor adjusted to provide that functionality.
This commit was SVN r7136.
1. user does NOT specify the universe name. For the default universe case, if we detect an existing default universe and cannot connect to it, we quietly create an alternative default name by adding the pid to the orte_default_universe name and move on - we no longer provide a warning message for this case.
2. user specified a universe name. If we detect an existing universe of that name and cannot connect to it, we consider this an error condition and abort.
This commit was SVN r7131.
1. Added OMPI_PROC_ARCH as a defined registry key and added the code so that the architecture info gets properly transmitted across all processes using the startup message.
2. Added an OMPI_MODEX_KEY definition and removed the hard-coded "modex" key from pml_modex_exchange
This commit was SVN r7129.
add a -I to find the included ltdl.h (vs. a system-installed ltdl.h)
- Clean up kruft in a bunch of Makefile.am's to remove now-unnecessary
AM_CPPFLAGS settings to get static-components.h for each framework
- Move the component_repository API functions out of opal/mca/base/base.h
and into opal/mca/base/mca_base_component_repository.h in order to
decrease unnecessary dependencies (e.g., before this, almost
everything in the tree depended on ltdl.h, which is unnecessary --
only a small number of files really need ltdl.h)
This commit was SVN r7127.
include any optimization flags
- Use these flags to always compile ompi/debuggers/* and orterun so
that parallel debuggers (such as Totalview) can always see the
debugging symbols (see comments in ompi/debuggers/Makefile.am and
orte/tools/orterun/Makefile.am)
- Remove some obsolete LAM-named variables from configure.ac
This commit was SVN r7125.
Here's the huge registry check-in you've all been waiting for with baited breath. The revised version sends a single message to all processes at the various stage gates, thus making the startup much more scalable. I could provide you with all the tawdry details, but won't for now - you are welcome to ask, though, and I'll merrily bore your ears to tears.
In addition, the commit contains the following:
1. set the ignore properties on ompi/debuggers and orte/mca/pls/poe
2. Added simplified subscribe and put functions to the registry's API. I have also converted all of the ompi functions that registered subscriptions to the new API, and caught their associated put's as well.
In a follow-on commit, I'll be adding support for George's hetero arch registry subscription (wanted to get this one in first).
This commit was SVN r7118.
it to be an exit.
* Put the srun process (or what is about to become the srun process) in
it's own process group so that group-wide signals (such as the
SIGINT sent by hitting cntl-c in a shell) are not sent to the srun
process.
This commit was SVN r7068.
orte_init_stage1(), since not all ORTE processes call orte_init().
* Expad opal_error test case to make sure ORTE error codes print
properly
* Make project error codes start at easy values (OPAL is -1 to -100,
ORTE is -101 to -200, OMPI is less than -201) to make it easier
to figure out what an error code as an integer means. Also has
the nice property of not changing the values of error codes ever
time a new error code is added.
This commit was SVN r7061.
tree.
- fix up #include's throughout the tree (yay contrib/search_replace.pl!)
- remove a few extraneous #include's
- remove orte_sys_info*() from opal_init()/opal_finalize() (it's
already in orte_init_stage1() and orte_system_finalize())
- remove dependencies in opal on orte_system_info -- util/os_path.c
and util/os_create_dirpath.c (they only used path_sep, anyway --
easily changed to #defines)
This commit was SVN r7059.
session directory cleanup (among other things)
- When we get an abnormal exit in orterun (i.e., timeout expires and
we haven't gotten termination notices from all processes), print a
better message an exit in a better way (which includes session
directory cleanup)
- Fix tm and poe pls's to not exit() but rather propagate the error up
the stack (where relevant)
This commit was SVN r7058.
- Change orte_base_infrastructre to orte_infrastructre to conform with
ompi_info's needs
- Move MCA Param registration in ORTE to a centralized function that is
called first in orte_init_stage1
- Set the infrastructre flag as an argument to orte_init
- Adjust initalization functions to properly pass down the infrastructre
flag.
This commit was SVN r7053.
Also check to see if infrastructre flag was previously set before assuming it
to be false. This was causing orterun to operate incorrectly in the presence
of a persistant daemon.
This commit was SVN r7039.
NOTE: These have NOT been added to the Makefile.am in the repository. Please do NOT add them at this time - I will do so later.
This commit was SVN r6979.
OPAL_ERROR, same for all the other error codes. Also, make sure that there
are never conflicts between OPAL anr ORTE error codes (for example).
Finally, provide opal_perror(), opal_strerror(), and opal_strerror_r() to
give stringified error messages for the different error codes
This commit was SVN r6969.
multi-client issues the old version had. Also, ignore the NULL iof
component, since we shouldn't use it when using the proxy orteds
This commit was SVN r6939.
against the total number of processors. If not oversubscribing, emit
the MCA environment variable mpi_paffinity_processor with the
processor number to bind the process to. This parameter is picked up
during MPI_Init (i.e., ompi_mpi_init()) and used to bind the process,
but currently iif the MCA param mpi_paffinity_alone is set to a
nonzero value (i.e., the user asks for it).
This commit was SVN r6906.
- change the framework opens to [mostly] use the new MCA param API
- properly pass in framework debug output streams to the
mca_base_component_open() function
This commit was SVN r6888.
* Add base to memory framework so that we can do something sane with
ompi_info
* Updated ompi_info to print components for memory framework and
show whether we have memory hooks active or not.
This commit was SVN r6861.