We still have an issue with the io forwarding going through the spawning process, but that will be dealt with at a future time.
This commit was SVN r11943.
It turns out that we were improperly allocating an array if -np was not passed. Also, we were not really using this array for anything. So this gets rid of the array and performs some minor cleanup.
This commit was SVN r11934.
The following Trac tickets were found above:
Ticket 452 --> https://svn.open-mpi.org/trac/ompi/ticket/452
* Error message in an NSError object is localizedDescription, not
localizedErrorReason. The latter is a decription of how the error
can occur, which is usually nothing in XGrid frameworks.
* Clean up silly error in finding the Kerberos Service Principal
when using Kerberos authenticaion
* Print useful error message when a connection unexpectedly closes,
as this is usually authentication related...
This commit was SVN r11923.
remove requirements on .la files on wrapper scripts
Ticket: #374
extend compilers to support 32 bit and 64 bit in one version of the wrapper
Submitted by: Dan Lacher
Reviewed by: Rolf Vandevaart
This commit was SVN r11908.
install-exec-hook is not only wrong, it can cause ordering issues such
as trying to put sym links to man pages in directories that do not yet
exist.
This commit was SVN r11893.
I have added a new MCA param (hey, you can't have too many!) called OMPI_MCA_orte_timing. If set to anything other than zero, the system will report out critical timing loops. At the moment, this includes three measurements:
1. Time spent going through the RDS->RAS->RMAPS, setting up triggers, etc. prior to calling the actual PLS launch function. This is reported out as time to setup job.
2. Time spent in MPI_Init from start of that function (well, right after opal_init) to the place where we send all of our info the registry. Reported out as time from start to exec_compound_cmd
3. Time actually spent executing the compound cmd. Reported out as time to exec_compound_cmd.
A few additional timing points will be added shortly.
These may eventually be removed or (better) setup with a conditional compile flag.
This commit was SVN r11892.
1. PLS finalize was not being called. Now ensure that happens during orte_finalize.
2. Errmgr proxies were sending their messages to the wrong tag - typical cut/paste error.
This commit was SVN r11891.
__DARWIN_ALIGN_POWER define from the last release of the OS X compiler
toolchain. The bug in net/if.h, however, is still there. So look
for the hints that we're on a 64 bit Apple PowerPC instead.
* If we don't find a buffer size that works by 10MB, we're never
going to. So add some code to limit the buffer size we'll try
so that we don't fall into an infinite loop
* Detect errors in opal_ifcount in the oob init code
Refs trac:420
This commit was SVN r11825.
The following Trac tickets were found above:
Ticket 420 --> https://svn.open-mpi.org/trac/ompi/ticket/420
on 64 bit platforms sizeof(size_t) != sizeof(orte_std_cntr_t), and we were incorrectly
assuming this when dealing with num procs. It worked on little endian platforms, but
not big endian. So change num_procs to type int, and cast where needed.
This commit was SVN r11796.
wider space than getpid()
* Include <time.h> to get time()'s prototype
* Fix typo that prevented using /dev/urandom on systems that had it
This commit was SVN r11780.
Fix for double mutex free that would cause an abort condition in the orted
whenever threads were enabled.
This commit was SVN r11759.
The following Trac tickets were found above:
Ticket 391 --> https://svn.open-mpi.org/trac/ompi/ticket/391
LoadLeveler only sets LOADL_PROCESSOR_LIST when there are 128 or less tasks allocated to a job. The POE RAS relied on this variable so I created a new RAS which uses the LoadLeveler API instead of relying on the environment variable. This still needs some testing, so for now we use the POE RAS whenever LOADL_PROCESSOR_LIST, otherwise we fall back on this component.
Unfortunately, this will require an autogen...
This commit was SVN r11732.
We were still waiting the entire duration of the timeout before we figured out that a connect() was successful. Re-introduce adding the peer_send_event so that we detect immediately when a connect() completes.
Also make sure to delete the timeout event in complete_connect().
Fixed a struct timeval initialization warning reported by Jeff.
Remove an erroneous opal_output().
This commit was SVN r11724.
The following SVN revision numbers were found above:
r11718 --> open-mpi/ompi@1b6231a9b5
Each 's' partition has its own TCP network. It's fine to use this network for jobs that fit inside the partition, but the TCP OOB errors when trying to connect across two partitions, because there are two disjoint networks. Each node also has another TCP network connecting ALL nodes together.
So the solution is to actually try all the available TCP interfaces on a node, instead of erroring when the first one fails.
Also, the default TCP connect() timeout is way too long (5 minutes) - use our own timeout mechanism, with the timeout value expressed as an MCA parameter.
This commit was SVN r11718.
Add --enable-orterun-prefix-by-default (and a synonym:
--enable-mpirun-prefix-by-default) to make orterun always behave as if
"--prefix $prefix" was given on the command line (where $prefix is the
value given to the --prefix option to configure). This prevents many
rsh/ssh users from needing to modify their shell startup files to set
the LD_LIBRARY_PATH for Open MPI (they will still need to set PATH or
otherwise find the OMPI executables to mpicc/mpirun/etc. their MPI
applications).
Also added --noprefix option to orterun to disable this behavior.
Finally, note that even if --enable-orterun-prefix-by-default is
specified, if the user specifies --prefix or /path/to/mpirun, these
options will override the default value of the prefix ($prefix).
This commit was SVN r11669.
The following Trac tickets were found above:
Ticket 377 --> https://svn.open-mpi.org/trac/ompi/ticket/377
Allow the POE RAS to be compled for linux as well as AIX.
The POE RAS is really a Loadleveler RAS, and IU now has
a cluster that uses Loadleveler in a Linux environment (BigRed).
This seems to be the only thing we need to do so far to run
Open MPI on BigRed. Yay :)
This commit was SVN r11600.
set to 1 or 0 instead of the user defined number or default (128).
This caused the PLS to deadlock when using '--debug-daemons' with
more than 2 processes. :(
svn blame says that it was broken in r11347
It is *not* a problem on v1.1 or v1.2 branches.
Bug spotted by Tim Mattox and myself.
This commit was SVN r11575.
The following SVN revision numbers were found above:
r11347 --> open-mpi/ompi@f52c10d18e
- everything statically built (dynamically opened).
- OPAL, ORTE and OMPI static libraries and all the components
as dynamic files(DLL).
- everything as dynamic files (DLL).
This commit was SVN r11461.
create a process component which use CreateProcess to spawn the child.
Special care should be taken in order to correctly redirect the stdin,
stdout and stderr of the child process.
This commit was SVN r11405.
- Remove extra NULL argument from rsh module.
This commit was SVN r11377.
The following SVN revision numbers were found above:
r11347 --> open-mpi/ompi@f52c10d18e
- use the OPAL functions for PATH and environment variables
- make all headers C++ friendly
- no unamed structures
- no implicit cast.
Plus a full implementation for the orte_wait functions.
This commit was SVN r11347.
different macros, one for each project. Therefore, now we have OPAL_DECLSPEC,
ORTE_DECLSPEC and OMPI_DECLSPEC. Please use them based on the sub-project.
This commit was SVN r11270.
Other changes:
1. Remove the old xcpu components as they are not functional.
2. Fix a "bug" in orterun whereby we called dump_aborted_procs even when we normally terminated. There is still some kind of bug in this procedure, however, as we appear to be calling the orterun job_state_callback function every time a process terminates (instead of only once when they have all terminated). I'll continue digging into that one.
This will require an autogen/configure, I'm afraid.
This commit was SVN r11228.
Clean up the remainder of the size_t references in the runtime itself. Convert to orte_std_cntr_t wherever it makes sense (only avoid those places where the actual memory size is referenced).
Remove the obsolete oob barrier function (we actually obsoleted it a long time ago - just never bothered to clean it up).
I have done my best to go through all the components and catch everything, even if I couldn't test compile them since I wasn't on that type of system. Still, I cannot guarantee that problems won't show up when you test this on specific systems. Usually, these will just show as "warning: comparison between signed and unsigned" notes which are easily fixed (just change a size_t to orte_std_cntr_t).
In some places, people didn't use size_t, but instead used some other variant (e.g., I found several places with uint32_t). I tried to catch all of them, but...
Once we get all the instances caught and fixed, this should once and for all resolve many of the heterogeneity problems.
This commit was SVN r11204.