Split the finalize process into two parts: one that finalizes the orte subsystems, and another that finalizes (what will become) the opal subsystems. Needed to properly restart the workstation process once remote launch accomplished.
This commit was SVN r5758.
launching new processes so that they get all the default unix
behaviors (i.e., become killable, and don't accidentally have some
signals blocked -- left over from the event library).
This commit was SVN r5757.
Remote launch of daemon now working. Bunch of forced diagnostic messages in it, though, which I'll leave until I release this for actual use.
This commit was SVN r5750.
Orted has been modified to take a new parameter - a file descriptor used as a pipe to pass the daemon's contact info back to the probe when the daemon is remotely launched.
This commit was SVN r5748.
You will now receive a message indicating that an existing universe was detected, but connection to it was refused. The system will tell you the name it created for the new universe it will now be using.
This commit was SVN r5747.
(mostly UB/LB related and doing questionable things). Otherwise a lot of changes:
- cleaner code + more comments
- stronger bound checker (DDT error messages if we exceed the buffer capacity)
- differentiate the 3 internal types: loop, basic element and end_loop (using union)
- more MACROS to solve the repetitive problems
- more output in debug mode (and if requested by the user).
- correct some mismaching between usage of true_extent and extent
- improve the special cases (contiguous data, contiguous with gaps, no conversion ...)
- in order to allow gdb to break in the pack/unpack function they became real function (defined
in dt_pack.c and dt_unpack.c) if OMPI_ENABLE_DEBUG is defined. Otherwise they are just macros.
- a new approach for the computation of the lower bound and upper bound. In same time the true_lb and
true_ub has been modified to match the new algorithm.
- handle specific cases in the datatype optimization. In some cases the datatype definition increase,
but the complexity decrease.
This commit was SVN r5729.
is int (mostly 4 bytes) and strdup normally return a char* (usually 8 bytes). The result: stack corrupted
and other weird things ...
This commit was SVN r5725.
Need to do some refining of the component, but it meets basic requirements right now. Nobody else should notice any change - system basically ignores it unless you tell it to do something.
This commit was SVN r5723.
Added a special case under the win_makefile for the gpr/replica directory
since it contains multiple dependant layers of directories.
Added a couple of OMPI_DECLSPECs. Change a conflicting variable name in
gpr_replica_dict_tl.c from 'new' to 'new_dict'.
This commit was SVN r5712.
Fixes for orterun in handling different MCA params for different
processes (reviewed by Brian):
- By design, if you run the following:
mpirun --mca foo aaa --mca foo bbb a.out
a.out will get a single MCA param for foo with value "aaa,bbb".
- However, if you specify multiple apps with different values for the
same MCA param, you should expect to get the different values for
each app. For example:
mpirun --mca foo aaa a.out : --mca foo bbb b.out
Should yield a.out with a "foo" param with value "aaa" and b.out
with a "foo" param with a value "bbb".
- This did not work -- both a.out and b.out would get a "foo" with
"aaa,bbb".
- This commit fixes this behavior -- now a.out will get aaa and b.out
will get bbb.
- Additionally, if you mix --mca and and app file, you can have
"global" params and per-line-in-the-appfile params. For example:
mpirun --mca foo zzzz --app appfile
where "appfile" contains:
-np 1 --mca bar aaa a.out
-np 1 --mca bar bbb b.out
In this case, a.out will get foo=zzzz and bar=aaa, and b.out will
get foo=zzzz and bar=bbb.
Spiffy.
Ok, fortran build is done... back to Fortran... sigh...
This commit was SVN r5710.
* start refactoring duplicate code into inline functions (probably will
have to become macros, but not until debugging is done)
* general code cleanup
This commit was SVN r5706.
1. Added a new function to launch head node processes on remote nodes.
2. Added new tool "orteprobe" that checks to see if a daemon is running on a node. If so, it reports the contact info back to the requestor. If not, it will (eventually - but not now) fork/exec a daemon on the node, report the contact info back to requestor, and then die.
3. Modified orted to handle universe name parameters, and added separate command line flags for debugging the daemon and saving daemon debugging output in a file. The "debug" flag now turns on the runtime debug info instead of the daemon debug - thus, you can now just get daemon debug info if you like.
4. Fix the dps to handle zero length strings correctly.
5. Modify the fork and rsh launchers to pass required environmental variables to the daemons and processes
6. Pulled the redirection of stdin/stdout/stderr for the daemon out of orted and put it into the daemon_init function to simplify orted logic.
7. Modified sys_info to correctly deal with passed mca param
8. Modified univ_info to parse incoming universe location information.
This commit was SVN r5705.
* make buffers really big so that we pass allocmem until we figure out
why we're not flow controlling as I expected
* set event queue to invalid intially and use that as the enabled test
rather than a seperate bool - shrinks the module a bit
* add dropped count checks, with a panic if one occurs. Still need to
implement some type of retransmit logic.
This commit was SVN r5704.
- don't free the send buffer unless the converter tells us we need to
- properly do the math to determine when the receive buffer has been
fully used and unlinked itself
This commit was SVN r5703.
* Minor formatting fixes in XGrid RAS component
* Code cleanup in XGrid PLS component:
- If we can't get daemon contact information, kill the job at the XGrid
level
- Add MCA parameter pls_xgrid_delete_job that will delete the job from
XGrid when complete (this seems like standard behavior, so it's the
default)
- Remove compiler warning about getting the name of a XGGrid object
- Properly populate the daemon information for the killing code
This commit was SVN r5697.
more than we have asked for (on my G5). Anyway now I hope I have enought memory to printout
the full description of the datatype.
This commit was SVN r5690.
Many changes to headers for OMPI_DECLSPEC, and
proper placement of c_plusplus defines in those files.
mca/gpr/replica and tools are the two sets of directories
that still need work for the Windows build for this pass.
This commit was SVN r5688.
- app->num_procs changed to a size_t, which hosed the initialization
of its value to -1 (not sure why the compiler didn't complain
#$%@#$%), which was there to catch the case when the user forgot to
specify -np (or some other equivalent). Fixed.
This commit was SVN r5672.
- Change all uses of *printf'ing a size_t to use an explicit cast to
(unsigned long) and the %lu escape
- change ORTE_GPR_REPLICA_MAX_SIZE to INT_MAX until bug 1345 is fixed
(i.e., until we allow size_t in MCA params)
- ns_base_local_fns.c:orte_ns_base_get_proc_name_string(): changed
from %0X -> %lu
- ORTE_NAME_ARGS added explicit (unsigned long) casts, and changed all
usages of ORTE_NAME_ARGS to use %lu's
This commit was SVN r5644.
As an FYI: the pack/unpack routines should be happy with a NULL string (and appear to be so). Issue here was that the constructor was not called, which means that the string pointer was not initialized to NULL as it ordinarily would have been.
This commit was SVN r5639.
1. Instead of removing various src/ component directories, simply
"flatten" the Makefile.am structure by having only a single
top-level Makefile.am for the component, and having it include
src/Makefile.extra (which is where the source files are listed).
This effectively makes the build faster because "make" does not
traverse down into src/, and we don't build a Makefile for that
directory.
2. Did end up moving topo/unity/src/* into topo/unity, which is where
I figured out that option #1 would be a bit easier (and safer,
considering that other developers are actively working in various
src/ directories -- moving things around while they're working
would be Bad!)
3. Did not consolidate most of the io/romio component because of the
nightmare of sym links (especially w.r.t. VPATH builds) in the
included ROMIO distribution. I wasted too much time trying to get
that stuff right and finally gave up -- this is a "low hanging
fruit" optimization, after all.
This commit was SVN r5629.
- sends/recvs short messages (less than first frag size)
- does not properly ACK messages, so Ssend() is borked
- leaks memory like there's no tomorrow
- don't use it just yet
This commit was SVN r5625.
1. Added pid_t to the dps
2. Processes now "register" their local pid and update their location (i.e., nodename) on the registry during mpi_init
3. Added a new error code for values that exceed maximum for their data type (useful when transitioning a value from one variable to another of different size)
4. Fixed a few places where size_t was being incorrectly handled
5. Updated dps_test to cover pid_t types
This should now provide support for TotalView connection - which David is pursuing.
This commit was SVN r5623.
biggie), so we gain nothing there. On 10.4, it's implemented directly,
but doesn't support devices (which messes up pty support and IO
forwarding).
This commit was SVN r5621.
on all 64 bits architectures. The problem was the for unpack the source pointer was cast to a
specific type (uint32_t for 32 bits data) and then hton* was applied. The result was ... unexpected.
This patch always memcpy the data in a temporary variable with the correct size before calling
ntoh* functions, so we can insure that the data is always correctly aligned.
Moreover I add a debuging layer. OMPI_OUTPUT is used to print out the data being packed and
unpacked. It generate a lot of output but hopefully allow us to spot few bugs. This layer is not
completed the output stream descriptor is set to -1 (no output).
This commit was SVN r5617.
Anyway now I'm able to run on several 64 bits architectures (Athlon and G5) so
I suppose that we are back online on 64 bits.
This commit was SVN r5616.
everything in one directory. Still have only one Makefile, so it shouldn't
change build time at all
* Now that I finally understand the header system for data, refactor a little
bit of the code to match what really should be happening
* start of a hacked up send() - puts the data for a 0 byte message on the
other side, and all the pointers are where i think they should be. So
my plan of attack will work. But I think I'm going to have to use
iovecs instead of memcpy() real soon now.
This commit was SVN r5610.
one is selected it will be used for all purposes: small messages and long messages (even if the
long message is still split in several fragments). For the case where 2 PTLs per peer exists,
the first one is for latency (small messages and rendez-vous requests) when the second one
will be used for bandwitdh.
This commit was SVN r5600.
Jeff send me the way to do that automatically, and I'm pretty sure I'm not the only one who miss some
of the functionalities of our build system. The idea is really cool, let only the developper of a
component have it active until it reach a stable state. For all others peoples the .ompi_ignore
file prevent them for compiling the component.
cd src/mca/pml/uniq
echo $USER > .ompi_unignore
svn add .ompi_unignore
svn ci .ompi_unignore
This commit was SVN r5595.
The idea behind this PML is to minimiza the overhead of managing multiple PTL. For each node, UNIQ keep two PTL's
one for latency and one for bandwidth. One the next version I want to add a configure parameter to allow the user
to select how many PTL's he want: one or two.
This commit was SVN r5593.
based around PTL_MD_MAX_SIZE, which apparently isn't implemented in
Cray's Portals implementation. Time to rethink that design :/
This commit was SVN r5576.
HEADS UP: string versions of names are now presented in DECIMAL format - not HEX as they previously were. If you used the name services functions (as you were supposed to do) to access these names, you will not have any problems. If you did it yourself, then you need to fix it - my suggestion would be that you fix your code by using the name service functions to avoid future problems.
This commit was SVN r5571.
1. *correctly* fix the printing of size_t variables. Need to do this through a #define, not just typecast things. Thanks to Jeff/Brian for suggesting a cleaner way to do it (as opposed to just doing the #define at the print location). Note that not ALL of the prints have been "fixed" yet - will continue to identify them.
2. Add int64 and size_t to the pack/unpack unit tests.
3. Fix a bug in the int64 pack/unpack system.
This commit was SVN r5570.
the trick: I decide to print it always as an unsigned long and explicitly cast everything to this type.
Thus, I change all printf formats from %d to %lu and cast all arguemnts to the correct type (unsigned long).
This commit was SVN r5568.