Some minor changes to help facilitate debugger support so that both mpirun and yod can operate with it. Still to be completed.
This commit was SVN r18664.
This patch also fixes a minor bug discovered along the way: we had "lost" the passing of the oversubscribed condition flag from the mapper to the orteds. Thus, we were not setting sched_yield correctly when in oversubscribed conditions (except when a hostfile was specified - different logic there because we treat the number of slots allocated on the node as "uncertain")
I did not modify the process component in this patch - I will send a proposed patch to the maintainers of that component so they can review it first.
This commit was SVN r16418.
and implementation. This has shown drastic performance benefit when
transferring Many files at roughly the same time.
I tested this for many different filem operations and everything was working
fine. Let me know if you have any problems with this functionality.
Some Notes:
- opal-checkpoint now has a 'quiet' flag to keep it from being too verbose.
- FileM RSH component is fully non-blocking.
- FileM RSH component has incomming connection throttling since by default
ssh only allows 10 concurrent scp connections to any single host. This
default can be adjusted via an MCA parameter.
{{{-mca filem_rsh_max_incomming 10}}}
- There is an MCA parameter for max outgoing connections, but it is currently
not implemented. If someone needs it then it should not be hard to implement.
{{{-mca filem_rsh_max_outgoing 10}}}
- Changed the FileM request structure so that it is a bit more explicit and
flexible.
- Moved the 'preload-binary' and 'preload-files' functionality into odls/base
allowing for code reuse in the 'process' and 'default' ODLS components.
- Fixed a bug in the process name resolution which broke the 'preload-*'
functionality due to GPR table structure changes.
- The FileM RSH component might be able to see even more speedup from using a
thread pool to operate on the work_pool structures, but that is for future
work.
- Added a 'opal-show-help' file to ODLS Base
This commit was SVN r16252.
Short description: major changes include -
1. singletons now fork/exec a local daemon to manage their operations.
2. the orte daemon code now resides in libopen-rte
3. daemons no longer use the orte triggering system during startup. Instead, they directly call back to their parent pls component to report ready to operate. A base function to count the callbacks has been provided.
I have modified all the pls components except xcpu and poe (don't understand either well enough to do it). Full functionality has been verified for rsh, SLURM, and TM systems. Compile has been verified for xgrid and gridengine.
This commit was SVN r15390.
1. Fix the "hang" condition when an application isn't found. It turned out that the ODLS had some difficulty with the process actually not having been started - hence, it never called the waitpid callback. As a result, the "terminated" trigger didn't fire, and so mpirun didn't wake up. With this change, the HNP's errmgr forces the issue by causing the trigger to fire itself when an abort condition occurs.
2. Shift the recording of the pid and the nodename from mpi_init to the orted launcher. This allows programs such as Eclipse PTP to get the pids even for non-MPI applications. In the case of bproc, the pls handles this chore since we don't use orteds in that system.
This commit was SVN r12558.
This patch will cause a problem for cnos, however, as there we want to specifically tell the backends to be "null". I'm working on that issue.
This commit was SVN r12225.