The problem was eventually traced to two bugs in the code:
1. the orted wasn't resetting the write event flag, thus preventing itself from turning it on again.
2. the HNP needed to check if the stdin was attached to tty or not before adding the delay for fairness. If it is attached to a tty, there is no need for the delay. This prevents some strangely slow typing response.
This patch needs to move to 1.3
This commit was SVN r20246.
Error output strings were changed to be unique per code site.
They are still pretty meaningless to the user, but at least now
developers might be able to find which unique place in the code
reported which error.
This commit was SVN r20238.
Consolidate the nid/pid lookup code with the rest of the nid/pid code so that changes are easier to track. Add the ability to send cluster profile info as part of the nidmap. Cleanup the setup and teardown of the new global nidmap and pidmap objects.
This commit was SVN r20219.
Also, per chat with Jeff, modified the Makefile.am's of a few orte tools so that they were consistent in the way we generate the ompi-equivalent cmds.
This commit was SVN r20165.
changes to the already existing ccp components
event/win32.c: merge old FD handling into new
opal_installdirs_windows.c:fix the registry handling
This commit was SVN r20109.
Basically, the remaining problem turned out to be:
1. closing stdout/stderr during orte_finalize of mpirun
2. inadvertently setting up a write event on fd = -1
3. devising a scheme to more accurately track when the stdin write event was active vs closed so it only got released once
This passed prelim MTT testing by Jeff and Tim, but should soak for awhile before migrating to 1.3.
This commit was SVN r20106.
The following SVN revision numbers were found above:
r20064 --> open-mpi/ompi@a07660aea8
r20068 --> open-mpi/ompi@ec930d14a9
r20074 --> open-mpi/ompi@2940309613
It is a small launch performance improvement as now we relay the launch cmd across to the next daemon before taking the time to launch our own local procs. Still, it does allow more parallel operations during the launch procedure.
This commit was SVN r20104.
1. minor modification to include two new opal MCA params:
(a) opal_profile: outputs what components were selected by each framework
currently enabled for most, but not all, frameworks
(b) opal_profile_file: name of file that contains profile info required
for modex
2. introduction of two new tools:
(a) ompi-probe: MPI process that simply calls MPI_Init/Finalize with
opal_profile set. Also reports back the rml IP address for all
interfaces on the node
(b) ompi-profiler: uses ompi-probe to create the profile_file, also
reports out a summary of what framework components are actually
being used to help with configuration options
3. modification of the grpcomm basic component to utilize the
profile file in place of the modex where possible
4. modification of orterun so it properly sees opal mca params and
handles opal_profile correctly to ensure we don't get its profile
5. similar mod to orted as for orterun
6. addition of new test that calls orte_init followed by calls to
grpcomm.barrier
This is all completely benign unless actively selected. At the moment, it only supports modex-less launch for openib-based systems. Minor mod to the TCP btl would be required to enable it as well, if people are interested. Similarly, anyone interested in enabling other BTL's for modex-less operation should let me know and I'll give you the magic details.
This seems to significantly improve scalability provided the file can be locally located on the nodes. I'm looking at an alternative means of disseminating the info (perhaps in launch message) as an option for removing that constraint.
This commit was SVN r20098.
1. a direct callback from waitpid - this set the waitpid_fired flag
2. a notify event callback from the IOF - this set the iof complete flag
3. a message via the daemon cmd processor from the proc "de-registering" the sync, thus indicating it was going through MPI_Finalize.
The problem is that these could overlap, with the first two allowing the orted to declare the proc complete before the daemon had responded to #3.
This change forces all three events to flow through the daemon cmd processor, thus ensuring an ordered handling. I'm not certain this will solve the problem, but will await further MTT reports to see. Unfortunately, the problem doesn't show up on any manual or script-based tests I have been able to run, even when I duplicate the exact cmd that fails under MTT.
This commit was SVN r20074.
1. coordination of job completion notification to include a requirement for both waitpid detection AND notification that all iof pipes have been closed by the app
2. change of all IOF read and write events to be non-persistent so they can properly be shutdown and restarted only when required
3. addition of a delay (currently set to 10ms) before restarting the stdin read event. This was required to ensure that the stdout, stderr, and stddiag read events had an opportunity to be serviced in scenarios where large files are attached to stdin.
This commit was SVN r20064.
This seems to fix the segv seen on process restart.
This commit was SVN r20051.
The following SVN revision numbers were found above:
r20022 --> open-mpi/ompi@9a57db4a81
This commit does that by ensuring that daemons retain knowledge of proc location for all procs in their job family. It required a minor change to the ESS API to allow the daemons to update their pidmaps as data was received. In addition, the routed modules have been updated to take advantage of the newly available info, and the encode/decode pidmap utilities have been updated to communicate the required info in the launch message.
This commit was SVN r20022.
The slurm plm fork/exec's a call to srun to launch its daemons. When mpirun terminates, it then sends out a "terminate" command to those daemons. The daemons respond back to mpirun, and then exit.
If slurm itself is running on a slow network, and mpirun is running the OOB across a fast network, then it is possible for mpirun to receive notification of daemon termination and exit -before- the srun can complete its bookkeeping and declare the job as complete. When this happens, slurm becomes confused and loses state.
Mucho bad. :-/
This commit changes the termination logic so that mpirun will wait for srun to report complete before exiting. It also enables fully routed communications since it no longer requires daemons to report back that they are terminating, thus allowing the daemons to terminate asynchronously (thereby breaking routing paths).
This commit was SVN r20018.