1. after we get enough jobs in the pidmap, the address of the jobmap pointer array data can move due to realloc. Need to reset the jobs pointer each time through to ensure it is pointing to valid data
2. when we exit the loop, rc will be set to an error due to reading past end of buffer - need to reset so it is ignored
3. need to ensure we only try to read one jobid each time through loop
This commit was SVN r20030.
This commit does that by ensuring that daemons retain knowledge of proc location for all procs in their job family. It required a minor change to the ESS API to allow the daemons to update their pidmaps as data was received. In addition, the routed modules have been updated to take advantage of the newly available info, and the encode/decode pidmap utilities have been updated to communicate the required info in the launch message.
This commit was SVN r20022.
The slurm plm fork/exec's a call to srun to launch its daemons. When mpirun terminates, it then sends out a "terminate" command to those daemons. The daemons respond back to mpirun, and then exit.
If slurm itself is running on a slow network, and mpirun is running the OOB across a fast network, then it is possible for mpirun to receive notification of daemon termination and exit -before- the srun can complete its bookkeeping and declare the job as complete. When this happens, slurm becomes confused and loses state.
Mucho bad. :-/
This commit changes the termination logic so that mpirun will wait for srun to report complete before exiting. It also enables fully routed communications since it no longer requires daemons to report back that they are terminating, thus allowing the daemons to terminate asynchronously (thereby breaking routing paths).
This commit was SVN r20018.
determing of the IP subnet. The netmask was being used improperly when
determining which subnet each connection is on. Part two is the ability to
include/exclude specific subnets.
This patch fixes ticket #1665
This commit was SVN r20016.
memory hooks (munmap) and initialize the mallopt component, and
nothing else.
Use this mpool in the MX common initialization, supporting both BTL
and MTL. Automatically set the MX_RCACHE environment variable to
enable registration cache in MX.
Tested with success for munmap() and large free().
This commit was SVN r20003.
because we wasted a good amount of time today assuming that it was
returning the actual netmask. Specifically, we were confused why it
returned 0x18 instead of 0xffffff00 for a class C subnet (the
head-smacking moment wasn't until [much] later when we converted 0x18
to decimal, which is 24. Then the Clue Light(tm) went on).
This commit was SVN r20002.
* If max_inline_data == -1 perform runtime detection
* If max_inline_data >=0 use the value provided
* If the user does not explicitly set this via command line, use the value from INI file
This commit fixes trac:1662
This commit was SVN r19995.
The following Trac tickets were found above:
Ticket 1662 --> https://svn.open-mpi.org/trac/ompi/ticket/1662
1. modify the iof to track when a proc actually closes all of its open iof output pipes. When this occurs, notify the odls that the proc's iof is complete. This is done via a zero-time event so that we can step out of the read event before processing the notification.
2. in the odls, modify the waitpid callback so it only flags that it was called. Add a function to receive the iof-complete notification, and a function that checks for both iof complete and waitpid callback before declaring a proc fully terminated. This ensures that we read and deliver -all- of the IO prior to declaring the job complete.
Also modified the odls call to orte_iof.close (and the component's implementation) so it only closes stdin, leaving the other io channels alone. This fixes the other half of the known problem.
This should fix the ticket on this subject, but I'll wait to close it pending further testing in the trunk.
This commit was SVN r19991.
This happens with really long paths as part of the variable name.
Found in MTT testing (where the paths are long). This will need to be moved to v1.3
This commit was SVN r19989.
This will sit in trunk for a few days - would like to actually see some errors reported to syslog before moving the code to 1.3
This commit was SVN r19986.
another bug. This also causes 0-byte requests to be treated as a buffer
error, causing the base request to be requeued. On Cray XT, it may be
temporarily impossible to make allocations for buffer requests, as the
default stack size is small (8 MB) and there is no true swap device.
Even with the stack size increased, there will be cases in which this
condition recurs.
One possibility is to make the buffer allocations off of the heap; but,
this does not change the fact that eventually an out-of-memory condition
will occur and we need to support multiple receives in transit, a
condition for which the available buffer space may change. On the other
hand, if we switch to allocating the buffer space from the heap, we will
need to return an error when the allocation fails and there are no other
buffers in transit.
This commit was SVN r19981.
It highlighted a bug in the bookmark component where for persistent sends we were not copying the context, but just moving it. This caused us to lose track of the message if it is started/completed multiple times.
This will need to be brought over to the v1.3 branch, but it should soak overnight to get a round of testing first.
This commit was SVN r19962.