I know it's just a technicality, but it is time to address such things rather than just letting them continue to propagate. :-)
This commit was SVN r12954.
This has now been corrected. The singleton startup will dutifully call the mapper framework so that the proper data storage locations get initialized. Unfortunately, we then had to instruct the RMAPS not to allocate a vpid range for this job - otherwise, it would make a mistake and think there were two processes in it. Hence, a change was required to RMAPS to tell it "map this job, but don't allocate a vpid range for it".
This change will need to migrate across to 1.2 after it "soaks" the appropriate time.
This commit was SVN r12952.
is allocated on a per comm_world instance, with the lowest rank
in comm_world on the given host creating and initializing the file,
and then notifying the remaining files via the OOB.
Reviewed: Ralph Castain, Brian Barrett
Addressing ticket #674.
This commit was SVN r12949.
rc (which is -1 or 4 if we hit this case) resulted in an odd error that a
signal killed the proc (instead of a startup error, as is reality).
Instead, use the W_EXITCODE macro (if available) to build up an exit
code that has an error code for exit status, but does not make it look
like the process died from a signal
This commit was SVN r12890.
1. no -np provided - put one proc/node across all allocated nodes
2. -np N provided, N > #nodes - we print a pretty error message and exit
3. -np N provided, N <= #nodes - put one proc/node across N nodes
I also added a new orte constant (ORTE_ERR_SILENT) that allows us to pass up the chain that an error was encountered, but NOT print ORTE_ERROR_LOG messages. This is intended to be used for cases where the error we encounter is NOT an orte error, but rather is one associated with incorrect user input (e.g., the preceding case 2). In such cases, there is no point in printing an ORTE_ERROR_LOG chain of messages as it isn't an orte error.
This commit was SVN r12821.
I found only two places that were looking at the tokens:
1. the odls - we used the tokens to separately process the globals container data from everything else. In this case, I left the subscription that returned the globals data alone, but "stripped" the subscription that returned the launch data for the procs. These subscriptions have nothing to do with the xcast message.
2. the pml_base_modex - the callback function was getting process names from the returned tokens. Actually, this function was doing a very bad thing - it was assuming that the first token returned was *always* the process name. This is currently true, but is one of those assumptions that someone could have easily changed - and suddenly found the system inexplicably failing. I modified the function to (a) get the name sent back to us, (b) "stripped" the value structures of tokens and segment strings, and (c) correctly obtained process names from the returned values. I also reindented the heck out of the code so it was legible (at least, to my old eyes).
This commit was SVN r12813.
Obviously, people like bproc will have to get the app_num via another avenue...but that's a problem for another day. Several options are easily available.
This commit was SVN r12788.
1. implement and enable the non-described buffer operations. I will send out a more detailed explanation separately. However, this mode of operation (which is now the default) significantly reduces message size during startup. If you want the described buffers, set the mca param "-mca dss_describe_buffer 1".
2. revise the xcast system to support both linear and binomial tree broadcast methods. Since we are seeing scenarios where the binomiall tree can cause problems, I have made the linear method the default. To run with the binomial tree, set the mca param "-mca oob_xcast_mode binomial".
3. add some detailed timing reports to the xcast operation. These are enabled via "-mca oob_xcast_timing 1".
4. add some more unit tests for the dss and gpr (focused on support for the non-described buffer)
This commit was SVN r12722.
Ralph identified the problem, I tracked down ''where'' the fd was
being closed, and Brian figured out ''why'' (and the fix).
What was happening is that a remote process was closing its
stdout/stderr and therefore sending a 0-byte IOF message to mpirun.
mpirun, in turn, closed the iof endpoint associated with that stream
(i.e., stdout/stderr). IOF does this to handle the case where
mpirun's stdin is closed -- this therefore causes the stdin on all the
ORTE-started processes to have their stdin's closed as well.
So the workaround here is to check that if we get a 0-byte IOF message
on a sink (indicating a remote closure), and if that sink is the
special stdout or stderr stream, don't actually close anything in the
local process.
This commit was SVN r12691.
The following Trac tickets were found above:
Ticket 635 --> https://svn.open-mpi.org/trac/ompi/ticket/635
because they are in ORTE, not OMPI. Also, remove the ORTE_PROCESS_NAME macros
in iof base as they are duplicates of the ones that were in ns_types, which
meant that bad things happened if you changed what an orte_process_name_t
looked like.
This commit was SVN r12646.
We were burned again by the fact that the bproc state monitor creates entries on the node segment for *all* the nodes in the cluster when it is opened during orte_init. As a result, the bjs allocator was never being called, and the system merrily assumed that *all* nodes in the cluster had been allocated to it.
To fix this, I removed a test that had been inserted into the allocation procedure that checked for a non-zero node segment. This was an old artifact - the RAS components already know that they are not to overwrite any existing node segment entries (at least, bproc does - I will check the others. For now, I just want to save the bproc fix on this machine).
This commit was SVN r12640.
Modify the RMAPS framework so we eliminate communicating a map to a backend node when certain attributes are set. The proxy functions are now implemented in the base, and a check made for HNP/non-HNP operation made in the map_jobs function prior to execution.
This commit was SVN r12619.
Add placeholders for the new orte tools. These don't actually do anything yet - in fact, I have set the .ompi_ignore so that you won't compile them (I have set a .ompi_unignore for me). Please let me know if you encounter any trouble with this - the ompi_ignore's should protect everyone.
This commit was SVN r12616.
Note that Bproc won't support this operation, so we just ignore the --reuse-daemons directive.
I'm afraid I don't understand the POE and XGrid environments well enough to attempt the necessary modifications.
Also, please note that XGrid support has been broken on the trunk. I don't understand the code syntax well enough to make the required changes to that PLS component, so it won't compile at the moment. I'm hoping Brian has a few minutes to fix it after SC.
This commit was SVN r12614.
1. new functionality in the pls base to check for reusable daemons and launch upon them
2. an extension of the odls API to allow each odls component to build a notify message with the "correct" data in it for adding processes to the local daemon. This means that the odls now opens components on the HNP as well as on daemons - but that's the price of allowing so much flexibility. Only the default odls has this functionality enabled - the others just return NOT_IMPLEMENTED
3. addition of a new command line option "--reuse-daemons" to orterun. The default, for now, is to NOT reuse daemons. Once we have more time to test this capability, we may choose to reverse the default. For one thing, we probably want to investigate the tradeoffs in start time for comm_spawn'd processes that reuse daemons versus launch their own. On some systems, though, having another daemon show up can cause problems - so they may want to set the default as "reuse".
This is ONLY enabled for rsh launch, at the moment. The code needing to be added to each launcher is about three lines long, so I'll be doing that as I get access to machines I can test it on.
This commit was SVN r12608.
1. use non-blocking sends to transmit commands (this was actually done in a prior commit)
2. have an "ack" message sent back from the orted when it completes the command
The latter item is the new one here. With my prior commit, it was possible for the HNP to move on to other things before the orted had completed its command. This caused the HNP to occassionally exit before the orted, thus generating "lost connection" errors. With this change, we retain the parallel nature of the command communications, but still hold the HNP at that point until the orteds are done.
Best of both worlds.
This commit was SVN r12605.
Accordingly, there are new APIs to the name service to support the ability to get a job's parent, root, immediate children, and all its descendants. In addition, the terminate_job, terminate_orted, and signal_job APIs for the PLS have been modified to accept attributes that define the extent of their actions. For example, doing a "terminate_job" with an attribute of ORTE_NS_INCLUDE_DESCENDANTS will terminate the given jobid AND all jobs that descended from it.
I have tested this capability on a MacBook under rsh, Odin under SLURM, and LANL's Flash (bproc). It worked successfully on non-MPI jobs (both simple and including a spawn), and MPI jobs (again, both simple and with a spawn).
This commit was SVN r12597.
Add some debugger output to the ODLS default component.
Modify the orted command communication system so that it is done via non-blocking sends. This removes the linearity of the transmission and improves the response time.
This commit was SVN r12585.
Make it so if -np was not passed and -pernode was, we map bynode
This commit was SVN r12580.
The following Trac tickets were found above:
Ticket 612 --> https://svn.open-mpi.org/trac/ompi/ticket/612
Add some debugging output to the ODLS default module, and the orted.
Remove the nodename data from the ODLS info report - that info is already stored in the registry by the RMAPS framework upon completing the mapping procedure.
Add another test program that does an ORTE-only dynamic spawn (gasp!). Looks just like comm_spawn - just no MPI involved.
Modify the ODLS to release the processor when we "kill" local procs in a more scalable fashion. It previously had a sleep in it that Jeff's prior commit removed. However, he introduced some Windows code into the non-Windows component (protected by "if"s, but unnecessary). This is a more general solution he proposed - included here so I could get things to compile properly.
This commit was SVN r12579.
1. Fix the "hang" condition when an application isn't found. It turned out that the ODLS had some difficulty with the process actually not having been started - hence, it never called the waitpid callback. As a result, the "terminated" trigger didn't fire, and so mpirun didn't wake up. With this change, the HNP's errmgr forces the issue by causing the trigger to fire itself when an abort condition occurs.
2. Shift the recording of the pid and the nodename from mpi_init to the orted launcher. This allows programs such as Eclipse PTP to get the pids even for non-MPI applications. In the case of bproc, the pls handles this chore since we don't use orteds in that system.
This commit was SVN r12558.
check for bourne shell, because bourne shell is the smallest
common divisor for bash/ksh/sh.
- Make some shell expressions sh compatible
This commit was SVN r12509.
1. Added reporting points around the xcasts in MPI_Init. Note that these times will include time spent waiting for a trigger to fire, which is why the times between stage gates did NOT include these times initially. The inter-stage-gate times still do NOT include the xcast time - the xcast time is reported separately.
2. Added the process vpid on the MPI_Init timing reports for clarity.
3. Added a report from the xcast function on the HNP that outputs the number of bytes in the message being sent to the processes.
This commit was SVN r12422.
1. ORTE_RMAPS_DISPLAY_AT_LAUNCH: pretty-prints out the process map right before we launch so you can see where everyone is going. This is settable via the command line option "--display-map-at-launch"
2. ORTE_RMGR_STOP_AFTER_SETUP: just setup the job and then return from the spawn command.
3. ORTE_RMGR_STOP_AFTER_ALLOC: return from the rmgr.spawn call after allocating the job
4. ORTE_RMGR_STOP_AFTER_MAP: return from the rmgr.spawn call after mapping the job. This gives folks a chance to retrieve and graphically display the map, let the user edit it, and store the results. They can then call "launch" on their own and the system will use the revised map.
Enjoy! My personal favorite is the first one - helps with debugging.
This commit was SVN r12379.
Setup subscriptions to correctly return the MPI_APPNUM attribute.
Fix an unreported bug that was found. The universe size was incorrectly defined in the attributes code. As coded, it looked for size_t values and based its size computation on those numbers. Unfortunately, the node_slots value had been changed to an orte_std_cntr_t awhile back! So the universe size was never updated.
Update the hello_nodename test to check for MPI_APPNUM.
Add a definition to ns_types for ORTE_PROC_MY_NAME - just a shortcut for orte_process_info.my_name. Brought over from ORTE 2.0 as it will be used extensively there.
This commit was SVN r12377.
Give a more intelligible error message when someone passes -nolocal and the only available node is the local node.
This commit was SVN r12325.
The following Trac tickets were found above:
Ticket 487 --> https://svn.open-mpi.org/trac/ompi/ticket/487
If you want to look at our launch and MPI process startup times, you can do so with two MCA params:
OMPI_MCA_orte_timing: set it to anything non-zero and you will get the launch time for different steps in the job launch procedure. The degree of detail depends on the launch environment. rsh will provide you with the average, min, and max launch time for the daemons. SLURM block launches the daemon, so you only get the time to launch the daemons and the total time to launch the job. Ditto for bproc. TM looks more like rsh. Only those four environments are currently supported - anyone interested in extending this capability to other environs is welcome to do so. In all cases, you also get the time to setup the job for launch.
OMPI_MCA_ompi_timing: set it to anything non-zero and you will get the time for mpi_init to reach the compound registry command, the time to execute that command, the time to go from our stage1 barrier to the stage2 barrier, and the time to go from the stage2 barrier to the end of mpi_init. This will be output for each process, so you'll have to compile any statistics on your own. Note: if someone develops a nice parser to do so, it would be really appreciated if you could/would share!
This commit was SVN r12302.
packing a sockaddr_in, as there are some endianness and padding issues
with sending a sockaddr_in. Note that the sin_port and sin_addr are
already in network byte order, which is why we pack them as a byte
string.
Refs trac:493
This commit was SVN r12301.
The following Trac tickets were found above:
Ticket 493 --> https://svn.open-mpi.org/trac/ompi/ticket/493
Only close off stdout/stderr from the daemons if we are not debugging the slurm pls and --debug-daemons was not passed.
This commit was SVN r12276.
The following Trac tickets were found above:
Ticket 352 --> https://svn.open-mpi.org/trac/ompi/ticket/352
Get the ordering right so that a singleton can start.
Protect the rmgr copy app_context function from NULL fields
Tell the mapper it is okay for there not to be a pre-existing mapping plan for a parent when dynamically spawning processes
This commit was SVN r12257.
Fix the persistent daemon problem where it was exiting when a job completed. Problem was that the persistent daemon would order the job daemons to exit. They would then send an 'ack' back to the persistent daemon - but the ack consisted of an echo of the "exit" command, which was recv'd by the wrong listener who treated it as a properly sent cmd....and exited.
This commit was SVN r12243.
- It is possible to leave a byslot/bynode routine and have cur_node_item be NULL, so check for that.
- After we do an allocation where the user has provided a map (i.e. with --host), cur_node_item is pointing into the map list, not the global list. Change it to point into the global list.
This commit was SVN r12232.
Also, I am no longer seeing any issue with the child job spawning its own daemons - this appears to be fixed. We still don't reuse the existing daemons, however, but that will come.
This commit was SVN r12229.
This patch will cause a problem for cnos, however, as there we want to specifically tell the backends to be "null". I'm working on that issue.
This commit was SVN r12225.
- Simplified the logic of the ras modules by moving the attribute handling into the base allocation function. This allows us to decide how to allocate based on the situation, and solves some of the allocation problems we were having with comm_spawn.
- moved the proxy component into the base. This was done because we always want to call the proxy functions if we are not on a HNP regardless of the attributes passed.
- Got rid of the hostfile component. What little logic was in it was moved into the base to deal with other circumstances. The hostfile information is currently being propagated into the registry by the RDS, so we just use what is already in the registry.
- renamed some slurm function so that they have the proper prefix. Not strictly necessary as they were static, but it makes debugging much easier.
- fixed a buglet in the round_robin rmaps where we would return an error when really no error occured.
I tried to make proper corrections to all the ras modules, but I cannot test all of them.
This commit was SVN r12202.
Update the mapper so it correctly points to the next node to be used if we are mapping by slots. As it was, if we had an app_context that used only one slot on a node, the next app_context would start on the next node - leaving a blank slot in-between.
This commit was SVN r12193.
Modify the mapper to better bookmark its stopping place each time, and to pick up the next time from there. This needs to be validated on a multi-node system.
Fix a major memory corruption problem in the registry put/get functions that was doing multiple free's. Not sure how valgrind missed this one, though it only occurred in specific circumstances (such as comm_spawn).
This commit was SVN r12179.
This change does a couple of things:
1. Since the USE_PARENT_ALLOC attribute is a directive about regarding allocation of resources to a job, it more properly should be an attribute of the RAS. Change the name to reflect that and move the attribute define to the ras_types.h file.
2. Add the attributes list to the RMAPS map_job interface. This provides us with the desired flexibility to dynamically specify directives for mapping. The system will - in the absence of any attribute-based directive - default to the values provided in the MCA parameters (either from environment or command-line interface).
This commit was SVN r12164.
In this implementation, we begin mapping on the first node that has at least one slot available as measured by the slots_inuse versus the soft limit. If none of the nodes meet that criterion, we just start at the beginning of the node list since we are oversubscribed anyway.
Note that we ignore this logic if the user specifies a mapping - then it's just "user beware".
The real root cause of the problem is that we don't adjust sched_yield as we add processes onto a node. Hence, the node becomes oversubscribed and performance goes into the toilet. What we REALLY need to do to solve the problem is:
(a) modify the PLS components so they reuse the existing daemons,
(b) create a way to tell a running process to adjust its sched_yield, and
(c) modify the ODLS components to update the sched_yield on a process per the new method
Until we do that, we will continue to have this problem - all this fix (and any subsequent one that focuses solely on the mapper) does is hopefully make it happen less often.
This commit was SVN r12145.
Fix the problem observed by multiple people that comm_spawned children were (once again) being mapped onto the same nodes as their parents. This was caused by going through the RAS a second time, thus overwriting the mapper's bookkeeping that told RMAPS where it had left off.
To solve this - and to continue moving forward on the ORTE development - we introduce the concept of attributes to control the behavior of the RM frameworks. I defined the attributes and a list of attributes as new ORTE data types to make it easier for people to pass them around (since they are now fundamental to the system, and therefore we will be packing and unpacking them frequently). Thus, all the functions to manipulate attributes can be implemented and debugged in one place.
I used those capabilities in two places:
1. Added an attribute list to the rmgr.spawn interface.
2. Added an attribute list to the ras.allocate interface. At the moment, the only attribute I modified the various RAS components to recognize is the USE_PARENT_ALLOCATION one (as defined in rmgr_types.h).
So the RAS components now know how to reuse an allocation. I have debugged this under rsh, but it now needs to be tested on a wider set of platforms.
This commit was SVN r12138.
seed value have something set to true. Allow selection of the listen
type to thread if (and only if) the process is the HNP...
This commit was SVN r12105.
processes launched locally for the stdio file names. This was causing
the expected files to not exist and bproc_vexecmove_io to fail.
* Clean up a bunch of debugging output in the bproc pls
This commit was SVN r12102.
I have tested on rsh, slurm, bproc, and tm. Bproc continues to have a problem (will be asking for help there).
Gridengine compiles but I cannot test (believe it likely will run).
Poe and xgrid compile to the extent they can without the proper include files.
This commit was SVN r12059.