Fix a potential problem with RM-provided nodenames not matching returns from gethostname - ensure that the HNP's nodename gets DNS-resolved when comparing against RM-provided hostnames. Note that this may be an issue for RM-based clusters that don't have local DNS resolution, but hopefully that is more indicative of a poorly configured system.
This commit was SVN r18252.
Restore the "do-not-launch" functionality so users can test a mapping without launching it.
Add a "do-not-resolve" cmd line flag to mpirun so the opal/util/if.c code does not attempt to resolve network addresses, thus enabling a user to test a hostfile mapping without hanging on network resolve requests.
Add a function to hostfile to generate an ordered list of host names from a hostfile
This commit was SVN r18190.
about linkers, have all OPAL, ORTE, and OMPI components '''not'' link
against the OPAL, ORTE, or OMPI libraries.
See ttp://www.open-mpi.org/community/lists/users/2007/10/4220.php for
details (or https://svn.open-mpi.org/trac/ompi/wiki/Linkers for a
better-formatted version of the same info).
This commit was SVN r16968.
You will not see any impact from this change unless you use the syntax described in ticket #1023. I've tried as many of the RAS components as possible and saw no problem - there may be issues with other RAS components that would not compile on any of my systems. Anything that appears should be trivial to fix.
This commit was SVN r15427.
1. generalize orte_rml.xcast to become a general broadcast-like messaging system. Messages can now be sent to any tag on the daemons or processes. Note that any message sent via xcast will be delivered to ALL processes in the specified job - you don't get to pick and choose. At a later date, we will introduce an augmented capability that will use the daemons as relays, but will allow you to send to a specified array of process names.
2. extended orte_rml.xcast so it supports more scalable message routing methodologies. At the moment, we support three: (a) direct, which sends the message directly to all recipients; (b) linear, which sends the message to the local daemon on each node, which then relays it to its own local procs; and (b) binomial, which sends the message via a binomial algo across all the daemons, each of which then relays to its own local procs. The crossover points between the algos are adjustable via MCA param, or you can simply demand that a specific algo be used.
3. orteds no longer exhibit two types of behavior: bootproxy or VM. Orteds now always behave like they are part of a virtual machine - they simply launch a job if mpirun tells them to do so. This is another step towards creating an "orteboot" functionality, but also provided a clean system for supporting message relaying.
Note one major impact of this commit: multiple daemons on a node cannot be supported any longer! Only a single daemon/node is now allowed.
This commit is known to break support for the following environments: POE, Xgrid, Xcpu, Windows. It has been tested on rsh, SLURM, and Bproc. Modifications for TM support have been made but could not be verified due to machine problems at LANL. Modifications for SGE have been made but could not be verified. The developers for the non-verified environments will be separately notified along with suggestions on how to fix the problems.
This commit was SVN r15007.
To be precise, given this hypothetical launching pattern:
host1: vpids 0, 2, 4, 6
host2: vpids 1, 3, 5, 7
The local_rank for these procs would be:
host1: vpids 0->local_rank 0, v2->lr1, v4->lr2, v6->lr3
host2: vpids 1->local_rank 0, v3->lr1, v5->lr2, v7->lr3
and the number of local procs on each node would be four. If vpid=0 then does a comm_spawn of one process on host1, the values of the parent job would remain unchanged. The local_rank of the child process would be 0 and its num_local_procs would be 1 since it is in a separate jobid.
I have verified this functionality for the rsh case - need to verify that slurm and other cases also get the right values. Some consolidation of common code is probably going to occur in the SDS components to make this simpler and more maintainable in the future.
This commit was SVN r14706.
Per discussions with Brian and Ralph, make a slight correction in
where components are installed. Use $pkglibdir, not $libdir/openmpi,
so that when compiled in the orte trunk, components are installed to
the right directory (because the component search patch is checking
$pkglibdir).
This commit was SVN r14345.
The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
r14289
This merge adds Checkpoint/Restart support to Open MPI. The initial
frameworks and components support a LAM/MPI-like implementation.
This commit follows the risk assessment presented to the Open MPI core
development group on Feb. 22, 2007.
This commit closes trac:158
More details to follow.
This commit was SVN r14051.
The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
r13912
The following Trac tickets were found above:
Ticket 158 --> https://svn.open-mpi.org/trac/ompi/ticket/158
components that use configure.m4 for configuration or are always built.
The macro has not been needed since moving to configure types other than
configure.stub
Fixes trac:590
This commit was SVN r13031.
The following Trac tickets were found above:
Ticket 590 --> https://svn.open-mpi.org/trac/ompi/ticket/590
This has now been corrected. The singleton startup will dutifully call the mapper framework so that the proper data storage locations get initialized. Unfortunately, we then had to instruct the RMAPS not to allocate a vpid range for this job - otherwise, it would make a mistake and think there were two processes in it. Hence, a change was required to RMAPS to tell it "map this job, but don't allocate a vpid range for it".
This change will need to migrate across to 1.2 after it "soaks" the appropriate time.
This commit was SVN r12952.
1. no -np provided - put one proc/node across all allocated nodes
2. -np N provided, N > #nodes - we print a pretty error message and exit
3. -np N provided, N <= #nodes - put one proc/node across N nodes
I also added a new orte constant (ORTE_ERR_SILENT) that allows us to pass up the chain that an error was encountered, but NOT print ORTE_ERROR_LOG messages. This is intended to be used for cases where the error we encounter is NOT an orte error, but rather is one associated with incorrect user input (e.g., the preceding case 2). In such cases, there is no point in printing an ORTE_ERROR_LOG chain of messages as it isn't an orte error.
This commit was SVN r12821.
Make it so if -np was not passed and -pernode was, we map bynode
This commit was SVN r12580.
The following Trac tickets were found above:
Ticket 612 --> https://svn.open-mpi.org/trac/ompi/ticket/612
- It is possible to leave a byslot/bynode routine and have cur_node_item be NULL, so check for that.
- After we do an allocation where the user has provided a map (i.e. with --host), cur_node_item is pointing into the map list, not the global list. Change it to point into the global list.
This commit was SVN r12232.
Also, I am no longer seeing any issue with the child job spawning its own daemons - this appears to be fixed. We still don't reuse the existing daemons, however, but that will come.
This commit was SVN r12229.
- Simplified the logic of the ras modules by moving the attribute handling into the base allocation function. This allows us to decide how to allocate based on the situation, and solves some of the allocation problems we were having with comm_spawn.
- moved the proxy component into the base. This was done because we always want to call the proxy functions if we are not on a HNP regardless of the attributes passed.
- Got rid of the hostfile component. What little logic was in it was moved into the base to deal with other circumstances. The hostfile information is currently being propagated into the registry by the RDS, so we just use what is already in the registry.
- renamed some slurm function so that they have the proper prefix. Not strictly necessary as they were static, but it makes debugging much easier.
- fixed a buglet in the round_robin rmaps where we would return an error when really no error occured.
I tried to make proper corrections to all the ras modules, but I cannot test all of them.
This commit was SVN r12202.
Update the mapper so it correctly points to the next node to be used if we are mapping by slots. As it was, if we had an app_context that used only one slot on a node, the next app_context would start on the next node - leaving a blank slot in-between.
This commit was SVN r12193.
Modify the mapper to better bookmark its stopping place each time, and to pick up the next time from there. This needs to be validated on a multi-node system.
Fix a major memory corruption problem in the registry put/get functions that was doing multiple free's. Not sure how valgrind missed this one, though it only occurred in specific circumstances (such as comm_spawn).
This commit was SVN r12179.
This change does a couple of things:
1. Since the USE_PARENT_ALLOC attribute is a directive about regarding allocation of resources to a job, it more properly should be an attribute of the RAS. Change the name to reflect that and move the attribute define to the ras_types.h file.
2. Add the attributes list to the RMAPS map_job interface. This provides us with the desired flexibility to dynamically specify directives for mapping. The system will - in the absence of any attribute-based directive - default to the values provided in the MCA parameters (either from environment or command-line interface).
This commit was SVN r12164.
In this implementation, we begin mapping on the first node that has at least one slot available as measured by the slots_inuse versus the soft limit. If none of the nodes meet that criterion, we just start at the beginning of the node list since we are oversubscribed anyway.
Note that we ignore this logic if the user specifies a mapping - then it's just "user beware".
The real root cause of the problem is that we don't adjust sched_yield as we add processes onto a node. Hence, the node becomes oversubscribed and performance goes into the toilet. What we REALLY need to do to solve the problem is:
(a) modify the PLS components so they reuse the existing daemons,
(b) create a way to tell a running process to adjust its sched_yield, and
(c) modify the ODLS components to update the sched_yield on a process per the new method
Until we do that, we will continue to have this problem - all this fix (and any subsequent one that focuses solely on the mapper) does is hopefully make it happen less often.
This commit was SVN r12145.
I have tested on rsh, slurm, bproc, and tm. Bproc continues to have a problem (will be asking for help there).
Gridengine compiles but I cannot test (believe it likely will run).
Poe and xgrid compile to the extent they can without the proper include files.
This commit was SVN r12059.
- use the OPAL functions for PATH and environment variables
- make all headers C++ friendly
- no unamed structures
- no implicit cast.
Plus a full implementation for the orte_wait functions.
This commit was SVN r11347.
different macros, one for each project. Therefore, now we have OPAL_DECLSPEC,
ORTE_DECLSPEC and OMPI_DECLSPEC. Please use them based on the sub-project.
This commit was SVN r11270.
Clean up the remainder of the size_t references in the runtime itself. Convert to orte_std_cntr_t wherever it makes sense (only avoid those places where the actual memory size is referenced).
Remove the obsolete oob barrier function (we actually obsoleted it a long time ago - just never bothered to clean it up).
I have done my best to go through all the components and catch everything, even if I couldn't test compile them since I wasn't on that type of system. Still, I cannot guarantee that problems won't show up when you test this on specific systems. Usually, these will just show as "warning: comparison between signed and unsigned" notes which are easily fixed (just change a size_t to orte_std_cntr_t).
In some places, people didn't use size_t, but instead used some other variant (e.g., I found several places with uint32_t). I tried to catch all of them, but...
Once we get all the instances caught and fixed, this should once and for all resolve many of the heterogeneity problems.
This commit was SVN r11204.
Update the help text to report errors when not following that rule.
Also updated the RMAPS help text to reflect the reorganization of some of the round-robin code into the base.
The new functionality has been tested under Mac OS-X and on Odin using an MPI program. Both byslot and bynode mapping have been checked and verified. Operational support for other systems needs to be verified - I respectfully request people's help in doing so.
This commit was SVN r10708.
1. Modifies the RAS framework so it correctly stores and retrieves the actual slots in use, not just those that were allocated. Although the RAS node structure had storage for the number of slots in use, it turned out that the base function for storing and retrieving that information ignored what was in the field and simply set it equal to the number of slots allocated. This has now been fixed.
2. Modified the RMAPS framework so it updates the registry with the actual number of slots used by the mapping. Note that daemons are still NOT counted in this process as daemons are NOT mapped at this time. This will be fixed in 2.0, but will not be addressed in 1.x.
3. Added a new MCA parameter "rmaps_base_no_oversubscribe" that tells the system not to oversubscribe nodes even if the underlying environment permits it. The default is to oversubscribe if needed and the underlying environment permits it. I'm sure someone may argue "why would a user do that?", but it turns out that (looking ahead to dynamic resource reservations) sometimes users won't know how many nodes or slots they've been given in advance - this just allows them to say "hey, I'd rather not run if I didn't get enough".
4. Reorganizes the RMAPS framework to more easily support multiple components. A lot of the logic in the round_robin mapper was very valuable to any component - this has been moved to the base so others can take advantage of it.
5. Added a new test program "hello_nodename" - just does "hello_world" but also prints out the name of the node it is on.
6. Made the orte_ras_node_t object a full ORTE data type so it can more easily be copied, packed, etc. This proved helpful for the RMAPS code reorganization and might be of use elsewhere too.
This commit was SVN r10697.
/tmp/tm-merge). Validated by RHC. Summary:
- Add --nolocal (and -nolocal) options to orterun
- Make some scalability improvements to the tm pls
This commit was SVN r10651.
We were stuck in an infinite loop inside the rmaps round_robin
component when the user specified a host, then over subscribed it.
Instead of retuning an error, we looped forever.
For example:
$ cat hostfile
A slots=2 max-slots=2
B slots=2 max-slots=2
$ mpirun -np 3 --hostfile hostfile --host B
<hang>
The loop would not terminate because both host A and B are in the
'nodes' structure as they are both allocated to the job. However,
after allocating 2 slots to host B, we remove it from the node list
leaving us with a 'nodes' structure with just A in it. Since we can't
use host A, we keep looping here until we find a node that we can use.
This patch checks to make sure that if we get into this situation where
rmaps is looping over the list a second time without finding a node
during the first pass then we know that there are no nodes left to
use, so we have a resource allocation error, and should return to the user.
This patch should be moved to all of the release branches
This commit was SVN r10131.
- move files out of toplevel include/ and etc/, moving it into the
sub-projects
- rather than including config headers with <project>/include,
have them as <project>
- require all headers to be included with a project prefix, with
the exception of the config headers ({opal,orte,ompi}_config.h
mpi.h, and mpif.h)
This commit was SVN r8985.