1
1

65 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
455e4ada9a Bring the modified/updated pernode and npernode behaviors over from the openrte repository. This change enables npernode to pay attention to the total #procs to be launched, and cleans up the bynode vs. byslot mapping directives when in pernode and npernode modes.
This commit was SVN r13191.
2007-01-18 17:15:19 +00:00
Ralph Castain
90f5e3fad8 Fix a buglet in the singleton startup procedure. For purposes of minimizing the xcast message, we "strip" the descriptive info on all subscription messages. This means, though, that we have to store the process name and other info so it can be retrieved in the body of the subscription data (as opposed to in the description). This wasn't being done for singletons because they don't call the RMAPS to "map" themselves.
This has now been corrected. The singleton startup will dutifully call the mapper framework so that the proper data storage locations get initialized. Unfortunately, we then had to instruct the RMAPS not to allocate a vpid range for this job - otherwise, it would make a mistake and think there were two processes in it. Hence, a change was required to RMAPS to tell it "map this job, but don't allocate a vpid range for it".

This change will need to migrate across to 1.2 after it "soaks" the appropriate time.

This commit was SVN r12952.
2007-01-02 16:14:44 +00:00
Ralph Castain
64ec238b7b Repair support for Bproc 4 on 64-bit systems. Update the SMR framework to actually support the begin_monitoring API. Implement the get/set_node_state APIs.
This commit was SVN r12864.
2006-12-15 02:34:14 +00:00
Ralph Castain
3b064a624e For convenience, revise the orte_job_map_t object so it includes the vpid start/range values, the number of nodes, and the number of processes on each node. These values are all used in various places in the code base - we currently re-compute them multiple times. Since these values do not change and are already being computed by the RMAPS framework, we might as well just save them for re-use.
This commit was SVN r12829.
2006-12-12 16:07:23 +00:00
Ralph Castain
28ce8e5e5e Extend the mpirun options to support "--npernode N". This option tells the system to spawn N procs/node across all nodes in the allocation. If N is greater than the number of allocated slots, then the usual oversubscription logic will apply (i.e., the system will error out if oversubscription is not allowed, otherwise it will run with the sched_yield set to non-aggressive behavior).
In "--npernode" operation, the "-np" command line parameter is ignored.

This commit was SVN r12826.
2006-12-12 00:54:05 +00:00
Ralph Castain
eb941d8ae2 Fix a bug that declared a node as "oversubscribed" a little early during the mapper procedure. This only affected the mapping procedure, and only if you had set the "--no-oversubscribe" flag.
Kudos to Tim Prins for finding it.

This commit was SVN r12757.
2006-12-05 13:04:27 +00:00
Ralph Castain
f771cc4fbd Modify the reuse daemons procedure so we only generate the add_local_procs message once. Revise the display-map-at-launch option so the RMAPS framework takes responsibility for implementation of that option.
Modify the RMAPS framework so we eliminate communicating a map to a backend node when certain attributes are set. The proxy functions are now implemented in the base, and a check made for HNP/non-HNP operation made in the map_jobs function prior to execution.

This commit was SVN r12619.
2006-11-17 19:06:10 +00:00
Ralph Castain
ca5b4358fa Need to revise the display-map-at-launch option so it is active not only for the initial launch, but applies to any subsequent comm_spawn events too.
Add placeholders for the new orte tools. These don't actually do anything yet - in fact, I have set the .ompi_ignore so that you won't compile them (I have set a .ompi_unignore for me). Please let me know if you encounter any trouble with this - the ompi_ignore's should protect everyone.

This commit was SVN r12616.
2006-11-17 02:58:46 +00:00
Ralph Castain
6d6cebb4a7 Bring over the update to terminate orteds that are generated by a dynamic spawn such as comm_spawn. This introduces the concept of a job "family" - i.e., jobs that have a parent/child relationship. Comm_spawn'ed jobs have a parent (the one that spawned them). We track that relationship throughout the lineage - i.e., if a comm_spawned job in turn calls comm_spawn, then it has a parent (the one that spawned it) and a "root" job (the original job that started things).
Accordingly, there are new APIs to the name service to support the ability to get a job's parent, root, immediate children, and all its descendants. In addition, the terminate_job, terminate_orted, and signal_job APIs for the PLS have been modified to accept attributes that define the extent of their actions. For example, doing a "terminate_job" with an attribute of ORTE_NS_INCLUDE_DESCENDANTS will terminate the given jobid AND all jobs that descended from it.

I have tested this capability on a MacBook under rsh, Odin under SLURM, and LANL's Flash (bproc). It worked successfully on non-MPI jobs (both simple and including a spawn), and MPI jobs (again, both simple and with a spawn).

This commit was SVN r12597.
2006-11-14 19:34:59 +00:00
Tim Prins
894b220fbb Fixes trac:487
Give a more intelligible error message when someone passes -nolocal and the only available node is the local node.

This commit was SVN r12325.

The following Trac tickets were found above:
  Ticket 487 --> https://svn.open-mpi.org/trac/ompi/ticket/487
2006-10-26 21:46:18 +00:00
Ralph Castain
601443e690 Fix the other end of the attribute store/retrieve to correctly handle attributes with no value
This commit was SVN r12290.
2006-10-25 01:41:58 +00:00
Ralph Castain
46166a9c77 Fix a bug that would cause a segfault when someone specified an option that had no corresponding value
This commit was SVN r12288.
2006-10-24 22:06:18 +00:00
Ralph Castain
7a77ef0ae3 Given the amount of pain singletons cause, one can't help but wonder if it REALLY was that much trouble for people to type "mpirun -n 1 foo"....sigh.
Get the ordering right so that a singleton can start.

Protect the rmgr copy app_context function from NULL fields

Tell the mapper it is okay for there not to be a pre-existing mapping plan for a parent when dynamically spawning processes

This commit was SVN r12257.
2006-10-23 15:15:45 +00:00
George Bosilca
2aa3e51223 Nothing relevant. Only a set of castings to have a clean compile on
Windows. The cl.exe compiler is pretty good at complaining about
any kind of non explicit cast.

This commit was SVN r12207.
2006-10-20 02:25:50 +00:00
Tim Prins
ab964d096a Need a terminating NULL...
This commit was SVN r12180.
2006-10-18 20:52:31 +00:00
Ralph Castain
d0eb7d7216 Complete the attribute management functions.
Modify the mapper to better bookmark its stopping place each time, and to pick up the next time from there. This needs to be validated on a multi-node system.

Fix a major memory corruption problem in the registry put/get functions that was doing multiple free's. Not sure how valgrind missed this one, though it only occurred in specific circumstances (such as comm_spawn).

This commit was SVN r12179.
2006-10-18 20:02:16 +00:00
Ralph Castain
f4a458532b This doesn't totally resolve the comm_spawn problem, but it helps a little. I'll continue working on it and hope to resolve it completely shortly. The issue primarily centers on where to start mapping the child job's processes, and how to deal with oversubscription that might result. At the moment, I am trying to resolve the first issue first (hey, that even sounds right!).
This change does a couple of things:

1. Since the USE_PARENT_ALLOC attribute is a directive about regarding allocation of resources to a job, it more properly should be an attribute of the RAS. Change the name to reflect that and move the attribute define to the ras_types.h file.

2. Add the attributes list to the RMAPS map_job interface. This provides us with the desired flexibility to dynamically specify directives for mapping. The system will - in the absence of any attribute-based directive - default to the values provided in the MCA parameters (either from environment or command-line interface).

This commit was SVN r12164.
2006-10-18 14:01:44 +00:00
Ralph Castain
699ffcf359 Restore the "bynode" mapping functionality - accidentally deleted setting of parameter
This commit was SVN r12078.
2006-10-10 19:41:22 +00:00
George Bosilca
7dadc1832d Correctly export the required functions. They are defined in a private file, but they are completely public.
This commit was SVN r12070.
2006-10-10 04:54:51 +00:00
Ralph Castain
2e09128337 Many thanks to Jeff for tracking down the typo causing the orte_job_map_t destuctor to fail!!
Restore the OBJ_RELEASE calls to cleanup map objects.

This commit was SVN r12064.
2006-10-07 22:44:00 +00:00
Ralph Castain
98dd57b70e Add a new option to launch "pernode" - launches one process/node across all available nodes.
The other options also work correctly: "-bynode" with no -np will launch on all *slots*, mapped on a per-node basis.

This commit was SVN r12063.
2006-10-07 19:50:12 +00:00
Ralph Castain
ae79894bad Bring the map fixes into the main trunk. This should fix several problems, including the multiple app_context issue.
I have tested on rsh, slurm, bproc, and tm. Bproc continues to have a problem (will be asking for help there).

Gridengine compiles but I cannot test (believe it likely will run).

Poe and xgrid compile to the extent they can without the proper include files.

This commit was SVN r12059.
2006-10-07 15:45:24 +00:00
George Bosilca
090b8a9098 opal_list_is_empty return true or false ...
This commit was SVN r12000.
2006-10-05 05:26:08 +00:00
George Bosilca
ad5810e33f ORTE_DECLSPEC what needs to be ORTE_DECLSPES.
This commit was SVN r11997.
2006-10-05 05:22:22 +00:00
Ralph Castain
faf3a558e6 Missing CR at end of file
This commit was SVN r11959.
2006-10-03 18:17:52 +00:00
Ralph Castain
cd7d87aa7b Define the map data types for dss compatibility. Setup to debug bproc
This commit was SVN r11955.
2006-10-03 17:40:00 +00:00
Ralph Castain
121f834776 Continue bringing comm_spawn back online. Ensure all RM frameworks post their HNP receives. Fix the rmgr proxy component.
Still need some work on the proxy component, and on job termination for persistent daemon case.

This commit was SVN r11928.
2006-10-02 00:46:31 +00:00
Ralph Castain
37dfdb76eb Here is the major MAD-cure commit. I have written plenty about it, so I refer you here to those messages for a description of everything that was done.
This commit was SVN r11661.
2006-09-14 21:29:51 +00:00
George Bosilca
f52c10d18e And ORTE is ready for prime-time. All Windows tricks are in:
- use the OPAL functions for PATH and environment variables
- make all headers C++ friendly
- no unamed structures
- no implicit cast.

Plus a full implementation for the orte_wait functions.

This commit was SVN r11347.
2006-08-23 03:32:36 +00:00
George Bosilca
6afa4c6c64 Windows friendly version. We have to split the OMPI_DECLSPEC in at least 3
different macros, one for each project. Therefore, now we have OPAL_DECLSPEC,
ORTE_DECLSPEC and OMPI_DECLSPEC. Please use them based on the sub-project.

This commit was SVN r11270.
2006-08-20 15:54:04 +00:00
Ralph Castain
8c7f0ed9ae Change the SOH to the new State Monitoring and Reporting (SMR) framework. New API's will be appearing in the new framework shortly - this just gets the name change into the system.
Other changes:

1. Remove the old xcpu components as they are not functional.

2. Fix a "bug" in orterun whereby we called dump_aborted_procs even when we normally terminated. There is still some kind of bug in this procedure, however, as we appear to be calling the orterun job_state_callback function every time a process terminates (instead of only once when they have all terminated). I'll continue digging into that one.

This will require an autogen/configure, I'm afraid.

This commit was SVN r11228.
2006-08-16 16:35:09 +00:00
Ralph Castain
5dfd54c778 With the branch to 1.2 made....
Clean up the remainder of the size_t references in the runtime itself. Convert to orte_std_cntr_t wherever it makes sense (only avoid those places where the actual memory size is referenced).

Remove the obsolete oob barrier function (we actually obsoleted it a long time ago - just never bothered to clean it up).

I have done my best to go through all the components and catch everything, even if I couldn't test compile them since I wasn't on that type of system. Still, I cannot guarantee that problems won't show up when you test this on specific systems. Usually, these will just show as "warning: comparison between signed and unsigned" notes which are easily fixed (just change a size_t to orte_std_cntr_t).

In some places, people didn't use size_t, but instead used some other variant (e.g., I found several places with uint32_t). I tried to catch all of them, but...

Once we get all the instances caught and fixed, this should once and for all resolve many of the heterogeneity problems.

This commit was SVN r11204.
2006-08-15 19:54:10 +00:00
Josh Hursey
9a31060b6d Fix r10725 so that the trunk builds again.
This commit was SVN r10733.

The following SVN revision numbers were found above:
  r10725 --> open-mpi/ompi@ae222cca5b
2006-07-11 14:48:31 +00:00
Ralph Castain
ae222cca5b Include the help file so it can be accessed
This commit was SVN r10725.
2006-07-11 12:15:25 +00:00
Ralph Castain
6129a5a887 Enable -host support for "mpirun a.out". You can now execute on all slots on specified nodes within your overall allocation.
This commit was SVN r10713.
2006-07-11 02:59:23 +00:00
Ralph Castain
febc143d8c Per LANL's stated need, add functionality that runs a.out across ALL available process slots if no num_proc is specified on the command line. However, please note the following limitation: we ONLY allow ONE application to be specified on the command line when this feature is invoked. If multiple apps are specified, the user MUST also specify the number to be launched for each and every one of them.
Update the help text to report errors when not following that rule.

Also updated the RMAPS help text to reflect the reorganization of some of the round-robin code into the base.

The new functionality has been tested under Mac OS-X and on Odin using an MPI program. Both byslot and bynode mapping have been checked and verified. Operational support for other systems needs to be verified - I respectfully request people's help in doing so.

This commit was SVN r10708.
2006-07-10 21:25:33 +00:00
Ralph Castain
3d220cbd48 This patch fixes several issues relating to comm_spawn and N1GE. In particular, it does the following:
1. Modifies the RAS framework so it correctly stores and retrieves the actual slots in use, not just those that were allocated. Although the RAS node structure had storage for the number of slots in use, it turned out that the base function for storing and retrieving that information ignored what was in the field and simply set it equal to the number of slots allocated. This has now been fixed.

2. Modified the RMAPS framework so it updates the registry with the actual number of slots used by the mapping. Note that daemons are still NOT counted in this process as daemons are NOT mapped at this time. This will be fixed in 2.0, but will not be addressed in 1.x.

3. Added a new MCA parameter "rmaps_base_no_oversubscribe" that tells the system not to oversubscribe nodes even if the underlying environment permits it. The default is to oversubscribe if needed and the underlying environment permits it. I'm sure someone may argue "why would a user do that?", but it turns out that (looking ahead to dynamic resource reservations) sometimes users won't know how many nodes or slots they've been given in advance - this just allows them to say "hey, I'd rather not run if I didn't get enough".

4. Reorganizes the RMAPS framework to more easily support multiple components. A lot of the logic in the round_robin mapper was very valuable to any component - this has been moved to the base so others can take advantage of it.

5. Added a new test program "hello_nodename" - just does "hello_world" but also prints out the name of the node it is on.

6. Made the orte_ras_node_t object a full ORTE data type so it can more easily be copied, packed, etc. This proved helpful for the RMAPS code reorganization and might be of use elsewhere too.

This commit was SVN r10697.
2006-07-10 14:10:21 +00:00
Jeff Squyres
538965aeb0 Final merge of stuff from /tmp/tm-stuff tree (merged through
/tmp/tm-merge).  Validated by RHC.  Summary:

- Add --nolocal (and -nolocal) options to orterun
- Make some scalability improvements to the tm pls

This commit was SVN r10651.
2006-07-04 20:12:35 +00:00
Josh Hursey
58110f9fc9 Fixes Ticket #125 for both the trunk and v1.1 branch.
This commit will apply cleanly to the v1.1 branch, and should
be moved over once I get someone to verify it.

The problem is outlined in the bug. The fix was to move the
setting of the app context index (idx) before we put it in the
GPR so that it is propogated to the gpr.

The reason this hasn't bitten us before is because we init
app->idx to 0, which is true most of the time. Except that is
when MPI_Comm_spawn_multiple in which we put in more than 
one app context, thus care about correct indexing.

This was causing down the line memory corruption by overrunning
the mapping array. This commit also puts in a check to make 
sure that we error out if we ever try to do that again.

This commit was SVN r10380.
2006-06-15 22:14:07 +00:00
Brian Barrett
566a050c23 Next step in the project split, mainly source code re-arranging
- move files out of toplevel include/ and etc/, moving it into the
    sub-projects
  - rather than including config headers with <project>/include, 
    have them as <project>
  - require all headers to be included with a project prefix, with
    the exception of the config headers ({opal,orte,ompi}_config.h
    mpi.h, and mpif.h)

This commit was SVN r8985.
2006-02-12 01:33:29 +00:00
Ralph Castain
1abe8ef368 Well, it certainly helps triggers to fire if the respective responsible routines adjust the counters!
The INIT counter is supposed to be adjusted when the processes are mapped - this is now done correctly.

The LAUNCHED counter is supposed to be adjusted when the pls sets the process pid info into the registry and changes the state to LAUNCHED. This could probably be changed to have that function use the set_proc_soh API, but this fixes the problem for now.

Thanks to Brian for finding that the triggers were not being fired.

This commit was SVN r8948.
2006-02-09 15:39:06 +00:00
Ralph Castain
4b9f015c0b Merge in the new data support subsystem for ORTE. MPI folks should not notice a difference. Longer explanation will be sent to developers mailing list.
This commit was SVN r8912.
2006-02-07 03:32:36 +00:00
George Bosilca
7d8d516a4a A bunch of fixed for Windows support.
- protection with __WINDOWS__ and not WIN32 or _WIN32
 - protect all the headers

This commit was SVN r8463.
2005-12-12 20:04:00 +00:00
Jeff Squyres
31336e4773 Add some missing headers / correct one installation directory
This commit was SVN r8408.
2005-12-08 04:00:52 +00:00
Brian Barrett
8faa1884f0 * The last of the build system optimizations. Combine the component and
component/base Makefile.am files, reducing the time configure spends
  stamping out Makefiles at the end
* Install base_impl.h file when devel-headers are being installed

This commit was SVN r8200.
2005-11-20 01:03:01 +00:00
Tim Woodall
7f20198d49 Filter the set of data returned to the daemons during
startup using the new get_conditional command to improve
scalability during launch

This commit was SVN r8097.
2005-11-10 16:44:51 +00:00
Jeff Squyres
42ec26e640 Update the copyright notices for IU and UTK.
This commit was SVN r7999.
2005-11-05 19:57:48 +00:00
Jeff Squyres
0379b27969 Add missing DESTRUCT
This commit was SVN r7948.
2005-11-01 13:35:44 +00:00
Tim Woodall
e27dfb180d yet another fix
This commit was SVN r7941.
2005-10-31 21:59:14 +00:00
Tim Woodall
aa5b61e4f1 corrections for multiple app contexts
This commit was SVN r7939.
2005-10-31 20:37:44 +00:00