1
1
Граф коммитов

830 Коммитов

Автор SHA1 Сообщение Дата
Tim Prins
ade94b523b Fixed a number of issues related to resource allocation:
- Simplified the logic of the ras modules by moving the attribute handling into the base allocation function. This allows us to decide how to allocate based on the situation, and solves some of the allocation problems we were having with comm_spawn.
- moved the proxy component into the base. This was done because we always want to call the proxy functions if we are not on a HNP regardless of the attributes passed. 
- Got rid of the hostfile component. What little logic was in it was moved into the base to deal with other circumstances. The hostfile information is currently being propagated into the registry by the RDS, so we just use what is already in the registry.
- renamed some slurm function so that they have the proper prefix. Not strictly necessary as they were static, but it makes debugging much easier.
- fixed a buglet in the round_robin rmaps where we would return an error when really no error occured.

I tried to make proper corrections to all the ras modules, but I cannot test all of them.

This commit was SVN r12202.
2006-10-19 23:33:51 +00:00
Ralph Castain
ab196c3121 Okay, this fixes the problem of MCA params spreading too far. Sorry for the multiple corrections.
This commit was SVN r12201.
2006-10-19 22:51:02 +00:00
George Bosilca
c788f0dd51 Apparently, the MCA params are being loaded into the environment params
of the individual app_contexts as well. Clear them of "bad" ones.

This commit was SVN r12198.
2006-10-19 21:19:08 +00:00
Ralph Castain
d0d3d7fd41 Apparently, the MCA params are being loaded into the environment params of the individual app_contexts as well. Clear them of "bad" ones.
This commit was SVN r12197.
2006-10-19 20:49:35 +00:00
Ralph Castain
382f954fff Fix a bug in the way we saved and passed environments to child processes on remote nodes. The problem was that MCA directives for component selection were being passed back to the children. However, now that we only allow certain components to operate on HNPs, this caused the children to bomb out of orte_init.
This commit was SVN r12196.
2006-10-19 20:35:55 +00:00
Brian Barrett
204f5b8f52 - Clean up wrapper compiler man pages during maintainer-clean, since
they might require special tools (not sure if sed with multiple -e
    arguments is totally portable)
  - ignore the opalcc.1 man page.  Couldn't do this in the previous
    man page commit (r12192) because I was removing opalcc.1 in that
    commit.

This commit was SVN r12194.

The following SVN revision numbers were found above:
  r12192 --> open-mpi/ompi@581a4b0a4e
2006-10-19 20:14:40 +00:00
Ralph Castain
263f4379e8 Clean up an error in the mapper that caused "-hosts" to bomb.
Update the mapper so it correctly points to the next node to be used if we are mapping by slots. As it was, if we had an app_context that used only one slot on a node, the next app_context would start on the next node - leaving a blank slot in-between.

This commit was SVN r12193.
2006-10-19 18:57:29 +00:00
Brian Barrett
581a4b0a4e A few cleanups to the wrapper compiler build system / man pages:
- Only install opal{cc,c++} and orte{cc,c++} if configured with
     --with-devel-headers.  Right now, they are always installed, but 
    there are no header files installed for either project, so there's
    really not much way for a user to actually compile an OPAL / ORTE
    application.

  - Drop support for opalCC and orteCC.  It's a pain to setup all the 
    symlinks (indeed, they are currently done wrong for opalCC) and 
    there's no history like there is for mpiCC.

  - Change what is currently opalcc.1 to opal_wrapper.1 and add some
    macros that get sed'ed so that the man pages appear to be 
    customized for the given command.  

  - Install the wrapper data files even if we compiled with 
    --disable-binaries.  This is for the use case of doing multi-lib
    builds, where one word size will only have the library built, but 
    we need both set of wrapper data files to piece together to 
    activate the multi-lib support in the wrapper compilers.

This commit was SVN r12192.
2006-10-19 18:34:17 +00:00
Tim Prins
81d400ddfd break when they are equal, not not equal
This commit was SVN r12182.
2006-10-18 21:47:01 +00:00
Tim Prins
ab964d096a Need a terminating NULL...
This commit was SVN r12180.
2006-10-18 20:52:31 +00:00
Ralph Castain
d0eb7d7216 Complete the attribute management functions.
Modify the mapper to better bookmark its stopping place each time, and to pick up the next time from there. This needs to be validated on a multi-node system.

Fix a major memory corruption problem in the registry put/get functions that was doing multiple free's. Not sure how valgrind missed this one, though it only occurred in specific circumstances (such as comm_spawn).

This commit was SVN r12179.
2006-10-18 20:02:16 +00:00
Ralph Castain
f4a458532b This doesn't totally resolve the comm_spawn problem, but it helps a little. I'll continue working on it and hope to resolve it completely shortly. The issue primarily centers on where to start mapping the child job's processes, and how to deal with oversubscription that might result. At the moment, I am trying to resolve the first issue first (hey, that even sounds right!).
This change does a couple of things:

1. Since the USE_PARENT_ALLOC attribute is a directive about regarding allocation of resources to a job, it more properly should be an attribute of the RAS. Change the name to reflect that and move the attribute define to the ras_types.h file.

2. Add the attributes list to the RMAPS map_job interface. This provides us with the desired flexibility to dynamically specify directives for mapping. The system will - in the absence of any attribute-based directive - default to the values provided in the MCA parameters (either from environment or command-line interface).

This commit was SVN r12164.
2006-10-18 14:01:44 +00:00
George Bosilca
50649dd6a9 What we write it's a long long so we should be using the long long format.
This commit was SVN r12157.
2006-10-18 00:02:20 +00:00
George Bosilca
1d69375dd7 Somehow, somewhere we have to initialize rc ...
This commit was SVN r12156.
2006-10-17 22:32:21 +00:00
Ralph Castain
0c0fe022ff This is a first cut at fixing the problem of comm_spawn children being mapped onto the same nodes as their parents. I am not convinced the behavior implemented here is the long-term right one, but hopefully it will help alleviate the situation for now.
In this implementation, we begin mapping on the first node that has at least one slot available as measured by the slots_inuse versus the soft limit. If none of the nodes meet that criterion, we just start at the beginning of the node list since we are oversubscribed anyway.

Note that we ignore this logic if the user specifies a mapping - then it's just "user beware".

The real root cause of the problem is that we don't adjust sched_yield as we add processes onto a node. Hence, the node becomes oversubscribed and performance goes into the toilet. What we REALLY need to do to solve the problem is:

(a) modify the PLS components so they reuse the existing daemons, 

(b) create a way to tell a running process to adjust its sched_yield, and

(c) modify the ODLS components to update the sched_yield on a process per the new method

Until we do that, we will continue to have this problem - all this fix (and any subsequent one that focuses solely on the mapper) does is hopefully make it happen less often.

This commit was SVN r12145.
2006-10-17 19:35:00 +00:00
Tim Prins
720eb88cad Make no-op function match new interface.
This commit was SVN r12142.
2006-10-17 17:34:06 +00:00
Tim Prins
8b0170148e Add some missing headers.
This commit was SVN r12141.
2006-10-17 17:28:02 +00:00
Tim Prins
5d31332f97 Goodbye poe, long live LoadLeveler...
This commit was SVN r12140.
2006-10-17 17:07:48 +00:00
Ralph Castain
13227e36ab This commit looks a lot bigger than it is, so relax :-)
Fix the problem observed by multiple people that comm_spawned children were (once again) being mapped onto the same nodes as their parents. This was caused by going through the RAS a second time, thus overwriting the mapper's bookkeeping that told RMAPS where it had left off.

To solve this - and to continue moving forward on the ORTE development - we introduce the concept of attributes to control the behavior of the RM frameworks. I defined the attributes and a list of attributes as new ORTE data types to make it easier for people to pass them around (since they are now fundamental to the system, and therefore we will be packing and unpacking them frequently). Thus, all the functions to manipulate attributes can be implemented and debugged in one place.

I used those capabilities in two places:

1. Added an attribute list to the rmgr.spawn interface.

2. Added an attribute list to the ras.allocate interface. At the moment, the only attribute I modified the various RAS components to recognize is the USE_PARENT_ALLOCATION one (as defined in rmgr_types.h).

So the RAS components now know how to reuse an allocation. I have debugged this under rsh, but it now needs to be tested on a wider set of platforms.

This commit was SVN r12138.
2006-10-17 16:06:17 +00:00
Rainer Keller
3f88937081 - Error logging is really not yet enabled.
- Correct the error log for orte_errmgr_base_select
 - Spelling fixes

This commit was SVN r12135.
2006-10-17 09:11:20 +00:00
Ralph Castain
16e52c5784 Fix a non-compliance issue regarding hostfiles as reported by Sun. The man page states that an entry that specifies slots_max but does not specify "slots" will have the soft limit default to the hard limit. The hostfile implementation, however, defaulted the soft limit to 1.
This fix changes that behavior to conform to the man page.

This commit was SVN r12129.
2006-10-17 00:43:12 +00:00
Brian Barrett
9adde4f7b8 Allow multilib capability based on compiler flags. See:
https://svn.open-mpi.org/trac/ompi/wiki/compilerwrapper3264
for more information.

Refs trac:374

This commit was SVN r12120.

The following Trac tickets were found above:
  Ticket 374 --> https://svn.open-mpi.org/trac/ompi/ticket/374
2006-10-15 21:21:08 +00:00
Ralph Castain
3f55d6897a Remove the memory debugging options. Fix what appears to be a typo in a help file.
This commit was SVN r12107.
2006-10-12 00:44:48 +00:00
Brian Barrett
fce5130333 Delay opening the listen socket until module init, so that we can have the
seed value have something set to true.  Allow selection of the listen
type to thread if (and only if) the process is the HNP...

This commit was SVN r12105.
2006-10-11 21:29:29 +00:00
Brian Barrett
29c91cf2f3 * Fix issue in odls_bproc where we were using vpid instead of the number of
processes launched locally for the stdio file names.  This was causing
    the expected files to not exist and bproc_vexecmove_io to fail.
  * Clean up a bunch of debugging output in the bproc pls

This commit was SVN r12102.
2006-10-11 20:34:12 +00:00
Ralph Castain
f91a95b3fe Fix the bug that caused mpirun to hang when a remote executable wasn't found using the rsh launcher. Will now test on a remote node
This commit was SVN r12095.
2006-10-11 18:43:13 +00:00
Ralph Castain
2da8245be0 Correctly propagate no-daemonize
This commit was SVN r12093.
2006-10-11 17:53:17 +00:00
Ralph Castain
e5877cc459 Add the proper valgrind params
This commit was SVN r12092.
2006-10-11 17:48:41 +00:00
George Bosilca
b56636c855 orte_pls belong to the PLS framework, therefore it should only be defined
in pls.h.

This commit was SVN r12089.
2006-10-11 17:12:22 +00:00
Ralph Castain
27e305347c Add a couple of options to orterun that support debugging of daemons for memory corruption.
Ensure that the environment provided to local application processes isn't "polluted" by the orteds

This commit was SVN r12087.
2006-10-11 15:18:57 +00:00
Brian Barrett
f5b8f1f2f0 Work around Automake not knowing how to properly configure libtool to build
Objective C libraries

Refs trac:483

This commit was SVN r12080.

The following Trac tickets were found above:
  Ticket 483 --> https://svn.open-mpi.org/trac/ompi/ticket/483
2006-10-10 20:14:26 +00:00
Ralph Castain
699ffcf359 Restore the "bynode" mapping functionality - accidentally deleted setting of parameter
This commit was SVN r12078.
2006-10-10 19:41:22 +00:00
George Bosilca
7dadc1832d Correctly export the required functions. They are defined in a private file, but they are completely public.
This commit was SVN r12070.
2006-10-10 04:54:51 +00:00
Ralph Castain
0e9dc590b7 Fix typo that didn't make it over from testing on vogon
This commit was SVN r12068.
2006-10-09 20:37:39 +00:00
Ralph Castain
cebdc51762 Remove a debugging output
This commit was SVN r12066.
2006-10-09 02:10:52 +00:00
Ralph Castain
e7f6fa22d6 Fix return code so that mpirun returns the right thing when an abort is encountered.
This commit was SVN r12065.
2006-10-09 01:04:00 +00:00
Ralph Castain
2e09128337 Many thanks to Jeff for tracking down the typo causing the orte_job_map_t destuctor to fail!!
Restore the OBJ_RELEASE calls to cleanup map objects.

This commit was SVN r12064.
2006-10-07 22:44:00 +00:00
Ralph Castain
98dd57b70e Add a new option to launch "pernode" - launches one process/node across all available nodes.
The other options also work correctly: "-bynode" with no -np will launch on all *slots*, mapped on a per-node basis.

This commit was SVN r12063.
2006-10-07 19:50:12 +00:00
Jeff Squyres
efe28d62e9 Fix some compiler errors. I have *not* checked this for correctness;
but it does now compile.

This commit was SVN r12062.
2006-10-07 19:10:56 +00:00
Ralph Castain
889ddefe85 Remove release that caused totalview connection to bomb
This commit was SVN r12061.
2006-10-07 18:25:56 +00:00
Ralph Castain
ae79894bad Bring the map fixes into the main trunk. This should fix several problems, including the multiple app_context issue.
I have tested on rsh, slurm, bproc, and tm. Bproc continues to have a problem (will be asking for help there).

Gridengine compiles but I cannot test (believe it likely will run).

Poe and xgrid compile to the extent they can without the proper include files.

This commit was SVN r12059.
2006-10-07 15:45:24 +00:00
Ralph Castain
82a023c731 Fix a typo that caused a segfault if a caller requested that we abort an array of procs (i.e., MPI_Abort when it specifies the other procs to be aborted).
Should hopefully address the recent problem seen with the BLACS AUX test as discussed on the user mailing list.

This commit was SVN r12055.
2006-10-07 01:58:11 +00:00
Ralph Castain
ee0df85ece Add a stupid, useless const since someone put it in the odls.h file.
This commit was SVN r12054.
2006-10-07 01:42:23 +00:00
George Bosilca
cda46efd2a Some missing DECLSPEC
This commit was SVN r12047.
2006-10-06 15:21:52 +00:00
Jeff Squyres
72cf2fe813 Oops: --noprefix should not take an argument.
This commit was SVN r12043.
2006-10-06 13:02:56 +00:00
George Bosilca
017af37291 Keep only the useful _DECLSPEC and _DECLSPEC some globals.
This commit was SVN r12042.
2006-10-06 07:24:34 +00:00
George Bosilca
b7579b09c7 Correctly handle the ORTE_DECLSPEC attribute.
This commit was SVN r12041.
2006-10-06 07:08:17 +00:00
George Bosilca
422ce1d3f8 If we are in the ODLS framework then the types should be called odls not pls.
This commit was SVN r12040.
2006-10-06 07:05:58 +00:00
George Bosilca
7fed79434e Windows is now able to create local processes.
This commit was SVN r12039.
2006-10-06 07:04:43 +00:00
George Bosilca
fa01b9b9aa Last step for the name reversion (from windows back to process).
This commit was SVN r12014.
2006-10-05 06:36:11 +00:00