1
1
Граф коммитов

874 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
4e50cdae52 This commit accomplishes two things:
1. Fix the "hang" condition when an application isn't found. It turned out that the ODLS had some difficulty with the process actually not having been started - hence, it never called the waitpid callback. As a result, the "terminated" trigger didn't fire, and so mpirun didn't wake up. With this change, the HNP's errmgr forces the issue by causing the trigger to fire itself when an abort condition occurs.

2. Shift the recording of the pid and the nodename from mpi_init to the orted launcher. This allows programs such as Eclipse PTP to get the pids even for non-MPI applications. In the case of bproc, the pls handles this chore since we don't use orteds in that system.

This commit was SVN r12558.
2006-11-11 04:03:45 +00:00
Sven Stork
9ba4c4a7ee - Add support for "sh" handling. Instead of detecting of bash we now
check for bourne shell, because bourne shell is the smallest
  common divisor for bash/ksh/sh.
- Make some shell expressions sh compatible 

This commit was SVN r12509.
2006-11-09 10:16:45 +00:00
Ralph Castain
a3be8261fb Fix a bug that had us generate an error message and abort startup when there were stale universe directories around. Now, we just ignore them.
This commit was SVN r12472.
2006-11-07 21:34:57 +00:00
Ralph Castain
ea77beca29 cleanup the TM modules in prep for T-bird tests. The TM RAS will now report time required to resolve hostnames
This commit was SVN r12449.
2006-11-06 20:56:18 +00:00
Ralph Castain
884caeb2c7 Add timing tests for the TM ras
This commit was SVN r12445.
2006-11-06 18:41:22 +00:00
Galen Shipman
68d9922f44 enable/disable connection sleep in oob_tcp.c via mca param.. on by default..
This commit was SVN r12444.
2006-11-06 18:00:46 +00:00
Ralph Castain
194bdd413b Cleanup the problem of connecting to default universes.
This commit was SVN r12438.
2006-11-06 15:28:38 +00:00
Ralph Castain
d182ae7472 Clean up a few compiler warnings courtesy of Jeff
This commit was SVN r12430.
2006-11-03 20:45:22 +00:00
Ralph Castain
c77f6c605e Update timing reports:
1. Remove timing of xcast from mpi_init

2. Add timing report from oob_xcast on how long it took to send the message

This commit was SVN r12428.
2006-11-03 18:55:05 +00:00
Ralph Castain
761c8beeb7 Add output of the compound command size to the timing reports
This commit was SVN r12423.
2006-11-03 16:11:37 +00:00
Ralph Castain
60e27c77e7 Add some additional timing reporting:
1. Added reporting points around the xcasts in MPI_Init. Note that these times will include time spent waiting for a trigger to fire, which is why the times between stage gates did NOT include these times initially. The inter-stage-gate times still do NOT include the xcast time - the xcast time is reported separately.

2. Added the process vpid on the MPI_Init timing reports for clarity.

3. Added a report from the xcast function on the HNP that outputs the number of bytes in the message being sent to the processes.

This commit was SVN r12422.
2006-11-03 16:04:40 +00:00
Ralph Castain
2472f76d89 Make sure the print_map doesn't get called if we don't do the map step in rmgr.spawn
This commit was SVN r12404.
2006-11-02 05:52:24 +00:00
Ralph Castain
dc4de57253 Per request from the Eclipse team, change the attributes that control the "flow" through the rmgr.spawn procedure from "stop after this step" to "do this step". This allows them to use the spawn procedure to complete the launch one step at a time so they can let the user see the results after each step.
Shouldn't affect anyone else.

This commit was SVN r12403.
2006-11-02 05:15:24 +00:00
Ralph Castain
233dac8bba Enable the new attributes for remote operations such as comm_spawn
This commit was SVN r12382.
2006-10-31 23:52:25 +00:00
Ralph Castain
30de73a712 Add a few attributes that are helpful for folks doing things like Eclipse. Also add yet another command-line option to orterun to support one of the new attributes. These include:
1. ORTE_RMAPS_DISPLAY_AT_LAUNCH: pretty-prints out the process map right before we launch so you can see where everyone is going. This is settable via the command line option "--display-map-at-launch"

2. ORTE_RMGR_STOP_AFTER_SETUP: just setup the job and then return from the spawn command.

3. ORTE_RMGR_STOP_AFTER_ALLOC: return from the rmgr.spawn call after allocating the job

4. ORTE_RMGR_STOP_AFTER_MAP: return from the rmgr.spawn call after mapping the job. This gives folks a chance to retrieve and graphically display the map, let the user edit it, and store the results. They can then call "launch" on their own and the system will use the revised map.

Enjoy! My personal favorite is the first one - helps with debugging.

This commit was SVN r12379.
2006-10-31 22:16:51 +00:00
Ralph Castain
17c71f8d2a Fix ticket #545
Setup subscriptions to correctly return the MPI_APPNUM attribute.

Fix an unreported bug that was found. The universe size was incorrectly defined in the attributes code. As coded, it looked for size_t values and based its size computation on those numbers. Unfortunately, the node_slots value had been changed to an orte_std_cntr_t awhile back! So the universe size was never updated.

Update the hello_nodename test to check for MPI_APPNUM.

Add a definition to ns_types for ORTE_PROC_MY_NAME - just a shortcut for orte_process_info.my_name. Brought over from ORTE 2.0 as it will be used extensively there.

This commit was SVN r12377.
2006-10-31 21:29:07 +00:00
Tim Prins
894b220fbb Fixes trac:487
Give a more intelligible error message when someone passes -nolocal and the only available node is the local node.

This commit was SVN r12325.

The following Trac tickets were found above:
  Ticket 487 --> https://svn.open-mpi.org/trac/ompi/ticket/487
2006-10-26 21:46:18 +00:00
Ralph Castain
36d4511143 Bring the timing instrumentation to the trunk.
If you want to look at our launch and MPI process startup times, you can do so with two MCA params:

OMPI_MCA_orte_timing: set it to anything non-zero and you will get the launch time for different steps in the job launch procedure. The degree of detail depends on the launch environment. rsh will provide you with the average, min, and max launch time for the daemons. SLURM block launches the daemon, so you only get the time to launch the daemons and the total time to launch the job. Ditto for bproc. TM looks more like rsh. Only those four environments are currently supported - anyone interested in extending this capability to other environs is welcome to do so. In all cases, you also get the time to setup the job for launch.

OMPI_MCA_ompi_timing: set it to anything non-zero and you will get the time for mpi_init to reach the compound registry command, the time to execute that command, the time to go from our stage1 barrier to the stage2 barrier, and the time to go from the stage2 barrier to the end of mpi_init. This will be output for each process, so you'll have to compile any statistics on your own. Note: if someone develops a nice parser to do so, it would be really appreciated if you could/would share!

This commit was SVN r12302.
2006-10-25 15:27:47 +00:00
Brian Barrett
d6ff14ed61 Hand-pack the connection information for each peer rather than just
packing a sockaddr_in, as there are some endianness and padding issues
with sending a sockaddr_in.  Note that the sin_port and sin_addr are
already in network byte order, which is why we pack them as a byte
string.

Refs trac:493

This commit was SVN r12301.

The following Trac tickets were found above:
  Ticket 493 --> https://svn.open-mpi.org/trac/ompi/ticket/493
2006-10-25 15:09:30 +00:00
Ralph Castain
601443e690 Fix the other end of the attribute store/retrieve to correctly handle attributes with no value
This commit was SVN r12290.
2006-10-25 01:41:58 +00:00
Ralph Castain
46166a9c77 Fix a bug that would cause a segfault when someone specified an option that had no corresponding value
This commit was SVN r12288.
2006-10-24 22:06:18 +00:00
Ralph Castain
a7acd22e47 Fix a fix so we see the correct MCA param
This commit was SVN r12282.
2006-10-24 17:24:09 +00:00
George Bosilca
ed3caa9fc7 mca_base_param_reg_int require a pointer to a component not a name.
This commit was SVN r12281.
2006-10-24 16:41:51 +00:00
George Bosilca
7dec7731ce Instead of size_t use orte_std_cntr_t. Remove all warnings.
This commit was SVN r12280.
2006-10-24 16:40:49 +00:00
Ralph Castain
8636ac6a4d Fix ticket 353 - print out a nice message that the combination of debug-daemons and num_concurrent in the pls rsh launcher will cause deadlock and exit
This commit was SVN r12279.
2006-10-24 15:59:02 +00:00
Tim Prins
cb622db7c9 Fixes trac:352
Only close off stdout/stderr from the daemons if we are not debugging the slurm pls and --debug-daemons was not passed.

This commit was SVN r12276.

The following Trac tickets were found above:
  Ticket 352 --> https://svn.open-mpi.org/trac/ompi/ticket/352
2006-10-24 13:05:13 +00:00
Tim Prins
93d61d01fb Fix for a problem on SLURM we have neen having since r12243 where mpirun would hang after the process had finished. It turns out that we were always reporting the name of the daemon wrong, but we simply never noticed as we never used it, until r12243. This makes it so we report the name of the daemon correctly.
This commit was SVN r12274.

The following SVN revision numbers were found above:
  r12243 --> open-mpi/ompi@153e38ffc9
2006-10-24 01:41:28 +00:00
Ralph Castain
7a77ef0ae3 Given the amount of pain singletons cause, one can't help but wonder if it REALLY was that much trouble for people to type "mpirun -n 1 foo"....sigh.
Get the ordering right so that a singleton can start.

Protect the rmgr copy app_context function from NULL fields

Tell the mapper it is okay for there not to be a pre-existing mapping plan for a parent when dynamically spawning processes

This commit was SVN r12257.
2006-10-23 15:15:45 +00:00
Ralph Castain
c5b59829aa Fix a long-lingering annoyance. Calling mpirun with a non-existent application would cause the system to hang on all environments. Reason was that the orted would exit, which it should never do without explicit orders to that affect.
This commit was SVN r12255.
2006-10-23 13:27:31 +00:00
George Bosilca
ee559e9947 Do not completely reset the orterun_globals. Keep the condition and the mutex,
but reset everything else. Once initialized the condition (and the attached
mutex) should be kept alive as long as possible if we want to be able to
retrieve all the informations.

This commit was SVN r12253.
2006-10-23 03:34:08 +00:00
George Bosilca
2a863df0a5 Newline is required by some compilers at the end of a file.
This commit was SVN r12244.
2006-10-21 05:56:04 +00:00
Ralph Castain
153e38ffc9 Lesson to be learned: if you send an ack to a recv'd command, better not send it to the same tag it came from - at least, not if there is a persistent recv on that tag!
Fix the persistent daemon problem where it was exiting when a job completed. Problem was that the persistent daemon would order the job daemons to exit. They would then send an 'ack' back to the persistent daemon - but the ack consisted of an echo of the "exit" command, which was recv'd by the wrong listener who treated it as a properly sent cmd....and exited.

This commit was SVN r12243.
2006-10-21 02:53:19 +00:00
Ralph Castain
ab7bbb80a5 Teach the mapper to correctly handle the unbalanced --host scenario. We now map in a more expected fashion.
This commit was SVN r12240.
2006-10-20 20:48:24 +00:00
George Bosilca
548b94e4e1 Add a missing ORTE_DECLSPEC.
This commit was SVN r12236.
2006-10-20 19:37:21 +00:00
Tim Prins
28bf4d85ab A couple of small fixes:
- It is possible to leave a byslot/bynode routine and have cur_node_item be NULL, so check for that.
- After we do an allocation where the user has provided a map (i.e. with --host), cur_node_item is pointing into the map list, not the global list. Change it to point into the global list.

This commit was SVN r12232.
2006-10-20 19:00:17 +00:00
Ralph Castain
955d11fa7b The bookmark now respects slot assignments a little better. It will not oversubscribe the first node, but will take only what is available there before moving on.
See the comment in orte/mca/rmaps/round_robin/rmaps_rr.c if you want the details... :-)

This commit was SVN r12230.
2006-10-20 18:24:14 +00:00
Ralph Castain
ec0bb9ffda Fix the bookmark system - we now have children being correctly spawned where they should!
Also, I am no longer seeing any issue with the child job spawning its own daemons - this appears to be fixed. We still don't reuse the existing daemons, however, but that will come.

This commit was SVN r12229.
2006-10-20 18:05:16 +00:00
Ralph Castain
c07d4e2510 Cleaner rendition now extended to other environments. Remove MCA params for backend procs that can cause trouble. Specifically, any directives on the selection of components for RDS, RAS, RMAPS, PLS, and RMGR can be bad mojo on the backend.
This patch will cause a problem for cnos, however, as there we want to specifically tell the backends to be "null". I'm working on that issue.

This commit was SVN r12225.
2006-10-20 16:50:13 +00:00
Ralph Castain
02efd07b60 Fix the MCA param passing issue, at least for rsh at the moment. I will clean this up and move it to the other environments once I shift back to a local computer.
This commit was SVN r12224.
2006-10-20 15:27:29 +00:00
Ralph Castain
b07a6b1d7a Fix a major typo that caused remote launch to crash - had something inside the wrong brace
This commit was SVN r12221.
2006-10-20 14:30:23 +00:00
Brian Barrett
37fad860b7 Grrr... Forgot that EXTRA_DIST and man_MANS are not set to include all the
possible things contained in the conditional like other rules are (for
example, a SOURCES rule in a conditional automatically has its files
added to the dist rules, even if that conditional isn't tru when
make dist occurs).  So the man files weren't in the tarball.

Put the EXTRA_DIST with the files explicitly listed outside any conditionals
so the man pages always end up in the tarball.

This commit was SVN r12220.
2006-10-20 14:15:38 +00:00
George Bosilca
2aa3e51223 Nothing relevant. Only a set of castings to have a clean compile on
Windows. The cl.exe compiler is pretty good at complaining about
any kind of non explicit cast.

This commit was SVN r12207.
2006-10-20 02:25:50 +00:00
George Bosilca
7982a23bde ORTE_DECLSPEC should be ...
This commit was SVN r12206.
2006-10-20 02:23:54 +00:00
George Bosilca
5a939e21b2 Populate the file with ORTE_DECLSPEC declarations.
This commit was SVN r12205.
2006-10-20 02:19:30 +00:00
Tim Prins
ade94b523b Fixed a number of issues related to resource allocation:
- Simplified the logic of the ras modules by moving the attribute handling into the base allocation function. This allows us to decide how to allocate based on the situation, and solves some of the allocation problems we were having with comm_spawn.
- moved the proxy component into the base. This was done because we always want to call the proxy functions if we are not on a HNP regardless of the attributes passed. 
- Got rid of the hostfile component. What little logic was in it was moved into the base to deal with other circumstances. The hostfile information is currently being propagated into the registry by the RDS, so we just use what is already in the registry.
- renamed some slurm function so that they have the proper prefix. Not strictly necessary as they were static, but it makes debugging much easier.
- fixed a buglet in the round_robin rmaps where we would return an error when really no error occured.

I tried to make proper corrections to all the ras modules, but I cannot test all of them.

This commit was SVN r12202.
2006-10-19 23:33:51 +00:00
Ralph Castain
ab196c3121 Okay, this fixes the problem of MCA params spreading too far. Sorry for the multiple corrections.
This commit was SVN r12201.
2006-10-19 22:51:02 +00:00
George Bosilca
c788f0dd51 Apparently, the MCA params are being loaded into the environment params
of the individual app_contexts as well. Clear them of "bad" ones.

This commit was SVN r12198.
2006-10-19 21:19:08 +00:00
Ralph Castain
d0d3d7fd41 Apparently, the MCA params are being loaded into the environment params of the individual app_contexts as well. Clear them of "bad" ones.
This commit was SVN r12197.
2006-10-19 20:49:35 +00:00
Ralph Castain
382f954fff Fix a bug in the way we saved and passed environments to child processes on remote nodes. The problem was that MCA directives for component selection were being passed back to the children. However, now that we only allow certain components to operate on HNPs, this caused the children to bomb out of orte_init.
This commit was SVN r12196.
2006-10-19 20:35:55 +00:00
Brian Barrett
204f5b8f52 - Clean up wrapper compiler man pages during maintainer-clean, since
they might require special tools (not sure if sed with multiple -e
    arguments is totally portable)
  - ignore the opalcc.1 man page.  Couldn't do this in the previous
    man page commit (r12192) because I was removing opalcc.1 in that
    commit.

This commit was SVN r12194.

The following SVN revision numbers were found above:
  r12192 --> open-mpi/ompi@581a4b0a4e
2006-10-19 20:14:40 +00:00