Nathan Hjelm
b9959a95cd
ack! one more
...
This commit was SVN r26472.
2012-05-22 20:52:52 +00:00
Nathan Hjelm
f2d4e95429
doh! add missing include
...
This commit was SVN r26471.
2012-05-22 20:49:13 +00:00
Nathan Hjelm
cdc3c87ba6
move pmi init/finalize into a common component
...
This commit was SVN r26470.
2012-05-22 15:15:39 +00:00
Nathan Hjelm
78b8b3cf76
bug fix: actually close ess components
...
This commit was SVN r26469.
2012-05-22 15:09:18 +00:00
Nathan Hjelm
6eeca66475
add an option to enable static ports. diabled by default
...
This commit was SVN r26462.
2012-05-21 19:56:15 +00:00
Ralph Castain
83d69b6c95
Enable the ORTE progress thread for apps (not needed in the tools as they already continuously loop in the event lib). This appears to be working, at least for MPI apps that only use shared memory (a simple "hello"). More testing is required to identify where problems will occur - this is only intended to allow further development.
...
In order to use the progress thread, you must configure with:
--enable-orte-progress-threads --enable-event-thread-support
This commit was SVN r26457.
2012-05-20 15:14:43 +00:00
Ralph Castain
c4f8043064
Per Nathan, with a little cleanup by me: update the PMI support to aggregate modex info, thus reducing the number of keys required so it fits within Cray default constraints
...
This commit was SVN r26456.
2012-05-19 16:12:52 +00:00
Jeff Squyres
cab31eafce
Revert r26413: it was causing too much confusion. When an MPI proc
...
exits with status 77, the whole job will be killed, but mpirun will
still return an exit status of 77, so MTT will report it as a skip
anyway.
This commit was SVN r26445.
The following SVN revision numbers were found above:
r26413 --> open-mpi/ompi@02aa36f2e5
2012-05-16 14:45:58 +00:00
Jeff Squyres
02aa36f2e5
ORTE defaults to killing the entire job when any process exits with a
...
nonzero status (we polled other MPI implementations since one one in
the OMPI community had a concrete opinion on what behavior to do here
-- all other MPI's seem to adhere to this behavior, too).
This commit adds an MCA parameter that allows us to tell ORTE to
''not'' kill jobs when a process exits with a status of 77, meaning
the GNU testing standard of "this test was skipped". In all the OMPI
tests, all procs will either return 77 or not. So if they all return
77, mpirun won't consider it an error, but will still return an exit
status of 77 (so that MTT can know that the test was cleanly skipped).
This commit was SVN r26413.
2012-05-08 21:49:05 +00:00
Ralph Castain
70a106fa71
Fix binding on remote nodes - need to pass the binding bitmap!
...
This commit was SVN r26403.
2012-05-08 03:52:39 +00:00
Jeff Squyres
2ba10c37fe
Per RFC, bring in the following changes:
...
* Remove paffinity, maffinity, and carto frameworks -- they've been
wholly replaced by hwloc.
* Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
* Update sm, smcuda, wv, and openib components to no longer use carto.
Instead, use hwloc data. There are still optimizations possible in
the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old
carto-based code found out how many NUMA nodes were ''available''
-- not how many were used ''in this job''. The new hwloc-using
code computes the same value -- it was not updated to calculate how
many NUMA nodes are used ''by this job.''
* Note that I cannot compile the smcuda and wv BTLs -- I ''think''
they're right, but they need to be verified by their owners.
* The openib component now does a bunch of stuff to figure out where
"near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT
BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
(I do not have a NUMA machine with an OpenFabrics device that is a
non-uniform distance from multiple different NUMA nodes).
* Completely rewrite the OMPI_Affinity_str() routine from the
"affinity" mpiext extension. This extension now understands
hyperthreads; the output format of it has changed a bit to reflect
this new information.
* Bunches of minor changes around the code base to update names/types
from maffinity/paffinity-based names to hwloc-based names.
* Add some helper functions into the hwloc base, mainly having to do
with the fact that we have the hwloc data reporting ''all''
topology information, but sometimes you really only want the
(online | available) data.
This commit was SVN r26391.
2012-05-07 14:52:54 +00:00
Ralph Castain
44b8608f0a
Convert debug to verbose
...
This commit was SVN r26384.
2012-05-05 17:46:10 +00:00
Ralph Castain
96bfeb591c
Ensure flag is passed to remote daemons
...
This commit was SVN r26383.
2012-05-03 22:31:25 +00:00
Ralph Castain
45fee2b491
Resolve the case where only the HNP is in the system (i.e., single-node operation)
...
This commit was SVN r26382.
2012-05-03 18:00:01 +00:00
Ralph Castain
c352ca36c2
Minor cleanup
...
This commit was SVN r26381.
2012-05-02 21:23:37 +00:00
Ralph Castain
b2f77bf08f
Extend the iof by adding two new components to support map-reduce IO chaining. Add a mapreduce tool for running such applications.
...
Fix the state machine to support multiple jobs being simultaneously launched as this is not only required for mapreduce, but can happen under comm-spawn applications as well.
This commit was SVN r26380.
2012-05-02 21:00:22 +00:00
Ralph Castain
c5da4f24d7
Fix stupid singletons - get the pidmap message correct
...
This commit was SVN r26378.
2012-05-02 17:48:02 +00:00
Ralph Castain
9f724db182
Remove duplicate event assignment
...
This commit was SVN r26360.
2012-04-30 16:06:20 +00:00
Ralph Castain
289f9f41ec
From long-term discussions, have the daemons use the node_t and proc_t structs and arrays instead of the pidmap and nidmap arrays. Sets the stage for future work.
...
This commit was SVN r26359.
2012-04-29 00:10:01 +00:00
Ralph Castain
f3e3704c9e
Per request from Brian, enable mapping of stddiag output (output from opal_output calls) to stderr of the local process. This allows you to obtain that output in a local window (for example, when using xterm for each process) instead of having it automatically forwarded to mpirun. Turn this on automatically whenever someone uses the -xterm option, and to be set manually using the orte_map_stddiag_to_stderr mca param.
...
This commit was SVN r26352.
2012-04-27 14:39:34 +00:00
Jeff Squyres
46f47e08b6
Remove typo/extra brackets and parens.
...
This commit was SVN r26351.
2012-04-27 13:48:43 +00:00
Jeff Squyres
9d0df5a9a6
Update configury in the new oob ud component: actually check to see if
...
it succeeds and run $1 or $2, accordingly. This allows "make dist" to
run properly on machines that do not have OpenFabrics stuff installed
(e.g., the nightly tarball build machine).
There's still more to be done here -- it doesn't check for non-uniform
directories where the OpenFabrics headers/libraries might be
installed. We might need to re-tool/combine
ompi/config/ompi_check_openib.m4 (which checks for way more than
oob/ud needs) and move it up to config/ompi_check_ofa.m4, or
something...?
This commit was SVN r26350.
2012-04-27 11:32:56 +00:00
Jeff Squyres
9829d2279f
System-level includes should be at the top of the file, before most
...
OPAL/ORTE/OMPI includes.
This commit was SVN r26349.
2012-04-27 11:29:22 +00:00
Ralph Castain
38af7db183
Ensure the progress message comes out right away. Otherwise, on a large system where proc state messages are arriving frequently, the message doesn't get printed until the launch is done!
...
This commit was SVN r26346.
2012-04-26 23:41:03 +00:00
Nathan Hjelm
e1e0d466e5
Merge ssh://ct-fe1/usr/projects/hpctools/hjelmn/ompi-trunk-git into HEAD
...
This commit was SVN r26344.
2012-04-26 22:06:12 +00:00
Ralph Castain
3461809341
Fix reporting of launch progress so the numbers are correct and appear when they should
...
This commit was SVN r26342.
2012-04-26 00:10:09 +00:00
Ralph Castain
71805bf7e4
Clearout the startup_timeout event if the job did in fact start. Have ORTE_TERMINATE use the job state macro so debug will show where it was called
...
This commit was SVN r26334.
2012-04-25 01:05:17 +00:00
Jeff Squyres
708b497968
Ensure to unset the iof "active" flag after the libevent read callback
...
fires (it's already reset once we queue up the read event again). Failure
to unset the active flag would cause other logic to not queue up the
read event again, because it thought the read event was still active).
This commit was SVN r26311.
2012-04-23 15:58:12 +00:00
Ralph Castain
7999266f99
Silence warning by removing unused var
...
This commit was SVN r26275.
2012-04-17 22:34:48 +00:00
Ralph Castain
f68487016c
Add test code from Terry. Properly terminate if we don't abort on non-zero exit
...
This commit was SVN r26271.
2012-04-16 16:44:23 +00:00
Ralph Castain
ddfbde587f
Change the default to "abort" the job when any process exits with a non-zero status. Add the required code to ensure the orted tells the HNP about the problem.
...
This commit was SVN r26270.
2012-04-13 21:19:46 +00:00
Ralph Castain
7741ba47be
Fix comm_spawn that spans multiple nodes
...
This commit was SVN r26268.
2012-04-13 01:59:07 +00:00
Ralph Castain
4d16790836
Fix collectives for jobs running across partial allocations
...
This commit was SVN r26267.
2012-04-13 00:38:47 +00:00
Ralph Castain
5d14fa7546
Fix mpi_abort, minimize error output.
...
This commit was SVN r26266.
2012-04-11 14:37:08 +00:00
Ralph Castain
d3dfba3872
Fix the scenario where an MPI error handler causes a proc to exit after finalize, but with non-zero status to indicate an error occurred.
...
This commit was SVN r26265.
2012-04-11 02:23:46 +00:00
Ralph Castain
9cd4c06488
Get things to build and run when --disable-orte is specified
...
This commit was SVN r26263.
2012-04-10 21:50:01 +00:00
Ralph Castain
14d5525fb1
Some minor cleanups. Get singletons working. Cleanup abort handling so it gets properly identified.
...
This commit was SVN r26261.
2012-04-10 19:08:54 +00:00
Ralph Castain
53bbcf4b5b
Plug slot allocation leak
...
This commit was SVN r26260.
2012-04-10 14:56:24 +00:00
Ralph Castain
f5cd996b91
Fix the case where n=1
...
This commit was SVN r26258.
2012-04-09 22:44:56 +00:00
Ralph Castain
a34be856aa
Now that we have PMI support, this is no longer needed
...
This commit was SVN r26254.
2012-04-07 13:36:24 +00:00
Ralph Castain
71f9e69c62
Remove stale code
...
This commit was SVN r26253.
2012-04-07 13:34:12 +00:00
Ralph Castain
19630ca28d
Remove stale code
...
This commit was SVN r26252.
2012-04-07 13:33:40 +00:00
Ralph Castain
93bbeabc55
Remove stale code
...
This commit was SVN r26251.
2012-04-07 13:33:30 +00:00
Ralph Castain
b6cde9a8d1
Remove stale code
...
This commit was SVN r26250.
2012-04-07 13:33:18 +00:00
George Bosilca
319f76d66a
Low hanging fruit. Remove a declared but not defined function.
...
This commit was SVN r26245.
2012-04-06 15:43:28 +00:00
Ralph Castain
ed197acaa2
Eliminate stale code
...
This commit was SVN r26244.
2012-04-06 15:31:13 +00:00
Ralph Castain
bd8b4f7f1e
Sorry for mid-day commit, but I had promised on the call to do this upon my return.
...
Roll in the ORTE state machine. Remove last traces of opal_sos. Remove UTK epoch code.
Please see the various emails about the state machine change for details. I'll send something out later with more info on the new arch.
This commit was SVN r26242.
2012-04-06 14:23:13 +00:00
Ralph Castain
ca3ff58c76
Ensure we get a non-zero exit status when we can't find the specified fork agent. Output a better error message, and ensure we don't multiply report the problem.
...
This commit was SVN r26191.
2012-03-24 00:49:38 +00:00
Ralph Castain
46b040c79f
Fix typo
...
This commit was SVN r26189.
2012-03-24 00:31:05 +00:00
Ralph Castain
2bd75ec7e3
Fix Cray XE builds - the priority here needs to equal that of the HNP component so that both build. Otherwise, mpirun tries to use PMI for its basis, and that doesn't work!
...
This commit was SVN r26188.
2012-03-23 20:06:34 +00:00
Ralph Castain
811413e9bc
Correctly handle multiple cpu-set ranges. Correctly support optional binding directives combined with cpu-set.
...
This commit was SVN r26187.
2012-03-23 14:50:41 +00:00
Ralph Castain
ce0caf7567
Support -cpu-set by binding to the specified cpus in the absence of any other binding directive. Allows users to subdivide nodes for multiple parallel mpirun invocations.
...
This commit was SVN r26186.
2012-03-23 14:05:52 +00:00
Ralph Castain
33ed3cda07
Update the gridengine allocator to support data from multiple queues by checking for duplicate node entries
...
This commit was SVN r26148.
2012-03-15 17:45:50 +00:00
Josh Hursey
4dd9f89a99
Create an MCA parameter (ess_base_stream_buffering) that allows the user to override the system default for buffering of stdout/stderr streams. See 'man setvbuf' for more information.
...
Note: I am working on a system that buffered all output until the application fishished due to a default of 'fully buffered.' This makes debugging painful. This switch fixed the problem by allowing me to adjust the buffering.
This commit was SVN r26119.
2012-03-08 22:02:28 +00:00
Ralph Castain
e71e871bae
Initialize sink location when stdin is forwarded to all ranks
...
This commit was SVN r26107.
2012-03-06 15:47:04 +00:00
Ralph Castain
366f9d1518
Add some missing localities to the hwloc pretty-print, fix pmi modex
...
This commit was SVN r26105.
2012-03-06 06:21:10 +00:00
Ralph Castain
834a86420b
Ensure we use the slurm module for slurm environments, and correct init order in pmi module when used by daemons
...
This commit was SVN r26089.
2012-03-02 23:10:48 +00:00
Ralph Castain
ceb34ed0c9
Fix typo
...
This commit was SVN r26079.
2012-03-02 09:58:09 +00:00
Ralph Castain
b2f1bade37
Fix the -H localhost issue
...
This commit was SVN r26071.
2012-02-29 16:56:00 +00:00
Jeff Squyres
81dc6a11ee
Fix typo in copyright notice, found by Paul Hargrove
...
This commit was SVN r26070.
2012-02-29 02:02:54 +00:00
Ralph Castain
a83da303c5
When using PMI, we know the ranks that share our node and their relative local/node ranks. Save that info in the pidmap array so that BTLs that require early knowledge of local ranks can access it.
...
This commit was SVN r25992.
2012-02-21 16:43:17 +00:00
Jeff Squyres
b6a90434e4
Fix some include file header ordering issues for some BSDs, suggested
...
by Paul Hargrove.
This commit was SVN r25984.
2012-02-21 13:32:14 +00:00
Jeff Squyres
b295a01d8e
Fix another configury error found by Paul Hargrove. Thanks, Paul!
...
This commit was SVN r25971.
2012-02-20 21:38:27 +00:00
Jeff Squyres
cdc783925e
(Re-)Add oob_tcp_if_(in|ex)clude functionality to allow CIDR notation,
...
just like the btl_tcp_if_(in|ex)clude MCA param.
This commit was SVN r25953.
2012-02-17 15:38:42 +00:00
Jeff Squyres
3e22450345
Fix the oob_tcp_verbose MCA param; make it actually apply to the OOB
...
TCP verbose handle (not the generic/0 handle).
This commit was SVN r25942.
2012-02-16 22:28:11 +00:00
Ralph Castain
b3aabf1565
Cleanup the --without-hwloc build. Thanks to Paul Hargrove for reporting it broken.
...
This commit was SVN r25931.
2012-02-15 11:08:57 +00:00
Ralph Castain
91977444af
Silence warnings
...
This commit was SVN r25929.
2012-02-15 03:42:27 +00:00
Ralph Castain
bba6508b4b
Handle the default hostfile case a little better...
...
This commit was SVN r25928.
2012-02-15 03:33:49 +00:00
Ralph Castain
f14c4be580
Correct the ordering logic so the list gets correctly built in daemon vpid order
...
This commit was SVN r25818.
2012-01-30 16:25:07 +00:00
Shiqing Fan
bfbd3c67a5
Add a windows file into the tarball.
...
This commit was SVN r25811.
2012-01-29 10:12:02 +00:00
Ralph Castain
a0edae52f2
Ensure the wrapper flags get entered in the right order, with -lpmi coming before the alps util libs
...
This commit was SVN r25809.
2012-01-27 20:56:21 +00:00
Ralph Castain
3f31feee6f
Handle the case where a user's rankfile specifies only cpus, and not socket:cpu pairs.
...
This commit was SVN r25803.
2012-01-27 12:21:45 +00:00
Ralph Castain
07f3a91075
Okay, get srun to play nice. Problem was that everything worked fine so long as the user did "salloc" with an argument requesting a specific number of nodes. However, if the user specified instead a number of processes, then we launched that number of daemons - resulting in multiple daemons/node. Not good.
...
So force things to behave correctly either way.
This commit was SVN r25792.
2012-01-26 19:58:57 +00:00
Ralph Castain
ef94e606c7
Add some debug
...
This commit was SVN r25791.
2012-01-26 19:23:32 +00:00
Ralph Castain
1449b27e9f
Ensure that slurm only launches one orted/node, regardless of how the allocation was obtained.
...
This commit was SVN r25790.
2012-01-26 19:23:15 +00:00
Jeff Squyres
64165ce758
r25775 removed the .windows from this directory, but left it in the
...
Makefile.am.
This commit was SVN r25782.
The following SVN revision numbers were found above:
r25775 --> open-mpi/ompi@2c9a4beffd
2012-01-26 10:45:06 +00:00
Jeff Squyres
3751495443
Add missing arguments for the new DYLD_LIBRARY_PATH stuff.
...
This commit was SVN r25780.
2012-01-26 00:35:48 +00:00
Ralph Castain
079e4d9156
Per George's comment, just duplicate the lib path envars to provide both Linux and Mac compatible values
...
This commit was SVN r25776.
2012-01-25 14:37:36 +00:00
Shiqing Fan
2c9a4beffd
Add and remove a few components for windows build.
...
This commit was SVN r25775.
2012-01-25 09:01:27 +00:00
Ralph Castain
8b115754e6
Fix typo
...
This commit was SVN r25763.
2012-01-21 23:50:39 +00:00
Ralph Castain
469e40ace2
Expand the coverage a little when looking at remote shells for rsh. Prior patch (r25758) works only if both ends of the rsh/ssh connection are Mac. What we really want is to use the Mac version of ld_library_path when the remote end is Mac, regardless of the OS where mpirun is executing. So add a test for system type to the remote_shell test, and set the ld_library_path name to match the remote system type.
...
This commit was SVN r25762.
The following SVN revision numbers were found above:
r25758 --> open-mpi/ompi@1afb77e603
2012-01-21 23:48:42 +00:00
Ralph Castain
1afb77e603
Mac requires setting DYLD_LIBRARY_PATH instead of the Linux standard LD_LIBRARY_PATH, so ensure we set that when using rsh to launch in Mac environments.
...
Thanks to Teng Lin for the patch!
This commit was SVN r25758.
2012-01-20 19:14:32 +00:00
Ralph Castain
be3dfb6a1a
Ensure that we only add -lpmi once to the wrapper compilers, no matter how many components might use it.
...
This commit was SVN r25753.
2012-01-20 04:56:38 +00:00
Ralph Castain
d7fe1615b6
Add missing dollar sign on variable
...
This commit was SVN r25745.
2012-01-19 20:45:22 +00:00
Ralph Castain
0d20f745e2
Remove stale function def
...
This commit was SVN r25744.
2012-01-19 20:40:48 +00:00
Nathan Hjelm
6d0e7a0a0e
don't enable ess/alps unless cnos is available
...
This commit was SVN r25743.
2012-01-19 19:36:00 +00:00
Ralph Castain
9d556e2f17
Allow daemons to use PMI to get their name where PMI support is available while using the standard grpcomm and other capabilities. Remove the GNI code from the alps ess component as that component should only be for alps/cnos installations.
...
This commit was SVN r25737.
2012-01-18 20:56:53 +00:00
Ralph Castain
6235a355de
Correctly handle co-spawning of daemons when attaching to a running job. We cannot use the general process mappers as we only want debugger daemons spawned on nodes where application procs already exist. So custom build the map for the debugger daemon job, and have the plm just launch that job without doing its usual vm-spawn step.
...
This commit was SVN r25736.
2012-01-18 00:19:49 +00:00
Nathan Hjelm
a2437feba7
removed debug message
...
This commit was SVN r25722.
2012-01-12 20:23:59 +00:00
Nathan Hjelm
5ab1674138
fixed de bruijn copyrights
...
This commit was SVN r25720.
2012-01-12 17:18:08 +00:00
Nathan Hjelm
c57f18999d
added Debruijn routed component
...
This commit was SVN r25717.
2012-01-12 17:11:03 +00:00
Ralph Castain
477582abef
Grrrr....fix ALL the cases where the membind warning occurs.
...
This commit was SVN r25715.
2012-01-11 23:51:18 +00:00
Ralph Castain
bf103de66c
My apologies for doing this outside of the usual time restrictions, but we need to get this in so we can make progress.
...
Move the ORTE-level debugger code back into orterun and out of the ORTE library to resolve symbol conflicts.
This commit was SVN r25713.
2012-01-11 15:53:09 +00:00
Ralph Castain
167ad944c4
Surprise, surprise - hwloc treats memory binding as at the thread, not process, level. Thus, hwloc always sets the membind proc-level support flag to false, and indicates actual memory binding support via the thread-level flag. So...just to be safe, test -both- flags and issue the "no support" warning ONLY if both are false.
...
This commit was SVN r25709.
2012-01-11 01:12:57 +00:00
Shiqing Fan
e3dfc49ced
make correct use of the newly updated structures in the Windows module.
...
This commit was SVN r25699.
2012-01-09 11:08:34 +00:00
Ralph Castain
840841bb8f
Missed a couple
...
This commit was SVN r25686.
2011-12-29 23:30:19 +00:00
Ralph Castain
af7fb68cfb
If we forward envars in rsh, then we have to be very careful about both duplicate entries and disallowed characters on the cmd line. To aid with detecting duplicates, make all cmd line options be given in their mca variant. Check anything we might add for semi-colons and protect those values with quotes.
...
This commit was SVN r25685.
2011-12-29 23:25:25 +00:00
Jeff Squyres
a4c8bb27fa
Pull in the MPIR_Breakpoint symbol via a dummy function in
...
debuggers_base_fns.c: orte_debugger_base_pull_mpir_breakpoint().
This commit was SVN r25660.
2011-12-15 18:39:34 +00:00
Ralph Castain
2dd2694f25
Fix comm_spawn in oversubscribed conditions. IF oversubscription is allowed, let nodes flow into the mapper even if they are oversubscribed, constrained by the slots_max absolute ceiling. Cleanup error messages when comm_spawn fails so it correctly and succintly reports the ereror.
...
This commit was SVN r25659.
2011-12-15 18:04:48 +00:00
Ralph Castain
1adefcc176
When routing is not enabled, all routes must go direct
...
This commit was SVN r25656.
2011-12-15 15:32:09 +00:00
Ralph Castain
e683b2f9c7
Minor touchup - reset the pointer to the end of the list each time to ensure we get the nodes in correct daemon order
...
This commit was SVN r25651.
2011-12-14 22:16:52 +00:00
Ralph Castain
912abe8a6c
Catch one more use-case
...
This commit was SVN r25649.
2011-12-14 21:03:19 +00:00
Ralph Castain
f531b09a8d
Correctly handle -host and -hostfile options. Ensure the initial vm launch constrains itself to the union of specified hosts if those options are given. Get oversubscribe set correctly for that case.
...
This commit was SVN r25648.
2011-12-14 20:01:15 +00:00
George Bosilca
ac26f58bd7
I guess this wasn't yet ready for prime time.
...
This commit was SVN r25624.
2011-12-12 23:55:11 +00:00
Nathan Hjelm
885d5cbcf8
enable ptmalloc with using uGNI
...
This commit was SVN r25621.
2011-12-12 20:52:51 +00:00
Nathan Hjelm
be11acf727
bug fix. don't add node to allocated_nodes twice
...
This commit was SVN r25619.
2011-12-12 19:14:41 +00:00
Ralph Castain
3f1ae5d89b
No longer need this include
...
This commit was SVN r25606.
2011-12-09 00:40:07 +00:00
Ralph Castain
44094cd5b3
Remove compiler warning
...
This commit was SVN r25601.
2011-12-08 16:35:41 +00:00
Samuel Gutierrez
0a922dcb3e
fixes XE6 build.
...
This commit was SVN r25600.
2011-12-08 16:13:58 +00:00
Samuel Gutierrez
0588e9ba36
add Cray XK6 support to ras alps. the configuration file is a different format and is in a different place.
...
This commit was SVN r25599.
2011-12-08 14:05:02 +00:00
Ralph Castain
7180ad40ad
Fix a copule of minor buglets
...
This commit was SVN r25589.
2011-12-07 21:08:35 +00:00
Ralph Castain
3e7ab1212a
Since this has come up a number of times, have the rsh launcher add MCA params from the environment by default. If it finds that the cmd line is too long, error out with a message directing the user to set a param to ignore the environmental MCA params.
...
This commit was SVN r25581.
2011-12-07 01:24:36 +00:00
Ralph Castain
7510339725
Remove stale orte_vm_launch param. Add a param that allows users to specify envars to forward/set so they can do it in the MCA param file instead of only via mpirun cmd line.
...
This commit was SVN r25580.
2011-12-06 21:31:22 +00:00
Ralph Castain
15facc4ba6
Fix comm_spawn yet again...add another test
...
This commit was SVN r25579.
2011-12-06 20:15:40 +00:00
Ralph Castain
90b7f2a7bf
The rest of the multi app_context fix. Remove the restriction on number of app_contexts that can have zero np specified as multiple mappers now support that use-case. Update the ranking algorithms to respect and track bookmarks. Ensure we properly set the oversubscribed flag on a per-node basis.
...
This commit was SVN r25578.
2011-12-06 17:28:29 +00:00
Ralph Castain
d9c7764e9b
Remove some debug
...
This commit was SVN r25575.
2011-12-05 22:04:50 +00:00
Ralph Castain
df2f594aa8
Some cleanup associated with multiple app_contexts. Ensure nodes only get entered once into the map. Correctly handle bookmarks. Cleanup tracking of slots_inuse and correct detection of oversubscription.
...
Still need to resolve the ranking issue so it starts at the bookmark, but that will come next.
This commit was SVN r25574.
2011-12-05 22:01:08 +00:00
Abhishek Kulkarni
0b7c51fae2
Correct an invalid reference to a missing help file.
...
This commit was SVN r25573.
2011-12-05 21:29:07 +00:00
Josh Hursey
b5ac320826
* If not able to checkpoint at this time (say because we are already checkpointing or restarting) then make sure to re-set the listener so that we can checkpoint later.
...
* Work around duplicate node names in the map. It should not happen normally, but if the rmaps component gets this wrong provide a work around. Ralph is working on a rmaps fix for this, so we will likely remove/comment out the fix later.
This commit was SVN r25572.
2011-12-05 19:29:26 +00:00
Josh Hursey
cc57840b53
Fix ess/tool so that it does not segv when using the rsh PLM. Just have it use the base function directly to avoid similar problems with finalizing other components.
...
This commit was SVN r25571.
2011-12-05 15:40:46 +00:00
Ralph Castain
6cbd8fa6c9
Keep everyone in sync with new job state
...
This commit was SVN r25563.
2011-12-02 14:12:40 +00:00
Ralph Castain
07655e2945
Handle the case where the allocator "fibs" to us about the node names. In some cases (ahem...you know who you are!), the allocator will tell us a node number (e.g., "16"). However, the daemon will return a node name (e.g., "nid0016") - leaving us not recognizing its location.
...
So provide a new parameter (can't have too many!) that handles this situation by stripping the prefix from the returned node name. Also do a little cleanup to ensure we cleanly exit from errors, without generating too many annoying messages.
This commit was SVN r25562.
2011-12-02 14:10:08 +00:00
Jeff Squyres
ecf6ba910c
Silence a few icc warnings and about mixing enums with other types.
...
This commit was SVN r25560.
2011-12-02 13:18:54 +00:00
Ralph Castain
641e17f26c
A better way of handling fqdn allocations. Prior method was wrong as it equated "node1" with "node10", which definitely caused problems.
...
Detect the addition of fqdn nodes in the allocation. If not found, then strip all incoming hostnames from daemons of any domain info when matching those names against the names in the node pool.
Leave some protection and "live" diagnostic output in place so we can continue to detect problems across all environments.
This commit was SVN r25557.
2011-12-01 14:24:43 +00:00
Ralph Castain
512aea79bc
Print the right nodename value, fix the strange case
...
This commit was SVN r25556.
2011-12-01 02:31:56 +00:00
Ralph Castain
44394c6b34
Add a little more protection
...
This commit was SVN r25555.
2011-12-01 00:30:56 +00:00
Ralph Castain
c4ea7a252a
Add a little protection against badly formed node names so we don't segfault if they are encountered
...
This commit was SVN r25554.
2011-11-30 23:33:59 +00:00
Ralph Castain
fa9e99454a
Don't divide by cpus-per-task - we'll deal with that at binding time.
...
This commit was SVN r25552.
2011-11-30 21:35:25 +00:00
Ralph Castain
c56acf60ca
Although we never really thought about it, we made an unconscious assumption in the mapper system - we assumed that the daemons would be placed on nodes in the order that the nodes appear in the allocation. In other words, we assumed that the launch environment would map processes in node order.
...
Turns out, this isn't necessarily true. The Cray, for example, launches processes in a toroidal pattern, thus causing the daemons to wind up somewhere other than what we thought. Other environments (e.g., slurm) are also capable of such behavior, depending upon the default mapping algorithm they are told to use.
Resolve this problem by making the daemon-to-node assignment in the affected environments when the daemon calls back and tells us what node it is on. Order the nodes in the mapping list so they are in daemon-vpid order as opposed to the order in which they show in the allocation. For environments that don't exhibit this mapping behavior (e.g., rsh), this won't have any impact.
Also, clean up the vm launch procedure a little bit so it more closely aligns with the state machine implementation that is coming, and remove some lingering "slave" code.
This commit was SVN r25551.
2011-11-30 19:58:24 +00:00
George Bosilca
7a238933b6
Silence a compiler warning.
...
This commit was SVN r25543.
2011-11-29 20:53:08 +00:00
Terry Dontje
b1bb339d23
fix r25507 rationalization of rsh support by removing include of plm_base_rsh_support.h from tm module
...
This commit was SVN r25519.
The following SVN revision numbers were found above:
r25507 --> open-mpi/ompi@b475421c16
2011-11-29 11:49:41 +00:00
Ralph Castain
0d55a3d739
Missed one spot...
...
This commit was SVN r25517.
2011-11-28 22:30:53 +00:00
Ralph Castain
237c79b6d7
Fix daemon collectives - missed the one spot where returning orte_routed_tree_t was required. Sigh. Change the routed components to return that type on the list of children when get_routing_tree is called.
...
This commit was SVN r25516.
2011-11-28 22:24:49 +00:00
Ralph Castain
89e5bd27a2
Fix copyright date
...
This commit was SVN r25512.
2011-11-28 15:54:04 +00:00
Ralph Castain
70ab8422b1
Per the internal comments, the delay between ssh invocations is not there for debugging purposes, but rather to allow for NIS authetication times. We have seen that problem in the past, so don't just do the delay when we are debugging - use the delay for the intended purpose. Also, allow for shorter than second-level delays as it doesn't always have to be so long.
...
This commit was SVN r25510.
2011-11-27 01:49:42 +00:00
Ralph Castain
b173316b74
Dont induce a delay between spawns unless specifically asked to do so
...
This commit was SVN r25509.
2011-11-26 16:50:31 +00:00
Ralph Castain
b475421c16
As promised, rationalize the rsh support. Remove rshbase and the base rsh support, centralizing all rsh support into the rsh component. Remove the "slave" launch support as that experiment is complete. Fix tree spawn and make that the default method for rsh launch, turning it "off" for qrsh as that system does not support tree spawn.
...
This commit was SVN r25507.
2011-11-26 02:33:05 +00:00
Ralph Castain
9b59d8de6f
This is actually a much smaller commit than it appears at first glance - it just touches a lot of files. The --without-rte-support configuration option has never really been implemented completely. The option caused various objects not to be defined and conditionally compiled some base functions, but did nothing to prevent build of the component libraries. Unfortunately, since many of those components use objects covered by the option, it caused builds to break if those components were allowed to build.
...
Brian dealt with this in the past by creating platform files and using "no-build" to block the components. This was clunky, but acceptable when only one organization was using that option. However, that number has now expanded to at least two more locations.
Accordingly, make --without-rte-support actually work by adding appropriate configury to prevent components from building when they shouldn't. While doing so, remove two frameworks (db and rmcast) that are no longer used as ORCM comes to a close (besides, they belonged in ORCM now anyway). Do some minor cleanups along the way.
This commit was SVN r25497.
2011-11-22 21:24:35 +00:00
George Bosilca
1000af1c48
No need to abort there, returning an error trigger the
...
abort at the upper level.
This commit was SVN r25494.
2011-11-18 19:07:26 +00:00
Ralph Castain
866edf6a89
Now that George has found his problem, we no longer need the bozo check. Interesting how these platform-specific issues surface...
...
This commit was SVN r25493.
2011-11-18 17:43:14 +00:00
George Bosilca
b613c7eacb
Fix the issue with the round robin mapper. When mixing
...
different precisions, one should manually promote the
participants to the expected type. In this particular
example as opal_list_get_size returns an unsigned long,
the computation on the left side is translated to an
unsigned. If the hostfile contains more nodes that what
required (via the -np), this leads to a gigantic value
for the balance, and breaks the round robin algorithm.
This commit was SVN r25492.
2011-11-18 17:03:35 +00:00
Ralph Castain
1e5e9bde77
Add protection against a bozo case where we could end up in an infinite loop while calculating ranks
...
This commit was SVN r25491.
2011-11-18 15:35:55 +00:00
George Bosilca
61f273b987
Do not tolerate uninitialized variables.
...
This commit was SVN r25489.
2011-11-18 10:19:24 +00:00
Ralph Castain
b34acd0476
Grrr....get the correct number too!
...
This commit was SVN r25478.
2011-11-15 11:11:47 +00:00
Ralph Castain
593fc388a9
Make it a little more obvious as to which nodes are from each topology by labeling them with a letter.
...
This commit was SVN r25477.
2011-11-15 11:10:39 +00:00
Ralph Castain
6310361532
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
...
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 03:40:11 +00:00
Ralph Castain
c8e105bd8c
Remove stale code
...
This commit was SVN r25475.
2011-11-14 23:39:23 +00:00
Ralph Castain
793f4c688f
Extend capability to support heterogeneous clusters with multiple topologies
...
This commit was SVN r25474.
2011-11-13 23:23:09 +00:00
Ralph Castain
6b5e1b89cf
Turn off tree spawn as it doesn't currently work - will fix shortly. Add topology collection
...
This commit was SVN r25472.
2011-11-11 23:42:36 +00:00
Ralph Castain
d008aeb531
Silence debug
...
This commit was SVN r25471.
2011-11-11 16:42:45 +00:00
George Bosilca
3d318a4c26
Put the interface of our MPIR support in sync with the document accepted by the MPI
...
Forum (http://www.mpi-forum.org/docs/mpir-specification-10-11-2010.pdf ).
This commit was SVN r25456.
2011-11-08 01:24:16 +00:00
George Bosilca
85a18dab74
MPIR_partial_attach_ok is not a volatile, but a constant.
...
This commit was SVN r25455.
2011-11-08 01:00:38 +00:00
Ralph Castain
a3ce355a60
Revert r25453 and r25450 until we can fix the libevent2013 configure code - still not getting the includedir to eval correctly.
...
This commit was SVN r25454.
The following SVN revision numbers were found above:
r25450 --> open-mpi/ompi@7f7d5c4f1f
r25453 --> open-mpi/ompi@c9fe8c32e2
2011-11-07 16:23:44 +00:00
Samuel Gutierrez
e03bc93fb7
only use pmi grpcomm and pubsub during the direct launch case. use PMI environment variable to setup vpid in ess alps on cray xe systems. add pmi test code.
...
This commit was SVN r25447.
2011-11-06 17:28:40 +00:00
Ralph Castain
fcee46b063
Add an option for printing a diffable process map for testing mappers
...
This commit was SVN r25428.
2011-11-03 14:22:07 +00:00
Samuel Gutierrez
3fe7b3ee54
add PMI support to ess alps module. xt system guys: please yell at me if i missed something in cnos.
...
This commit was SVN r25423.
2011-11-03 04:04:32 +00:00
Samuel Gutierrez
27b9bcfafd
update ess alps configuration file to include CNOS and PMI checks. some of the features committed here aren't being used, but they will be. also update orte_check_pmi.m4 to include missing call to action-if-not-found if --with-pmi is not specified or is disabled.
...
This commit was SVN r25422.
2011-11-03 02:14:47 +00:00
Jeff Squyres
7f6f7bd0eb
Remove this component; twitter long ago switched to the oauth
...
authentication, and no one has ever updated this component to match.
It can be revived out of history if anyone cares.
This commit was SVN r25421.
2011-11-02 21:04:49 +00:00
Ralph Castain
b2e2d24726
As in the rsh module, report failed daemons to the errmgr for proper cleanup
...
This commit was SVN r25419.
2011-11-02 18:30:22 +00:00
Ralph Castain
3e4165fd8d
Cleanup includes
...
This commit was SVN r25418.
2011-11-02 18:28:28 +00:00
Ralph Castain
b77552c45d
Cleanup some include files, return a silent error in open/select as the complaining component already output a message
...
This commit was SVN r25416.
2011-11-02 17:42:06 +00:00
Ralph Castain
55b996678e
Minor indentation changes
...
This commit was SVN r25414.
2011-11-02 15:56:56 +00:00
Ralph Castain
f00753881e
Handle the case where mpirun -is- of the same topology as the compute nodes.
...
This commit was SVN r25412.
2011-11-01 22:26:03 +00:00
Ralph Castain
d28dd55d33
Minimize the amount of topology info returned by the daemons. Most clusters, especially at scale, use the same node topology on every node, so there is no re
...
ason to return the topology from every daemon. Borrow a page from the --hetero-apps page and let users indicate that the node topology differs by adding a --
hetero-nodes option to mpirun. If the option is set, then every daemon returns topology info. If not set, then only daemon vpid=1 returns it.
We always want one daemon to return the topology as the head node is often different from the compute nodes. Having one daemon return the compute node topolo
gy allows us to detect any such difference. All compute nodes are then set to the same topology.
This commit was SVN r25408.
2011-11-01 18:43:10 +00:00
Ralph Castain
14966e0f8f
Cleanup PMI startup - if a component isn't selected, it should finalize PMI IFF it started it. Otherwise, components that aren't selected can finalize PMI when it is in use by other parts of the system.
...
This commit was SVN r25407.
2011-11-01 16:25:12 +00:00
Ralph Castain
d492b20975
Bozo check for topology info
...
This commit was SVN r25398.
2011-10-30 11:49:38 +00:00
Ralph Castain
4232115a98
Ensure pruning remains within the current job/app being mapped.
...
This commit was SVN r25397.
2011-10-30 00:02:20 +00:00
Ralph Castain
648c85b41b
Add a simple pattern mapper as an example of how to use the topology info to create desired mappings. Let the user specify a pattern based on resource types, and map that pattern across all available nodes as resources permit.
...
Don't automatically display the topology for each node when --display-devel-map is set as it can overwhelm the reader. Use a separate flag --display-topo to get it.
This commit was SVN r25396.
2011-10-29 15:12:45 +00:00
Ralph Castain
12a589130a
Add some debug
...
This commit was SVN r25395.
2011-10-29 15:07:58 +00:00
Ralph Castain
965b04d1a5
Use the new utilities to get a topology that reflects available cpus
...
This commit was SVN r25394.
2011-10-29 15:07:36 +00:00
Ralph Castain
e50bcbf028
Add the ability to specify a topology-containing xml file to describe the simulated nodes to support mapping tests against arbitrary topologies
...
This commit was SVN r25388.
2011-10-29 02:01:11 +00:00
Ralph Castain
7fa5f82d70
Add simulator component to support testing of large scale mapping methods. Automatically sets do-not-resolve and do-not-launch, and creates however many nodes the user wants to simulate in the system.
...
This commit was SVN r25386.
2011-10-28 23:48:53 +00:00
Ralph Castain
e2eb8d5f78
Remove bad param registration - that param was already registered as an int_name in another location.
...
This commit was SVN r25381.
2011-10-28 19:14:43 +00:00
Josh Hursey
6726590b1c
Remove the 'ess_node_rank' accessor from here. This caused running under 'tm' to segv at the orteds.
...
It just looks like this part of the component was not updated during r25331. It was removed from the 'env' and 'slurm' environments in that patch. It looks like 'tm' was updated, but did not get this particular piece.
This commit was SVN r25380.
The following SVN revision numbers were found above:
r25331 --> open-mpi/ompi@b44f8d4b28
2011-10-28 17:41:35 +00:00
Josh Hursey
59ff1dbbfb
Fix indentation problem that caused a segv when running without regex.
...
This was introduced in r25063.
This commit was SVN r25379.
The following SVN revision numbers were found above:
r25063 --> open-mpi/ompi@e58623cd5b
2011-10-28 13:39:32 +00:00
Samuel Gutierrez
922e41a318
fix typo. use PMI_Initialized for init status instead of PMI_Init.
...
This commit was SVN r25377.
2011-10-27 22:27:30 +00:00
Ralph Castain
951d72692c
Reverse the #if direction so we report daemon failure to the errmgr - otherwise, we just hang if a daemon fails to start.
...
Reviewed with Josh.
This commit was SVN r25366.
2011-10-25 19:09:52 +00:00
Ralph Castain
c55cba55a7
Totally trivial spelling fix
...
This commit was SVN r25361.
2011-10-24 14:06:33 +00:00
Ralph Castain
955d8e7d46
Allow apps to use pmi when launched by mpirun, if desired, without affecting daemons
...
This commit was SVN r25359.
2011-10-23 15:57:13 +00:00
Nathan Hjelm
e8af0d8589
don't use alps paffinity
...
This commit was SVN r25358.
2011-10-21 22:52:03 +00:00
Abhishek Kulkarni
46952e9008
Fix C/R functionality in trunk. Intra-node checkpointing of a job now works as expected.
...
Signed-off-by: Abhishek Kulkarni <adkulkar@osl.iu.edu>
This commit was SVN r25357.
2011-10-21 22:07:35 +00:00
Nathan Hjelm
7b1172b346
need a terminating character in the decoded string
...
This commit was SVN r25355.
2011-10-21 16:46:28 +00:00
Nathan Hjelm
cd257ac707
fixed typo in pmi grpcomm
...
This commit was SVN r25353.
2011-10-21 16:28:36 +00:00
Shiqing Fan
5711414eb7
Fix Windows build
...
This commit was SVN r25351.
2011-10-21 14:46:58 +00:00
Ralph Castain
53ef085567
Fix a minor issue seen by Jeff in specific failure pathway
...
This commit was SVN r25350.
2011-10-21 14:44:48 +00:00
Ralph Castain
3e72fccacf
Cray's PMI implementation is quite different from slurm's - they extended PMI-1 by adding some, but not all, of the PMI-2 APIs. So you can't just switch to using PMI-2 functions as it isn't a complete implementation. Instead, you have to selectively figure out which ones they have in PMI-2, and use any missing ones from PMI-1. What fun.
...
Modify the configure logic and the PMI components to accommodate Cray's approach. Refactor the PMI error reporting code so it resides in only one place. Cray actually decided -not- to define the PMI-2 error codes, so we have to use the PMI-1 codes instead. More fun.
This commit was SVN r25348.
2011-10-21 04:54:38 +00:00
Nathan Hjelm
beb8d8ce32
pmi return code wtf
...
This commit was SVN r25336.
2011-10-20 17:51:24 +00:00
Ralph Castain
84713d5a84
Fix singletons again - must have been broken for a very long time, which only shows how little anyone cares about this capability.
...
This commit was SVN r25332.
2011-10-19 20:19:08 +00:00
Ralph Castain
b44f8d4b28
Complete implementation of the ess.proc_get_locality API. Up to this point, the API was only capable of telling if the specified proc was sharing a node with you. However, the returned value was capable of telling you much more detailed info - e.g., if the proc shares a socket, a cache, or numa node. We just didn't have the data to provide that detail.
...
Use hwloc to obtain the cpuset for each process during mpi_init, and share that info in the modex. As it arrives, use a new opal_hwloc_base utility function to parse the value against the local proc's cpuset and determine where they overlap. Cache the value in the pmap object as it may be referenced multiple times.
Thus, the return value from orte_ess.proc_get_locality is a 16-bit bitmask that describes the resources being shared with you. This bitmask can be tested using the macros in opal/mca/paffinity/paffinity.h
Locality is available for all procs, whether launched via mpirun or directly with an external launcher such as slurm or aprun.
This commit was SVN r25331.
2011-10-19 20:18:14 +00:00
Ralph Castain
2958f3de34
Add some clarifying comments and a small efficiency improvement
...
This commit was SVN r25322.
2011-10-18 18:30:43 +00:00
Ralph Castain
b771114086
Fix the fix :-)
...
If the errmgr is going to try and hold the orted until all routes and children are gone, then the exit cmd must do the same. Otherwise, the orted exits immediately without waiting for routes to be dismantled, which is why we don't see the connections close.
Also cleanup some diagnostics and add some debug to more clearly see what's going on.
This commit was SVN r25321.
2011-10-18 17:56:37 +00:00
Ralph Castain
ae8e556d14
Okay, once again let's fix the vpid calculator. Identified problem with prior commit (some rmaps components already place their procs in the jdata->procs array, and others don't), so account for those variations.
...
This commit was SVN r25315.
2011-10-18 15:50:11 +00:00
George Bosilca
749b63c09d
Provide a generic fix for the termination issue instead of r25248. The
...
termination condition is to be checked at the daemon/HNP level not down
in the routing.
This commit was SVN r25313.
The following SVN revision numbers were found above:
r25248 --> open-mpi/ompi@b42ccc89b8
2011-10-18 03:07:37 +00:00
George Bosilca
f28890fbb7
Revert r25302 as it break the --bynode option.
...
This commit was SVN r25311.
The following SVN revision numbers were found above:
r25302 --> open-mpi/ompi@d7a8553179
2011-10-18 02:48:17 +00:00
Ralph Castain
2fdd9c6dea
Ensure mpirun doesn't pick this component
...
This commit was SVN r25307.
2011-10-17 22:28:28 +00:00
Ralph Castain
8f0ef54130
Complete implementation of pmi support. Ensure we support both mpirun and direct launch within same configuration to avoid requiring separate builds. Add support for generic pmi, not just under slurm. Add publish/subscribe support, although slurm's pmi implementation will just return an error as it hasn't been done yet.
...
This commit was SVN r25303.
2011-10-17 20:51:22 +00:00
Ralph Castain
d7a8553179
Fix the mapping algo for computing vpids - it was borked for bynode operations when using nperxxx directives
...
This commit was SVN r25302.
2011-10-17 19:49:04 +00:00
Ralph Castain
89a20de474
Remove unused includes. Ensure that the error log is at least always available as we otherwise segfault when reporting errors that occur prior to opening the errmgr framework
...
This commit was SVN r25288.
2011-10-14 18:45:11 +00:00
Ralph Castain
07dbbc6513
Sorry for mid-day correction - but folks are trying to test this, and we didn't realize it was still ignored :-(
...
This commit was SVN r25287.
2011-10-14 16:19:20 +00:00
Ralph Castain
054c485dcf
Cleanup a race condition and an unreliable method that caused us to not properly handle procs that trapped sigterm for cleanup purposes while ORTE was trying to kill them. Thanks to Rick Payne and Ian Wells of Cisco for spending weeks chasing this down.
...
Fix a termination issue that caused procs local to mpirun to not be killed if they weren't calling into the library. Thanks to Terry Dontje for spending countless hours chasing his tail on this one! :-(
This commit was SVN r25285.
2011-10-14 15:39:54 +00:00
Ralph Castain
08fa9e1c6a
Correct include path
...
This commit was SVN r25282.
2011-10-13 23:46:52 +00:00
Ralph Castain
b96ef2161d
Complete the PMI support. Generalize PMI operations to support both slurm and non-slurm environments. Correct some configuration issues - we really only want the PMI integration at the individual component level. Ensure that the pmi grpcomm component doesn't get selected when launching via mpirun by setting its priority below the bad component.
...
Only verified in a slurm environment as that's all I have access to...
This commit was SVN r25275.
2011-10-12 20:59:25 +00:00
Ralph Castain
634f83fc52
Fix the routed components. All had errors, some completely broken. You cannot test
...
0 == ORTE_EPOCH_CMP(target->epoch,ORTE_EPOCH_INVALID)
when epoch is not configured as this will always return true. This caused get_route to return an error in all non-binomial routed modules, and caused all components to return an error when delete_route was called.
So protect the checks with ORTE_ENABLE_EPOCH so we get the correct behavior.
This commit was SVN r25274.
2011-10-12 20:18:57 +00:00
Ralph Castain
24a46f2acb
These were missed by prior commit - need to remove lingering references to OPAL_HWLOC_HAVE_XML
...
This commit was SVN r25272.
2011-10-12 16:54:03 +00:00
George Bosilca
872d377021
Tell what the update status is.
...
This commit was SVN r25259.
2011-10-11 19:49:12 +00:00
Brian Barrett
98e98ce2c5
* opal_atomic_trylock is documented to return 0 if the lock was acquired,
...
1 otherwise. It was doing the opposite, so this patch fixes the
return values. All uses (all in ORTE) used the actual return values,
not the documented values, so fix them as well.
This commit was SVN r25257.
2011-10-11 18:43:45 +00:00
Ralph Castain
2f38ff5e54
Ensure we don't try to build this module unless pmi is specifically requested
...
This commit was SVN r25252.
2011-10-11 06:12:04 +00:00
Ralph Castain
baefdabd98
Add some debug. Now confirmed to work correctly (prior problem was with odin tcp connection, not code).
...
This commit was SVN r25249.
2011-10-11 02:15:17 +00:00
Ralph Castain
b42ccc89b8
Although this didn't solve the earlier termination problem, the code will be required once we get connection terminations properly detected. If a daemon (or HNP) is trying to terminate, then we need to check for termination conditions whenever a route is lost - when all child connections are gone, then we are free to finalize.
...
This commit was SVN r25248.
2011-10-10 21:41:49 +00:00
Ralph Castain
1aa1c2e9b4
Get the slurm pmi support working. Cannot use infiniband, of course, as the oob can't make the connection - may try other existing methods. Modex may not quite be working right yet
...
as odin was having trouble making TCP connections, but at least the configure now works so things build, so save that for now
This commit was SVN r25247.
2011-10-10 21:39:10 +00:00
Swen Boehm
08b4322a1a
patched the lex files to not issue the following compiler warning:
...
'yyunput' defined but not used
This commit was SVN r25246.
2011-10-10 18:13:04 +00:00
George Bosilca
649af6c925
Enumerated mixed with another type (int) is tolerated but
...
easily fixable.
This commit was SVN r25241.
2011-10-09 03:54:52 +00:00
Terry Dontje
c6691b4122
clean up local procs when abort or abort signal happens
...
This commit was SVN r25237.
2011-10-06 19:19:55 +00:00
George Bosilca
c6d6c9aece
Remove some #if by using the correct macro (aka. ORTE_EPOCH_CMP).
...
This commit was SVN r25229.
2011-10-04 14:42:40 +00:00
Samuel Gutierrez
25cbf79592
modifications to ras alps. this commit allows users to mpirun without having to set id environment variables (BASIL_RESERVATION_ID, OMPI_ALPS_RESID). note, however, that we preserved the old behavior. if an id environment variable is set, it will be obeyed and our new code path is essentially bypassed. if we missed something, please yell at us. with this commit, the use of ras-alps-command.sh is no longer needed... at least that is our hope.
...
This commit was SVN r25181.
2011-09-26 21:31:08 +00:00
Ralph Castain
8347385630
Fix the radix routed component.
...
This commit was SVN r25175.
2011-09-22 09:32:53 +00:00
Ralph Castain
1cd7b02df3
Add a set of default errmgr components that support solely the default "everything dies on error" behavior. Set their priority to be selected by default, but provide params to adjust those priorities to allow other component selection.
...
This commit was SVN r25139.
2011-09-13 22:03:45 +00:00
Ralph Castain
3c4f04f4d9
Ensure opal_hwloc_topology is NULL after being destroyed
...
This commit was SVN r25138.
2011-09-13 19:21:10 +00:00
Nathan Hjelm
079ccdf8b1
fix debugger co-location launching
...
This commit was SVN r25136.
2011-09-13 15:08:03 +00:00
Ralph Castain
ca7638553f
Remove stale code
...
This commit was SVN r25133.
2011-09-12 23:00:41 +00:00
Ralph Castain
556a05566e
Silence warning
...
This commit was SVN r25130.
2011-09-12 16:21:51 +00:00
Ralph Castain
92c7372e20
Per the RFC from Jeff, move hwloc from opal/mca/common to its own static framework ala libevent. Have ORTE daemons collect the topology info at startup and, if --enable-hwloc-xml is set, send that info back to the HNP for later use. The HNP only retains unique topology "templates" to reduce memory footprint. Have the daemon include the local topology info in the nidmap buffer sent to each app so the apps don't all hammer the local system to discover it for themselves.
...
Remove the sysinfo framework as hwloc replaces that functionality.
This commit was SVN r25124.
2011-09-11 19:02:24 +00:00
Ralph Castain
2091e39bee
Record the file descriptor on the read event when building optimized
...
This commit was SVN r25123.
2011-09-11 18:57:14 +00:00
Wesley Bland
f8740e5478
Correct a typo reported by Pasha.
...
This commit was SVN r25109.
2011-08-30 18:44:52 +00:00
Ralph Castain
03ddf8520b
Resolve not-used warnings
...
This commit was SVN r25101.
2011-08-27 14:27:15 +00:00
Wesley Bland
f542ecd578
Fix a couple of problems with the resil code not compiling.
...
This commit was SVN r25099.
2011-08-27 03:21:00 +00:00
George Bosilca
a4245b8d63
Remove some warnings related to the resilience patch.
...
This commit was SVN r25097.
2011-08-27 00:15:34 +00:00
George Bosilca
67ec5a0556
While doing cleanup remove pending warnings.
...
This commit was SVN r25095.
2011-08-26 23:38:06 +00:00
Wesley Bland
4e7ff0bd5e
By popular demand the epoch code is now disabled by default.
...
To enable the epochs and the resilient orte code, use the configure flag:
--enable-resilient-orte
This will define both:
ORTE_ENABLE_EPOCH
ORTE_RESIL_ORTE
This commit was SVN r25093.
2011-08-26 22:16:14 +00:00
Ralph Castain
1c08a4006c
Refactor some code to remove a few API handles from errmgr. Reviewed/tested by Wes.
...
This commit was SVN r25064.
2011-08-18 16:24:45 +00:00
Ralph Castain
e58623cd5b
Bring alps back to full operations by correctly computing daemon names. Unfortunately, alps doesn't assign cnos rank in node-based order - i.e., cnos rank=0 isn't necessarily on the first node of the execution. So adjust when using static ports.
...
Add some debug to nidmap
Ensure that the HNP's node name is not included in the regex when launching via rshbase as that node is automatically included in the daemon map.
This commit was SVN r25063.
2011-08-18 14:59:18 +00:00
Wesley Bland
a2a20c3766
I believe this should fix the race condition that Terry is seeing in the MTT
...
tests. It appears that nothing in the errmgr was using the mutexes to protect
the odls child list.
This commit was SVN r25062.
2011-08-18 14:52:30 +00:00
Ralph Castain
23f47295a8
Add even more debug
...
This commit was SVN r25053.
2011-08-16 16:41:33 +00:00
Ralph Castain
d624d43f69
Add more debug
...
This commit was SVN r25052.
2011-08-16 15:47:37 +00:00
Ralph Castain
3d96497581
Add debug
...
This commit was SVN r25050.
2011-08-16 12:22:05 +00:00
Shiqing Fan
7292ee2387
One .windows file is missing in the tarball.
...
This commit was SVN r25049.
2011-08-15 10:21:25 +00:00
Shiqing Fan
3af7c9f7bb
Complete the MinGW build support on Windows.
...
This commit was SVN r25048.
2011-08-15 09:47:23 +00:00
Shiqing Fan
627f1dd351
Correct several export declarations.
...
This commit was SVN r25047.
2011-08-15 09:45:51 +00:00
Ralph Castain
ca3d29a1e6
Extend regex support to a bigger audience
...
This commit was SVN r25046.
2011-08-12 21:02:48 +00:00
Ralph Castain
ea4e2c2db4
Unused variables
...
This commit was SVN r25045.
2011-08-12 21:02:09 +00:00
Jeff Squyres
1cbfb53801
r24976 wasn't quite right -- you now actually get a warning if you
...
specify btl_tcp_if_include because btl_tcp_if_exclude is defaulted to
the loopback devices.
This commit does a few things:
* Introduce a new OPAL MCA base function:
mca_base_param_check_exclusive_string(). It checks to see that the
''user'' does not set two MCA parameters that are mutually
exclusive by checking the source of those MCS param values.
* Use the above function in many BTLs (and the OOB TCP) to ensure
that <foo>_if_include and <foo>_if_exclude are not both specified
''by the user''.
* Re-arrange many of these BTLs to move their MCA registration code
into a separate component_register() function (vs. the
component_open() function).
This code has been nominally reviewed and checked by Ralph, George,
Terry, and Shiqing.
This commit was SVN r25043.
The following SVN revision numbers were found above:
r24976 --> open-mpi/ompi@8f4ac54336
2011-08-10 17:24:36 +00:00
Ralph Castain
b360c98afd
Per request from Pasha, revert r25004 - but modified a touch to reflect fact that opal_argv_append copies the provided string, so we don't need to print it and then free it.
...
This commit was SVN r25037.
The following SVN revision numbers were found above:
r25004 --> open-mpi/ompi@2418831bea
2011-08-09 22:42:27 +00:00
Nathan Hjelm
aa3d302a05
use persistent rml_recv in iof
...
This commit was SVN r25035.
2011-08-09 21:30:12 +00:00
Ralph Castain
f1951e7ccd
If we are abnormally terminating, then don't wait for orteds to report back. Send them a "halt_vm" command, which instructs them to kill their local procs and immediately terminate, doing their best to cleanup on the way out.
...
Also do a little cleanup on debug output in rshbase.
This commit was SVN r25033.
2011-08-09 17:42:19 +00:00
Wesley Bland
67feeb6aca
Move the errmgr code back. This shouldn't cause the svn problems that I
...
apparently caused last time. Sorry about that. This one will just be a big
changelog.
This commit was SVN r25016.
2011-08-08 16:01:08 +00:00
Wesley Bland
09274cd047
Make sure that the epoch is initialized everywhere so we don't get weird output
...
during valgrind. This shouldn't have caused any problems with any actual
execution. Just extra warnings in valgrind.
This commit was SVN r25015.
2011-08-08 15:11:55 +00:00
Ralph Castain
8014e3429e
Don't double-count procs as they are launched
...
This commit was SVN r25011.
2011-08-08 06:05:23 +00:00
Ralph Castain
4083dc617f
Fix computation of number of required files and file descriptors - it only depends on the total number of local procs, not on the number of procs in the entire job!
...
This commit was SVN r25008.
2011-08-08 04:09:40 +00:00
Ralph Castain
8b3c562b84
Adjust verbosity levels to make it easier to debug at scale
...
This commit was SVN r25006.
2011-08-07 21:14:21 +00:00
Ralph Castain
2418831bea
Pass the nodelist to the aprun command even when using all nodes
...
This commit was SVN r25004.
2011-08-06 04:19:41 +00:00
Ralph Castain
bd8e43a2de
Correct debug output so it doesn't falsely report the module
...
This commit was SVN r25003.
2011-08-05 20:30:34 +00:00
Ralph Castain
d603c79ab4
Fix the FAILED_TO_START scenario so orted doesn't segfault
...
This commit was SVN r25002.
2011-08-05 20:29:50 +00:00
Ralph Castain
c86bfb4e90
Need to copy the string
...
This commit was SVN r25001.
2011-08-05 19:03:28 +00:00
Ralph Castain
066022126e
Sort the nodes to be in numerically increasing order so the regex has a chance of working right.
...
This commit was SVN r24993.
2011-08-05 03:37:13 +00:00
Jeff Squyres
294e1f50cd
Remove compiler warning about nested comment
...
This commit was SVN r24984.
2011-08-03 18:30:56 +00:00
Wesley Bland
87a96da99c
Should fix some of the shutdown woes of the errmgr.
...
Correctly checks that the orted's job is completed.
Correctly tests to make sure that there is shutdown going on (doesn't rely on orte_orteds_term_ordered).
Adds a patch from Ralph to correctdly check the status of processes.
This commit was SVN r24962.
2011-08-01 14:00:41 +00:00
Ralph Castain
42b125ef35
Move the debug so it more accurately reports
...
This commit was SVN r24961.
2011-07-29 20:48:46 +00:00
Ralph Castain
70bca4691f
Add a new "sensor" module that supports fault tolerance tests - randomly kills local procs and/or the daemon itself
...
This commit was SVN r24960.
2011-07-29 20:48:22 +00:00
Wesley Bland
5fde3e0e00
Move the resilient orte errmgr code into a seperate errmgr for now while it's
...
still unstable. Reverted errmgr modules back to the original errmgr (with the
updates since the resilient code was brought into the trunk).
This commit was SVN r24958.
2011-07-28 21:24:34 +00:00
Ralph Castain
6c879f87fb
Add a new param "orte_remote_tmpdir_base" for those situations where the compute nodes require a different session directory head than the head node.
...
This commit was SVN r24956.
2011-07-27 19:37:17 +00:00
Ralph Castain
decab98fb2
Do a little better job of catching up on missed mcast messages, and provide a way out of scenarios where catch-up is impossible.
...
This commit was SVN r24955.
2011-07-27 14:58:30 +00:00
Ralph Castain
c3bc33b3fb
Don't be so restrictive - accept "slots" as well as "slot" in rank file
...
This commit was SVN r24954.
2011-07-27 00:45:30 +00:00
Wesley Bland
b972fd84e1
No longer sends extra FAILED_NOTIFICATION messages in the non-failure case.
...
Should reduce finalize complexity and avoid a race condition that has been
detected by a few users.
This commit was SVN r24952.
2011-07-26 20:47:44 +00:00
Ralph Castain
db193555c2
Use non-blocking sends for recovering from lost multicast messages
...
This commit was SVN r24943.
2011-07-25 18:49:47 +00:00
Ralph Castain
869024f1c6
You have to initialize th daemon param -before- using it to get epoch!!
...
This commit was SVN r24926.
2011-07-23 20:19:43 +00:00
Ralph Castain
361bcef253
Close multicast before rml
...
This commit was SVN r24925.
2011-07-23 20:19:15 +00:00
Shiqing Fan
cc4403a863
Remove two unused windows files.
...
This commit was SVN r24913.
2011-07-21 12:53:32 +00:00
Brian Barrett
3bd66a5932
* Remove unused Portals3.3 reference implementation support
...
This commit was SVN r24906.
2011-07-20 23:30:29 +00:00
Eugene Loh
921852e1e5
Clean up the computations of num_procs_alive. Do some code
...
refactoring to improve readability and to compute num_procs_alive
correctly and to remove the use of loop iteration variables for
two loops nested one inside another (causing MPI_Comm_spawn_multiple
to fail).
This commit was SVN r24903.
2011-07-14 20:10:48 +00:00
Ralph Castain
8853e0e80a
Fix regular expression analyzer for slurmd - use a slurm-specific version
...
Fix multi-node routing for daemon startup when static ports are not set
This commit was SVN r24898.
2011-07-13 22:49:56 +00:00
Ralph Castain
8d1b31b887
Don't know how we got away with this for so long, but we really shouldn't be referencing pointer array objects directly.
...
Also, fix an error in mpirx debugger module - the pointer array object is the pointer to the object itself, not the object "super" like in an opal_list.
This commit was SVN r24894.
2011-07-13 20:11:14 +00:00
Ralph Castain
1405bacd85
Ensure we dont segfault if we report an error
...
This commit was SVN r24890.
2011-07-13 15:00:22 +00:00
Jeff Squyres
3893a5a1de
Fix compile error introduced in r24888.
...
This commit was SVN r24889.
The following SVN revision numbers were found above:
r24888 --> open-mpi/ompi@e5253647ea
2011-07-13 14:18:00 +00:00
Shiqing Fan
e5253647ea
Fix a type cast.
...
This commit was SVN r24888.
2011-07-13 09:00:17 +00:00
Nathan Hjelm
3f4e5d7dd6
add missing thread lock/unlock around condition_broadcast
...
This commit was SVN r24885.
2011-07-12 15:43:56 +00:00
Nathan Hjelm
c3ec2e2614
fix a potential race condition in rml
...
This commit was SVN r24884.
2011-07-12 15:43:12 +00:00
Abhishek Kulkarni
6bf02d1344
Fixes (after the new ORTE resiliency layer was merged) to make the trunk build with C/R flags turned on.
...
This commit was SVN r24872.
2011-07-10 23:36:26 +00:00
Ralph Castain
05f4926bfe
Remove some remaining cruft re regular expressions - caused the trunk to fail if regex wasn't being used
...
This commit was SVN r24861.
2011-07-08 06:42:12 +00:00
Ralph Castain
1ee7c39982
Fix some major bit-rot on scalable launch. If static ports are provided, then daemons can connect back to the HNP via the routed connection tree instead of doing so directly. In order to do that at scale, the node list must be passed as a regular expression - otherwise, the orted command line gets too long.
...
Over the course of time, usage of static ports got corrupted in several places, the "parent" info got incorrectly reset, etc. So correct all that and get the regex-based wireup going again.
Also, don't pass node lists if static ports aren't enabled - they are of no value to the orted and just create the possibility of overly-long cmd lines.
This commit was SVN r24860.
2011-07-07 18:54:30 +00:00
Ralph Castain
6496b2f845
Ensure we terminate properly on non-zero exit status
...
This commit was SVN r24859.
2011-07-07 14:33:49 +00:00
Wesley Bland
0628963506
Fix a return code when a process isn't found.
...
This commit was SVN r24845.
2011-06-30 15:22:54 +00:00
Brian Barrett
e52fef28ca
Do something rational for the disable full support case
...
This commit was SVN r24844.
2011-06-30 14:48:19 +00:00
Ralph Castain
8ac35a8496
Fully enable the monitoring of memory usage and automatic termination of memory hogs when limits are reached. Improve the efficiency of the sensor system so we don't multiply sample the resource usage if multiple modules are active. Ensure we output the proc error summary when we abnormally terminate.
...
This commit was SVN r24843.
2011-06-30 14:11:56 +00:00
Ralph Castain
c449871ade
Add an mca param to set the "fork agent" - i.e., a program to be run when forking off a process (e.g., valgrind). While you could specify this by "mpirun -n N fork_agent ./my_app", not everyone launches procs with ORTE from mpirun.
...
Provide the ability to store recent stat histories using the ring_buffer class
This commit was SVN r24842.
2011-06-30 03:12:38 +00:00
Ralph Castain
2e1fa3e08e
Don't error out if the recv.cancel comes back not found as this is just a race condition
...
This commit was SVN r24841.
2011-06-30 01:19:50 +00:00
Wesley Bland
84be81df95
Standardize the initialization of the EPOCH's.
...
Everyone will be starting at MIN anyway (until we implement restart of course)
so there's no reason to set the epoch to INVALID and then immediately reset them
to MIN. This way there's less room to make mistakes later.
This commit was SVN r24829.
2011-06-28 14:20:33 +00:00
Ralph Castain
d316701d3c
Remove unnecessary mcast channel
...
This commit was SVN r24816.
2011-06-23 20:44:22 +00:00
Wesley Bland
e1ba09ad51
Add a resilience to ORTE. Allows the runtime to continue after a process (or
...
ORTED) failure. Note that more work will be necessary to allow the MPI layer to
take advantage of this.
Per RFC:
http://www.open-mpi.org/community/lists/devel/2011/06/9299.php
This commit was SVN r24815.
2011-06-23 20:38:02 +00:00
Ralph Castain
391074cde6
Add a tag
...
This commit was SVN r24813.
2011-06-23 15:12:25 +00:00
Samuel Gutierrez
81f38b258a
commit of new shared memory backing facility framework (shmem) and its components.
...
This commit was SVN r24795.
2011-06-21 15:41:57 +00:00
Ralph Castain
9491fbb60c
Remove two stale modules
...
This commit was SVN r24794.
2011-06-21 05:57:39 +00:00
Ralph Castain
b95ede99d5
Ensure we use the pmi grpcomm when using pmi
...
This commit was SVN r24793.
2011-06-20 21:57:47 +00:00
Ralph Castain
92a65f21bf
Restore slurm pmi support from long, long ago. Since we already have the ability to directly srun an MPI job, just conditionally add the PMI support for key values and provide a grpcomm module that uses PMI for barriers and modex.
...
Currently ompi_ignored, and unignored only for me (others to soon follow).
This commit was SVN r24792.
2011-06-20 21:04:46 +00:00
Ralph Castain
042ee3ec48
Support the option of outputting error_log messages with something other than the process name
...
This commit was SVN r24784.
2011-06-17 14:50:00 +00:00
Ralph Castain
61dd7f4588
Need to pass identification of input/output channels
...
This commit was SVN r24783.
2011-06-17 14:48:59 +00:00
Ralph Castain
93dcfc15d0
Let the upper layer set the channels to be opened
...
This commit was SVN r24779.
2011-06-16 20:31:52 +00:00
Ralph Castain
7f2d2e3de7
Track the app_context rank - will equal overall rank for single app_context jobs
...
This commit was SVN r24778.
2011-06-16 20:31:30 +00:00
Josh Hursey
0eb3b3b7b0
Fix missing functionality in MPI_Abort so that the group of peers defined by the communicator that should be aborted with this process are requested from the runtime before the local process exits.
...
Per RFC:
http://www.open-mpi.org/community/lists/devel/2011/06/9335.php
This commit was SVN r24775.
2011-06-15 13:10:13 +00:00
Ralph Castain
033cbbed31
Don't automatically assign group channels if not given - let the layer above figure it out.
...
This commit was SVN r24771.
2011-06-10 16:28:18 +00:00
Ralph Castain
e039c7b7ea
Avoid crashing when debugging rmaps and a non-string resource constraint is given
...
This commit was SVN r24770.
2011-06-10 16:27:30 +00:00
Josh Hursey
9080eaedf3
Fully initialize the orte_errmgr_base_component_t structure for the app.
...
This commit was SVN r24765.
2011-06-09 14:22:25 +00:00
Josh Hursey
20339a7900
Minor coding style and intentation fixes.
...
This commit was SVN r24764.
2011-06-09 14:16:06 +00:00
Ralph Castain
fc8d920c56
Dont monitor resource usage unless requested, even if sensors are enabled during configure
...
This commit was SVN r24760.
2011-06-08 10:47:25 +00:00
Ralph Castain
906eb925f1
Update the resource usage sensor to initiate support for time analysis of measurements
...
This commit was SVN r24759.
2011-06-07 23:22:51 +00:00
Ralph Castain
f3cae3d6f3
Cleanup the handling of if_include and if_exclude arguments based on CIDR notation.
...
Fix a bug in the new code that prevented the system from correctly matching addresses.
Remove comments in the show-help text indicating that we would continue in the face of incorrect specifications - leave that to the calling layer to decide.
Modify the new opal_ifmatches so it returns error codes letting the caller better understand the result.
Modify the oob to ensure we abort if we don't find interfaces matching specified constraints, and that we do so without multiple error messages.
NOTE: we have a conflict in our standards. We have been using comma-delimited lists of interfaces for all our params. However, one param - opal_net_private_ipv4 - now uses semicolons instead of comma separators. No idea why, but it is confusing.
This commit was SVN r24755.
2011-06-07 02:09:11 +00:00
George Bosilca
6b52d8f519
The paffinity is apparently needed.
...
This commit was SVN r24749.
2011-06-06 01:20:01 +00:00
Ralph Castain
bd8d9a943a
Add diagnostics
...
This commit was SVN r24748.
2011-06-05 19:17:56 +00:00
Ralph Castain
1491d52bd7
Extend the parsing capability of the oob tcp module's if_include and if_exclude options to support subnet+mask notation, and to handle virtual IP addresses (it was previously having problems distinguishing between "eth1" and "eth1.3").
...
This commit was SVN r24747.
2011-06-05 19:16:42 +00:00
George Bosilca
454519842e
Report bindings if requested.
...
This commit was SVN r24743.
2011-06-02 17:17:10 +00:00
George Bosilca
1eccadbd87
No need for the paffinity here.
...
This commit was SVN r24742.
2011-06-02 17:16:25 +00:00
Ralph Castain
8f401a0563
Enable the ability to constrain applications to hosts on the basis of resources.
...
This commit was SVN r24736.
2011-05-28 22:18:19 +00:00
Brian Barrett
beb1bc70b2
* Add support for using modex to exchange NID/PID pairs when using Portals4.
...
Rather than try to support a bunch of lightweight environments like I did
with the Portals3 code, always use the "modex" and hack the grpcomm for
the SHMEM implementation to return the right nid/pid for a remote
process by "magic".
This commit was SVN r24733.
2011-05-25 22:10:27 +00:00
Ralph Castain
81b6c50daa
Correct stale typo
...
This commit was SVN r24725.
2011-05-23 17:34:22 +00:00
Ralph Castain
661f508e62
Fix typo
...
This commit was SVN r24723.
2011-05-22 00:20:42 +00:00
Ralph Castain
b2331113a5
Add some debug
...
This commit was SVN r24722.
2011-05-21 21:09:47 +00:00
Ralph Castain
1b5ca323c6
Always followup with sigkill when killing local procs as procs can trap sigterm and get stuck
...
This commit was SVN r24719.
2011-05-20 22:40:10 +00:00
Ralph Castain
c5686ecfca
Dont sample stats for pid=0 children
...
This commit was SVN r24717.
2011-05-20 14:33:23 +00:00
Ralph Castain
b03e4481a3
Plug a couple of additional places in case orte_iof is not opened
...
This commit was SVN r24716.
2011-05-20 13:42:53 +00:00
Ralph Castain
dc0bb0571b
Record the number of heartbeats recvd each period for diag purposes
...
This commit was SVN r24714.
2011-05-20 00:21:33 +00:00
Ralph Castain
69dce0ec10
Minor heartbeat cleanups
...
This commit was SVN r24713.
2011-05-19 21:27:44 +00:00
Ralph Castain
c3df95dd13
Prevent failure due to race condition during abnormal term
...
This commit was SVN r24712.
2011-05-19 21:27:05 +00:00
Ralph Castain
b0f47e6f59
Allow orte_iof to not be opened
...
This commit was SVN r24711.
2011-05-19 21:26:30 +00:00
Ralph Castain
1f3911cc8b
Add a new proc state
...
This commit was SVN r24710.
2011-05-19 21:25:58 +00:00
Ralph Castain
b47ec2ee87
Remove lingering references to opal_profile option
...
This commit was SVN r24709.
2011-05-18 18:27:29 +00:00
Ralph Castain
d34bab541d
Remove the ompi-profiler tool and its attendant ompi-probe program. Also remove the grpcomm basic component since its only function was to support profiled clusters, which nobody was doing. :-(
...
This commit was SVN r24704.
2011-05-17 03:30:25 +00:00
Ralph Castain
a3e43594a4
Extend node stats to include additional memory info. Change "darwin" pstat module to "test" as we don't really know how to get all the stat info for darwin.
...
Add a new OPAL_ERROR_LOG macro similar to the ORTE_ERROR_LOG one.
This commit was SVN r24692.
2011-05-08 14:45:16 +00:00
Ralph Castain
c160f5d5a2
Add ability to specify mcast interfaces by name
...
This commit was SVN r24691.
2011-05-08 14:42:48 +00:00
Thomas Herault
fb3fd8fd0e
items belonging to peer_send_queue are mca_oob_tcp_msg_t *, which are obtained through a opal_freelist.
...
They shouldn't be released, but returned to the freelist.
This commit was SVN r24679.
2011-05-03 21:03:09 +00:00
Ralph Castain
138928fcf4
Use ports as multicast channels instead of networks so we avoid stepping into reserved spaces.
...
This commit was SVN r24666.
2011-04-29 18:46:40 +00:00
Shiqing Fan
4490fdbd34
Add the initial support for MinGW and MSYS.
...
Correctly check the dependencies of MSYS env.
Set up configure include and lib path for building the package.
update a few more CMake scripts.
This commit was SVN r24663.
2011-04-29 14:42:07 +00:00
Ralph Castain
c78531ce8a
Don't free the envar that gets putenv'd as that messes up the environ
...
This commit was SVN r24660.
2011-04-29 08:50:29 +00:00
Ralph Castain
b586f2952e
Arggg...revert r24645. I knew those fields were there for a reason...sigh.
...
This commit was SVN r24647.
The following SVN revision numbers were found above:
r24645 --> open-mpi/ompi@e4732110da
2011-04-28 15:07:00 +00:00
Ralph Castain
859aaab93d
In the case of direct-launched processes running under slurm, psm requires that the pre_condition_transports MCA param be set. This is normally computed by mpirun and inserted into each proc's environ, but that doesn't work here.
...
So separate out the printing of that key, and let the individual procs generate it in a way that ensures they all get the same result.
This commit was SVN r24646.
2011-04-28 13:54:33 +00:00
Ralph Castain
e4732110da
Remove a couple more stale fields
...
This commit was SVN r24645.
2011-04-28 00:26:38 +00:00
Ralph Castain
9988b97b97
Extend/update how we handle process stats. Add the ability to collect node-level stats separate from the process stats. Update the process stat memory fields to report in MBytes instead of KBytes as I can't find any process that runs in KBytes nowadays.
...
Rename the memusage sensor plugin to "resusage" as it will soon be updated to include full process stat monitoring.
Extend the heartbeat sensor to report node and process stats in the heartbeat.
Store the process and node stats in their respective orte_xxx_t object.
This commit was SVN r24629.
2011-04-21 22:55:45 +00:00
Ralph Castain
5f64b830f9
Ensure we only kill threads once
...
This commit was SVN r24620.
2011-04-18 14:47:09 +00:00
Ralph Castain
89501e6e24
Don't try to politely end threads when abnormally terminating as we can hang if the thread is in a stuck callback.
...
This commit was SVN r24618.
2011-04-18 12:21:47 +00:00
Ralph Castain
3a28556472
Expand our handling of non-zero exit status. If a process exits with non-zero status, pass that info along to the user in case it means something to them, even if the process also exited without calling MPI_Finalize. If the process calls MPI_Abort, that trumps the exit status question.
...
Provide a new MCA param that allows the user to direct that we abort the job once a process exits with non-zero status. No recovery is allowed in such cases to avoid trying to restart a process that has already exited MPI.
This commit was SVN r24614.
2011-04-14 15:04:21 +00:00
Ralph Castain
30fb002524
Take the first small step towards rationalizing rsh support. Create a new "rshbase" component that contains a simple rsh module - no tree spawn, uses all the base functions for launch support. Extend the base rsh support functions to include those functions in common across all rsh modules.
...
Only a minor change made to the current rsh module to avoid a naming conflict. Otherwise, left it alone to avoid creating conflicts with other external work. The current rsh module remains the default for rsh/ssh support, and continues to contain the support for SGE and Loadleveler.
This commit was SVN r24593.
2011-03-30 01:15:07 +00:00
Nysal Jan
866ae8b43a
Close the file descriptor
...
This commit was SVN r24580.
2011-03-29 08:42:49 +00:00
Nysal Jan
c8c6b0edab
Improve LoadLeveler integration with Open MPI. Add support for LL native rsh agent - llspawn
...
This commit was SVN r24579.
2011-03-29 07:46:59 +00:00
Nathan Hjelm
8634b6394f
fixed plm/tm component
...
This commit was SVN r24577.
2011-03-25 22:20:15 +00:00
Ralph Castain
d7e029cb40
Convert heartbeat to multicast basis
...
This commit was SVN r24570.
2011-03-24 19:05:39 +00:00
Ralph Castain
90698a2c02
Ensure that blocking recvs wait until the data is actually recvd
...
This commit was SVN r24558.
2011-03-22 18:45:54 +00:00
Ralph Castain
888472f671
Do not release recv as the calling function needs that data and will release it later
...
This commit was SVN r24557.
2011-03-22 18:44:56 +00:00
Ralph Castain
30981de200
Minor cleanups courtesy of Nysal - thanks!
...
This commit was SVN r24552.
2011-03-22 13:48:58 +00:00
Ralph Castain
c1396b278c
Resolve the rsh confusion by splitting the initial search for a launch agent from the actual setup of the launch agent values in the plm base globals. Have each aspiring rsh-clone call lookup to see if their desired launch agent is available - if not, then reject that plm component.
...
If so, then setup the actual launch agent values only when the module init function is called.
This resolves the current conflict between the rsh and rshd components. Hopefully, it may avoid future problems in this area -provided- any new uses of rsh-like launchers abide by the lookup-and-then-setup rule.
This commit was SVN r24550.
2011-03-22 02:23:09 +00:00
Ralph Castain
d17b50e1ff
Add the appropriate hooks to tell Totalview to display the user's main program upon startup. Apparently, this hook got lost somewhere after the 1.2 series :-(
...
Thanks to David Turner and the TV folks for passing this along.
This commit was SVN r24549.
2011-03-21 17:40:58 +00:00
Ralph Castain
795ca2cff2
Complete implementation of the multicast-based grpcomm module
...
This commit was SVN r24548.
2011-03-20 01:18:06 +00:00
Eugene Loh
2770a12beb
Continue clean up of thread options started in r22841, 22842, and 22849.
...
No need for any CMRs to 1.5... that was already done in CMR 2728.
This commit was SVN r24545.
The following SVN revision numbers were found above:
r22841 --> open-mpi/ompi@b400b84162
2011-03-18 21:36:35 +00:00