1
1
Граф коммитов

2794 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
efc4a40c8a It is okay for a key not to be found
This commit was SVN r27187.
2012-08-30 15:12:23 +00:00
Ralph Castain
6dbb7a8493 Just because a peer didn't post a particular pmi key, that doesn't mean it is an immediate irrecoverable error - could be they just don't have a matching interface. Let the upper layer decide what to do about it.
This commit was SVN r27186.
2012-08-30 14:12:09 +00:00
Ralph Castain
1b659de132 Get staged execution working on multi-node setups. Improve efficiency by only remapping if all procs not yet mapped in the job.
This commit was SVN r27181.
2012-08-29 20:35:52 +00:00
Ralph Castain
a3b08f5800 Fix a few things relating to comm_spawn that causes new daemons to be launched. Ensure that all new daemons receive a full pidmap. Properly mark the daemon job as "updated" when daemons are added
This commit was SVN r27177.
2012-08-29 03:11:37 +00:00
Ralph Castain
f0077820f2 Silence warning
This commit was SVN r27175.
2012-08-28 22:27:41 +00:00
Ralph Castain
a414ffdf4c Remove debug
This commit was SVN r27174.
2012-08-28 22:18:00 +00:00
Ralph Castain
30fd9d7abc MPI procs should definitely not be trapping SIGCHLD - only ORTE tools need to do so
This commit was SVN r27166.
2012-08-28 21:39:06 +00:00
Ralph Castain
98580c117b Introduce staged execution. If you don't have adequate resources to run everything without oversubscribing, don't want to oversubscribe, and aren't using MPI, then staged execution lets you (a) run as many procs as there are available resources, and (b) start additional procs as others complete and free up resources. Adds a new mapper as well as a new state machine.
Remove some stale configure.m4's we no longer need.

Optimize the nidmaps a bit by only sending info that has changed each time, instead of sending a complete copy of everything. Makes no difference for the typical MPI job - only impacts things like staged execution where we are sending multiple (possibly many) launch messages.

This commit was SVN r27165.
2012-08-28 21:20:17 +00:00
Ralph Castain
aadfe1b61e Fix a missing test that breaks novm operation.
CMR:v1.7

This commit was SVN r27163.
2012-08-28 21:13:57 +00:00
Ralph Castain
d310dd8c58 Fix a strange race condition by creating a separate buffer for each send - apparently, just a retain isn't enough protection on some systems
This commit was SVN r27161.
2012-08-28 17:17:34 +00:00
Ralph Castain
11c68e2299 Correct the count in the pmi key
This commit was SVN r27156.
2012-08-28 15:05:02 +00:00
Ralph Castain
6e8c97c77c Per Sam's eagle-eyed review, free the malloc'd memory if getcwd fails for some strange reason.
This commit was SVN r27150.
2012-08-27 19:15:16 +00:00
Ralph Castain
bccc20d13e Deal with one last corner case of positioning a dot-file
This commit was SVN r27144.
2012-08-26 03:49:31 +00:00
Ralph Castain
63d41c643d Minor cleanup
This commit was SVN r27143.
2012-08-25 14:24:45 +00:00
Ralph Castain
05f0b4c653 Couple of minor cleanups
This commit was SVN r27135.
2012-08-24 21:14:40 +00:00
Ralph Castain
d6cbff6d4e Since the preload flags are at the app_context level, we need to link only those files/exe's that pertain to each app_context to the corresponding procs. Also, gain a little optimization by checking to ensure we only send files once - this probably won't work when daemons are created on-the-fly, but that's for some other day
This commit was SVN r27134.
2012-08-24 16:16:30 +00:00
Jeff Squyres
20612c4194 Don't close the IOF stdin if we happen to read less than a full
buffer's worth of data -- interactive stdin will have that behavior
frequently. 

This commit was SVN r27131.
2012-08-24 14:29:19 +00:00
Ralph Castain
e0c39c94e8 Complete the cleanup of the preload files system. Remove the dest_dir option as moving things to arbitrary locations - especially absolute paths - can prove disastrous. Remove the preload_libs option as these can be treated as just files. Cleanup some of the pack/unpack code as the dss handles NULL strings just fine. Deal a little better with absolute paths, noting that tar now strips the leading '/' for us (showing my age as it didn't used to do so).
Remove the odls_base_state.c file as that code is now covered by the new broadcast form of preload_files.

This commit was SVN r27127.
2012-08-24 02:28:29 +00:00
Ralph Castain
b4a544ad2a Per discussion with Josh, use the --preload-xxx cmd line options to broadcast files to all nodes. Add --set-cwd-to-session-dir option to start procs in their session directories. Add OMPI_FILE_LOCATION envar to tell procs where their prepositioned files went.
This commit was SVN r27125.
2012-08-23 21:28:05 +00:00
Ralph Castain
855c9ae6cf Support archives .tar, .bz[2,zip], and .gz[ip]
This commit was SVN r27123.
2012-08-23 15:38:39 +00:00
Ralph Castain
286c610712 Protect us against the scenario where filem is included in enable-mca-no-build
This commit was SVN r27122.
2012-08-23 13:52:06 +00:00
Shiqing Fan
d141d94bd7 Include the new .windows files into the tarball.
This commit was SVN r27121.
2012-08-23 12:50:51 +00:00
Ralph Castain
7237a938bf Extend the filem interface to support prepositioning and linking required local files for execution. Create a new "raw" module that uses xcast to send the files to all nodes as this is faster than doing an scp in a linear pattern
This commit was SVN r27118.
2012-08-22 21:43:20 +00:00
Ralph Castain
ed4b354846 Ensure we pass along user-specified mca params from the cmd line when doing a tree spawn, but don't extend the cmd line with duplicates or things that shouldn't be there
This commit was SVN r27117.
2012-08-22 21:41:50 +00:00
Ralph Castain
5d7872fd68 Cleanup the tag list
This commit was SVN r27115.
2012-08-22 21:37:58 +00:00
Ralph Castain
7bcf2f8b5c Stop leaving droppings behind us
This commit was SVN r27111.
2012-08-22 17:39:22 +00:00
Shiqing Fan
95b9552546 include several components for Windows build.
This commit was SVN r27108.
2012-08-22 14:46:49 +00:00
Jeff Squyres
c8cee23ee7 Priorities really shouldn't be less than 0.
This commit was SVN r27098.
2012-08-21 15:47:15 +00:00
Ralph Castain
dacb07000d Turn udcm and ud oob off by default, but allow them to build and be used if someone wants to test them
cmr:v1.7

This commit was SVN r27097.
2012-08-21 15:18:34 +00:00
Ralph Castain
64cf75cec5 Add some debug
This commit was SVN r27087.
2012-08-17 02:19:26 +00:00
Ralph Castain
35fef87202 Make the "no virtual machine" selection more intuitive by providing a --novm option to mpirun.
This commit was SVN r27048.
2012-08-15 14:55:03 +00:00
Nathan Hjelm
8e03f77004 update alps configure scripts
This commit was SVN r27026.
2012-08-13 22:57:55 +00:00
Ralph Castain
589acf550c Improve the new MPI_INFO_ENV to better handle Java applications and to correctly report the info for singletons.
This commit was SVN r27025.
2012-08-13 22:13:49 +00:00
Ralph Castain
49a757e0bd Silly me - now that all daemons are stripping their prefix on the backend, we no longer need to do it as they report
This commit was SVN r27023.
2012-08-13 20:48:13 +00:00
Ralph Castain
b9b41d8662 For cases where the alpha+non-zero prefix must be removed from a node name, be sure to do it everywhere we access node names - otherwise, modex methods such as pmi will fail to correctly identify procs on the same node
This commit was SVN r27022.
2012-08-13 20:44:56 +00:00
Ralph Castain
cb48fd52d4 Implement the MPI_Info part of MPI-3 Ticket 313. Add an MPI_info object MPI_INFO_GET_ENV that contains a number of run-time related pieces of info. This includes all the required ones in the ticket, plus a few that specifically address recent user questions:
"num_app_ctx" - the number of app_contexts in the job
"first_rank" - the MPI rank of the first process in each app_context
"np" - the number of procs in each app_context

Still need clarification on the MPI_Init portion of the ticket. Specifically, does the ticket call for returning an error is someone calls MPI_Init more than once in a program? We set a flag to tell us that we have been initialized, but currently never check it.

This commit was SVN r27005.
2012-08-12 01:28:23 +00:00
Ralph Castain
c4ee297a60 Cleanup the pmi grpcomm module so it passes non-btl modex data correctly.
This commit was SVN r26992.
2012-08-10 20:35:50 +00:00
Ralph Castain
e3e9b7345d First cut at updating the ccp launcher to use the state machine
This commit was SVN r26986.
2012-08-10 17:09:33 +00:00
Shiqing Fan
e304c19920 This is also used on Windows.
This commit was SVN r26975.
2012-08-08 16:44:00 +00:00
George Bosilca
ba879c2c51 Remove the unused map.
This commit was SVN r26960.
2012-08-07 12:06:13 +00:00
Shiqing Fan
2f442799f8 fix several typecasts
This commit was SVN r26957.
2012-08-07 10:41:53 +00:00
Ralph Castain
53b1a1c976 Cleanly error out when someone asks to map-to <object> if that object doesn't exist on a node.
This commit was SVN r26950.
2012-08-04 21:52:36 +00:00
Ralph Castain
61b09a132b Fix bynode mapping of multiple app-contexts
This commit was SVN r26949.
2012-08-03 21:45:40 +00:00
Ralph Castain
96f6f94c24 Ensure we don't get trapped in an infinite loop when ranking bynode if something isn't right
This commit was SVN r26948.
2012-08-03 21:45:10 +00:00
Ralph Castain
0d878937fe If a callback is set in the state machine, and the state doesn't yet exist, create it
This commit was SVN r26947.
2012-08-03 21:43:36 +00:00
Ralph Castain
431d5361ed For those who really preferred our prior mode of operation that mapped procs and only launched daemons on the nodes that had procs on them, introduce the "novm" state machine component. This recreates the old mode of operation by re-ordering the launch sequence so that we allocate, then map, and then launch daemons only on the reqd nodes (instead of across the entire allocation).
This commit was SVN r26946.
2012-08-03 16:30:05 +00:00
Ralph Castain
dc22ea5cde A little cleaner on the message about repeated ctrl-c, and re-enable the event so we can abort if we see multiple ctrl-c's that don't meet the time requirement
This commit was SVN r26945.
2012-08-03 01:26:18 +00:00
Ralph Castain
e6c72bfd53 Ensure we can forcibly exit even when we are stuck inside of an event by replacing the libevent signal handler with a POSIX one that (a) attempts to trip a libevent termination event and (b) if anothe ctrl-c hits within 5 seconds, just calls exit.
This commit was SVN r26943.
2012-08-02 21:15:35 +00:00
Ralph Castain
d818c9d407 Includes a patch from Jeff and Josh: update the simulator module to allow specification of multiple slot and max_slot counts for each node group (but don't require it). Remove the requirement that each node group provide its own topology. Adjust verbosities to allow showing some light debug output to see what nodes have been added without getting a bunch of other stuff.
This commit was SVN r26936.
2012-08-02 04:57:13 +00:00
Jeff Squyres
62c2ff7ee7 It's actually ''not'' an error to exit if all routes and children are
gone.  So exit with 0, not ORTE_ERROR_DEFAULT_EXIT_CODE (which is 1).

This fixes a race condition in the rsh launcher upon termination,
where ORTE would sometimes think that a daemon failed to launch.

This commit was SVN r26935.
2012-08-01 19:49:19 +00:00