openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	e0c39c94e8	Complete the cleanup of the preload files system. Remove the dest_dir option as moving things to arbitrary locations - especially absolute paths - can prove disastrous. Remove the preload_libs option as these can be treated as just files. Cleanup some of the pack/unpack code as the dss handles NULL strings just fine. Deal a little better with absolute paths, noting that tar now strips the leading '/' for us (showing my age as it didn't used to do so). Remove the odls_base_state.c file as that code is now covered by the new broadcast form of preload_files. This commit was SVN r27127.	2012-08-24 02:28:29 +00:00
Ralph Castain	b4a544ad2a	Per discussion with Josh, use the --preload-xxx cmd line options to broadcast files to all nodes. Add --set-cwd-to-session-dir option to start procs in their session directories. Add OMPI_FILE_LOCATION envar to tell procs where their prepositioned files went. This commit was SVN r27125.	2012-08-23 21:28:05 +00:00
Ralph Castain	e4d82b8912	Turn off the common port by default by now until we get rollup working properly on ALL platforms This commit was SVN r27060.	2012-08-15 22:13:04 +00:00
Ralph Castain	cb48fd52d4	Implement the MPI_Info part of MPI-3 Ticket 313. Add an MPI_info object MPI_INFO_GET_ENV that contains a number of run-time related pieces of info. This includes all the required ones in the ticket, plus a few that specifically address recent user questions: "num_app_ctx" - the number of app_contexts in the job "first_rank" - the MPI rank of the first process in each app_context "np" - the number of procs in each app_context Still need clarification on the MPI_Init portion of the ticket. Specifically, does the ticket call for returning an error is someone calls MPI_Init more than once in a program? We set a flag to tell us that we have been initialized, but currently never check it. This commit was SVN r27005.	2012-08-12 01:28:23 +00:00
Ralph Castain	431d5361ed	For those who really preferred our prior mode of operation that mapped procs and only launched daemons on the nodes that had procs on them, introduce the "novm" state machine component. This recreates the old mode of operation by re-ordering the launch sequence so that we allocate, then map, and then launch daemons only on the reqd nodes (instead of across the entire allocation). This commit was SVN r26946.	2012-08-03 16:30:05 +00:00
Shiqing Fan	660188307c	fix an export declaration name This commit was SVN r26895.	2012-07-27 13:26:24 +00:00
Shiqing Fan	8c4a3e1269	correct the symbol dllexports for windows build This commit was SVN r26827.	2012-07-22 08:54:50 +00:00
Ralph Castain	e335de3564	Refactor ompi_info, splitting it into parts according to the layer involved. Thus, we call down to the opal layer to get those frameworks and components, and down to the orte layer to get those. Still some abstraction breaks, but they mostly involve renaming of OMPI_foo labels that have been around since before we split the build system by layer. This commit was SVN r26695.	2012-06-28 18:23:34 +00:00
Ralph Castain	0dfe29b1a6	Roll in the rest of the modex change. Eliminate all non-modex API access of RTE info from the MPI layer - in some cases, the info was already present (either in the ompi_proc_t or in the orte_process_info struct) and no call was necessary. This removes all calls to orte_ess from the MPI layer. Calls to orte_grpcomm remain required. Update all the orte ess components to remove their associated APIs for retrieving proc data. Update the grpcomm API to reflect transfer of set/get modex info to the db framework. Note that this doesn't recreate the old GPR. This is strictly a local db storage that may (at some point) obtain any missing data from the local daemon as part of an async methodology. The framework allows us to experiment with such methods without perturbing the default one. This commit was SVN r26678.	2012-06-27 14:53:55 +00:00
Ralph Castain	b990c65a53	Remove another antiquated dss function - the 'size' API isn't used anywhere since the GPR went away This commit was SVN r26646.	2012-06-25 13:33:45 +00:00
Ralph Castain	abe7dd8274	Cleanup the dss by removing unused functions This commit was SVN r26644.	2012-06-23 21:20:09 +00:00
Ralph Castain	019857b616	Ensure that we don't attempt to use common ports if --disable-static was specified. This commit was SVN r26620.	2012-06-20 03:14:11 +00:00
Ralph Castain	9b026c6695	For now, run MTT with the use_common_port option enabled. This would be the desirable scenario for users, especially at scale, so let's see if it creates any issues. This commit was SVN r26609.	2012-06-15 15:46:38 +00:00
Ralph Castain	96c778656a	Improve launch performance on clusters that use dedicated nodes by instructing the orteds to use the same port as the HNP, thus allowing them to "rollup" their initial callback via the routed network. This substantially reduces the HNP bottleneck and the number of ports opened by the HNP. Restore enable-static-ports option by default - the Cray will have to disable it to get around their library issues, but that's just a warning problem as opposed to blocking the build. This commit was SVN r26606.	2012-06-15 10:15:07 +00:00
Ralph Castain	9506ac1617	Remove debug This commit was SVN r26592.	2012-06-11 20:02:53 +00:00
Ralph Castain	269cb2b8d9	Some cleanup to remove calls to opal_progress when running with orte progress threads, and to ensure that all orte-related events are in the orte event base. This commit was SVN r26591.	2012-06-11 19:59:53 +00:00
Ralph Castain	0442a807c0	Default the OOB to the "ud" component IFF the HNP finds itself on a node with a supported Infiniband device. Ensure that the daemons all pick the matching component by dictating the selection via mca param on the orted cmd line. This commit was SVN r26582.	2012-06-08 01:23:08 +00:00
Ralph Castain	d6279fc971	Fix the debugger daemon launch support to fit the new state machine. Treat debugger daemons just like any other job, except that we map them only to nodes where an app process currently exists (as opposed to every node in the system). Trigger breakpoint and rank0 release only after the debugger daemons are in position. This commit was SVN r26556.	2012-06-06 02:01:23 +00:00
Ralph Castain	9883f42caf	Add missing commit This commit was SVN r26501.	2012-05-28 02:20:20 +00:00
Ralph Castain	be6ed9c2df	Allow partial use of allocations by specifying the max number of daemons (i.e., max VM size) for the job This commit was SVN r26499.	2012-05-27 16:48:19 +00:00
Ralph Castain	83d69b6c95	Enable the ORTE progress thread for apps (not needed in the tools as they already continuously loop in the event lib). This appears to be working, at least for MPI apps that only use shared memory (a simple "hello"). More testing is required to identify where problems will occur - this is only intended to allow further development. In order to use the progress thread, you must configure with: --enable-orte-progress-threads --enable-event-thread-support This commit was SVN r26457.	2012-05-20 15:14:43 +00:00
Jeff Squyres	dab7d36a81	Fix location of the default hostfile. Thanks to Götz Waschk for identifying the problem. This commit was SVN r26441.	2012-05-15 16:13:39 +00:00
Jeff Squyres	2ba10c37fe	Per RFC, bring in the following changes: * Remove paffinity, maffinity, and carto frameworks -- they've been wholly replaced by hwloc. * Move ompi_mpi_init() affinity-setting/checking code down to ORTE. * Update sm, smcuda, wv, and openib components to no longer use carto. Instead, use hwloc data. There are still optimizations possible in the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old carto-based code found out how many NUMA nodes were ''available'' -- not how many were used ''in this job''. The new hwloc-using code computes the same value -- it was not updated to calculate how many NUMA nodes are used ''by this job.'' * Note that I cannot compile the smcuda and wv BTLs -- I ''think'' they're right, but they need to be verified by their owners. * The openib component now does a bunch of stuff to figure out where "near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors (I do not have a NUMA machine with an OpenFabrics device that is a non-uniform distance from multiple different NUMA nodes). * Completely rewrite the OMPI_Affinity_str() routine from the "affinity" mpiext extension. This extension now understands hyperthreads; the output format of it has changed a bit to reflect this new information. * Bunches of minor changes around the code base to update names/types from maffinity/paffinity-based names to hwloc-based names. * Add some helper functions into the hwloc base, mainly having to do with the fact that we have the hwloc data reporting ''all'' topology information, but sometimes you really only want the (online \| available) data. This commit was SVN r26391.	2012-05-07 14:52:54 +00:00
Ralph Castain	b2f77bf08f	Extend the iof by adding two new components to support map-reduce IO chaining. Add a mapreduce tool for running such applications. Fix the state machine to support multiple jobs being simultaneously launched as this is not only required for mapreduce, but can happen under comm-spawn applications as well. This commit was SVN r26380.	2012-05-02 21:00:22 +00:00
Ralph Castain	47a5e30095	Ensure debug output levels if we are debugging This commit was SVN r26358.	2012-04-29 00:03:28 +00:00
Ralph Castain	f3e3704c9e	Per request from Brian, enable mapping of stddiag output (output from opal_output calls) to stderr of the local process. This allows you to obtain that output in a local window (for example, when using xterm for each process) instead of having it automatically forwarded to mpirun. Turn this on automatically whenever someone uses the -xterm option, and to be set manually using the orte_map_stddiag_to_stderr mca param. This commit was SVN r26352.	2012-04-27 14:39:34 +00:00
Ralph Castain	3b5b185c86	Don't double free timer events This commit was SVN r26341.	2012-04-25 17:36:12 +00:00
Ralph Castain	ddfbde587f	Change the default to "abort" the job when any process exits with a non-zero status. Add the required code to ensure the orted tells the HNP about the problem. This commit was SVN r26270.	2012-04-13 21:19:46 +00:00
Ralph Castain	4d16790836	Fix collectives for jobs running across partial allocations This commit was SVN r26267.	2012-04-13 00:38:47 +00:00
Ralph Castain	5d14fa7546	Fix mpi_abort, minimize error output. This commit was SVN r26266.	2012-04-11 14:37:08 +00:00
Ralph Castain	9cd4c06488	Get things to build and run when --disable-orte is specified This commit was SVN r26263.	2012-04-10 21:50:01 +00:00
Ralph Castain	14d5525fb1	Some minor cleanups. Get singletons working. Cleanup abort handling so it gets properly identified. This commit was SVN r26261.	2012-04-10 19:08:54 +00:00
Ralph Castain	48de3a2501	Minor fixes so orte_progress_thread can work This commit was SVN r26248.	2012-04-06 15:50:49 +00:00
Ralph Castain	bd8b4f7f1e	Sorry for mid-day commit, but I had promised on the call to do this upon my return. Roll in the ORTE state machine. Remove last traces of opal_sos. Remove UTK epoch code. Please see the various emails about the state machine change for details. I'll send something out later with more info on the new arch. This commit was SVN r26242.	2012-04-06 14:23:13 +00:00
Ralph Castain	ca3ff58c76	Ensure we get a non-zero exit status when we can't find the specified fork agent. Output a better error message, and ensure we don't multiply report the problem. This commit was SVN r26191.	2012-03-24 00:49:38 +00:00
Josh Hursey	a595525366	Add the callers name to the 'comm failed' error message, so we know between which two peers the communication failed. This commit was SVN r26117.	2012-03-08 21:55:19 +00:00
Ralph Castain	3d718863a8	Fix typo - thanks Pascal This commit was SVN r26064.	2012-02-28 14:33:55 +00:00
Ralph Castain	d7d8a8cdf7	Some cleanup of the tmpdir session directory specifications. Remove the --tmpdir option from orterun as it was confusing. Create an orte_local_tmpdir_base mca param in its place. Clarify the role of the local vs remote vs global tmpdir base params, and ensure that you don't set conflicting options. Remove the OMPI_PREFIX_ENV environmental variable as that was totally confusing as a way of setting a tmpdir base location. This commit was SVN r25941.	2012-02-16 16:10:01 +00:00
Ralph Castain	3da1787c06	Allow there to be no default hostfile without generating an error This commit was SVN r25930.	2012-02-15 04:16:05 +00:00
Ralph Castain	a5e4dc6803	In accordance with prior releases, we are supposed to default to looking at the openmpi-default-hostfile as a default hostfile. Restore that behavior, but ignore the file if it is empty. Allow the user to ignore any MCA param setting pointing to a default hostfile by setting the param to "none" (via cmd line or whatever) - this allows them to override a setting in the system default MCA param file. This commit was SVN r25851.	2012-02-01 17:40:44 +00:00
Ralph Castain	11a37d3978	Fix the default This commit was SVN r25733.	2012-01-17 21:09:27 +00:00
Ralph Castain	fd0d9f73c6	Make preload_binaries an MCA param so it can be set in the default MCA parameters for a system This commit was SVN r25728.	2012-01-17 17:16:05 +00:00
Ralph Castain	ce7ddd0e10	Create the debugger attach fifo unless the user requests that we periodically poll insteaad. This commit was SVN r25714.	2012-01-11 19:44:22 +00:00
Ralph Castain	bf103de66c	My apologies for doing this outside of the usual time restrictions, but we need to get this in so we can make progress. Move the ORTE-level debugger code back into orterun and out of the ORTE library to resolve symbol conflicts. This commit was SVN r25713.	2012-01-11 15:53:09 +00:00
Ralph Castain	437c52d2bf	Routing must be enabled by default This commit was SVN r25657.	2011-12-15 17:13:52 +00:00
Ralph Castain	7510339725	Remove stale orte_vm_launch param. Add a param that allows users to specify envars to forward/set so they can do it in the MCA param file instead of only via mpirun cmd line. This commit was SVN r25580.	2011-12-06 21:31:22 +00:00
Ralph Castain	df2f594aa8	Some cleanup associated with multiple app_contexts. Ensure nodes only get entered once into the map. Correctly handle bookmarks. Cleanup tracking of slots_inuse and correct detection of oversubscription. Still need to resolve the ranking issue so it starts at the bookmark, but that will come next. This commit was SVN r25574.	2011-12-05 22:01:08 +00:00
Ralph Castain	641e17f26c	A better way of handling fqdn allocations. Prior method was wrong as it equated "node1" with "node10", which definitely caused problems. Detect the addition of fqdn nodes in the allocation. If not found, then strip all incoming hostnames from daemons of any domain info when matching those names against the names in the node pool. Leave some protection and "live" diagnostic output in place so we can continue to detect problems across all environments. This commit was SVN r25557.	2011-12-01 14:24:43 +00:00
Ralph Castain	c56acf60ca	Although we never really thought about it, we made an unconscious assumption in the mapper system - we assumed that the daemons would be placed on nodes in the order that the nodes appear in the allocation. In other words, we assumed that the launch environment would map processes in node order. Turns out, this isn't necessarily true. The Cray, for example, launches processes in a toroidal pattern, thus causing the daemons to wind up somewhere other than what we thought. Other environments (e.g., slurm) are also capable of such behavior, depending upon the default mapping algorithm they are told to use. Resolve this problem by making the daemon-to-node assignment in the affected environments when the daemon calls back and tells us what node it is on. Order the nodes in the mapping list so they are in daemon-vpid order as opposed to the order in which they show in the allocation. For environments that don't exhibit this mapping behavior (e.g., rsh), this won't have any impact. Also, clean up the vm launch procedure a little bit so it more closely aligns with the state machine implementation that is coming, and remove some lingering "slave" code. This commit was SVN r25551.	2011-11-30 19:58:24 +00:00
Ralph Castain	b475421c16	As promised, rationalize the rsh support. Remove rshbase and the base rsh support, centralizing all rsh support into the rsh component. Remove the "slave" launch support as that experiment is complete. Fix tree spawn and make that the default method for rsh launch, turning it "off" for qrsh as that system does not support tree spawn. This commit was SVN r25507.	2011-11-26 02:33:05 +00:00

1 2 3 4 5 ...

529 Коммитов