openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	71a24d6e74	Add some debug This commit was SVN r29326.	2013-10-02 01:37:02 +00:00
Ralph Castain	c6b7d9d027	Fix variable declaration This commit was SVN r29324.	2013-10-02 01:08:51 +00:00
Ralph Castain	fcb381c2e2	Minor cleanup - should behave the same, but just cleanup the variable names to avoid confusion This commit was SVN r29323.	2013-10-02 00:10:36 +00:00
Ralph Castain	d565a76814	Do some cleanup of the way we handle modex data. Identify data that needs to be shared with peers in my job vs data that needs to be shared with non-peers - no point in sharing extra data. When we share data with some process(es) from another job, we cannot know in advance what info they have or lack, so we have to share everything just in case. This limits the optimization we can do for things like comm_spawn. Create a new required key in the OMPI layer for retrieving a "node id" from the database. ALL RTE'S MUST DEFINE THIS KEY. This allows us to compute locality in the MPI layer, which is necessary when we do things like intercomm_create. cmr:v1.7.4:reviewer=rhc:subject=Cleanup handling of modex data This commit was SVN r29274.	2013-09-27 00:37:49 +00:00
Ralph Castain	6522963b9c	Flag that a daemon has been launched when it reports back to the HNP so we avoid re-launching it on spawns against dynamic allocations cmr:v1.7.3:reviewer=jsquyres This commit was SVN r29245.	2013-09-25 16:58:19 +00:00
Ralph Castain	23c8848157	Only connect the first time thru the Torque launch, remove stale code cmr:v1.7.3:reviewer=jsquyres This commit was SVN r29227.	2013-09-22 23:53:57 +00:00
Ralph Castain	400c68ed0f	Fix a segfault when a topology file is given to use in place of the one detected by mpirun itself. In that situation, the rmaps framework replaces the opal_hwloc_topology structure - but since that occurs after mpirun has set the node->topology field, we lose that definition. So don't set the node->topology field until after the rmaps framework has been opened. Does not need to go to 1.7 branch as that ordering is different. -This line, and those below, will be ignored-- M orte/mca/ess/hnp/ess_hnp_module.c This commit was SVN r29225.	2013-09-21 19:47:41 +00:00
Jeff Squyres	758cd25fff	Move the MCA / MPI_T level of the LAMA component down to 5 (from 9). This commit was SVN r29214.	2013-09-20 15:23:27 +00:00
George Bosilca	273d66d0f2	The MPI_Intercomm_create test was broken, as the remote peer was always considered as being 1 (instead of count). This commit was SVN r29207.	2013-09-18 16:47:54 +00:00
Ralph Castain	865a7028f8	Per patch from George, with a few minor cleanups. Correctly address the complete exchange of required wireup information in Intercomm_create so all procs in the resulting communicator know how to talk to each other. Refs trac:29166 This commit was SVN r29200. The following Trac tickets were found above: Ticket 29166 --> https://svn.open-mpi.org/trac/ompi/ticket/29166	2013-09-18 02:01:30 +00:00
Ralph Castain	99611ac1d2	Revert r29166 in favor of a better solution from George This commit was SVN r29199. The following SVN revision numbers were found above: r29166 --> open-mpi/ompi@497c7e6abb	2013-09-18 01:41:26 +00:00
George Bosilca	9e6c3c0646	Save the error code. This commit was SVN r29196.	2013-09-17 23:50:11 +00:00
Ralph Castain	2680bff88e	The function orte_iof_base_setup_prefork attempts to create a pty for child stdout and falls back to plain pipe if openpty fails. Child uses the 'usepty' flag to decide whether to treat this descriptor as a pty or as a pipe. Set 'usepty' flag to 0 upon openpty failure to inform the child that it isn't dealing with a pty even though pty has been requested. Thanks to Michal Peclo for reporting it and providing a patch. cmr:v1.7.3:reviewer=jsquyres cmr:v1.6.6:reviewer=jsquyres This commit was SVN r29169.	2013-09-15 15:33:51 +00:00
Ralph Castain	b64c8dafd8	Cleanup some errors in pubsub - must set the active flag before posting the recv in case the message has already arrived Refs trac:3696 This commit was SVN r29167. The following Trac tickets were found above: Ticket 3696 --> https://svn.open-mpi.org/trac/ompi/ticket/3696	2013-09-15 15:26:32 +00:00
Ralph Castain	497c7e6abb	Fixes trac:2904 The intercomm "merge" function can create a linkage between procs that was not reflected anywhere in a modex, and so at least some of the procs in the resulting communicator don't know how to talk to some of the new communicator's peers. For example, consider the case where: 1. parent job A comm_spawns a process (job B) - these processes exchange modex and can communicate 2. parent job A now comm_spawns another process (job C) - again, these can communicate, but the proc in C knows nothing of B 3. do an intercomm merge across the communicators created by the two comm_spawns. This puts B and C into the same communicator, but they know nothing about how to talk to each other as they were not involved in any exchange of contact info. Hence, collectives on that communicator now fail. This fix adds an API to the ompi/dpm framework that (a) exchanges the modex info across the procs in the merge to ensure all procs know how to communicate, and (b) calls add_procs to give the btl's a chance to select transports to any new procs. cmr:v1.7.3:reviewer=jsquyres This commit was SVN r29166. The following Trac tickets were found above: Ticket 2904 --> https://svn.open-mpi.org/trac/ompi/ticket/2904	2013-09-15 15:00:40 +00:00
Ralph Castain	eb132f923b	Check for bozo error of negative np for an app as this will cause ORTE to spin forever. cmr:v1.7.3:reviewer=jsquyres:subject=Check for negative np cmr:v1.6.6:reviewer=jsquyres:subject=Check for negative np This commit was SVN r29157.	2013-09-11 19:21:22 +00:00
Ralph Castain	2a116ecdfc	Fix a race condition created when two processes attempt to send to each other at the same time. This causes both processes to start connection procedures, resulting in a c onflict that can cause messages to be lost. Add detection of this condition, and have both processes cancel their connect operations. The process with the higher rank will reconnect, while the lower rank process will simply wait for the connection to be created. Refs trac:3696 This commit was SVN r29139. The following Trac tickets were found above: Ticket 3696 --> https://svn.open-mpi.org/trac/ompi/ticket/3696	2013-09-06 05:15:25 +00:00
Ralph Castain	e8697de521	Deal with PGI compilers on the Mac by initializing a global variable. cmr:v1.6.6:reviewer=jsquyres cmr:v1.7.3:reviewer=jsquyres This commit was SVN r29129.	2013-09-05 21:40:50 +00:00
Ralph Castain	13ae51a91b	Protect against possible race conditions and threads by ensuring that rml send always occurs inside an event. cmr:v1.7.4:reviewer=jsquyres:subject=Protect against race conditions in rml send This commit was SVN r29128.	2013-09-05 01:16:32 +00:00
Ralph Castain	d32dfc96be	Use the rankfile to obtain list of nodes for VM launch if/when rankfile is given. cmr:v1.7.3:reviewer=jsquyres:subject=Obtain VM nodes from rankfile This commit was SVN r29119.	2013-09-04 16:37:30 +00:00
Ralph Castain	d9f0505952	Fix the lama verbose outputs so they don't segfault if someone asks for verbose output, but isn't using lama cmr:v1.7.3:reviewer=jsquyres This commit was SVN r29108.	2013-09-03 17:55:35 +00:00
Ralph Castain	2bfa99e945	If a rankfile is given and the number of procs not specified in the mpirun cmd line, then set the number of procs to the number of ranks in the rankfile cmr:v1.7.3:reviewer=jsquyres This commit was SVN r29104.	2013-09-02 15:04:40 +00:00
Ralph Castain	43d1cd92ac	Ensure we activate the "daemons launched" state when only the HNP is left or else we will hang. cmr:v1.7.3:reviewer=jsquyres This commit was SVN r29094.	2013-08-29 22:50:51 +00:00
Dave Goodell	d17f104e7a	oob: squash some valgrind warnings These warnings were harmless, but they appeared even for simple programs like single-process runs of `ring_c`. This commit was SVN r29093.	2013-08-29 21:08:44 +00:00
Ralph Castain	12d4f45b5e	Silence warning: oob_tcp_connection.c: In function 'mca_oob_tcp_peer_accept': oob_tcp_connection.c:725:9: warning: variable 'cmpval' set but not used [-Wunused-but-set-variable] Refs trac:3696 This commit was SVN r29091. The following Trac tickets were found above: Ticket 3696 --> https://svn.open-mpi.org/trac/ompi/ticket/3696	2013-08-29 20:56:05 +00:00
Ralph Castain	7a7cfdd519	A little cleanup - the base function to sort numa lists must return something or you get a warning about non-void function returning without value, so cleanup the return values. Ensure the mindist module actually checks for a return of "error" so it won't segfault, and have it emit a polite message when that happens. cmr:v1.7.3:reviewer=jladd This commit was SVN r29089.	2013-08-29 20:01:06 +00:00
Ralph Castain	c71e760e6c	The modex code was unfortunately written solely for PMI1 when updated to minimize calls to PMI_get - add the required PMI2 code This commit was SVN r29084.	2013-08-28 23:52:32 +00:00
Joshua Ladd	1802aabf1a	Add support for autodetecting a MLNX HCA in the rmaps min distance feature. In this way, .ini files distributed with software stacks need not specify a particular HCA but instead may select the key word auto which will automatically select the discovered device. To use this feature, simply pass the keyword auto instead of a specific device name, --mca rmaps_base_dist_hca auto. If more than one card is installed, the mapper will inform the user of this and, at this point, the user will then need to specify which card via the normal route, e.g. --mca rmaps_base_dist_hca <dev_name>. This should be added to \ncmr=v1.7.4:reviewer=rhc:subject=Autodetect logic for min dist mapping This commit was SVN r29079.	2013-08-28 16:23:33 +00:00
Ralph Castain	7125143253	Replace missing opal_db open/select that was apparently lost on a prior merge. Thanks to Nathan for pointing it out This commit was SVN r29072.	2013-08-27 19:42:31 +00:00
George Bosilca	65a362909d	Can't see how it works ... Thanks Thomas and Arm for the patch. This commit was SVN r29066.	2013-08-27 16:52:24 +00:00
Ralph Castain	c9a25465da	Don't need the number of nodes any more for PMI Refs trac:3729 This commit was SVN r29064. The following Trac tickets were found above: Ticket 3729 --> https://svn.open-mpi.org/trac/ompi/ticket/3729	2013-08-23 18:36:51 +00:00
Ralph Castain	6d24b34940	Extend the dpm framework API to support persistent accept/connect operations: * paccept - establish a persistent listening port for async connect requests * pconnect - async connect to remote process that has posted a paccept port. Provides a timeout mechanism, and allows the underlying implementation to retry until timeout * pclose - shuts down a prior paccept posting Includes example programs paccept.c and pconnect.c in orte/test/mpi. New MPI extension interfaces coming... This commit was SVN r29063.	2013-08-23 18:02:50 +00:00
Ralph Castain	a200e4f865	As per the RFC, bring in the ORTE async progress code and the rewrite of OOB: * THIS RFC INCLUDES A MINOR CHANGE TO THE MPI-RTE INTERFACE * Note: during the course of this work, it was necessary to completely separate the MPI and RTE progress engines. There were multiple places in the MPI layer where ORTE_WAIT_FOR_COMPLETION was being used. A new OMPI_WAIT_FOR_COMPLETION macro was created (defined in ompi/mca/rte/rte.h) that simply cycles across opal_progress until the provided flag becomes false. Places where the MPI layer blocked waiting for RTE to complete an event have been modified to use this macro. *************************************************************************************** I am reissuing this RFC because of the time that has passed since its original release. Since its initial release and review, I have debugged it further to ensure it fully supports tests like loop_spawn. It therefore seems ready for merge back to the trunk. Given its prior review, I have set the timeout for one week. The code is in https://bitbucket.org/rhc/ompi-oob2 WHAT: Rewrite of ORTE OOB WHY: Support asynchronous progress and a host of other features WHEN: Wed, August 21 SYNOPSIS: The current OOB has served us well, but a number of limitations have been identified over the years. Specifically: * it is only progressed when called via opal_progress, which can lead to hangs or recursive calls into libevent (which is not supported by that code) * we've had issues when multiple NICs are available as the code doesn't "shift" messages between transports - thus, all nodes had to be available via the same TCP interface. * the OOB "unloads" incoming opal_buffer_t objects during the transmission, thus preventing use of OBJ_RETAIN in the code when repeatedly sending the same message to multiple recipients * there is no failover mechanism across NICs - if the selected NIC (or its attached switch) fails, we are forced to abort * only one transport (i.e., component) can be "active" The revised OOB resolves these problems: * async progress is used for all application processes, with the progress thread blocking in the event library * each available TCP NIC is supported by its own TCP module. The ability to asynchronously progress each module independently is provided, but not enabled by default (a runtime MCA parameter turns it "on") * multi-address TCP NICs (e.g., a NIC with both an IPv4 and IPv6 address, or with virtual interfaces) are supported - reachability is determined by comparing the contact info for a peer against all addresses within the range covered by the address/mask pairs for the NIC. * a message that arrives on one TCP NIC is automatically shifted to whatever NIC that is connected to the next "hop" if that peer cannot be reached by the incoming NIC. If no TCP module will reach the peer, then the OOB attempts to send the message via all other available components - if none can reach the peer, then an "error" is reported back to the RML, which then calls the errmgr for instructions. * opal_buffer_t now conforms to standard object rules re OBJ_RETAIN as we no longer "unload" the incoming object * NIC failure is reported to the TCP component, which then tries to resend the message across any other available TCP NIC. If that doesn't work, then the message is given back to the OOB base to try using other components. If all that fails, then the error is reported to the RML, which reports to the errmgr for instructions * obviously from the above, multiple OOB components (e.g., TCP and UD) can be active in parallel * the matching code has been moved to the RML (and out of the OOB/TCP component) so it is independent of transport * routing is done by the individual OOB modules (as opposed to the RML). Thus, both routed and non-routed transports can simultaneously be active * all blocking send/recv APIs have been removed. Everything operates asynchronously. KNOWN LIMITATIONS: * although provision is made for component failover as described above, the code for doing so has not been fully implemented yet. At the moment, if all connections for a given peer fail, the errmgr is notified of a "lost connection", which by default results in termination of the job if it was a lifeline * the IPv6 code is present and compiles, but is not complete. Since the current IPv6 support in the OOB doesn't work anyway, I don't consider this a blocker * routing is performed at the individual module level, yet the active routed component is selected on a global basis. We probably should update that to reflect that different transports may need/choose to route in different ways * obviously, not every error path has been tested nor necessarily covered * determining abnormal termination is more challenging than in the old code as we now potentially have multiple ways of connecting to a process. Ideally, we would declare "connection failed" when all transports can no longer reach the process, but that requires some additional (possibly complex) code. For now, the code replicates the old behavior only somewhat modified - i.e., if a module sees its connection fail, it checks to see if it is a lifeline. If so, it notifies the errmgr that the lifeline is lost - otherwise, it notifies the errmgr that a non-lifeline connection was lost. * reachability is determined solely on the basis of a shared subnet address/mask - more sophisticated algorithms (e.g., the one used in the tcp btl) are required to handle routing via gateways * the RML needs to assign sequence numbers to each message on a per-peer basis. The receiving RML will then deliver messages in order, thus preventing out-of-order messaging in the case where messages travel across different transports or a message needs to be redirected/resent due to failure of a NIC This commit was SVN r29058.	2013-08-22 16:37:40 +00:00
Ralph Castain	63d10d2d0d	Fix typo Refs trac:3729 This commit was SVN r29057. The following Trac tickets were found above: Ticket 3729 --> https://svn.open-mpi.org/trac/ompi/ticket/3729	2013-08-22 16:05:58 +00:00
Ralph Castain	16c5b30a1f	Since the calls to "PMI get" scale by number of procs (not nodes), it makes more sense to have the MCA param be the cutoff based on number of procs. Also, it occurred to me that this shouldn't impact the nidmap process as that is built and circulated when we launch via mpirun, not during direct launch. So shift the cutoff param to the MPI layer, and have it solely determine whether or not we call modex_recv on the hostname. If comm_world is of size greater than the cutoff, then we don't automatically retrieve the hostname when we build the ompi_proc_t for a process - instead, we fill the hostname entry on first call to modex_recv for that process. The param is now "ompi_hostname_cutoff=N", where N=number of procs for cutoff. Refs trac:3729 This commit was SVN r29056. The following Trac tickets were found above: Ticket 3729 --> https://svn.open-mpi.org/trac/ompi/ticket/3729	2013-08-22 03:40:26 +00:00
Ralph Castain	45e695928f	As per the email discussion, revise the sparse handling of hostnames so that we avoid potential infinite loops while allowing large-scale users to improve their startup time: * add a new MCA param orte_hostname_cutoff to specify the number of nodes at which we stop including hostnames. This defaults to INT_MAX => always include hostnames. If a value is given, then we will include hostnames for any allocation smaller than the given limit. * remove ompi_proc_get_hostname. Replace all occurrences with a direct link to ompi_proc_t's proc_hostname, protected by appropriate "if NULL" * modify the OMPI-ORTE integration component so that any call to modex_recv automatically loads the ompi_proc_t->proc_hostname field as well as returning the requested info. Thus, any process whose modex info you retrieve will automatically receive the hostname. Note that on-demand retrieval is still enabled - i.e., if we are running under direct launch with PMI, the hostname will be fetched upon first call to modex_recv, and then the ompi_proc_t->proc_hostname field will be loaded * removed a stale MCA param "mpi_keep_peer_hostnames" that was no longer used anywhere in the code base * added an envar lookup in ess/pmi for the number of nodes in the allocation. Sadly, PMI itself doesn't provide that info, so we have to get it a different way. Currently, we support PBS-based systems and SLURM - for any other, rank0 will emit a warning and we assume max number of daemons so we will always retain hostnames This commit was SVN r29052.	2013-08-20 18:59:36 +00:00
Ralph Castain	9aebd7e281	Ensure we register the nidmap verbosity in mpirun, and add some debug This commit was SVN r29042.	2013-08-18 23:40:32 +00:00
Ralph Castain	611d7f9f6b	When we direct launch an application, we rely on PMI for wireup support. In doing so, we lose the de facto data compression we get from the ORTE modex since we no longer get all the wireup info from every proc in a single blob. Instead, we have to iterate over all the procs, calling PMI_KVS_get for every value we require. This creates a really bad scaling behavior. Users have found a nearly 20% launch time differential between mpirun and PMI, with PMI being the slower method. Some of the problem is attributable to poor exchange algorithms in RM's like Slurm and Alps, but we make things worse by calling "get" so many times. Nathan (with a tad advice from me) has attempted to alleviate this problem by reducing the number of "get" calls. This required the following changes: * upon first request for data, have the OPAL db pmi component fetch and decode all the info from a given remote proc. It turned out we weren't caching the info, so we would continually request it and only decode the piece we needed for the immediate request. We now decode all the info and push it into the db hash component for local storage - and then all subsequent retrievals are fulfilled locally * reduced the amount of data by eliminating the exchange of the OMPI_ARCH value if heterogeneity is not enabled. This was used solely as a check so we would error out if the system wasn't actually homogeneous, which was fine when we thought there was no cost in doing the check. Unfortunately, at large scale and with direct launch, there is a non-zero cost of making this test. We are open to finding a compromise (perhaps turning the test off if requested?), if people feel strongly about performing the test * reduced the amount of RTE data being automatically fetched, and fetched the rest only upon request. In particular, we no longer immediately fetch the hostname (which is only used for error reporting), but instead get it when needed. Likewise for the RML uri as that info is only required for some (not all) environments. In addition, we no longer fetch the locality unless required, relying instead on the PMI clique info to tell us who is on our local node (if additional info is required, the fetch is performed when a modex_recv is issued). Again, all this only impacts direct launch - all the info is provided when launched via mpirun as there is no added cost to getting it Barring objections, we may move this (plus any required other pieces) to the 1.7 branch once it soaks for an appropriate time. This commit was SVN r29040.	2013-08-17 00:49:18 +00:00
Ralph Castain	b2d86e1857	Silence uninitialized var warning This commit was SVN r29034.	2013-08-16 21:35:51 +00:00
Ralph Castain	b34bff8792	Cleanup warning This commit was SVN r29032.	2013-08-16 21:14:35 +00:00
Ralph Castain	bebe852057	Add new info key for publish that allows user to designate that the port is to be unique - i.e., to return an error if that service has already been published. Default is to overwrite This commit was SVN r29028.	2013-08-14 04:21:17 +00:00
Ralph Castain	72b5e867ab	Correct shutdown ordering - rml must go last This commit was SVN r29027.	2013-08-14 04:20:17 +00:00
Ralph Castain	8a4c5f4957	Attempt to plug a few memory leaks by ensuring we finalize all things opened during init. However, we are still leaking memory like a sieve in param registration and hwloc. This commit was SVN r29026.	2013-08-14 02:03:00 +00:00
Nathan Hjelm	b2e773ece3	Fix debugger support for direct-launched jobs. The orte rte component checks the orte_standalone_operation to decide if it should wait for a message from the hnp or wait on the debugger. This variable needed to be set to true in ess/pmi to enable the correct path when direct launching. cmr=v1.7.3:reviewer=rhc cmr=v1.6.6:reviewer=rhc This commit was SVN r29013.	2013-08-09 22:39:41 +00:00
Nathan Hjelm	841ed962f6	fix MCA variable and component system leaks cmr=v1.7.3:reviewer=rhc This commit was SVN r29011.	2013-08-09 19:50:28 +00:00
Nathan Hjelm	88cadc552d	Make opal/db/pmi use as few PMI keys as possible. This commit reintroduces key compression into the pmi db. This feature compresses the keys stored into the component into a small number of PMI keys by serializing the data and base64 encoding the result. This will avoid issues with Cray PMI which restricts us to ~ 3 PMI keys per rank. This commit was SVN r28993.	2013-08-03 01:06:59 +00:00
Ralph Castain	285429a1c6	Remove release of buffer - non-blocking send callback will do it This commit was SVN r28985.	2013-08-02 03:49:17 +00:00
Ralph Castain	37db1727a2	Refs trac:3710 Simplify the whole stripping of prefix method by consolidating it into a single MCA param. Allow for multiple prefixes to be stripped, each separated in the param by a comma. If no prefix is given, or the specified prefix isn't in the nodename, then just use the hostname itself. This commit was SVN r28974. The following Trac tickets were found above: Ticket 3710 --> https://svn.open-mpi.org/trac/ompi/ticket/3710	2013-08-01 00:32:10 +00:00
Nathan Hjelm	83a3fc2fd2	Add an option to control which hostnames orte_strip_prefix_from_node_names works on. This corrects a problem with Cray systems where the login node's hostname was being stripped causing the login node to be used as a compute node by mpirun. cmr=v1.7.3:reviewer=rhc This commit was SVN r28970.	2013-07-31 18:42:02 +00:00
Nathan Hjelm	ebbb32120a	MCA/base: variable system updates - Use an enumerator to handle bool values. - Fix a leak in the variable enumerator. - Fix a leak in an orte parameter. This commit was SVN r28949.	2013-07-25 15:42:01 +00:00
Ralph Castain	6c1a140e99	Per request from Nathan, add a "commit" API to the opal db framework. This allows him to aggregate keys to work around the Cray's severe PMI limitations This commit was SVN r28917.	2013-07-22 22:57:16 +00:00
Ralph Castain	5d12ab3873	Ensure we always set num_local_peers for both PMI2 and PMI1 This commit was SVN r28860.	2013-07-19 04:34:58 +00:00
Ralph Castain	b033a6b6d6	One last Cray-inspired fix... Refs trac:3685 This commit was SVN r28857. The following Trac tickets were found above: Ticket 3685 --> https://svn.open-mpi.org/trac/ompi/ticket/3685	2013-07-19 03:04:00 +00:00
Ralph Castain	92cb93b21e	Remove set-but-unused variable Refs trac:3685 This commit was SVN r28855. The following Trac tickets were found above: Ticket 3685 --> https://svn.open-mpi.org/trac/ompi/ticket/3685	2013-07-19 01:42:35 +00:00
Ralph Castain	bc2586cf3c	Refs trac:3685. Check error code returned by PMI2_Info_GetJobAttr. This commit was SVN r28854. The following Trac tickets were found above: Ticket 3685 --> https://svn.open-mpi.org/trac/ompi/ticket/3685	2013-07-19 01:24:51 +00:00
Ralph Castain	a10546d5c1	Cleanup and rename of platform files This commit was SVN r28853.	2013-07-19 01:18:41 +00:00
Ralph Castain	e4e678e234	Per the RFC and discussion on the devel list, update the RTE-MPI error handling interface. There are a few differences in the code from the original RFC that came out of the discussion - I've captured those in the following writeup George and I were talking about ORTE's error handling the other day in regards to the right way to deal with errors in the updated OOB. Specifically, it seemed a bad idea for a library such as ORTE to be aborting the job on its own prerogative. If we lose a connection or cannot send a message, then we really should just report it upwards and let the application and/or upper layers decide what to do about it. The current code base only allows a single error callback to exist, which seemed unduly limiting. So, based on the conversation, I've modified the errmgr interface to provide a mechanism for registering any number of error handlers (this replaces the current "set_fault_callback" API). When an error occurs, these handlers will be called in order until one responds that the error has been "resolved" - i.e., no further action is required - by returning OMPI_SUCCESS. The default MPI layer error handler is specified to go "last" and calls mpi_abort, so the current "abort" behavior is preserved unless other error handlers are registered. In the register_callback function, I provide an "order" param so you can specify "this callback must come first" or "this callback must come last". Seemed to me that we will probably have different code areas registering callbacks, and one might require it go first (the default "abort" will always require it go last). So you can append and prepend, or go first. Note that only one registration can declare itself "first" or "last", and since the default "abort" callback automatically takes "last", that one isn't available. :-) The errhandler callback function passes an opal_pointer_array of structs, each of which contains the name of the proc involved (which can be yourself for internal errors) and the error code. This is a change from the current fault callback which returned an opal_pointer_array of just process names. Rationale is that you might need to see the cause of the error to decide what action to take. I realize that isn't a requirement for remote procs, but remember that we will use the SAME interface to report RTE errors internal to the proc itself. In those cases, you really do need to see the error code. It is legal to pass a NULL for the pointer array (e.g., when reporting an internal failure without error code), so handlers must be prepared for that possibility. If people find that too burdensome, we can remove it. Should we ever decide to create a separate callback path for internal errors vs remote process failures, or if we decide to do something different based on experience, then we can adjust this API. This commit was SVN r28852.	2013-07-19 01:08:53 +00:00
Ralph Castain	6c50c8167c	Fix pmi-1 compile when no pmi2 is present This commit was SVN r28849.	2013-07-18 22:45:08 +00:00
Ralph Castain	256034a3dc	Sigh - fix a couple of spots I missed Refs trac:3683 This commit was SVN r28843. The following Trac tickets were found above: Ticket 3683 --> https://svn.open-mpi.org/trac/ompi/ticket/3683	2013-07-18 19:07:16 +00:00
Ralph Castain	fc3b777ef5	Cleanup a variable that isn't used if pmi2 support is available Refs trac:3683 This commit was SVN r28841. The following Trac tickets were found above: Ticket 3683 --> https://svn.open-mpi.org/trac/ompi/ticket/3683	2013-07-18 17:19:13 +00:00
Ralph Castain	92c6b806b9	Based on a patch submitted by Piotr Lesnicki of Bull, cleanup the PMI2 support. This has not been tested yet on multiple environments (e.g., Cray), so it needs more evaluation prior to moving to the 1.7 branch. cmr:v1.7.3:reviewer=rhc This commit was SVN r28837.	2013-07-18 14:46:07 +00:00
Ralph Castain	10ca1c1b04	Turns out that there was exactly ONE place in all of the OMPI code base that still referred to OPAL_TRACE, though a few places retained the include file for no reason. So no point in letting this sit as it is clearly an unused "feature". This commit was SVN r28789.	2013-07-14 18:57:20 +00:00
Ralph Castain	563bf60fb8	Fix an ordering problem that crept in due to the change in MCA param system. One MCA param would set a value, and then we did a hard reset of that value before testing another MCA param, thus removing a critical value for proper operation of the first param. cmr:v1.7.3:reviewer=brbarret This commit was SVN r28788.	2013-07-14 16:22:13 +00:00
Ralph Castain	7a21661785	Silence a warning when --without-hwloc is used This commit was SVN r28783.	2013-07-13 17:17:17 +00:00
Dave Goodell	3741d62308	fix --without-hwloc build failure All builds since r28682 configured with '--without-hwloc' fail at "make" time without this fix. Reviewed by rhc@ This commit was SVN r28769. The following SVN revision numbers were found above: r28682 --> open-mpi/ompi@446e33a5d8	2013-07-12 17:21:14 +00:00
Jeff Squyres	baa3182794	Per RFC (http://www.open-mpi.org/community/lists/devel/2013/07/12534.php), remove a bunch of dead code. This commit was SVN r28756.	2013-07-11 17:34:28 +00:00
Ralph Castain	028f5ee7a6	Cleanup some bitrot from moving the db framework to opal and from the new mca param system This commit was SVN r28741.	2013-07-09 14:37:08 +00:00
Ralph Castain	62378209f0	Even if we don't find the default hostfile, and nothing else was provided, then use all the known nodes. cmr:v1.7.3:#3653:reviewer=jsquyres cmr:v1.6.6:#3654:reviewer=jsquyres This commit was SVN r28718.	2013-07-03 22:31:32 +00:00
Ralph Castain	443a6802b9	If the default hostfile is empty, we need to pickup all the known nodes, not just the head node. cmr:v1.7.3:reviewer=jsquyres cmr:v1.6.6:reviewer=jsquyres This commit was SVN r28717.	2013-07-03 22:25:51 +00:00
Ralph Castain	446e33a5d8	There are cases where we want to use the novm state machine, but the backend node topology differs from that where mpirun is executing. In those cases, we can wind up thinking we are oversubscribed because the head node has fewer cores than the compute nodes. To resolve this situation, add the ability to specify a backend topology file that mpirun shall use for its mapping operations. Create a new "set_topology" function in opal hwloc to support it. This commit was SVN r28682.	2013-06-27 03:04:50 +00:00
Joshua Ladd	0b5c1f2ea8	Add 'generic' support for PMI2 (previously, we checked for PMI2 only on Cray systems.) If your resource manager (e.g. SLURM) has support for PMI2, then the --with-pmi configure flag will enable its usage. If you don't have PMI2, then you will fallback to regular old PMI1. This patch was submitted by Ralph Castain and reviewed and pushed by Josh Ladd. This should be added to cmr:v1.7:reviewer=jladd This commit was SVN r28666.	2013-06-21 15:28:14 +00:00
Nathan Hjelm	299d5b3dd7	Fix two debugger attach bugs. - orte_debugger_init_after_spawn was not being called for debuggers that use the MPIR_attach_fifo to co-locate debugger daemons. - MPIR_Breakpoint was not getting called if a debugger reattached. Add a job state (ORTE_JOB_STATE_DEBUGGER_DETACH) to reset mpir_breakpoint_fired to false when a debugger detaches to ensure MPIR_Breakpoint is called if another debugger attaches. Tested with STAT 2.0/launchmon 1.0. cmr:v1.7 This commit was SVN r28665.	2013-06-20 16:18:05 +00:00
Jeff Squyres	b9ca8e3cd1	Tweaked the help message a bit (this is the end result of iterating on the message in email between Mike, Ralph, Jeff). Add this to CMR #3642 and #3643. This commit was SVN r28662.	2013-06-20 13:19:23 +00:00
Ralph Castain	13665bffe8	Per an off-list discussion, it appears possible for a system to report failure when executing getpwuid. There are several reasons for this error to occur, most notably if the system uses a network-based authentication protocol (e.g., NIS) and that sytem gets overwhelmed when we launch on a lot of nodes. There is no good way to recover from this scenario, and from past experience, using the user's name in the session directory (as opposed to the uid) is very helpful when things go wrong. So print a help message when this happens (it is extremely rare, but has happened at least once now) and return an error. cmr:v1.7.3,reviewer=jsquyres cmr:v1.6.5,reviewer=jsquyres This commit was SVN r28658.	2013-06-20 04:30:42 +00:00
Ralph Castain	a51a0a8c48	Fix uninitialized var This commit was SVN r28652.	2013-06-18 22:41:47 +00:00
George Bosilca	b4ebc417a1	Correctly register the component MCA parameters. Few cleanups in the includes. This commit was SVN r28645.	2013-06-15 16:05:09 +00:00
Nathan Hjelm	518d1fe200	Fix two typos that prevented alps direct launch from working This commit was SVN r28628.	2013-06-13 17:04:08 +00:00
Joshua Ladd	61ffb47573	Minor fix for the min-dist mapping algorithm: we need to call 'get_nbobjs_by_type' first, before we get the sorted list of nodes - we need to add node objects and fill them in the summary object for the current topology. This patch was submitted by Elena Elkina and pushed by Josh Ladd. This should be added to cmr:v1.7:reviewer=jladd This commit was SVN r28578.	2013-05-31 15:19:59 +00:00
Jeff Squyres	6d173af329	This commit introduces a new "mindist" ORTE RMAPS mapper, as well as some relevant updates/new functionality in the opal/mca/hwloc and orte/mca/rmaps bases. This work was mainly developed by Mellanox, with a bunch of advice from Ralph Castain, and some minor advice from Brice Goglin and Jeff Squyres. Even though this is mainly Mellanox's work, Jeff is committing only for logistical reasons (he holds the hg+svn combo tree, and can therefore commit it directly back to SVN). ----- Implemented distance-based mapping algorithm as a new "mindist" component in the rmaps framework. It allows mapping processes by NUMA due to PCI locality information as reported by the BIOS - from the closest to device to furthest. To use this algorithm, specify: {{{mpirun --map-by dist:<device_name>}}} where <device_name> can be mlx5_0, ib0, etc. There are two modes provided: 1. bynode: load-balancing across nodes 1. byslot: go through slots sequentially (i.e., the first nodes are more loaded) These options are regulated by the optional ''span'' modifier; the command line parameter looks like: {{{mpirun --map-by dist:<device_name>,span}}} So, for example, if there are 2 nodes, each with 8 cores, and we'd like to run 10 processes, the mindist algorithm will place 8 processes to the first node and 2 to the second by default. But if you want to place 5 processes to each node, you can add a span modifier in your command line to do that. If there are two NUMA nodes on the node, each with 4 cores, and we run 6 processes, the mindist algorithm will try to find the NUMA closest to the specified device, and if successful, it will place 4 processes on that NUMA but leaving the remaining two to the next NUMA node. You can also specify the number of cpus per MPI process. This option is handled so that we map as many processes to the closest NUMA as we can (number of available processors at the NUMA divided by number of cpus per rank) and then go on with the next closest NUMA. The default binding option for this mapping is bind-to-numa. It works if you don't specify any binding policy. But if you specified binding level that was "lower" than NUMA (i.e hwthread, core, socket) it would bind to whatever level you specify. This commit was SVN r28552.	2013-05-22 13:04:40 +00:00
Jeff Squyres	089c632cce	Remove a bunch of dead code: gcc 4.7 warns of set-but-unused variables. So get rid of them. This commit was SVN r28538.	2013-05-17 21:45:49 +00:00
Ralph Castain	e100b8d165	don't need the return value, but should check for error This commit was SVN r28534.	2013-05-16 15:15:02 +00:00
Jeff Squyres	128cc27417	Minor type fix (they're both enums/ints, so the compiler previously silently cast them). This commit was SVN r28532.	2013-05-16 00:47:37 +00:00
Ralph Castain	93ba4247f8	remove extra paren when --without-hwloc This commit was SVN r28530.	2013-05-15 21:31:45 +00:00
Ralph Castain	3a372a65b8	Mapping policies must be tested as equalities as they are values, not bitmasks This commit was SVN r28526.	2013-05-15 13:45:00 +00:00
Ralph Castain	29e4b0cc50	Cannot test equality on mapping directives as it is a bitmask This commit was SVN r28525.	2013-05-15 13:41:49 +00:00
Ralph Castain	04b11accd3	Silience a few warnings This commit was SVN r28515.	2013-05-14 21:58:40 +00:00
Ralph Castain	5296099ecb	Fix the cpus-per-rank when binding to hwthreads. Add cpus-per-rank to diag printout Thanks to Elena for reporting the problem This commit was SVN r28508.	2013-05-14 20:17:50 +00:00
Ralph Castain	427b6b0b47	Fix the verbosity of yet another framework...sigh. This commit was SVN r28481.	2013-05-13 14:36:32 +00:00
Jeff Squyres	456df1c9f7	Remove redundant opal_output() messages from the module; the called functions will now show_help() their own error messages if something goes wrong (per r28470). This commit was SVN r28471. The following SVN revision numbers were found above: r28470 --> open-mpi/ompi@2ff95a7739	2013-05-10 15:12:07 +00:00
Jeff Squyres	2ff95a7739	Proper show_help error messages for LAMA. This commit was SVN r28470.	2013-05-10 15:06:25 +00:00
Ralph Castain	f15fe5045e	Ensure that debugger connect can occur by getting the rml contact info updated before calling init_after_spawn cmr:v1.7.3,reviewer=jsquyres This commit was SVN r28455.	2013-05-06 22:00:45 +00:00
Ralph Castain	c52b94af8b	Revert r28453 and r28452 - wrong fix This commit was SVN r28454. The following SVN revision numbers were found above: r28452 --> open-mpi/ompi@756ee4b5e0 r28453 --> open-mpi/ompi@6da24143a2	2013-05-06 21:52:17 +00:00
Ralph Castain	6da24143a2	Minor performance improvement This commit was SVN r28453.	2013-05-06 20:27:16 +00:00
Ralph Castain	756ee4b5e0	Update the rml_uri for each proc so debuggers can attach This commit was SVN r28452.	2013-05-06 20:18:14 +00:00
Ralph Castain	707d0e653a	Must use equal and not & comparison for mapping directives This commit was SVN r28451.	2013-05-06 15:07:12 +00:00
Ralph Castain	a0a6412545	Do a little cleanup on abnormal termination procedure - don't keep submitting forced exit events (one will do), no need to reset the abnormal termination pipe event in orterun, etc. This commit was SVN r28450.	2013-05-05 17:39:45 +00:00
Ralph Castain	fb2a694587	Fix print This commit was SVN r28446.	2013-05-04 22:37:34 +00:00
Ralph Castain	27e3e382d5	No need for ORTE tools to use orte progress thread This commit was SVN r28445.	2013-05-04 21:13:20 +00:00
Jeff Squyres	42a9a4c62c	After examining a '''lot''' of MTT output with Ralph, fix the cause of many, many MTT timeouts when running jobs under SLURM: send the right command at the end to cause remote orteds to shut down. This commit was SVN r28438.	2013-05-02 00:23:53 +00:00
Ralph Castain	5d7a93c032	Add the ability to use an external version of libevent. Clearly not recommended at this time. I've verified that it works in limited scenarios, but more thorough testing and performance impacts need to be assessed. Interesting how many includes had to be fixed here and there to fill in missing dependencies :-) This commit was SVN r28411.	2013-04-29 17:02:37 +00:00
Ralph Castain	3a354c4ea3	Cleanup the verbose output channel name This commit was SVN r28391.	2013-04-24 23:44:02 +00:00
Ralph Castain	c5e1a7dc65	fix typo This commit was SVN r28390.	2013-04-24 23:37:59 +00:00
Nathan Hjelm	55decca2b7	routed/debruijn: if the next hop for a message is unknown forward the message to the parent process This commit was SVN r28384.	2013-04-24 19:05:25 +00:00
Nathan Hjelm	bccf8c657a	Per RFC add initial support for the MPI 3.0 tools interface. Current MPI_T support: - Full cvar interface. - Full categories interface. - No pvar support at this time. This commit was SVN r28376.	2013-04-24 15:59:23 +00:00
Ralph Castain	2e8946db0a	Add some debug output This commit was SVN r28371.	2013-04-23 23:11:22 +00:00
Ralph Castain	b6377a2138	Properly remove the object from the list prior to releasing it when an error is encountered This commit was SVN r28370.	2013-04-23 22:44:52 +00:00
Ralph Castain	1a54e52da4	Per conversation with Josh H., remove the stale rsh component from the filem framework. Let the "raw" component do the job. This commit was SVN r28338.	2013-04-16 20:39:48 +00:00
Ralph Castain	d6ac721e22	Add client/server test This commit was SVN r28332.	2013-04-15 13:10:42 +00:00
Jeff Squyres	349ee654c1	Fix some --without-hwloc compile errors. Also remove one assigned-but-not-used variable assignment. This commit was SVN r28321.	2013-04-10 15:08:31 +00:00
Ralph Castain	45af6cf59e	The move of the orte_db framework to opal required that we create an opaque opal_identifier_t type as OPAL cannot know anything about the ORTE process name. However, passing a value down to opal and then having the db components reference it causes alignment issues on Solaris Sparc platforms. So pass the pointer instead and do the old "memcpy" trick to avoid the problem. This commit was SVN r28308.	2013-04-08 23:34:16 +00:00
Ralph Castain	2040afdcae	Daemonize prior to starting any progress threads This commit was SVN r28303.	2013-04-08 00:42:57 +00:00
Ralph Castain	76426285f0	Cannot retain opal_buffer_t, so use a copy This commit was SVN r28302.	2013-04-07 23:02:59 +00:00
Ralph Castain	698b4ad6e7	Fix the parameter handling so no-tree-spawn isn't getting reversed This commit was SVN r28300.	2013-04-07 15:48:25 +00:00
Ralph Castain	1f011bef99	Cleanup the updated sys limits capability. Fix a few copy/paste bugs (my bad). Shift the limit set to the ODLS default module so that we sete the limits for all apps, even those that don't call opal_init. Leave it in opal_init as well to support direct-launch apps, but ensure we only set the limits once by removing the envar after launch by ODLS. Provide some nice error messages if we fail to set the limits. Since the user had to specifically request we set the limit, treat failure as an error-out situation. This commit was SVN r28288.	2013-04-04 16:00:17 +00:00
Ralph Castain	252147fba6	Cleanup error message if unknown host is given in -host and -hostfile options This commit was SVN r28262.	2013-03-28 16:52:10 +00:00
Ralph Castain	e6ae088813	Cleanup error outputs when a daemon fails to start This commit was SVN r28261.	2013-03-28 16:51:19 +00:00
Ralph Castain	21ee48de57	Add missing static declaration This commit was SVN r28247.	2013-03-27 21:59:17 +00:00
Ralph Castain	1b5b9afa85	Silence warnings This commit was SVN r28244.	2013-03-27 21:56:46 +00:00
Nathan Hjelm	17315bf360	Now that the entire codebase has been updated to use the MCA framework system remove the last calls to the MCA parameter system. This commit was SVN r28242.	2013-03-27 21:17:53 +00:00
Nathan Hjelm	9d4a26f47d	Update OMPI frameworks to use the MCA framework system. Notes: - This commit also eliminates the need for an available components list in use in several frameworks. None of the code in question was making use of the priority field of the priority component list item so these extra lists were removed. - Cleaned up selection code in several frameworks to sort lists using opal_list_sort. - Cleans up the ompi/orte-info functions. Expose the functions that construct the list of params so they can be used elsewhere. patches for mtl/portals4 from brian missed a few output variables in openib This commit was SVN r28241.	2013-03-27 21:17:31 +00:00
Nathan Hjelm	c041156f60	Update ORTE frameworks to use the MCA framework system. This commit was SVN r28240.	2013-03-27 21:14:43 +00:00
Nathan Hjelm	365cf48db5	Update OPAL frameworks to use the MCA framework system. This commit was SVN r28239.	2013-03-27 21:11:47 +00:00
Nathan Hjelm	c3b67d0187	Automatically generate a list of installed frameworks in project/include/project/frameworks.h This commit was SVN r28238.	2013-03-27 21:10:32 +00:00
Nathan Hjelm	cf377db823	MCA/base: Add new MCA variable system Features: - Support for an override parameter file (openmpi-mca-param-override.conf). Variable values in this file can not be overridden by any file or environment value. - Support for boolean, unsigned, and unsigned long long variables. - Support for true/false values. - Support for enumerations on integer variables. - Support for MPIT scope, verbosity, and binding. - Support for command line source. - Support for setting variable source via the environment using OMPI_MCA_SOURCE_<var name>=source (either command or file:filename) - Cleaner API. - Support for variable groups (equivalent to MPIT categories). Notes: - Variables must be created with a backing store (char *, int , or bool *) that must live at least as long as the variable. - Creating a variable with the MCA_BASE_VAR_FLAG_SETTABLE enables the use of mca_base_var_set_value() to change the value. - String values are duplicated when the variable is registered. It is up to the caller to free the original value if necessary. The new value will be freed by the mca_base_var system and must not be freed by the user. - Variables with constant scope may not be settable. - Variable groups (and all associated variables) are deregistered when the component is closed or the component repository item is freed. This prevents a segmentation fault from accessing a variable after its component is unloaded. - After some discussion we decided we should remove the automatic registration of component priority variables. Few component actually made use of this feature. - The enumerator interface was updated to be general enough to handle future uses of the interface. - The code to generate ompi_info output has been moved into the MCA variable system. See mca_base_var_dump(). opal: update core and components to mca_base_var system orte: update core and components to mca_base_var system ompi: update core and components to mca_base_var system This commit also modifies the rmaps framework. The following variables were moved from ppr and lama: rmaps_base_pernode, rmaps_base_n_pernode, rmaps_base_n_persocket. Both lama and ppr create synonyms for these variables. This commit was SVN r28236.	2013-03-27 21:09:41 +00:00
Ralph Castain	e7ac6c9bde	Don't build rank_file if you can't use it anyway This commit was SVN r28233.	2013-03-27 15:12:40 +00:00
Ralph Castain	256414121e	Protect the cpus-per-rank MCA param registration so that --without-hwloc will build This commit was SVN r28232.	2013-03-27 14:53:30 +00:00
Ralph Castain	317915225c	Finish the binding cleanup by removing the no-longer-used binding level scheme. This proved to be fallible as there is no guarantee that the hierarchy it used matched physical reality of the machine (e.g., is L3 "above" the socket or not). Still have to complete the ppr update, but get the rest of it correct. This commit was SVN r28223.	2013-03-26 20:09:49 +00:00
Ralph Castain	24b91839aa	Ensure the process knows it local cpuset early enough to perform the locality computation This commit was SVN r28221.	2013-03-26 19:14:23 +00:00
Ralph Castain	6ee32767d4	Restore the cpus-per-proc option for byslot and bynode mapping. Remove the bind_idx (which recorded the index of the hwloc object where the proc was bound) as this would no longer be unique, and just use the bitmap as the standard reference for location. Update the relative locality computation to take bitmaps as its argument. This commit was SVN r28219.	2013-03-26 18:27:50 +00:00
Ralph Castain	9c68e60965	Add test for comm_spawn with info keys This commit was SVN r28207.	2013-03-24 14:38:33 +00:00
Ralph Castain	2f43989d22	Add debug and handle the use-case where someone (a) uses a hostfile while in a managed allocation to sub-allocate runs, and (b) includes the HNP's node in one of those hostfiles. cmr:v1.7 This commit was SVN r28203.	2013-03-22 00:53:33 +00:00
Jeff Squyres	63d17ce901	Fix CID 968581: ensure that the string read from the socket is always \0-terminated so that strlen() and strstr() can be used without fear. Also fix some insignificant mem leaks (which is somewhat moot, because as soon as we leave those error conditions, the process will be terminating, but what the heck, might as well fix these while I was in the file for the \0-termination issue...). This commit was SVN r28199.	2013-03-21 16:05:50 +00:00
Jeff Squyres	562db0dd11	Fix CID 741328: remove some dead code This commit was SVN r28192.	2013-03-21 11:15:06 +00:00
Ralph Castain	fa13d27238	Avoid double-release in error path This commit was SVN r28190.	2013-03-20 21:00:59 +00:00
Ralph Castain	147c6ff9e7	Clean out the cruft leftover from the use_common_ports experiment cmr:v1.7 This commit was SVN r28184.	2013-03-20 15:07:43 +00:00
Ralph Castain	a4b6fb241f	Remove all remaining vestiges of the Windows integration This commit was SVN r28137.	2013-02-28 17:31:47 +00:00
Ralph Castain	63727aa714	Use a non-blocking send in show_help as it could be called from inside an event This commit was SVN r28135.	2013-02-28 17:19:18 +00:00
Ralph Castain	cf9796accd	Remove the old configure option for disabling full rte support - we now use the OMPI rte framework for such purposes This commit was SVN r28134.	2013-02-28 01:35:55 +00:00
Ralph Castain	347df93cd4	Handle the case of someone specifying a directory for the application. Ensure we get a non-zero exit status and clarify the error message. cmr:v1.7 This commit was SVN r28119.	2013-02-27 01:36:21 +00:00
Ralph Castain	f36312ee6f	Continue cleanup - this time, start working on the "without full support" flags in ORTE. Remove no-longer-needed configure.m4 files from the ess and errmgr. In the former case, since all priorities are now the same (given the removal of the cnos component), configure priorities are no longer required. This commit was SVN r28118.	2013-02-26 21:27:48 +00:00
Ralph Castain	74a3ece313	Remove unused component This commit was SVN r28117.	2013-02-26 20:58:43 +00:00
Ralph Castain	8d2fa3693b	First cut at removing the native Windows support. Remove all the Windows-specific components, and the .windows files sprinkled around. Remove the Windows platform files and MTT scripts. Update the NEWS to point Windows users to the cygwin package. This commit was SVN r28116.	2013-02-26 20:44:56 +00:00
Ralph Castain	bd9265c560	Per the meeting on moving the BTLs to OPAL, move the ORTE database "db" framework to OPAL so the relocated BTLs can access it. Because the data is indexed by process, this requires that we define a new "opal_identifier_t" that corresponds to the orte_process_name_t struct. In order to support multiple run-times, this is defined in opal/mca/db/db_types.h as a uint64_t without identifying the meaning of any part of that data. A few changes were required to support this move: 1. the PMI component used to identify rte-related data (e.g., host name, bind level) and package them as a unit to reduce the number of PMI keys. This code was moved up to the ORTE layer as the OPAL layer has no understanding of these concepts. In addition, the component locally stored data based on process jobid/vpid - this could no longer be supported (see below for the solution). 2. the hash component was updated to use the new opal_identifier_t instead of orte_process_name_t as its index for storing data in the hash tables. Previously, we did a hash on the vpid and stored the data in a 32-bit hash table. In the revised system, we don't see a separate "vpid" field - we only have a 64-bit opaque value. The orte_process_name_t hash turned out to do nothing useful, so we now store the data in a 64-bit hash table. Preliminary tests didn't show any identifiable change in behavior or performance, but we'll have to see if a move back to the 32-bit table is required at some later time. 3. the db framework was a "select one" system. However, since the PMI component could no longer use its internal storage system, the framework has now been changed to a "select many" mode of operation. This allows the hash component to handle all internal storage, while the PMI component only handles pushing/pulling things from the PMI system. This was something we had planned for some time - when fetching data, we first check internal storage to see if we already have it, and then automatically go to the global system to look for it if we don't. Accordingly, the framework was provided with a custom query function used during "select" that lets you seperately specify the "store" and "fetch" ordering. 4. the ORTE grpcomm and ess/pmi components, and the nidmap code, were updated to work with the new db framework and to specify internal/global storage options. No changes were made to the MPI layer, except for modifying the ORTE component of the OMPI/rte framework to support the new db framework. This commit was SVN r28112.	2013-02-26 17:50:04 +00:00
Jeff Squyres	9bd4b814db	Fix one more nroff macro issue This commit was SVN r28090.	2013-02-21 17:38:06 +00:00
Jeff Squyres	76fcd42bc3	Fix minor nroff macro issues. This commit was SVN r28088.	2013-02-21 17:35:36 +00:00
Jeff Squyres	12e047e594	Update documentation for rankfiles in orterun.1: * Add a little more description of what rankfiles are * Update that we use logical numbering for socket:core notation * Mention +nX notation This commit was SVN r28067.	2013-02-16 17:52:30 +00:00
Ralph Castain	c0b670bea8	I guess some profiling tools and debuggers require that the argv[0] of each rank be unique so they can create a filename based on that value. For those obscure cases, provide an mpirun cmd line option that indexes each argv[0] by rank This commit was SVN r28064.	2013-02-15 20:20:49 +00:00
Ralph Castain	b9897267ef	Cleanup report-bindings so it always reports the actual binding instead of what was requested. Ensure we don't report twice if it is an MPI process being launched. This commit was SVN r28057.	2013-02-14 17:24:28 +00:00
Ralph Castain	744ed49b2d	Begin cleanup of the thread_lock calls in ORTE. We'll ignore the ones in the rml/oob for now as that code block is being rewritten anyway. This commit was SVN r28053.	2013-02-13 01:53:12 +00:00
Brian Barrett	504a6d036f	* Rather than use the extra_includes directive, add the extra includes (which is really just -I${includedir}/openmpi/ for devel headers) to CPPFLAGS, since all the other necessary -Is for devel headers (like libevent and hwloc) are added to CPPFLAGS. * Clean up ${includedir} and ${libdir} for script wrapper compilers * Update script wrapper compilers to work like the C wrapper compilers w.r.t static and dynamic linking * Remove the ORTE script wrapper compilers since they didn't support the ${includedir} stuff and Ralph said they weren't used anymore. This commit was SVN r28052.	2013-02-13 00:33:05 +00:00
Joshua Ladd	70ad711337	Backing out the Open SHMEM project This commit was SVN r28050.	2013-02-12 17:45:27 +00:00
Mike Dubman	ff384daab4	Added new project: oshmem. This commit was SVN r28048.	2013-02-12 15:33:21 +00:00
Ralph Castain	b360156a37	Extend print coverage to all types This commit was SVN r28012.	2013-02-01 14:21:06 +00:00
Ralph Castain	afb0db5b6f	Okay, Jeff - just for you...flow the show help thru the orte functions so help messages will be aggregated This commit was SVN r28007.	2013-02-01 00:35:48 +00:00
Ralph Castain	53e0ed71b0	Disqualify slurm module even if slurm support was configured into the build if we don't have an allocation and haven't enabled dynamic allocations This commit was SVN r27995.	2013-01-31 18:15:47 +00:00
Ralph Castain	166f512924	Add some useful debug to the heartbeat sensor This commit was SVN r27994.	2013-01-31 18:01:13 +00:00
Ralph Castain	ca9605773b	If sensors are enabled, then the daemons need to have their proc->node field linked to their local node object This commit was SVN r27991.	2013-01-31 16:38:57 +00:00
Ralph Castain	c87fa68f9b	Cleanup the resource usage sensor, letting the db handle any printing requests. This commit was SVN r27990.	2013-01-31 15:20:56 +00:00
Ralph Castain	9625757a71	Add new database component for printing "add_log" info This commit was SVN r27989.	2013-01-31 15:19:39 +00:00
Ralph Castain	8e8e95ca6b	Silence error report - just because someone only defines ipv4 static ports doesn't make a fatal error This commit was SVN r27976.	2013-01-29 23:48:22 +00:00
Jeff Squyres	8e25b927ab	Clean some minor warnings: remove variables that were set but never used. This commit was SVN r27974.	2013-01-29 23:35:42 +00:00
Ralph Castain	112f8eedb1	Handle the case where rankfile is providing the allocation This commit was SVN r27971.	2013-01-29 20:37:58 +00:00
Nathan Hjelm	666bd826dc	fix alps configury This commit was SVN r27962.	2013-01-29 15:44:30 +00:00
Brian Barrett	b8442ba505	Revamp the handling of wrapper compiler flags. The user flags, main configure flags, and mca flags are kept seperate until the very end. The main configure wrapper flags should now be modified by using the OPAL_WRAPPER_FLAGS_ADD macro. MCA components should either let <framework>_<component>_{LIBS,LDFLAGS} be copied over OR set <framework>_<component>_WRAPPER_EXTRA_{LIBS,LDFLAGS}. The situations in which WRAPPER CPPFLAGS can be set by MCA components was made very small to match the one use case where it makes sense. This commit was SVN r27950.	2013-01-29 00:00:43 +00:00
Ralph Castain	cfaefb3286	Remove the only place where PMI was used outside a component, and relocate that code to common/pmi. This commit was SVN r27944.	2013-01-28 20:14:51 +00:00
Brian Barrett	f42783ae1a	Move the RTE framework change into the trunk. With this change, all non-CR runtime code goes through one of the rte, dpm, or pubsub frameworks. This commit was SVN r27934.	2013-01-27 23:25:10 +00:00
Ralph Castain	6eaf601ae6	Good ol' Cray changed the way node/cpu allocation is handled in their latest release of ALPS, and so our allocator is broken. Adjust for the revised method, but preserve the older method for those Cray users who have not updated their system. cmr:v1.7 This commit was SVN r27911.	2013-01-25 21:53:31 +00:00
Ralph Castain	f6b4db0b79	Fix rank_file operations. We changed the syntax to use semi-colons between multiple slot assignments so that we could use the comma to separate specific cores, but somehow the flex definitions didn't get updated to accept that character. We also incorrectly zero'd the bitmap between slot assignment sections, and so multiple slot assignments only wound up making the last one in the list. This commit was SVN r27908.	2013-01-25 18:33:25 +00:00
Ralph Castain	2504da1ac9	Remove stale code - message arrival time doesn't really mean much anymore. This commit was SVN r27905.	2013-01-24 23:02:02 +00:00
Brian Barrett	0e799a93c3	Automake will ship the .in file whether or not the conditional is taken, so don't install orte_wrapper_script when it's not used This commit was SVN r27902.	2013-01-24 21:36:25 +00:00
Ralph Castain	9bfb2b989b	Silence warning This commit was SVN r27901.	2013-01-24 19:38:51 +00:00
Ralph Castain	4b310473a1	Correct the computation of the daemon vpid cmr:v1.7 This commit was SVN r27899.	2013-01-24 18:04:53 +00:00
Ralph Castain	b403ca5bd8	Silence warning This commit was SVN r27897.	2013-01-23 22:17:08 +00:00
Ralph Castain	4d34d30a97	Silence warning This commit was SVN r27896.	2013-01-23 22:16:48 +00:00
Ralph Castain	6e2cabb87f	Remove duplicate code This commit was SVN r27889.	2013-01-23 02:07:06 +00:00
Ralph Castain	a591fbf06f	Add initial support for dynamic allocations. At this time, only Slurm supports the new capability, which will be included in an upcoming release. Add hooks for supporting dynamic allocation and deallocation to support application-driven requests and fault recovery operations. This commit was SVN r27879.	2013-01-20 00:33:42 +00:00
Ralph Castain	e4673f3283	Add new job state This commit was SVN r27878.	2013-01-20 00:30:27 +00:00
Ralph Castain	7aa80b984d	Add new test program This commit was SVN r27877.	2013-01-20 00:29:45 +00:00
Ralph Castain	7102d7c5f7	ick - brain is fried. take that test out as it isnt needed on a regular basis This commit was SVN r27875.	2013-01-19 14:48:31 +00:00
Ralph Castain	38786457cb	Add new test This commit was SVN r27874.	2013-01-19 14:46:23 +00:00
Ralph Castain	73387e50e2	Add missing variable def - thanks to Paul Hargrove for spotting. This commit was SVN r27865.	2013-01-18 14:32:53 +00:00
George Bosilca	e69dc00460	Dont duplicate headers nor global variables. This commit was SVN r27864.	2013-01-18 11:51:56 +00:00
Ralph Castain	c96cc2d5a0	In order to properly connect to debuggers like STAT, we need to get the hostname in its unstripped version for the MPIR_proctab. Unfortunately, we need a stripped version for Cray's alps launcher. So when we are stripping the hostname prefix, retain alias hostnames and add the ability to specify an alias to use in the proctab. This commit was SVN r27863.	2013-01-18 05:00:05 +00:00
Ralph Castain	54266837e9	Remove use of param_find function as that function will be disappearing This commit was SVN r27831.	2013-01-15 19:50:38 +00:00
Ralph Castain	5b8de0b9f4	Ouch - opal_progress calls event_loop with a NO_BLOCK flag. So when run without progress threads, the ORTE tools were not blocking in the event lib as they should be. Avoid calling opal_progress inside ORTE by directly using the event_loop call instead of ORTE_WAIT_FOR_COMPLETION as parts of the OMPI layer are using that macro. Thanks to George for spotting the problem. This commit was SVN r27815.	2013-01-14 23:06:42 +00:00
Ralph Castain	97ee683275	Track parentage of procs This commit was SVN r27758.	2013-01-08 04:41:12 +00:00
Ralph Castain	aea6787918	Add new routed component with self-healing connections - based on radix component - for use in monitoring system This commit was SVN r27757.	2013-01-08 04:40:35 +00:00
Ralph Castain	c9a596b487	Remove unused var This commit was SVN r27756.	2013-01-08 04:39:30 +00:00
Ralph Castain	beddf3b379	Add required rml tag This commit was SVN r27751.	2013-01-05 06:32:20 +00:00
Ralph Castain	bee8bf5d8f	Update the sensor framework to report stats back to the HNP if requested by including the data in heartbeats. This commit was SVN r27748.	2013-01-05 06:30:20 +00:00
Ralph Castain	c71e119bbb	Extend the db framework to add support for logging data to databases without duplicating all the modex-related storage. This commit was SVN r27746.	2013-01-05 06:28:09 +00:00
George Bosilca	34eecb8956	Be more explicit about the operation (store or update). complain loudly if something goes wrong. This commit was SVN r27743.	2013-01-04 20:47:25 +00:00
Ralph Castain	cc29f8ff95	Attempt to fix the stupid Cray PMI problem This commit was SVN r27742.	2013-01-04 02:53:42 +00:00
Nathan Hjelm	6a9ab9b221	Change orte_startup_timeout to be in seconds and remove the 10 second maximum This commit was SVN r27741.	2013-01-03 23:56:34 +00:00
Ralph Castain	81a8e21939	Need to have the event thread running during init/finalize, but we still have a problem with cleanup - so comment out the event_base_free for now. This commit was SVN r27738.	2013-01-03 02:16:57 +00:00
Ralph Castain	c65de32218	Cleanup the PMI subsystems to support Sam's "rml-less" shared memory wireup. Only retrieve keys that are specifically requested, and only when they are requested. Let string values be segmented across multiple keys, but don't do it for anything else. This commit was SVN r27737.	2013-01-03 02:16:10 +00:00
Ralph Castain	c1690f403e	Remove non-existent file This commit was SVN r27730.	2012-12-29 02:21:50 +00:00
Ralph Castain	68329b516c	Cleanup stale test codes This commit was SVN r27729.	2012-12-28 16:52:51 +00:00
Ralph Castain	d1163ebbf2	Ensure we cleanup DFS worker threads during finalize to avoid segfaulting in MCA param cleanup This commit was SVN r27723.	2012-12-25 21:17:35 +00:00
Ralph Castain	64da742d5f	Remove the orte_finalize_event variable - no longer needed This commit was SVN r27722.	2012-12-25 19:33:20 +00:00
Ralph Castain	cada035f38	Fix the segfault problem in the orteds - turns out it only occurred with progress threads enabled. Ensure the thread gets started at the right time (at the end of init), although the event base gets created earlier. Remove the finalize event as we can instead use the loopbreak call to exit the event loop. This commit was SVN r27721.	2012-12-25 19:30:18 +00:00
Ralph Castain	c8e34813b6	THIS IS A TEMPORARY FIX - do not finalize opal as the parameter system has been broken and will segfault when finalized. THIS PATCH MUST BE REMOVED WHEN THE PARAMETER SYSTEM HAS BEEN FIXED. This commit was SVN r27720.	2012-12-24 18:42:19 +00:00
Ralph Castain	72bea688f1	Fix typo This commit was SVN r27717.	2012-12-23 18:13:39 +00:00
Ralph Castain	852a709c0e	Add libopen-pal to the libraries as all these tools directly reference OPAL functions, and the list of OS's that don't support indirect linking grows (Mac and Ubuntu, for now). This commit was SVN r27716.	2012-12-23 15:54:05 +00:00
Jeff Squyres	b29b852281	Consolidate all the opal/orte/ompi .m4 files back to the top-level config/ directory. We split them apart a while ago in the hopes that it would simplify things, but it didn't really (e.g., because there were still some ompi/opal .m4 files in the top-level config/ directory, resulting in developer confusion where any given m4 macro was defined). So this commit consolidates them back into the top-level directory for simplicity. There's still (at least) two changes that would be nice to make: 1. Split any generated .m4 file (e.g., autogen-generated .m4 files) into a separate directory somewhere so that a top-level -Iconfig/ will only get our explicitly defined macros, not the autogen stuff (e.g., with libevent2019 needing to get the visibility macro, but NOT all the autogen-generated inclusion of component configure.m4 files). 1. Change configure to be of the form: {{{ # ...a small amount of preamble/setup... OPAL_SETUP m4_ifdef([project_orte], [ORTE_SETUP]) m4_ifdef([project_ompi], [OMPI_SETUP]) # ...a small amount of finishing stuff... }}} I doubt we'll ever get anything as clean as that, but that would be the goal to shoot for. This commit was SVN r27704.	2012-12-19 00:00:36 +00:00
Ralph Castain	ab73d11368	Oops - push missing definitions This commit was SVN r27688.	2012-12-18 16:43:03 +00:00
Ralph Castain	c5ba59ba67	Remove stale component This commit was SVN r27684.	2012-12-18 04:01:16 +00:00
Ralph Castain	0427a478b2	Remove stale component This commit was SVN r27683.	2012-12-18 04:00:51 +00:00
Ralph Castain	82f1ba0ea8	Fix static port usage, ensure that both ipv4 and ipv6 are given if ipv6 was enabled This commit was SVN r27682.	2012-12-18 03:59:49 +00:00
Ralph Castain	2fdd367aa9	Refs trac:3429 Fix bug reported by FreyGuy19713: in cases where HNP node has multiple entries in a hostfile or other allocation, we need to track the total slots allocated to that node. This commit was SVN r27673. The following Trac tickets were found above: Ticket 3429 --> https://svn.open-mpi.org/trac/ompi/ticket/3429	2012-12-14 17:00:44 +00:00
Jeff Squyres	c5b0bcd9f7	Refs trac:3422 * Add some comments in the -wrapper-data-txt.in files just so that someone doesn't forget in the future why we link in what we do in the MPI and ORTE wrapper compilers. Update ompi_wrapper_script.in to match the new behavior. * Update orte_wrapper_script.in to support --openmpi:linkall (which is a no-op in this case) This commit was SVN r27672. The following Trac tickets were found above: Ticket 3422 --> https://svn.open-mpi.org/trac/ompi/ticket/3422	2012-12-14 16:34:20 +00:00
Jeff Squyres	f779b1ded9	Put back the static-library-detection stuff from r27668, with some additional functionality. Rationale (refs trac:3422): * Normal MPI applications only ever use the MPI API. Hence, -lmpi is sufficient (they'll never directly call ORTE or OPAL functions). This is arguably the most common case. * That being said, we do have some test programs (e.g., those in orte/test/mpi) that call MPI functions but also call ORTE/OPAL functions. I've also written the occasional MPI test program that calls opal_output, for example (there even might be a few tests in the IBM test suite that directly call ORTE/OPAL functions). * Even though this is not a common case, these applications should also compile/link with mpicc. * So we should add a --openmpi:linkall option that will also link in whatever is necessary to call ORTE/OPAL functions * Yes, we could hard-code "-lopen-rte -lopen-pal" in Makefiles, but we do reserve the right to change those library names and/or add others someday, so it's better to abstract out the names and let the wrapper supply whatever is necessary. * ORTE programs, however, are different. They almost always call OPAL functions (e.g., if they want to send a message, they must use the OPAL DSS). As such, it seems like the ORTE programs should always link in OPAL. Therefore: * Add undocumented --openmpi:linkall flag to the wrapper compilers. See the comment in opal_wrapper.c for an explanation of what it does. This flag is only intended for Open MPI developers -- not end users. That's why it's undocumented. * Update orte/test/mpi/Makefile.am to add --openmpi:linkall * Make ortecc/ortec++'s wrapper data text files always explicitly link in libopen-pal This commit was SVN r27670. The following SVN revision numbers were found above: r27668 --> open-mpi/ompi@cf845897aa The following Trac tickets were found above: Ticket 3422 --> https://svn.open-mpi.org/trac/ompi/ticket/3422	2012-12-13 22:31:37 +00:00
Jeff Squyres	cf845897aa	Temporarily revert r27662 and r27667 because something wonky is happening on OS X. Grumble... This commit was SVN r27668. The following SVN revision numbers were found above: r27662 --> open-mpi/ompi@97cc916007 r27667 --> open-mpi/ompi@529f6244ca	2012-12-11 23:08:14 +00:00
Jeff Squyres	97cc916007	Per discussion at the Open MPI developer meeting last week: 1. Restore libopen-pal.la, libopen-rte.la, and libmpi.la to be separate entities (i.e., don't have libopen-rte.la include libopen-pal.la, and don't have libmpi.la include libopen-pal.la). Yay! 1. Consequently, make the wrapper compilers look for flags indicating that the user wants to compile statically (currently: -static, !--static, -Bstatic, and "-Wl," in front of all of those). If it is, follow a 6-way matrix for determinining which libraries to list on the underlying command line. 1. To support that, add the name of a token static and dynamic library to look for in each of the wrapper compiler data files. 1. Fix a long-standing typo in the opalcc wrapper data file. This commit was SVN r27662.	2012-12-11 01:46:59 +00:00
Ralph Castain	1e92aa2b66	Enable multiple worker threads for processing DFS requests This commit was SVN r27659.	2012-12-09 02:54:19 +00:00
Ralph Castain	c26ed7dcdd	Fix comm_spawn when ORTE progress thread is enabled by ensuring that all operations on the global list of active collectives are done in events to avoid conflicts. This commit was SVN r27658.	2012-12-09 02:53:20 +00:00
Nathan Hjelm	3e1b13b13a	Re-add support for old flex (2.5.4a and earlier) while still cleaning up properly in new flex. This commit was SVN r27657.	2012-12-07 00:12:43 +00:00
Ralph Castain	1237f8db57	Extend the ras module interface to include the orte_job_t being allocated so that dynamic allocations can be supported This commit was SVN r27627.	2012-11-23 13:50:10 +00:00
George Bosilca	994d1aba50	Nothing. This commit was SVN r27626.	2012-11-21 20:07:20 +00:00
Nathan Hjelm	a427a7e727	do not include c99 flag in compiler wrappers This commit was SVN r27625.	2012-11-20 19:33:14 +00:00
Ralph Castain	7a5f6b584c	Have orte-info show thread support as well This commit was SVN r27624.	2012-11-18 18:15:22 +00:00
Ralph Castain	43f883cb42	Add some more detailed error output to the db_hash component and nidmap code. Ensure the local nodename is included in the HNP's aliases This commit was SVN r27622.	2012-11-18 17:57:19 +00:00
Ralph Castain	f2ec35536e	Fix a bug that prevented MCA params from being forwarded to daemons upon launch cmr:v1.7 This commit was SVN r27621.	2012-11-18 17:55:26 +00:00
Ralph Castain	e11f32038a	Add an MCA param to retain all aliases based on IP addrs for node names so that procs can look them up by interface, if desired. If the param is set, pass aliases around to all daemons and procs for local use This commit was SVN r27619.	2012-11-16 04:04:29 +00:00
Ralph Castain	da6428a822	Add an MCA param to help debug the ORTE progress thread This commit was SVN r27614.	2012-11-15 15:54:38 +00:00
Ralph Castain	5241925b62	Add the dfs to ompi_info This commit was SVN r27613.	2012-11-15 15:54:07 +00:00
Ralph Castain	3cecc1569b	Fix segfault if no file_maps were pushed This commit was SVN r27612.	2012-11-15 15:39:17 +00:00
Ralph Castain	fe6dfad625	Update DFS to support multi-node operations This commit was SVN r27594.	2012-11-12 02:54:53 +00:00
Ralph Castain	fefec03e78	Enable all ORTE tools to use progress threads if they are enabled This commit was SVN r27593.	2012-11-12 02:54:09 +00:00
Ralph Castain	a6325e4546	Silence compiler warning This commit was SVN r27590.	2012-11-12 02:51:29 +00:00
Ralph Castain	26f1cd0909	Fix compiler warnings This commit was SVN r27588.	2012-11-12 02:50:45 +00:00
Ralph Castain	bd887f7f56	Add a new "test" component to the DFS that treats all files as remote in order to test the app-to-daemon interactions on a single machine. Set a global param to indicate we are using staged execution. Add a param to indicate it is okay for non-MPI processes to execute without finalizing. Cleanup file map load and fetch operations. This commit was SVN r27587.	2012-11-10 14:09:12 +00:00
Ralph Castain	81d0b06842	Strip the domain info from the hostname if that option is specified, protecting IP address-based names This commit was SVN r27586.	2012-11-10 14:05:27 +00:00
Ralph Castain	615cc66b44	Protect the HNP cleanup in cases where no session dirs are created This commit was SVN r27585.	2012-11-10 14:03:07 +00:00
Ralph Castain	fd632147df	Per patch from Nathan, with a few fixes, cleanup the orte-info tool This commit was SVN r27581.	2012-11-10 04:11:40 +00:00
Nathan Hjelm	e0f5137e46	add prototypes for lex destroy functions This commit was SVN r27580.	2012-11-09 22:00:27 +00:00
Nathan Hjelm	a754674fd7	Per the specification for putenv (http://pubs.opengroup.org/onlinepubs/009604599/functions/putenv.html ) the string given to putenv becomes part of the environment. The string must not be changed or freed. cmr:v1.7 This commit was SVN r27578.	2012-11-09 16:33:14 +00:00
Nathan Hjelm	8658bbc902	instead of relying on yyterminate to clean up the lex context call the destroy functions directly (after closing the file) This commit was SVN r27577.	2012-11-09 16:10:55 +00:00
Nathan Hjelm	842caae4c7	Fix a small leak in orte/util/name_fns.c cmr:v1.7 This commit was SVN r27576.	2012-11-07 23:59:49 +00:00
Ralph Castain	9b729794f2	A prior commit apparently broke the trunk when something was inadvertently left behind - so remove a reference to a no-longer-existing function This commit was SVN r27574.	2012-11-07 11:11:05 +00:00
Nathan Hjelm	7fb5caea92	Remove the finish_parsing function from various .l files. The function is incomplete (doesn't clean up the lex state) and should be replaced by *_yylex_destroy which correctly cleans up the state. Checked with the flex 2.5.35. Verified with valgrind that this fixes several "still reachable" leaks. cmr:v1.7 This commit was SVN r27571.	2012-11-06 19:26:14 +00:00
Nathan Hjelm	bdedd8b0d3	Per RFC modify the behavior of mca_base_components_close to NOT close the output. Modify frameworks to always close their output and set to -1. Reasoning: The old behavior was a little confusing. mca_base_components_open does not open an output stream so it is a little unexpected that mca_base_components_close does. To add to this several frameworks (that don't use mca_base_components_close) failed to close their output in the framework close function and others closed their output a second time. This change is an improvement to the symantics of mca_base_components_open/close as they are now symetric in their functionality. This commit was SVN r27570.	2012-11-06 19:09:26 +00:00
Ralph Castain	27b41a7db4	If the nodename is an IP address, we need to retain the full name (even if keep_fqdn is false) so that the ssh tree spawn can proceed. cmr:v1.7 This commit was SVN r27561.	2012-11-05 16:59:53 +00:00
Ralph Castain	3812979315	Remove stale file - we removed the size functions awhile back This commit was SVN r27551.	2012-11-01 03:36:51 +00:00
Brian Barrett	e61c00212d	Add files found in svn but not tarball This commit was SVN r27549.	2012-11-01 02:27:03 +00:00
Brian Barrett	dd907e8d4c	Remove unused macro This commit was SVN r27547.	2012-11-01 02:04:57 +00:00
Nathan Hjelm	2acd0f83de	Revert "Revert r27451 and r27456 - the cmd line parser is incorrectly marking the application as an MCA parameter". It appears the problem was not with the command line parser but the rsh plm. I don't know why this problem was not occuring before the command line parser changes but it appears to be resolved now. This commit was SVN r27527. The following SVN revision numbers were found above: r27451 --> open-mpi/ompi@d59034e6ef r27456 --> open-mpi/ompi@ecdbf34937	2012-10-30 19:45:18 +00:00
Nathan Hjelm	df9bd0ed59	fix bug in plm/rsh that could add extraneous mca options to the orted argv cmr:v1.7 This commit was SVN r27526.	2012-10-30 19:40:04 +00:00
Ralph Castain	7c3a2c3c92	Allow PMI support to find subdirs named lib64 instead of lib - thanks to Guillaume.Papaure of Bull for the patch This commit was SVN r27519.	2012-10-30 17:38:57 +00:00
Ralph Castain	a080de188f	Enable orterun to directly support staged execution, treating each app as a separate job. Support transfer of file maps when support exists. This commit was SVN r27516.	2012-10-29 23:11:30 +00:00

... 3 4 5 6 7 ...

4204 Коммитов