openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	98580c117b	Introduce staged execution. If you don't have adequate resources to run everything without oversubscribing, don't want to oversubscribe, and aren't using MPI, then staged execution lets you (a) run as many procs as there are available resources, and (b) start additional procs as others complete and free up resources. Adds a new mapper as well as a new state machine. Remove some stale configure.m4's we no longer need. Optimize the nidmaps a bit by only sending info that has changed each time, instead of sending a complete copy of everything. Makes no difference for the typical MPI job - only impacts things like staged execution where we are sending multiple (possibly many) launch messages. This commit was SVN r27165.	2012-08-28 21:20:17 +00:00
Ralph Castain	aadfe1b61e	Fix a missing test that breaks novm operation. CMR:v1.7 This commit was SVN r27163.	2012-08-28 21:13:57 +00:00
Ralph Castain	cb48fd52d4	Implement the MPI_Info part of MPI-3 Ticket 313. Add an MPI_info object MPI_INFO_GET_ENV that contains a number of run-time related pieces of info. This includes all the required ones in the ticket, plus a few that specifically address recent user questions: "num_app_ctx" - the number of app_contexts in the job "first_rank" - the MPI rank of the first process in each app_context "np" - the number of procs in each app_context Still need clarification on the MPI_Init portion of the ticket. Specifically, does the ticket call for returning an error is someone calls MPI_Init more than once in a program? We set a flag to tell us that we have been initialized, but currently never check it. This commit was SVN r27005.	2012-08-12 01:28:23 +00:00
George Bosilca	ba879c2c51	Remove the unused map. This commit was SVN r26960.	2012-08-07 12:06:13 +00:00
Ralph Castain	53b1a1c976	Cleanly error out when someone asks to map-to <object> if that object doesn't exist on a node. This commit was SVN r26950.	2012-08-04 21:52:36 +00:00
Ralph Castain	61b09a132b	Fix bynode mapping of multiple app-contexts This commit was SVN r26949.	2012-08-03 21:45:40 +00:00
Ralph Castain	96f6f94c24	Ensure we don't get trapped in an infinite loop when ranking bynode if something isn't right This commit was SVN r26948.	2012-08-03 21:45:10 +00:00
Ralph Castain	431d5361ed	For those who really preferred our prior mode of operation that mapped procs and only launched daemons on the nodes that had procs on them, introduce the "novm" state machine component. This recreates the old mode of operation by re-ordering the launch sequence so that we allocate, then map, and then launch daemons only on the reqd nodes (instead of across the entire allocation). This commit was SVN r26946.	2012-08-03 16:30:05 +00:00
Shiqing Fan	12d99a9ebb	Update the hwloc build on Windows and related files. This commit was SVN r26818.	2012-07-20 12:14:28 +00:00
Ralph Castain	b0938a254e	Dont use mutex where it isn't needed This commit was SVN r26521.	2012-05-29 20:21:11 +00:00
Ralph Castain	32b66c166b	Missed one blasted spot This commit was SVN r26520.	2012-05-29 20:20:10 +00:00
Ralph Castain	9bedb25dda	Cleanup some compiler warnings, some of which are actual logic errors This commit was SVN r26519.	2012-05-29 20:11:51 +00:00
Jeff Squyres	2ba10c37fe	Per RFC, bring in the following changes: * Remove paffinity, maffinity, and carto frameworks -- they've been wholly replaced by hwloc. * Move ompi_mpi_init() affinity-setting/checking code down to ORTE. * Update sm, smcuda, wv, and openib components to no longer use carto. Instead, use hwloc data. There are still optimizations possible in the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old carto-based code found out how many NUMA nodes were ''available'' -- not how many were used ''in this job''. The new hwloc-using code computes the same value -- it was not updated to calculate how many NUMA nodes are used ''by this job.'' * Note that I cannot compile the smcuda and wv BTLs -- I ''think'' they're right, but they need to be verified by their owners. * The openib component now does a bunch of stuff to figure out where "near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors (I do not have a NUMA machine with an OpenFabrics device that is a non-uniform distance from multiple different NUMA nodes). * Completely rewrite the OMPI_Affinity_str() routine from the "affinity" mpiext extension. This extension now understands hyperthreads; the output format of it has changed a bit to reflect this new information. * Bunches of minor changes around the code base to update names/types from maffinity/paffinity-based names to hwloc-based names. * Add some helper functions into the hwloc base, mainly having to do with the fact that we have the hwloc data reporting ''all'' topology information, but sometimes you really only want the (online \| available) data. This commit was SVN r26391.	2012-05-07 14:52:54 +00:00
Ralph Castain	bd8b4f7f1e	Sorry for mid-day commit, but I had promised on the call to do this upon my return. Roll in the ORTE state machine. Remove last traces of opal_sos. Remove UTK epoch code. Please see the various emails about the state machine change for details. I'll send something out later with more info on the new arch. This commit was SVN r26242.	2012-04-06 14:23:13 +00:00
Ralph Castain	811413e9bc	Correctly handle multiple cpu-set ranges. Correctly support optional binding directives combined with cpu-set. This commit was SVN r26187.	2012-03-23 14:50:41 +00:00
Ralph Castain	ce0caf7567	Support -cpu-set by binding to the specified cpus in the absence of any other binding directive. Allows users to subdivide nodes for multiple parallel mpirun invocations. This commit was SVN r26186.	2012-03-23 14:05:52 +00:00
Ralph Castain	b3aabf1565	Cleanup the --without-hwloc build. Thanks to Paul Hargrove for reporting it broken. This commit was SVN r25931.	2012-02-15 11:08:57 +00:00
Ralph Castain	bba6508b4b	Handle the default hostfile case a little better... This commit was SVN r25928.	2012-02-15 03:33:49 +00:00
Ralph Castain	f14c4be580	Correct the ordering logic so the list gets correctly built in daemon vpid order This commit was SVN r25818.	2012-01-30 16:25:07 +00:00
Shiqing Fan	bfbd3c67a5	Add a windows file into the tarball. This commit was SVN r25811.	2012-01-29 10:12:02 +00:00
Ralph Castain	3f31feee6f	Handle the case where a user's rankfile specifies only cpus, and not socket:cpu pairs. This commit was SVN r25803.	2012-01-27 12:21:45 +00:00
Ralph Castain	ef94e606c7	Add some debug This commit was SVN r25791.	2012-01-26 19:23:32 +00:00
Shiqing Fan	2c9a4beffd	Add and remove a few components for windows build. This commit was SVN r25775.	2012-01-25 09:01:27 +00:00
Ralph Castain	477582abef	Grrrr....fix ALL the cases where the membind warning occurs. This commit was SVN r25715.	2012-01-11 23:51:18 +00:00
Ralph Castain	167ad944c4	Surprise, surprise - hwloc treats memory binding as at the thread, not process, level. Thus, hwloc always sets the membind proc-level support flag to false, and indicates actual memory binding support via the thread-level flag. So...just to be safe, test -both- flags and issue the "no support" warning ONLY if both are false. This commit was SVN r25709.	2012-01-11 01:12:57 +00:00
Ralph Castain	2dd2694f25	Fix comm_spawn in oversubscribed conditions. IF oversubscription is allowed, let nodes flow into the mapper even if they are oversubscribed, constrained by the slots_max absolute ceiling. Cleanup error messages when comm_spawn fails so it correctly and succintly reports the ereror. This commit was SVN r25659.	2011-12-15 18:04:48 +00:00
Ralph Castain	e683b2f9c7	Minor touchup - reset the pointer to the end of the list each time to ensure we get the nodes in correct daemon order This commit was SVN r25651.	2011-12-14 22:16:52 +00:00
Ralph Castain	f531b09a8d	Correctly handle -host and -hostfile options. Ensure the initial vm launch constrains itself to the union of specified hosts if those options are given. Get oversubscribe set correctly for that case. This commit was SVN r25648.	2011-12-14 20:01:15 +00:00
George Bosilca	ac26f58bd7	I guess this wasn't yet ready for prime time. This commit was SVN r25624.	2011-12-12 23:55:11 +00:00
Nathan Hjelm	885d5cbcf8	enable ptmalloc with using uGNI This commit was SVN r25621.	2011-12-12 20:52:51 +00:00
Nathan Hjelm	be11acf727	bug fix. don't add node to allocated_nodes twice This commit was SVN r25619.	2011-12-12 19:14:41 +00:00
Ralph Castain	7510339725	Remove stale orte_vm_launch param. Add a param that allows users to specify envars to forward/set so they can do it in the MCA param file instead of only via mpirun cmd line. This commit was SVN r25580.	2011-12-06 21:31:22 +00:00
Ralph Castain	15facc4ba6	Fix comm_spawn yet again...add another test This commit was SVN r25579.	2011-12-06 20:15:40 +00:00
Ralph Castain	90b7f2a7bf	The rest of the multi app_context fix. Remove the restriction on number of app_contexts that can have zero np specified as multiple mappers now support that use-case. Update the ranking algorithms to respect and track bookmarks. Ensure we properly set the oversubscribed flag on a per-node basis. This commit was SVN r25578.	2011-12-06 17:28:29 +00:00
Ralph Castain	d9c7764e9b	Remove some debug This commit was SVN r25575.	2011-12-05 22:04:50 +00:00
Ralph Castain	df2f594aa8	Some cleanup associated with multiple app_contexts. Ensure nodes only get entered once into the map. Correctly handle bookmarks. Cleanup tracking of slots_inuse and correct detection of oversubscription. Still need to resolve the ranking issue so it starts at the bookmark, but that will come next. This commit was SVN r25574.	2011-12-05 22:01:08 +00:00
Ralph Castain	07655e2945	Handle the case where the allocator "fibs" to us about the node names. In some cases (ahem...you know who you are!), the allocator will tell us a node number (e.g., "16"). However, the daemon will return a node name (e.g., "nid0016") - leaving us not recognizing its location. So provide a new parameter (can't have too many!) that handles this situation by stripping the prefix from the returned node name. Also do a little cleanup to ensure we cleanly exit from errors, without generating too many annoying messages. This commit was SVN r25562.	2011-12-02 14:10:08 +00:00
Jeff Squyres	ecf6ba910c	Silence a few icc warnings and about mixing enums with other types. This commit was SVN r25560.	2011-12-02 13:18:54 +00:00
Ralph Castain	c56acf60ca	Although we never really thought about it, we made an unconscious assumption in the mapper system - we assumed that the daemons would be placed on nodes in the order that the nodes appear in the allocation. In other words, we assumed that the launch environment would map processes in node order. Turns out, this isn't necessarily true. The Cray, for example, launches processes in a toroidal pattern, thus causing the daemons to wind up somewhere other than what we thought. Other environments (e.g., slurm) are also capable of such behavior, depending upon the default mapping algorithm they are told to use. Resolve this problem by making the daemon-to-node assignment in the affected environments when the daemon calls back and tells us what node it is on. Order the nodes in the mapping list so they are in daemon-vpid order as opposed to the order in which they show in the allocation. For environments that don't exhibit this mapping behavior (e.g., rsh), this won't have any impact. Also, clean up the vm launch procedure a little bit so it more closely aligns with the state machine implementation that is coming, and remove some lingering "slave" code. This commit was SVN r25551.	2011-11-30 19:58:24 +00:00
Ralph Castain	9b59d8de6f	This is actually a much smaller commit than it appears at first glance - it just touches a lot of files. The --without-rte-support configuration option has never really been implemented completely. The option caused various objects not to be defined and conditionally compiled some base functions, but did nothing to prevent build of the component libraries. Unfortunately, since many of those components use objects covered by the option, it caused builds to break if those components were allowed to build. Brian dealt with this in the past by creating platform files and using "no-build" to block the components. This was clunky, but acceptable when only one organization was using that option. However, that number has now expanded to at least two more locations. Accordingly, make --without-rte-support actually work by adding appropriate configury to prevent components from building when they shouldn't. While doing so, remove two frameworks (db and rmcast) that are no longer used as ORCM comes to a close (besides, they belonged in ORCM now anyway). Do some minor cleanups along the way. This commit was SVN r25497.	2011-11-22 21:24:35 +00:00
Ralph Castain	866edf6a89	Now that George has found his problem, we no longer need the bozo check. Interesting how these platform-specific issues surface... This commit was SVN r25493.	2011-11-18 17:43:14 +00:00
George Bosilca	b613c7eacb	Fix the issue with the round robin mapper. When mixing different precisions, one should manually promote the participants to the expected type. In this particular example as opal_list_get_size returns an unsigned long, the computation on the left side is translated to an unsigned. If the hostfile contains more nodes that what required (via the -np), this leads to a gigantic value for the balance, and breaks the round robin algorithm. This commit was SVN r25492.	2011-11-18 17:03:35 +00:00
Ralph Castain	1e5e9bde77	Add protection against a bozo case where we could end up in an infinite loop while calculating ranks This commit was SVN r25491.	2011-11-18 15:35:55 +00:00
George Bosilca	61f273b987	Do not tolerate uninitialized variables. This commit was SVN r25489.	2011-11-18 10:19:24 +00:00
Ralph Castain	6310361532	At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here: https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation. In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions: 1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior. 2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation. 3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so. As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes. This commit was SVN r25476.	2011-11-15 03:40:11 +00:00
Ralph Castain	fcee46b063	Add an option for printing a diffable process map for testing mappers This commit was SVN r25428.	2011-11-03 14:22:07 +00:00
Ralph Castain	d492b20975	Bozo check for topology info This commit was SVN r25398.	2011-10-30 11:49:38 +00:00
Ralph Castain	4232115a98	Ensure pruning remains within the current job/app being mapped. This commit was SVN r25397.	2011-10-30 00:02:20 +00:00
Ralph Castain	648c85b41b	Add a simple pattern mapper as an example of how to use the topology info to create desired mappings. Let the user specify a pattern based on resource types, and map that pattern across all available nodes as resources permit. Don't automatically display the topology for each node when --display-devel-map is set as it can overwhelm the reader. Use a separate flag --display-topo to get it. This commit was SVN r25396.	2011-10-29 15:12:45 +00:00
Ralph Castain	2958f3de34	Add some clarifying comments and a small efficiency improvement This commit was SVN r25322.	2011-10-18 18:30:43 +00:00
Ralph Castain	ae8e556d14	Okay, once again let's fix the vpid calculator. Identified problem with prior commit (some rmaps components already place their procs in the jdata->procs array, and others don't), so account for those variations. This commit was SVN r25315.	2011-10-18 15:50:11 +00:00
George Bosilca	f28890fbb7	Revert r25302 as it break the --bynode option. This commit was SVN r25311. The following SVN revision numbers were found above: r25302 --> open-mpi/ompi@d7a8553179	2011-10-18 02:48:17 +00:00
Ralph Castain	d7a8553179	Fix the mapping algo for computing vpids - it was borked for bynode operations when using nperxxx directives This commit was SVN r25302.	2011-10-17 19:49:04 +00:00
Swen Boehm	08b4322a1a	patched the lex files to not issue the following compiler warning: 'yyunput' defined but not used This commit was SVN r25246.	2011-10-10 18:13:04 +00:00
Ralph Castain	92c7372e20	Per the RFC from Jeff, move hwloc from opal/mca/common to its own static framework ala libevent. Have ORTE daemons collect the topology info at startup and, if --enable-hwloc-xml is set, send that info back to the HNP for later use. The HNP only retains unique topology "templates" to reduce memory footprint. Have the daemon include the local topology info in the nidmap buffer sent to each app so the apps don't all hammer the local system to discover it for themselves. Remove the sysinfo framework as hwloc replaces that functionality. This commit was SVN r25124.	2011-09-11 19:02:24 +00:00
Wesley Bland	4e7ff0bd5e	By popular demand the epoch code is now disabled by default. To enable the epochs and the resilient orte code, use the configure flag: --enable-resilient-orte This will define both: ORTE_ENABLE_EPOCH ORTE_RESIL_ORTE This commit was SVN r25093.	2011-08-26 22:16:14 +00:00
Wesley Bland	09274cd047	Make sure that the epoch is initialized everywhere so we don't get weird output during valgrind. This shouldn't have caused any problems with any actual execution. Just extra warnings in valgrind. This commit was SVN r25015.	2011-08-08 15:11:55 +00:00
Ralph Castain	c3bc33b3fb	Don't be so restrictive - accept "slots" as well as "slot" in rank file This commit was SVN r24954.	2011-07-27 00:45:30 +00:00
Ralph Castain	1405bacd85	Ensure we dont segfault if we report an error This commit was SVN r24890.	2011-07-13 15:00:22 +00:00
Wesley Bland	e1ba09ad51	Add a resilience to ORTE. Allows the runtime to continue after a process (or ORTED) failure. Note that more work will be necessary to allow the MPI layer to take advantage of this. Per RFC: http://www.open-mpi.org/community/lists/devel/2011/06/9299.php This commit was SVN r24815.	2011-06-23 20:38:02 +00:00
Ralph Castain	7f2d2e3de7	Track the app_context rank - will equal overall rank for single app_context jobs This commit was SVN r24778.	2011-06-16 20:31:30 +00:00
Ralph Castain	e039c7b7ea	Avoid crashing when debugging rmaps and a non-string resource constraint is given This commit was SVN r24770.	2011-06-10 16:27:30 +00:00
Ralph Castain	bd8d9a943a	Add diagnostics This commit was SVN r24748.	2011-06-05 19:17:56 +00:00
Ralph Castain	8f401a0563	Enable the ability to constrain applications to hosts on the basis of resources. This commit was SVN r24736.	2011-05-28 22:18:19 +00:00
Ralph Castain	dc6f616599	Enable VM launch. For some time, ORTE has had the ability to launch daemons on all nodes prior to launching an application. It has largely been used outside of the OMPI community, and so was never explicitly turned "on" inside OMPI releases. Nevertheless, the code has been there. Allowing VM launches does not require ANY changes to existing PLM components. All that was required was to have orterun launch the daemons as a separate call to orte_plm.spawn -prior- to launching the applications. The rest of the VM support code resides in the rmaps framework: (a) a check when asked to map a job to see if it is the daemon job, and (b) a separate "setup_virtual_machine" mapper in the rmaps base that creates the required map so the PLM's will do the right thing. In order to support those users who have no RM allocation but like to give the allocation in the form of a -host or -hostfile argument to their application, there is a little more code in orterun and the setup_virtual_machine mapper to capture information passed in that manner. This has been tested with rsh and slurm environments, and, since there is nothing environment-specific in the implementation, should work in others as well - but needs to be proven. This commit was SVN r24524.	2011-03-12 22:50:53 +00:00
Ralph Castain	df82e4cd36	Plug a memory leak This commit was SVN r24521.	2011-03-12 15:37:33 +00:00
Ralph Castain	1297acde13	George raised some valid concerns about the extensibility of the revised rmaps framework. Address those by: 1. removing the enum of mapper values 2. change the req_mapper and last_mapper fields to char* so they can hold the component name instead of a mapper flag 3. revise the selection logic in the mapper components to reflect the change. Components now look for their name in the req_mapper field, or to see if other criteria (e.g., npernode) are set that mandate their doing the mapping Several MCA params resided in the rmaps base for historical reasons - they have been in the base since at least the original 1.2 release (and perhaps earlier). However, George correctly pointed out that they really should reside in their respective components. Accordingly, move them to the components, but register synonyms to the old names to avoid breaking backward compatibility. These revisions retain the current functionality of allowing comm_spawn'd jobs to use different mappers than the original job, and for the errmgr to utilize the resilient mapper to recover processes regardless of how they were originally mapped. Given the large number of possible combinations, I am sure that someone will find a corner-case combination of values and selection criteria that cause either no mapper to be selected, or one other than the intended to be used. No one can test all the ways people will use this system, so I expect debugging to continue for awhile. The ability of comm_spawn'd jobs to exploit this functionality relies on changes to the orte_dpm component - this will be committed separately. This commit was SVN r24520.	2011-03-12 05:30:09 +00:00
Ralph Castain	3b4421d8e3	Separately track requested and last-used mapper so we don't lose that info This commit was SVN r24502.	2011-03-09 18:51:36 +00:00
George Bosilca	9bbe00bdc3	Set the return code from the processes upstream. This commit was SVN r24483.	2011-03-03 00:02:21 +00:00
George Bosilca	c6a5f9706a	Thomas's patch: Assume we won't fail unless notified by a child. This commit was SVN r24482.	2011-03-02 23:50:01 +00:00
Josh Hursey	62bba1bf12	Name the enum so that it represents as an actual symbol in gdb, instead of just a number. This commit was SVN r24472.	2011-03-01 21:00:03 +00:00
Ralph Castain	f014284f91	Update resilient recovery mapping algorithm to be a bit more sophisticated. Track the prior node a proc was on so we avoid ricochet effect. Also avoid putting recovering proc onto node that is already occupied by a peer as this degrades fault tolerance. This commit was SVN r24417.	2011-02-20 18:46:21 +00:00
Ralph Castain	ef56e6d78b	Helps to move the pointer This commit was SVN r24414.	2011-02-18 14:01:25 +00:00
Ralph Castain	7b35ada7fc	Fix ricochet effect - move failed procs to next on list instead of loadbalancing This commit was SVN r24413.	2011-02-18 13:11:55 +00:00
Ralph Castain	65ba6af44d	Cleanup our handling of VMs to ensure daemons don't get mapped when operating with a VM. Have each mapper flag it did the map so we can see who did it later. Ensure procs are flagged as "ready to launch". This commit was SVN r24406.	2011-02-16 23:01:57 +00:00
Ralph Castain	9b38525d1e	Remove unused include files This commit was SVN r24394.	2011-02-16 00:32:47 +00:00
Ralph Castain	5120e6aec3	Redefine the rmaps framework to allow multiple mapper modules to be active at the same time. This allows users to map the primary job one way, and map any comm_spawn'd job in a different way. Modules are given the opportunity to map a job in priority order, with the round-robin mapper having the highest default priority. Priority of each module can be defined using mca param. When called, each mapper checks to see if it can map the job. If npernode is provided, for example, then the loadbalance mapper accepts the assignment and performs the operation - all mappers before it will "pass" as they can't map npernode requests. Also remove the stale and never completed topo mapper. This commit was SVN r24393.	2011-02-15 23:24:31 +00:00
Shiqing Fan	f43862420c	Convert the bad dos line endings to unix style for all windows related files. This commit was SVN r24137.	2010-12-02 12:08:08 +00:00
Ralph Castain	30c37ea536	Ensure that the oversubscribed condition of nodes is accurately reported by the mapper, and that the results are communicated and used by the backend orteds when setting sched_yield on local procs. Restores prior behavior that was somehow lost along the way. Includes a patch from Damien Guinier to fix vpid assignments when cpus-per-task is specified. This commit was SVN r24126.	2010-12-01 12:51:39 +00:00
Jeff Squyres	73bcc4a36b	Fix mistake that came in via the ompi-agen tree in r23764. The mistake wasn't part of the core autogen upgrade; it was an additional 'bonus' cleanup. Oops. The mistake will always create a set of directories under installdir, even if you do not --with-devel-headers. The set of directories will be empty, but still -- they should not be there at all. This commit fixes that -- the directories are not created at all if you do not --with-devel-headers This commit was SVN r23801. The following SVN revision numbers were found above: r23764 --> open-mpi/ompi@40a2bfa238	2010-09-24 22:53:28 +00:00
Ralph Castain	40a2bfa238	WARNING: Work on the temp branch being merged here encountered problems with bugs in subversion. Considerable effort has gone into validating the branch. However, not all conditions can be checked, so users are cautioned that it may be advisable to not update from the trunk for a few days to allow MTT to identify platform-specific issues. This merges the branch containing the revamped build system based around converting autogen from a bash script to a Perl program. Jeff has provided emails explaining the features contained in the change. Please note that configure requirements on components HAVE CHANGED. For example. a configure.params file is no longer required in each component directory. See Jeff's emails for an explanation. This commit was SVN r23764.	2010-09-17 23:04:06 +00:00
Josh Hursey	ba7e94dd89	Some relatively minor C/R related cleanup * Fix a configure warning for checking --enable-ft-thread * In hnp and orted ErrMgr components check to see if other components have already recovered this process before trying to recover it again. * Fix 'npernode' for restarting using the resilient rmaps component * export ompi_info_set, so that internal functionality can use it. This commit was SVN r23535.	2010-07-30 18:59:34 +00:00
Ralph Castain	ad5eaee4c6	Protect against NULL and provide additional resource check/error report This commit was SVN r23432.	2010-07-19 18:33:32 +00:00
Ralph Castain	510ade9503	Do not use nodes that are flagged as down or do-not-use for this map. Modify error output to reflect possible reasons no nodes would be available This commit was SVN r23333.	2010-07-01 19:39:31 +00:00
Ethan Mallove	57eee4d75c	* Can't put var declarations in the middle of code * Use OBJ_RELEASE on data that was OBJ_NEW'd * Limit single-line char width * Use ORTE_ERR_BAD_PARAM on a rankfile typo, not ORTE_ERR_SILENT * Add copyright This commit was SVN r23196.	2010-05-21 15:30:38 +00:00
Ralph Castain	aaaeea6f17	Once again, fix the blasted rank_file mapper. I can't guarantee that I fixed it correctly, but at least now it compiles! This commit was SVN r23190.	2010-05-21 09:46:42 +00:00
Ethan Mallove	e751f3c21c	Add a check for a duplicate rank assignment in the rankfile parser (Fixes trac:2414) This commit was SVN r23186. The following Trac tickets were found above: Ticket 2414 --> https://svn.open-mpi.org/trac/ompi/ticket/2414	2010-05-20 18:38:03 +00:00
Abhishek Kulkarni	afbe3e99c6	* Wrap all the direct error-code checks of the form (OMPI_ERR_* == ret) with (OMPI_ERR_* = OPAL_SOS_GET_ERR_CODE(ret)), since the return value could be a SOS-encoded error. The OPAL_SOS_GET_ERR_CODE() takes in a SOS error and returns back the native error code. * Since OPAL_SUCCESS is preserved by SOS, also change all calls of the form (OPAL_ERROR == ret) to (OPAL_SUCCESS != ret). We thus avoid having to decode 'ret' to get the native error code. This commit was SVN r23162.	2010-05-17 23:08:56 +00:00
Ralph Castain	871f445848	Ignore nodes that are "down" when generating maps This commit was SVN r23119.	2010-05-12 18:08:40 +00:00
Ralph Castain	8da781af84	Continue developing support for distributed virtual machines - minor changes to ensure correct jobid gets used and that dvm's can communicate with tools This commit was SVN r22958.	2010-04-12 22:33:09 +00:00
Ralph Castain	d3ed4e68b7	Utilize a non-used mapping policy bit to define a policy that uses only existing alive daemons to support virtual machines and restarting processes on already-active nodes This commit was SVN r22951.	2010-04-10 05:02:47 +00:00
Ralph Castain	a1e82e9d05	Per discussion with Josh, cleanup the errmgr API by creating separate modules for the public vs internal APIs. This mirrors the architecture used in other frameworks that had similar requirements. Remove the orcm errmgr module - moving to the orcm code base so it can utilize orcm communications and not interfere with ompi-related operations. This commit was SVN r22931.	2010-04-05 22:59:21 +00:00
Ralph Castain	1caba7af2f	Fix a bunch of compiler warnings reported by Jeff This commit was SVN r22930.	2010-04-03 00:20:19 +00:00
Ralph Castain	84c7973df8	Update the #procs in the job prior to assigning vpids for each app_context. This commit was SVN r22929.	2010-04-03 00:03:35 +00:00
Ralph Castain	6b43b76f9d	Some updates required for generating a LAM-style virtual machine. Retain the local node if requested. Properly setup the daemon job map for a VM launch. This commit was SVN r22928.	2010-04-03 00:03:01 +00:00
Josh Hursey	e4f2d03d28	ErrMgr Framework redesign to better support fault tolerance development activities. Explained in more detail in the following RFC: http://www.open-mpi.org/community/lists/devel/2010/03/7589.php This commit was SVN r22872.	2010-03-23 21:28:02 +00:00
Ralph Castain	7ebf72b4aa	Trivial cleanup This commit was SVN r22813.	2010-03-10 18:24:38 +00:00
Ralph Castain	7fd7b7a8cc	Fix the load_balance mapper so that it sets the #procs in the job before attempting to compute vpids This commit was SVN r22812.	2010-03-10 17:52:19 +00:00
Ralph Castain	4355134991	Let the vm launcher specify the mapping policy This commit was SVN r22797.	2010-03-08 19:13:21 +00:00
Ralph Castain	bfa39d7f7e	Update the seq mapper to support lists from -host. Reorg the dash_host code to provide an ordered list as required by the seq mapper This commit was SVN r22795.	2010-03-08 09:54:49 +00:00
Ralph Castain	69fe5ca69b	Correctly compute bynode mapping, even in the presence of a $#$%#@^$ rankfile This commit was SVN r22748.	2010-03-02 05:21:42 +00:00
Ralph Castain	5514d9c673	Fix the stupid rankfile mapper again, hopefully not breaking everything else to accommodate it. Looks like the round-robin mappers still work, at least... This commit was SVN r22746.	2010-03-01 20:40:47 +00:00
Ralph Castain	359dc5cad3	Complete the app_idx change by cleaning up warnings in mappers This commit was SVN r22728.	2010-02-27 18:14:27 +00:00
Josh Hursey	a3583b8f57	Fix --bynode option to remember for subsequent jobs where it left off last time. Add a ''map_bynode'' info key to determine if the job to be started by comm_spawn* should be mapped by node or by slot. Default is to map according to the default policy set when the parent job was started. cmr:v1.5.1 This commit was SVN r22564.	2010-02-05 15:37:49 +00:00
Shiqing Fan	bdc13dacb1	A type cast. This commit was SVN r22520.	2010-01-31 20:22:22 +00:00
Ralph Castain	7badff9d2d	Okay to return no available nodes for mapping when launching daemons - just means there is nothing to do This commit was SVN r22509.	2010-01-28 22:58:28 +00:00
Ralph Castain	f66b6cae23	Enable the boot of an orted "virtual machine". Modify the mapper framework to allow mapping of only daemons. Remove the cm ras module as no longer required. Modify the orted code to always send back node arch info. Remove the "--enable-bootstrap" configure option as this feature will now always be available. This commit was SVN r22480.	2010-01-25 22:25:13 +00:00
Shiqing Fan	872a4047ba	Fix the bug that caused by ADD_DEPENDENCIES() from different version of CMake. In CMake 2.6 and earlier, this function add dependencies for targets and also link the target libraries automatically, but in CMake 2.8,this behavior has been changed, i.e. it will only add the dependencies but no link, which will cause linking errors at compilation time. This commit was SVN r22405.	2010-01-14 18:10:20 +00:00
Ralph Castain	cec840f6b9	The ability to add procs to a running job was unfortunately borked when we added the detection of a proc exiting before calling init. Re-enable it here, ensuring that procs that are being restarted and/or added to a job do -not- call barrier during orte_init. This commit was SVN r22404.	2010-01-14 17:59:42 +00:00
Ralph Castain	5e031d9ded	Let a restarted process have access to all known nodes instead of only those already in its prior job map This commit was SVN r22225.	2009-11-19 19:45:11 +00:00
Ralph Castain	f1f156d57b	Make rmaps base open function play nicely with ompi_info This commit was SVN r22111.	2009-10-20 07:28:23 +00:00
Ralph Castain	d8d80d6f1a	Closes trac:2054. Check if a user specifies more cpus-per-rank than there are cpus in a socket - if so, politely tell them "you are stupid" and abort. This commit was SVN r22091. The following Trac tickets were found above: Ticket 2054 --> https://svn.open-mpi.org/trac/ompi/ticket/2054	2009-10-13 04:19:07 +00:00
Ralph Castain	1475d34c13	Ensure we default to byslot mapping This commit was SVN r22090.	2009-10-11 23:50:42 +00:00
Ralph Castain	40e2299fa7	Test to ensure that num_procs was provided for the resilient mapper - it cannot be used with options like npernode. Cleanup the show_help text file This commit was SVN r22082.	2009-10-09 15:26:23 +00:00
Ralph Castain	dcab61ad83	Restore the prior default rank assignment scheme for round-robin mappers. Ensure that each app_context has sequential vpids. This commit was SVN r22048.	2009-10-02 03:16:18 +00:00
Ralph Castain	a15c58c583	Fix the proc assignment into the job data object during assignment of vpids as comm_spawned procs were being overwritten by their parents with the same vpid. Add a little debug output when updating proc state This commit was SVN r22042.	2009-10-01 13:44:34 +00:00
Ralph Castain	51f64aaf96	Add a new ras module to support bootstrap operations. Additional functionality may eventually be required in the component, but for now all it does is provide a mechanism for ensuring that other allocations don't confuse the system. Only active if specifically directed to use it This commit was SVN r22040.	2009-09-30 23:30:24 +00:00
Ralph Castain	dff0d01673	Yet another paffinity cleanup...sigh. 1. ensure that orte_rmaps_base_schedule_policy does not override cmd line settings 2. when you try to bind to more cores than we have, generate a not-enough-processors error message 3. allow npersocket -bind-to-core combination - because, yes, somebody actually wants to do it. This commit was SVN r21996.	2009-09-22 18:44:53 +00:00
Ralph Castain	8da3aa8d5c	Some (hopefully final!) adjustments and corrections to the paffinity support: 1. default -npersocket to force -bind-to-socket 2. if we cannot get a value for cores/socket, try using #logical cpus. otherwise, default to 1 core 3. add missing error message for not-enough-processors 4. since we no longer loop through orte_register_params twice, put the auto-detect of topology info in the rte_init for hnp and std_orted 5. fix bind-to-core, bysocket combination This commit was SVN r21992.	2009-09-22 15:41:03 +00:00
Ralph Castain	98a4450df6	Fix the seq mapper by initializing the proc object to NULL before claiming a slot for it This commit was SVN r21969.	2009-09-17 05:18:37 +00:00
Ralph Castain	142036f2c0	Issue an error message and abort if the user requests a number of processes that conflicts with nperxxx directives when evaluated against available resources This commit was SVN r21949.	2009-09-07 03:36:10 +00:00
Jeff Squyres	e1fe03ad44	Minor grammar fixes, and use "#" for separating lines, not blank lines. This commit was SVN r21931.	2009-09-03 07:02:21 +00:00
Ralph Castain	0421a49844	Update the xml support to allow -xml-file foo whereby we redirect all xml formatted output (and ONLY xml formatted output) to a specified file This commit was SVN r21930.	2009-09-02 18:03:10 +00:00
Lenny Verkhovsky	2a594fec6c	added help message to rankfile mapper when failed if using alias instead of full hostname This commit was SVN r21919.	2009-09-01 11:17:32 +00:00
Ralph Castain	0394a4884d	Setup cpus-per-proc and cpus-per-rank as synonyms, both in mca params and on mpirun cmd line This commit was SVN r21914.	2009-08-30 14:30:36 +00:00
Ralph Castain	2d27bc9824	Default npersocket to bind-to-socket unless otherwise directed This commit was SVN r21904.	2009-08-27 13:21:14 +00:00
Ralph Castain	5e710928a5	Revise the new binding system slightly: 1. finalize the logic for properly respecting externally assigned bindings. Thanks to Chris Samuel for his help with this. Still needs some acid testing, but appears to now work. 2. remove the double-logic of requiring opal_paffinity_alone AND bind-to-foo. If the user specifies bind-to-foo, trust her and just do it. This commit was SVN r21885.	2009-08-26 02:01:49 +00:00
Ralph Castain	2016a3180b	Silence compiler warnings about uninitialized variables This commit was SVN r21883.	2009-08-26 01:56:39 +00:00
Ralph Castain	9ad33a4688	Silence compiler warning about uninitialized variable This commit was SVN r21882.	2009-08-26 01:56:11 +00:00
Rainer Keller	8e1b23779f	- Replace combinations of #if defined (c_plusplus) defined (__cplusplus) followed by extern "C" { and the closing counterpart by BEGIN_C_DECLS and END_C_DECLS. Notable exceptions are: - opal/include/opal_config_bottom.h: This is our generated code, that itself defines BEGIN_C_DECL and END_C_DECL - ompi/mpi/cxx/mpicxx.h: Here we do not include opal_config_bottom.h: - Belongs to external code: opal/mca/backtrace/darwin/MoreBacktrace/MoreDebugging/MoreBacktrace.c opal/mca/backtrace/darwin/MoreBacktrace/MoreDebugging/MoreBacktrace.h - opal/include/opal/prefetch.h: Has C++ specific macros that are protected: - Had #if ... } #endif _and_ END_C_DECLS (aka end up with 2x END_C_DECLS) ompi/mca/btl/openib/btl_openib.h - opal/event/event.h has #ifdef __cplusplus as BEGIN_C_DECLS... - opal/win32/ompi_process.h: had extern "C"\n {... opal/win32/ompi_process.h: dito - ompi/mca/btl/pcie/btl_pcie_lex.l: needed to add *_C_DECLS ompi/mpi/f90/test/align_c.c: dito - ompi/debuggers/msgq_interface.h: used #ifdef __cplusplus - ompi/mpi/f90/xml/common-C.xsl: Amend Tested on linux using --with-openib and --with-mx The following do not contain either opal_config.h, orte_config.h or ompi_config.h (but possibly other header files, that include one of the above): ompi/mca/bml/r2/bml_r2_ft.h ompi/mca/btl/gm/btl_gm_endpoint.h ompi/mca/btl/gm/btl_gm_proc.h ompi/mca/btl/mx/btl_mx_endpoint.h ompi/mca/btl/ofud/btl_ofud_endpoint.h ompi/mca/btl/ofud/btl_ofud_frag.h ompi/mca/btl/ofud/btl_ofud_proc.h ompi/mca/btl/openib/btl_openib_mca.h ompi/mca/btl/portals/btl_portals_endpoint.h ompi/mca/btl/portals/btl_portals_frag.h ompi/mca/btl/sctp/btl_sctp_endpoint.h ompi/mca/btl/sctp/btl_sctp_proc.h ompi/mca/btl/tcp/btl_tcp_endpoint.h ompi/mca/btl/tcp/btl_tcp_ft.h ompi/mca/btl/tcp/btl_tcp_proc.h ompi/mca/btl/template/btl_template_endpoint.h ompi/mca/btl/template/btl_template_proc.h ompi/mca/btl/udapl/btl_udapl_eager_rdma.h ompi/mca/btl/udapl/btl_udapl_endpoint.h ompi/mca/btl/udapl/btl_udapl_mca.h ompi/mca/btl/udapl/btl_udapl_proc.h ompi/mca/mtl/mx/mtl_mx_endpoint.h ompi/mca/mtl/mx/mtl_mx.h ompi/mca/mtl/psm/mtl_psm_endpoint.h ompi/mca/mtl/psm/mtl_psm.h ompi/mca/pml/cm/pml_cm_component.h ompi/mca/pml/csum/pml_csum_comm.h ompi/mca/pml/dr/pml_dr_comm.h ompi/mca/pml/dr/pml_dr_component.h ompi/mca/pml/dr/pml_dr_endpoint.h ompi/mca/pml/dr/pml_dr_recvfrag.h ompi/mca/pml/example/pml_example.h ompi/mca/pml/ob1/pml_ob1_comm.h ompi/mca/pml/ob1/pml_ob1_component.h ompi/mca/pml/ob1/pml_ob1_endpoint.h ompi/mca/pml/ob1/pml_ob1_rdmafrag.h ompi/mca/pml/ob1/pml_ob1_recvfrag.h ompi/mca/pml/v/pml_v_output.h opal/include/opal/prefetch.h opal/mca/timer/aix/timer_aix.h opal/util/qsort.h test/support/components.h This commit was SVN r21855. The following SVN revision numbers were found above: r2 --> open-mpi/ompi@58fdc18855	2009-08-20 11:42:18 +00:00
Ralph Castain	646a3500a7	Correctly account for number of procs in the job This commit was SVN r21843.	2009-08-20 00:07:38 +00:00
Ralph Castain	0005e6e834	Correct a couple of bugs in the rank_file mapper that were incorrectly assigning vpids. Add a capability to parse the rankfile to extract node information in place of requiring both hostfile and rankfile for non-RM managed environments. The rankfile is -only- parsed for this IF the hostfile and -host options are not given. Otherwise, those are used to establish allocation info as we did before this commit. This commit was SVN r21815.	2009-08-13 16:08:43 +00:00
Shiqing Fan	bce2f44154	Update related .windows files with proper compiling properties, in order to have a successful DSO build. This commit was SVN r21805.	2009-08-12 08:55:58 +00:00
Ralph Castain	1dc12046f1	Modify the OMPI paffinity and mapping system to support socket-level mapping and binding. Mostly refactors existing code, with modifications to the odls_default module to support the new capabilities. Adds several new mpirun options: * -bysocket - assign ranks on a node by socket. Effectively load balances the procs assigned to a node across the available sockets. Note that ranks can still be bound to a specific core within the socket, or to the entire socket - the mapping is independent of the binding. * -bind-to-socket - bind each rank to all the cores on the socket to which they are assigned. * -bind-to-core - currently the default behavior (maintained from prior default) * -npersocket N - launch N procs for every socket on a node. Note that this implies we know how many sockets are on a node. Mpirun will determine its local values. These can be overridden by provided values, either via MCA param or in a hostfile Similar features/options are provided at the board level for multi-board nodes. Documentation to follow... This commit was SVN r21791.	2009-08-11 02:51:27 +00:00
Ralph Castain	c0e85a492c	Deleted one too many lines...might be good to set the value of oldnode! Thanks George. This commit was SVN r21702.	2009-07-16 18:49:24 +00:00
Ralph Castain	ae6c36ae01	Ensure that jdata->num_procs is correct when the rank_file mapper is mapping more procs than are specified in the rank_file This commit was SVN r21690.	2009-07-15 22:45:12 +00:00
Ralph Castain	247ba7e90d	Use the base function to claim a slot when fault groups are not defined This commit was SVN r21681.	2009-07-15 11:28:58 +00:00
Ralph Castain	dbac602be5	Add support for the add-host and add-hostfile MPI Info keys to allow Comm_spawn users to add new hosts to those already known by mpirun. Requires full testing once comm_spawn is fixed (Edgar is working that now). This commit was SVN r21664.	2009-07-14 14:34:11 +00:00
Ralph Castain	b97f885c00	Restore the original API to terminate individual processes instead of the entire job. This was originally removed as we didn't at that time know how to take advantage of it. Some of us are now working on proactive resilience methods that move procs prior to node failure, so this is now a required API. Modify the odls, plm, and orted functions to support this new functionality. Continue work on the resilient mapper, completing support for fault groups. This commit was SVN r21639.	2009-07-13 02:29:17 +00:00
Ralph Castain	e30826c6e1	Quiet some compiler warnings This commit was SVN r21591.	2009-07-02 17:48:36 +00:00
Lenny Verkhovsky	e03807a3d1	small patch to extend current rankfile syntax to be compliant with orte_hosts syntax making it possible to claim relative hosts from the hostfile/scheduler by using +n# hostname, where 0 <= # < np ex: cat ~/work/svn/hpc/dev/test/Rankfile/rankfile rank 0=+n0 slot=0 rank 1=+n0 slot=1 rank 2=+n1 slot=2 rank 3=+n1 slot=1 This commit was SVN r21557.	2009-06-28 11:20:56 +00:00
Ralph Castain	b96a71b62e	Enable restart of individual processes upon command via the errmgr callback function. It needs an external application to drive this capability, so normal operations shouldn't be affected. Does not support MPI applications. More work coming to update daemon accounting on movement of procs across nodes. This commit was SVN r21545.	2009-06-26 20:54:58 +00:00
Lenny Verkhovsky	efa800efea	removed orphan files in rankfile mapper This commit was SVN r21532.	2009-06-25 17:14:10 +00:00
Ralph Castain	e9fc0a74fb	Silence compiler warnings This commit was SVN r21445.	2009-06-16 13:34:31 +00:00
Ralph Castain	d1dd8c2653	Ensure we accurately count the number of new daemons to be launched, especially if we are restarting processes. Have the resilient mapper also setup for new daemons in case the PLM needs them. This commit was SVN r21437.	2009-06-15 13:55:01 +00:00
Ralph Castain	c0c56e30c9	Add a missing function to the resilient mapper so it defines daemons in case they are needed This commit was SVN r21428.	2009-06-12 19:48:13 +00:00
Ralph Castain	170327e575	Reorg the rmaps components to collect shared code for byslot and bynode mapping in the base so we quit duplicating it in every mapper This commit was SVN r21424.	2009-06-12 17:52:17 +00:00
Shiqing Fan	5a90b3068e	Two type casts. This commit was SVN r21388.	2009-06-07 12:51:46 +00:00
Ralph Castain	0a67bcb653	Minor cleanups This commit was SVN r21387.	2009-06-06 15:44:00 +00:00
Ralph Castain	0336460b0a	Continue implementation of resilient operations by supporting reuse of jobids for restarted procs. Ensure that restarted processes have valid node and local ranks, and that node rank values are passed to direct-launched processes. This commit was SVN r21385.	2009-06-06 01:08:47 +00:00
Ralph Castain	303e3a1d39	Add a resilient mapping capability - currently maps by fault groups (if provided), still need to add the remapping capability for failed procs. This commit was SVN r21350.	2009-06-02 03:23:20 +00:00
Ralph Castain	f139cfd28a	Fully enable the use of static ports to minimize connections on mpirun. When static ports are provided, daemons will automatically use routes defined by the selected routed module to callback to mpirun during startup, thus elimating the dedicated daemon-to-mpirun connection. Therefore, the total number of connections on mpirun will equal the fanout of the routed module (instead of #nodes in job). Add a new tm ess module that exploits this capability. Update the various plm modules to enable it - just a minor change reflecting an added param to a plm base function. Additional fixes included: 1. remove an erroneous cleanup of session directories in the tool finalize procedure - tools don't create session directories to begin with! 2. fix a duplicate free when attempting to execute a non-existent app 3. cleanup an typo in the comm utilities 4. fix comm_spawn - was perturbed by the changes in pack/unpack of orte_job_t to properly support orte-ps Been tested on slurm and tm machines, using all tests in orte/test/mpi. May run into issue with command line length on large jobs due to inclusion of node info to support static ports - will fix this next with addition of regexp generator to compress that info. This commit was SVN r21248.	2009-05-16 04:15:55 +00:00
Ralph Castain	fd5dd9c4cb	Ensure we correctly cycle through map_by_slot when mapping leftover procs in rankfile mapper This commit was SVN r21219.	2009-05-12 15:41:55 +00:00
Ralph Castain	fa839f4a30	Fix a bug in the rankfile mapper when nooversubscribe is set This commit was SVN r21208.	2009-05-11 23:44:59 +00:00
Shiqing Fan	cd565923d3	Completely remove ltdl support for Windows build. This commit was SVN r21170.	2009-05-05 18:59:13 +00:00
Ralph Castain	e615af8b80	Silence coverity... This commit was SVN r21149.	2009-05-04 22:22:47 +00:00
Ralph Castain	7194f1636f	Complete the rewrite of rankfile mapper - ensure that all non-specified ranks are properly mapped when dealing with multiple app_contexts This commit was SVN r21111.	2009-04-29 14:06:53 +00:00
Rainer Keller	71052deebb	- Get rid of incompatible implicit declaration Need #include string.h This commit was SVN r21104.	2009-04-29 08:11:37 +00:00
Ralph Castain	5fa3b38d3c	Revert r21097 as this results in multiple instantiations of global variables. Instead, fix the problem by including orte_globals.h in the orte_init.c. Since I already had some changes in there, add in the rmaps rank_file changes - should work okay, but not fully tested. This commit was SVN r21099. The following SVN revision numbers were found above: r21097 --> open-mpi/ompi@88ae934c26	2009-04-29 02:13:14 +00:00
Rainer Keller	221fb9dbca	... Delayed due to notifier commits earlier this day ... - Delete unnecessary header files using contrib/check_unnecessary_headers.sh after applying patches, that include headers, being "lost" due to inclusion in one of the now deleted headers... In total 817 files are touched. In ompi/mpi/c/ header files are moved up into the actual c-file, where necessary (these are the only additional #include), otherwise it is only deletions of #include (apart from the above additions required due to notifier...) - To get different MCAs (OpenIB, TM, ALPS), an earlier version was successfully compiled (yesterday) on: Linux locally using intel-11, gcc-4.3.2 and gcc-SVN + warnings enabled Smoky cluster (x86-64 running Linux) using PGI-8.0.2 + warnings enabled Lens cluster (x86-64 running Linux) using Pathscale-3.2 + warnings enabled This commit was SVN r21096.	2009-04-29 01:32:14 +00:00
Rainer Keller	6c1cce8761	- For the upcoming header cleanup commit, several header files (previously included by header-files) now have to be moved "upward". This is mainly system headers such as string.h, stdio.h and for networking, but also some orte headers. This commit was SVN r21095.	2009-04-29 00:49:23 +00:00
Shiqing Fan	3d4e0472d6	Add windows support files into the tarball, including .windows, CMakeLists.txt files, and CMake modules. Thanks to Jeff for testing it on Linux. This commit was SVN r21069.	2009-04-24 16:39:33 +00:00
Rainer Keller	bff1b2a22b	- Finally add the missing opal/util/output.h for the OPAL_OUTPUT_VERBOSE macro. - ompi/errhandler/errhandler_predefined.h: Well, just the missing fwd declarations... This commit was SVN r20820.	2009-03-17 22:37:15 +00:00
Rainer Keller	a94438343b	- Revert r20740 This commit was SVN r20741. The following SVN revision numbers were found above: r20740 --> open-mpi/ompi@2a70618a77	2009-03-05 21:50:47 +00:00
Rainer Keller	2a70618a77	- Second patch, as discussed in Louisville. Replace short macros in orte/util/name_fns.h to the actual fct. call. - Compiles on linux/x86-64 This commit was SVN r20740.	2009-03-05 21:14:18 +00:00
Rainer Keller	fd28b392bf	- An intrusive commit yet again (sorry): with the separation we get bitten by header depending on having already included the corresponding [opal\|orte\|ompi]_config.h header. When separating, things like [OPAL\|ORTE\|OMPI]_DECLSPEC are missed. Script to add the corresponding header in front of all following (taking care of possible #ifdef HAVE_...) - Including some minor cleanups to - ompi/group/group.h -- include _after_ #ifndef OMPI_GROUP_H - ompi/mca/btl/btl.h -- nclude _after_ #ifndef MCA_BTL_H - ompi/mca/crcp/bkmrk/crcp_bkmrk_btl.c -- still no need for orte/util/output.h - ompi/mca/pml/dr/pml_dr_recvreq.c -- no need for mpool.h - ompi/mca/btl/btl.h -- reorder to fit - ompi/mca/bml/bml.h -- reorder to fit - ompi/runtime/ompi_mpi_finalize.c -- reorder to fit - ompi/request/request.h -- additionally need ompi/constants.h - Tested on linux/x86-64 This commit was SVN r20720.	2009-03-04 15:35:54 +00:00
Ralph Castain	f11931306a	Modify the accounting system to recycle jobids. Properly recover resources from nodes and jobs upon completion. Adjustments in several places were required to deal with sparsely populated job, node, and proc arrays as a result of this change. Correct an error wrt how jobids were being computed. Needed to ensure that the job family field was not overrun as we increment jobids for comm_spawn. Update the slurm plm module so it uses the new slurm termination procedure (brings trunk back into alignment with 1.3 branch). Update the slurmd ess component so it doesn't get selected if we are running a singleton inside of a slurm allocation. Cleanup HNP init by moving some code that had been in orte_globals.c for historical reasons into the ess hnp module, and removing the call to that code from the ess_base_std_prolog NOTE: this change allows orte to support an infinite aggregate number of comm_spawn's, with up to 64k being alive at any one instant. HOWEVER, the MPI layer currently does -not- support re-use of jobids. I did some prototype coding to revise the ompi_proc_t structures, but the BTLs are caching their own data, and there was no readily apparent way to update it. Thus, attempts to spawn more than the 64k limit will abort to avoid causing the MPI layer to hang. This commit was SVN r20700.	2009-03-03 16:39:13 +00:00
Rainer Keller	04567d3af0	- Header orte/mca/errmgr/errmgr.h is not needed. Once again compiles fine with -Wimplicit-function-declaration This commit was SVN r20640.	2009-02-26 04:05:30 +00:00
Rainer Keller	b356e90fa1	- Get rid of include orte/util/proc_info.h, if not needed Only proc_info.h-internal include file is opal/dss/dss_types.h - In one case (orte/util/hnp_contact.c) had to add proc_info.h again. - Local compilation (Linux/x86_64) w/ -Wimplicit-function-declaration works fine, no errors. Again, let's have MTT the last word. This commit was SVN r20631.	2009-02-25 03:38:00 +00:00
Rainer Keller	d81443cc5a	- On the way to get the BTLs split out and lessen dependency on orte: Often, orte/util/show_help.h is included, although no functionality is required -- instead, most often opal_output.h, or orte/mca/rml/rml_types.h Please see orte_show_help_replacement.sh commited next. - Local compilation (Linux/x86_64) w/ -Wimplicit-function-declaration actually showed two missing #include "orte/util/show_help.h" in orte/mca/odls/base/odls_base_default_fns.c and in orte/tools/orte-top/orte-top.c Manually added these. Let's have MTT the last word. This commit was SVN r20557.	2009-02-14 02:26:12 +00:00
Shiqing Fan	a5281f0434	- 1/4 commit for Windows Visual Studio and CCP support: CMakeLists and .windows files. In contribs preconfigured and precompiled parts. This commit was SVN r20108.	2008-12-10 20:59:20 +00:00
Ralph Castain	89792bbc72	May as well have the other "clean" outputs use the same channel This commit was SVN r20082.	2008-12-08 19:37:22 +00:00
Ralph Castain	6db5737779	Remove a couple of mutex vars that were defined and used - but never initialized. No clear way to initialize them, and that area of the code should never see threads anyway. This commit was SVN r19889.	2008-11-03 17:23:10 +00:00
George Bosilca	9528d33e90	Nothing relevant, few indentations and replace tab by spaces. This commit was SVN r19870.	2008-10-31 22:24:52 +00:00
Ralph Castain	aa11e0977c	Correct a bug in the bookmarking code that incorrectly looked at #slots instead of #slots_allocated, thus causing slot reductions in hostfiles to be ignored when selecting our starting node. Fixes trac:1527 This commit was SVN r19656. The following Trac tickets were found above: Ticket 1527 --> https://svn.open-mpi.org/trac/ompi/ticket/1527	2008-09-29 14:09:02 +00:00
Ralph Castain	037231fbcb	MOdify the node_rank and local_rank fields to be uint16_t so we can handle more than 256 procs/node. Change the type to a defined one so that any future change can be easily done, if required. This commit was SVN r19637.	2008-09-25 13:39:08 +00:00
Ralph Castain	e64b79f30f	Modify the --display-map and --display-alloc per note on devel list to reduce info for user understanding. Add --display-devel-map and --display-devel-alloc to display all the detailed info we used to provide - it is only of use/interest to developers anyway and confuses users. This commit was SVN r19608.	2008-09-23 15:46:34 +00:00
Ralph Castain	c0d7fbaf88	A few mapping cleanups - mostly aimed to properly balancing loads so multi app-context comm_spawns don't dump everything on one node. This commit was SVN r19519.	2008-09-08 15:45:55 +00:00
George Bosilca	579d70edad	We should use #ifdef and not #if This commit was SVN r19504.	2008-09-05 12:44:19 +00:00
Rainer Keller	0d08866786	- Declare functions in lex-files as extern "C" {} to get rid of warnings. This commit was SVN r19132.	2008-08-04 11:49:01 +00:00
Ralph Castain	1210a96d82	Ensure a value gets defined before used...thanks Jeff This commit was SVN r19075.	2008-07-29 13:08:45 +00:00
Jeff Squyres	0af7ac53f2	Fixes trac:1392, #1400 * add "register" function to mca_base_component_t * converted coll:basic and paffinity:linux and paffinity:solaris to use this function * we'll convert the rest over time (I'll file a ticket once all this is committed) * add 32 bytes of "reserved" space to the end of mca_base_component_t and mca_base_component_data_2_0_0_t to make future upgrades [slightly] easier * new mca_base_component_t size: 196 bytes * new mca_base_component_data_2_0_0_t size: 36 bytes * MCA base version bumped to v2.0 * '''We now refuse to load components that are not MCA v2.0.x''' * all MCA frameworks versions bumped to v2.0 * be a little more explicit about version numbers in the MCA base * add big comment in mca.h about versioning philosophy This commit was SVN r19073. The following Trac tickets were found above: Ticket 1392 --> https://svn.open-mpi.org/trac/ompi/ticket/1392	2008-07-28 22:40:57 +00:00
Ralph Castain	1a77b15523	Modify the handling of hostfiles to allow them to subdivide allocations. Utilize the "slots_alloc" field of the orte_node_t object - which had previously been unused - to track the #slots allocated to a given app_context. Let the hostfile filtering action utilize the #slots field to modify the allocated slots for each app_context. This commit was SVN r19066.	2008-07-28 15:10:40 +00:00
Ralph Castain	3107545709	Ensure that ORTE processes such as mpirun and orted never inadvertently bind themselves to cores. Change the mca param name used by the rank_file mapper to get user directives on slot lists to be different from that used by MPI procs to discover their binding. Add a cmd line option to orterun to make it easier for a user to specify the slot list (basically, hide the mca param name). Discussed and reviewed with Lenny and Jeff. This commit was SVN r19062.	2008-07-28 14:18:36 +00:00
Ralph Castain	d5a916d350	Fix a problem reported by IBM: nolocal and bynode combined to map byslot. Problem actually was that any time multiple mapping policy directives were provided, we would only map byslot due to incorrect if statement conditions. Thanks to Kris Davis for his patience while we tracked this down! This commit was SVN r19039.	2008-07-25 17:50:46 +00:00
Ralph Castain	718cceddaa	Ensure that we only launch procs on the HNP if that node is actually included in the allocation. This commit was SVN r19038.	2008-07-25 17:13:22 +00:00
Ralph Castain	a1d296ae03	This commit fixes ticket #1410 Fix a few bugs in the mappers: 1. Ensure that bynode with no -np fills all available slots - it just does so with the ranks set bynode instead of byslot 2. fix --nolocal behavior so it works correctly in all cases. We still have to test the host's name using opal_ifislocal in the mapper because the name returned by gethostname to orte_process_info.hostname can be an FQDN, but a hostfile may contain a non-FQDN version. 3. Add missing --nolocal logic to the seq mapper Oversubscribed mapping seemed to be working okay without repair, so I couldn't verify my own bug report in that regard. Also included are some preliminary changes to support the modified hostfile behavior, which will be committed shortly: 1. removed the totally useless "allocate" field in the orte_node_t object since every node is automatically allocated for use - and everything ignored the field anyway 2. correctly initialize the slots_alloc field when the allocation is read This commit was SVN r19030.	2008-07-25 13:35:12 +00:00
Lenny Verkhovsky	b4d54dda57	Fixed possible seqf when using RANKFILE, but not all ranks assigned Fixed allocation of all ranks when using RANKFILE, but not all ranks assigned Aborting if using RANKFILE, but np wasn't specified a little earlier Clean mca_rmaps_rank_file_component.debug This commit was SVN r19004.	2008-07-23 17:44:02 +00:00
Jeff Squyres	583bf425c0	Fixes trac:1383: Short version: remove opal_paffinity_alone and restore mpi_paffinity_alone. ORTE makes various information available for the MPI layer to decide what it wants to do in terms of processor affinity. Details: * remove opal_paffinity_alone MCA param; restore mpi_paffinity_alone MCA param * move opal_paffinity_slot_list param registration to paffinity base * ompi_mpi_init() calls opal_paffinity_base_slot_list_set(); if that succeeds use that. If no slot list was set, see if mpi_paffinity_alone was set. If so, bind this process to its Node Local Rank (NLR). The NLR is the ORTE-maintained slot ID; if you COMM_SPAWN to a host in this ORTE universe that already has procs on it, the NLR for the new job will start at N (not 0). So this is slightly better than mpi_paffinity_alone in the v1.2 series. * If a slot list is specified and mpi_paffinity_alone is set, we display an error and abort. * Remove calls from rmaps/rank_file component to register and lookup opal_paffinity mca params. * Remove code in orte/odls that set affinities - instead, have them just pass a slot_list if it exists. * Cleanup the orte/odls code that determined oversubscribed/want_processor as these were just opposites of each other. This commit was SVN r18874. The following Trac tickets were found above: Ticket 1383 --> https://svn.open-mpi.org/trac/ompi/ticket/1383	2008-07-10 21:12:45 +00:00
Lenny Verkhovsky	30f0b33274	Priority of rmaps_rank_file_component changed to 100 when it selected This commit was SVN r18866.	2008-07-10 13:57:40 +00:00
Lenny Verkhovsky	c143c95ff9	Partial rankfile slots allocation fix This commit was SVN r18787.	2008-07-01 08:54:20 +00:00
Ralph Castain	0532d799d6	Complete implementation of the --without-rte-support configure option. Working with Brian, this has been tested on RedStorm. Some minor changes to help facilitate debugger support so that both mpirun and yod can operate with it. Still to be completed. This commit was SVN r18664.	2008-06-18 03:15:56 +00:00
Ralph Castain	f9d809748c	Glad someone found that last error - caused me to review the code and find a couple of other cleanups! Nothing major, but just ensure that things flow smoothly since we had a "shadowed" variable. This commit was SVN r18643.	2008-06-10 19:15:59 +00:00
Camille Coti	67cd1849f7	*map was still NULL in the else statement, inducing a segmentation fault when a field of the structure was accessed to. This commit was SVN r18642.	2008-06-10 19:00:57 +00:00
Ralph Castain	9613b3176c	Effectively revert the orte_output system and return to direct use of opal_output at all levels. Retain the orte_show_help subsystem to allow aggregation of show_help messages at the HNP. After much work by Jeff and myself, and quite a lot of discussion, it has become clear that we simply cannot resolve the infinite loops caused by RML-involved subsystems calling orte_output. The original rationale for the change to orte_output has also been reduced by shifting the output of XML-formatted vs human readable messages to an alternative approach. I have globally replaced the orte_output/ORTE_OUTPUT calls in the code base, as well as the corresponding .h file name. I have test compiled and run this on the various environments within my reach, so hopefully this will prove minimally disruptive. This commit was SVN r18619.	2008-06-09 14:53:58 +00:00
Josh Hursey	1de50b523c	Fix some Coverity 'Event set_but_not_used' highlights. Thanks to Jeff for bringing them to my attention. This commit was SVN r18606.	2008-06-06 14:38:41 +00:00
Ralph Castain	0da811ce79	Initial work on xml support - allocation and job map outputs completed. More to come. This commit was SVN r18587.	2008-06-04 20:53:12 +00:00
Ralph Castain	c992e99035	Remove the tags from orte_output_open and the filtering operation from orte_output - this will be handled differently to improve the XML output interface This commit was SVN r18557.	2008-06-03 14:24:01 +00:00
Josh Hursey	4ac7016200	Make sure to check "opal_list_get_last" instead of "opal_list_get_end". The former will return a valid item in the list, the latter will return an invalid item that marks the end of the list. It was happending that when oversubscribing by way of an appfile we would cause a segv because we tried to interpret the invalid item returned by "opal_list_get_end" instead of a valid item. We would then try to write to unallocated memory. This commit fixes trac:1279 This commit was SVN r18529. The following Trac tickets were found above: Ticket 1279 --> https://svn.open-mpi.org/trac/ompi/ticket/1279	2008-05-28 19:37:20 +00:00
Ralph Castain	93d932aa0c	Ensure that the display-map and display-allocation outputs get processed through the new OPAL filter framework by passing them through orte_output instead of using the opal_dss.dump function. This commit was SVN r18507.	2008-05-27 15:46:21 +00:00
Jeff Squyres	e7ecd56bd2	This commit represents a bunch of work on a Mercurial side branch. As such, the commit message back to the master SVN repository is fairly long. = ORTE Job-Level Output Messages = Add two new interfaces that should be used for all new code throughout the ORTE and OMPI layers (we already make the search-and-replace on the existing ORTE / OMPI layers): * orte_output(): (and corresponding friends ORTE_OUTPUT, orte_output_verbose, etc.) This function sends the output directly to the HNP for processing as part of a job-specific output channel. It supports all the same outputs as opal_output() (syslog, file, stdout, stderr), but for stdout/stderr, the output is sent to the HNP for processing and output. More on this below. * orte_show_help(): This function is a drop-in-replacement for opal_show_help(), with two differences in functionality: 1. the rendered text help message output is sent to the HNP for display (rather than outputting directly into the process' stderr stream) 1. the HNP detects duplicate help messages and does not display them (so that you don't see the same error message N times, once from each of your N MPI processes); instead, it counts "new" instances of the help message and displays a message every ~5 seconds when there are new ones ("I got X new copies of the help message...") opal_show_help and opal_output still exist, but they only output in the current process. The intent for the new orte_* functions is that they can apply job-level intelligence to the output. As such, we recommend that all new ORTE and OMPI code use the new orte_* functions, not thei opal_* functions. === New code === For ORTE and OMPI programmers, here's what you need to do differently in new code: * Do not include opal/util/show_help.h or opal/util/output.h. Instead, include orte/util/output.h (this one header file has declarations for both the orte_output() series of functions and orte_show_help()). * Effectively s/opal_output/orte_output/gi throughout your code. Note that orte_output_open() takes a slightly different argument list (as a way to pass data to the filtering stream -- see below), so you if explicitly call opal_output_open(), you'll need to slightly adapt to the new signature of orte_output_open(). * Literally s/opal_show_help/orte_show_help/. The function signature is identical. === Notes === * orte_output'ing to stream 0 will do similar to what opal_output'ing did, so leaving a hard-coded "0" as the first argument is safe. * For systems that do not use ORTE's RML or the HNP, the effect of orte_output_* and orte_show_help will be identical to their opal counterparts (the additional information passed to orte_output_open() will be lost!). Indeed, the orte_* functions simply become trivial wrappers to their opal_* counterparts. Note that we have not tested this; the code is simple but it is quite possible that we mucked something up. = Filter Framework = Messages sent view the new orte_* functions described above and messages output via the IOF on the HNP will now optionally be passed through a new "filter" framework before being output to stdout/stderr. The "filter" OPAL MCA framework is intended to allow preprocessing to messages before they are sent to their final destinations. The first component that was written in the filter framework was to create an XML stream, segregating all the messages into different XML tags, etc. This will allow 3rd party tools to read the stdout/stderr from the HNP and be able to know exactly what each text message is (e.g., a help message, another OMPI infrastructure message, stdout from the user process, stderr from the user process, etc.). Filtering is not active by default. Filter components must be specifically requested, such as: {{{ $ mpirun --mca filter xml ... }}} There can only be one filter component active. = New MCA Parameters = The new functionality described above introduces two new MCA parameters: * '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that help messages will be aggregated, as described above. If set to 0, all help messages will be displayed, even if they are duplicates (i.e., the original behavior). * '''orte_base_show_output_recursions''': An MCA parameter to help debug one of the known issues, described below. It is likely that this MCA parameter will disappear before v1.3 final. = Known Issues = * The XML filter component is not complete. The current output from this component is preliminary and not real XML. A bit more work needs to be done to configure.m4 search for an appropriate XML library/link it in/use it at run time. * There are possible recursion loops in the orte_output() and orte_show_help() functions -- e.g., if RML send calls orte_output() or orte_show_help(). We have some ideas how to fix these, but figured that it was ok to commit before feature freeze with known issues. The code currently contains sub-optimal workarounds so that this will not be a problem, but it would be good to actually solve the problem rather than have hackish workarounds before v1.3 final. This commit was SVN r18434.	2008-05-13 20:00:55 +00:00
Ralph Castain	64ef4102c4	Add the topo mapper module - requires some work in carto for completion. Little cleanup in round-robin mapper. This commit was SVN r18412.	2008-05-08 05:09:13 +00:00
Josh Hursey	9971bc9d95	Merge in the mca_base_select changes per RFC: http://www.open-mpi.org/community/lists/devel/2008/04/3779.php {{{ svn merge -r 18276:18380 https://svn.open-mpi.org/svn/ompi/tmp-public/jjh-mca-play . }}} Any components not in the trunk, but in one of the effected frameworks must be updated. Contact the list, look at the RFC, or look at the diff for how to do this. Sorry for the early commit of this, but I wanted to get it in today (per RFC) and didn't know if I would have a chance later today. This commit was SVN r18381.	2008-05-06 18:08:45 +00:00
Ralph Castain	432d441b3e	Cleanup a bug found by Josh that caused multiple app_contexts to keep mapping onto the first node in an allocation Continue work on loadbalancing Cleanup code organization in rmaps_base This commit was SVN r18353.	2008-05-01 21:07:49 +00:00
Ralph Castain	1766442591	Fix a double-free when tree-spawning Fix the round-robin mapper so it doesn't move to the next node just because it completed mapping an app_context This commit was SVN r18344.	2008-05-01 14:49:56 +00:00
Ralph Castain	3e55fe6f6d	Fold in the revised modex scheme. Move the ompi_proc_t modex portions to the RTE level since the daemons already have that info. Provide each process with the equivalent of a "nidmap" - both a map of what nodes are in the job, and a map of which node each process is on. This enables the use of static ports, though that hasn't been turned "on" in this commit. Update the rsh tree spawn capability so we spawn the next wave of daemons before launching our own local procs. Add an ability to encode nodenames for large clusters with contiguous node name numbering schemes - this allows communication of all node names in a few bytes instead of tens-of-bytes/node. This commit was SVN r18338.	2008-04-30 19:49:53 +00:00
Ralph Castain	eece9f88f0	Fix a bug in the way we computed local_rank. This needs to be the local_rank -among my job peers- on a node. We were mistakenly computing the local_rank across -all- jobs with procs on that node. While the two definitions are equivalent for an initial launch, comm_spawn'd procs would get the wrong local_rank. In particular, there would not be a local_rank=0 proc in the comm_spawn'd job on any node that was shared with the initial job. This commit was SVN r18263.	2008-04-23 17:42:59 +00:00
Ralph Castain	5311b13b60	Add a loadbalancing feature to the round-robin mapper - more to be sent to devel list Fix a potential problem with RM-provided nodenames not matching returns from gethostname - ensure that the HNP's nodename gets DNS-resolved when comparing against RM-provided hostnames. Note that this may be an issue for RM-based clusters that don't have local DNS resolution, but hopefully that is more indicative of a poorly configured system. This commit was SVN r18252.	2008-04-23 14:52:09 +00:00
Lenny Verkhovsky	456ce6c4da	Few cleanups in Rank_File component + fixed opal_paffinity_slot_list without rankfile This commit was SVN r18249.	2008-04-23 13:34:05 +00:00
Ralph Castain	16c9100633	Add --display-allocation option to orterun that will display the node-by-node information regarding your allocation. This commit was SVN r18216.	2008-04-20 02:25:45 +00:00
Ralph Castain	07f0a71faa	Cleanup the show_help entries on the seq mapper This commit was SVN r18191.	2008-04-17 14:43:15 +00:00
Ralph Castain	e7487ad533	Implement the seq rmaps module that sequentially maps process ranks to a list hosts in a hostfile. Restore the "do-not-launch" functionality so users can test a mapping without launching it. Add a "do-not-resolve" cmd line flag to mpirun so the opal/util/if.c code does not attempt to resolve network addresses, thus enabling a user to test a hostfile mapping without hanging on network resolve requests. Add a function to hostfile to generate an ordered list of host names from a hostfile This commit was SVN r18190.	2008-04-17 13:50:59 +00:00
Ralph Castain	66e532669a	Remove some dead code This commit was SVN r18182.	2008-04-16 20:33:53 +00:00
Ralph Castain	3a0d09300b	Fully implement the inbound binomial allgather for daemon-based collectives. Supports both modex and barrier operations. Comm_spawn still uses the rank=0 method - shifting that algo to the daemons is under study. This commit was SVN r18115.	2008-04-09 22:10:53 +00:00
Lenny Verkhovsky	2be4e32c79	1. Fixing Possible strdup of NULL 2. Fixing num_alloc when combined mapping policies ( rankfile & byslot or bynode ) This commit was SVN r18073.	2008-04-02 14:12:38 +00:00
Ralph Castain	51533c9340	Add a new mapper component that sequentially maps ranks-to-hosts according to the ordering in the hostfile. Not functional yet - still under development. Just placeholding for now to clear a backlog This commit was SVN r18062.	2008-04-01 20:03:49 +00:00
Ralph Castain	1889bbd119	Quiet some warnings about uninitialized variables This commit was SVN r18032.	2008-03-31 13:52:10 +00:00
Ralph Castain	8506be755d	Clean-up the mess. Repair static builds. Remove unused and empty C-decl braces. Add missing prototype for function. This commit was SVN r18031.	2008-03-31 13:02:33 +00:00
Lenny Verkhovsky	cb83a1287d	Realy deleted old files now This commit was SVN r18018.	2008-03-30 11:50:19 +00:00
Lenny Verkhovsky	f734ba51a4	Added files with names according to prefix rule This commit was SVN r18017.	2008-03-30 11:42:09 +00:00
Lenny Verkhovsky	b43f4a2dc9	Deleted and added files after prefix rule changes This commit was SVN r18016.	2008-03-30 11:41:01 +00:00
Ralph Castain	dc7f45dafd	Remove the obsolete and largely unused orte_system_info structure. The only fields that were used in that struct were nodeid and nodename - these have been transferred to the orte_process_info structure. Only one place used the user name field - session_dir, when formulating the name of the top-level directory. Accordingly, the code for getting the user's id has been moved to the session_dir code. This commit was SVN r17926.	2008-03-23 23:10:15 +00:00
Jeff Squyres	ac2e329353	Oops! That should not have been removed... This commit was SVN r17865.	2008-03-18 14:42:30 +00:00
Jeff Squyres	bd92720d41	More fixes to make it compile and play nice on OS X. Still more fixes are required; sending mail to devel shortly... This commit was SVN r17864.	2008-03-18 14:38:52 +00:00
Ralph Castain	8f31a62600	Fix compilation errors so this will compile, remove unused variables This commit was SVN r17862.	2008-03-18 13:01:26 +00:00
Lenny Verkhovsky	647bce6d3e	Support for new RMAPS rank mapping component This commit was SVN r17860.	2008-03-18 09:39:07 +00:00
Lenny Verkhovsky	14c32f87d5	Added new RMAPS component for rank mapping This commit was SVN r17859.	2008-03-18 09:33:49 +00:00
Tim Prins	f9916811ae	Make it so we do not mangle the options the user passes to their executeable. Fixes trac:1124 The change also: - cleans up and simplifies the command line processing code - adds an error output if more than one hostfile passed for a single app context - gets rid of the superfluous orte_app_context_map_t type, and instead use a simple argv of -host options This commit was SVN r17750. The following Trac tickets were found above: Ticket 1124 --> https://svn.open-mpi.org/trac/ompi/ticket/1124	2008-03-05 22:12:27 +00:00
Ralph Castain	edb8e32a7a	Add default hostfile parameter plus --default-hostfile command line option. Fix error message when job setup failed This commit was SVN r17724.	2008-03-05 04:54:57 +00:00
George Bosilca	9d421bea2a	Replace all occurences of orte_pointer_array by opal_pointer_array. Remove the implementation of orte_pointer_array. This commit was SVN r17636.	2008-02-28 05:32:23 +00:00
Ralph Castain	d70e2e8c2b	Merge the ORTE devel branch into the main trunk. Details of what this means will be circulated separately. Remains to be tested to ensure everything came over cleanly, so please continue to withhold commits a little longer This commit was SVN r17632.	2008-02-28 01:57:57 +00:00
George Bosilca	48f5a26e8c	Cast to keep VC happy (quiet). This commit was SVN r17054.	2008-01-04 23:13:32 +00:00
Jeff Squyres	213b5d5c6e	Per long threads on the mailing list and much confusion discussion about linkers, have all OPAL, ORTE, and OMPI components '''not'' link against the OPAL, ORTE, or OMPI libraries. See ttp://www.open-mpi.org/community/lists/users/2007/10/4220.php for details (or https://svn.open-mpi.org/trac/ompi/wiki/Linkers for a better-formatted version of the same info). This commit was SVN r16968.	2007-12-15 13:32:02 +00:00
Jeff Squyres	9e4387d021	* Use new BEGIN_C_DECLS / END_C_DECLS convention * Add newline at end of file to avoid compiler warning This commit was SVN r16579.	2007-10-26 13:40:38 +00:00
Shiqing Fan	3c38c9c020	- Add extern "C" to resolve linkage specification problems. This commit was SVN r16577.	2007-10-26 09:54:42 +00:00
Josh Hursey	729c63cf9d	Fix invalid MCA 'base' names so they appear in ompi_info. A subset of this patch needs to be applied to v1.2 Refs trac:928 This commit was SVN r15918. The following Trac tickets were found above: Ticket 928 --> https://svn.open-mpi.org/trac/ompi/ticket/928	2007-08-18 03:05:45 +00:00
Brian Barrett	801fffabff	Don't assume things about the contact info string in the general case. There is no need for the IP address in most cases (filem being one dubious exception), so just publish and hand around the supposedly opaque contact info strings This commit was SVN r15638.	2007-07-26 16:51:41 +00:00
Brian Barrett	5b9fa7e998	reapply r15517 and r15520, which were removed in r15527 so that I could get the RML/OOB merge in slightly easier This commit was SVN r15530. The following SVN revision numbers were found above: r15517 --> open-mpi/ompi@41977fcc95 r15520 --> open-mpi/ompi@9cbc9df1b8 r15527 --> open-mpi/ompi@2d17dd9516	2007-07-20 02:34:29 +00:00
Brian Barrett	39a6057fc6	A number of improvements / changes to the RML/OOB layers: * General TCP cleanup for OPAL / ORTE * Simplifying the OOB by moving much of the logic into the RML * Allowing the OOB RML component to do routing of messages * Adding a component framework for handling routing tables * Moving the xcast functionality from the OOB base to its own framework Includes merge from tmp/bwb-oob-rml-merge revisions: r15506, r15507, r15508, r15510, r15511, r15512, r15513 This commit was SVN r15528. The following SVN revisions from the original message are invalid or inconsistent and therefore were not cross-referenced: r15506 r15507 r15508 r15510 r15511 r15512 r15513	2007-07-20 01:34:02 +00:00
Brian Barrett	2d17dd9516	temporarily back our r15517 and 15520 so that I can get the RML / OOB changes to cleanly apply This commit was SVN r15527. The following SVN revision numbers were found above: r15517 --> open-mpi/ompi@41977fcc95	2007-07-20 01:10:34 +00:00
Ralph Castain	41977fcc95	Remove the cellid field from the orte_process_name_t structure. This only affects a handful of files in itself, but... Cleanup ALL instances of output involving the printing of orte_process_name_t structures using the ORTE_NAME_ARGS macro so that the number of fields and type of data match. Replace those values with a new macro/function pair ORTE_NAME_PRINT that outputs a string (using the new thread safe data capability) so that any future changes to the printing of those structures can be accomplished with a change to a single point. Note that I could not possibly find outputs that directly print the orte_process_name_t fields, but only dealt with those that used ORTE_NAME_ARGS. Hence, you may still have a few outputs that bark during compilation. Also, I could only verify those that fall within environments I can compile on, so other environments may yield some minor warnings. This commit was SVN r15517.	2007-07-19 20:56:46 +00:00
Ralph Castain	d109e9a6f4	Roll in the Voltaire core/socket/etc process mapping implementation. Only change I made was to cleanup some of the diagnostic output in the odls_default component so it uses the -mca odls_base_verbose parameter. You will not see any impact from this change unless you use the syntax described in ticket #1023. I've tried as many of the RAS components as possible and saw no problem - there may be issues with other RAS components that would not compile on any of my systems. Anything that appears should be trivial to fix. This commit was SVN r15427.	2007-07-14 15:14:07 +00:00
Ralph Castain	bd65f8ba88	Bring in an updated launch system for the orteds. This commit restores the ability to execute singletons and singleton comm_spawn, both in single node and multi-node environments. Short description: major changes include - 1. singletons now fork/exec a local daemon to manage their operations. 2. the orte daemon code now resides in libopen-rte 3. daemons no longer use the orte triggering system during startup. Instead, they directly call back to their parent pls component to report ready to operate. A base function to count the callbacks has been provided. I have modified all the pls components except xcpu and poe (don't understand either well enough to do it). Full functionality has been verified for rsh, SLURM, and TM systems. Compile has been verified for xgrid and gridengine. This commit was SVN r15390.	2007-07-12 19:53:18 +00:00
George Bosilca	715f6012cf	The DSS pack function can use the const attribute for the src field as it is never modified by the pack functions directly. Enforce it all over the code base. This commit was SVN r15026.	2007-06-12 22:47:14 +00:00
Tim Prins	1467558157	Cleanup a couple warnings. Update svn:ignore This commit was SVN r15009.	2007-06-12 14:11:06 +00:00
Ralph Castain	85df3bd92f	Bring in the generalized xcast communication system along with the correspondingly revised orted launch. I will send a message out to developers explaining the basic changes. In brief: 1. generalize orte_rml.xcast to become a general broadcast-like messaging system. Messages can now be sent to any tag on the daemons or processes. Note that any message sent via xcast will be delivered to ALL processes in the specified job - you don't get to pick and choose. At a later date, we will introduce an augmented capability that will use the daemons as relays, but will allow you to send to a specified array of process names. 2. extended orte_rml.xcast so it supports more scalable message routing methodologies. At the moment, we support three: (a) direct, which sends the message directly to all recipients; (b) linear, which sends the message to the local daemon on each node, which then relays it to its own local procs; and (b) binomial, which sends the message via a binomial algo across all the daemons, each of which then relays to its own local procs. The crossover points between the algos are adjustable via MCA param, or you can simply demand that a specific algo be used. 3. orteds no longer exhibit two types of behavior: bootproxy or VM. Orteds now always behave like they are part of a virtual machine - they simply launch a job if mpirun tells them to do so. This is another step towards creating an "orteboot" functionality, but also provided a clean system for supporting message relaying. Note one major impact of this commit: multiple daemons on a node cannot be supported any longer! Only a single daemon/node is now allowed. This commit is known to break support for the following environments: POE, Xgrid, Xcpu, Windows. It has been tested on rsh, SLURM, and Bproc. Modifications for TM support have been made but could not be verified due to machine problems at LANL. Modifications for SGE have been made but could not be verified. The developers for the non-verified environments will be separately notified along with suggestions on how to fix the problems. This commit was SVN r15007.	2007-06-12 13:28:54 +00:00
Ralph Castain	ea0c03fd7a	Revert out r14910. Turns out that the GPR has to be able to deal with NULL data values. We fixed this a long time ago on the "put" side, but never dealt with it for "get" - hence, we could "put" ORTE_UNDEF'd attributes in a mapping policy, but couldn't retrieve them. This is why you only encountered the error on comm_spawn and not during the original launch of a job. This correctly repairs the problem by enabling the GPR's "get" function to correctly handle NULL data values. This commit was SVN r14916. The following SVN revision numbers were found above: r14910 --> open-mpi/ompi@0757467d77	2007-06-06 18:34:54 +00:00
Ralph Castain	0757467d77	Fix comm_spawn. The problem stems from our use of the existence of an attribute as equivalent to a boolean "true" - in other words, we only confirm the existence of an attribute on a list to indicate something as opposed to looking at its specific value. Hence, we create the attribute with a type of ORTE_UNDEF - which is fine...until we then attempt to store/retrieve that attribute from the registry. In that case, the DSS barks because it treats ORTE_UNDEF as an error. The only place where we attempt to store/retrieve attributes is in the RMAPS framework in support of comm_spawn. So this is where things broke down. The fix was simply to say "if the attribute data type is ORTE_UNDEF, then treat it like a boolean with value true". Trivial fix - solves problem. This commit was SVN r14910.	2007-06-06 15:16:22 +00:00
Rainer Keller	7d84de8510	- now the formatting (just getting rid of spaces at the end).... This commit was SVN r14764.	2007-05-24 19:10:32 +00:00
Rainer Keller	ff3cfc0011	- Get rid of "set but never used" warning This commit was SVN r14763.	2007-05-24 19:07:45 +00:00

... 3 4 5 6 7 ...

557 Коммитов