openmpi

Автор	SHA1	Сообщение	Дата
Lenny Verkhovsky	c143c95ff9	Partial rankfile slots allocation fix This commit was SVN r18787.	2008-07-01 08:54:20 +00:00
Ralph Castain	0532d799d6	Complete implementation of the --without-rte-support configure option. Working with Brian, this has been tested on RedStorm. Some minor changes to help facilitate debugger support so that both mpirun and yod can operate with it. Still to be completed. This commit was SVN r18664.	2008-06-18 03:15:56 +00:00
Ralph Castain	f9d809748c	Glad someone found that last error - caused me to review the code and find a couple of other cleanups! Nothing major, but just ensure that things flow smoothly since we had a "shadowed" variable. This commit was SVN r18643.	2008-06-10 19:15:59 +00:00
Camille Coti	67cd1849f7	*map was still NULL in the else statement, inducing a segmentation fault when a field of the structure was accessed to. This commit was SVN r18642.	2008-06-10 19:00:57 +00:00
Ralph Castain	9613b3176c	Effectively revert the orte_output system and return to direct use of opal_output at all levels. Retain the orte_show_help subsystem to allow aggregation of show_help messages at the HNP. After much work by Jeff and myself, and quite a lot of discussion, it has become clear that we simply cannot resolve the infinite loops caused by RML-involved subsystems calling orte_output. The original rationale for the change to orte_output has also been reduced by shifting the output of XML-formatted vs human readable messages to an alternative approach. I have globally replaced the orte_output/ORTE_OUTPUT calls in the code base, as well as the corresponding .h file name. I have test compiled and run this on the various environments within my reach, so hopefully this will prove minimally disruptive. This commit was SVN r18619.	2008-06-09 14:53:58 +00:00
Josh Hursey	1de50b523c	Fix some Coverity 'Event set_but_not_used' highlights. Thanks to Jeff for bringing them to my attention. This commit was SVN r18606.	2008-06-06 14:38:41 +00:00
Ralph Castain	0da811ce79	Initial work on xml support - allocation and job map outputs completed. More to come. This commit was SVN r18587.	2008-06-04 20:53:12 +00:00
Ralph Castain	c992e99035	Remove the tags from orte_output_open and the filtering operation from orte_output - this will be handled differently to improve the XML output interface This commit was SVN r18557.	2008-06-03 14:24:01 +00:00
Josh Hursey	4ac7016200	Make sure to check "opal_list_get_last" instead of "opal_list_get_end". The former will return a valid item in the list, the latter will return an invalid item that marks the end of the list. It was happending that when oversubscribing by way of an appfile we would cause a segv because we tried to interpret the invalid item returned by "opal_list_get_end" instead of a valid item. We would then try to write to unallocated memory. This commit fixes trac:1279 This commit was SVN r18529. The following Trac tickets were found above: Ticket 1279 --> https://svn.open-mpi.org/trac/ompi/ticket/1279	2008-05-28 19:37:20 +00:00
Ralph Castain	93d932aa0c	Ensure that the display-map and display-allocation outputs get processed through the new OPAL filter framework by passing them through orte_output instead of using the opal_dss.dump function. This commit was SVN r18507.	2008-05-27 15:46:21 +00:00
Jeff Squyres	e7ecd56bd2	This commit represents a bunch of work on a Mercurial side branch. As such, the commit message back to the master SVN repository is fairly long. = ORTE Job-Level Output Messages = Add two new interfaces that should be used for all new code throughout the ORTE and OMPI layers (we already make the search-and-replace on the existing ORTE / OMPI layers): * orte_output(): (and corresponding friends ORTE_OUTPUT, orte_output_verbose, etc.) This function sends the output directly to the HNP for processing as part of a job-specific output channel. It supports all the same outputs as opal_output() (syslog, file, stdout, stderr), but for stdout/stderr, the output is sent to the HNP for processing and output. More on this below. * orte_show_help(): This function is a drop-in-replacement for opal_show_help(), with two differences in functionality: 1. the rendered text help message output is sent to the HNP for display (rather than outputting directly into the process' stderr stream) 1. the HNP detects duplicate help messages and does not display them (so that you don't see the same error message N times, once from each of your N MPI processes); instead, it counts "new" instances of the help message and displays a message every ~5 seconds when there are new ones ("I got X new copies of the help message...") opal_show_help and opal_output still exist, but they only output in the current process. The intent for the new orte_* functions is that they can apply job-level intelligence to the output. As such, we recommend that all new ORTE and OMPI code use the new orte_* functions, not thei opal_* functions. === New code === For ORTE and OMPI programmers, here's what you need to do differently in new code: * Do not include opal/util/show_help.h or opal/util/output.h. Instead, include orte/util/output.h (this one header file has declarations for both the orte_output() series of functions and orte_show_help()). * Effectively s/opal_output/orte_output/gi throughout your code. Note that orte_output_open() takes a slightly different argument list (as a way to pass data to the filtering stream -- see below), so you if explicitly call opal_output_open(), you'll need to slightly adapt to the new signature of orte_output_open(). * Literally s/opal_show_help/orte_show_help/. The function signature is identical. === Notes === * orte_output'ing to stream 0 will do similar to what opal_output'ing did, so leaving a hard-coded "0" as the first argument is safe. * For systems that do not use ORTE's RML or the HNP, the effect of orte_output_* and orte_show_help will be identical to their opal counterparts (the additional information passed to orte_output_open() will be lost!). Indeed, the orte_* functions simply become trivial wrappers to their opal_* counterparts. Note that we have not tested this; the code is simple but it is quite possible that we mucked something up. = Filter Framework = Messages sent view the new orte_* functions described above and messages output via the IOF on the HNP will now optionally be passed through a new "filter" framework before being output to stdout/stderr. The "filter" OPAL MCA framework is intended to allow preprocessing to messages before they are sent to their final destinations. The first component that was written in the filter framework was to create an XML stream, segregating all the messages into different XML tags, etc. This will allow 3rd party tools to read the stdout/stderr from the HNP and be able to know exactly what each text message is (e.g., a help message, another OMPI infrastructure message, stdout from the user process, stderr from the user process, etc.). Filtering is not active by default. Filter components must be specifically requested, such as: {{{ $ mpirun --mca filter xml ... }}} There can only be one filter component active. = New MCA Parameters = The new functionality described above introduces two new MCA parameters: * '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that help messages will be aggregated, as described above. If set to 0, all help messages will be displayed, even if they are duplicates (i.e., the original behavior). * '''orte_base_show_output_recursions''': An MCA parameter to help debug one of the known issues, described below. It is likely that this MCA parameter will disappear before v1.3 final. = Known Issues = * The XML filter component is not complete. The current output from this component is preliminary and not real XML. A bit more work needs to be done to configure.m4 search for an appropriate XML library/link it in/use it at run time. * There are possible recursion loops in the orte_output() and orte_show_help() functions -- e.g., if RML send calls orte_output() or orte_show_help(). We have some ideas how to fix these, but figured that it was ok to commit before feature freeze with known issues. The code currently contains sub-optimal workarounds so that this will not be a problem, but it would be good to actually solve the problem rather than have hackish workarounds before v1.3 final. This commit was SVN r18434.	2008-05-13 20:00:55 +00:00
Ralph Castain	64ef4102c4	Add the topo mapper module - requires some work in carto for completion. Little cleanup in round-robin mapper. This commit was SVN r18412.	2008-05-08 05:09:13 +00:00
Josh Hursey	9971bc9d95	Merge in the mca_base_select changes per RFC: http://www.open-mpi.org/community/lists/devel/2008/04/3779.php {{{ svn merge -r 18276:18380 https://svn.open-mpi.org/svn/ompi/tmp-public/jjh-mca-play . }}} Any components not in the trunk, but in one of the effected frameworks must be updated. Contact the list, look at the RFC, or look at the diff for how to do this. Sorry for the early commit of this, but I wanted to get it in today (per RFC) and didn't know if I would have a chance later today. This commit was SVN r18381.	2008-05-06 18:08:45 +00:00
Ralph Castain	432d441b3e	Cleanup a bug found by Josh that caused multiple app_contexts to keep mapping onto the first node in an allocation Continue work on loadbalancing Cleanup code organization in rmaps_base This commit was SVN r18353.	2008-05-01 21:07:49 +00:00
Ralph Castain	1766442591	Fix a double-free when tree-spawning Fix the round-robin mapper so it doesn't move to the next node just because it completed mapping an app_context This commit was SVN r18344.	2008-05-01 14:49:56 +00:00
Ralph Castain	3e55fe6f6d	Fold in the revised modex scheme. Move the ompi_proc_t modex portions to the RTE level since the daemons already have that info. Provide each process with the equivalent of a "nidmap" - both a map of what nodes are in the job, and a map of which node each process is on. This enables the use of static ports, though that hasn't been turned "on" in this commit. Update the rsh tree spawn capability so we spawn the next wave of daemons before launching our own local procs. Add an ability to encode nodenames for large clusters with contiguous node name numbering schemes - this allows communication of all node names in a few bytes instead of tens-of-bytes/node. This commit was SVN r18338.	2008-04-30 19:49:53 +00:00
Ralph Castain	eece9f88f0	Fix a bug in the way we computed local_rank. This needs to be the local_rank -among my job peers- on a node. We were mistakenly computing the local_rank across -all- jobs with procs on that node. While the two definitions are equivalent for an initial launch, comm_spawn'd procs would get the wrong local_rank. In particular, there would not be a local_rank=0 proc in the comm_spawn'd job on any node that was shared with the initial job. This commit was SVN r18263.	2008-04-23 17:42:59 +00:00
Ralph Castain	5311b13b60	Add a loadbalancing feature to the round-robin mapper - more to be sent to devel list Fix a potential problem with RM-provided nodenames not matching returns from gethostname - ensure that the HNP's nodename gets DNS-resolved when comparing against RM-provided hostnames. Note that this may be an issue for RM-based clusters that don't have local DNS resolution, but hopefully that is more indicative of a poorly configured system. This commit was SVN r18252.	2008-04-23 14:52:09 +00:00
Lenny Verkhovsky	456ce6c4da	Few cleanups in Rank_File component + fixed opal_paffinity_slot_list without rankfile This commit was SVN r18249.	2008-04-23 13:34:05 +00:00
Ralph Castain	16c9100633	Add --display-allocation option to orterun that will display the node-by-node information regarding your allocation. This commit was SVN r18216.	2008-04-20 02:25:45 +00:00
Ralph Castain	07f0a71faa	Cleanup the show_help entries on the seq mapper This commit was SVN r18191.	2008-04-17 14:43:15 +00:00
Ralph Castain	e7487ad533	Implement the seq rmaps module that sequentially maps process ranks to a list hosts in a hostfile. Restore the "do-not-launch" functionality so users can test a mapping without launching it. Add a "do-not-resolve" cmd line flag to mpirun so the opal/util/if.c code does not attempt to resolve network addresses, thus enabling a user to test a hostfile mapping without hanging on network resolve requests. Add a function to hostfile to generate an ordered list of host names from a hostfile This commit was SVN r18190.	2008-04-17 13:50:59 +00:00
Ralph Castain	66e532669a	Remove some dead code This commit was SVN r18182.	2008-04-16 20:33:53 +00:00
Ralph Castain	3a0d09300b	Fully implement the inbound binomial allgather for daemon-based collectives. Supports both modex and barrier operations. Comm_spawn still uses the rank=0 method - shifting that algo to the daemons is under study. This commit was SVN r18115.	2008-04-09 22:10:53 +00:00
Lenny Verkhovsky	2be4e32c79	1. Fixing Possible strdup of NULL 2. Fixing num_alloc when combined mapping policies ( rankfile & byslot or bynode ) This commit was SVN r18073.	2008-04-02 14:12:38 +00:00
Ralph Castain	51533c9340	Add a new mapper component that sequentially maps ranks-to-hosts according to the ordering in the hostfile. Not functional yet - still under development. Just placeholding for now to clear a backlog This commit was SVN r18062.	2008-04-01 20:03:49 +00:00
Ralph Castain	1889bbd119	Quiet some warnings about uninitialized variables This commit was SVN r18032.	2008-03-31 13:52:10 +00:00
Ralph Castain	8506be755d	Clean-up the mess. Repair static builds. Remove unused and empty C-decl braces. Add missing prototype for function. This commit was SVN r18031.	2008-03-31 13:02:33 +00:00
Lenny Verkhovsky	cb83a1287d	Realy deleted old files now This commit was SVN r18018.	2008-03-30 11:50:19 +00:00
Lenny Verkhovsky	f734ba51a4	Added files with names according to prefix rule This commit was SVN r18017.	2008-03-30 11:42:09 +00:00
Lenny Verkhovsky	b43f4a2dc9	Deleted and added files after prefix rule changes This commit was SVN r18016.	2008-03-30 11:41:01 +00:00
Ralph Castain	dc7f45dafd	Remove the obsolete and largely unused orte_system_info structure. The only fields that were used in that struct were nodeid and nodename - these have been transferred to the orte_process_info structure. Only one place used the user name field - session_dir, when formulating the name of the top-level directory. Accordingly, the code for getting the user's id has been moved to the session_dir code. This commit was SVN r17926.	2008-03-23 23:10:15 +00:00
Jeff Squyres	ac2e329353	Oops! That should not have been removed... This commit was SVN r17865.	2008-03-18 14:42:30 +00:00
Jeff Squyres	bd92720d41	More fixes to make it compile and play nice on OS X. Still more fixes are required; sending mail to devel shortly... This commit was SVN r17864.	2008-03-18 14:38:52 +00:00
Ralph Castain	8f31a62600	Fix compilation errors so this will compile, remove unused variables This commit was SVN r17862.	2008-03-18 13:01:26 +00:00
Lenny Verkhovsky	647bce6d3e	Support for new RMAPS rank mapping component This commit was SVN r17860.	2008-03-18 09:39:07 +00:00
Lenny Verkhovsky	14c32f87d5	Added new RMAPS component for rank mapping This commit was SVN r17859.	2008-03-18 09:33:49 +00:00
Tim Prins	f9916811ae	Make it so we do not mangle the options the user passes to their executeable. Fixes trac:1124 The change also: - cleans up and simplifies the command line processing code - adds an error output if more than one hostfile passed for a single app context - gets rid of the superfluous orte_app_context_map_t type, and instead use a simple argv of -host options This commit was SVN r17750. The following Trac tickets were found above: Ticket 1124 --> https://svn.open-mpi.org/trac/ompi/ticket/1124	2008-03-05 22:12:27 +00:00
Ralph Castain	edb8e32a7a	Add default hostfile parameter plus --default-hostfile command line option. Fix error message when job setup failed This commit was SVN r17724.	2008-03-05 04:54:57 +00:00
George Bosilca	9d421bea2a	Replace all occurences of orte_pointer_array by opal_pointer_array. Remove the implementation of orte_pointer_array. This commit was SVN r17636.	2008-02-28 05:32:23 +00:00
Ralph Castain	d70e2e8c2b	Merge the ORTE devel branch into the main trunk. Details of what this means will be circulated separately. Remains to be tested to ensure everything came over cleanly, so please continue to withhold commits a little longer This commit was SVN r17632.	2008-02-28 01:57:57 +00:00
George Bosilca	48f5a26e8c	Cast to keep VC happy (quiet). This commit was SVN r17054.	2008-01-04 23:13:32 +00:00
Jeff Squyres	213b5d5c6e	Per long threads on the mailing list and much confusion discussion about linkers, have all OPAL, ORTE, and OMPI components '''not'' link against the OPAL, ORTE, or OMPI libraries. See ttp://www.open-mpi.org/community/lists/users/2007/10/4220.php for details (or https://svn.open-mpi.org/trac/ompi/wiki/Linkers for a better-formatted version of the same info). This commit was SVN r16968.	2007-12-15 13:32:02 +00:00
Jeff Squyres	9e4387d021	* Use new BEGIN_C_DECLS / END_C_DECLS convention * Add newline at end of file to avoid compiler warning This commit was SVN r16579.	2007-10-26 13:40:38 +00:00
Shiqing Fan	3c38c9c020	- Add extern "C" to resolve linkage specification problems. This commit was SVN r16577.	2007-10-26 09:54:42 +00:00
Josh Hursey	729c63cf9d	Fix invalid MCA 'base' names so they appear in ompi_info. A subset of this patch needs to be applied to v1.2 Refs trac:928 This commit was SVN r15918. The following Trac tickets were found above: Ticket 928 --> https://svn.open-mpi.org/trac/ompi/ticket/928	2007-08-18 03:05:45 +00:00
Brian Barrett	801fffabff	Don't assume things about the contact info string in the general case. There is no need for the IP address in most cases (filem being one dubious exception), so just publish and hand around the supposedly opaque contact info strings This commit was SVN r15638.	2007-07-26 16:51:41 +00:00
Brian Barrett	5b9fa7e998	reapply r15517 and r15520, which were removed in r15527 so that I could get the RML/OOB merge in slightly easier This commit was SVN r15530. The following SVN revision numbers were found above: r15517 --> open-mpi/ompi@41977fcc95 r15520 --> open-mpi/ompi@9cbc9df1b8 r15527 --> open-mpi/ompi@2d17dd9516	2007-07-20 02:34:29 +00:00
Brian Barrett	39a6057fc6	A number of improvements / changes to the RML/OOB layers: * General TCP cleanup for OPAL / ORTE * Simplifying the OOB by moving much of the logic into the RML * Allowing the OOB RML component to do routing of messages * Adding a component framework for handling routing tables * Moving the xcast functionality from the OOB base to its own framework Includes merge from tmp/bwb-oob-rml-merge revisions: r15506, r15507, r15508, r15510, r15511, r15512, r15513 This commit was SVN r15528. The following SVN revisions from the original message are invalid or inconsistent and therefore were not cross-referenced: r15506 r15507 r15508 r15510 r15511 r15512 r15513	2007-07-20 01:34:02 +00:00
Brian Barrett	2d17dd9516	temporarily back our r15517 and 15520 so that I can get the RML / OOB changes to cleanly apply This commit was SVN r15527. The following SVN revision numbers were found above: r15517 --> open-mpi/ompi@41977fcc95	2007-07-20 01:10:34 +00:00
Ralph Castain	41977fcc95	Remove the cellid field from the orte_process_name_t structure. This only affects a handful of files in itself, but... Cleanup ALL instances of output involving the printing of orte_process_name_t structures using the ORTE_NAME_ARGS macro so that the number of fields and type of data match. Replace those values with a new macro/function pair ORTE_NAME_PRINT that outputs a string (using the new thread safe data capability) so that any future changes to the printing of those structures can be accomplished with a change to a single point. Note that I could not possibly find outputs that directly print the orte_process_name_t fields, but only dealt with those that used ORTE_NAME_ARGS. Hence, you may still have a few outputs that bark during compilation. Also, I could only verify those that fall within environments I can compile on, so other environments may yield some minor warnings. This commit was SVN r15517.	2007-07-19 20:56:46 +00:00
Ralph Castain	d109e9a6f4	Roll in the Voltaire core/socket/etc process mapping implementation. Only change I made was to cleanup some of the diagnostic output in the odls_default component so it uses the -mca odls_base_verbose parameter. You will not see any impact from this change unless you use the syntax described in ticket #1023. I've tried as many of the RAS components as possible and saw no problem - there may be issues with other RAS components that would not compile on any of my systems. Anything that appears should be trivial to fix. This commit was SVN r15427.	2007-07-14 15:14:07 +00:00
Ralph Castain	bd65f8ba88	Bring in an updated launch system for the orteds. This commit restores the ability to execute singletons and singleton comm_spawn, both in single node and multi-node environments. Short description: major changes include - 1. singletons now fork/exec a local daemon to manage their operations. 2. the orte daemon code now resides in libopen-rte 3. daemons no longer use the orte triggering system during startup. Instead, they directly call back to their parent pls component to report ready to operate. A base function to count the callbacks has been provided. I have modified all the pls components except xcpu and poe (don't understand either well enough to do it). Full functionality has been verified for rsh, SLURM, and TM systems. Compile has been verified for xgrid and gridengine. This commit was SVN r15390.	2007-07-12 19:53:18 +00:00
George Bosilca	715f6012cf	The DSS pack function can use the const attribute for the src field as it is never modified by the pack functions directly. Enforce it all over the code base. This commit was SVN r15026.	2007-06-12 22:47:14 +00:00
Tim Prins	1467558157	Cleanup a couple warnings. Update svn:ignore This commit was SVN r15009.	2007-06-12 14:11:06 +00:00
Ralph Castain	85df3bd92f	Bring in the generalized xcast communication system along with the correspondingly revised orted launch. I will send a message out to developers explaining the basic changes. In brief: 1. generalize orte_rml.xcast to become a general broadcast-like messaging system. Messages can now be sent to any tag on the daemons or processes. Note that any message sent via xcast will be delivered to ALL processes in the specified job - you don't get to pick and choose. At a later date, we will introduce an augmented capability that will use the daemons as relays, but will allow you to send to a specified array of process names. 2. extended orte_rml.xcast so it supports more scalable message routing methodologies. At the moment, we support three: (a) direct, which sends the message directly to all recipients; (b) linear, which sends the message to the local daemon on each node, which then relays it to its own local procs; and (b) binomial, which sends the message via a binomial algo across all the daemons, each of which then relays to its own local procs. The crossover points between the algos are adjustable via MCA param, or you can simply demand that a specific algo be used. 3. orteds no longer exhibit two types of behavior: bootproxy or VM. Orteds now always behave like they are part of a virtual machine - they simply launch a job if mpirun tells them to do so. This is another step towards creating an "orteboot" functionality, but also provided a clean system for supporting message relaying. Note one major impact of this commit: multiple daemons on a node cannot be supported any longer! Only a single daemon/node is now allowed. This commit is known to break support for the following environments: POE, Xgrid, Xcpu, Windows. It has been tested on rsh, SLURM, and Bproc. Modifications for TM support have been made but could not be verified due to machine problems at LANL. Modifications for SGE have been made but could not be verified. The developers for the non-verified environments will be separately notified along with suggestions on how to fix the problems. This commit was SVN r15007.	2007-06-12 13:28:54 +00:00
Ralph Castain	ea0c03fd7a	Revert out r14910. Turns out that the GPR has to be able to deal with NULL data values. We fixed this a long time ago on the "put" side, but never dealt with it for "get" - hence, we could "put" ORTE_UNDEF'd attributes in a mapping policy, but couldn't retrieve them. This is why you only encountered the error on comm_spawn and not during the original launch of a job. This correctly repairs the problem by enabling the GPR's "get" function to correctly handle NULL data values. This commit was SVN r14916. The following SVN revision numbers were found above: r14910 --> open-mpi/ompi@0757467d77	2007-06-06 18:34:54 +00:00
Ralph Castain	0757467d77	Fix comm_spawn. The problem stems from our use of the existence of an attribute as equivalent to a boolean "true" - in other words, we only confirm the existence of an attribute on a list to indicate something as opposed to looking at its specific value. Hence, we create the attribute with a type of ORTE_UNDEF - which is fine...until we then attempt to store/retrieve that attribute from the registry. In that case, the DSS barks because it treats ORTE_UNDEF as an error. The only place where we attempt to store/retrieve attributes is in the RMAPS framework in support of comm_spawn. So this is where things broke down. The fix was simply to say "if the attribute data type is ORTE_UNDEF, then treat it like a boolean with value true". Trivial fix - solves problem. This commit was SVN r14910.	2007-06-06 15:16:22 +00:00
Rainer Keller	7d84de8510	- now the formatting (just getting rid of spaces at the end).... This commit was SVN r14764.	2007-05-24 19:10:32 +00:00
Rainer Keller	ff3cfc0011	- Get rid of "set but never used" warning This commit was SVN r14763.	2007-05-24 19:07:45 +00:00
Ralph Castain	d9acc93efa	Compute and pass the local_rank and local number of procs (in that proc's job) on the node. To be precise, given this hypothetical launching pattern: host1: vpids 0, 2, 4, 6 host2: vpids 1, 3, 5, 7 The local_rank for these procs would be: host1: vpids 0->local_rank 0, v2->lr1, v4->lr2, v6->lr3 host2: vpids 1->local_rank 0, v3->lr1, v5->lr2, v7->lr3 and the number of local procs on each node would be four. If vpid=0 then does a comm_spawn of one process on host1, the values of the parent job would remain unchanged. The local_rank of the child process would be 0 and its num_local_procs would be 1 since it is in a separate jobid. I have verified this functionality for the rsh case - need to verify that slurm and other cases also get the right values. Some consolidation of common code is probably going to occur in the SDS components to make this simpler and more maintainable in the future. This commit was SVN r14706.	2007-05-21 14:30:10 +00:00
Jeff Squyres	51f286d737	Just like r14289 on the ORTE trunk: Per discussions with Brian and Ralph, make a slight correction in where components are installed. Use $pkglibdir, not $libdir/openmpi, so that when compiled in the orte trunk, components are installed to the right directory (because the component search patch is checking $pkglibdir). This commit was SVN r14345. The following SVN revisions from the original message are invalid or inconsistent and therefore were not cross-referenced: r14289	2007-04-12 11:19:42 +00:00
Tim Prins	2ffc02870d	Reduce the memory usage of the GPR: - Make it so that all the GPR pointer arrays are allocated initially at 16 elements instead of 512. This saves (on a 64 bit machine) approximately 4*(# procs + # nodes) KB. - Fix up the segment prealloc function so that preallocating an existant segment is not an error, and make the areas where we do large inserts use it. Fix the orte_pointer_array to efficiently implement setting its size. Before we just realloced the array one block at a time until the desired size was reached. Now we resize it all in one realloc. This commit was SVN r14264.	2007-04-09 00:40:15 +00:00
Tim Prins	2f74160a37	Fix some more memory leaks This commit was SVN r14175.	2007-03-30 13:43:50 +00:00
Tim Prins	9cb455272b	Fix a pile of memory leaks in ORTE. Fix a major memory leak in the SLURM RAS, and cleanup a bit of code there. This commit was SVN r14164.	2007-03-29 00:50:56 +00:00
Ralph Castain	0d98264097	Fix the nolocal option on the OMPI trunk This commit was SVN r14138.	2007-03-24 16:16:16 +00:00
Josh Hursey	dadca7da88	Merging in the jjhursey-ft-cr-stable branch (r13912 : HEAD). This merge adds Checkpoint/Restart support to Open MPI. The initial frameworks and components support a LAM/MPI-like implementation. This commit follows the risk assessment presented to the Open MPI core development group on Feb. 22, 2007. This commit closes trac:158 More details to follow. This commit was SVN r14051. The following SVN revisions from the original message are invalid or inconsistent and therefore were not cross-referenced: r13912 The following Trac tickets were found above: Ticket 158 --> https://svn.open-mpi.org/trac/ompi/ticket/158	2007-03-16 23:11:45 +00:00
Ralph Castain	5818a32245	Bring in a forgotten speed improvement for the TM launcher that was developed during SNL Tbird testing last year. Remove the redundant and slow calls to TM to resolve hostnames. Instead, read the host info from the PBS file during the RAS, and then just use that info in the PLS (rather than getting it again). Adjust the RMAPS mapped_node object to propagate the required launch_id info now included in the ras_node object. This provides support for those few systems that don't use nodename to launch, but instead want some id (typically an index into the array of allocated nodes). This value gets set for each node in the RAS - the RMAPS just propagates it for easy launch. This commit was SVN r13581.	2007-02-09 15:06:45 +00:00
Ralph Castain	1487e22ec8	Store the mapping mode so that it can be recovered later This commit was SVN r13197.	2007-01-18 20:00:15 +00:00
Ralph Castain	455e4ada9a	Bring the modified/updated pernode and npernode behaviors over from the openrte repository. This change enables npernode to pay attention to the total #procs to be launched, and cleans up the bynode vs. byslot mapping directives when in pernode and npernode modes. This commit was SVN r13191.	2007-01-18 17:15:19 +00:00
Brian Barrett	a34e67d743	Remove unneeded PARAM_INIT_FILE variable in configure.params files used by components that use configure.m4 for configuration or are always built. The macro has not been needed since moving to configure types other than configure.stub Fixes trac:590 This commit was SVN r13031. The following Trac tickets were found above: Ticket 590 --> https://svn.open-mpi.org/trac/ompi/ticket/590	2007-01-08 03:44:22 +00:00
Ralph Castain	90f5e3fad8	Fix a buglet in the singleton startup procedure. For purposes of minimizing the xcast message, we "strip" the descriptive info on all subscription messages. This means, though, that we have to store the process name and other info so it can be retrieved in the body of the subscription data (as opposed to in the description). This wasn't being done for singletons because they don't call the RMAPS to "map" themselves. This has now been corrected. The singleton startup will dutifully call the mapper framework so that the proper data storage locations get initialized. Unfortunately, we then had to instruct the RMAPS not to allocate a vpid range for this job - otherwise, it would make a mistake and think there were two processes in it. Hence, a change was required to RMAPS to tell it "map this job, but don't allocate a vpid range for it". This change will need to migrate across to 1.2 after it "soaks" the appropriate time. This commit was SVN r12952.	2007-01-02 16:14:44 +00:00
Ralph Castain	64ec238b7b	Repair support for Bproc 4 on 64-bit systems. Update the SMR framework to actually support the begin_monitoring API. Implement the get/set_node_state APIs. This commit was SVN r12864.	2006-12-15 02:34:14 +00:00
Ralph Castain	3b064a624e	For convenience, revise the orte_job_map_t object so it includes the vpid start/range values, the number of nodes, and the number of processes on each node. These values are all used in various places in the code base - we currently re-compute them multiple times. Since these values do not change and are already being computed by the RMAPS framework, we might as well just save them for re-use. This commit was SVN r12829.	2006-12-12 16:07:23 +00:00
Ralph Castain	28ce8e5e5e	Extend the mpirun options to support "--npernode N". This option tells the system to spawn N procs/node across all nodes in the allocation. If N is greater than the number of allocated slots, then the usual oversubscription logic will apply (i.e., the system will error out if oversubscription is not allowed, otherwise it will run with the sched_yield set to non-aggressive behavior). In "--npernode" operation, the "-np" command line parameter is ignored. This commit was SVN r12826.	2006-12-12 00:54:05 +00:00
Ralph Castain	8314e8dbb9	Modify the pernode option so it can accept a request for the number of processes to be launched. We now check three use-cases for pernode: 1. no -np provided - put one proc/node across all allocated nodes 2. -np N provided, N > #nodes - we print a pretty error message and exit 3. -np N provided, N <= #nodes - put one proc/node across N nodes I also added a new orte constant (ORTE_ERR_SILENT) that allows us to pass up the chain that an error was encountered, but NOT print ORTE_ERROR_LOG messages. This is intended to be used for cases where the error we encounter is NOT an orte error, but rather is one associated with incorrect user input (e.g., the preceding case 2). In such cases, there is no point in printing an ORTE_ERROR_LOG chain of messages as it isn't an orte error. This commit was SVN r12821.	2006-12-11 18:07:07 +00:00
Brian Barrett	6f8b366acb	Rename liborte to libopen-rte and libopal to libopen-pal per telecon today and bug #632. Refs trac:632 This commit was SVN r12762. The following Trac tickets were found above: Ticket 632 --> https://svn.open-mpi.org/trac/ompi/ticket/632	2006-12-05 18:27:24 +00:00
Ralph Castain	eb941d8ae2	Fix a bug that declared a node as "oversubscribed" a little early during the mapper procedure. This only affected the mapping procedure, and only if you had set the "--no-oversubscribe" flag. Kudos to Tim Prins for finding it. This commit was SVN r12757.	2006-12-05 13:04:27 +00:00
Ralph Castain	f771cc4fbd	Modify the reuse daemons procedure so we only generate the add_local_procs message once. Revise the display-map-at-launch option so the RMAPS framework takes responsibility for implementation of that option. Modify the RMAPS framework so we eliminate communicating a map to a backend node when certain attributes are set. The proxy functions are now implemented in the base, and a check made for HNP/non-HNP operation made in the map_jobs function prior to execution. This commit was SVN r12619.	2006-11-17 19:06:10 +00:00
Ralph Castain	ca5b4358fa	Need to revise the display-map-at-launch option so it is active not only for the initial launch, but applies to any subsequent comm_spawn events too. Add placeholders for the new orte tools. These don't actually do anything yet - in fact, I have set the .ompi_ignore so that you won't compile them (I have set a .ompi_unignore for me). Please let me know if you encounter any trouble with this - the ompi_ignore's should protect everyone. This commit was SVN r12616.	2006-11-17 02:58:46 +00:00
Ralph Castain	6d6cebb4a7	Bring over the update to terminate orteds that are generated by a dynamic spawn such as comm_spawn. This introduces the concept of a job "family" - i.e., jobs that have a parent/child relationship. Comm_spawn'ed jobs have a parent (the one that spawned them). We track that relationship throughout the lineage - i.e., if a comm_spawned job in turn calls comm_spawn, then it has a parent (the one that spawned it) and a "root" job (the original job that started things). Accordingly, there are new APIs to the name service to support the ability to get a job's parent, root, immediate children, and all its descendants. In addition, the terminate_job, terminate_orted, and signal_job APIs for the PLS have been modified to accept attributes that define the extent of their actions. For example, doing a "terminate_job" with an attribute of ORTE_NS_INCLUDE_DESCENDANTS will terminate the given jobid AND all jobs that descended from it. I have tested this capability on a MacBook under rsh, Odin under SLURM, and LANL's Flash (bproc). It worked successfully on non-MPI jobs (both simple and including a spawn), and MPI jobs (again, both simple and with a spawn). This commit was SVN r12597.	2006-11-14 19:34:59 +00:00
Tim Prins	39bc652899	Refs trac:612 Make it so if -np was not passed and -pernode was, we map bynode This commit was SVN r12580. The following Trac tickets were found above: Ticket 612 --> https://svn.open-mpi.org/trac/ompi/ticket/612	2006-11-13 19:13:21 +00:00
Ralph Castain	30de73a712	Add a few attributes that are helpful for folks doing things like Eclipse. Also add yet another command-line option to orterun to support one of the new attributes. These include: 1. ORTE_RMAPS_DISPLAY_AT_LAUNCH: pretty-prints out the process map right before we launch so you can see where everyone is going. This is settable via the command line option "--display-map-at-launch" 2. ORTE_RMGR_STOP_AFTER_SETUP: just setup the job and then return from the spawn command. 3. ORTE_RMGR_STOP_AFTER_ALLOC: return from the rmgr.spawn call after allocating the job 4. ORTE_RMGR_STOP_AFTER_MAP: return from the rmgr.spawn call after mapping the job. This gives folks a chance to retrieve and graphically display the map, let the user edit it, and store the results. They can then call "launch" on their own and the system will use the revised map. Enjoy! My personal favorite is the first one - helps with debugging. This commit was SVN r12379.	2006-10-31 22:16:51 +00:00
Tim Prins	894b220fbb	Fixes trac:487 Give a more intelligible error message when someone passes -nolocal and the only available node is the local node. This commit was SVN r12325. The following Trac tickets were found above: Ticket 487 --> https://svn.open-mpi.org/trac/ompi/ticket/487	2006-10-26 21:46:18 +00:00
Ralph Castain	601443e690	Fix the other end of the attribute store/retrieve to correctly handle attributes with no value This commit was SVN r12290.	2006-10-25 01:41:58 +00:00
Ralph Castain	46166a9c77	Fix a bug that would cause a segfault when someone specified an option that had no corresponding value This commit was SVN r12288.	2006-10-24 22:06:18 +00:00
Ralph Castain	7a77ef0ae3	Given the amount of pain singletons cause, one can't help but wonder if it REALLY was that much trouble for people to type "mpirun -n 1 foo"....sigh. Get the ordering right so that a singleton can start. Protect the rmgr copy app_context function from NULL fields Tell the mapper it is okay for there not to be a pre-existing mapping plan for a parent when dynamically spawning processes This commit was SVN r12257.	2006-10-23 15:15:45 +00:00
Ralph Castain	ab7bbb80a5	Teach the mapper to correctly handle the unbalanced --host scenario. We now map in a more expected fashion. This commit was SVN r12240.	2006-10-20 20:48:24 +00:00
Tim Prins	28bf4d85ab	A couple of small fixes: - It is possible to leave a byslot/bynode routine and have cur_node_item be NULL, so check for that. - After we do an allocation where the user has provided a map (i.e. with --host), cur_node_item is pointing into the map list, not the global list. Change it to point into the global list. This commit was SVN r12232.	2006-10-20 19:00:17 +00:00
Ralph Castain	955d11fa7b	The bookmark now respects slot assignments a little better. It will not oversubscribe the first node, but will take only what is available there before moving on. See the comment in orte/mca/rmaps/round_robin/rmaps_rr.c if you want the details... :-) This commit was SVN r12230.	2006-10-20 18:24:14 +00:00
Ralph Castain	ec0bb9ffda	Fix the bookmark system - we now have children being correctly spawned where they should! Also, I am no longer seeing any issue with the child job spawning its own daemons - this appears to be fixed. We still don't reuse the existing daemons, however, but that will come. This commit was SVN r12229.	2006-10-20 18:05:16 +00:00
George Bosilca	2aa3e51223	Nothing relevant. Only a set of castings to have a clean compile on Windows. The cl.exe compiler is pretty good at complaining about any kind of non explicit cast. This commit was SVN r12207.	2006-10-20 02:25:50 +00:00
Tim Prins	ade94b523b	Fixed a number of issues related to resource allocation: - Simplified the logic of the ras modules by moving the attribute handling into the base allocation function. This allows us to decide how to allocate based on the situation, and solves some of the allocation problems we were having with comm_spawn. - moved the proxy component into the base. This was done because we always want to call the proxy functions if we are not on a HNP regardless of the attributes passed. - Got rid of the hostfile component. What little logic was in it was moved into the base to deal with other circumstances. The hostfile information is currently being propagated into the registry by the RDS, so we just use what is already in the registry. - renamed some slurm function so that they have the proper prefix. Not strictly necessary as they were static, but it makes debugging much easier. - fixed a buglet in the round_robin rmaps where we would return an error when really no error occured. I tried to make proper corrections to all the ras modules, but I cannot test all of them. This commit was SVN r12202.	2006-10-19 23:33:51 +00:00
Ralph Castain	263f4379e8	Clean up an error in the mapper that caused "-hosts" to bomb. Update the mapper so it correctly points to the next node to be used if we are mapping by slots. As it was, if we had an app_context that used only one slot on a node, the next app_context would start on the next node - leaving a blank slot in-between. This commit was SVN r12193.	2006-10-19 18:57:29 +00:00
Tim Prins	81d400ddfd	break when they are equal, not not equal This commit was SVN r12182.	2006-10-18 21:47:01 +00:00
Tim Prins	ab964d096a	Need a terminating NULL... This commit was SVN r12180.	2006-10-18 20:52:31 +00:00
Ralph Castain	d0eb7d7216	Complete the attribute management functions. Modify the mapper to better bookmark its stopping place each time, and to pick up the next time from there. This needs to be validated on a multi-node system. Fix a major memory corruption problem in the registry put/get functions that was doing multiple free's. Not sure how valgrind missed this one, though it only occurred in specific circumstances (such as comm_spawn). This commit was SVN r12179.	2006-10-18 20:02:16 +00:00
Ralph Castain	f4a458532b	This doesn't totally resolve the comm_spawn problem, but it helps a little. I'll continue working on it and hope to resolve it completely shortly. The issue primarily centers on where to start mapping the child job's processes, and how to deal with oversubscription that might result. At the moment, I am trying to resolve the first issue first (hey, that even sounds right!). This change does a couple of things: 1. Since the USE_PARENT_ALLOC attribute is a directive about regarding allocation of resources to a job, it more properly should be an attribute of the RAS. Change the name to reflect that and move the attribute define to the ras_types.h file. 2. Add the attributes list to the RMAPS map_job interface. This provides us with the desired flexibility to dynamically specify directives for mapping. The system will - in the absence of any attribute-based directive - default to the values provided in the MCA parameters (either from environment or command-line interface). This commit was SVN r12164.	2006-10-18 14:01:44 +00:00
Ralph Castain	0c0fe022ff	This is a first cut at fixing the problem of comm_spawn children being mapped onto the same nodes as their parents. I am not convinced the behavior implemented here is the long-term right one, but hopefully it will help alleviate the situation for now. In this implementation, we begin mapping on the first node that has at least one slot available as measured by the slots_inuse versus the soft limit. If none of the nodes meet that criterion, we just start at the beginning of the node list since we are oversubscribed anyway. Note that we ignore this logic if the user specifies a mapping - then it's just "user beware". The real root cause of the problem is that we don't adjust sched_yield as we add processes onto a node. Hence, the node becomes oversubscribed and performance goes into the toilet. What we REALLY need to do to solve the problem is: (a) modify the PLS components so they reuse the existing daemons, (b) create a way to tell a running process to adjust its sched_yield, and (c) modify the ODLS components to update the sched_yield on a process per the new method Until we do that, we will continue to have this problem - all this fix (and any subsequent one that focuses solely on the mapper) does is hopefully make it happen less often. This commit was SVN r12145.	2006-10-17 19:35:00 +00:00
Ralph Castain	699ffcf359	Restore the "bynode" mapping functionality - accidentally deleted setting of parameter This commit was SVN r12078.	2006-10-10 19:41:22 +00:00
George Bosilca	7dadc1832d	Correctly export the required functions. They are defined in a private file, but they are completely public. This commit was SVN r12070.	2006-10-10 04:54:51 +00:00
Ralph Castain	2e09128337	Many thanks to Jeff for tracking down the typo causing the orte_job_map_t destuctor to fail!! Restore the OBJ_RELEASE calls to cleanup map objects. This commit was SVN r12064.	2006-10-07 22:44:00 +00:00
Ralph Castain	98dd57b70e	Add a new option to launch "pernode" - launches one process/node across all available nodes. The other options also work correctly: "-bynode" with no -np will launch on all slots, mapped on a per-node basis. This commit was SVN r12063.	2006-10-07 19:50:12 +00:00
Ralph Castain	ae79894bad	Bring the map fixes into the main trunk. This should fix several problems, including the multiple app_context issue. I have tested on rsh, slurm, bproc, and tm. Bproc continues to have a problem (will be asking for help there). Gridengine compiles but I cannot test (believe it likely will run). Poe and xgrid compile to the extent they can without the proper include files. This commit was SVN r12059.	2006-10-07 15:45:24 +00:00
George Bosilca	cda46efd2a	Some missing DECLSPEC This commit was SVN r12047.	2006-10-06 15:21:52 +00:00
George Bosilca	090b8a9098	opal_list_is_empty return true or false ... This commit was SVN r12000.	2006-10-05 05:26:08 +00:00
George Bosilca	ad5810e33f	ORTE_DECLSPEC what needs to be ORTE_DECLSPES. This commit was SVN r11997.	2006-10-05 05:22:22 +00:00
Ralph Castain	faf3a558e6	Missing CR at end of file This commit was SVN r11959.	2006-10-03 18:17:52 +00:00
Ralph Castain	cd7d87aa7b	Define the map data types for dss compatibility. Setup to debug bproc This commit was SVN r11955.	2006-10-03 17:40:00 +00:00
Ralph Castain	4e39878944	Add a "dump" capability to the DSS so one can display a single data value to an output stream. Add some comments to the map type def in prep for building its data type support. This commit was SVN r11947.	2006-10-03 08:40:35 +00:00
Ralph Castain	121f834776	Continue bringing comm_spawn back online. Ensure all RM frameworks post their HNP receives. Fix the rmgr proxy component. Still need some work on the proxy component, and on job termination for persistent daemon case. This commit was SVN r11928.	2006-10-02 00:46:31 +00:00
Ralph Castain	37dfdb76eb	Here is the major MAD-cure commit. I have written plenty about it, so I refer you here to those messages for a description of everything that was done. This commit was SVN r11661.	2006-09-14 21:29:51 +00:00
George Bosilca	f52c10d18e	And ORTE is ready for prime-time. All Windows tricks are in: - use the OPAL functions for PATH and environment variables - make all headers C++ friendly - no unamed structures - no implicit cast. Plus a full implementation for the orte_wait functions. This commit was SVN r11347.	2006-08-23 03:32:36 +00:00
George Bosilca	6afa4c6c64	Windows friendly version. We have to split the OMPI_DECLSPEC in at least 3 different macros, one for each project. Therefore, now we have OPAL_DECLSPEC, ORTE_DECLSPEC and OMPI_DECLSPEC. Please use them based on the sub-project. This commit was SVN r11270.	2006-08-20 15:54:04 +00:00
Ralph Castain	8c7f0ed9ae	Change the SOH to the new State Monitoring and Reporting (SMR) framework. New API's will be appearing in the new framework shortly - this just gets the name change into the system. Other changes: 1. Remove the old xcpu components as they are not functional. 2. Fix a "bug" in orterun whereby we called dump_aborted_procs even when we normally terminated. There is still some kind of bug in this procedure, however, as we appear to be calling the orterun job_state_callback function every time a process terminates (instead of only once when they have all terminated). I'll continue digging into that one. This will require an autogen/configure, I'm afraid. This commit was SVN r11228.	2006-08-16 16:35:09 +00:00
Ralph Castain	5dfd54c778	With the branch to 1.2 made.... Clean up the remainder of the size_t references in the runtime itself. Convert to orte_std_cntr_t wherever it makes sense (only avoid those places where the actual memory size is referenced). Remove the obsolete oob barrier function (we actually obsoleted it a long time ago - just never bothered to clean it up). I have done my best to go through all the components and catch everything, even if I couldn't test compile them since I wasn't on that type of system. Still, I cannot guarantee that problems won't show up when you test this on specific systems. Usually, these will just show as "warning: comparison between signed and unsigned" notes which are easily fixed (just change a size_t to orte_std_cntr_t). In some places, people didn't use size_t, but instead used some other variant (e.g., I found several places with uint32_t). I tried to catch all of them, but... Once we get all the instances caught and fixed, this should once and for all resolve many of the heterogeneity problems. This commit was SVN r11204.	2006-08-15 19:54:10 +00:00
Jeff Squyres	c2d4dfce78	Remove unused variable This commit was SVN r10985.	2006-07-25 21:43:21 +00:00
Ralph Castain	65acc9325a	Fix a bug that crept in during the last change to support "mpirun a.out" operations. Since we now reserve a range of vpids for each app_context, we no longer need to track the rank and offset the starting vpid each time through the mapper - the name service automatically accounts for the offset when allocating the next starting vpid for the job. This should be shifted to v1.1. This commit was SVN r10916.	2006-07-20 21:06:15 +00:00
Ralph Castain	11125dd67a	George has a retarded compiler - but that's okay. This will quiet it's warning system. This commit was SVN r10736.	2006-07-11 15:27:02 +00:00
Josh Hursey	9a31060b6d	Fix r10725 so that the trunk builds again. This commit was SVN r10733. The following SVN revision numbers were found above: r10725 --> open-mpi/ompi@ae222cca5b	2006-07-11 14:48:31 +00:00
Ralph Castain	ae222cca5b	Include the help file so it can be accessed This commit was SVN r10725.	2006-07-11 12:15:25 +00:00
Ralph Castain	6129a5a887	Enable -host support for "mpirun a.out". You can now execute on all slots on specified nodes within your overall allocation. This commit was SVN r10713.	2006-07-11 02:59:23 +00:00
Ralph Castain	febc143d8c	Per LANL's stated need, add functionality that runs a.out across ALL available process slots if no num_proc is specified on the command line. However, please note the following limitation: we ONLY allow ONE application to be specified on the command line when this feature is invoked. If multiple apps are specified, the user MUST also specify the number to be launched for each and every one of them. Update the help text to report errors when not following that rule. Also updated the RMAPS help text to reflect the reorganization of some of the round-robin code into the base. The new functionality has been tested under Mac OS-X and on Odin using an MPI program. Both byslot and bynode mapping have been checked and verified. Operational support for other systems needs to be verified - I respectfully request people's help in doing so. This commit was SVN r10708.	2006-07-10 21:25:33 +00:00
Ralph Castain	3d220cbd48	This patch fixes several issues relating to comm_spawn and N1GE. In particular, it does the following: 1. Modifies the RAS framework so it correctly stores and retrieves the actual slots in use, not just those that were allocated. Although the RAS node structure had storage for the number of slots in use, it turned out that the base function for storing and retrieving that information ignored what was in the field and simply set it equal to the number of slots allocated. This has now been fixed. 2. Modified the RMAPS framework so it updates the registry with the actual number of slots used by the mapping. Note that daemons are still NOT counted in this process as daemons are NOT mapped at this time. This will be fixed in 2.0, but will not be addressed in 1.x. 3. Added a new MCA parameter "rmaps_base_no_oversubscribe" that tells the system not to oversubscribe nodes even if the underlying environment permits it. The default is to oversubscribe if needed and the underlying environment permits it. I'm sure someone may argue "why would a user do that?", but it turns out that (looking ahead to dynamic resource reservations) sometimes users won't know how many nodes or slots they've been given in advance - this just allows them to say "hey, I'd rather not run if I didn't get enough". 4. Reorganizes the RMAPS framework to more easily support multiple components. A lot of the logic in the round_robin mapper was very valuable to any component - this has been moved to the base so others can take advantage of it. 5. Added a new test program "hello_nodename" - just does "hello_world" but also prints out the name of the node it is on. 6. Made the orte_ras_node_t object a full ORTE data type so it can more easily be copied, packed, etc. This proved helpful for the RMAPS code reorganization and might be of use elsewhere too. This commit was SVN r10697.	2006-07-10 14:10:21 +00:00
Jeff Squyres	538965aeb0	Final merge of stuff from /tmp/tm-stuff tree (merged through /tmp/tm-merge). Validated by RHC. Summary: - Add --nolocal (and -nolocal) options to orterun - Make some scalability improvements to the tm pls This commit was SVN r10651.	2006-07-04 20:12:35 +00:00
Josh Hursey	58110f9fc9	Fixes Ticket #125 for both the trunk and v1.1 branch. This commit will apply cleanly to the v1.1 branch, and should be moved over once I get someone to verify it. The problem is outlined in the bug. The fix was to move the setting of the app context index (idx) before we put it in the GPR so that it is propogated to the gpr. The reason this hasn't bitten us before is because we init app->idx to 0, which is true most of the time. Except that is when MPI_Comm_spawn_multiple in which we put in more than one app context, thus care about correct indexing. This was causing down the line memory corruption by overrunning the mapping array. This commit also puts in a check to make sure that we error out if we ever try to do that again. This commit was SVN r10380.	2006-06-15 22:14:07 +00:00
Josh Hursey	2f20a38c98	This is a fix for bug Ticket #27 We were stuck in an infinite loop inside the rmaps round_robin component when the user specified a host, then over subscribed it. Instead of retuning an error, we looped forever. For example: $ cat hostfile A slots=2 max-slots=2 B slots=2 max-slots=2 $ mpirun -np 3 --hostfile hostfile --host B <hang> The loop would not terminate because both host A and B are in the 'nodes' structure as they are both allocated to the job. However, after allocating 2 slots to host B, we remove it from the node list leaving us with a 'nodes' structure with just A in it. Since we can't use host A, we keep looping here until we find a node that we can use. This patch checks to make sure that if we get into this situation where rmaps is looping over the list a second time without finding a node during the first pass then we know that there are no nodes left to use, so we have a resource allocation error, and should return to the user. This patch should be moved to all of the release branches This commit was SVN r10131.	2006-05-31 03:42:01 +00:00
Brian Barrett	566a050c23	Next step in the project split, mainly source code re-arranging - move files out of toplevel include/ and etc/, moving it into the sub-projects - rather than including config headers with <project>/include, have them as <project> - require all headers to be included with a project prefix, with the exception of the config headers ({opal,orte,ompi}_config.h mpi.h, and mpif.h) This commit was SVN r8985.	2006-02-12 01:33:29 +00:00
Ralph Castain	1abe8ef368	Well, it certainly helps triggers to fire if the respective responsible routines adjust the counters! The INIT counter is supposed to be adjusted when the processes are mapped - this is now done correctly. The LAUNCHED counter is supposed to be adjusted when the pls sets the process pid info into the registry and changes the state to LAUNCHED. This could probably be changed to have that function use the set_proc_soh API, but this fixes the problem for now. Thanks to Brian for finding that the triggers were not being fired. This commit was SVN r8948.	2006-02-09 15:39:06 +00:00
Ralph Castain	4b9f015c0b	Merge in the new data support subsystem for ORTE. MPI folks should not notice a difference. Longer explanation will be sent to developers mailing list. This commit was SVN r8912.	2006-02-07 03:32:36 +00:00
George Bosilca	7d8d516a4a	A bunch of fixed for Windows support. - protection with __WINDOWS__ and not WIN32 or _WIN32 - protect all the headers This commit was SVN r8463.	2005-12-12 20:04:00 +00:00
Jeff Squyres	31336e4773	Add some missing headers / correct one installation directory This commit was SVN r8408.	2005-12-08 04:00:52 +00:00
Jeff Squyres	6fbd321442	Fix a bunch of install locations for header files This commit was SVN r8406.	2005-12-08 00:54:44 +00:00
Brian Barrett	20cea60b82	* fix "make distclean" error in PML * turns out (duh!) that there was a reason that the <projectdir>dir variable was set in the AM conditional. If not, stupid directories are created and not needed... duh. This commit was SVN r8205.	2005-11-20 07:41:09 +00:00
Brian Barrett	8faa1884f0	* The last of the build system optimizations. Combine the component and component/base Makefile.am files, reducing the time configure spends stamping out Makefiles at the end * Install base_impl.h file when devel-headers are being installed This commit was SVN r8200.	2005-11-20 01:03:01 +00:00
George Bosilca	c802d54696	The return type is an int. Casting it to a size_t before checking if it's bigger than zero lead to a true condition ... always ... This commit was SVN r8114.	2005-11-11 06:34:14 +00:00
Tim Woodall	7f20198d49	Filter the set of data returned to the daemons during startup using the new get_conditional command to improve scalability during launch This commit was SVN r8097.	2005-11-10 16:44:51 +00:00
Jeff Squyres	42ec26e640	Update the copyright notices for IU and UTK. This commit was SVN r7999.	2005-11-05 19:57:48 +00:00
Jeff Squyres	0379b27969	Add missing DESTRUCT This commit was SVN r7948.	2005-11-01 13:35:44 +00:00
Tim Woodall	e27dfb180d	yet another fix This commit was SVN r7941.	2005-10-31 21:59:14 +00:00
Tim Woodall	aa5b61e4f1	corrections for multiple app contexts This commit was SVN r7939.	2005-10-31 20:37:44 +00:00
Jeff Squyres	8503fce61b	Remove debugging message This commit was SVN r7924.	2005-10-28 18:53:20 +00:00
Tim Woodall	3fd351117a	removed debug This commit was SVN r7902.	2005-10-27 21:07:49 +00:00
Tim Woodall	793836da57	removed debug This commit was SVN r7897.	2005-10-27 17:10:49 +00:00
Tim Woodall	60754acae8	- modified rmaps data structures to point directly to ras node - modified rsh to NOT query for each nodes mapping, as all data is already available in the rmaps structures This commit was SVN r7894.	2005-10-27 17:04:10 +00:00
Brian Barrett	1302cb4072	The next in a long line of crazed build system changes from Brian. This was originally suggested by Ralf Wildenhues, to try to speed autogen, configure, and make (and possibly even make install). Use automake's include directive to drastically reduce the number of Makefile files (although the number of Makefile.am files is the same - most are just included in a top-level Makefile.am). Also use an Automake SUBDIRs feature to eliminate the dynamic-mca tree, which was no longer really needed. This makes adding a framework easier (since you don't have to remember the dynamic-mca tree) and makes building faster (as make doesn't have to recurse through the dynamic-mca tree) This commit was SVN r7777.	2005-10-17 00:21:10 +00:00
Josh Hursey	0f08e87a1f	Fixed a max_slots off by one problem that Brian highlighted. Also cleaned up the error message when allocating over the number of slots available. This commit was SVN r7715.	2005-10-12 02:09:56 +00:00
Josh Hursey	d5ebb5c46a	fix a compiler warning This commit was SVN r7674.	2005-10-08 17:03:12 +00:00
Jeff Squyres	0629cdc2d7	Bring back the changes from /tmp/jjhursey-rmaps. Specific merge command: svn merge -r 7567:7663 https://svn.open-mpi.org/svn/ompi/tmp/jjhursey-rmaps . (where "." is a trunk checkout) The logs from this branch are much more descriptive than I will put here (including a really long description from last night). Here's the short version: - fixed some broken implementations in ras and rmaps - "orterun --host ..." now works and has clearly defined semantics (this was the impetus for the branch and all these fixes -- LANL had a requirement for --host to work for 1.0) - there is still a little bit of cleanup left to do post-1.0 (we got correct functionality for 1.0 -- we did not fix bad implementations that still "work") - rds/hostfile and ras/hostfile handshaking - singleton node segment assignments in stage1 - remove the default hostfile (no need for it anymore with the localhost ras component) - clean up pls components to avoid duplicate ras mapping queries - [possible] -bynode/-byslot being specific to a single app context This commit was SVN r7664.	2005-10-07 22:24:52 +00:00
Andrew Friedley	82ee2933a5	- Add an opal_show_help() to the pls fork module to explain what went wrong when the execv to start the application fails. - Add a couple opal_show_help()'s to indicate when not enough slots/nodes are available to satisfy a request. This commit was SVN r7555.	2005-09-30 14:30:21 +00:00

1 2 3 4 5 ...

267 Коммитов