openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	9613b3176c	Effectively revert the orte_output system and return to direct use of opal_output at all levels. Retain the orte_show_help subsystem to allow aggregation of show_help messages at the HNP. After much work by Jeff and myself, and quite a lot of discussion, it has become clear that we simply cannot resolve the infinite loops caused by RML-involved subsystems calling orte_output. The original rationale for the change to orte_output has also been reduced by shifting the output of XML-formatted vs human readable messages to an alternative approach. I have globally replaced the orte_output/ORTE_OUTPUT calls in the code base, as well as the corresponding .h file name. I have test compiled and run this on the various environments within my reach, so hopefully this will prove minimally disruptive. This commit was SVN r18619.	2008-06-09 14:53:58 +00:00
Ralph Castain	e1e224b81a	Silence a couple of minor compiler warnings This commit was SVN r18617.	2008-06-09 12:57:41 +00:00
Ralph Castain	7bee71aa59	Fix a potential, albeit perhaps esoteric, race condition that can occur for fast HNP's, slow orteds, and fast apps. Under those conditions, it is possible for the orted to be caught in its original send of contact info back to the HNP, and thus for the progress stack never to recover back to a high level. In those circumstances, the orted can "hang" when trying to exit. Add a new function to opal_progress that tells us our recursion depth to support that solution. Yes, I know this sounds picky, but good ol' Jeff managed to make it happen by driving his cluster near to death... Also ensure that we declare "failed" for the daemon job when daemons fail instead of the application job. This is important so that orte knows that it cannot use xcast to tell daemons to "exit", nor should it expect all daemons to respond. Otherwise, it is possible to hang. After lots of testing, decide to default (again) to slurm detecting failed orteds. This proved necessary to avoid rather annoying hangs that were difficult to recover from. There are conditions where slurm will fail to launch all daemons (slurm folks are working on it), and yet again, good ol' Jeff managed to find both of them. Thanks you Jeff! :-/ This commit was SVN r18611.	2008-06-06 19:36:27 +00:00
George Bosilca	b2aa751c28	Remove a race condition in the threaded mode. As a callback is allowed to modify the callback array (add or remove), make sure we don't call the same callback twice if it get remove in another thread. This commit was SVN r18608.	2008-06-06 15:54:40 +00:00
Josh Hursey	1de50b523c	Fix some Coverity 'Event set_but_not_used' highlights. Thanks to Jeff for bringing them to my attention. This commit was SVN r18606.	2008-06-06 14:38:41 +00:00
Jeff Squyres	12a3fe57e1	As pointed out by Ralf W. (http://www.open-mpi.org/community/lists/devel/2008/06/4095.php), these dependencies don't need to be here. This commit was SVN r18603.	2008-06-06 01:20:47 +00:00
Jeff Squyres	b123629e6a	Fix CIDs 458, 716, 717: ensure that strings are long enough to always be properly \0 terminated. This commit was SVN r18602.	2008-06-06 00:59:08 +00:00
Jeff Squyres	e2b08aaca4	Fix bad free's found in CID 707 and CID 708. This commit was SVN r18600.	2008-06-05 20:49:33 +00:00
Lenny Verkhovsky	a8b5dcb204	Added more output info about socket:core pair in paffinity / rankfile components This commit was SVN r18589.	2008-06-05 10:28:44 +00:00
Ralph Castain	ca91ec525b	Add a suffix to the opal_output stream descriptor object - we can now output both a prefix and a suffix for a given stream. Default the suffix to NULL. Remove lingering references to a filtering system as this will no longer be implemented. This commit was SVN r18586.	2008-06-04 20:52:20 +00:00
Josh Hursey	78f14b5255	Fix the none.checkpoint command. orte-checkpoint/orte-restart seem to not seem to totally like orte_output so revert them to opal_output for now. Since we have no need for the additional complexity of orte_output we can drop it for now and revisit this if anyone needs it later. It seems that if you set the verbose level on an output handle then try to call a normal orte_output() on it then the message will not be printed. This is the same for opal_output, and seems incorrect to me because it stops some error messages from being printed out if you do not directly specify opal_output(0, ...). Maybe someone should take a look a this. orte-checkpoint would segv if passed an incorrect PID. Fixed the return code so it errors out properly. Thanks to Eric Roman for bringing this to my attention. This commit was SVN r18583.	2008-06-04 14:44:11 +00:00
Jeff Squyres	530a15baa4	Fix cross-compiling scenario with valgrind.m4. This commit was SVN r18579.	2008-06-04 11:58:41 +00:00
Shiqing Fan	2dc812f720	Clean configure.m4 of memchecker/valgrind. If Valgrind is requested but wrong version is supplied, print error messages and stop. Save the CPPFLAGS in opal_memchecker_valgrind_CPPFLAGS, which could be used in Makefile.am. Many thanks to Jeff. This commit was SVN r18573.	2008-06-04 11:46:50 +00:00
Ralph Castain	9927b2445c	Remove the filter framework - the xml support will have to be provided in a different manner that will be implemented shortly This commit was SVN r18572.	2008-06-04 09:04:51 +00:00
Jeff Squyres	75a97ebbf0	Many thanks to Ralf W. for finding a subtle bug in these Makefile.am's that can sometimes cause problems with "make -j [N>1] install". Ensure to make the target directory before we copy stuff into it -- read the thread starting here for more details: http://www.open-mpi.org/community/lists/devel/2008/06/4080.php This commit was SVN r18570.	2008-06-04 01:28:03 +00:00
Jeff Squyres	3b568d4b14	Remove an old attempt to understand the tradeoffs with using GNU libc's malloc_hooks functionality, which turned out to be totally unusable in practice. I think we just always forgot to remove them. This commit was SVN r18547.	2008-05-30 00:11:12 +00:00
Shiqing Fan	b67a1244b6	Some small fixes. This commit was SVN r18541.	2008-05-29 15:05:28 +00:00
Jeff Squyres	ed5bc2cd08	Per http://www.open-mpi.org/community/lists/devel/2008/05/4057.php , remove the darwin memory hooks component This commit was SVN r18531.	2008-05-28 23:50:53 +00:00
Sharon Melamed	64fe554b8e	Fix bug in carto component select. After the insertion of mca_base_select the carto file component was never selected. This commit was SVN r18496.	2008-05-26 12:52:41 +00:00
Jeff Squyres	d45cb82ecc	Fix two bugs in PLPA: 1. If we don't have the topology information, don't bother trying to create cross-referencing information 1. Ensure to only check for valid processor ID's This commit was SVN r18462.	2008-05-20 12:57:12 +00:00
Terry Dontje	ef7ac86929	created opal_version_string and orte_version_string to match the ompi changes made in r18345 for ompi_version_string. This was done per request from Jeff Squyres to maintain consistency and to remove some warnings caused by the non-use of some static const char. This commit was SVN r18461. The following SVN revision numbers were found above: r18345 --> open-mpi/ompi@8dd0421015	2008-05-20 12:13:19 +00:00
Jeff Squyres	ea1582856f	Clarify some messages, move AC_ARG_WITH outside of the conditional This commit was SVN r18459.	2008-05-19 23:13:31 +00:00
Jeff Squyres	d12b21e21b	Ensure that if an error occurs, we actually return that error rather than an undefined value (which could be 0/OPAL_SUCCESS). This commit was SVN r18452.	2008-05-19 11:57:44 +00:00
Terry Dontje	517abf9b09	This commit fixes trac:1288. This commit was SVN r18441. The following Trac tickets were found above: Ticket 1288 --> https://svn.open-mpi.org/trac/ompi/ticket/1288	2008-05-15 17:40:08 +00:00
Jeff Squyres	fb17097de4	Make ompi_info correctly display "filter" components This commit was SVN r18435.	2008-05-13 20:56:20 +00:00
Jeff Squyres	e7ecd56bd2	This commit represents a bunch of work on a Mercurial side branch. As such, the commit message back to the master SVN repository is fairly long. = ORTE Job-Level Output Messages = Add two new interfaces that should be used for all new code throughout the ORTE and OMPI layers (we already make the search-and-replace on the existing ORTE / OMPI layers): * orte_output(): (and corresponding friends ORTE_OUTPUT, orte_output_verbose, etc.) This function sends the output directly to the HNP for processing as part of a job-specific output channel. It supports all the same outputs as opal_output() (syslog, file, stdout, stderr), but for stdout/stderr, the output is sent to the HNP for processing and output. More on this below. * orte_show_help(): This function is a drop-in-replacement for opal_show_help(), with two differences in functionality: 1. the rendered text help message output is sent to the HNP for display (rather than outputting directly into the process' stderr stream) 1. the HNP detects duplicate help messages and does not display them (so that you don't see the same error message N times, once from each of your N MPI processes); instead, it counts "new" instances of the help message and displays a message every ~5 seconds when there are new ones ("I got X new copies of the help message...") opal_show_help and opal_output still exist, but they only output in the current process. The intent for the new orte_* functions is that they can apply job-level intelligence to the output. As such, we recommend that all new ORTE and OMPI code use the new orte_* functions, not thei opal_* functions. === New code === For ORTE and OMPI programmers, here's what you need to do differently in new code: * Do not include opal/util/show_help.h or opal/util/output.h. Instead, include orte/util/output.h (this one header file has declarations for both the orte_output() series of functions and orte_show_help()). * Effectively s/opal_output/orte_output/gi throughout your code. Note that orte_output_open() takes a slightly different argument list (as a way to pass data to the filtering stream -- see below), so you if explicitly call opal_output_open(), you'll need to slightly adapt to the new signature of orte_output_open(). * Literally s/opal_show_help/orte_show_help/. The function signature is identical. === Notes === * orte_output'ing to stream 0 will do similar to what opal_output'ing did, so leaving a hard-coded "0" as the first argument is safe. * For systems that do not use ORTE's RML or the HNP, the effect of orte_output_* and orte_show_help will be identical to their opal counterparts (the additional information passed to orte_output_open() will be lost!). Indeed, the orte_* functions simply become trivial wrappers to their opal_* counterparts. Note that we have not tested this; the code is simple but it is quite possible that we mucked something up. = Filter Framework = Messages sent view the new orte_* functions described above and messages output via the IOF on the HNP will now optionally be passed through a new "filter" framework before being output to stdout/stderr. The "filter" OPAL MCA framework is intended to allow preprocessing to messages before they are sent to their final destinations. The first component that was written in the filter framework was to create an XML stream, segregating all the messages into different XML tags, etc. This will allow 3rd party tools to read the stdout/stderr from the HNP and be able to know exactly what each text message is (e.g., a help message, another OMPI infrastructure message, stdout from the user process, stderr from the user process, etc.). Filtering is not active by default. Filter components must be specifically requested, such as: {{{ $ mpirun --mca filter xml ... }}} There can only be one filter component active. = New MCA Parameters = The new functionality described above introduces two new MCA parameters: * '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that help messages will be aggregated, as described above. If set to 0, all help messages will be displayed, even if they are duplicates (i.e., the original behavior). * '''orte_base_show_output_recursions''': An MCA parameter to help debug one of the known issues, described below. It is likely that this MCA parameter will disappear before v1.3 final. = Known Issues = * The XML filter component is not complete. The current output from this component is preliminary and not real XML. A bit more work needs to be done to configure.m4 search for an appropriate XML library/link it in/use it at run time. * There are possible recursion loops in the orte_output() and orte_show_help() functions -- e.g., if RML send calls orte_output() or orte_show_help(). We have some ideas how to fix these, but figured that it was ok to commit before feature freeze with known issues. The code currently contains sub-optimal workarounds so that this will not be a problem, but it would be good to actually solve the problem rather than have hackish workarounds before v1.3 final. This commit was SVN r18434.	2008-05-13 20:00:55 +00:00
Josh Hursey	c70ba283b8	Fix a warning, and some return codes. Thanks to Jeff for pointing this out to me. This commit was SVN r18430.	2008-05-13 13:10:16 +00:00
Josh Hursey	4236255700	Add the framework name to the verbose message for improved debugging. Also set the 'best_priority' to the smallest 32 bit integer possible so negaive priority component can be selected if they are the highest ranking component available. This commit was SVN r18427.	2008-05-12 14:07:37 +00:00
Rainer Keller	b0cbeb0b41	- Add detection of __attribute__((hot)) and __attribute__((cold)) to allow explicit grouping of hot functions into similar code sections upon link-time. Should decrease TLB misses (iff the code- section is really too large)... Candidates for __opal_attribute_hot__ are MPI_Isend MPI_Irecv, MPI_Wait, MPI_Waitall Candidates for __opal_attribute_cold__ are MPI_Init, MPI_Finalize and MPI_Abort... This commit was SVN r18421.	2008-05-10 10:38:51 +00:00
Josh Hursey	9b0cd5b02a	Remove the 'include' check from mca_base_select. include/exclude is handled by the mca_base_open functionality and it is redundant (and wrong) to check this in the select function. Thanks to Pak Lui for bringing this to my attention. This commit was SVN r18418.	2008-05-08 23:41:07 +00:00
Josh Hursey	da2f1c58e2	Some checkpoint/restart cleanup. * Remove the opal_only option. This was suffering from bit rot, and no one uses it. It can be added back fairly easily if wanted. * Cleanup metadata interactions at the local level. * Touch up some of the INC funcitonality (fix typos and a minor ordering issue) This commit was SVN r18416.	2008-05-08 18:47:47 +00:00
Josh Hursey	8739edc580	Fix a couple of missing OPAL_DECLSPEC missing from r18407 This commit was SVN r18415. The following SVN revision numbers were found above: r18407 --> open-mpi/ompi@7c7b9b0486	2008-05-08 18:44:23 +00:00
George Bosilca	fe495e429a	Completely remove the kqueue support on MAC OS X. Remove the test from kqueue that try to detect if kqueue might works with ptys. This commit was SVN r18411.	2008-05-08 02:33:23 +00:00
Ralph Castain	7c7b9b0486	Do a little cleanup on the opal graph class and opal carto framework to conform to OMPI naming conventions and avoid potential conflict with user applications - no change in functionality, passes carto test program This commit was SVN r18407.	2008-05-07 19:33:49 +00:00
Josh Hursey	9971bc9d95	Merge in the mca_base_select changes per RFC: http://www.open-mpi.org/community/lists/devel/2008/04/3779.php {{{ svn merge -r 18276:18380 https://svn.open-mpi.org/svn/ompi/tmp-public/jjh-mca-play . }}} Any components not in the trunk, but in one of the effected frameworks must be updated. Contact the list, look at the RFC, or look at the diff for how to do this. Sorry for the early commit of this, but I wanted to get it in today (per RFC) and didn't know if I would have a chance later today. This commit was SVN r18381.	2008-05-06 18:08:45 +00:00
Aurelien Bouteiller	c06620ad70	Add a const to the parameters of opal_dss_compare. This commit was SVN r18374.	2008-05-05 19:12:01 +00:00
Brad Penoff	4f104ba5d1	Add header for FreeBSD. This commit was SVN r18366.	2008-05-03 23:07:45 +00:00
George Bosilca	f5dfc005a4	Only check for /proc/cpuinfo if we are on a supported architecture. This commit was SVN r18331.	2008-04-29 22:36:18 +00:00
George Bosilca	465f690f90	We need to force the compiler to preprocess these files as some of them use #include. The standard way is to rename to file .S instead of .s. This commit was SVN r18290.	2008-04-24 21:40:40 +00:00
Josh Hursey	2c736873bb	Fix a checkpoint/restart bug that causes a restarted application to occasionally throw a SIGSEGV or SIGPIPE due to invalid socket descriptors. The problem was caused by a bad ordering between the restart of the ORTE level tcp connections (in the OOB - out-of-band communication) and the Open MPI level tcp connections (BTLs). Before this commit ORTE would shutdown and restart the OOB completely before the OMPI level restarted its tcp connections. What would happen is that a socket descriptor used by the OMPI level on checkpoint was assigned to the ORTE level on restart. But the OMPI level had no knowledge that the socket descriptor it was previously using has been recycled so it closed it on restart. This caused the ORTE level to break as the newly created socket descriptor was closed without its knowledge. The fix is to have the OMPI level shutdown tcp connections, allow the ORTE level to restart, and then allow the OMPi level to restart its connections. This seems obvious, and I'm surprised that this bug has not cropped up sooner. I'm confident that this specific problem has been fixed with this commit. Thanks to Eric Roman and Tamer El Sayed for their help in identifying this problem, and patience while I was fixing it. * Add a new state {{{OPAL_CRS_RESTART_PRE}}}. This state identifies when we are on the down slope of the INC (finalize-like) which is useful when you want to close, but not reopen a component set for fear of interfering with a lower level. * Use this new state in OMPI level coordination. Here we want to make sure to play well with both the OMPI/BTL/TCP and ORTE/OOB/TCP components. * Update ft_event functions in PML and BML to handle the new restart state. * Add an additional flag to the error output in OOB/TCP so we can see what the socket descriptor was on failure as this can be helpful in debugging. This commit was SVN r18276.	2008-04-24 17:54:22 +00:00
Shiqing Fan	4a9787979e	When valgrind is not available or it is deselected (--without-valgrind, --with-valgrind=no), don't compile this component, continue without abortion. This commit was SVN r18243.	2008-04-23 11:50:42 +00:00
Josh Hursey	cc83d41ad9	Merge in tmp/jjh-scratch {{{ svn merge -r 18218:18240 https://svn.open-mpi.org/svn/ompi/tmp/jjh-scratch . }}} Contains: * Primarily a fix for a user reported problem where a cached file descriptor is causing a SIGPIPE on restart. * Cleanup some small memory leaks from using mca_base_param_env_var() - Thanks Jeff * Cleanup ORTE FT tool compilation in non-FT builds - Thanks Tim P. * Cleanup mpi interface with missplaced {{{OPAL_CR_ENTER_LIBRARY}}} - Thanks Terry * Some other sundry cleanup items all dealing with C/R functionality in the trunk. This commit was SVN r18241.	2008-04-23 00:17:12 +00:00
Jeff Squyres	db2695ccab	Make the symbols be visible. This commit was SVN r18201.	2008-04-18 00:26:17 +00:00
Ralph Castain	fa082cafa9	Shift the architecture calculation from the ompi/datatype engine to the opal/util area. This allows us to compute the architecture earlier in the launch and communicate it outside of the modex. Note: this is an early preliminary step in the movement of portions of the datatype engine to the opal layer. This commit was SVN r18198.	2008-04-17 20:43:56 +00:00
George Bosilca	01148b77dc	Generate the help message for the available event ops. Now the list only contains the one that are compiled on the current ompi. This commit was SVN r18196.	2008-04-17 18:16:54 +00:00
Ralph Castain	e7487ad533	Implement the seq rmaps module that sequentially maps process ranks to a list hosts in a hostfile. Restore the "do-not-launch" functionality so users can test a mapping without launching it. Add a "do-not-resolve" cmd line flag to mpirun so the opal/util/if.c code does not attempt to resolve network addresses, thus enabling a user to test a hostfile mapping without hanging on network resolve requests. Add a function to hostfile to generate an ordered list of host names from a hostfile This commit was SVN r18190.	2008-04-17 13:50:59 +00:00
Shiqing Fan	49fbc4e795	These functions should always have a return value. This commit was SVN r18174.	2008-04-16 13:54:15 +00:00
George Bosilca	b359d84661	Use the correct prefix. This commit was SVN r18048.	2008-03-31 21:42:59 +00:00
George Bosilca	be2454e0c5	Default the temporary directory to /tmp if no special environment variables are set. This commit was SVN r18046.	2008-03-31 20:15:49 +00:00
George Bosilca	ee784b601e	For consistency reasons always use opal_home_directory and opal_tmp_directory. This commit was SVN r18043.	2008-03-31 18:13:41 +00:00

1 2 3 4 5 ...

978 Коммитов