1
1
Граф коммитов

1825 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
7183179f56 Provide native integration with SLURM 2.0's OMPI support
This commit was SVN r21865.
2009-08-21 18:03:34 +00:00
Shiqing Fan
baa81a6525 Add a missing header to compile on Windows.
This commit was SVN r21861.
2009-08-21 07:20:21 +00:00
Rainer Keller
8e1b23779f - Replace combinations of
#if defined (c_plusplus)
          defined (__cplusplus)
   followed by
      extern "C" {
   and the closing counterpart by BEGIN_C_DECLS and END_C_DECLS.

   Notable exceptions are:
    - opal/include/opal_config_bottom.h:
      This is our generated code, that itself defines BEGIN_C_DECL and
      END_C_DECL
    - ompi/mpi/cxx/mpicxx.h:
      Here we do not include opal_config_bottom.h:                                 
    - Belongs to external code:                                                    
      opal/mca/backtrace/darwin/MoreBacktrace/MoreDebugging/MoreBacktrace.c        
      opal/mca/backtrace/darwin/MoreBacktrace/MoreDebugging/MoreBacktrace.h        
    - opal/include/opal/prefetch.h:
      Has C++ specific macros that are protected:                                  

    - Had #if ... } #endif  _and_ END_C_DECLS (aka end up with 2x
      END_C_DECLS)
      ompi/mca/btl/openib/btl_openib.h
    - opal/event/event.h has #ifdef __cplusplus as BEGIN_C_DECLS...
    - opal/win32/ompi_process.h: had extern "C"\n {...
      opal/win32/ompi_process.h: dito
    - ompi/mca/btl/pcie/btl_pcie_lex.l: needed to add *_C_DECLS
      ompi/mpi/f90/test/align_c.c: dito
    - ompi/debuggers/msgq_interface.h: used #ifdef __cplusplus
    - ompi/mpi/f90/xml/common-C.xsl: Amend

   Tested on linux using --with-openib and --with-mx

   The following do not contain either opal_config.h, orte_config.h or
   ompi_config.h
   (but possibly other header files, that include one of the above):
      ompi/mca/bml/r2/bml_r2_ft.h
      ompi/mca/btl/gm/btl_gm_endpoint.h
      ompi/mca/btl/gm/btl_gm_proc.h
      ompi/mca/btl/mx/btl_mx_endpoint.h
      ompi/mca/btl/ofud/btl_ofud_endpoint.h
      ompi/mca/btl/ofud/btl_ofud_frag.h
      ompi/mca/btl/ofud/btl_ofud_proc.h
      ompi/mca/btl/openib/btl_openib_mca.h
      ompi/mca/btl/portals/btl_portals_endpoint.h
      ompi/mca/btl/portals/btl_portals_frag.h
      ompi/mca/btl/sctp/btl_sctp_endpoint.h
      ompi/mca/btl/sctp/btl_sctp_proc.h
      ompi/mca/btl/tcp/btl_tcp_endpoint.h
      ompi/mca/btl/tcp/btl_tcp_ft.h
      ompi/mca/btl/tcp/btl_tcp_proc.h
      ompi/mca/btl/template/btl_template_endpoint.h
      ompi/mca/btl/template/btl_template_proc.h
      ompi/mca/btl/udapl/btl_udapl_eager_rdma.h
      ompi/mca/btl/udapl/btl_udapl_endpoint.h
      ompi/mca/btl/udapl/btl_udapl_mca.h
      ompi/mca/btl/udapl/btl_udapl_proc.h
      ompi/mca/mtl/mx/mtl_mx_endpoint.h
      ompi/mca/mtl/mx/mtl_mx.h
      ompi/mca/mtl/psm/mtl_psm_endpoint.h
      ompi/mca/mtl/psm/mtl_psm.h
      ompi/mca/pml/cm/pml_cm_component.h
      ompi/mca/pml/csum/pml_csum_comm.h
      ompi/mca/pml/dr/pml_dr_comm.h
      ompi/mca/pml/dr/pml_dr_component.h
      ompi/mca/pml/dr/pml_dr_endpoint.h
      ompi/mca/pml/dr/pml_dr_recvfrag.h
      ompi/mca/pml/example/pml_example.h
      ompi/mca/pml/ob1/pml_ob1_comm.h
      ompi/mca/pml/ob1/pml_ob1_component.h
      ompi/mca/pml/ob1/pml_ob1_endpoint.h
      ompi/mca/pml/ob1/pml_ob1_rdmafrag.h
      ompi/mca/pml/ob1/pml_ob1_recvfrag.h
      ompi/mca/pml/v/pml_v_output.h
      opal/include/opal/prefetch.h
      opal/mca/timer/aix/timer_aix.h
      opal/util/qsort.h
      test/support/components.h

This commit was SVN r21855.

The following SVN revision numbers were found above:
  r2 --> open-mpi/ompi@58fdc18855
2009-08-20 11:42:18 +00:00
Rainer Keller
3f742fc35b - add missing #include
This commit was SVN r21854.
2009-08-20 11:20:53 +00:00
Ralph Castain
40fc0b6367 Silence compiler warning
This commit was SVN r21850.
2009-08-20 04:57:23 +00:00
Ralph Castain
c3c642aa0d Add two new frameworks for sensing and predicting faults. This is just the bare-bones plumbing for now - will instantiate soon.
No ess modules reference these frameworks yet, so they are completely inactive.

This commit was SVN r21847.
2009-08-20 04:27:16 +00:00
Ralph Castain
646a3500a7 Correctly account for number of procs in the job
This commit was SVN r21843.
2009-08-20 00:07:38 +00:00
Ralph Castain
e66a0be796 First attempt at making OMPI respect external bindings. Detect any external bindings on the daemons, and use that to determine which sockets/cores to bind to.
I have no machine which allows me to do external binding, so I will have to ask others to test the new logic. However, I did verify that these changes don't break the existing logic when no external bindings were present.

This commit was SVN r21842.
2009-08-19 19:29:15 +00:00
Ralph Castain
dbb3cbe3dd Fix a reported problem when specifying orte_launch_agent - if only one word was given, we inadvertently appended a "NULL" to the end of the cmd.
This commit was SVN r21827.
2009-08-18 14:57:34 +00:00
Ralph Castain
3f3b46495e Add some error checking to the tm launcher
This commit was SVN r21818.
2009-08-14 03:13:02 +00:00
Ralph Castain
0005e6e834 Correct a couple of bugs in the rank_file mapper that were incorrectly assigning vpids.
Add a capability to parse the rankfile to extract node information in place of requiring both hostfile and rankfile for non-RM managed environments. The rankfile is -only- parsed for this IF the hostfile and -host options are not given. Otherwise, those are used to establish allocation info as we did before this commit.

This commit was SVN r21815.
2009-08-13 16:08:43 +00:00
Rainer Keller
cea3d68ef6 - Fix reference counting of daemons killed.
This commit was SVN r21810.
2009-08-12 14:04:50 +00:00
Shiqing Fan
bce2f44154 Update related .windows files with proper compiling properties, in order to have a successful DSO build.
This commit was SVN r21805.
2009-08-12 08:55:58 +00:00
Josh Hursey
843a61f7eb Add another missing header for FT that got lost in the shuffle yesterday.
This commit was SVN r21794.
2009-08-11 13:32:27 +00:00
Ralph Castain
1dc12046f1 Modify the OMPI paffinity and mapping system to support socket-level mapping and binding. Mostly refactors existing code, with modifications to the odls_default module to support the new capabilities.
Adds several new mpirun options:

* -bysocket - assign ranks on a node by socket. Effectively load balances the procs assigned to a node across the available sockets. Note that ranks can still be bound to a specific core within the socket, or to the entire socket - the mapping is independent of the binding.

* -bind-to-socket - bind each rank to all the cores on the socket to which they are assigned.

* -bind-to-core - currently the default behavior (maintained from prior default)

* -npersocket N - launch N procs for every socket on a node. Note that this implies we know how many sockets are on a node. Mpirun will determine its local values. These can be overridden by provided values, either via MCA param or in a hostfile

Similar features/options are provided at the board level for multi-board nodes.

Documentation to follow...

This commit was SVN r21791.
2009-08-11 02:51:27 +00:00
Rainer Keller
76469ea64a - Change the property of a few files, that obviously
don't need to be svn:executable...

This commit was SVN r21786.
2009-08-11 01:40:00 +00:00
Rainer Keller
784b9b9f5b - Based and updated from Ken's patch: since CLE-2.1 does not offer
the BATCH_PARTITION_ID anymore, use the ras-alps-command.sh script to
   figure out the jobs ID to query from ALPS.

   Gracefully report errors, update the help file and parse the sysconfig file

This commit was SVN r21772.
2009-08-07 01:15:09 +00:00
George Bosilca
0bf381e931 This patch try to solve a issue on Leopard. The supposedly global
variables that are not initialized and are declared in a file that
doesn't export any globally visible function are marked as
non-initialized constants, i.e. uninitialized common symbols. For some
obscure reasons, they get removed from the object files on Mac OS X.

So far I found two solution to this problem. One require the addition
of "-c" to the linker command, the second one (corresponding to this
patch) force them to became a common initialized symbol.

This commit was SVN r21739.
2009-07-28 17:06:16 +00:00
Ralph Castain
55e7365e7a Jeff correctly pointed out that char values > 127 also don't print. Adjust the xml output to handle those too.
Thanks Jeff!

This commit was SVN r21727.
2009-07-22 13:28:27 +00:00
Shiqing Fan
3e24d3df70 An ORTE event fix for Windows, i.e. using socket pairs instead of pipes on Windows.
This commit was SVN r21726.
2009-07-22 07:39:52 +00:00
Ralph Castain
6c85d954f3 Use a conditioned wait to serialize launches when they come from multiple sources (e.g., an orte application that spawns multiple jobs).
This commit was SVN r21718.
2009-07-20 01:51:29 +00:00
Ralph Castain
1a5f7245c8 Create a new message handling method for serializing responses. Place recvd messages on a list, using a file descriptor and the event library to trigger processing. This is identical in design to what is used in the IOF.
Use it first in the plm_base_receive to serialize multiple comm_spawn and update_proc requests.

This commit was SVN r21717.
2009-07-19 18:07:04 +00:00
Ralph Castain
1d74ab6e3c Cleanup some pointer array addressing and ensure we always exit the function cleanly
This commit was SVN r21716.
2009-07-19 18:05:04 +00:00
Ralph Castain
c3ce908515 Shift a debug output to come at a better place
This commit was SVN r21715.
2009-07-19 17:56:48 +00:00
Ralph Castain
1a5a591424 Clarify some comments
This commit was SVN r21714.
2009-07-19 17:56:19 +00:00
Ralph Castain
43d532cfd3 Minor tweak to xml ouput
This commit was SVN r21713.
2009-07-19 17:45:01 +00:00
Ralph Castain
2f515c8357 Per request of the Eclipse team, further modify the xml output to "escape" all non-printable characters
This commit was SVN r21712.
2009-07-17 22:45:32 +00:00
Ralph Castain
210f591f1c Cleanup array addressing for opal_pointer_array
This commit was SVN r21710.
2009-07-17 22:20:30 +00:00
Ralph Castain
51a8b89a83 Treat termination of continuously operating processes as an abort
This commit was SVN r21709.
2009-07-17 22:20:05 +00:00
Ralph Castain
08e17b72cf Break a circular logic loop in the cm routed module.
This commit was SVN r21708.
2009-07-17 18:07:35 +00:00
Ralph Castain
ef20e778b3 Ensure that output ends on an appropriate suffix tag when --tag-output or --xml are selected.
When we read the input buffer, we don't always get a complete printf output - we sometimes end mid stream. We still need to add the suffix and a <CR> to keep the output working right.

This commit was SVN r21706.
2009-07-17 05:02:53 +00:00
Ralph Castain
4c1eb040b0 Enable the system to keep functioning even when multiple launches are occurring simultaneously.
This is a bit of a hack, but it does seem to allow the system to work. A better solution is being discussed.

This commit was SVN r21705.
2009-07-17 02:28:47 +00:00
Ralph Castain
c0e85a492c Deleted one too many lines...might be good to set the value of oldnode!
Thanks George.

This commit was SVN r21702.
2009-07-16 18:49:24 +00:00
George Bosilca
3e971e61f3 The system headers are supposed to be protected by #ifdef and not by #if.
This commit was SVN r21700.
2009-07-16 18:27:33 +00:00
George Bosilca
ed93b967f7 Remove some warnings about uninitialized values.
This commit was SVN r21695.
2009-07-16 17:38:09 +00:00
George Bosilca
52d013baae Add a missing header.
This commit was SVN r21694.
2009-07-16 17:21:37 +00:00
Ralph Castain
007d14f238 Add a threshold reporting level to the orte notifier framework. This takes a string value:
"critical" - any error at or above the critical severity will be reported (i.e., only critical errors)
"warning" - any error at or above the warning severity will be reported (i.e., warning and critical errors)
"notice" - pretty much everything will be reported

Default to "critical" to keep down the chatter.

Obviously, only places that call orte_notifier will be affected - all other error reporting (e.g., via opal_output calls) is unaffected.

This commit was SVN r21693.
2009-07-16 13:31:23 +00:00
Ralph Castain
ae6c36ae01 Ensure that jdata->num_procs is correct when the rank_file mapper is mapping more procs than are specified in the rank_file
This commit was SVN r21690.
2009-07-15 22:45:12 +00:00
Ralph Castain
e75d9b8296 Use orte_notifier to alert sys admins to checksum violations in the csum pml.
Add ability to store the RM's jobid string to tag the notifier message so that the sys admin knows what job had the problem.

This commit was SVN r21687.
2009-07-15 19:43:26 +00:00
Ralph Castain
90a2db25e9 Modify the errmgr callback function so it passes the proc that failed instead of only the jobid.
Update the cm routed module to detect and pass orted failures.

This commit was SVN r21682.
2009-07-15 11:43:33 +00:00
Ralph Castain
247ba7e90d Use the base function to claim a slot when fault groups are not defined
This commit was SVN r21681.
2009-07-15 11:28:58 +00:00
Ralph Castain
7161b37c76 Ensure that the stdin channel is closed when we kill a local proc - all other channels will automatically be closed when the proc terminates
This commit was SVN r21680.
2009-07-15 11:28:19 +00:00
Ralph Castain
dbac602be5 Add support for the add-host and add-hostfile MPI Info keys to allow Comm_spawn users to add new hosts to those already known by mpirun.
Requires full testing once comm_spawn is fixed (Edgar is working that now).

This commit was SVN r21664.
2009-07-14 14:34:11 +00:00
Ralph Castain
60edbc7220 Fix hetero operations and comm_spawn (to a point).
Remove all architecture references from ORTE and put them back in the modex using modex_send/recv calls.

Hetero operations are now fully supported again. Comm_spawn now works up to the point where it segfaults due to an error in the CID code - which now allows Edgar to dig further! :-)

This commit was SVN r21655.
2009-07-13 20:03:41 +00:00
Ralph Castain
1b418dd397 Fix segfault in comm_spawn. The underlying problem breaking comm_spawn, however, remains - the change to make modex non-blocking causes the system to fail due to the arch not getting properly set.
Fix for that coming shortly.

This commit was SVN r21646.
2009-07-13 15:13:06 +00:00
Ralph Castain
b97f885c00 Restore the original API to terminate individual processes instead of the entire job. This was originally removed as we didn't at that time know how to take advantage of it. Some of us are now working on proactive resilience methods that move procs prior to node failure, so this is now a required API. Modify the odls, plm, and orted functions to support this new functionality.
Continue work on the resilient mapper, completing support for fault groups.

This commit was SVN r21639.
2009-07-13 02:29:17 +00:00
Ralph Castain
e30826c6e1 Quiet some compiler warnings
This commit was SVN r21591.
2009-07-02 17:48:36 +00:00
Shiqing Fan
0b56a8a4d5 Enable IPv6 on Windows by default, and fix two type casts for IPv6 operations.
This commit was SVN r21586.
2009-07-02 14:41:03 +00:00
Ralph Castain
4adb3ed80f Print out a more meaningful and correct error message
This commit was SVN r21581.
2009-07-01 20:16:15 +00:00
Ralph Castain
f832352b45 Clean up some compiler warnings
This commit was SVN r21577.
2009-07-01 16:51:11 +00:00