1
1
Граф коммитов

11658 Коммитов

Автор SHA1 Сообщение Дата
Pak Lui
108921c020 typo
This commit was SVN r18387.
2008-05-06 21:37:35 +00:00
Pak Lui
0302c098be minor typo
This commit was SVN r18386.
2008-05-06 21:26:17 +00:00
Ralph Castain
d97a4f880d Shift the daemon collective operation to the ODLS framework. Ensure we track the collectives per job to avoid race conditions. Take advantage of the new capabilities of the routed framework to define aggregating trees for the daemon collective, and to track which daemons are participating to handle the case of sparse participation.
Make it all work with comm_spawn in the case of all procs on previously occupied nodes, some new procs on new nodes, and mixtures of the two.

Note: comm_spawn now works with both binomial and linear routed modules. There remains a problem of spawned procs not properly getting updated contact info for the parent proc when run in the direct routed mode...but that's for another day.

This commit was SVN r18385.
2008-05-06 20:16:17 +00:00
Josh Hursey
c47406810e Fix AMCA orted command line.
If no AMCA parameters are passed then do not send across the path information. Only place it on the command line if the AMCA parameter is set.

This commit was SVN r18382.
2008-05-06 18:27:31 +00:00
Josh Hursey
9971bc9d95 Merge in the mca_base_select changes per RFC:
http://www.open-mpi.org/community/lists/devel/2008/04/3779.php

{{{
svn merge -r 18276:18380 https://svn.open-mpi.org/svn/ompi/tmp-public/jjh-mca-play .
}}}

Any components not in the trunk, but in one of the effected frameworks *must* be
updated. Contact the list, look at the RFC, or look at the diff for how to do this.

Sorry for the early commit of this, but I wanted to get it in today (per RFC) and
didn't know if I would have a chance later today.

This commit was SVN r18381.
2008-05-06 18:08:45 +00:00
Jeff Squyres
a06d4023b8 Oops -- missed one sys_errlist -> strerror().
This commit was SVN r18378.
2008-05-06 13:22:36 +00:00
Ralph Castain
40904dd152 Add a binomial routed module - for now, still completely wires up the daemons, but that will be changed later.
Modify grpcomm xcast so it now uses the selected routed module - eliminates cross-wiring of xcast and routing paths. Suboptimal at the moment, but better implementation is on its way.

Cleanup ignore properties on the new routed components.

This commit was SVN r18377.
2008-05-05 22:32:25 +00:00
Jeff Squyres
4154e587de strerror() is much better.
This commit was SVN r18376.
2008-05-05 21:06:07 +00:00
Aurelien Bouteiller
5ba62469a0 Add a route_is_defined implementation for the linear oob routing.
This commit was SVN r18375.
2008-05-05 19:12:41 +00:00
Aurelien Bouteiller
c06620ad70 Add a const to the parameters of opal_dss_compare.
This commit was SVN r18374.
2008-05-05 19:12:01 +00:00
Aurelien Bouteiller
2ae30fe126 Implementation of the route_is_defined stub for direct oob routing.
This commit was SVN r18373.
2008-05-05 18:23:26 +00:00
Shiqing Fan
f35a06119c Use memchecker_convertor_call function instead the old one. Move the function to the place that we can use convertor.
This commit was SVN r18370.
2008-05-05 13:57:27 +00:00
Ralph Castain
b8bb990acf Rename the routed modules to more accurately reflect what they do and the role they will play in soon-to-come updates.
Add two new API's to the routed framework - stub them out so that collaborators can work on them in various components without conflicts.

Remove a "finalize" from the select function that could cause problems as the component had not had its initialize called yet.

This commit was SVN r18369.
2008-05-05 02:59:09 +00:00
Pak Lui
f5311903ee Correct the check with AC_LINK_IFELSE per Jeff's suggestion
This commit was SVN r18368.
2008-05-05 02:13:30 +00:00
Brad Penoff
4f104ba5d1 Add header for FreeBSD.
This commit was SVN r18366.
2008-05-03 23:07:45 +00:00
Jon Mason
a3bf503e01 Remove error on rdma cm
If there are multiple QP's, RDMACM will not send a message if the
qpnum != 0.  In doing so, it will log an error unecessarily.  This
removes that.

This commit was SVN r18363.
2008-05-02 20:12:01 +00:00
Jon Mason
3989981578 Enable support of num_proc > num_nodes
Add the logic to support using port numbers, instead of simply using
the IP address of the sending node to determine which endpoint to
connect.  Since each process calls the cpc query function, it will
generate its own port to listen on thus enablign this to work.

This commit was SVN r18362.
2008-05-02 16:20:28 +00:00
Ralph Castain
519c15f8af Fix direct and linear xcast modes
This commit was SVN r18359.
2008-05-02 14:30:07 +00:00
Ralph Castain
8e846bf7f2 Separate the gathering of collective data by jobid
This commit was SVN r18357.
2008-05-02 12:00:08 +00:00
Jeff Squyres
ba5615a18f Merge in /tmp-public/cpc3 branch to trunk. oob/xoob still remains the
default CPC.

This commit was SVN r18356.
2008-05-02 11:52:33 +00:00
Jeff Squyres
357428f82f Per http://www.open-mpi.org/community/lists/devel/2008/04/3778.php, Ralph W.'s suggestion to remove an unnecessary escape
This commit was SVN r18354.
2008-05-01 22:33:49 +00:00
Ralph Castain
432d441b3e Cleanup a bug found by Josh that caused multiple app_contexts to keep mapping onto the first node in an allocation
Continue work on loadbalancing

Cleanup code organization in rmaps_base

This commit was SVN r18353.
2008-05-01 21:07:49 +00:00
Donald Kerr
843a35094f adding local work queue accounting
This commit was SVN r18352.
2008-05-01 21:01:51 +00:00
George Bosilca
a69ac964df Allow any order in the list of Elan vpid.
This commit was SVN r18350.
2008-05-01 20:32:03 +00:00
Ralph Castain
b2c73f6e11 Fix tree-spawn to work within the new modex system
This commit was SVN r18349.
2008-05-01 19:19:34 +00:00
Josh Hursey
dcd21d7d07 Some checkpoint/restart fixes in response to r18338 (changes in modex).
Things should be working now.

This commit was SVN r18348.

The following SVN revision numbers were found above:
  r18338 --> open-mpi/ompi@3e55fe6f6d
2008-05-01 17:48:13 +00:00
Jeff Squyres
7154b232bb Per http://www.open-mpi.org/community/lists/devel/2008/04/3778.php,
Ralf W.'s suggestion for More Automake Goodness(tm).

This commit was SVN r18347.
2008-05-01 15:32:20 +00:00
Ralph Castain
ad894b050b Set the bookmark so the first process of a comm_spawn'd job will be mapped to the same node as the spawning proc, assuming it has space. If not, then the mapper will automatically move to the next node.
This commit was SVN r18346.
2008-05-01 15:24:03 +00:00
Terry Dontje
8dd0421015 Moved ident lines to ompi_mpi_init.c and created new ompi_version_string
variable.

This commit was SVN r18345.
2008-05-01 15:06:10 +00:00
Ralph Castain
1766442591 Fix a double-free when tree-spawning
Fix the round-robin mapper so it doesn't move to the next node just because it completed mapping an app_context

This commit was SVN r18344.
2008-05-01 14:49:56 +00:00
Jeff Squyres
a1e5139b8f Update svn:ignore
This commit was SVN r18343.
2008-05-01 14:34:17 +00:00
Ralph Castain
3e55fe6f6d Fold in the revised modex scheme. Move the ompi_proc_t modex portions to the RTE level since the daemons already have that info. Provide each process with the equivalent of a "nidmap" - both a map of what nodes are in the job, and a map of which node each process is on. This enables the use of static ports, though that hasn't been turned "on" in this commit.
Update the rsh tree spawn capability so we spawn the next wave of daemons before launching our own local procs.

Add an ability to encode nodenames for large clusters with contiguous node name numbering schemes - this allows communication of all node names in a few bytes instead of tens-of-bytes/node.

This commit was SVN r18338.
2008-04-30 19:49:53 +00:00
George Bosilca
f5dfc005a4 Only check for /proc/cpuinfo if we are on a supported architecture.
This commit was SVN r18331.
2008-04-29 22:36:18 +00:00
Pavel Shamis
61cc8843bf The r17940 broke the XRC code.
The endpoint may be appended to list during XOOB connection bring up.

This commit was SVN r18328.

The following SVN revision numbers were found above:
  r17940 --> open-mpi/ompi@ebfdd133f5
2008-04-29 13:22:40 +00:00
Ralph Castain
59277e2141 Update the ignores for hg
This commit was SVN r18327.
2008-04-29 02:23:33 +00:00
Galen Shipman
ced88a338b include portals modex fun in the distro
This commit was SVN r18325.
2008-04-28 18:51:54 +00:00
Brad Penoff
c699236be2 updating SCTP BTL to configure properly with FreeBSD 7
This commit was SVN r18324.
2008-04-28 04:19:10 +00:00
George Bosilca
6e6c370917 Rollback r18274 as its legal to have a sequence number smaller than the
expected one. It doesn't necessarily means the message is duplicated,
it can simply signify the message is out of sequence and the counter
overflowed.

This commit was SVN r18323.

The following SVN revision numbers were found above:
  r18274 --> open-mpi/ompi@73c9de3af9
2008-04-27 18:35:54 +00:00
Ralph Castain
2329af9b49 Adjust the platform files for lanl
This commit was SVN r18319.
2008-04-26 14:16:49 +00:00
Aurelien Bouteiller
611d52fa95 Fix a bug that rpevented to use the same port (as returned by Open_port) for several Comm_accept)
This commit was SVN r18303.
2008-04-25 20:41:44 +00:00
Galen Shipman
19c986bb57 include ornl specifics in the dist
This commit was SVN r18301.
2008-04-25 15:30:03 +00:00
Jeff Squyres
518bd99e17 Per thread started here:
http://www.open-mpi.org/community/lists/users/2008/04/5483.php

Make the error message a bit more user-friendly.

This commit was SVN r18293.
2008-04-25 11:09:43 +00:00
Aurelien Bouteiller
c20b020ea6 Fix ticket #1275. The pml v can now be correctly deactivated on the configure command line. Also fix a dist target under some unusual circumpstances.
This commit was SVN r18291.
2008-04-24 21:42:54 +00:00
George Bosilca
465f690f90 We need to force the compiler to preprocess these files as some of
them use #include. The standard way is to rename to file .S instead
of .s.

This commit was SVN r18290.
2008-04-24 21:40:40 +00:00
Ralph Castain
4c2c6c9bd8 Ensure the pack/unpacks match for tree-spawn
This commit was SVN r18282.
2008-04-24 18:53:08 +00:00
Ralph Castain
09b6758f8c Pass the prefix dir to the remote orted when doing tree-based spawns
This commit was SVN r18280.
2008-04-24 18:38:24 +00:00
Josh Hursey
2c736873bb Fix a checkpoint/restart bug that causes a restarted application to occasionally throw a SIGSEGV or SIGPIPE due to invalid socket descriptors.
The problem was caused by a bad ordering between the restart of the ORTE level tcp connections (in the OOB - out-of-band communication) and the Open MPI level tcp connections (BTLs). Before this commit ORTE would shutdown and restart the OOB completely before the OMPI level restarted its tcp connections. What would happen is that a socket descriptor used by the OMPI level on checkpoint was assigned to the ORTE level on restart. But the OMPI level had no knowledge that the socket descriptor it was previously using has been recycled so it closed it on restart. This caused the ORTE level to break as the newly created socket descriptor was closed without its knowledge.

The fix is to have the OMPI level shutdown tcp connections, allow the ORTE level to restart, and then allow the OMPi level to restart its connections. This seems obvious, and I'm surprised that this bug has not cropped up sooner. I'm confident that this specific problem has been fixed with this commit.

Thanks to Eric Roman and Tamer El Sayed for their help in identifying this problem, and patience while I was fixing it.

 * Add a new state {{{OPAL_CRS_RESTART_PRE}}}. This state identifies when we are on the down slope of the INC (finalize-like) which is useful when you want to close, but not reopen a component set for fear of interfering with a lower level.
 * Use this new state in OMPI level coordination. Here we want to make sure to play well with both the OMPI/BTL/TCP and ORTE/OOB/TCP components.
 * Update ft_event functions in PML and BML to handle the new restart state.
 * Add an additional flag to the error output in OOB/TCP so we can see what the socket descriptor was on failure as this can be helpful in debugging.

This commit was SVN r18276.
2008-04-24 17:54:22 +00:00
George Bosilca
3ccac4f803 Oops ...
This commit was SVN r18275.
2008-04-24 15:54:52 +00:00
George Bosilca
73c9de3af9 Bark if we got a wrong sequence number. Here wrong means that the
seq number if smaller than what we expect.

This commit was SVN r18274.
2008-04-24 15:48:43 +00:00
Tim Mattox
46c6aa4ed4 Resync the trunk NEWS file with the 1.2.7 changes.
This commit was SVN r18268.
2008-04-23 18:32:19 +00:00