1
1
openmpi/orte/tools/orterun/orterun.h
Ralph Castain 7342a6f1da Per the July technical meeting:
During the discussion of MPI-2 functionality, it was pointed out by Aurelien that there was an inherent race condition between startup of ompi-server and mpirun. Specifically, if someone started ompi-server to run in the background as part of a script, and then immediately executed mpirun, it was possible that an MPI proc could attempt to contact the server (or that mpirun could try to read the server's contact file before the server is running and ready.

At that time, we discussed createing a new tool "ompi-wait-server" that would wait for the server to be running, and/or probe to see if it is running and return true/false. However, rather than create yet another tool, it seemed just as effective to add the functionality to mpirun.

Thus, this commit creates two new mpirun cmd line flags (hey, you can never have too many!):

--wait-for-server : instructs mpirun to ping the server to see if it responds. This causes mpirun to execute an rml.ping to the server's URI with an appropriate timeout interval - if the ping isn't successful, mpirun attempts it again.

--server-wait-time xx : sets the ping timeout interval to xx seconds. Note that mpirun will attempt to ping the server twice with this timeout, so we actually wait for twice this time. Default is 10 seconds, which should be plenty of time.

This has only lightly been tested. It works if the server is present, and outputs a nice error message if it cannot be contacted. I have not tested the race condition case.

This commit was SVN r19152.
2008-08-04 20:29:50 +00:00

73 строки
1.8 KiB
C

/*
* Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
* University Research and Technology
* Corporation. All rights reserved.
* Copyright (c) 2004-2005 The University of Tennessee and The University
* of Tennessee Research Foundation. All rights
* reserved.
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
* Copyright (c) 2004-2005 The Regents of the University of California.
* All rights reserved.
* Copyright (c) 2007 Cisco, Inc. All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#ifndef ORTERUN_ORTERUN_H
#define ORTERUN_ORTERUN_H
#include "orte_config.h"
#include "opal/threads/condition.h"
#include "opal/util/cmd_line.h"
#include "orte/runtime/orte_globals.h"
BEGIN_C_DECLS
/**
* Main body of orterun functionality
*/
int orterun(int argc, char *argv[]);
/**
* Global struct for catching orterun command line options.
*/
struct orterun_globals_t {
bool help;
bool version;
bool verbose;
bool quiet;
bool exit;
bool by_node;
bool by_slot;
bool debugger;
int num_procs;
char *env_val;
char *appfile;
char *wdir;
char *path;
bool preload_binary;
char *preload_files;
char *preload_files_dest_dir;
opal_mutex_t lock;
bool sleep;
char *ompi_server;
bool wait_for_server;
int server_wait_timeout;
};
/**
* Struct holding values gleaned from the orterun command line -
* needed by debugger init
*/
ORTE_DECLSPEC extern struct orterun_globals_t orterun_globals;
END_C_DECLS
#endif /* ORTERUN_ORTERUN_H */