Advanced Configuration

This section documents optional configuration options for the advanced user. You do not need to be concerned with any of these options in order to get your csWMPI application to run, however some of the options might prove convenient.


Cluster Configuration

Machines Definition:
The machines section is started by the string "/Machines". In this section create an entry for each machine of the cluster using the following syntax:
/Machines
<Machine name>
<device1> <machine id using the device1>
<device2> <machine id using the device2>
<device3> <machine id using the device3>
...

Each entry starts with the identification of a machine, which is its DNS Name or IP address.

You then specify the devices that the machine can use to communicate with the other machines. For each device the machine has an identifier, which is unique within that device (e.g. for the TCP device the identifier is the IP address or IP name of the machine).

An example of an entry to specify the machine mountain, which has the devices tcp and shmem, is:
/Machines
mountain
tcp mountain.criticalsoftware.com
shmem mountain

Connections Definition:
After defining all the machines, it is necessary to provide information on how the processes of each machine communicate with other processes. Default communication devices can be specified for communications between processes running on the same machine (internal) and for processes on remote machines (external). You can also specify devices for specific connections between machines. Finally, you have to specify an intercomputation device, which is the device to use for computations connecting at runtime using MPI_Comm_connect and MPI_Comm_accept. Currently, only the tcp device can be used as the intercomputation device.

The format of the Connection section is shown below:
/Connections
[internal device <device>]
[external device <device>]
[intercomputation device <device>]
[<Machine1> <Machine2> <device>]

An example of the Connections section is:
/Machines 
mountain 
tcp mountain.criticalsoftware.com
shmem mountain 

squirrel 
tcp squirrel.criticalsoftware.com
shmem squirrel 

pacific 
tcp pacific.criticalsoftware.com
#Note: This machine (pacific) has no shared memory device

/Connections 
internal device shmem
external device tcp
intercomputation device tcp
pacific pacific tcp

This configuration file specifies three different machines. Two of the machines, mountain and squirrel, have the devices tcp and shmem, while the third machine, pacific, uses only the device tcp. In the configuration shown above under the /Connections section it is specified that the default internal device to be used is shmem. Hence, by default, shared memory is used for communication between processes residing on the same machine. Likewise, the default external device is set to be tcp, which means that by default processes residing on different machines communicate via tcp. The intercomputation device is configured to tcp. The default configuration is overwritten for the machine named pacific. The last line of the example shown above specifies that tcp should be used for internal communications between processes residing on the machine pacific.

Security section:
For each machine you must specify the security context for processes running on that machine. A domain/user pair defines the security context. It is possible to set a default user name and domain name to be used in all the machines. You can specify empty domain name when you are using the same username, in machines outside a Windows Domain or in Linux machines.

The format of the Security section is shown below:
/Security
[default user <user name>]
[default domain [<domain name>]]
[<Machine> <user name> [<domain name>]]

Using the example from the Connections Definition above, assume that the machines mountain and squirrel belong the domain csWMPI, while the machine pacific does not belong to any domain, the security section should then be:
/Security
default user csWMPI_user
default domain csWMPI
pacific parallel_user pacific

csWMPI_user must be an account in the domain csWMPI and the parallel_user must be an account on the pacific machine.

If the passwords are already present in the system (using the csWMPIreguser tool), then csWMPI will use them directly. Otherwise you will be prompted for the passwords.

Cluster Configuration file example
This is an example of a complete cluster configuration file:
/Machines 
mountain 
startup address mountain.criticalsoftware.com
tcp mountain.criticalsoftware.com
shmem mountain 

squirrel 
startup address squirrel.criticalsoftware.com
tcp squirrel.criticalsoftware.com
shmem squirrel 

pacific 
startup address pacific.criticalsoftware.com
tcp pacific.criticalsoftware.com
#Note: This machine has no shared memory device

/Connections 
internal device shmem 
external device tcp 
intercomputation device tcp
pacific pacific tcp

/Security
default user csWMPI_user
default domain csWMPI
pacific parallel_user pacific

Creating a portable Cluster Configuration file
A portable configuration file can be created using the wildcard character "." (period). Every time that it is used it will represent the default system entry for that field. It can be used to specify the name of a machine (the current machine name), the name of the user (the user currently logged on and starting the computation) or the name of the domain (the domain the user logged on).

Below is an example of a portable configuration file:
/Machines 

startup address .
shmem .
/Connections 
internal device shmem 
external device shmem 
/Security
default user .
default domain .

Note that wildcards are not accepted in specific connections or in specific security entries.


Process Group

In order to improve easy of use and provide a Process Group configuration file flexible and easy to maintain, the PG2 file support was added to csWMPI. All files are parsed as PG2 files, unless they have extension .pg, when they get parsed as old .pg files.

A PG2 is a valid XML file that can include a wide number of options, as following:
<job>
     <set>
         <executable>myapp.exe</executable>
         <arguments>"argument1 with spaces" argument2</arguments> <!-- optional -->
         <wdir>X:\temp</wdir> <!-- optional -->
         <path>X:\apps-dir</path> <!-- optional -->
         <processes>2</processes><!-- processes per machine -->
         <drivemap><!-- optional -->
             <drive>X:</drive>
             <share>\\ideafix\public</share>
         </drivemap>
         <environment><!-- optional -->
             <variable name="variable1_name">variable1_value</variable> <!-- optional -->
             <variable name="variable2_name">variable2_value</variable> <!-- optional -->
         </environment>
         <monitor>mpi</monitor> <!-- optional -->
         <machine name="machine1"/>
         <machine name="machine2"/>
         <machine name="machine3">
             <processes>4</processes><!-- optional -->
             <executable>machine3_executable.exe</executable><!-- optional -->
             <arguments>machine3_arg1</arguments><!-- optional -->
         </machine>
     </set>
</job>

Let's see each of the elements one by one

<job>
....
</job>
All PG2 file elements have to be inside a <job> element as a correct XML file.

     <set>
     ....
     </set>
A set is made of a number of common options. A <job> can have multiple <set> elements. MPI process ranks will be distributed first to all processes described in the first set, then in second set and so on. A <job> must have at least one <set> element.

         <executable>myapp.exe</executable>
The <executable> specifies the executable of each process of this set. It can be a relative or absolute path filename. In case one uses absolute path filename, notice that it is to be valid pathname in each machine running the executable.

         <arguments>"argument1 with spaces" argument2</arguments> <!-- optional -->
The <arguments> specifies the arguments for each process of this set. Arguments containing spaces must be within quotes. This is an optional element, thus processes without arguments don't need to define this element.

         <wdir>X:\temp</wdir> <!-- optional -->
The <wdir> specifies the working directory of each process. Notice this is relative to each process machine and that it has to be valid for all machines of this set. It's an optional element and in case it's not present the working directory will be the executable's directory. Network drives are mapped before setting the working directory of the process, thus one can specify directories from the Mapped Network drives as working directory.

         <path>X:\apps-dir</path> <!-- optional -->
The <path> specifies the path to be look for the executable. Notice this is relative to each process machine. Network drives are mapped before searching for the executable, thus one can specify Mapped Network drives in the path having the process executable on a shared drive.
It can contain a list of paths (valid in Windows or Linux) separated by semicolon, e.g.
<path>X:\apps-dir;c:\other-appsdir;/home/csWMPI-user/apps</path>
The '.' wildcard will be expanded to: current directory when starting computation with mpiexec; executable's directory when using direct-run

         <processes>2</processes><!-- processes per machine -->
The <processes> specifies number of processes to be created in each machine. Please notice that this is NOT the total number of processes in the computation.

         <drivemap><!-- optional -->
             <drive>X:</drive>
             <share>\\ideafix\public</share>
         </drivemap>
The <drivemap> specifies network shares that will be mapped in each machine. One can specify multiple <drivemap> elements in each set to map more than one network share. This is the easiest way to share an executable to all machines of a cluster. Notice that Windows might have license limitations on sharing a file to a lot of machines.
In the <drive> one specifies the drive letter and in the <share> element the network share to map the drive. The share is accessed as the owner of the process to be created.
When using MPI_Comm_spawn and MPI_Comm_spawn_multiple, if no additional drivemap were defined using the info object, the drivemaps of the first set are activated in the newly spawned processes.

         <environment><!-- optional -->
             <variable name="variable1_name">variable1_value</variable> <!-- optional -->
             <variable name="variable2_name">variable2_value</variable> <!-- optional -->
         </environment>
The <variable> elements inside the <environment> element, specify environment variables that will be defined in all MPI processes. One can define as many environment variables as desired (limited by OS restrictions only). In case of using direct-run, these environment variables will be defined in rank 0 (the starting process) after calling MPI_Init.
In case a PATH variable is defined here, it will be added, rather overwriting, to the PATH environment variable of each csWMPI Service/Daemon when starting the processes. Notice this PATH will not be set when looking for the mpi executable. For that you should use the <path> element.
When using MPI_Comm_spawn and MPI_Comm_spawn_multiple, if no additional environment variables were defined using the info object, the environment variables of the first set are defined in the newly spawned processes.

         <monitor>mpi</monitor> <!-- optional -->
The csWMPI Services monitor processes created by them. By default created processes are MPI processes. In case the created processes are not MPI processes, the csWMPI Service needs to know that so it monitors the behavior of the process and acts according to that. The value of <monitor> element specifies which type of monitoring level should be done and what kind of behavior is expected by the create processes. Must be one of the following values:

  • mpi - (default value) created processes are mpi processes and computation will abort in case a process dies unexpectedly. Also, in case of abortion mpi processes will be terminated.
  • process - created processes are not mpi processes, but they will create the mpi process as their child. In case a created process dies, abortion sequence is NOT started. In case mpi computation aborts, these processes will be terminated. This can be usefull when using intermediate wrappers that will start the mpi processes.
  • permanent - created processes are not mpi processes, but they will create the mpi processes as their child. In case the created processes die, abortion sequence is NOT started. In case mpi computation aborts, these processes will NOT be terminated. This option can be usefull when having third party schedulers or debuggers.
  • none - created processes will NOT be monitored by csWMPI Service.
  •          <machine name="machine3">
                 <processes>4</processes><!-- optional -->
                 <executable>machine3_executable.exe</executable><!-- optional -->
                 <arguments>machine3_arg1</arguments><!-- optional -->
             </machine>
    Multiple <machine> elements specify the nodes where the processes will be run. By default, when no inner elements are defined, csWMPI will use the definitions of the set to create the process on the machine.
    Optionally, in case for a specific machine one wants to configure different values for executable, number of processes or arguments, that can be done by optionally add each one of the <processes>, <executable> or </machine> elements. In case the executable is not a fully qualified name, csWMPI will use the <set> <path> element to look for it.


    Error Output Redirection

    The normal behavior, in case an error occurs, is for csWMPI to generates an error message, display it in the console of the process, and sends a pop-up message to the master machine in case the process is on a different machine from rank 0.

    While developing applications with csWMPI, you might find it convenient to redirect the output of errors to files. The environment variable csWMPI_MASTER_ERROR_OUTPUT sets the output filename for master process (rank 0 of MPI_COMM_WORLD) while csWMPI_SLAVE_ERROR_OUTPUT sets the output filename for all other processes. The files are created in the machines that host the processes. If these variables have the value null (a string with the value "null") then no output will be generated.


    Password Checking

    csWMPI reads the security context in the cluster configuration file. Then it searches in the registry/personal files for the passwords users specified. You can configure how csWMPI handles passwords through the csWMPI_PASSWORD_SEMANTICS environment variable, in case a user's password is not found in the registry. This environment variable can take three different values:

    Value: Description:
    ask_user Prompts the user for the password on stdin.
    return_error Exits application with error.
    get_environment Tries to get passwords from environment variables (see below).

    If csWMPI_PASSWORD_SEMANTICS is assigned the value get_environment csWMPI attempts to find environment variables with names corresponding to the domain\user pairs specified in the cluster configuration file. For example, if cluster configuration specifies the user GALIA\asterix for a machine, that user's password should be assigned to an environment variable named galia\asterix (must be in lowercase letters). The value of galia\asterix should be the password for the user in clear text.

    Storing user's password in environment variables might prove a serious security risk, thus using the password semantics of get_environment should be used with extreme care. We have added this functionality to enable users to work around the "feature" that Windows does not load a user's profile, when the user is logged in using Windows API calls. However this mechanism is also available in Linux systems.


    Environment Variables

    This section contains a list and description of the environment variables recognized by csWMPI.

    Environment Variable: Description:
    MPI_ROOT Denotes the location of the root directory of the csWMPI installation. Default value: C:\Program Files\csWMPI.
    csWMPI_CLUSTER_CONF_FILE Denotes the full path and file name of the cluster configuration file to use. If this environment variable is not set, the default behavior is to search in the current directory for a file named csWMPI.clusterconf (See Cluster Configuration). Default value: Not set.
    csWMPI_PG_FILENAME Denotes the full path and file name of the process group file to use. If this environment variable is not set, the default behavior is to search in the current directory for a file named [program name].pg. If this file is not found csWMPI attempts to read a file named: csWMPI.pg (See Process Group). Default value: Not set.
    csWMPI_NO_OUTPUT_PREFIX In case using mpiexec, if this variable is defined with a value different than 0, no prefix like "Rank 0: " will be added to processes output lines. Same as using -noprefix argument in mpiexec.
    csWMPI_MASTER_ERROR_OUTPUT If this variable is set to a filename, the master process' output (stdout) is redirected to this file. If the variable is set to null no out is output. (See (See Error Output Redirection). Default value: Not set.
    csWMPI_SLAVE_ERROR_OUTPUT If this variable is set to a filename, the slave(s) process' output (stdout) is redirected to this file. If the variable is set to null no out is output. (See (See Error Output Redirection). Default value: Not set.
    csWMPI_PASSWORD_SEMANTICS Defines how csWMPI should behave in case the security context information is not found in the registry. Valid values are ask_user, get_environment, and return_error. (See Password Checking). Default value: Not set.
    csWMPI_COLL_SYNC_COMM_START MPI_Alltoall variants can suffer from floods in Network switches, when multiple processes send to the same target process. To avoid that, processes can synchronize with the receiver before sending. This behaviour doesn't not occur in small communicators and it depends on the Network devices. This variable sets the minimum communicator size that will require synchronization of processes when sending/receiving. Default value: 10
    csWMPI_COLL_SYNC_MSG_START MPI_Alltoall can suffer from floods in Network switches, when multiple processes send to the same target process. To avoid that, processes can synchronize with the receiver before sending. This behaviour doesn't not occur in small messages and it depends on the Network devices. This variable sets the minimum message size that will require synchronization of processes when sending/receiving. This is only applied when the communicator size is at least the size speficied through csWMPI_COLL_SYNC_COMM_START. Default value: 8192 (8KB)
    csWMPI_TCP_RENDEZVOUS_START Specifies the minimum message size (in bytes) of messages transferred using a rendezvous protocol. The rendezvous protocol will synchronize both sender and receiver and the data will be sent only when the receiver specified the buffer to receive the data, hence called the matching receive function. This reduces the number of memory copy operations as well as avoids allocation and deallocation of big memory buffers.
    Default value: 1048576 (1 MB).
    csWMPI_TCP_RECV_BUFFER Specifies the TCP socket's receive buffer in bytes.
    Default value: 32768 (32 KB)
    csWMPI_TCP_SEND_BUFFER Specifies the TCP socket's send buffer in bytes.
    Default value: 16384 (16 KB)
    csWMPI_TCP_RT_SIGNAL The tcp device for Linux uses a realtime signal during communication. This signal cannot be used by any other library of the process.
    Default value (Linux only): SIGRTMIN+2
    csWMPI_SHMEM_SIZE Specifies the size (in bytes) of the memory region shared by processes on the machine using the shmem device. The minimum and default size of this region is 16MB. (See shmem device). Default value: Not set.
    csWMPI_SHMEM_END_POINT The end point of the shared memory region. The default is the bottom of the address space; unfortunately some other DLLs might load into or otherwise use this region. In cases where such territorial conflicts occur between DLLs, you can set this variable to some unused memory region. (See shmem device). Default value: Not set.
    csWMPI_SHMEM_RENDEZVOUS_START For big messages, a rendezvous protocol is needed to avoid flooding the share memory segment. In MS Windows, the rendezvous protocol is implemented through a zero-copy operation, copying directly from one process's address to another's. In Linux, the message will be transferred in small pieces of data. Since such a protocol requires the two participating processes to rendezvous, it is often only beneficial to use it for messages above some size. Messages below this size are temporarily stored in the memory region shared by the processes local to a machine.
    Default value for MS Windows: 65536 (64 KB)
    Default value for Linux: 131072 (128 KB)
    csWMPI_SHMEM_RT_SIGNAL The shmem device for Linux uses a realtime signal during communication. This signal cannot be used by any other library of the process.
    Default value (Linux only): SIGRTMIN+3
    csWMPI_SHMEM_UNIVERSE_SIZE Specifies the maximum number of processes that can connect to the shared memory device on a single machine. The default value is 256. (See shmem device). Default value: Not set.



    © 2009 Critical Software SA. All trademarks and copyrights on this page are owned by their respective owners.
    cscsWMPI II™, cscsWMPI™ and PatentMPI™ are trademarks of Critical Software SA. All Rights Reserved.