Stuff

UoM::RCS::Talby::Danzek::SGE



Page Group

Basic Config:

Extra Stuff:

Applications:

Scripts Etc.







StarCCM

1. 

See Also

The HP-MPI tight-integration notes at wiki.gridengine.info.

2. 

The Issues

PE Integration
We need SGE and StarCCM+ to talk to eachother to ensure that the application starts the SGE-determined number of processes and starts them on the right compute nodes.

We also want to ensure processes are tidied up correctly at the end.

StarCCM+ comes with its own implementation of MPI, HP's MPI.
Licensing
We have an unlimited number of licences for StarCCM — for MACE users only — so the complicatons that arose for Fluent do not exist for StarCCM+.

3. 

A Further Problem

All ok in R410.q
With the PE set up as described below, StarCCM+ seemed to work perfectly with, for example, 24-process jobs on two 12-core nodes in the R410.q queue which comprises 11 Dell R410 which were part of RQ2, installed and configured by the Research Infrastructure team, rather than Alces.
But not in C6100-STD.q — what is the difference?
Running the same job in a queue comprising nodes installed/configure by Alces, sometimes the jobs failed with:
  node064.danzek.itservices.manchester.ac.uk 12 [email protected] UNDEFINED
  node047.danzek.itservices.manchester.ac.uk 12 [email protected] UNDEFINED

  Starting local server: /opt/gridware/apps/binapps/starccm/5.04.006_02/starccm+5.04.006/star/bin/starccm+ -rsh ssh -np 24 -machinefile machinefile.42844 -server  100_layer.sim
  ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
  Host key verification failed.
  mpirun: Warning one more more remote shell commands exited with non-zero status, which may indicate a remote access problem.

  error: Server process ended unexpectedly (return code 255)
  mpirun: Warning one more more remote shell commands exited with non-zero status, which may indicate a remote access problem.
Hypothesis and Proof
Hypothesis: jobs ran ok when they have corresponding entries in ~/.ssh/known_hosts; failed when not. Using pdsh -g nodes uptime to add all nodes to ~/.ssh/known_hosts, then all jobs run ok.
So what is going on?
On the R410 nodes, /etc/ssh/ssh_known_hosts includes BOTH hostnames and IP addresses; on Alces installed/configured, only hostnames. And it is the IPs that get added to one's personal known_hosts. . .
Solution Attempt One
Can we add -q to the -rsh ssh command, or use
  Host node*
    LogLevel QUIET
in ~/.ssh/config? Did not seem to help: starccm+ is a script which includes other include-scripts, etc. Complicated. Running starccm with -verbose we see
/opt/gridware/apps/binapps/starccm/5.04.006_02/starccm+5.04.006/mpi/hp2/linux-x86_64-2.2.5/2.03.01.00/bin/mpirun -f /tmp/mpi-simonh10859/machinefile.10859 -e MPI_ROOT=/opt/gridware/apps/binapps/starccm/5.04.006_02/starccm+5.04.006/mpi/hp2/linux-x86_64-2.2.5/2.03.01.00 -e MPI_NOBACKTRACE=1 -e MPI_FLAGS=%MPI_FLAGS -e MPI_TMPDIR=/tmp/mpi-simonh10859 -e MPI_REMSH="ssh"
and hacking to get MPI_REMSH="ssh -q" gives errors, as one might expect.
Solution Attempt Two
The scripts seem to set StrictHostKeyChecking=yes. Trying StrictHostKeyChecking=no in ~/.ssh/config seems to override this and all is well.
Solution
Rather than asking all users to change their SSH config, we got Alces to add IPs to /etc/ssh/ssh_known_hosts on all their nodes.

4. 

Our Approach

5. 

Implementation: Licensing

The StarCCM+ environment module simply sets

  CDLMD_LICENSE_FILE="[email protected]:[email protected]:[email protected]"
and that is all that is required.

6. 

Implementation: PE Integration

The PE, starccm-12.pe, is

  pe_name            starccm-12.pe
  slots              999
  user_lists         NONE
  xuser_lists        NONE
  start_proc_args    /opt/gridware/ge-local/pe_hostfile2starccmmachinefile.sh
  stop_proc_args     /bin/true
  allocation_rule    12
  control_slaves     FALSE
  job_is_first_task  FALSE
  urgency_slots      min
  accounting_summary FALSE
where /opt/gridware/ge-local/pe_hostfile2starccmmachinefile.sh
  #!/bin/bash

  MACHINEFILE="machinefile.$JOB_ID"

  for host in `cat $PE_HOSTFILE | awk '{print $1}'`; do 
      num=`grep $host $PE_HOSTFILE | awk '{print $2}'`
  ##  for i in {1..$num}; do
      for i in `seq 1 $num`; do
        echo $host >> $MACHINEFILE
      done
  done 
simply creates an HP-MPI format machinefile from that provided by SGE.

7. 

Example Qsub Scripts

In this example, we select the required PE, starccm-12.pe, ensure the script knows about environment modules by sourceing the modules.sh file, then load the required module and call StarCCM+.

#!/bin/bash

#$ -pe starccm-12.pe 24
#$ -S /bin/bash
#$ -cwd

source /etc/profile.d/modules.sh

module load apps/binapps/starccm/5.04

starccm+ -verbose -batch -rsh ssh -np $NSLOTS -machinefile machinefile.$JOB_ID 100_layer.sim 

In this example it is assumed that the required environment module has already been loaded. We must add the -V option to the script to ensure it inherits the environment, including that added by the module, from out commandline.

#!/bin/bash

#$ -pe starccm-12.pe 24
#$ -S /bin/bash
#$ -cwd
#$ -V
    # ...ensure the StarCCM+ env. module is loaded before qsubbing this script...

starccm+ -verbose -batch -rsh ssh -np $NSLOTS -machinefile machinefile.$JOB_ID 100_layer.sim