StarCCM

1.

PE Integration: We need SGE and StarCCM+ to talk to eachother to ensure that the application starts the SGE-determined number of processes and starts them on the right compute nodes.

We also want to ensure processes are tidied up correctly at the end.

StarCCM+ comes with its own implementation of MPI, HP's MPI.
Licensing: We have an unlimited number of licences for StarCCM — for MACE users only — so the complicatons that arose for Fluent do not exist for StarCCM+.

3.	A Further Problem

All ok in R410.q

With the PE set up as described below, StarCCM+ seemed to work perfectly with, for example, 24-process jobs on two 12-core nodes in the R410.q queue which comprises 11 Dell R410 which were part of RQ2, installed and configured by the Research Infrastructure team, rather than Alces.

But not in C6100-STD.q — what is the difference?

Running the same job in a queue comprising nodes installed/configure by Alces, sometimes the jobs failed with:

  node064.danzek.itservices.manchester.ac.uk 12 [email protected] UNDEFINED
  node047.danzek.itservices.manchester.ac.uk 12 [email protected] UNDEFINED

  Starting local server: /opt/gridware/apps/binapps/starccm/5.04.006_02/starccm+5.04.006/star/bin/starccm+ -rsh ssh -np 24 -machinefile machinefile.42844 -server  100_layer.sim
  ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
  Host key verification failed.
  mpirun: Warning one more more remote shell commands exited with non-zero status, which may indicate a remote access problem.

  error: Server process ended unexpectedly (return code 255)
  mpirun: Warning one more more remote shell commands exited with non-zero status, which may indicate a remote access problem.

Hypothesis and Proof

Hypothesis: jobs ran ok when they have corresponding entries in ~/.ssh/known_hosts; failed when not. Using pdsh -g nodes uptime to add all nodes to ~/.ssh/known_hosts, then all jobs run ok.

So what is going on?

On the R410 nodes, /etc/ssh/ssh_known_hosts includes BOTH hostnames and IP addresses; on Alces installed/configured, only hostnames. And it is the IPs that get added to one's personal known_hosts. . .

Solution Attempt One

Can we add -q to the -rsh ssh command, or use

  Host node*
    LogLevel QUIET

in ~/.ssh/config? Did not seem to help: starccm+ is a script which includes other include-scripts, etc. Complicated. Running starccm with -verbose we see

/opt/gridware/apps/binapps/starccm/5.04.006_02/starccm+5.04.006/mpi/hp2/linux-x86_64-2.2.5/2.03.01.00/bin/mpirun -f /tmp/mpi-simonh10859/machinefile.10859 -e MPI_ROOT=/opt/gridware/apps/binapps/starccm/5.04.006_02/starccm+5.04.006/mpi/hp2/linux-x86_64-2.2.5/2.03.01.00 -e MPI_NOBACKTRACE=1 -e MPI_FLAGS=%MPI_FLAGS -e MPI_TMPDIR=/tmp/mpi-simonh10859 -e MPI_REMSH="ssh"

and hacking to get MPI_REMSH="ssh -q" gives errors, as one might expect.

Solution Attempt Two

The scripts seem to set StrictHostKeyChecking=yes. Trying StrictHostKeyChecking=no in ~/.ssh/config seems to override this and all is well.

Solution

Rather than asking all users to change their SSH config, we got Alces to add IPs to /etc/ssh/ssh_known_hosts on all their nodes.

4.	Our Approach

We need to create a suitable machine file for HP MPI from that provided by SGE. We do this in the parallel environment. See below!

5.	Implementation: Licensing

The StarCCM+ environment module simply sets

  CDLMD_LICENSE_FILE="[email protected]:[email protected]:[email protected]"

and that is all that is required.

6.	Implementation: PE Integration

The PE, starccm-12.pe, is

  pe_name            starccm-12.pe
  slots              999
  user_lists         NONE
  xuser_lists        NONE
  start_proc_args    /opt/gridware/ge-local/pe_hostfile2starccmmachinefile.sh
  stop_proc_args     /bin/true
  allocation_rule    12
  control_slaves     FALSE
  job_is_first_task  FALSE
  urgency_slots      min
  accounting_summary FALSE

where /opt/gridware/ge-local/pe_hostfile2starccmmachinefile.sh

  #!/bin/bash

  MACHINEFILE="machinefile.$JOB_ID"

  for host in `cat $PE_HOSTFILE | awk '{print $1}'`; do 
      num=`grep $host $PE_HOSTFILE | awk '{print $2}'`
  ##  for i in {1..$num}; do
      for i in `seq 1 $num`; do
        echo $host >> $MACHINEFILE
      done
  done

simply creates an HP-MPI format machinefile from that provided by SGE.

7.	Example Qsub Scripts

In this example, we select the required PE, starccm-12.pe, ensure the script knows about environment modules by sourceing the modules.sh file, then load the required module and call StarCCM+.

#!/bin/bash

#$ -pe starccm-12.pe 24
#$ -S /bin/bash
#$ -cwd

source /etc/profile.d/modules.sh

module load apps/binapps/starccm/5.04

starccm+ -verbose -batch -rsh ssh -np $NSLOTS -machinefile machinefile.$JOB_ID 100_layer.sim

In this example it is assumed that the required environment module has already been loaded. We must add the -V option to the script to ensure it inherits the environment, including that added by the module, from out commandline.

#!/bin/bash

#$ -pe starccm-12.pe 24
#$ -S /bin/bash
#$ -cwd
#$ -V
    # ...ensure the StarCCM+ env. module is loaded before qsubbing this script...

starccm+ -verbose -batch -rsh ssh -np $NSLOTS -machinefile machinefile.$JOB_ID 100_layer.sim

Stuff

UoM::RCS::Talby::Danzek::SGE

Page Group

Basic Config:

Extra Stuff:

Applications:

Scripts Etc.

Page Related

StarCCM

1.

See Also

2.

The Issues

3.

A Further Problem

4.

Our Approach

5.

Implementation: Licensing

6.

Implementation: PE Integration

7.

Example Qsub Scripts