StarCCM
1. |
See Also |
The HP-MPI tight-integration notes at wiki.gridengine.info.
2. |
The Issues |
- PE Integration
-
We need SGE and StarCCM+ to talk to eachother to ensure that the application
starts the SGE-determined number of processes and starts them on the right
compute nodes.
We also want to ensure processes are tidied up correctly at the end.
StarCCM+ comes with its own implementation of MPI, HP's MPI. - Licensing
- We have an unlimited number of licences for StarCCM — for MACE users only — so the complicatons that arose for Fluent do not exist for StarCCM+.
3. |
A Further Problem |
- All ok in R410.q
- With the PE set up as described below, StarCCM+ seemed to work perfectly with, for example, 24-process jobs on two 12-core nodes in the R410.q queue which comprises 11 Dell R410 which were part of RQ2, installed and configured by the Research Infrastructure team, rather than Alces.
- But not in C6100-STD.q — what is the difference?
-
Running the same job in a queue comprising nodes installed/configure by
Alces, sometimes the jobs failed with:
node064.danzek.itservices.manchester.ac.uk 12 [email protected] UNDEFINED node047.danzek.itservices.manchester.ac.uk 12 [email protected] UNDEFINED Starting local server: /opt/gridware/apps/binapps/starccm/5.04.006_02/starccm+5.04.006/star/bin/starccm+ -rsh ssh -np 24 -machinefile machinefile.42844 -server 100_layer.sim ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory Host key verification failed. mpirun: Warning one more more remote shell commands exited with non-zero status, which may indicate a remote access problem. error: Server process ended unexpectedly (return code 255) mpirun: Warning one more more remote shell commands exited with non-zero status, which may indicate a remote access problem.
- Hypothesis and Proof
- Hypothesis: jobs ran ok when they have corresponding entries in ~/.ssh/known_hosts; failed when not. Using pdsh -g nodes uptime to add all nodes to ~/.ssh/known_hosts, then all jobs run ok.
- So what is going on?
- On the R410 nodes, /etc/ssh/ssh_known_hosts includes BOTH hostnames and IP addresses; on Alces installed/configured, only hostnames. And it is the IPs that get added to one's personal known_hosts. . .
- Solution Attempt One
-
Can we add -q to the -rsh ssh command, or use
Host node* LogLevel QUIET
in ~/.ssh/config? Did not seem to help: starccm+ is a script which includes other include-scripts, etc. Complicated. Running starccm with -verbose we see/opt/gridware/apps/binapps/starccm/5.04.006_02/starccm+5.04.006/mpi/hp2/linux-x86_64-2.2.5/2.03.01.00/bin/mpirun -f /tmp/mpi-simonh10859/machinefile.10859 -e MPI_ROOT=/opt/gridware/apps/binapps/starccm/5.04.006_02/starccm+5.04.006/mpi/hp2/linux-x86_64-2.2.5/2.03.01.00 -e MPI_NOBACKTRACE=1 -e MPI_FLAGS=%MPI_FLAGS -e MPI_TMPDIR=/tmp/mpi-simonh10859 -e MPI_REMSH="ssh"
and hacking to get MPI_REMSH="ssh -q" gives errors, as one might expect. - Solution Attempt Two
- The scripts seem to set StrictHostKeyChecking=yes. Trying StrictHostKeyChecking=no in ~/.ssh/config seems to override this and all is well.
- Solution
- Rather than asking all users to change their SSH config, we got Alces to add IPs to /etc/ssh/ssh_known_hosts on all their nodes.
4. |
Our Approach |
- We need to create a suitable machine file for HP MPI from that provided by SGE. We do this in the parallel environment. See below!
5. |
Implementation: Licensing |
The StarCCM+ environment module simply sets
CDLMD_LICENSE_FILE="[email protected]:[email protected]:[email protected]"and that is all that is required.
6. |
Implementation: PE Integration |
The PE, starccm-12.pe, is
pe_name starccm-12.pe slots 999 user_lists NONE xuser_lists NONE start_proc_args /opt/gridware/ge-local/pe_hostfile2starccmmachinefile.sh stop_proc_args /bin/true allocation_rule 12 control_slaves FALSE job_is_first_task FALSE urgency_slots min accounting_summary FALSEwhere /opt/gridware/ge-local/pe_hostfile2starccmmachinefile.sh
#!/bin/bash MACHINEFILE="machinefile.$JOB_ID" for host in `cat $PE_HOSTFILE | awk '{print $1}'`; do num=`grep $host $PE_HOSTFILE | awk '{print $2}'` ## for i in {1..$num}; do for i in `seq 1 $num`; do echo $host >> $MACHINEFILE done donesimply creates an HP-MPI format machinefile from that provided by SGE.
7. |
Example Qsub Scripts |
In this example, we select the required PE, starccm-12.pe, ensure the script knows about environment modules by sourceing the modules.sh file, then load the required module and call StarCCM+.
#!/bin/bash #$ -pe starccm-12.pe 24 #$ -S /bin/bash #$ -cwd source /etc/profile.d/modules.sh module load apps/binapps/starccm/5.04 starccm+ -verbose -batch -rsh ssh -np $NSLOTS -machinefile machinefile.$JOB_ID 100_layer.sim
In this example it is assumed that the required environment module has already been loaded. We must add the -V option to the script to ensure it inherits the environment, including that added by the module, from out commandline.
#!/bin/bash #$ -pe starccm-12.pe 24 #$ -S /bin/bash #$ -cwd #$ -V # ...ensure the StarCCM+ env. module is loaded before qsubbing this script... starccm+ -verbose -batch -rsh ssh -np $NSLOTS -machinefile machinefile.$JOB_ID 100_layer.sim