Stuff

UoM::RCS::Talby::Danzek::SGE



Page Contents:


Page Group

How can a user influence job priority?

 -- deadline jobs
 -- posix priority
 -- resource reservation
 -- advance reservation

Bugs/Features

Troubleshooting

Job Scheduling







Parallel Environments, Host/Machine Files and Loose & Tight Integration of MPI

1. 

Miscellaneous Notes

 -- The prolog, jobscript, epilog and associated PE scripts are just 
    executed one after the other as child-processes (or sub-shells).

 -- therefore cannot, e.g., export env vars from PE to job

2. 

Parallel Environment Fields and Attributes

2.1. 

Start Proc Args and Stop Proc Args

  1. If no such procedures are required for a certain parallel environment, you can leave the fields empty.
  2. The first argument is usually the name of the start or stop procedure itself; the remaining parameters are command-line arguments to these procedures.
  3. A variety of special identifiers, which begin with a $ prefix, are available to pass internal runtime information to the procedures. The sge_pe man page contains a list and description of all available parameters. These are:
            $pe_hostfile
            $host 
            $job_owner
            $job_id
            $job_name
            $pe   
            $pe_slots
            $processors
            $queue

2.2. 

Allocation Rule

A positive integer
This specifies the number of processes which are run on each host. For example, for eight-core nodes, could set this field to 8 so that only jobs of size eight, 16, 24. . . are accepted by the scheduler for this PE and all cores on each hosts are always used. This will:
  • minimise the number of hosts on which a job runs;
  • eliminate competition for on-host resources between this job and others.
$pe_slots
Use the special denominator $pe_slots to cause all the processes of a job to be allocated on a single host (SMP).
$fill_up
$round_robin

2.3. 

Urgency Slots

For jobs submitted with a slot range, determines how a particular number of slots is chosen from this range for use in the resource-request-based priority contribution for numeric resources. In simple words:

a job's priority depends partly on the resources requested, e.g., the number of slots; the Urgency Slots value determineds how the number of slots is chosen for the priority calculation if a PE slot range has been specified.

One can specify:

min or max
Take the minimum or maximum value from the specified range.
avg
Average
An integer value

2.4. 

Control Slaves

Specifies whether SGE generates (parallel) processes — child processes under sge_execd and sge_shepherd — or whether the PE creates its own processes:

  1. If control slaves is set, SGE generates the (parallel) processes.
  2. Full control over slave tasks by the SGE is strongly preferable, e.g., to ensure reported accounting figures are correct and for (parallel) process control — see loose vs tight integration, below.

2.5. 

Job is first task

  1. Significant only if control slaves is set (TRUE).
  2. If Job is first task is set, the job script, or one of its child processes, acts as one of the parallel tasks of the parallel application. If Job is first task is unset, the job script initiates the parallel application but does not participate.
  3. For PVM, this field should usually be set (TRUE).
  4. For MPI, this field should not usually be set (FALSE).

2.6. 

Accounting Summary

  1. Significant only if control slaves is set (TRUE), i.e., if accounting data is available via sge_execd and sge_shepherd.
  2. If set, only a single accounting record is written to the accounting file and this contains the accounting summary of the whole job including all slave tasks;
  3. if unset (FALSE) an individual accounting record is written for every slave process and for the master process.

3. 

Host/Machine Files — Ensuring the Processes Run Where SGE Says!

When a multi-process job is submitted to SGE a pe_hostfile is created which should be used to tell the parallel (e.g., MPI) job on which hosts to run and how many processes to start on each. It is easy to determine the pe_hostfile path and contents by adding a couple of lines to a qsub script, for example:

  #!/bin/bash

  #$ -cwd
  #$ -S /bin/bash
  #$ -pe orte-32.pe 64

  echo "PE_HOSTFILE:"
  echo $PE_HOSTFILE
  echo
  echo "cat PE_HOSTFILE:"
  cat $PE_HOSTFILE

  /opt/gridware/mpi/gcc/openmpi/1_4_3-ib/bin/mpirun -np $NSLOTS ./mynameis
which yields
  PE_HOSTFILE:
  /opt/gridware/ge/default/spool/node104/active_jobs/12565.1/pe_hostfile

  cat PE_HOSTFILE:
  node104.danzek.itservices.manchester.ac.uk 32 [email protected] UNDEFINED
  node105.danzek.itservices.manchester.ac.uk 32 [email protected] UNDEFINED
  
 Hello! From task            5  on host node103   
 Hello! From task            9  on host node103
 .                           .  .
 .                           .  .

3.1. 

Example: The SGE pe_hostfile and OpenMPI

SGE and OpenMPI talk happily to eachother. There is nothing to do.

3.2. 

Example: The SGE pe_hostfile and HP-MPI

HP-MPI expects a different host/machine file format. A suitable file is easily created from the start procedure in the PE. For example:

  qconf -sp hp-mpi.pe

  pe_name            hpmpi-12.pe
  slots              999
  user_lists         NONE
  xuser_lists        NONE
  start_proc_args    /opt/gridware/ge-local/pe_hostfile2hpmpimachinefile.sh
  stop_proc_args     /bin/true
  allocation_rule    12
  control_slaves     TRUE
  job_is_first_task  FALSE
  urgency_slots      min
  accounting_summary FALSE
where /opt/gridware/ge-local/pe_hostfile2hpmpimachinefile.sh
  #!/bin/bash

  MACHINEFILE="machinefile.$JOB_ID"

  for host in `cat $PE_HOSTFILE | awk '{print $1}'`; do 
      num=`grep $host $PE_HOSTFILE | awk '{print $2}'`
  ##  for i in {1..$num}; do
      for i in `seq 1 $num`; do
        echo $host >> $MACHINEFILE
      done
  done 
 
and sample output likes like
  cat machinefile.44375

  node032.danzek.itservices.manchester.ac.uk
  node032.danzek.itservices.manchester.ac.uk
  node032.danzek.itservices.manchester.ac.uk
  .                                        .
  .                                        .
  node038.danzek.itservices.manchester.ac.uk
  node038.danzek.itservices.manchester.ac.uk
  .                                        .
  .                                        .

4. 

Tight Integration vs Loose Integration

When running a multi-process (e.g., MPI) job under SGE, we want reliable control of resources, full process control and correct accounting. Within Unix/Linux this is possible only if the job processes are part of a process heirarchy with SGE at the top.

Parallel processes can be started under SGE in two ways: the sge_execd daemon can start the processes; or they can be started from the PE. See Control Slaves, below.

5. 

Loose Integration

6. 

Tight Integration via RSH/SSH Wrappers

As stated above, using plain rsh/ssh does not yield tight-integration. A special invocation of qrsh can be used to achieve what we want, like this:

  1. The qmaster sends the master process to its destination execution daemon;
  2. it also reserves the jobs slots on the destination execution daemons for the slave processes — reserves only, not starts.
  3. The PE uses
            qrsh -inherit
    to start slave processes on target hosts — using qrsh this way bypasses the scheduler (contrast use of qrsh for interactive and/or GUI-based work); the job is sent directly to the target host execution daemon (sge_execd).
  4. As a security precaution, execution daemons deny such job submissions by default, but those reserved by qmaster as above are told to allow them.

Example

See $SGE_ROOT/mpi/startmpi.sh and $SGE_ROOT/rsh.