Run parallel jobs on the compute grid

From VrlWiki
Jump to navigation Jump to search

Users interact with the department's compute grid through a standard interface called "Grid Engine". The compute grid includes all kinds of spare computing resources, in addition to clusters of dedicated, high-powered machines. Using the Grid Engine interface can be a little bit hairy and bootstrapping yourself with the documentation is difficult, hence this HOWTO.

Please note that in Spring 2011, the department switched from using the Sun Grid Engine to the open-source variant just called "Grid Engine". The documentation on this page hasn't been thoroughly checked since then. If you notice anything inaccurate, please fix it!

This particular page documents how to run an instance of the same program on several machines at once, with one argument to the program varying for each instance. As a motivating example, you might use this to compute several rows of a gigantic matrix, the computation of each entry of which is independent, on several different machines at the same time.

Decide on an Architecture

Before you can actually submit jobs to the grid, you must decide on a machine architecture on which to run your job. Do you want to run on a 32-bit architecture or a 64-bit one? Well, there are pros and cons to each:

32-bit
Pro: no need to recompile your programs.
Con: there are only six 32-bit grid hosts, and there are fewer every semester.
64-bit
Pro: there are more than sixty 64-bit grid hosts, and of course you can cram way more data into main memory on a 64-bit architecture.
Con: every department machine that you can log into normally, from sixg to wheat to your own workstation, is 32-bit. Therefore all our programs are compiled on 32-bit machines. You must recompile your program and all the libraries it requires in order to run it on a 64-bit machine.

If you chose to use a 32-bit architecture, continue to the next section. If you chose 64-bit, keep reading...

Recompiling for a 64-bit Architecture

Log into an interactive session on one of the 64-bit hosts:

> qlogin -l arch=lx24-amd64

Now do your compilation like normal.

Run Your Job

Write a shell script interface to your executable

To run parallel jobs, Grid Engine spawns a different instance of the job on every machine available, each time setting the environment variable $SGE_TASK_ID to a different value (see below for how to specify the range of values). The easiest way to run your executable is via a shell script that captures this and transforms it into a command-line argument for the program that's doing the real work. Here's an example script:

#!/usr/bin/tcsh
# 
# ge.sh
# 
# An example Grid Engine array script

# Grid Engine requires task IDs to start at 1, but the program expects 0
@ MACHINEINDEX=( $SGE_TASK_ID - 1 )

./obj/myProgram a_bunch_of default arguments $MACHINEINDEX

Submit the job to Grid Engine

  1. cd to the directory from which you want to run your shell script.
  2. Submit the following command, replacing <STUFF_IN_BRACKETS> with the appropriate arguments:
    qsub -l arch=<ARCH> -q <QUEUE> -m <NOTIFY> -t <RANGE> -wd <PATH> <SHELLSCRIPT>
    Arguments:
    • -l arch=<ARCH> --- your choice of either lx24-x86 to use 32-bit machines or lx24-amd64 to use 64-bit machines.
    • -q <QUEUE> --- optional. If you'd like, you can select a specific job queue to which to add your job. Choices include short.q, long.q, idle.q, and highmem.q. See tstaff's descriptions for more details.
    • -m <NOTIFY> --- optional. By default, Grid Engine doesn't push much information about the state of your job to you. If you'd like to get notifications by email, pick any combination of the following options:
      • b --- get an email when your job begins
      • e --- get an email when your job ends
      • a --- get an email if your job is aborted
      • s --- get an email if your job is suspended
      • n --- don't get any emails. Monitor your job with qmon instead.
    • -t <RANGE> --- specify a range for the task IDs that will be sent to your parallel jobs. The range looks like this: MIN-MAX:STEP. This is a range specification similar to Matlab's; in particular, note that task IDs are one-indexed and the range is inclusive of its maximum value.
      • MIN is an integer ≥ 1 that represents the lowest ID that will be assigned.
      • MAX is an integer ≥ MIN that represents the highest ID that will be assigned.
      • STEP is an integer ≥ 1 that represents the step size taken between assigned task IDs.
    • -wd <PATH> --- optional. Set the working directory for the host executing your job to <PATH>.
    • <SHELLSCRIPT> --- the shell script to be executed by the grid hosts.

This command just prints out a single string, indicating a problem if there is one, otherwise the job ID of the submitted job. Grid Engine dumps standard out and standard error from all the instances of your shell script that get run on the various hosts to files named according to the job and task ID, in the directory where qsub was called.

Troubleshooting

I get an email saying "can't chdir to" some directory

Occasionally certain grid hosts lose their connection to the CS network's filesystem. There's nothing you can do to fix it, but if you tell Grid Engine that you did, it will just randomly re-assign the task to another host. The email will tell you the specific job ID that went wrong; for array jobs, those that use $SGE_TASK_ID to do different things on different hosts, this will include a decimal point and the task ID. You can tell Grid Engine to try again by issuing this command:

qmod -c <jobid>

My script fails and the error output says "Command not found" for my executable

There are a few simple problems that could be causing this: either the program doesn't have the execute bit set, it's genuinely not where you're telling the script to look for it, or you have an architecture mismatch.

  1. To check the execute bit, ls -la in the directory containing your program.
  2. To check for path issues, add some lines to your script before you call the executable to say where the script thinks it is and to list the directory where it's trying to find the executable. Things like pwd, ls, echo $PATH, and even mount could be useful.
  3. To check for architecture mismatch, add uname -m to your script before you call the executable. If the output is i486, i686, or something like that, the host is a 32-bit machine. If the output is x86_64, it's a 64-bit machine. Most likely, your executable was compiled on a 32-bit department machine, and you'll need to recompile it to run on a 64-bit architecture.

See Also