Run parallel jobs on the compute grid: Difference between revisions
New page: The '''Sun Grid Engine''' (SGE) is a standard interface to submitting jobs to a compute cluster. The department uses SGE to manage all sorts of spare computing resources, including some r... |
|||
| Line 27: | Line 27: | ||
# <code>cd</code> to the directory from which you want to run your job | # <code>cd</code> to the directory from which you want to run your job | ||
# Submit the following command, replacing <code><STUFF_IN_BRACKETS></code> with the appropriate arguments: | # Submit the following command, replacing <code><STUFF_IN_BRACKETS></code> with the appropriate arguments: | ||
#; <pre>qsub -l arch=<ARCH> -q <QUEUE> -m <NOTIFY> -t | #; <pre>qsub -l arch=<ARCH> -q <QUEUE> -m <NOTIFY> -t <RANGE> -wd <PATH> <SHELLSCRIPT></pre> | ||
#; Arguments | #; Arguments<nowiki>:</nowiki> | ||
#* <code>-l arch=<ARCH></code> --- your choice of either <code>lx24-x86</code> to use 32-bit machines or <code>lx24-amd64</code> to use 64-bit machines. | #* <code>-l arch=<ARCH></code> --- your choice of either <code>lx24-x86</code> to use 32-bit machines or <code>lx24-amd64</code> to use 64-bit machines. | ||
#* <code>-q <QUEUE></code> --- | #* <code>-q <QUEUE></code> --- ''optional''. If you'd like, you can select a specific job queue to which to add your job. Choices include <tt>'''short.q'''</tt>, <tt>'''long.q'''</tt>, <tt>'''idle.q'''</tt>, and <tt>'''highmem.q'''</tt>. See [http://cs.brown.edu/system/hardware/cluster/resources.html tstaff's descriptions] for more details. | ||
#* <code>-m <NOTIFY></code> --- this | #* <code>-m <NOTIFY></code> --- ''optional''. By default, SGE doesn't push much information about the state of your job to you. If you'd like to get notifications by email, pick any combination of the following options: | ||
#** <tt>'''b'''</tt> --- get an email when your job begins | |||
#** <tt>'''e'''</tt> --- get an email when your job ends | |||
#** <tt>'''a'''</tt> --- get an email if your job is aborted | |||
#** <tt>'''s'''</tt> --- get an email if your job is suspended | |||
#** <tt>'''n'''</tt> --- don't get any emails. Monitor your job with <tt>qmon</tt> instead. | |||
#* <code>-t <RANGE></code> --- specify a range for the task IDs that will be sent to your parallel jobs. The range looks like this: <code>MIN-MAX:STEP</code>. This is a range specification similar to Matlab's; in particular, note that task IDs are one-indexed and the range is ''inclusive'' of its maximum value. | |||
#** <code>MIN</code> is an integer ≥ 1 that represents the lowest ID that will be assigned. | |||
#** <code>MAX</code> is an integer ≥ <code>MIN</code> that represents the highest ID that will be assigned. | |||
#** <code>STEP</code> is an integer ≥ 1 that represents the step size taken between assigned task IDs. | |||
#* <code>-wd <PATH></code> --- ''optional''. Set the working directory for the host executing your job to <tt><PATH></tt>. | |||
#* <code><SHELLSCRIPT></code> --- the shell script to be executed by the SGE hosts. | |||
This command just prints out a single string, indicating a [[#Troubleshooting|problem]] if there is one, otherwise the job ID of the submitted job. SGE dumps standard out and standard error from all the instances of your shell script that get run on the various hosts to files named according to the job and task ID, in the directory where <tt>qsub</tt> was called. | |||
== Troubleshooting == | == Troubleshooting == | ||
Revision as of 15:34, 30 March 2009
The Sun Grid Engine (SGE) is a standard interface to submitting jobs to a compute cluster. The department uses SGE to manage all sorts of spare computing resources, including some really high-powered machines that are only accessible through SGE. Using this interface can be a little bit hairy and bootstrapping yourself with the documentation is difficult, hence this HOWTO.
First Things First
Decide on an Architecture
Before you start, you must decide on a machine architecture on which to run your job. Do you want to run on a 32-bit architecture or a 64-bit one? Well, there are pros and cons to each:
- 32-bit
- Pro: no need to recompile your programs.
- Con: there are only six 32-bit SGE hosts.
- 64-bit
- Pro: there are sixty 64-bit SGE hosts, and of course you can cram way more data into main memory on a 64-bit architecture.
- Con: every department machine that you can log into normally, from sixg to wheat to your own workstation, is 32-bit. Therefore all our programs are compiled on 32-bit machines. You must recompile your program and all the libraries it requires in order to run it on a 64-bit machine.
If you chose to use a 32-bit architecture, continue to the next section. If you chose 64-bit, keep reading...
Recompiling for a 64-bit Architecture
... I'll let you know when I figure it out...
Run Your Job
Write a shell script interface to your executable
Submit the job to SGE
ssh sge-- Log into the machine sgecdto the directory from which you want to run your job- Submit the following command, replacing
<STUFF_IN_BRACKETS>with the appropriate arguments:qsub -l arch=<ARCH> -q <QUEUE> -m <NOTIFY> -t <RANGE> -wd <PATH> <SHELLSCRIPT>
- Arguments:
-l arch=<ARCH>--- your choice of eitherlx24-x86to use 32-bit machines orlx24-amd64to use 64-bit machines.-q <QUEUE>--- optional. If you'd like, you can select a specific job queue to which to add your job. Choices include short.q, long.q, idle.q, and highmem.q. See tstaff's descriptions for more details.-m <NOTIFY>--- optional. By default, SGE doesn't push much information about the state of your job to you. If you'd like to get notifications by email, pick any combination of the following options:- b --- get an email when your job begins
- e --- get an email when your job ends
- a --- get an email if your job is aborted
- s --- get an email if your job is suspended
- n --- don't get any emails. Monitor your job with qmon instead.
-t <RANGE>--- specify a range for the task IDs that will be sent to your parallel jobs. The range looks like this:MIN-MAX:STEP. This is a range specification similar to Matlab's; in particular, note that task IDs are one-indexed and the range is inclusive of its maximum value.MINis an integer ≥ 1 that represents the lowest ID that will be assigned.MAXis an integer ≥MINthat represents the highest ID that will be assigned.STEPis an integer ≥ 1 that represents the step size taken between assigned task IDs.
-wd <PATH>--- optional. Set the working directory for the host executing your job to <PATH>.<SHELLSCRIPT>--- the shell script to be executed by the SGE hosts.
This command just prints out a single string, indicating a problem if there is one, otherwise the job ID of the submitted job. SGE dumps standard out and standard error from all the instances of your shell script that get run on the various hosts to files named according to the job and task ID, in the directory where qsub was called.