Main Page | Recent changes | Edit this page | Page history

Printable version | Disclaimers

Not logged in
Log in | Help
 

Typhoon(Cluster) Documentation

From AstCCwiki

The Astro Theory group maintains a parallel computer cluster named Typhoon for group computational needs. Currently, Typhoon comprises 80 processors, 70GB of RAM, and 5TB of disk space. The interesting thing, of course, is how all this computing power and storage capacity is organized into one semi-cohesive unit. Typhoon is a Beowulf-class computer cluster consisting of 1 master node and 39 slave compute nodes connected via a private 100Mb/s Ethernet network. Each node contains two roughly 2 GHz AMD AthlonMP processors, roughly 2GB of RAM, a roughly 100 GB hard disk, and runs Fedora Core Linux. For long term storage, we have 1.4TB of space available on a hardware RAID attached to the master node. The full power of the cluster is harnessed via parallel programming, of course, for which we use MPI, which allows shared memory parallelization on an individual dual CPU node, non-shared memory among different nodes through the network, and any combination of the two. Computer jobs are managed via the Torque resource manager (based on OpenPBS), in combination with the Moab scheduler (based on Maui). We also use the Condor resource manager as an "opportunistic" scheduler to gobble up any available resources not allocated by Torque (since Torque is compatible with all codes/programming languages used by our group, but Condor is not).

General information on accessing and using the cluster is detailed below. Hopefully, this documentation will answer most questions. If you need additional assistance, please email the Cluster Help List at mailto:cluster-help@astro.northwestern.edu.

Table of contents

Accessing the Cluster

Ground Rules and General Information

[friendly@master friendly]$ quota -sv
Disk quotas for user friendly (uid 599):
     Filesystem  blocks   quota   limit   grace   files   quota   limit   grace
       /dev/md0    457M   4758M   5008M            2613       0       0
      /dev/sda1    124G    152G    160G           10293       0       0

/dev/md0 corresponds to /home, and /dev/sda1 corresponds to /storage. This user is using only 457MB of their 4758MB soft / 5008MB hard quota on /home. They are using 124GB of their 152GB soft / 160GB hard quota on /storage. If they exceed their soft quota on either filesystem, a time will appear in the "grace" column, letting them know how long they have to reduce their disk space usage to below the soft limit before their hard quota is clamped at their current usage. The last 4 columns above correspond to limits on the number of files a user may have. The quotas here are 0, meaning that there is no limit. Every night the cluster checks each user's disk usage. Anyone over quota receives an automated email message instructing them to reduce their usage. If you find that even after compressing infrequently used files and deleting unused files you need more space than you currently have allocated on /storage, please contact mailto:cluster-help@astro.northwestern.edu to request more space.

Job Submission

Management of the cluster computing resources is handled by one primary (and one secondary) batch submission system. The primary system is Torque (based on OpenPBS, often referred to generically as PBS), which should be used in priority to Condor, which is another batch system we have installed for use in opportunistic mode, to use any computing resources that are currently unused by Torque. The batch submission systems handles the scheduling of job requests, executes submitted jobs, and handles job output and errors, all while balancing user demands on computing resources in a (hopefully) equitable manner. Below we provide an introduction to using each batch submission system that should allow you to quickly start using Typhoon effectively.

We have two methods for you to submit jobs:

Job Submission with the Default Resource Manager, Torque

Jobs are submitted via the resource manager PBS (which stands for "Portable Batch System"). The implementation we use on Typhoon is Torque.

An Introduction to PBS Job Submission & Management

PBS has the capability of handling both serial and parallel jobs. While the details of submitting each type differ somewhat and will be treated separately below, the basic process is the same:

  1. 1) you compile your code and prepare your input files as usual,
  2. 2) you create a batch script for submission to PBS with details on which files are needed to run the job, how to run the job, and how to copy the output back to the master node,
  3. 3) you submit the job to PBS in one of the available queues.

Serial Jobs

Let's jump right in with a working example. Let's say your executable is called cmc , that to run a job requires the input files pl_n1e5_s_fb0_c0.cmc and plummer_n1e5_salpeter0.2_1.2.fits , that you run the job with the command ./cmc -d pl_n1e5_s_fb0_c0.cmc pl_n1e5_s_fb0_c0.out , and that the location of the input files and where you would like the output copied back to when the job is complete is /storage/fregeau/runs/testing . Here is a working PBS job submission script for such a job for the user "fregeau" (available as the file /usr/local/cluster/example.pbs" on Typhoon):

#!/bin/sh
#
# created with /usr/local/bin/beo-genpbs.pl on Mon Jan  9 15:44:11 CST 2006
# commandline: /usr/local/bin/beo-genpbs.pl -j pln1e5sfb0c0 -f \
# cmc,pl_n1e5_s_fb0_c0.cmc,plummer_n1e5_salpeter0.2_1.2.fits -c "./cmc -d \
# pl_n1e5_s_fb0_c0.cmc pl_n1e5_s_fb0_c0.out" -t "300:00:00" -q longjob pl_n1e5_s_fb0_c0.pbs
#

# PBS directives can't use shell variables, so have to be done by hand.
#PBS -N pln1e5sfb0c0
#PBS -r n
#PBS -e /storage/fregeau/runs/testing/pln1e5sfb0c0.stderr
#PBS -o /storage/fregeau/runs/testing/pln1e5sfb0c0.stdout
#PBS -l nodes=1,cput=300:00:00,ncpus=1
#PBS -q longjob

# Define variables to make PBS script more flexible.
SRCDIR=/storage/fregeau/runs/testing
LOCDIR=/usr/local/cluster/users/`whoami`
WRKDIR=pln1e5sfb0c0_zQd6mdir

# Create working directory based on job name.
cd $LOCDIR
mkdir $WRKDIR
cd $WRKDIR

# Copy needed files to working directory.
cp $SRCDIR/cmc .
cp $SRCDIR/pl_n1e5_s_fb0_c0.cmc .
cp $SRCDIR/plummer_n1e5_salpeter0.2_1.2.fits .

# Run code.
./cmc -d pl_n1e5_s_fb0_c0.cmc pl_n1e5_s_fb0_c0.out

# Copy all files back.
rcp * master:$SRCDIR && rm -f *

# And clean up.
cd ..
rmdir $WRKDIR

# This is just in case the previous command fails for some reason.
exit 0

The simplest way to explain this script is to dissect it. The first line:

#!/bin/sh

simply declares the script to be an executable "sh" script. PBS requires that the submission script be an executable shell script, such as "sh", "csh", or "bash". The next few lines:

#
# created with /usr/local/bin/beo-genpbs.pl on Mon Jan  9 15:44:11 CST 2006
# commandline: /usr/local/bin/beo-genpbs.pl -j pln1e5sfb0c0 -f \
# cmc,pl_n1e5_s_fb0_c0.cmc,plummer_n1e5_salpeter0.2_1.2.fits -c "./cmc -d \
# pl_n1e5_s_fb0_c0.cmc pl_n1e5_s_fb0_c0.out" -t "300:00:00" -q longjob pl_n1e5_s_fb0_c0.pbs
#

are comments. In this case they say that this PBS script has been created with the "beo-genpbs.pl" utility for automatically creating PBS scripts (more on that below). The next few lines:

# PBS directives can't use shell variables, so have to be done by hand.
#PBS -N pln1e5sfb0c0
#PBS -r n
#PBS -e /storage/fregeau/runs/testing/pln1e5sfb0c0.stderr
#PBS -o /storage/fregeau/runs/testing/pln1e5sfb0c0.stdout
#PBS -l nodes=1,cput=300:00:00,ncpus=1
#PBS -q longjob

are a comment followed by several PBS directives. The "PBS -N pln1e5sfb0c0" directive specifies the job name as "pln1e5sfb0c0". "PBS -r n" tells PBS that the job is re-runnable---don't worry about what this means for now. "PBS -e /storage/fregeau/runs/testing/pln1e5sfb0c0.stderr" tells PBS to write all stderr for the job to "/storage/fregeau/runs/testing/pln1e5sfb0c0.stderr". "PBS -o /storage/fregeau/runs/testing/pln1e5sfb0c0.stdout" is analogous, but for stdout. "PBS -l nodes=1,cput=300:00:00,ncpus=1" speicifies the resources required for the job---in this case 1 node, 300 hours of CPU time, and 1 CPU. Finally, "PBS -q longjob" tells PBS to put the job in the "longjob" queue. Anything in this script that is not a comment or a PBS directive is executed by PBS on the compute node as the job. The next few lines:

# Define variables to make PBS script more flexible.
SRCDIR=/storage/fregeau/runs/testing
LOCDIR=/usr/local/cluster/users/`whoami`
WRKDIR=pln1e5sfb0c0_zQd6mdir

simply define useful shell variables that will make file manipulation and job execution later in the script much easier. In particular, "SRCDIR" is defined to be the source directory of all input files. "LOCDIR" is defined to be the "local" working directory on the compute node (the "whoami" command gets translated into the user name, "fregeau" in this case). Finally, "WRKDIR" is defined as the working directory for the job, which gets created inside LOCDIR on the compute node. Note that the name of the working directory is composed of the job name, followed by some random characters. The "beo-genpbs.pl" script adds the random characters so that two different jobs that have the same job name can run on the same compute node without interfering with each other. The next stanza:

# Create working directory based on job name.
cd $LOCDIR
mkdir $WRKDIR
cd $WRKDIR

changes directories to the "LOCDIR" on the compute node, creates the working directory, then cd's into it. The next stanza:

# Copy needed files to working directory.
cp $SRCDIR/cmc .
cp $SRCDIR/pl_n1e5_s_fb0_c0.cmc .
cp $SRCDIR/plummer_n1e5_salpeter0.2_1.2.fits .

copies all the files needed to run the job (executable and input files) from the source directory on the master node to the working directory on the compute node. The next stanza:

# Run code.
./cmc -d pl_n1e5_s_fb0_c0.cmc pl_n1e5_s_fb0_c0.out

actually runs the job. After this command finishes, the next stanza is executed:

# Copy all files back.
rcp * master:$SRCDIR && rm -f *

which copies all files (executable, input files, and all output files created in that directory) back to the source directory on the master node. Note that "rcp" is used for this, since /storage is not writable from the compute nodes via NFS. Note also that if the remote copy command fails for any reason, the "rm -f" command (which removes all files from the working directory) will not be executed due to the "&&". It is arranged this way as a safety measure, in case the source directory is not accessible for some reason. Next:

# And clean up.
cd ..
rmdir $WRKDIR

the script cd's one directory up and attempts to remove the working directory. This will only succeed if the remote copy worked correctly and all files were copied back to the master node. Finally:

# This is just in case the previous command fails for some reason.
exit 0

The script claims that it has finished successfully so that PBS finishes the job. PBS scripts (such as this one) are then submitted to PBS with the "qsub" command. For example, if this script exists as "example.pbs" in the current directory, it is submitted with the command:

qsub example.pbs

To find out which queues are active and what their limits are, run the command "qstat -q" on the master node. Or you can look at the table below, which is unfortunately quite likely to be out of date.

Queue            Memory CPU Time Walltime Node   Run   Que   Lm  State
---------------- ------ -------- -------- ---- ----- ----- ----  -----
shortjob           --   48:00:00 72:00:00    2     5     0   60   E R
verylongjob        --   800:00:0 1200:00:   10     5     0   10   E R
mediumjob          --   150:00:0 225:00:0    2    47    17   50   E R
longjob            --   300:00:0 450:00:0    2    15   104   40   E R
debug              --   00:20:00 00:30:00    8     0     0   10   E R
veryshortjob       --   15:00:00 22:30:00    2     0     0   60   E R
fastlargejob       --   05:00:00 07:30:00   16     0     0   20   E R
largejob           --   100:00:0 150:00:0   16     0     0   16   E R
def                --   150:00:0 225:00:0   32     0     0  240   E R
                                               ----- -----
                                                  72   121

def is the "default" routing queue, which should route jobs to the appropriate queue based on their requirements, but has a default CPU time of 150 hours if this is not specified. debug has a maximum time limit of 20 minutes, so it really is for debugging. The remaining queues are the "real" queues, and are organized both by CPU time required for a job, and number of CPUs required. veryshortjob, shortjob, mediumjob, longjob, and verylongjob are intended primarily for serial jobs, and, as their names imply, are constructed for varying CPU time requirements. fastlargejob and largejob are intended primarily for parallel jobs, with a limit of 16 CPUs max.

This table is simply the output of "qstat -q". Note that the columns "Run" and "Que" show how many jobs are running and queued, respectively, in the queues listed.

Parallel Job Submission

Here is an example of a working PBS submission script for a parallel job:

#!/bin/csh
#PBS -N ptest
#PBS -r n
#PBS -e /storage/nata/Current/WD/GC/Set_1e6/tuc47/tuc47.err
#PBS -o /storage/nata/Current/WD/GC/Set_1e6/tuc47/tuc47.log
#PBS -l nodes=2:ppn=2,cput=299:00:00
#PBS -q longjob
#PBS -m e
mkdir -p /usr/local/cluster/users/nata/tuc47
mkdir -p /usr/local/cluster/users/nata/tuc47/NS
mkdir -p /usr/local/cluster/users/nata/tuc47/LISA
mkdir -p /storage/nata/Current/WD/GC/Set_1e6/tuc47/NS
mkdir -p /storage/nata/Current/WD/GC/Set_1e6/tuc47/LISA
cd /usr/local/cluster/users/nata/tuc47
NPROCS=`wc -l < $PBS_NODEFILE`
cat $PBS_NODEFILE > "computers.file"
rsync computers.file nata@master.cluster:/storage/nata/Current/WD/GC/Set_1e6/tuc47/
rsync -c nata@master.cluster:/storage/nata/Current/WD/GC/Set_1e6/tuc47/upload.sh ./
sh upload.sh
if ( -e restart.dat) then
   sleep 60
   echo "do restart of the cluster evolution (part 2)"
   sh restart2.sh
   sleep 120
else
    echo "Fresh start"
endif
mv -f ./gc_exe tuc47
/usr/local/mpich-1.2.5/bin/mpirun -machinefile computers.file -np 4 ./tuc47 > debug
if ( -e restart1.sh) then
    sh restart1.sh
endif
sh recover.sh

The most important line is

#PBS -l nodes=2:ppn=2,cput=299:00:00

which specifies the requirements of the job. This job requires 2 nodes, 2 processors per node, and 299 hours of computing time. Note that there is no directive such as "ncpus=4". This is because PBS already knows the total number of CPUs required from nodes and ppn. Adding "ncpus=4" will in fact cause the job to hang in the PBS queue indefinitely.

Note that you should be sure to submit your job to the proper queue. For example, at present the largejob queue will only run jobs that require 3 or more nodes. You can get every last gory detail about the queues with the command

qmgr -c 'print server'

Automatic Generation of PBS Files with "beo-genpbs.pl"

The serial PBS script above was created with the useful utility "beo-genpbs.pl", which is available for all to use on Typhoon. It was created with the command:

beo-genpbs.pl -j pln1e5sfb0c0 -f cmc,pl_n1e5_s_fb0_c0.cmc,plummer_n1e5_salpeter0.2_1.2.fits \
-c "./cmc -d pl_n1e5_s_fb0_c0.cmc pl_n1e5_s_fb0_c0.out" -t "300:00:00" -q longjob \
pl_n1e5_s_fb0_c0.pbs

executed from the directory "/home/fregeau/storage/runs/testing" by user "fregeau". The available options for using "beo-genpbs.pl" can be obtained via the help, and will print out something like the following:

[fregeau@master testing]$ beo-genpbs.pl -h
Generates a PBS file using a few simple rules.

USAGE:
  /usr/local/bin/beo-genpbs.pl [options...] <outfile>

OPTIONS:
  -d <srcdir>  : source directory [<current directory>]
  -j <jobname> : job name [myjob]
  -c <cmdline> : command line [echo "this is a test"]
  -f <files>   : comma-separated list of required files []
  -m <nodes>   : number of nodes required [1]
  -t <cput>    : cput [150:00:00]
  -n <ncpus>   : ncpus [1]
  -q <queue>   : queue [mediumjob]
  -h           : display this help text

EXAMPLE:
  /usr/local/bin/beo-genpbs.pl -f cmc,test.input,test.fits -c "./cmc test.input test.out" \
  -t 300:00:00 -q longjob test.pbs

PBS Job Management

There are several commands and utilities useful for monitoring and managing your jobs. The collection below should be viewed as only a partial list. Please consult the local manpages (run "man" on each of these commands and look also at the "SEE ALSO" section for related commands) for more information.

The simplest command for monitoring the state of the cluster is "qstat". When given without any flags, it prints out:

  1. A list of all jobs submitted to the batch system, in the form 'nnnnn'.master, where 'nnnnn' is the unique job id
  2. The name of the job as given by the user
  3. The owner of the job
  4. The time for which the job has been running (note that it only refreshes every ~30 secs)
  5. The queue to which the job was submitted

Troubleshooting

Common Problems

Unfortunately, no system works properly all the time, and jobs have been known to crash on Typhoon, and sometimes not even submit at all! Here is a helpful list to help analyze the problem. Please check these items before emailing us.

Here is an example of someone who is reaching their quota limit:

Disk quotas for user friday (uid 518):
    Filesystem  blocks   quota   limit   grace   files   quota   limit   grace
      /dev/md0   4026M   6439M   7155M            4564       0       0
     /dev/sda1    186G*   177G    197G   5days   54605       0       0

As you can see, this user is over their soft limit on /dev/sda1 (which is /storage). With just 11GB more data, they would reach their hard limit, in which case they would no longer be able to write to /storage. Note the column labeled "grace". This is the remaining grace period for being above the soft quota limit. Once this grace period is exceeded, the user cannot write any more to disk until their usage is brought to below the soft limit.

Recovering Files After a Crash

If your job crashes for any reason, you should be able to recover whatever output files were generated. The files created during the job will remain in place on the compute node, in /usr/local/cluster/users/yourname/yourjob. You can do an "ls" on all compute nodes with "dsh -a -c -- ls /usr/local/cluster/users/user_name". If the output and error logs are not written to the job launch directory, log into the compute node and check out /var/spool/pbs/undelivered, where they will have the names given by the job number.

Job Submission with the Opportunistic Resource Manager, Condor

Should you use condor? : As it stands now, PBS is configured for maximal use of the cluster. However, in order for the queueing system to work fairly, it is not uncommon for several nodes to be unused at any given time. For this reason we have installed Condor on the cluster as well. Condor is another resource manager, but one that can checkpoint and suspend jobs as they are running to let other jobs take over given nodes. Thus it is the ideal opportunistic scheduling system, since it will let a user use as much of the cluster as possible, but as soon as another user submits jobs, will suspend some fraction of the original users jobs in order to fairly share the cluster. Condor sounds great, right? So why are we not using it as our primary resource manager? Simply because not all codes and programming languages work properly with Condor. A notable example is Fortran90. But it does work with C and C++. In any case, you may submit jobs to Condor as well as PBS. Condor will use any nodes that are not in use by PBS, but PBS has priority.

Or, more briefly, you have the opportunity to use more nodes than you could get with PBS, if the cluster is under-used. (Some users have often obtained 50-60 CPUs at a time.) But you only get those nodes if the cluster isn't busy.


Getting started?: To begin using Condor, see the examples in /usr/local/condor/examples/. Basically all you need to do is recompile your code with "condor_compile gcc testcode.c", for example, instead of gcc. Say you usually run your code as "testcode 10 12". Then create a job file like the following:

executable      = testcode
output          = test.stdout
error           = test.stderr
log             = test.log
arguments       = 10 12
environment     =
queue

Let's call this file "test.condor". Finally, run

condor_submit test.condor

and that's it! If your code writes output files, they will all be written to the directory from which you submitted the job. No more worrying about rcp'ing files from compute nodes back to the master node!

To check the Condor queue, run "condor_q".

There is no need to worry about different queues and the like with Condor. It works on the principle of sharing, which it is able to do efficiently thanks to checkpointing. For example, say one user has jobs running on all available compute nodes, and I submit several jobs. Condor will realize that the cluster is not being fully shared, and will pause as many of the first user's jobs (by checkpointing) as necessary so that my jobs will run, up to the point at which I am using my fair share of the cluster, which in this case is one half. If another user comes along, the same process is repeated, so that each users gets up to one third of the cluster.

Please read the documentation available at the Condor website: http://www.cs.wisc.edu/condor/. It's very good. See especially the tutorials and job submission examples. They are very succinct. Note that our default universe is the standard universe.

Standard or vanilla? A checklist....

Not all codes work with condor's implicit checkpointing mechanism. Before trying your code, take a look at the following list to make sure your code has a chance...

Examples

Examples are available in /usr/local/condor/examples on each of these machines. Please copy the files in this directory to your own directory, see the README, and compile/run them.

For example, to run the 'io' job a few times, you might try the following:

make io.remote
for i in `seq 1 10`
 do
  condor_submit io.cmd
done
<wait...>
condor_q
condor_status
<wait...>
more io.log
condor_userlog -hostname io.log

To run all the jobs, use

make all
./submit

See the README for a discussion of what each file does. Make sure you look at the log files and output files.

Vanilla Universe

The 'stanard' condor mode, in which we recommend you operate, requires you to use code which has been compiled specifically to work with condor.

However, you can also use condor in the 'vanilla' universe.

1) Default vanilla

2) Continuing vanilla If your code is sufficiently intelligent, it will be able to restart where it left off given its data files. In this case, you should use the following [contributed by Bart Willems]

there are 3 additional lines to get multiple jobs running in a vanilla universe:

should_transfer_files = YES 
when_to_transfer_output = ON_EXIT_OR_EVICT
transfer_input_files = population_v2_sun.in,WEB_birth_dat.MV1,...

The first line transfers the executable and any input file defined by the input command to the appropriate node. The second one tells condor to store the output not only on a normal program termination, but also when condor vacates the node. If there are input files native to the code, these must be specified in the third line. Wild cards do not appear to work for this command.

For more details, see http://www.cs.wisc.edu/condor/manual/v6.6.7/2_5Submitting_Job.html#SECTION00354000000000000000

Summary of commands

[This discussion is based on mail that was mailed out a long time ago to the Theory list.]

'condor_status, used as

condor_status
condor_status -submitters
condor_status -constraint 'RemoteUser == "<your-name>@<your-machine>.astro.northwestern.edu"'

Some useful variants include

condor_q -run              # shows only running jobs
condor_q -goodputs     # Shows CPU utilization, 
condor_submit <submission-specification-file>
condor_rm <jobid>

Depending on how you prepare your code and depending on the configuration file,the code will either

a) run continuously on one node, but be forced to die and restart from the beginning

b) migrate from node-to-node, resuming and stoping as necessary to manage load balancing.

Advanced topic: Condor DAGman and workflow management

If you have several stages of computation to manage, often you find yourself manually running tasks. For example, in Richard's population synthesis simulations, he wants after each simulation to run several postprocessing tasks on the data, some of which depend on others. Moreover, he may want to go back and add to the list of tasks he performs. He wants to make sure, by running a single command for each job, everything is updated. How to do that, while only recalculating the fewest possible results?

While it is certainly possible to write a script to make sure new calculations are only performed as needed, and even one to submit them to the cluster, it is far, far easier (and more portable) to make a single condor DAG, a prototype describing which commands need to be run, in which order. With a fairly low maintenance overhead, you can do anything you would ever want to do with a script.

FINISH THIS.

Further references

Examples In each machine, and on the workstations, is

/usr/local/condor/examples

Copy these files to your own directory and explore them to see how condor works.

Documentation

Main documentation 
Condor has extensive online documentation available at

http://www.cs.wisc.edu/condor/manual/index.html

Tutorials 
See also


For administrators:

Retrieved from "http://ciera.northwestern.edu/AstCCwiki/index.php?title=Typhoon%28Cluster%29_Documentation"

This page has been accessed 9622 times. This page was last modified 17:17, 16 Jan 2006.


[Main Page]
Main Page
Recent changes
Random page
Current events

Edit this page
Discuss this page
Page history
What links here
Related changes

Special pages
Bug reports