Typhoon(Cluster) Documentation
From AstCCwiki
The Astro Theory group maintains a parallel computer cluster named Typhoon for group computational needs. Currently, Typhoon comprises 80 processors, 70GB of RAM, and 5TB of disk space. The interesting thing, of course, is how all this computing power and storage capacity is organized into one semi-cohesive unit. Typhoon is a Beowulf-class computer cluster consisting of 1 master node and 39 slave compute nodes connected via a private 100Mb/s Ethernet network. Each node contains two roughly 2 GHz AMD AthlonMP processors, roughly 2GB of RAM, a roughly 100 GB hard disk, and runs Fedora Core Linux. For long term storage, we have 1.4TB of space available on a hardware RAID attached to the master node. The full power of the cluster is harnessed via parallel programming, of course, for which we use MPI, which allows shared memory parallelization on an individual dual CPU node, non-shared memory among different nodes through the network, and any combination of the two. Computer jobs are managed via the Torque resource manager (based on OpenPBS), in combination with the Moab scheduler (based on Maui). We also use the Condor resource manager as an "opportunistic" scheduler to gobble up any available resources not allocated by Torque (since Torque is compatible with all codes/programming languages used by our group, but Condor is not).
General information on accessing and using the cluster is detailed below. Hopefully, this documentation will answer most questions. If you need additional assistance, please email the Cluster Help List at mailto:cluster-help@astro.northwestern.edu.
Accessing the Cluster
- Obtaining An Account: Please contact mailto:cluster-help@astro.northwestern.edu to arrange a time to set up the account in person.
- Remote Access: The cluster can be accessed via ssh into typhoon.phys.northwestern.edu. Successful login will place you into your home directory on the master node. We recommend setting up ssh to allow key based authentication into typhoon from private workstations (see http://software.newsforge.com/software/04/03/15/211214.shtml ). In any case, keep your password secure. Under no circumstances should you type your password into a computer that you do not fully trust. In particular, you should avoid logging into Typhoon (or any machine for that matter) from publicly-used machines, such as the ISP machines. The reason is that unless the administrator of the workstation is vigilant, it is relatively easy for a cracker to break into the machine and install a key logger, which monitors all keypresses and records usernames and passwords.
Ground Rules and General Information
- User Directories: Users have two dedicated directories accessible on the master node: their home directory, /home/user_name/, and a storage directory, /storage/user_name/. These two directories are located on two different filesystems, with /storage providing about 1.4TB shared among all users, and /home providing about 150GB shared among all users. Due to the differences in available storage between the two, users should keep the bulk of their data in /storage, while /home should be used for things that do not require so much disk space, like code, etc.
- Quotas: Disk quotas are in effect on both /home and /storage to insure that there is sufficient disk space for everyone to run jobs and store data. For /storage, each user is placed into one of roughly 15 usage levels based on their cluster usage and needs, and their quota is determined according to this level. For /home, everyone gets an equal fraction of the available disk space. To determine your current usage, run the command "quota -sv", which will print something like the following:
[friendly@master friendly]$ quota -sv
Disk quotas for user friendly (uid 599):
Filesystem blocks quota limit grace files quota limit grace
/dev/md0 457M 4758M 5008M 2613 0 0
/dev/sda1 124G 152G 160G 10293 0 0
/dev/md0 corresponds to /home, and /dev/sda1 corresponds to /storage. This user is using only 457MB of their 4758MB soft / 5008MB hard quota on /home. They are using 124GB of their 152GB soft / 160GB hard quota on /storage. If they exceed their soft quota on either filesystem, a time will appear in the "grace" column, letting them know how long they have to reduce their disk space usage to below the soft limit before their hard quota is clamped at their current usage. The last 4 columns above correspond to limits on the number of files a user may have. The quotas here are 0, meaning that there is no limit. Every night the cluster checks each user's disk usage. Anyone over quota receives an automated email message instructing them to reduce their usage. If you find that even after compressing infrequently used files and deleting unused files you need more space than you currently have allocated on /storage, please contact mailto:cluster-help@astro.northwestern.edu to request more space.
- Node References: The cluster nodes communicate via a private, internal network. On this network invidual nodes are referenced as compute1.cluster, compute2.cluster, ..., compute39.cluster. The master node is master.cluster. If you need to access a node for some reason, you can rsh into it or rcp files to/from it directly from the master node without needing to type in your password. Users should not start jobs on individual nodes. User jobs must be submitted from the master node via batch submission (see below).
- Installed Software & Libraries: The cluster is running Fedora Core 2 Linux. Details about installed software and libraries can be found at the Software Available on Typhoon page. We make an effort to keep the list of available software up to date, as well as keep installed third party software up to date, but it is likely that some of the information recorded on the referenced pages will be a bit out of date. If you have software that you would like installed, please email mailto:cluster-help@astro.northwestern.edu and we'll do our best to accomodate your request.
Job Submission
Management of the cluster computing resources is handled by one primary (and one secondary) batch submission system. The primary system is Torque (based on OpenPBS, often referred to generically as PBS), which should be used in priority to Condor, which is another batch system we have installed for use in opportunistic mode, to use any computing resources that are currently unused by Torque. The batch submission systems handles the scheduling of job requests, executes submitted jobs, and handles job output and errors, all while balancing user demands on computing resources in a (hopefully) equitable manner. Below we provide an introduction to using each batch submission system that should allow you to quickly start using Typhoon effectively.
We have two methods for you to submit jobs:
- PBS, the default and recommended method
- Condor, an alternative method which is theoretically superior but has some compatibility problems in practice.
Job Submission with the Default Resource Manager, Torque
Jobs are submitted via the resource manager PBS (which stands for "Portable Batch System"). The implementation we use on Typhoon is Torque.
An Introduction to PBS Job Submission & Management
PBS has the capability of handling both serial and parallel jobs. While the details of submitting each type differ somewhat and will be treated separately below, the basic process is the same:
- 1) you compile your code and prepare your input files as usual,
- 2) you create a batch script for submission to PBS with details on which files are needed to run the job, how to run the job, and how to copy the output back to the master node,
- 3) you submit the job to PBS in one of the available queues.
Serial Jobs
Let's jump right in with a working example. Let's say your executable is called cmc , that to run a job requires the input files pl_n1e5_s_fb0_c0.cmc and plummer_n1e5_salpeter0.2_1.2.fits , that you run the job with the command ./cmc -d pl_n1e5_s_fb0_c0.cmc pl_n1e5_s_fb0_c0.out , and that the location of the input files and where you would like the output copied back to when the job is complete is /storage/fregeau/runs/testing . Here is a working PBS job submission script for such a job for the user "fregeau" (available as the file /usr/local/cluster/example.pbs" on Typhoon):
#!/bin/sh # # created with /usr/local/bin/beo-genpbs.pl on Mon Jan 9 15:44:11 CST 2006 # commandline: /usr/local/bin/beo-genpbs.pl -j pln1e5sfb0c0 -f \ # cmc,pl_n1e5_s_fb0_c0.cmc,plummer_n1e5_salpeter0.2_1.2.fits -c "./cmc -d \ # pl_n1e5_s_fb0_c0.cmc pl_n1e5_s_fb0_c0.out" -t "300:00:00" -q longjob pl_n1e5_s_fb0_c0.pbs # # PBS directives can't use shell variables, so have to be done by hand. #PBS -N pln1e5sfb0c0 #PBS -r n #PBS -e /storage/fregeau/runs/testing/pln1e5sfb0c0.stderr #PBS -o /storage/fregeau/runs/testing/pln1e5sfb0c0.stdout #PBS -l nodes=1,cput=300:00:00,ncpus=1 #PBS -q longjob # Define variables to make PBS script more flexible. SRCDIR=/storage/fregeau/runs/testing LOCDIR=/usr/local/cluster/users/`whoami` WRKDIR=pln1e5sfb0c0_zQd6mdir # Create working directory based on job name. cd $LOCDIR mkdir $WRKDIR cd $WRKDIR # Copy needed files to working directory. cp $SRCDIR/cmc . cp $SRCDIR/pl_n1e5_s_fb0_c0.cmc . cp $SRCDIR/plummer_n1e5_salpeter0.2_1.2.fits . # Run code. ./cmc -d pl_n1e5_s_fb0_c0.cmc pl_n1e5_s_fb0_c0.out # Copy all files back. rcp * master:$SRCDIR && rm -f * # And clean up. cd .. rmdir $WRKDIR # This is just in case the previous command fails for some reason. exit 0
The simplest way to explain this script is to dissect it. The first line:
#!/bin/sh
simply declares the script to be an executable "sh" script. PBS requires that the submission script be an executable shell script, such as "sh", "csh", or "bash". The next few lines:
# # created with /usr/local/bin/beo-genpbs.pl on Mon Jan 9 15:44:11 CST 2006 # commandline: /usr/local/bin/beo-genpbs.pl -j pln1e5sfb0c0 -f \ # cmc,pl_n1e5_s_fb0_c0.cmc,plummer_n1e5_salpeter0.2_1.2.fits -c "./cmc -d \ # pl_n1e5_s_fb0_c0.cmc pl_n1e5_s_fb0_c0.out" -t "300:00:00" -q longjob pl_n1e5_s_fb0_c0.pbs #
are comments. In this case they say that this PBS script has been created with the "beo-genpbs.pl" utility for automatically creating PBS scripts (more on that below). The next few lines:
# PBS directives can't use shell variables, so have to be done by hand. #PBS -N pln1e5sfb0c0 #PBS -r n #PBS -e /storage/fregeau/runs/testing/pln1e5sfb0c0.stderr #PBS -o /storage/fregeau/runs/testing/pln1e5sfb0c0.stdout #PBS -l nodes=1,cput=300:00:00,ncpus=1 #PBS -q longjob
are a comment followed by several PBS directives. The "PBS -N pln1e5sfb0c0" directive specifies the job name as "pln1e5sfb0c0". "PBS -r n" tells PBS that the job is re-runnable---don't worry about what this means for now. "PBS -e /storage/fregeau/runs/testing/pln1e5sfb0c0.stderr" tells PBS to write all stderr for the job to "/storage/fregeau/runs/testing/pln1e5sfb0c0.stderr". "PBS -o /storage/fregeau/runs/testing/pln1e5sfb0c0.stdout" is analogous, but for stdout. "PBS -l nodes=1,cput=300:00:00,ncpus=1" speicifies the resources required for the job---in this case 1 node, 300 hours of CPU time, and 1 CPU. Finally, "PBS -q longjob" tells PBS to put the job in the "longjob" queue. Anything in this script that is not a comment or a PBS directive is executed by PBS on the compute node as the job. The next few lines:
# Define variables to make PBS script more flexible. SRCDIR=/storage/fregeau/runs/testing LOCDIR=/usr/local/cluster/users/`whoami` WRKDIR=pln1e5sfb0c0_zQd6mdir
simply define useful shell variables that will make file manipulation and job execution later in the script much easier. In particular, "SRCDIR" is defined to be the source directory of all input files. "LOCDIR" is defined to be the "local" working directory on the compute node (the "whoami" command gets translated into the user name, "fregeau" in this case). Finally, "WRKDIR" is defined as the working directory for the job, which gets created inside LOCDIR on the compute node. Note that the name of the working directory is composed of the job name, followed by some random characters. The "beo-genpbs.pl" script adds the random characters so that two different jobs that have the same job name can run on the same compute node without interfering with each other. The next stanza:
# Create working directory based on job name. cd $LOCDIR mkdir $WRKDIR cd $WRKDIR
changes directories to the "LOCDIR" on the compute node, creates the working directory, then cd's into it. The next stanza:
# Copy needed files to working directory. cp $SRCDIR/cmc . cp $SRCDIR/pl_n1e5_s_fb0_c0.cmc . cp $SRCDIR/plummer_n1e5_salpeter0.2_1.2.fits .
copies all the files needed to run the job (executable and input files) from the source directory on the master node to the working directory on the compute node. The next stanza:
# Run code. ./cmc -d pl_n1e5_s_fb0_c0.cmc pl_n1e5_s_fb0_c0.out
actually runs the job. After this command finishes, the next stanza is executed:
# Copy all files back. rcp * master:$SRCDIR && rm -f *
which copies all files (executable, input files, and all output files created in that directory) back to the source directory on the master node. Note that "rcp" is used for this, since /storage is not writable from the compute nodes via NFS. Note also that if the remote copy command fails for any reason, the "rm -f" command (which removes all files from the working directory) will not be executed due to the "&&". It is arranged this way as a safety measure, in case the source directory is not accessible for some reason. Next:
# And clean up. cd .. rmdir $WRKDIR
the script cd's one directory up and attempts to remove the working directory. This will only succeed if the remote copy worked correctly and all files were copied back to the master node. Finally:
# This is just in case the previous command fails for some reason. exit 0
The script claims that it has finished successfully so that PBS finishes the job. PBS scripts (such as this one) are then submitted to PBS with the "qsub" command. For example, if this script exists as "example.pbs" in the current directory, it is submitted with the command:
qsub example.pbs
- Queues:
To find out which queues are active and what their limits are, run the command "qstat -q" on the master node. Or you can look at the table below, which is unfortunately quite likely to be out of date.
Queue Memory CPU Time Walltime Node Run Que Lm State
---------------- ------ -------- -------- ---- ----- ----- ---- -----
shortjob -- 48:00:00 72:00:00 2 5 0 60 E R
verylongjob -- 800:00:0 1200:00: 10 5 0 10 E R
mediumjob -- 150:00:0 225:00:0 2 47 17 50 E R
longjob -- 300:00:0 450:00:0 2 15 104 40 E R
debug -- 00:20:00 00:30:00 8 0 0 10 E R
veryshortjob -- 15:00:00 22:30:00 2 0 0 60 E R
fastlargejob -- 05:00:00 07:30:00 16 0 0 20 E R
largejob -- 100:00:0 150:00:0 16 0 0 16 E R
def -- 150:00:0 225:00:0 32 0 0 240 E R
----- -----
72 121
def is the "default" routing queue, which should route jobs to the appropriate queue based on their requirements, but has a default CPU time of 150 hours if this is not specified. debug has a maximum time limit of 20 minutes, so it really is for debugging. The remaining queues are the "real" queues, and are organized both by CPU time required for a job, and number of CPUs required. veryshortjob, shortjob, mediumjob, longjob, and verylongjob are intended primarily for serial jobs, and, as their names imply, are constructed for varying CPU time requirements. fastlargejob and largejob are intended primarily for parallel jobs, with a limit of 16 CPUs max.
This table is simply the output of "qstat -q". Note that the columns "Run" and "Que" show how many jobs are running and queued, respectively, in the queues listed.
Parallel Job Submission
Here is an example of a working PBS submission script for a parallel job:
#!/bin/csh
#PBS -N ptest
#PBS -r n
#PBS -e /storage/nata/Current/WD/GC/Set_1e6/tuc47/tuc47.err
#PBS -o /storage/nata/Current/WD/GC/Set_1e6/tuc47/tuc47.log
#PBS -l nodes=2:ppn=2,cput=299:00:00
#PBS -q longjob
#PBS -m e
mkdir -p /usr/local/cluster/users/nata/tuc47
mkdir -p /usr/local/cluster/users/nata/tuc47/NS
mkdir -p /usr/local/cluster/users/nata/tuc47/LISA
mkdir -p /storage/nata/Current/WD/GC/Set_1e6/tuc47/NS
mkdir -p /storage/nata/Current/WD/GC/Set_1e6/tuc47/LISA
cd /usr/local/cluster/users/nata/tuc47
NPROCS=`wc -l < $PBS_NODEFILE`
cat $PBS_NODEFILE > "computers.file"
rsync computers.file nata@master.cluster:/storage/nata/Current/WD/GC/Set_1e6/tuc47/
rsync -c nata@master.cluster:/storage/nata/Current/WD/GC/Set_1e6/tuc47/upload.sh ./
sh upload.sh
if ( -e restart.dat) then
sleep 60
echo "do restart of the cluster evolution (part 2)"
sh restart2.sh
sleep 120
else
echo "Fresh start"
endif
mv -f ./gc_exe tuc47
/usr/local/mpich-1.2.5/bin/mpirun -machinefile computers.file -np 4 ./tuc47 > debug
if ( -e restart1.sh) then
sh restart1.sh
endif
sh recover.sh
The most important line is
#PBS -l nodes=2:ppn=2,cput=299:00:00
which specifies the requirements of the job. This job requires 2 nodes, 2 processors per node, and 299 hours of computing time. Note that there is no directive such as "ncpus=4". This is because PBS already knows the total number of CPUs required from nodes and ppn. Adding "ncpus=4" will in fact cause the job to hang in the PBS queue indefinitely.
Note that you should be sure to submit your job to the proper queue. For example, at present the largejob queue will only run jobs that require 3 or more nodes. You can get every last gory detail about the queues with the command
qmgr -c 'print server'
Automatic Generation of PBS Files with "beo-genpbs.pl"
The serial PBS script above was created with the useful utility "beo-genpbs.pl", which is available for all to use on Typhoon. It was created with the command:
beo-genpbs.pl -j pln1e5sfb0c0 -f cmc,pl_n1e5_s_fb0_c0.cmc,plummer_n1e5_salpeter0.2_1.2.fits \ -c "./cmc -d pl_n1e5_s_fb0_c0.cmc pl_n1e5_s_fb0_c0.out" -t "300:00:00" -q longjob \ pl_n1e5_s_fb0_c0.pbs
executed from the directory "/home/fregeau/storage/runs/testing" by user "fregeau". The available options for using "beo-genpbs.pl" can be obtained via the help, and will print out something like the following:
[fregeau@master testing]$ beo-genpbs.pl -h Generates a PBS file using a few simple rules. USAGE: /usr/local/bin/beo-genpbs.pl [options...] <outfile> OPTIONS: -d <srcdir> : source directory [<current directory>] -j <jobname> : job name [myjob] -c <cmdline> : command line [echo "this is a test"] -f <files> : comma-separated list of required files [] -m <nodes> : number of nodes required [1] -t <cput> : cput [150:00:00] -n <ncpus> : ncpus [1] -q <queue> : queue [mediumjob] -h : display this help text EXAMPLE: /usr/local/bin/beo-genpbs.pl -f cmc,test.input,test.fits -c "./cmc test.input test.out" \ -t 300:00:00 -q longjob test.pbs
PBS Job Management
There are several commands and utilities useful for monitoring and managing your jobs. The collection below should be viewed as only a partial list. Please consult the local manpages (run "man" on each of these commands and look also at the "SEE ALSO" section for related commands) for more information.
The simplest command for monitoring the state of the cluster is "qstat". When given without any flags, it prints out:
- A list of all jobs submitted to the batch system, in the form 'nnnnn'.master, where 'nnnnn' is the unique job id
- The name of the job as given by the user
- The owner of the job
- The time for which the job has been running (note that it only refreshes every ~30 secs)
- The queue to which the job was submitted
- qstat -a prints out a list similar to that above, but also shows the job limits for each submission
- qstat -q prints out a list of the queues, the limits for each queue, as well as how many jobs are running and waiting in queue for an available processor
- qstat -Q is similar, but gives less info on the limits for each queue, and more about the status (enabled/disabled) of each queue
- qstat -B gives information about the batch system as a whole
- qstat -f lists every available bit of information about every job on the system. A particularly useful trick is to try "qstat -f | grep exec", which filters out the host (compute node) on which every job is running. By checking against the output of qstat, you can figure out where your jobs are running.
- qdel is used to kill a job. Typing "qdel nnnnn" where 'nnnnn' is the job id number will kill the job, but cause it to properly write the output and error logfiles. If this fails for some reason (the compute node running the job is down, or the job has somehow become stale), try "qdel -p nnnnn". If that doesn't work contact the administrators.
- pbsnodes -a will display the status of each compute node. If "state" is listed as free, the machine has at least one free processor, whereas job-exclusive means both processors are being used. "jobs" shows the job id for any jobs running on that machine, immediately following the specific processor which is handling the job (0 or 1, followed by a slash).
Troubleshooting
Common Problems
Unfortunately, no system works properly all the time, and jobs have been known to crash on Typhoon, and sometimes not even submit at all! Here is a helpful list to help analyze the problem. Please check these items before emailing us.
- Disk Full: If your quotas on /home or /storage are reached (and you can no longer write to disk), PBS will not be able to write the temporary files needed to submit the job. To investigate, type "quota -sv" at the prompt, and see whether you have reached your quota.
Here is an example of someone who is reaching their quota limit:
Disk quotas for user friday (uid 518):
Filesystem blocks quota limit grace files quota limit grace
/dev/md0 4026M 6439M 7155M 4564 0 0
/dev/sda1 186G* 177G 197G 5days 54605 0 0
As you can see, this user is over their soft limit on /dev/sda1 (which is /storage). With just 11GB more data, they would reach their hard limit, in which case they would no longer be able to write to /storage. Note the column labeled "grace". This is the remaining grace period for being above the soft quota limit. Once this grace period is exceeded, the user cannot write any more to disk until their usage is brought to below the soft limit.
- Compute Node Disk Full: If the /usr partition on the compute node is full, your job will suffer from the same situation described above. If the root partition (/, which contains the /var directory) on a compute node is full, your job will crash and die since it would be impossible to store the output of your processes. Note that resubmitted jobs are almost always submitted to the same node as they were previously submitted to, so you will in all likelihood find the same problem if you just try resubmitting the job. You can check the disk status of all nodes with "dsh -a -c -- df -h".
- NFS Dismounts: The system only works if "/home" is NFS mounted on each compute node. To make sure of this, use the command "dsh -a -c -- mount | grep home".
- Node Crashed : Sometimes, one of the compute nodes will crash. You should notice this if you run the disk space check above, because the compute node will not respond, but a more direct way to check is with "pbsnodes -l".
- Library Problems: Sometimes you will use an executable compiled on a machine other than the master node or to compile it you will first install a library in a private directory (eg. /storage/login_name). Under these circumstances your executable might fail to run since those libraries might be missing on compute nodes. There are a few solutions:
- Recompile your code on master node without referring to privately installed libraries. We try to keep the libraries on master and compute nodes in harmony. If you do this and still get messages about a library missing it is probably trivial for us to fix.
- Recompile your code with static libraries. This is generally done by appending "-static" at the end of the linking line (at least for GCC). This will make your code larger but more portable.
- Copy the required libraries to a cluster-wide available location (e.g. /home) and use the LD_LIBRARY_PATH environment variable to let the dynamic loader check for the location of these libraries. To learn about the dynamic loader and environment variables, check the manual pages and FAQs on the web. Note that you should add commands to your pbs script for these environment variables to be effective for your run. If you take this path and come up with a successful solution, please write it up so we can post it on these pages, as this is something nontrivial to do.
- PBS System Down: PBS relies on three daemon processes to work. If one crashes, strange things happen. To check to see that PBS is running, run the commands "ps aux | grep pbs", which should show "pbs_mom" and "pbs_server", and "ps aux | grep moab", which should show a "moab" process.
- I/O Problems: NFS can be a fragile system at times. If you are writing directly to "/home" from your PBS script, change it so that you write all data to the local compute node disk, then "rcp" it back to the master node when the job is finished, as described in the PBS script example above.
- Memory Problems: We have experienced problems in the past with faulty memory chips. If you find your job crashing out of nowhere with extremely odd error messages and bus errors, or if you find that copying files actually corrupts the files, it may indicate a memory chip has failed. Errors that occur repeatedly but in slightly different ways are especially suspect for memory failure. If you think you have encountered a memory problem, please contact us immediately.
Recovering Files After a Crash
If your job crashes for any reason, you should be able to recover whatever output files were generated. The files created during the job will remain in place on the compute node, in /usr/local/cluster/users/yourname/yourjob. You can do an "ls" on all compute nodes with "dsh -a -c -- ls /usr/local/cluster/users/user_name". If the output and error logs are not written to the job launch directory, log into the compute node and check out /var/spool/pbs/undelivered, where they will have the names given by the job number.
Job Submission with the Opportunistic Resource Manager, Condor
Should you use condor? : As it stands now, PBS is configured for maximal use of the cluster. However, in order for the queueing system to work fairly, it is not uncommon for several nodes to be unused at any given time. For this reason we have installed Condor on the cluster as well. Condor is another resource manager, but one that can checkpoint and suspend jobs as they are running to let other jobs take over given nodes. Thus it is the ideal opportunistic scheduling system, since it will let a user use as much of the cluster as possible, but as soon as another user submits jobs, will suspend some fraction of the original users jobs in order to fairly share the cluster. Condor sounds great, right? So why are we not using it as our primary resource manager? Simply because not all codes and programming languages work properly with Condor. A notable example is Fortran90. But it does work with C and C++. In any case, you may submit jobs to Condor as well as PBS. Condor will use any nodes that are not in use by PBS, but PBS has priority.
Or, more briefly, you have the opportunity to use more nodes than you could get with PBS, if the cluster is under-used. (Some users have often obtained 50-60 CPUs at a time.) But you only get those nodes if the cluster isn't busy.
Getting started?:
To begin using Condor, see the examples in
/usr/local/condor/examples/. Basically all you need to do is recompile your
code with "condor_compile gcc testcode.c", for example, instead of gcc. Say
you usually run your code as "testcode 10 12". Then create a job file like
the following:
executable = testcode output = test.stdout error = test.stderr log = test.log arguments = 10 12 environment = queue
Let's call this file "test.condor". Finally, run
condor_submit test.condor
and that's it! If your code writes output files, they will all be written to the directory from which you submitted the job. No more worrying about rcp'ing files from compute nodes back to the master node!
To check the Condor queue, run "condor_q".
There is no need to worry about different queues and the like with Condor. It works on the principle of sharing, which it is able to do efficiently thanks to checkpointing. For example, say one user has jobs running on all available compute nodes, and I submit several jobs. Condor will realize that the cluster is not being fully shared, and will pause as many of the first user's jobs (by checkpointing) as necessary so that my jobs will run, up to the point at which I am using my fair share of the cluster, which in this case is one half. If another user comes along, the same process is repeated, so that each users gets up to one third of the cluster.
Please read the documentation available at the Condor website: http://www.cs.wisc.edu/condor/. It's very good. See especially the tutorials and job submission examples. They are very succinct. Note that our default universe is the standard universe.
Standard or vanilla? A checklist....
Not all codes work with condor's implicit checkpointing mechanism. Before trying your code, take a look at the following list to make sure your code has a chance...
Examples
Examples are available in /usr/local/condor/examples on each of these machines. Please copy the files in this directory to your own directory, see the README, and compile/run them.
For example, to run the 'io' job a few times, you might try the following:
make io.remote for i in `seq 1 10` do condor_submit io.cmd done <wait...> condor_q condor_status <wait...> more io.log condor_userlog -hostname io.log
To run all the jobs, use
make all ./submit
See the README for a discussion of what each file does. Make sure you look at the log files and output files.
Vanilla Universe
The 'stanard' condor mode, in which we recommend you operate, requires you to use code which has been compiled specifically to work with condor.
However, you can also use condor in the 'vanilla' universe.
1) Default vanilla
2) Continuing vanilla If your code is sufficiently intelligent, it will be able to restart where it left off given its data files. In this case, you should use the following [contributed by Bart Willems]
there are 3 additional lines to get multiple jobs running in a vanilla universe:
should_transfer_files = YES when_to_transfer_output = ON_EXIT_OR_EVICT transfer_input_files = population_v2_sun.in,WEB_birth_dat.MV1,...
The first line transfers the executable and any input file defined by the input command to the appropriate node. The second one tells condor to store the output not only on a normal program termination, but also when condor vacates the node. If there are input files native to the code, these must be specified in the third line. Wild cards do not appear to work for this command.
For more details, see http://www.cs.wisc.edu/condor/manual/v6.6.7/2_5Submitting_Job.html#SECTION00354000000000000000
Summary of commands
[This discussion is based on mail that was mailed out a long time ago to the Theory list.]
- Viewing available computational resources:
'condor_status, used as
condor_status condor_status -submitters condor_status -constraint 'RemoteUser == "<your-name>@<your-machine>.astro.northwestern.edu"'
- Viewing the scheduler: condor_q
Some useful variants include
condor_q -run # shows only running jobs condor_q -goodputs # Shows CPU utilization,
- Submitting and deleting jobs:
condor_submit <submission-specification-file> condor_rm <jobid>
Depending on how you prepare your code and depending on the configuration file,the code will either
a) run continuously on one node, but be forced to die and restart from the beginning
b) migrate from node-to-node, resuming and stoping as necessary to manage load balancing.
Advanced topic: Condor DAGman and workflow management
If you have several stages of computation to manage, often you find yourself manually running tasks. For example, in Richard's population synthesis simulations, he wants after each simulation to run several postprocessing tasks on the data, some of which depend on others. Moreover, he may want to go back and add to the list of tasks he performs. He wants to make sure, by running a single command for each job, everything is updated. How to do that, while only recalculating the fewest possible results?
While it is certainly possible to write a script to make sure new calculations are only performed as needed, and even one to submit them to the cluster, it is far, far easier (and more portable) to make a single condor DAG, a prototype describing which commands need to be run, in which order. With a fairly low maintenance overhead, you can do anything you would ever want to do with a script.
FINISH THIS.
Further references
Examples In each machine, and on the workstations, is
/usr/local/condor/examples
Copy these files to your own directory and explore them to see how condor works.
Documentation
- Main documentation
- Condor has extensive online documentation available at
http://www.cs.wisc.edu/condor/manual/index.html
- Tutorials
- See also
- http://www.cs.wisc.edu/~roy/effective_condor
- LSC datagrid tutorial : While specific to LIGO, this tutorial (http://www.lsc-group.phys.uwm.edu/lscdatagrid/LSCGridCamp/) has excellent discussion of how to use condor and condor DAGs.
For administrators:
- condor debugging and configuration examples (http://www.nesc.ac.uk/talks/308/scotland-admin-tutorial-2003-10-23_DEMO.htm)
- condor installation on mac (http://www.math.dartmouth.edu/software/resources_OSX.phtml)
![[Main Page]](/AstCCwiki/stylesheets/images/wiki.png)