Difference between revisions of "HPC - High Performance Computing"
|  (→Using PBS) |  (→Using PBS) | ||
| Line 35: | Line 35: | ||
| Running this command will submit simple.sh to the scheduler to be ran. This command will print one line, which gives the job's ID. To check on a job's status, we run <code>qstat</code> (as in queue status). | Running this command will submit simple.sh to the scheduler to be ran. This command will print one line, which gives the job's ID. To check on a job's status, we run <code>qstat</code> (as in queue status). | ||
| + | |||
| + |  # qstat | ||
| + | |||
| + | This will print out a list of jobs that are currently in the queue. Jobs marked as '''R''' are currently running, jobs marked as '''Q''' are waiting to run. The simple.sh job will not take very long to run, so it will not show up in the <code>qstat</code> output unless other jobs are already running. We can submit a script that will take 10 seconds to run, which should give us enough time to see it in the queue. | ||
| + | |||
| + |  # cat simple_sleep.sh | ||
| + |  sleep 10 | ||
| + |  hostname | ||
| + | |||
| + | Submitting this job and then calling <code>qstat</code> immediately should allow us to see the job in the queue. | ||
| + | |||
| + |  # qsub simple_sleep.sh | ||
| + |  # qstat | ||
| + |  Job id                    Name             User            Time Use S Queue | ||
| + |  ------------------------- ---------------- --------------- -------- - ----- | ||
| + |  453.localhost              simple_sleep.sh  cclark                 0 R test | ||
Revision as of 12:56, 5 April 2012
Using a Scheduler
Motivation
When developing code, it is usually sufficient to test that your code is working by running "small" configurations one at a time and checking the output. Once a code has been shown to work, we typically want to run several "large" configurations. From the terminal, this consists of configuring the code an running the executable. After the first configuration finishes, we run the executable with a different configuration. Once this finishes, we do the same until we have run all configurations. The problem here is that we are only running one instance of our code at a time, and we have to manually start the next configuration after the last completes. This means that if the last configuration completes at 2:00 AM, the next configuration will not be started until we check on the progress later in the mourning.
We could open multiple terminals and run multiple instances of the executable for different configurations. We would need to be careful not to run too many instances simultaneously though, because we could end up requiring more resources than are available. For example, if running 1 instance of our code requires 1 GB of RAM, then running 5 instances will required 5 GB. If our computer only has 4 GB of RAM, running 5 instances of our code will fail. What we would like is a way to automatically run multiple jobs, one after the other, and run multiple instances of our code at once without running too many. Enter the scheduler.
A scheduler is a system that manages the running of simulations. With a scheduler, we can just tell the scheduler what we want to run (for example, 10 different configurations of the same code), and the scheduler will do the rest. The scheduler will manage our simulations along with any other simulations that other users would like to run, and running as many at once as possible and starting new instances when old instances finish. All large scale high performance computer clusters used for numerical simulation use some sort of a scheduler. Therefore, in order to use these clusters, you must be able to use a scheduler.
Terminology
A user submits jobs to the scheduler. The scheduler adds these jobs to a queue, and runs them as old jobs finish. In order to submit a job, you must write a submit script. This is a simple shell script that the scheduler will read and then execute to start a job running. The simplest submit script just contains the name of the command to run, but often the script will actually do some pre and post processing tasks as well. The submit script can also contain commands to configure the environment that the scheduler sets up.
The scheduler can also work with a resource manager to determine what resources (CPUs, memory, etc) are available, and to run jobs on the available resources. On a cluster, the resource manager and scheduler will work together and take care of running jobs on different computers in the cluster. However, there is no requirement for a scheduler to run on a cluster, you can run a scheduler on a stand-alone computer as well. Eigen, which has a 6-core hyperthreaded Intel i7 and 24 GB of RAM has a scheduler running on it, and you can submit jobs from the laser account.
- Job
- The simulation to run. This involves running an executable that performs some calculation at some point.
- Queue
- List of jobs that are "in line" to run. The scheduler runs jobs from the queue in the order they were submitted.
- Submit
- The user submits a job to the scheduler. The scheduler places this in the queue of jobs to run.
- Submit Script
- The script that the scheduler uses to start a job.
Using PBS
The most common scheduler used it PBS (Portable Batch System). PBS is an opensource scheduler that can work with an open source resource manager called Torque to manage a job queue. Other scheduling systems exist as well. To submit a job using PBS, we must first write a submit script. This is usually just a bash script. The simplest submit script contains just the executable to run.
# cat simple.sh hostname
To submit this job to the scheduler, we use the qsub command (as in queue submit)
# qsub simple.sh
Running this command will submit simple.sh to the scheduler to be ran. This command will print one line, which gives the job's ID. To check on a job's status, we run qstat (as in queue status).
# qstat
This will print out a list of jobs that are currently in the queue. Jobs marked as R are currently running, jobs marked as Q are waiting to run. The simple.sh job will not take very long to run, so it will not show up in the qstat output unless other jobs are already running. We can submit a script that will take 10 seconds to run, which should give us enough time to see it in the queue.
# cat simple_sleep.sh sleep 10 hostname
Submitting this job and then calling qstat immediately should allow us to see the job in the queue.
# qsub simple_sleep.sh # qstat Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 453.localhost simple_sleep.sh cclark 0 R test
