HPC - High Performance Computing

From PHYSpedia
Revision as of 12:27, 5 April 2012 by Cclark (talk | contribs)

Jump to: navigation, search

Using a Scheduler

Motivation

When developing code, it is usually sufficient to test that your code is working by running "small" configurations one at a time and checking the output. Once a code has been shown to work, we typically want to run several "large" configurations. From the terminal, this consists of configuring the code an running the executable. After the first configuration finishes, we run the executable with a different configuration. Once this finishes, we do the same until we have run all configurations. The problem here is that we are only running one instance of our code at a time, and we have to manually start the next configuration after the last completes. This means that if the last configuration completes at 2:00 AM, the next configuration will not be started until we check on the progress later in the mourning.

We could open multiple terminals and run multiple instances of the executable for different configurations. We would need to be careful not to run too many instances simultaneously though, because we could end up requiring more resources than are available. For example, if running 1 instance of our code requires 1 GB of RAM, then running 5 instances will required 5 GB. If our computer only has 4 GB of RAM, running 5 instances of our code will fail. What we would like is a way to automatically run multiple jobs, one after the other, and run multiple instances of our code at once without running too many. Enter the scheduler.

A scheduler is a system that manages the running of simulations. With a scheduler, we can just tell the scheduler what we want to run (for example, 10 different configurations of the same code), and the scheduler will do the rest. The scheduler will manage our simulations along with any other simulations that other users would like to run, and running as many at once as possible and starting new instances when old instances finish. All large scale high performance computer clusters used for numerical simulation use some sort of a scheduler. Therefore, in order to use these clusters, you must be able to use a scheduler.

Terminology

A user submits jobs to the scheduler. The scheduler adds these jobs to a queue, and runs them as old jobs finish. In order to submit a job, you must write a submit script. This is a simple shell script that the scheduler will read and then execute to start a job running. The simplest submit script just contains the name of the command to run, but often the script will actually do some pre and post processing tasks as well. The submit script can also contain commands to configure the environment that the scheduler sets up.

The scheduler can also work with a resource manager to determine what resources (CPUs, memory, etc) are available, and to run jobs on the available resources. On a cluster, the resource manager and scheduler will work together and take care of running jobs on different computers in the cluster. However, there is no requirement for a scheduler to run on a cluster, you can run a scheduler on a stand-alone computer as well. Eigen, which has a 6-core hyperthreaded Intel i7 and 24 GB of RAM has a scheduler running on it, and you can submit jobs from the laser account.

Job
The simulation to run. This involves running an executable that performs some calculation at some point.
Queue
List of jobs that are "in line" to run. The scheduler runs jobs from the queue in the order they were submitted.
Submit
The user submits a job to the scheduler. The scheduler places this in the queue of jobs to run.

Using PBS

The most common scheduler used it PBS (Portable Batch System). PBS is an opensource scheduler that can work with an open source resource manager called Torque to manage a job queue. Other scheduling systems exist as well. To submit a job using PBS, we must first write a submit script. This is usually just a bash script. The simplest submit script contains just the executable to run.

# cat simple.sh
hostname