HPC - High Performance Computing
Italic text=Using a Scheduler=
Motivation
When developing code, it is usually sufficient to test that your code is working by running "small" configurations one at a time and checking the output. Once a code has been shown to work, we typically want to run several "large" configurations. From the terminal, this consists of configuring the code an running the executable. After the first configuration finishes, we run the executable with a different configuration. Once this finishes, we do the same until we have run all configurations. The problem here is that we are only running one instance of our code at a time, and we have to manually start the next configuration after the last completes. This means that if the last configuration completes at 2:00 AM, the next configuration will not be started until we check on the progress later in the mourning.
We could open multiple terminals and run multiple instances of the executable for different configurations. We would need to be careful not to run too many instances simultaneously though, because we could end up requiring more resources than are available. For example, if running 1 instance of our code requires 1 GB of RAM, then running 5 instances will required 5 GB. If our computer only has 4 GB of RAM, running 5 instances of our code will fail. What we would like is a way to automatically run multiple jobs, one after the other, and run multiple instances of our code at once without running too many. Enter the scheduler.
A scheduler is a system that manages the running of simulations. With a scheduler, we can just tell the scheduler what we want to run (for example, 10 different configurations of the same code), and the scheduler will do the rest. The scheduler will manage our simulations along with any other simulations that other users would like to run, and running as many at once as possible and starting new instances when old instances finish. All large scale high performance computer clusters used for numerical simulation use some sort of a scheduler. Therefore, in order to use these clusters, you must be able to use a scheduler.
Terminology
A user submit jobs to the scheduler. The scheduler adds these jobs to a queue, and runs them as old jobs finish. The scheduler can also work with a resource manager to determine what resources (CPU's, memory, etc) are available, and to run jobs on the available resources. On a cluster, the resource manager and scheduler will work together and take care of running jobs on different computers in the cluster. However, with the advent of multi-core processors, a single computer can run multiple jobs at one time. (in the old days, when one computer had one processor with one core, running two instances of a simulation took twice as long as running one instance, so it took the same amount of time as running them separately, one after the other.)
- Job
- The simulation to run. This involves running an executable that performs some calculation at some point.
- Queue
- List of jobs that are "in line" to run. The scheduler runs jobs from the queue in the order they were submitted.
- Submit
- The user submits a job to the scheduler. The scheduler places this in the queue of jobs to run.
==
