User Tools

Site Tools


aurora_cluster:how_scheduling_works

How Aurora scheduling works

In this section I try to explain how your job requests are treated on the hep partition.

Users within the same project are as one single user.

Inside a project, say, HEP 2016/1-4, it doesn't matter who you are, if you're member of the project your requests will be processed FIFO: first in, first out. The first to submit will be the first to have their jobs processed.

As long as the SLURM scheduler can find resources to match the ones requested by a job, the job will be started soon after being submitted.

But if there are no resources available, the job will have to wait in line and any job submitted earlier for the same project will have a higher priority due to a longer waiting time. There are some modifications of priority based on job size, but there is no priority or fair share among members of the same project.

It's as if all members were a single person submitting jobs, with many names.

The exception to this rule is that there are limits to how many jobs a single user can submit and how many resources a single user can utilise simultaneously. Thus, if a user has reached any of these limits, other members of the same project are still able to submit and run jobs within the bounds of similar, but higher limits for the project as a whole.

Considerations on interactive sessions

If the cluster is busy, requesting an interactive session may take time and fail. The scheduler will happily schedule resources for a user, but if the user asks for an interactive session with say 6 cores and there is no machine with 6 cores free, the scheduler cannot fulfil the request at the moment.

The scheduler treats an interactive job the same as a batch job, queueing it with the FIFO strategy described above. However, an exception to the FIFO scheduling appears when a parallel job is waiting for other jobs to finish and release resources. Then a short job can be promoted ahead of the queue, if it fits into an empty slot that is reserved for later use by the parallel job, so called backfill.

Thus, queueing interactive jobs with shorter wall times have a higher probability of starting earlier.

Fairness among projects running on the hep partition

Fairness is maintained among the three projects using the hep partition (HEP 2016/1-3, HEP 2016/1-4, HEP 2016/1-5) by each being allocated 1/3 of the computing power (calculated as core hours per month). This is the basis for fair share; i.e., the priority of jobs from one project is calculated with respect to how much of the target 1/3 has been used by the project over the last 30 days. A project that has used a large portion of the allocated time (or more than the allocation) will have a lower priority than a project that has used a small part.

:!: If more memory per core is used than the total memory of the node divided by the number of cores of the node, this will be equivalent to using more cores in the calculation of usage.

Suggestion to self-regulate the usage inside a project

  • The project members should interact on a regular basis to understand what are their expected computing needs;
  • Those negotiated needs should be translated into expected resource requests
  • These requests should be documented somewhere FIXME on the cluster, like in a file.
  • All users should honour the expected resource requests from the above file when submitting
  • In order to preserve the possibility to use all of the nodes when needed, these requests should be flexible enough to be changed on the fly according to needs of the members of the project
aurora_cluster/how_scheduling_works.txt · Last modified: 2019/03/18 09:18 by florido