Exploiting the Grid Engine Core Binding Feature

For the 6.2u5 release of Sun Grid Engine (SGE) I designed and implemented a new feature called „core binding“. Because it was developed under an open source license it is available for the whole Grid Engine community, i.e. it can be found in any Grid Engine fork currently available. Along with this „core binding“ I  enhanced the host architecture reporting and introduced new standard complexes for the host topology and numerical complex attributes for the amount of socket, cores and hardware supported threads the execution host supports. In Univa Grid Engine 8.0.0 and afterwards in Univa Grid Engine 8.0.1 some more enhancements were made, but more about them later...

What is „core binding“?

Core binding refers to the capability that Grid Engine is able to dispatch jobs (scripts or binary programs) to specific cores on an host not only to the host itself. The difference is that when a job is dispatched to an host then usually the operating scheduler places the jobs processes to sockets/cores depending on the OS scheduler policies. From time to time the process is moved from one CPU core to another, which can even include jumps to other sockets. For some applications (like HPC jobs, graphic rendering, or benchmarks) the movement between cores has visible influence in the overall job run-times. The reason for this is that caches are invalidated and must refilled again and that on NUMA architectures the memory access times differs depending on which core the job is running and on which memory socket memory was allocated. Also, in case of overallocation of an host (more slots/processes than compute cores), the run-times of the jobs (even when they are identical) varies a lot. For example when you submit 4 jobs on a dual socket machine (with 2 cores each) but each job spawns 4 threads, the run-time of the longest running job comparing to the run-time of the shortest job can be a multiple (even when the jobs are identical!).

With the „core binding“ feature it is now possible to dispatch the jobs to specific cores, which means in case of Linux that the job gets a core bit mask which tells the OS scheduler that the job is only allowed to run on specific cores. When you are now submitting the 4 jobs (each with 4 threads) to a 4 core host and you tell GE during submission time that you want to have 1 core for each job (-binding linear:1) then each of the job gets one core exclusively (with the exception of OS internal processes). So all 4 jobs running then on a different core where they never can be moved away by the operating system scheduler. When you are comparing now the job runtimes you will get very similar results. In many cases the program execution times are shorter because of the better caching behavior. When you are using a NUMA machine exclusively for a job, which has less slots (processes/threads) than cores you can also speed up your program by forcing it to run on different sockets. This increases the total amount of used cache.

In Sun Grid Engine 6.2u5 and its forks the „core binding“ functionality is turned off per default. The reason for this is the behavior on Solaris. Here processor sets are used which are working a little bit different. On Linux with core binding you make a process nicer, because you are excluding existing resources from being used by the process (something like saying „don‘t let me run on core 2 and 3“) hence no special right is needed. On Solaris you are allocating cores via processor sets. When such a processor set is granted than you own the resources exclusively (like saying „don‘t let other processes (even not OS processes) run my cores, just me“). This could be a problem in case you don‘t align slots to cores. A user could steal resources. But disabling the feature in a default installation on Linux is not needed.

Hence Univa Grid Engine / UGE 8.0.0  changed this. There is no extra execution daemon parameter for Linux needed (execd_params) anymore, it it active on Linux out of the box. On Solaris it must be still turned on (execd_params ENABLE_BINDING=true) by the administrator like before (if creating processors sets in the particular use-case is not an security issue).

Univa Grid Engine 8.0.0 also shows now the amount sockets, cores and hardware supported threads in the default qhost output. This was also part of the „core binding“ enhancement in SGE 6.2u5 but in this release  only visible with „qhost -cb“ parameter. In order to get backward compatible output UGE provides the „qhost -ncb“ command. In order to see the core allocation in the qstat -j <jobnumber> output there is no extra „-cb“ parameter needed anymore (but again backward compatible output is available with „-ncb“).

How to submit jobs with „core binding“?

The submission parameter „-binding“ is used for telling Grid Engine a recommendation how the job has to be handled.

Following arguments can be used with -binding:

-binding linear:<amount>

Binds the job of <amount> successive cores. Grid Engine to find the socket with the most free cores first. This ensures load distribution over sockets in case of multiple core-bound jobs on the same execution host.

-binding striding:<amount>:<step-size>

Allows distribute the job over cores with a distance of <step-size>. This is useful in order to  exploit all cache of a particular execution host.

-binding explicit:<socket>,<core>:...

Here a list of socket,core pairs specifies the execution cores for the job.

In all cases, when on the particular execution host the requested binding can not be fulfilled, the job run without any binding. You can see the actual binding with in the qstat -j <jobnumber> output.

Having only serial jobs in the cluster

When you have a GE cluster without any parallel jobs, exploiting the „-binding linear:1“ parameter makes much sense. It ensures that each job runs on a different host and hence ensures fairness even a job spawns lots of threads. In order to enforce this request for users without the need that their submission must be changed you can add the parameter in the sge_request file which is located in $SGE_ROOT/default/common/ (where „default“ is an example cell name) directory.

..to be continued... :)