Grid Engine and Multitenancy (2015-02-22)

Univa Grid Engine can run any workload in your compute cluster. Hence Univa Grid Engine can also easily run Grid Engine as a job. This could be probably required if you want to strictly isolate temporary Grid Engine installations from each other or just want to have some fun with your own scheduler within the scheduler.

To reduce the word count in the article let’s just go through this learning task by using the free Univa Grid Engine trial version which you can download here and follow the recipe in the next sections.

Setup the Grid Engine Demo Enviornment

To start from scratch we need to first setup a demo environment. This is described in an earlier article on my blog. The only requirements are having free Virtual Box and and free Vagrant installed. Then follow the blog for starting it up (copy the demo tar.gz in the directory and do vagrant up - version 8.2.0-demo is required here - for another version you need to adapt). Now you should have a 3 node cluster (3 VMs) running the Univa Grid Engine cluster scheduler.

Next you need to log in (password for vagrant is always vagrant, also for root):

vagrant ssh master

[vagrant@master vagrant]$ qhost
HOSTNAME                ARCH         NCPU NSOC NCOR NTHR NLOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
----------------------------------------------------------------------------------------------
global                  -               -    -    -    -     -       -       -       -       -
execd1                  lx-amd64        4    1    4    4  0.01  490.0M   89.5M  928.0M     0.0
execd2                  lx-amd64        4    1    4    4  0.01  490.0M   88.8M  928.0M     0.0
master                  lx-amd64        4    1    4    4  0.05  490.0M  104.3M  928.0M  356.0K

You see 3 compute nodes.

Configure Grid Engine Resources

For running UGE as a job, UGE needs to bootstrap itself. This is an easy task since you can setup a UGE cluster with just one command and a configuration file. The critical point here is the configuration file, since it needs to be adapted depending on which hosts and resources were selected for UGE as a job.

Let’s look at the sample config file (which was used to setup our virtual Grid Engine cluster) here.

There are a few setting we need to adapt dynamically after our UGE was scheduled:

  • SGE_ROOT: Of course each UGE needs a different root directory. It must be shared by all nodes which are part of the job / UGE installation.
  • SGE_QMASTER_PORT: Needs to be unique as well
  • SGE_EXECD_PORT: Also UGE needs to make a unique selection
  • SGE_CLUSTER_NAME: Can be derived from the qmaster port.
  • GID_RANGE: No overlapping gid range! Otherwise you can cause severe issues in accounting. Let UGE select one for you.
  • QMASTER_SPOOL_DIR: Dependent from the SGE_ROOT
  • EXEC / SUBMIT / ADMIN_HOST_LIST: All hosts which were selected from UGE

Also note that we are running the job not as root, hence we must disable init script creation. If you want to have a multi-user cluster in this Grid Engine job, than you need to setup the cluster as root.

There are a lot of unique things to select. But this is very easy to configure with the RSMAP resource type in Univa Grid Engine.

Here a sample configuration script which installs these resources in UGE:

#!/bin/sh
qconf -sc > $$.tmp
echo "MT_SGE_ROOT MT_SGE_ROOT RSMAP <= YES JOB 0 0" >> $$.tmp
echo "MT_QMASTER_PORT MT_QMASTER_PORT RSMAP <= YES JOB 0 0" >> $$.tmp
echo "MT_GID_RANGE MT_GID_RANGE RSMAP <= YES JOB 0 0" >> $$.tmp
echo "MT_EXEC_SPOOL MT_EXEC_SPOOL RSMAP <= YES JOB 0 0" >> $$.tmp
qconf -Mc $$.tmp
rm $$.tmp

Just copy and paste that in an editor (vi?) on the master host und run it.

[vagrant@master vagrant]$ vi config.sh
[vagrant@master vagrant]$ ./config.sh 
vagrant@master added "MT_SGE_ROOT" to complex entry list
vagrant@master added "MT_QMASTER_PORT" to complex entry list
vagrant@master added "MT_GID_RANGE" to complex entry list

Since three clusters a enough on three virtual hosts to play with, we add now three resources for each type.

[vagrant@master vagrant]$ qconf -mattr exechost complex_values "MT_SGE_ROOT=3(cluster1 cluster2 cluster3)" global
"complex_values" of "exechost" is empty - Adding new element(s).
vagrant@master modified "global" in exechost list

These are our $SGE_ROOT directories. They will appear as subdirectories under /vagrant since it is shared already.

$ qconf -mattr exechost complex_values "MT_QMASTER_PORT=3(2324 2326 2328)" global

These are the portnumbers the UGE installation gets. Note that the execd port will be just qmaster + 1 as this is default convention.

$ qconf -mattr exechost complex_values "MT_GID_RANGE=3(22000 23000 24000)" global

The GID ranges are quite important, please don’t create a conflict with the main UGE installation.

The execd spool directories which needs to exist before a job runs also needs to be selected. (I put that in the Vagrant installation)

$ qconf -mattr exechost complex_values "MT_EXEC_SPOOL=3(local1 local2 local3)" global

That’s it from the main configuration part. The hostnames are dynamically requested and derived from Grid Engine (they will appear in the $PE_HOSTFILE on the master tasks host).

The only thing left is adding a parallel environment so that you can request multiple hosts for your cluster:

$ qconf -ap uge
pe_name                uge
slots                  1000
user_lists             NONE
xuser_lists            NONE
start_proc_args        NONE
stop_proc_args         NONE
allocation_rule        1
control_slaves         FALSE
job_is_first_task      TRUE
urgency_slots          min
accounting_summary     FALSE
daemon_forks_slaves    FALSE
master_forks_slaves    FALSE

Then save…^ZZ

vagrant@master added "uge" to parallel environment list

Allocation rule 1 here means that you basically match execds for your inner UGE installation to slots.

Add it to the all.q:

$ qconf -aattr queue pe_list uge all.q 
vagrant@master modified "all.q" in cluster queue list

Finished!

Writing the Job that Sets Your Cluster Up

Here is our job script (job.sh):

Make it executable:

$ chmod +x job.sh

Now submit this script which sets up a user define Grid Engine installation with 2 execution hosts (UGE 2).

$ qsub -N cluster -o /vagrant/UGE/job1.log -j y 
-l MT_SGE_ROOT=1,MT_QMASTER_PORT=1,MT_GID_RANGE=1,MT_EXEC_SPOOL=1 
-pe uge 2 ./job.sh

With qstat you can see it running:

$ qstat -g t
job-ID     prior   name       user         state submit/start at     queue                          jclass                         master ja-task-ID 
-------------------------------------------------------------------------------------------------------------------------------------------------
     1 0.55500 cluster    vagrant      r     02/21/2015 19:00:40 all.q@execd1                                                  MASTER        
     1 0.55500 cluster    vagrant      r     02/21/2015 19:00:40 all.q@execd2                                                  SLAVE         

With qstat -j 1 you can see some interesting details about the new cluster within you normal cluster:

context:                    STATUS=RUNNING,SETTINGS=/vagrant/cluster1
binding:                    NONE
mbind:                      NONE
submit_cmd:                 qsub -N cluster -o /vagrant/UGE/job.log -j y -l MT_SGE_ROOT=1,MT_QMASTER_PORT=1,MT_GID_RANGE=1,MT_EXEC_SPOOL=1 -pe uge 2 ./job.sh
granted_license       1:    
usage                 1:    cpu=00:00:00, mem=0.00000 GBs, io=0.00000, vmem=N/A, maxvmem=N/A
resource map          1:    MT_EXEC_SPOOL=global=(local1), MT_GID_RANGE=global=(22000), MT_QMASTER_PORT=global=(2324), MT_SGE_ROOT=global=(cluster1)

On the resource map entry you can see what the Univa Grid Engine scheduler has selected (port number / cluster name / spooling directories). Also see STATUS=RUNNING in the context - this appears after the installation performed from the job. The settings points to the installation directory, this is used in order to access the new cluster. But first we schedule another cluster, we have one host left.

$  qsub -N cluster -o /vagrant/UGE/job2.log -j y 
-l MT_SGE_ROOT=1,MT_QMASTER_PORT=1,MT_GID_RANGE=1,MT_EXEC_SPOOL=1 
-pe uge 1 ./job.sh

This job is running as well. Wenn submitting a third, this one stays pending of course since we are using already all hosts (of course you can configure oversubscription).

Now through the SETTINGS information from the job context we are accessing our first inner cluster:

source /vagrant/cluster1/default/common/settings.sh

Doing a qhost gives us two hosts, since we requested two:

[vagrant@master ~]$ qhost
HOSTNAME                ARCH         NCPU NSOC NCOR NTHR NLOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
----------------------------------------------------------------------------------------------
global                  -               -    -    -    -     -       -       -       -       -
execd2                  lx-amd64        4    1    4    4  0.02  490.0M  102.5M  928.0M     0.0
master                  lx-amd64        4    1    4    4  0.04  490.0M  124.5M  928.0M  472.0K

Now you can start jobs...

[vagrant@master ~]$ qsub -b y sleep 1234

...doing qstat and qacct etc.

Finally you can go back to your "real" cluster with

[vagrant@master ~]$ source /vagrant/UGE/default/common/settings.sh 

Where you only can see your two jobs which sets up UGE clusters.

When you want to delete a cluster, just use qdel.

There is one task left, cleaning up the directories after a cluster as a job finished. It is up to the reader to implement that (using an epilog script?), I just wanted to show here how easy it is to run Univa Grid Engine under Grid Engine. As we have seen the RSMAP resource type (here used as global per job consumable) is a great help here for selecting ports etc.

Also consider to startup the DRMAA2 d2proxy in the newly created cluster in order to access it from home (with ssh tunneling) or use it as a simple multi-cluster access toolkit (which now supports file staging as well).

Further more advanced considerations: start execds with qrsh -inherit (export SGE_ND=1 might help here) for full accounting and process reaping, doing cleanup of directories after the cluster job finished, using transfer queues to forward jobs from original cluster to inner cluster, adding load sensors for tracking usage of inner cluster from main cluster and remove cluster when unutilized, spawn more inner execds with additional job types...