Ultra-Fast Job Submission with Go DRMAA and Univa Grid Engine (2012-10-20)

In my last article I showed two basic examples how to use the Go DRMAA binding for simple job submission and job status checks. This time I want to demonstrate how easy it is to submit thousands of (possibly different) jobs into a Grid Engine system in a very fast way. Of course fast bulk job submission can also be done with the qsub -t switch, which allows to submit array jobs with one single submit, but then your job parameters and even the job command name must be the same for all jobs.

The example below consists of four different functions. The main() function creates a new DRMAA session by just calling drmaa.MakeSession(). The defer statement below puts the session.Exit() cleanup method on a stack, which is executed after the main function finishes. Finally, the submitJobs() function is called, which requires a reference to the session object as well as the amount of worker routines, which have to be spawned concurrently.

While playing around with my system (a dual core laptop, where all components of Univa Grid Engine 8.1 are running!), 512 workers was ideal for me in terms of performance: I was able to submit 1024 jobs within 1-2 seconds! Which is an average rate of 1-2 ms per job. An unfair comparison: Using a bash script with a single loop doing 1024 qsubs took between 18-19 seconds.

package main
 
import (
  "drmaa"
  "fmt"
  "runtime"
)
 
type jobTemplate struct {
  jobname string
  arg     string
}
 
func createJobs(session *drmaa.Session, jobs chan<- jobTemplate, amount int) {
  for i := 0; i < amount; i++ {
    var jt jobTemplate
    jt.jobname = "sleep"
    jt.arg = "10"
    jobs <- jt
  }
  close(jobs)
}
 
func submitJob(session *drmaa.Session, jobs <-chan jobTemplate, done chan<- bool) {
  // as long as there are jobs to submit, do so
  for job := range jobs {
    if djt, err := session.AllocateJobTemplate(); err == nil {
      djt.SetRemoteCommand(job.jobname)
      djt.SetArg(job.arg)
      session.RunJob(&amp;djt)
      session.DeleteJobTemplate(&amp;djt)
    }
  }
  done <- true
}
 
func submitJobs(session *drmaa.Session, workers int) {
  jobsChannel := make(chan jobTemplate)
  done := make(chan bool)
  // create 1024 jobs
  go createJobs(session, jobsChannel, 1024)
  // start worker
  for i := 0; i < workers; i++ {
    go submitJob(session, jobsChannel, done)
  }
  // block until all workers have finished
  for i := 0; i < workers; i++ {
    <-done
  }
}
 
func main() {
  const workers int = 512
 
  runtime.GOMAXPROCS(4)
 
  session, err := drmaa.MakeSession()
  if err != nil {
    fmt.Println(err)
    return
  }
  defer session.Exit()
 
  submitJobs(&amp;session, workers)
}
 

The submitJobs() function creates two channels, one for sending the jobs from the job generation function (createJobs()) to the workers, and one done channel for signaling that no more jobs are left and a hence a worker finished. Then the coroutine createJobs() is started asynchronously. It simply loops 1024 times generating structs, which are filled with the job-name and the job parameter. In an real-world example this function would parse a file, which contains a job and parameter list, for example. Finally, the workers are started as coroutines/or go routines. As long as there are jobs in the jobs channel, they are processing them by allocating a DRMAA job template and submit the job template to the Grid Engine master process. When no job is left the worker sending a done message and quit. When submitJobs() was able to collect all done messages it returns to the main function.

UPDATE (2012/10/27): A simple single-threaded C DRMAA application needs between 3-4 seconds to submit 1024 jobs in my environment. A simple single-threaded Java DRMAA application about 4 seconds. Of course a multi-threaded C DRMAA application could reach a similar performance, but it would be much more sophisticate, especially when having a single source for all jobs.