Ultra-Fast Job Submission with Go DRMAA and Univa Grid Engine (2012-10-20)
In my last article I showed two basic examples how to use the Go DRMAA binding for simple job submission and job status checks. This time I want to demonstrate how easy it is to submit thousands of (possibly different) jobs into a Grid Engine system in a very fast way. Of course fast bulk job submission can also be done with the qsub -t switch, which allows to submit array jobs with one single submit, but then your job parameters and even the job command name must be the same for all jobs.
The example below consists of four different functions. The main() function creates a new DRMAA session by just calling drmaa.MakeSession(). The defer statement below puts the session.Exit() cleanup method on a stack, which is executed after the main function finishes. Finally, the submitJobs() function is called, which requires a reference to the session object as well as the amount of worker routines, which have to be spawned concurrently.
While playing around with my system (a dual core laptop, where all components of Univa Grid Engine 8.1 are running!), 512 workers was ideal for me in terms of performance: I was able to submit 1024 jobs within 1-2 seconds! Which is an average rate of 1-2 ms per job. An unfair comparison: Using a bash script with a single loop doing 1024 qsubs took between 18-19 seconds.
package main import ( "drmaa" "fmt" "runtime" ) type jobTemplate struct { jobname string arg string } func createJobs(session *drmaa.Session, jobs chan<- jobTemplate, amount int) { for i := 0; i < amount; i++ { var jt jobTemplate jt.jobname = "sleep" jt.arg = "10" jobs <- jt } close(jobs) } func submitJob(session *drmaa.Session, jobs <-chan jobTemplate, done chan<- bool) { // as long as there are jobs to submit, do so for job := range jobs { if djt, err := session.AllocateJobTemplate(); err == nil { djt.SetRemoteCommand(job.jobname) djt.SetArg(job.arg) session.RunJob(&djt) session.DeleteJobTemplate(&djt) } } done <- true } func submitJobs(session *drmaa.Session, workers int) { jobsChannel := make(chan jobTemplate) done := make(chan bool) // create 1024 jobs go createJobs(session, jobsChannel, 1024) // start worker for i := 0; i < workers; i++ { go submitJob(session, jobsChannel, done) } // block until all workers have finished for i := 0; i < workers; i++ { <-done } } func main() { const workers int = 512 runtime.GOMAXPROCS(4) session, err := drmaa.MakeSession() if err != nil { fmt.Println(err) return } defer session.Exit() submitJobs(&session, workers) }
The submitJobs() function creates two channels, one for sending the jobs from the job generation function (createJobs()) to the workers, and one done channel for signaling that no more jobs are left and a hence a worker finished. Then the coroutine createJobs() is started asynchronously. It simply loops 1024 times generating structs, which are filled with the job-name and the job parameter. In an real-world example this function would parse a file, which contains a job and parameter list, for example. Finally, the workers are started as coroutines/or go routines. As long as there are jobs in the jobs channel, they are processing them by allocating a DRMAA job template and submit the job template to the Grid Engine master process. When no job is left the worker sending a done message and quit. When submitJobs() was able to collect all done messages it returns to the main function.
UPDATE (2012/10/27): A simple single-threaded C DRMAA application needs between 3-4 seconds to submit 1024 jobs in my environment. A simple single-threaded Java DRMAA application about 4 seconds. Of course a multi-threaded C DRMAA application could reach a similar performance, but it would be much more sophisticate, especially when having a single source for all jobs.