The DRMAA2 Tutorial - Introduction (1) (2013-10-05)

"Evolution is a process of creating patterns of increasing order" (Ray Kurzweil, The Singularity Is Near)

It is obvious that open standards are important in the software industry. They protect investments, they increase usage of interfaces, they decrease costs, they bring people with the same objectives together, and so on. With no or minimal changes the software can support multiple systems / even systems that will be built in the future. The knowledge and cooking recipes are usually widespread - you will find help/solutions in many different communities. Certainly there are many more aspects of open standards. Open standards are taking a major role in the exponential growth in many areas.

DRMAA2 (Distributed Resource Management Application API 2) is such an open standard. It is the successor of the wide-spread DRMAA (Distributed Resource Management and Application API). DRMAA is generally used for submitting jobs (or creating job workflows) into a compute cluster by using a cluster resource management system like Grid Engine (or Condor / PBS / Torque / LSF, …) for applications (like Mathematica, KNIME, …) or for users to build workflows.

DRMAA defines with its language bindings a set of functions for different programming languages. Those functions represent the least common denominator of specific functionalities of cluster schedulers like Grid Engine, PBS, and LSF.

Unlike DRMAA with DRMAA2 you can not only submit jobs, you can also get cluster related information, like getting host names, types and status information or insight about queues configured in the resource management system. You can also monitor jobs not submitted within the application. Overall it covers many more use cases than the old DRMAA standard. When we started several years ago at the Super Computing event in Hamburg with our kick-off meeting we took the results for a survey of DRMAA users as a starting point. Since that time lots of things where added and re-arranged.

One of my current projects is implementing the DRMAA2 into a Grid Engine API. A beta version of the library will be part of Univa Grid Engine 8.2. The target language of the first implementation is C. Support for other programming languages is planned, usually they are wrappers around the C library (using stuff like JNI for Java or cgo for Go/#golang). This article is the first of a series which I want to publish over the next weeks / months where I'm going to introduce the basic usage of the new programming API.

Compiling DRMAA2 Applications for Univa Grid Engine

Compiling DRMAA2 applications for Grid Engine is not much different than for DRMAA applications. The DRMAA2 library will be shipped like DRMAA in the $SGE_ROOT/lib/$ARCH (where $ARCH is lx-amd64 on 64bit Linux) directory, the header file is located in the $SGE_ROOT/include directory.

Your DRMAA2 application can be then compiled (if $SGE_ROOT/default/common/settings.sh is sourced; which is the case when you can do a qsub on command line) with:

gcc -ldrmaa2 -I$SGE_ROOT/include -L$SGE_ROOT/lib/lx-amd64 <example.c>

and be started with

export LD_LIBRARY_PATH=$SGE_ROOT/lib/lx-amd64
./example

If you put the drmaa2.so in a local library path you don't need to export LD_LIBRARY_PATH of course but you need to take care of updating the library after a Grid Engine update.

A First Look at DRMAA2 Job Sessions

DRMAA2 comes with different types of sessions: job sessions, monitoring sessions, and reservation sessions. While job and monitoring sessions are a mandatory part of each DRMAA2 implemenation the reservation session is optional. The availability can be discovered during application run-time (which is part of a later tutorial).

The job session is similar to what the old DRMAA is with the difference that DRMAA2 sessions are persistent. In Univa Grid Engine the job session names is stored in the central qmaster component. This is particular useful when you have different processes which are "sharing" the same type of jobs (i.e. each process wants to track a specific set of jobs). Job sessions are user specific, i.e. different users can't share a session. This is implied by the rights management of Grid Engine. It disallows performing operations like suspend, resume or termination of jobs for other users. Within a job session you can submit jobs, control jobs, and monitor jobs. But only those jobs which are submitted within a job session can be controlled and monitored there. You can have multiple different job sessions open at the same time in one process, while for the monitoring session only one makes sense. With a monitoring session you can track the status and online usage of your own jobs, whether they are submitted in any job session, in DRMAA1, or on command line. Like in job sessions you also can access jobs finished during your application run-time. Within a monitoring session you can't submit or control jobs. If the Grid Engine administrative user opens a monitoring session, it get jobs from all users of the system. This makes the DRMAA2 API a good candidate for writing job monitoring GUIs.

Following code demonstrates how a new job session is created, opened, closed and destroyed in a DRMAA2 application. Creating means that the session is allocated in the Grid Engine qmaster process, opening means that such a persistent session is made available for the DRMAA2 application. Closing leads to communication with qmaster so that the library don't get anymore information about jobs running in this session, and finally destroying is the removal of the job session object on the qmaster. After running the below code the state on qmaster is like before since the session was destroyed. What you are missing is the open call because it is implicitly opened by the drmaa2_create_jsession() call.

In order to leave a session persistent the destroy call can be omitted. It is important to close the session before the appliciation exists because otherwise the Grid Engine master process will keep the underlaying communication connection open for a much longer time than needed (despite there is a timeout) until the Grid Engine master process figures out that the client connection died.

Using DRMAA2 lists and dictionaries

The C implementation of DRMAA2 comes with two higher level data structures: a list and a dictionary. While a dictionary maps a string to another string in a efficient way (which is used for setting resource limits for example), a list contains a collection of strings, job objects or other DRMAA2 specific data types. They are used to simplify creation and access of input and output values for some DRMAA2 functions.

The following code creates a dictionary, adds 2 different key/value pairs, changes the value of an already known key, retrieves the value of this key, checks if a specific key is part of the dict, deletes it and finally destroys the dictionary, i.e. frees it. Since the strings are not allocated the callback method (the second argument) is set to NULL, i.e. nothing needs to be freed when an element or the whole list is deleted.

When creating a list the type must be given as argument. In this example a list of strings is created, filled, the length is retrieved and finally all values are printed.

Following list types are defined in DRMAA2:

typedef enum drmaa2_listtype {
   DRMAA2_STRINGLIST       = 0,
   DRMAA2_JOBLIST          = 1,
   DRMAA2_QUEUEINFOLIST    = 2,
   DRMAA2_MACHINEINFOLIST  = 3,
   DRMAA2_SLOTINFOLIST     = 4,
   DRMAA2_RESERVATIONLIST  = 5
} drmaa2_listtype; 

Querying the list of job sessions

All available job sessions can be requested with drmaa2_get_jsession_list(), which returns a list of type DRMAA2_STRINGLIST. This this can simply processed like explained above. The following code searches for a specific session, if it exists it opens it.

Gorge - A Go (#golang) Library for Accessing Grid Engine (2013-07-28)

Looks like the Go (#golang) programming language becomes popular for cluster management. Kamil offers a project called Gorge on github. Similar to my gestatus sub-package of the Go DRMAA implementation Gorge reads out job status information of Grid Engine jobs by parsing the qstat xml output. Not only job information is parsed it also has wrappers for qstat -pri / -ext /-urg for detailed queue status information. Additionally it contains functions for accessing Grid Engine's Arco DB. Thanks for sharing!

Go DRMAA Language Binding Source Code now Hosted on Github (2013-03-17)

Today I pushed the source code of the Go DRMAA language binding on a github repository: https://github.com/dgruber/drmaa

It now contains a sub-package called gestatus (Grid Engine status) which parses the qstat -xml -j under the hood and is therefore able to deliver almost all available information about a particular job. Hence gestatus only works for Grid Engine (tested with Univa Grid Engine) while the DRMAA library itself should be compatible also with other DRMs.

For a minimal gestatus example please have a look at the example file.

More Go DRMAA related examples you find in the Programming APIs section of this blog.

DRMAA Version 2 C Language Binding Specification Officially Released (2012-11-16)

The first language binding is now officially approved as OGF standard GFD-R-P.198. You can get the standard here.

Failure Handling in Go DRMAA (2012-11-11)

One thing not covered yet is the failure handling in Go DRMAA. The package documentation shows that an error is a pointer to an DRMAA Error type, which is a struct consisting of an error id (Id) and an human readable error message (Message). The type also fulfills the Go error interface (i.e. in Go terminology it's an “errorer”), which prints the error message, whenever the error.Error() method is called. That makes it simple to use in both relevant cases, when the program just has to log the error and when the program should react based on an expected error.

The error ids are derived from the DRMAA 1.0 IDL standard. The following list shows the Go DRMAA error ids:

  • "nil" in case of success
  • ErrorId
  • InternalError
  • DrmCommunicationFailure
  • AuthFailure
  • InvalidArgument
  • NoActiveSession
  • NoMemory
  • InvalidContactString
  • DefaultContactStringError
  • NoDefaultContactStringSelected
  • DrmsInitFailed
  • AlreadyActiveSession
  • DrmsExitError
  • InvalidAttributeFormat
  • InvalidAttributeValue
  • ConflictingAttributeValues
  • TryLater
  • DeniedByDrm
  • InvalidJob
  • ResumeInconsistentState
  • SuspendInconsistentState
  • HoldInconsistentState
  • ReleaseInconsistentState
  • ExitTimeout
  • NoRusage
  • NoMoreElements
  • NoErrno

The following program demonstrates the usage of error ids for control flow decisions. While there is just one session at a time allowed, a second session init() results in an error with the id AlreadyActiveSession. If such an error occurs during initialization, the session is usable (since it's already initialized) despite the error. In case an other error occurs during initialization, the program exits gracefully.

 

package main
 
import (
  "drmaa"
  "fmt"
  "os"
)
 
func main() {
  var session drmaa.Session
 
  session.Init("session=Init1")
  defer session.Exit()
 
  // ... 
 
  // opening a second session is illegal
  err := session.Init("session=Init2")
 
  // in case of an error check the error id for failure handling
  if err != nil {
    switch err.Id {
    case drmaa.AlreadyActiveSession:
      fmt.Println("We have already a session: ", err.Message)
      // let's continue by using the open session
    case drmaa.DrmsInitFailed:
      fmt.Println("Init failed: ", err.Message)
      os.Exit(1)
    default:
      fmt.Println("Unexpected error: ", err)
      os.Exit(2)
    }
  }
 
  contact, _ := drmaa.GetContact()
  fmt.Println("Session name: ", contact)
}
 

When running the program produces following output:

 

We have already a session: Initialization failed due to existing DRMAA session.
Session name: session=Init1

 

Setting Grid Engine Specific Job Submission Parameters and Fetching Job Usage Values in Go DRMAA (2012-11-04)

In order to continue the Go DRMAA API description series I present below a small program, which submits jobs with Univa Grid Engine specific submission parameters and reports all collected job usage values.

The most obvious difference between submission with qsub and with DRMAA is that a DRMAA job is expected to be a binary, while the default expectation of qsub is that the command name is a script. But both can submit scripts as well as binaries. For qsub the “-b y” parameter can be used in order to submit binary jobs. The binary itself is not transferred to the execution host, like job scripts are, they must exist in the path on the execution host. DRMAA jobs, which are job scripts can be submitted by setting the “-b n” parameter as job submission parameter. Then the job script is transferred by Grid Engine to the execution host, like submitting through qsub. Setting job submission parameters, which are not defined by the DRMAA standard is easy: They can be set with using the DRMAA standardized native specification, which is in Go the SetNativeSpecification() job template method. The job output is usually written in two files, the output file (“jobname”.o<jobno>) for stdout output and the error file (“jobname”.e<jobno>) for stderr output. In order to tell the system that all output should be in the output file, the SetJoinFiles(true) job template method can be called. In order to submit parallel jobs again the native specification has to be used and “-pe <pename> <slots>” has to be added. When having more parameters which are specific for a whole job class, the job class (which is a new feature in Univa Grid Engine 8.1) can be set in the native specification as well (like “-jc <classname>”). Finally the remote command, which points to a shell script in the current working directory is set.

After the job finished the exit status (in case the job fully ran) is printed, otherwise the signal which terminated the job is displayed. Finally a loop through all values of the resource map prints out the resource and the specific usage of this resource (like the resident segment size etc.).

package main
 
import (
  "drmaa"
  "fmt"
  "os"
)
 
func main() {
  session, err := drmaa.MakeSession()
  if err != nil {
    fmt.Println(err)
    return
  }
  defer session.Exit()
 
  jt, err := session.AllocateJobTemplate()
  if err != nil {
    fmt.Println(err)
    return
  }
 
  // stderr output of job is written to stdout output file
  jt.SetJoinFiles(true)
  // set jobs name for accounting and qstat
  jt.SetJobName("testJob")
  // set Grid Engine spefic submission parameters
  jt.SetNativeSpecification("-b n -pe mytestpe 4")
  wd, _ := os.Getwd()
  // set shell script to submit (requires "-b n" <- binary no)
  jt.SetRemoteCommand(wd + "/testjob.sh")
 
  // submit job
  id, err := session.RunJob(&amp;jt)
  if err != nil {
    fmt.Println("Error during job submission: ", err)
  }
 
  // wait until job finishs and get job information
  if ji, err := session.Wait(id, drmaa.TimeoutWaitForever); err == nil {
    if ji.HasExited() {
      fmt.Println("Job exited with exit status: ", ji.ExitStatus())
    }
    if ji.HasSignaled() {
      fmt.Println("Job was termintated through signal ", ji.TerminationSignal())
    }
    // report job usage
    fmt.Println("Job used following resources:")
    for resource, usage := range ji.ResourceUsage() {
      fmt.Println("Resource ", resource, " usage: ", usage)
    }
  }
}
 

Ultra-Fast Job Submission with Go DRMAA and Univa Grid Engine (2012-10-20)

In my last article I showed two basic examples how to use the Go DRMAA binding for simple job submission and job status checks. This time I want to demonstrate how easy it is to submit thousands of (possibly different) jobs into a Grid Engine system in a very fast way. Of course fast bulk job submission can also be done with the qsub -t switch, which allows to submit array jobs with one single submit, but then your job parameters and even the job command name must be the same for all jobs.

The example below consists of four different functions. The main() function creates a new DRMAA session by just calling drmaa.MakeSession(). The defer statement below puts the session.Exit() cleanup method on a stack, which is executed after the main function finishes. Finally, the submitJobs() function is called, which requires a reference to the session object as well as the amount of worker routines, which have to be spawned concurrently.

While playing around with my system (a dual core laptop, where all components of Univa Grid Engine 8.1 are running!), 512 workers was ideal for me in terms of performance: I was able to submit 1024 jobs within 1-2 seconds! Which is an average rate of 1-2 ms per job. An unfair comparison: Using a bash script with a single loop doing 1024 qsubs took between 18-19 seconds.

package main
 
import (
  "drmaa"
  "fmt"
  "runtime"
)
 
type jobTemplate struct {
  jobname string
  arg     string
}
 
func createJobs(session *drmaa.Session, jobs chan<- jobTemplate, amount int) {
  for i := 0; i < amount; i++ {
    var jt jobTemplate
    jt.jobname = "sleep"
    jt.arg = "10"
    jobs <- jt
  }
  close(jobs)
}
 
func submitJob(session *drmaa.Session, jobs <-chan jobTemplate, done chan<- bool) {
  // as long as there are jobs to submit, do so
  for job := range jobs {
    if djt, err := session.AllocateJobTemplate(); err == nil {
      djt.SetRemoteCommand(job.jobname)
      djt.SetArg(job.arg)
      session.RunJob(&amp;djt)
      session.DeleteJobTemplate(&amp;djt)
    }
  }
  done <- true
}
 
func submitJobs(session *drmaa.Session, workers int) {
  jobsChannel := make(chan jobTemplate)
  done := make(chan bool)
  // create 1024 jobs
  go createJobs(session, jobsChannel, 1024)
  // start worker
  for i := 0; i < workers; i++ {
    go submitJob(session, jobsChannel, done)
  }
  // block until all workers have finished
  for i := 0; i < workers; i++ {
    <-done
  }
}
 
func main() {
  const workers int = 512
 
  runtime.GOMAXPROCS(4)
 
  session, err := drmaa.MakeSession()
  if err != nil {
    fmt.Println(err)
    return
  }
  defer session.Exit()
 
  submitJobs(&amp;session, workers)
}
 

The submitJobs() function creates two channels, one for sending the jobs from the job generation function (createJobs()) to the workers, and one done channel for signaling that no more jobs are left and a hence a worker finished. Then the coroutine createJobs() is started asynchronously. It simply loops 1024 times generating structs, which are filled with the job-name and the job parameter. In an real-world example this function would parse a file, which contains a job and parameter list, for example. Finally, the workers are started as coroutines/or go routines. As long as there are jobs in the jobs channel, they are processing them by allocating a DRMAA job template and submit the job template to the Grid Engine master process. When no job is left the worker sending a done message and quit. When submitJobs() was able to collect all done messages it returns to the main function.

UPDATE (2012/10/27): A simple single-threaded C DRMAA application needs between 3-4 seconds to submit 1024 jobs in my environment. A simple single-threaded Java DRMAA application about 4 seconds. Of course a multi-threaded C DRMAA application could reach a similar performance, but it would be much more sophisticate, especially when having a single source for all jobs.

Go DRMAA Language Binding Update (2012-10-15)

I just uploaded the slightly enhanced version of the Google Go DRMAA language binding. It offers now the missing JobInfo access methods and the JobTemplate SetArg(), which accepts a simple string.

Documentation can be found here: http://www.gridengine.eu/DRMAA/GO/drmaa.html

And the linux_amd64 library, with README and Documentation here: http://www.gridengine.eu/DRMAA/GO/GODRMAA_02.zip

Google Go and Univa Grid Engine: A Go DRMAA Language Binding Implementation - Part 1 (2012-10-07)

Google's Go looks for me is like the most interesting programming language published in recent years (of course there are others, like “Julia”, but they are more domain specific (technical computing)).  It is compiled, has automatic garbage collection, multiple return values, offers pointers, methods, interfaces, built-in maps, slices, range expressions, coroutines and channels, and its easy to use. There are a lot of articles which discusses all the features, so there is no need to go in further details. The amount of keywords is small (about 20), which makes it clearly arranged. But there is one little thing which I really miss: Method overloading. Having for each different parameter set a different method could IMHO lead to a method name zoo. There is a work-around to use the default interface{} (which is implemented by each Go type) which could be combined with an ellipsis (in Go: “…”) and type checking, but then the signature is hard to read. It would be easier if this would just be possible out-of-the box.

In order to gain more expertise in Go I developed a DRMAA (1.0) language binding in some free time. The current  release is not well tested and there are still some details which I want to re-work, so I would describe the version as a first “Proof of Concept”. Nevertheless you can download it and use it for free without any bigger restrictions. There are no guarantees and I don't take any responsibility for anything bad which could happen (like that an asteroid destroys your computer while using ;). But when you have any issues feel free to contact me (Daniel) at This e-mail address is being protected from spambots. You need JavaScript enabled to view it. , so that I can repair the defect or we can at least discuss about the issue. I developed it using Univa Grid Engine 8.1. (feel free to download the 48-core limited free demo version), but it should work with any other Grid Engine fork which offers the C DRMAA 1.0 binding. Like the Java DRMAA or Python binding it uses internally the C library, hence in theory it should work also for any other job scheduler which offers a C DRMAA 1.0 binding. If you are trying that with success or without success, please let me know.

How to use the Go DRMAA binding


The Go DRMAA binding consists of the Go DRMAA library and a small piece of documentation. For a more detailed description of the functions please checkout the DRMAA spec or the Grid Engine DRMAA man pages. Internally the Go lib needs to access a C DRMAA 1.0  library, hence you must set the LD_LIBRARY_PATH to the right location. For Univa Grid Engine it is $SGE_ROOT/lib/lx-amd64. Additionally you need to source the GE settings.sh (or settings.csh) file before. Then your Go programs can be compiled and started.   Here is a small example of a command which submits a binary program with one argument to the job scheduler. I called it “dsub” for drmaa submission (similar POSIX standardized submission client name  “qsub”).

 

package main
 
import ("drmaa"
        "fmt"
        "os")
 
func main() {
   var session drmaa.Session
   args := os.Args[2:]
   
   if err := session.Init("session=dsubdrmaasession"); err == nil {
      jt, _ := session.AllocateJobTemplate()
      jt.SetRemoteCommand(os.Args[1])
      jt.SetArgs(args)
      id, _ := session.RunJob(&amp;jt)
      fmt.Println(id)
      session.DeleteJobTemplate(&amp;jt)
      session.Exit()
   } else {
      fmt.Println("Session init error: ", err)
   }
}
 

First you need to import the “drmaa” package then you have to create a DRMAA session. The DRMAA way to do it is to call “Init(“sessionname”)” where session name can be the name of a previously created session or an empty string (“”) which lets Grid Engine create a random session name. In order to simply that I also provide a “drmaa.MakeSession()” which returns a new initialized session back (which can be used with the “:=” operator). In order to submit a job you need to allocate a job template, which fields should be set accordingly. A job template can be get from the session method AllocateJobTemplate(). Internally there are some malloc calls hence you must call the session method DeleteJobTemplate() in order not to generate memory leaks. In order to run a job successfully it requires at least to provide a command / application name. This is done with the SetRemoteCommand() job template method. Arguments for the application are set by SetArgs() method, which requires a slice of strings as sing argument. A session should always be closed with an Exit() call.   Just two examples where method overloading would had been nice: Session.SetArgs([]string) requires to convert a single argument into a string slice. Hence an additional Session.SetArgs(string) could make life easier. Instead I implemented Session.SetArg(string). The other example is the Init(string) method: A Session.Init() could replace Session.Init(“”). Instead I'm using the MakeSession().
An example program that prints out the status of a job (which id is given as the first parameter) is shown below: dstat.go.

package main
 
import ("drmaa"
        "fmt"
        "os")
 
func main() {
   var session drmaa.Session
   args := os.Args
 
 
   if err := session.Init("session=dsubdrmaasession"); err != nil {
      fmt.Println(err)
      return
   }
 
  if len(args) <= 1 {
    fmt.Println("Usage: dstat jobid")
    return
  }
 
   ps, _ := session.JobPs(args[1])
   fmt.Println("Job is in state: ", ps)
 
   session.Exit()
}
 

...and the most important thing: You can download the Go DRMAA library here!

Don't forget to source the $SGE_ROOT/default/common/settings.sh file and set the LD_LIBRARY_PATH to $SGE_ROOT/lib/lx-amd64/ before you are starting programming in Go with the drmaa pacakge and Univa Grid Engine

 

The package documentation is here: Go DRMAA Language Binding Documentation

Last DRMAA Meeting about DRMAA 2 C Language Binding (2012-08-23)

The public comment period was over and therefore we had yesterday night the final meeting where we went over the last issues of the upcoming DRMAAv2 C language binding. So the official final C spec will be published soon - stay tuned!

(2012-03-11) Exploiting the Univa Grid Engine JGDI API Java Interface - A JGDI Hello World Example

Doing evil things! The JGDI API is an unsupported (but available) Java interface for accessing and controlling Grid Engine. It is unsupported because there are no guarantees that the interface will change over time and that the methods are working like expected. Much of the underlaying code is automatically generated during the compile process.

Read more: (2012-03-11) Exploiting the Univa Grid Engine JGDI API Java Interface - A JGDI Hello World Example

DRMAA Version 2 - Released Publications

The DRMAA 2 standard (final draft 8) is until August 2nd, 2011, 23:59 CET open for public review.

The PDF document can be downloaded here.

Update (2012-01-27): Final DRMAA version 2 standard published

Please download the final DRMAA2 publication here.

DRMAA Version 2

The Distributed Resource Management Application API (DRMAA) is a programming library for job submission and job management. It abstracts about vendor specific job submission methods and provides a common interface in order to enable applications to run on different distributed management systems (DRMs; e.g. Univa Grid Engine or Condor) without refactoring the code. It simplifies job workflow creation and bulk job submissions through standardized C function calls and Java methods. While DRMAA version 1.0 (and its predecessors like 0.95 and 0.5) has only job submission, job control and event synchronizing on its agenda, the new open DRMAAv2 standard also includes host and job monitoring functionalities though a new monitoring session concept.

DRMAA - The Distributed Resource Management API

DRMAA or DRMAA version 1 is a highly adopted standard for accessing DRMs (distributed resource management systems). The DRMAA IDL API 1.0 specification can be found here and here.

Following systems are currently supported (this list is not complete, if you know other implementations, please write me a mail):

  • UGE - Univa Grid Engine
  • OGE - Oracle Grid Engine
  • SGE - Sun Grid Engine and SGE 6.2u5 Open Source forks
  • Condor
  • LSF
  • PBS
  • Torque
  • IBM Tivoli LoadLeveler
  • SLURM
  • Unicore
  • Kerrighed
  • EGEE Framework
  • XGridDRMAA

Most of the systems support the C API interface, but there are others which have additionally a Java API interface or a Python Interface (Python DRMAA API spec can be found here).