Programming APIs

Modern CPUs Need Smarter Scheduling (2025-10-24)

Current and future CPUs offer more than just cores and sockets. HPC schedulers require information about L3, L2 caches, cache groups (like AMD's CCX), and power/efficiency cores to allow scenarios like "run only on power cores" or "pack application inside one core complex" to get the best and most stable performance results.

Gridware Cluster Scheduler 9.1.0 introduces topology-aware scheduling that understands these complex hardware layouts. The scheduler uses a simple notation to represent CPU topology:

Letter	Meaning
N	NUMA node
S	Socket
X	L3 cache / Chiplet
Y	L2 cache group
C	Power core
E	Efficiency core
T	Thread (SMT)

Example topology string: NSXCTTCTTCTTCTTEYEEYEE

This represents a socket with power cores (dual-threaded) and efficiency cores (grouped by L2 cache), all sharing an L3 cache—exactly the kind of heterogeneous design found in modern Intel, AMD, and ARM processors.

Check out Ernst Bablick's comprehensive blog post , where the engineer behind this feature explains the implementation with real-world examples from NVIDIA DGX Spark, Intel i9-14900HX, and AMD EPYC Zen5 systems.

Update 1: Please also check the follow up article Compute Nodes with Heterogeneous Topology in Gridware Cluster Scheduler

Update 2: Please also check the second follow up article How Binding Order and Range Shape Job Placement in Gridware Cluster Scheduler

MCP Servers Bring AI Reasoning to HPC Cluster Scheduling (2025-04-18)

The Model Context Protocol (MCP) defines a powerful and simple protocol for AI applications to interact with external tools. Its key benefit is modularity: any tool implementing an MCP server can be plugged into any AI application supporting MCP, allowing for seamless integration of specialized context and even control of external software.

Why is This Useful for HPC?

High Performance Computing (HPC) workload managers—like the venerable Open Cluster Scheduler (formerly Grid Engine)—must accommodate an incredible range of use cases. From desktops running a few sequential jobs, to massive clusters processing millions of jobs daily, requirements and configurations can look dramatically different. Admins often become translators, bridging the gap between complex user requests and the equally complex world of scheduler configurations, with diagnostics (like “why aren't my jobs running?”) rarely having a single, straightforward answer.

MCP Server for Open Cluster Scheduler

I just implemented an example MCP server for Open Cluster Scheduler for research and academic use. It enables an AI, like Claude, to translate high-level, natural language questions into low-level cluster operations and return formatted, accurate, and actionable information, potentially combined with other sources. The AI can answer "Why aren't my jobs running?" with cluster context, analyze configurations, generate job overviews, or spot patterns in job submissions.

Getting Started

The MCP server for Open Cluster Scheduler is easy to try in a containerized, simulated cluster. All code and documentation are available on GitHub.

Quickstart on MacOS with ARM / M chip

Simulate a Cluster

 git clone https://github.com/hpc-gridware/go-clusterscheduler.git
 cd go-clusterscheduler
 make build && make simulate

This launches a container with a fake (but realistic!) test cluster. Test with qhost and qsub.

Launch MCP Server
```
 cd cmd/clusterscheduler-mcp
 go build
 ./clusterscheduler-mcp
```
The MCP server opens port 8888, forwarded to your host.

Connect Your AI

Use npx mcp-remote as a wrapper to connect tools like Claude (details in their docs).

Example config for Claude:

{
  "mcpServers": {
    "gridware": {
      "command": "npx",
      "args": ["mcp-remote", "http://localhost:8888/sse"]
    }
  }
}

Interact
Once connected, use natural language to ask about job status, configuration advice, or request analyses.

Example Use Cases

Diagnose why jobs are not running with AI-aided reasoning.
Generate overviews of current and past jobs in tabular or summary form.
Clone and analyze entire cluster setups, creating a digital twin for testing or support.

Screenshots

Running jobs overview table:

Running Jobs Overview

Why is my job not running?

Conclusion

MCP bridges the gap between complex system tooling and natural language AI assistants, making powerful cluster analysis, debugging, and administration accessible even to non-experts. With a growing ecosystem of thousands of MCP servers, and modern AI’s impressive reasoning abilities, troubleshooting and tuning HPC environments has never been easier. In the world of HPC clusters: Why not building a digital twin of your HPC clusters workload manager for getting getting insights, try changes, and test the impact with your LLMs help? The building blocks are all available.

For more details, see the project documentation. Contributions and feedback are always welcome!

Running Simple PyTorch Jobs on Gridware Cluster Scheduler: Generating 1000 Cat Images (2025-01-26)

The most valuable digital assets in human history are undoubtedly cat pictures :). While they once dominated the web, today they have become a focal point for AI-generated imagery. Therefore, why not demonstrate using the Gridware Cluster Scheduler by generating 1000 high-resolution cat images?

Key components include:

Gridware Cluster Scheduler installation
Optional use of Open Cluster Scheduler
Python
PyTorch
Stable Diffusion
GPUs

System Setup

After installing the Gridware Cluster Scheduler and enabling GPU support according to the Admin Guide, NVIDIA_GPUS resources become automatically available. The GPU integration sets NVIDIA-related environment variables as selected by the scheduler for each job, and ensuring accurate per-job GPU accounting.

When using the free Open Cluster Scheduler, additional manual configuration is necessary (see this blog post).

In this example, I assume PyTorch and the required Python libraries are available on the nodes. To minimize machine dependencies, containers can be used, although this is not the focus of the blog. HPC-compatible container runtime environments such as Apptainer (formerly Singularity) and Podman can be used with the Gridware Cluster Scheduler out-of-the-box.

Submitting Cat Image Creation Jobs

To create cat images, we need input prompts, lots of them. They can be stored in an array like this:

prompts = [
    "A playful cat stretches its ears while rubbing against soft fur...",
    "A curious cat leaps gracefully to catch a tiny mouse...",
    # ... (other prompts)
]

To generate many prompts, we can automate their creation:

import random

def generate_pro_prompts():
    cameras = [
        "Canon EOS R5", "Sony α1", "Nikon Z9", "Phase One XT",
        "Hasselblad X2D", "Fujifilm GFX100 II", "Leica M11",
        "Pentax 645Z", "Panasonic S1R", "Olympus OM-1"
    ]

    lenses = [
        "85mm f/1.2", "24-70mm f/2.8", "100mm Macro f/2.8",
        "400mm f/2.8", "50mm f/1.0", "12-24mm f/4",
        "135mm f/1.8", "Tilt-Shift 24mm f/3.5", "8-15mm Fisheye",
        "70-200mm f/2.8"
    ]

    styles = [
        "award-winning wildlife", "editorial cover", "cinematic still",
        "fine art gallery", "commercial product", "documentary",
        "fashion editorial", "scientific macro", "sports action",
        "architectural interior"
    ]

    lighting = [
        "golden hour backlight", "softbox Rembrandt", "dappled forest",
        "blue hour ambient", "studio butterfly", "silhouette contrast",
        "LED ring light", "candlelit warm", "neon urban", "moonlit"
    ]

    cats = [
        "Maine Coon", "Siamese", "Bengal", "Sphynx", "Ragdoll",
        "British Shorthair", "Abyssinian", "Persian", "Scottish Fold",
        "Norwegian Forest"
    ]

    actions = [
        "mid-leap", "grooming", "playing", "sleeping", "stretching",
        "climbing", "hunting", "yawning", "curious gaze", "pouncing"
    ]

    # Generate random combinations
    prompts = []
    for _ in range(1000):
        style = random.choice(styles)
        cat = random.choice(cats)
        action = random.choice(actions)
        camera = random.choice(cameras)
        lens = random.choice(lenses)
        light = random.choice(lighting)
        aperture = round(1.4 + random.random() * 7.6, 1)  # Random f-stop between f/1.4 and f/9
        shutter_speed = 1 / random.randint(1, 4000)  # Random shutter speed
        iso = random.choice([100, 200, 400, 800, 1600, 3200, 6400])  # Random ISO

        prompt = (f"{style} photo of {cat} cat {action} | "
                  f"{camera} with {lens} | {light} lighting | "
                  f"f/{aperture} {shutter_speed:.0f}s ISO {iso} | "
                  f"Technical excellence award composition")

        prompts.append(prompt)

    return prompts

Once we have an array of prompts, we can divide them into chunks so that each batch job generates more than one image. This approach can be further optimized later (not part of this article).

The critical aspect here is job submission. The prompts are supplied to the job as environment variables using the -v switch. The -q switch assigns the gpu.q to the job, which is assumed to be configured across the GPU nodes. The -l switch selects 1 GPU device per job, ensuring the GPU integration sets the appropriate NVIDIA environment variables so that jobs don’t conflict. This is accomplished through the new qgpu utility called in the gpu.q's prolog. For the Open Cluster Scheduler, you need to configure this manually. That means configuring the NVIDIA_GPUS resource as RSMAP with the GPU ID range, while the job itself must convert the SGE_HGR_NVIDIA_GPUS environment variable set in the job to the NVIDIA environment variables (see this blog post).

The job itself, executed on the compute node, is the python3 script located on the shared storage.

Here's the submit.py script:

def main():
    prompts = generate_pro_prompts()

    parser = argparse.ArgumentParser()
    parser.add_argument('--chunk-size', type=int, required=True,
                       help='Number of prompts per job')
    args = parser.parse_args()

    # Split prompts into chunks
    chunks = [prompts[i:i+args.chunk_size]
             for i in range(0, len(prompts), args.chunk_size)]

    # Submit jobs
    for i, chunk in enumerate(chunks):
        try:
            # Serialize chunk to JSON for safe transmission
            prompts_json = json.dumps(chunk)

            subprocess.run(
                [
                    "qsub", "-j", "y", "-b", "y",
                    "-v", f"INPUT_PROMPT={prompts_json}",
                    "-q", "gpu.q",
                    "-l", "NVIDIA_GPUS=1",
                    "python3", "/home/nvidia/genai1/run.py"
                ],
                check=True
            )
            print(f"Submitted job {i+1} with {len(chunk)} prompts")
        except subprocess.CalledProcessError as e:
            print(f"Failed to submit job {i+1}. Error: {e}")

if __name__ == "__main__":
    main()

The run.py script carries out the intensive process using the cat-related prompts. Note, there is a lot of potential for obvious improvements - which is not the focus of the article. It retrieves prompts from the INPUT_PROMPT environment variable together with the unique JOB_ID assigned by the Gridware Cluster Scheduler, exactly like in SGE. The images are stored in the job's working directory, which is assumed to be shared. Learn more about the DiffusionPipeline utilizing Stable Diffusion at HuggingFace.

import logging
import os
import json
from diffusers import DiffusionPipeline
import torch
from PIL import Image

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

def generate_image(prompt: str, output_path: str):
    try:
        logging.info("Loading model...")
        pipe = DiffusionPipeline.from_pretrained(
            "stabilityai/stable-diffusion-xl-base-1.0",
            torch_dtype=torch.float16,
            use_safetensors=True,
            variant="fp16"
        )
        pipe.to("cuda")
    except Exception as e:
        logging.error("Pipeline init error: %s", str(e))
        return

    try:
        logging.info("Generating: %s", prompt)
        image = pipe(prompt=prompt).images[0]
        image.save(output_path)
        logging.info("Saved: %s", output_path)
    except Exception as e:
        logging.error("Generation error: %s", str(e))

if __name__ == "__main__":
    # Get job context
    job_id = os.getenv("JOB_ID", "unknown_job")

    # Parse prompts from JSON
    try:
        prompt_list = json.loads(os.getenv("INPUT_PROMPT", "[]"))
    except json.JSONDecodeError:
        logging.error("Invalid prompt format!")
        prompt_list = []

    # Process all prompts in chunk
    for idx, prompt in enumerate(prompt_list):
        if not prompt.strip():
            continue  # skip empty prompts

        output_path = f"image_{job_id}_{idx}.png"
        generate_image(prompt, output_path)

To submit the AI inference jobs to the system, simply execute:

python3 submit.py --chunk-size 5

This results in 200 queued jobs, each capable of creating 5 images.

Supervising AI Job Execution

When correctly configured, the Gridware Cluster Scheduler executes jobs on available GPUs at the appropriate time. You can check the status with qstat or get job-related information with qstat -j <jobid>. After a while, you will have your 1000 cat images. Once completed, you can also view per-job GPU usage in qacct -j <jobid>, with metrics like nvidia_energy_consumed and nvidia_power_usage_avg, as well as the submission command line with the prompts, for example:

qacct -j 2900
==============================================================
qname                              gpu.q
hostname                           XXXXXX.server.com
group                              nvidia
owner                              nvidia
project                            NONE
department                         defaultdepartment
jobname                            python3
jobnumber                          2900
taskid                             undefined
pe_taskid                          NONE
account                            sge
priority                           0
qsub_time                          2025-01-26 11:02:12.650259
submit_cmd_line                    qsub -j y -b y -v 'INPUT_PROMPT=["award-winning wildlife photo of Maine Coon cat mid-leap | Canon EOS R5 with 85mm f/1.2 | golden hour backlight lighting | f/2.4 1/1500s ISO 300 | Technical excellence award composition", "editorial cover photo of Siamese cat grooming | Sony u03b11 with 24-70mm f/2.8 | softbox Rembrandt lighting | f/2.9 1/2000s ISO 400 | Technical excellence award composition", "cinematic still photo of Bengal cat playing | Nikon Z9 with 100mm Macro f/2.8 | dappled forest lighting | f/3.4 1/2500s ISO 500 | Technical excellence award composition", "fine art gallery photo of Sphynx cat sleeping | Phase One XT with 400mm f/2.8 | blue hour ambient lighting | f/3.9 1/3000s ISO 600 | Technical excellence award composition", "commercial product photo of Ragdoll cat stretching | Hasselblad X2D with 50mm f/1.0 | studio butterfly lighting | f/4.4 1/3500s ISO 100 | Technical excellence award composition"]' -q gpu.q -l NVIDIA_GPUS=1 python3 /home/nvidia/genai1/run.py
start_time                         2025-01-26 13:13:06.172316
end_time                           2025-01-26 13:15:19.362819
granted_pe                         NONE
slots                              1
failed                             0
exit_status                        0
ru_wallclock                       133
ru_utime                           135.470
ru_stime                           1.319
ru_maxrss                          7142080
ru_ixrss                           0
ru_ismrss                          0
ru_idrss                           0
ru_isrss                           0
ru_minflt                          74728
ru_majflt                          0
ru_nswap                           0
ru_inblock                         0
ru_oublock                         14208
ru_msgsnd                          0
ru_msgrcv                          0
ru_nsignals                        0
ru_nvcsw                           2488
ru_nivcsw                          846
wallclock                          134.238
cpu                                136.789
mem                                6587.407
io                                 0.096
iow                                0.000
maxvmem                            61648994304
maxrss                             7112032256
arid                               undefined
nvidia_energy_consumed             26058.000
nvidia_power_usage_avg             158.000
nvidia_power_usage_max             158.000
nvidia_power_usage_min             0.000
nvidia_max_gpu_memory_used         0.000
nvidia_sm_clock_avg                1980.000
nvidia_sm_clock_max                1980.000
nvidia_sm_clock_min                1980.000
nvidia_mem_clock_avg               2619.000
nvidia_mem_clock_max               2619.000
nvidia_mem_clock_min               2619.000
nvidia_sm_utilization_avg          0.000
nvidia_sm_utilization_max          0.000
nvidia_sm_utilization_min          0.000
nvidia_mem_utilization_avg         0.000
nvidia_mem_utilization_max         0.000
nvidia_mem_utilization_min         0.000
nvidia_pcie_rx_bandwidth_avg       0.000
nvidia_pcie_rx_bandwidth_max       0.000
nvidia_pcie_rx_bandwidth_min       0.000
nvidia_pcie_tx_bandwidth_avg       0.000
nvidia_pcie_tx_bandwidth_max       0.000
nvidia_pcie_tx_bandwidth_min       0.000
nvidia_single_bit_ecc_count        0.000
nvidia_double_bit_ecc_count        0.000
nvidia_pcie_replay_warning_count   0.000
nvidia_critical_xid_errors         0.000
nvidia_slowdown_thermal_count      0.000
nvidia_slowdown_power_count        0.000
nvidia_slowdown_sync_boost         0.000

This example demonstrates how easily you can utilize the Gridware Cluster Scheduler to keep your GPUs engaged continuously, regardless of the frameworks, models, or input data your cluster users employ for single-GPU jobs, multi-GPU jobs, or multi-node multi-GPU jobs using MPI. Using containers or just using applications available on the command line.

Below you can find some output examples...enjoy :)

Enhancing wfl with Large Language Models: Researching the Power of GPT for (HPC/AI) Job Workflows (2023-05-14)

In today's world of ever-evolving technology, the need for efficient and intelligent job workflows is more important than ever. With the advent of large language models (LLMs) like GPT, we can now leverage the power of AI to create powerful and sophisticated job workflows. In this blog post, I'll explore how I've enhanced wfl, a versatile workflow library for Go, by integrating LLMs like OpenAI's GPT. I'll dive into three exciting use cases: job error analysis, job output transformation, and job template generation.

wfl: A Brief Overview

wfl is a flexible workflow Go library designed to simplify the process of creating and managing job workflows. It's built on top of the DRMAA2 standard and supports various backends like Docker, Kubernetes, Google Batch, and more. With wfl, users can create complex workflows with ease, focusing on the tasks at hand rather than the intricacies of job management.

Enhancing the go wfl library with LLMs

I've started to enhance wfl by integrating large language models (OpenAI), enabling users to harness the power of AI to enhance their job workflows even further. By utilizing GPT's natural language understanding capabilities, we can now create more intelligent and adaptable workflows that can be tailored to specific requirements and challenges. This not only expands the possibilities for research but also increases the efficiency of job workflows. These workflows can span various domains, including AI workflows and HPC workflows.

It's important to note that this is a first research step in applying LLMs to wfl, and I expect to find new and exciting possibilities build upon these three basic use cases.

1. Job Error Analysis

Errors are inevitable in any job workflow, but understanding and resolving them can be a time-consuming and tedious process. With the integration of LLMs in wfl, we can now analyze job errors more efficiently and intelligently. By applying a prompt to an error message, the LLM can provide a detailed explanation of the error and even suggest possible solutions. This can significantly reduce the time spent on debugging and increase overall productivity.

2. Job Output Transformation

Sometimes, the raw output of a job can be difficult to understand or may require further processing to extract valuable insights. With LLMs, we can now apply a prompt to the output of a job, transforming it into a more understandable or usable format. For example, we can use a prompt to translate the output into a different language, summarize it, or extract specific information. This can save time and effort while enabling to extract maximum value from my job outputs.

3. Job Template Generation

Creating job templates can be a complex and time-consuming process, especially when dealing with intricate workflows. With the integration of LLMs in wfl, we can now also generate job templates based on textual descriptions, making the process more intuitive and efficient. By providing a prompt describing the desired job, the LLM can generate a suitable job template that can be analyzed, customized and executed. This not only simplifies the job creation process but also enables users to explore new possibilities and ideas more quickly. Please use this with caution and do not execute generated job templates without additional security verifications! I guess such verifications when automated could be a whole new research area.

Conclusion

The integration of large language models like GPT into wfl has opened up a world of possibilities for job workflows in HPC, AI, and enterprise jobs. By leveraging the power of AI, you can now create more intelligent and adaptable workflows that can address specific challenges and requirements. Further use cases, like building whole job flows upon the building blocks, needs to be investigated

To learn more about wfl and how to harness the power of LLMs for your job workflows, visit the WFL GitHub repository: https://github.com/dgruber/wfl/

A basic sample application demonstrating the Go interface is here: https://github.com/dgruber/wfl/tree/master/examples/llm_openai

Streamline Your Machine Learning Workflows with the wfl Go Library (2023-04-10)

wfl is a versatile and user-friendly Go library designed to simplify the management and execution of workflows. In this blog post, we will explore how wfl can be employed for various machine learning tasks, including sentiment analysis, artificial neural networks (ANNs), hyperparameter tuning, and convolutional neural networks (CNNs).

Sentiment Analysis with wfl

Sentiment analysis is the process of determining the sentiment or emotion expressed in a piece of text. One straightforward approach for sentiment analysis is to use textblob, like in wfl's sentiment analysis example. This example demonstrates how to use a simple sentiment analysis model to classify marketing phrases as positive or negative.

Sentiment Analysis with ANNs and wfl

To enhance the performance of sentiment analysis, we can use an artificial neural network (ANN) trained on a dataset of labeled marketing phrases. The sentiment analysis with Keras example demonstrates how to implement, train, and deploy an ANN using the Keras library and wfl. This example shows how to use the wfl library to manage the training workflow and execute the ANN model for sentiment analysis.

Hyperparameter Tuning with wfl

Hyperparameter tuning is the process of finding the best set of hyperparameters for a machine learning model. wfl's hyperparameter tuning example demonstrates how to perform hyperparameter tuning using the Keras library and wfl. This example shows how to use wfl to manage and execute a grid search to find the optimal hyperparameters, such as learning rate, batch size, and epochs, for a deep learning model.

Hyperparameter Tuning with Google Batch

As hyperparameter tuning can be computationally expensive, it can be beneficial to distribute the workload across multiple machines. wfl's hyperparameter tuning with Google Batch example demonstrates how to use the Google Batch implementation of the DRMAA2 interface to distribute the hyperparameter tuning workload on Google Cloud, significantly accelerating the process and reducing the computational burden on your local machine.

Convolutional Neural Networks with Cifar10 and wfl

Convolutional neural networks (CNNs) are a type of deep learning model particularly suited for image classification tasks. The CNN with Cifar10 example demonstrates how to use wfl to manage the training workflow of a CNN using the Cifar10 dataset. This example shows how to use wfl to train a CNN on Google Cloud and store the trained model in a Google Cloud Storage bucket.

In conclusion, wfl is a nice tool for streamlining your machine learning workflows, from simple sentiment analysis to intricate CNNs. It offers an easy-to-use interface for managing and executing machine learning tasks, and its integration with cloud platforms like Google Cloud enables you to scale your workloads effortlessly. Give wfl a try and see how it can enhance your machine learning projects! Any feedback - especially things which don't work or are hard to figure out is welcome. Please use github for opening an issue.

Creating Kubernetes based UberCloud HPC Application Clusters using Containers (2020-12-16)

This article was originally published at UberCloud's Blog

UberCloud provides all necessary automation for integrating cloud based self-service HPC application portals in enterprise environments. Due to the significant differences inside the IT landscape of large organizations we are continuously challenged providing the necessary flexibility within our own solution stack. Hence we continuously evaluate newly adopted tools and technologies about the readiness to interact with UberCloud’s technology.

Recent adoption of Kubernetes not just for enterprise workloads but for all sorts of applications, be it on the edge, AI, or for HPC, has strong focus. We created hundreds of Kubernetes clusters on various cloud providers hosting HPC applications like Ansys, Comsol, OpenFoam and many more. We can deploy fully configured HPC clusters which are dedicated to an engineer on GKE or AKS within minutes. We can also use EKS but the deployment time of an EKS cluster is at this point in time significantly slower as on the other platforms (around 3x times). While GKE is excellent and has been my favorite service (due to its deployment speed and its good APIs), AKS has begun in the last months to get really strong. Many features which are relevant for us (like using spot instances and placement groups) and its speed in terms of AKS cluster allocation time (now even almost one minute faster as GKE - 3:30 min. from 0 to a fully configured AKS cluster) have been implement on Azure. Great!

When managing HPC applications in dedicated Kubernetes clusters one challenge remains: How to manage fleets of clusters distributed across multiple clouds? At UberCloud we are building simple tools which takes HPC application start requests and turns it into a fully automated cluster creation and configuration job. One very popular way is to put this logic behind self-service portals where the user selects an application he/she want to use. One other way is creating those HPC applications based on events in workflows, CI/CD and gitops pipelines. Use cases are automated application testing, running automated compute tasks, cloud bursting, infrastructure as code integrations, and more. To support those tasks we’ve developed a container which turns an application and infrastructure description into a managed Kubernetes cluster independent of where the job runs and on which cloud provider and regions the cluster is created.

Due to the flexibility of containers UberCloud’s cluster creation container can be used in almost all modern environments which support containers. We are using it as a Kubernetes job and as CI/CD tasks. When the job is finished, the engineer has access to a fully configured HPC desktop including a HPC cluster attached.

Another integration we just tested is Argo. Argo is a popular workflow engine targeted and working on top of Kubernetes. We have a test installation running on GKE. As the UberCloud HPC cluster creation is fully wrapped inside a container running a single binary the configuration required to integrate it in a Argo workflow is very minimal.

After the workflow (task) is finished, the engineer get’s automatically access to the freshly created remote visualization application running on a newly allocated AKS cluster spanning two node pools having GUI based remote Linux desktop access setup.

The overall AKS cluster creation, configuration, and deployment of our services and HPC application containers just took a couple of minutes. A solution targeted for IT organizations challenged by the task of rolling out HPC applications for their engineers but required to work with modern cloud based technologies.

Paper about Virtual Grid Engine (2019/12/12)

Over in Japan researchers working on the K supercomputer recently published a paper about a middleware software which they call VGE - Virtual Grid Engine. It allows them to run bio informatics software in Grid Engine style on K. Certainly a great read!

The Singularity is :%s/near/here/g - Using wfl and Go drmaa2 with Singularity (2018-12-04)

The Singularity containerization software is re-written in Go. That’s a great message and caught immediately my interest :-). For a short intro into Singularity there are many sources including wikipedia or the official documentation.

A typical problem using Docker in a broader context, i.e. including integrating it in workload managers, is that it runs as a daemon. That means requests to start a Docker container end up in processes which are children of the daemon and not of the original process which initiated the container creation request. Workload managers are built around managing and supervising child processes and have reliable mechanisms for supervising processes which detach itself from the parent processes but they don’t work very well with Docker daemon. That’s why most of the serious workload managers do container orchestration using a different mechanism (like using directly runc or rkt or even handling cgroups and namespaces directly). Supervision includes resource usage measurements and limitations which sometimes exceed Dockers native capabilities. Singularity brings back the expected behavior for workload managers. But there are more reasons people use Singularity which I don’t cover in this article.

This article is about using Singularity in Go applications using wfl and drmaa2. Since I wanted to give Singularity a try by myself I started to add support for it in the drmaa2 go binding by implementing a job tracker. It was a very short journey to get the first container running as most of the necessary functionality was already implemented.

If you are Go programmer and write Singularity workflows either for testing, performance measurements, or for running simulations I would be more than happy to get feedback about future improvements.

DRMAA2 is a standard for job management which includes basic primitives like start, stop, pause, resume. That’s enough to built workflow management applications or even simpler abstractions like wfl around it.

In order to make use of the implemented DRMAA2 calls for Singularity in your Go application you need to include it along with the drmaa2interface dependency


    "github.com/dgruber/drmaa2interface"
    "github.com/dgruber/drmaa2os"

The next step is to create a DRMAA2 job session. As the underlying implementation need to make the session names and job IDs persistent you need to provide a path to a file which is used for that purposes.


    sm, err := drmaa2os.NewSingularitySessionManager(filepath.Join(os.TempDir(), "jobs.db"))
    if err != nil {
        panic(err)
    }

Note that when Singularity is not installed the call fails.

Creating a job session is like defined in the standard. As it fails when it exists (also defined in the standard) we need to do:


   js, err := sm.CreateJobSession("jobsession", "")
    if err != nil {
        js, err = sm.OpenJobSession("jobsession")
        if err != nil {
            panic(err)
        }
    }

The JobTemplate is the key for specifying the details of the job and the container.


       jt := drmaa2interface.JobTemplate{
            RemoteCommand: "/bin/sleep",
            Args:          []string{"600"},
            JobCategory:   "shub://GodloveD/lolcow",
            OutputPath:    "/dev/stdout",
            ErrorPath:     "/dev/stderr",
        }

The RemoteCommand is the application executed within the container and therefore must be accessible inside of the container. Args is the list of arguments for the file. The JobCategory is mandatory and specifies the image to be used. Here we are downloading the image from Singularity Hub (shub). In order to see some output in the shell we need to specify the OutputPath pointing to /dev/stdout. We should do the same for the ErrorPath otherwise we will miss some error messages printed by the applications running within the container.

In order to give Singularity more parameters (which are of course not part of the standard) we can use the extensions:


    jt.ExtensionList = map[string]string{
        "debug": "true",
        "pid":   "true",
    }

Please check out the drmaa2os singularity job tracker sources for a complete list of evaluated parameters (it might be the case some are missing). debug is a global parameter (before exec) and pid is an exec specific parameter. Since they are boolean parameters it does not matter if they are set to true or just an empty string. Only when set to false or FALSE they will be ignored.

This JobTemplate will result in a process managed by the Go drmaa2 implementation started with

singularty —debug exec —pid shub://GodloveD/lolcow /bin/sleep 600

RunJob() finally creates the process and returns immediately .


    job, err := js.RunJob(jt)
    if err != nil {
        panic(err)
    }

The job ID can be used to perform action on the container. It can be suspended:


    err = job.Suspend()

and resumed


    err = job.Resume()

or killed


    err := job.Terminate()

The implementation sends the necessary signals to the process group.

You can wait until it is terminated (with a timeout or drmaa2interface.InfiniteTime).


    err = job.WaitTerminated(time.Second * 240)

Don’t forget to close the JobSession at the end and destroy it properly.

    
    js.Close()
    sm.DestroyJobSession("jobsession")

While that is easy it still requires lot’s of code when writing real applications. That’s what we have wfl for. wfl now also includes the Singularity support as it is based on the drmaa2 interfaces.

Since last Sunday wfl can make use of default JobTemplate’s which allows to set a container image and parameters like OutputPath and ErrorPath for each subsequent job run. This further reduces the LOC to be written by avoiding RunT().

For starting a Singularity container you need a workflow based on the new SingularityContext which accepts a DefaultTemplate.


    var template = drmaa2interface.JobTemplate{
        JobCategory: "shub://GodloveD/lolcow",
        OutputPath:  "/dev/stdout",
        ErrorPath:   "/dev/stderr",
    }


    flow := wfl.NewWorkflow(wfl.NewSingularityContextByCfg(wfl.SingularityConfig{DefaultTemplate: template}))

Now flow let’s allow you manage your Singularity containers. For running 100 containers executing the sleep binary in parallel and re-executing any failed containers up 3 times you can do like that.


    job := flow.Run("/bin/sleep", "15").
        Resubmit(99).
        Synchronize().
        RetryAnyFailed(3)

For more capabilities of wfl please check out also the other examples at https://github.com/dgruber/wfl.

Of course you can mix backends in your applications, i.e. you can manage OS processes, k8s jobs, Docker containers, and Singularity containers within the same application.

Any feedback welcome.

Creating Processes, Docker Containers, Kubernetes Jobs, and Cloud Foundry Tasks with wfl (2018-08-26)

There are many workload managers running in the data centers of the world. Some of them take care of running weather forecasts, end-of-day reports, reservoir simulations, or simulating functionalities before they are physically built-in in computer chips. Others are taking care of running business applications and services. Finally we have data-based workload types aggregating data and trying to discover new information through supervised and unsupervised learning methods.

All of them have a need to run jobs. Jobs are one of the old concepts in computer science which exists since the 50s of the last century. A job is a unit of work which needs to be executed. That execution needs to be controlled which is typically called job control. Examples of implementations of job control systems are command line shells (like bash, zsh, ...), but also job schedulers (like Grid Engine, SLURM, torque, pbs, LSF) and the new generation container orchestrators (like Kubernetes), or even application platforms (like Cloud Foundry).

All of the mentioned systems have different characteristics if, when, and how jobs are executed. Also how jobs are described is different from system to system. The most generic systems treat binaries or shell scripts as jobs. Container orchestrators executes applications stored in container images. Application platforms are building the executed containers in a transparent, completely automatic, easy to manage, and very secure way.

In large enterprises you will not find a single system handling all kind of workloads. That is unfortunate since it is really something which should be achieved. But the world is moving - new workload types arise, some of them are removed, and again others will stay. New workload types have different characteristics by nature. Around a decade ago I've seen the first appearance of map/reduce workloads with the advent of Hadoop. Data centers tried to figure out how to run this kind of workload side-by-side with their traditional batch jobs.

Later containers became incredibly popular and the need for managing them at scale was approached by a new class of workload managers called container orchestrators. Meanwhile also batch schedulers are enhanced to handle containers and map/reduce systems can do containers. Container orchestrators are now looking for more. They are moving up the stack were application platforms like Cloud Foundry have learned and provide functionalities since years.

But when looking at those systems you will also find lot's of similarities. As mentioned in the beginning they all have the need for running a job for different reasons. A job has a start and a defined end. It can fail to be accepted by the system, fail during execution, or succeed. Often they need more context, like environment variables to be set, job scheduling hints to be provided, or data to be staged.

Standards

If you are reading this blog you are most likely aware that there are standards already defined handling all these tasks. Even been much longer around than most of the new systems they are still applicable for new systems.

In some free time I adapted the DRMAA2 standard to Go and worked on an implementation for different systems. Currently it is working on processes, Docker, Kubernetes, and Cloud Foundry. The Grid Engine implementation is based on the C library, the others are pure Go.

When working on the implementation I wanted to write quickly test applications so I've build a much simpler abstraction around it, which I called wfl.

With wfl it makes even more fun to write applications which need to start processes, Docker containers, Kubernetes jobs, or Cloud Foundry tasks. You can even mix backend types in the same application or point to different Kubernetes clusters within one application. It is kept as simple as possible but still grants access to details of a DRMAA2 job when required.

There are a few examples in the wfl repository demonstrating its usage. A general introduction you find in the README on github.

Native DRMAA2 Compatible Go Interface (2016-04-16)

Creating an API which follows the DRMAA2 standard is not an easy task. Unlike the DRMAA(1) standard it defines more then twice as many methods (around 100) which all needs to follow a compatible behavior.

In order to simplify the DRMAA2 API creation for the Go programming language I distilled the interfaces and structs and created a new repository for it. The difference to the Go wrapper for the DRMAA2 C API is that it defines no functionality and is pure Go.

Over the long term it would be ideal to have multiple DRMAA2 compliant Go APIs, like for the operating system, containers, etc. for which one create and test DRMAA2 applications and finally let them run on the compute cluster or in the cloud without changing the application.

Reducing a Mapped Univa Grid Engine Accounting File with Glow (2016/03/06)

This article describes how the Univa Grid Engine accounting file can be processed with Glow.

The Univa Grid Engine cluster scheduler spits out a text file containing information about the resource usage of past jobs running in your cluster: The accounting file.

The accounting file is a very easy to process ASCII text file which consists of one accounting record per line which contains detailed job information about resource usage of a job. Typically it is used with the qacct utility to get useful usage aggregations or a pretty printed output for a job’s resource usage. After one job is finished a line (or when configured, one line per host for parallel jobs) is added to the file by the qmaster process.

Glow is a very interesting MapReduce framework written in Go which is very easy to use. It has two modes, a simple mode where the tasks are executed in goroutines and a more scalable approach where one application can be executed on multiple machines.

In order to process the accounting file with Glow you need to make your MapReduce application aware about the location of the accounting file. Once you source the Grid Engine’s settings.sh file the path can be derived from the default environment variables pointing to the Grid Engine installation directory.

accountingFileName := fmt.Sprintf(„%s/%s/common/accounting“, os.Getenv("SGE_ROOT"), os.Getenv("SGE_CELL"))

In order to read text files with Glow you just need to point the file location and give the amount of shards to use.

flow.New().TextFile(
    accountingFileName, 4,
)

This returns a Glow Dataset structure.

Since the accounting file has some comments at the beginning we need to filter them out. All of them have a common prefix: #. Filtering lines with Glow is simple. It expects a function which determines for a given line if it should be filtered or not. This function can be anonymous:

flow.New().TextFile(
    accountingFileName, 4,
).Filter(func(line string) bool {
    // filter out all comments
    return !strings.HasPrefix(line, "#")
})

The filter returns also a Glow Dataset which again has the same set of methods defined.

Now we can apply our first Mapper to the resulting Dataset. The Mapper should convert the accounting line to an accounting entry data structure. A while ago I’ve written a parser which you can find here The ParseLine() function is less then 30 lines of Go code thanks to Go’s reflection API. If reflection should be used or not for those things is certainly a different debate. It certainly has its power but makes applications harder to debug. Glow makes heavily use of it (but again this makes debugging harder).

In order to compile your code you need Glow and the UGE parsing API (please create issues on github when you find some).

go get github.com/chrislusf/glow
go get github.com/chrislusf/glow/flow
go get github.com/dgruber/ugego

After importing glow and ugego in your Go source the ParseLine() function can be used in a new Mapper which is called on the filtered Dataset. This mapper takes a string and a channel which has the type of the parsed accounting Entry data structure from the ugego package. In this channel the parsed Entries are written out.

Map(func(line string, ch chan accounting.Entry) {
        if e, err := accounting.ParseLine([]byte(line)); err == nil {
                ch <- e
        }
    }

Here additional filtering can be injected in a very simple way. When you are only interested in jobs which run in a specific time window you can check for it with e.StartTime and e.EndTime (both are both time.Time values) and only write the good ones back in the channel.

Finally the elements in the channel have to be reduced. That is the hard part. Reducing accounting entries would mean aggregating usage values but that does not make sense for elements like the return value. For simplicity I’m just accounting CPU time here, which is used CPU seconds.

Reduce(func(x accounting.Entry, y accounting.Entry) accounting.Entry {
    y.CPU += x.CPU
    return y
})

Reducing other elements or build counters would require to build better suited data structures. Go’s embedding can be used for that. In any case overflows needs to be avoided.

Finally we print our results on the reduced accounting entry:

Map(func(out accounting.Entry) {
    fmt.Println("Combined CPU times for all jobs.")
    fmt.Printf("%f\n", out.CPU)
}).Run()

The full program can be found here.

PS: Using grouped aggregations is not much harder. For example if you want to list the summed amount of consumed slots by job name you can create a flow.KeyValue and use ReduceByKey():


.Map(func(e accounting.Entry) flow.KeyValue {
		return flow.KeyValue{e.JobName, e.Slots}
}).ReduceByKey(func(s1, s2 int) int {
		return s1 + s2
}).Map(func(jobname string, slots int) {
		fmt.Printf("%s %d\n", jobname, slots)
}).Run()

PPS: Note that this can then be further simplified by


.Map(func(e accounting.Entry) (string, int) {
		return e.JobName, e.Slots
}).ReduceByKey(func(s1, s2 int) int {
		return s1 + s2
}).Map(func(jobname string, slots int) {
		fmt.Printf("%s %d\n", jobname, slots)
}).Run()

DRMAA2 Python wrapper created using SWIG (2016/01/02)

David Chin created a Python wrapper for Univa Grid Engine's DRMAA2 compatible C library for cluster monitoring, job and workflow management. It is available on his github account. Great!

Submitting Jobs on behalf of Others (2015-09-21)

Univa Grid Engine 8.3 is out since a while and even the first patch 8.3.1 is available for download. Lot's of great new features like the long awaited real preemption of jobs. Also a new Web Service API found its way in the product. It covers the complete functionality of Grid Engine so it is the perfect tool to integrate Grid Engine in your existing cluster management applications.

I don't want to go over all the new features and functionalities (consumable ranges, consumables requestable as soft requests, just to have some mentioned) now but I just want to highlight a little hidden feature which I'm pretty sure is overlooked but is amazingly useful. While my colleague Andre implemented the new Web Service API he needed a way to submit jobs on behalf of others. This is not an easy thing in Grid Engine since the user depends on an internal context which is setup at the very beginning of an application. Resetting that during an application run-time can't be done.

What was done is setting up new calls in the DRMAA library. Since DRMAA2 is out I'm pretty sure this functionality will find its way in DRMAA2 soon, too.

Those calls are drmaa_run_job_as(), drmaa_control_as(), and drmaa_run_bulk_jobs_as(). I know the drmaa prefix is not really compliant but the DRMAA1 standard is deprecated by OGF anyhow (since DRMAA2 is out).

The calls have an additional parameter a sudo structure which contains a user name, group name, user ID, and group ID. This can set to the user on behalf of which you want to run or control the job. In order to be successful the admin additionally must tell Grid Engine which user is allowed to submit jobs on behalf of whom. This is done in the sudomasters and sudoers user group lists. Once set (example: qconf -au service sudomasters) you can submit within one application jobs for difference users! This really opens the doors for lots of applications...

To test this I enhanced the Go DRMAA1 binding in order to support those calls. I created a special branch for it since other DRMAA C implementations don't have the call and I don't want to introduce incompatibility.

Here is a simple example of how to use it:

DRMAA2 Talk at HEPiX 2015 (2015-03-26)

Todays talk at HEPiX at Oxford University about the DRMAA2 standard. You can follow their live stream here.

Multi-clustering with Go, DRMAA2, and Grid Engine (2014-12-22)

The cluster monitoring and job submission API based on the DRMAA2 standard makes it easy to build cluster monitoring and job workflow submission applications. Univa Grid Engine supports this open standard since version 8.2.

Doing cluster monitoring is straight forward, all neccessary calls for listing (including filtering) jobs, queues, and hosts are available. Also job submission is easy and is well defined and documented.

But what when you have multiple clusters and you want to have access to all of them without constantly switching the environment (or even logging in to different hosts)?

Fortunatly the Go progamming language makes it incredible easy to build proxies for such cases. With just one line of code using the standard library you can encode and decode data structures from and to text representations, like XML or JSON. Another line of code sets up a web-server. So, going this route I ended up with a simple proxy which provides some DRMAA2 functionality for Grid Engine clusters accessible by another simple command line tool, I called uc. The whole implementation is based on my Go DRMAA2 API

If you are interested in this project, you can find it in my github repository (certainly lots of potential for improvements...;-)):

https://github.com/dgruber/ubercluster

Presentation from DRMAA2 Code Camp at OGF42 (2014-11-29)

See also the blog entry at www.wheregridenginelives.com.

Overview of DRMAA2 (2014-08-05)

DRMAA2 C API Webinar / Video Available for Download (2014-06-17)

A free webinar about the upcoming DRMAA2 C API implementation in Univa Grid Engine can be downloaded here.

DRMAA2 C API documentation / man pages

Preliminary DRMAA2 API documentation in pre-alpha version are linked here.

The DRMAA2 Tutorial - Introduction (1) (2013-10-05)

"Evolution is a process of creating patterns of increasing order" (Ray Kurzweil, The Singularity Is Near)

It is obvious that open standards are important in the software industry. They protect investments, they increase usage of interfaces, they decrease costs, they bring people with the same objectives together, and so on. With no or minimal changes the software can support multiple systems / even systems that will be built in the future. The knowledge and cooking recipes are usually widespread - you will find help/solutions in many different communities. Certainly there are many more aspects of open standards. Open standards are taking a major role in the exponential growth in many areas.

DRMAA2 (Distributed Resource Management Application API 2) is such an open standard. It is the successor of the wide-spread DRMAA (Distributed Resource Management and Application API). DRMAA is generally used for submitting jobs (or creating job workflows) into a compute cluster by using a cluster resource management system like Grid Engine (or Condor / PBS / Torque / LSF, …) for applications (like Mathematica, KNIME, …) or for users to build workflows.

DRMAA defines with its language bindings a set of functions for different programming languages. Those functions represent the least common denominator of specific functionalities of cluster schedulers like Grid Engine, PBS, and LSF.

Unlike DRMAA with DRMAA2 you can not only submit jobs, you can also get cluster related information, like getting host names, types and status information or insight about queues configured in the resource management system. You can also monitor jobs not submitted within the application. Overall it covers many more use cases than the old DRMAA standard. When we started several years ago at the Super Computing event in Hamburg with our kick-off meeting we took the results for a survey of DRMAA users as a starting point. Since that time lots of things where added and re-arranged.

One of my current projects is implementing the DRMAA2 into a Grid Engine API. A beta version of the library will be part of Univa Grid Engine 8.2. The target language of the first implementation is C. Support for other programming languages is planned, usually they are wrappers around the C library (using stuff like JNI for Java or cgo for Go/#golang). This article is the first of a series which I want to publish over the next weeks / months where I'm going to introduce the basic usage of the new programming API.

Compiling DRMAA2 Applications for Univa Grid Engine

Compiling DRMAA2 applications for Grid Engine is not much different than for DRMAA applications. The DRMAA2 library will be shipped like DRMAA in the $SGE_ROOT/lib/$ARCH (where $ARCH is lx-amd64 on 64bit Linux) directory, the header file is located in the $SGE_ROOT/include directory.

Your DRMAA2 application can be then compiled (if $SGE_ROOT/default/common/settings.sh is sourced; which is the case when you can do a qsub on command line) with:

gcc -I$SGE_ROOT/include -L$SGE_ROOT/lib/lx-amd64 <example.c> -ldrmaa2

and be started with

export LD_LIBRARY_PATH=$SGE_ROOT/lib/lx-amd64
./example

If you put the drmaa2.so in a local library path you don't need to export LD_LIBRARY_PATH of course but you need to take care of updating the library after a Grid Engine update.

A First Look at DRMAA2 Job Sessions

DRMAA2 comes with different types of sessions: job sessions, monitoring sessions, and reservation sessions. While job and monitoring sessions are a mandatory part of each DRMAA2 implemenation the reservation session is optional. The availability can be discovered during application run-time (which is part of a later tutorial).

The job session is similar to what the old DRMAA is with the difference that DRMAA2 sessions are persistent. In Univa Grid Engine the job session names is stored in the central qmaster component. This is particular useful when you have different processes which are "sharing" the same type of jobs (i.e. each process wants to track a specific set of jobs). Job sessions are user specific, i.e. different users can't share a session. This is implied by the rights management of Grid Engine. It disallows performing operations like suspend, resume or termination of jobs for other users. Within a job session you can submit jobs, control jobs, and monitor jobs. But only those jobs which are submitted within a job session can be controlled and monitored there. You can have multiple different job sessions open at the same time in one process, while for the monitoring session only one makes sense. With a monitoring session you can track the status and online usage of your own jobs, whether they are submitted in any job session, in DRMAA1, or on command line. Like in job sessions you also can access jobs finished during your application run-time. Within a monitoring session you can't submit or control jobs. If the Grid Engine administrative user opens a monitoring session, it get jobs from all users of the system. This makes the DRMAA2 API a good candidate for writing job monitoring GUIs.

Following code demonstrates how a new job session is created, opened, closed and destroyed in a DRMAA2 application. Creating means that the session is allocated in the Grid Engine qmaster process, opening means that such a persistent session is made available for the DRMAA2 application. Closing leads to communication with qmaster so that the library don't get anymore information about jobs running in this session, and finally destroying is the removal of the job session object on the qmaster. After running the below code the state on qmaster is like before since the session was destroyed. What you are missing is the open call because it is implicitly opened by the drmaa2_create_jsession() call.

In order to leave a session persistent the destroy call can be omitted. It is important to close the session before the appliciation exists because otherwise the Grid Engine master process will keep the underlaying communication connection open for a much longer time than needed (despite there is a timeout) until the Grid Engine master process figures out that the client connection died.

Using DRMAA2 lists and dictionaries

The C implementation of DRMAA2 comes with two higher level data structures: a list and a dictionary. While a dictionary maps a string to another string in a efficient way (which is used for setting resource limits for example), a list contains a collection of strings, job objects or other DRMAA2 specific data types. They are used to simplify creation and access of input and output values for some DRMAA2 functions.

The following code creates a dictionary, adds 2 different key/value pairs, changes the value of an already known key, retrieves the value of this key, checks if a specific key is part of the dict, deletes it and finally destroys the dictionary, i.e. frees it. Since the strings are not allocated the callback method (the second argument) is set to NULL, i.e. nothing needs to be freed when an element or the whole list is deleted.

When creating a list the type must be given as argument. In this example a list of strings is created, filled, the length is retrieved and finally all values are printed.

Following list types are defined in DRMAA2:

typedef enum drmaa2_listtype {
   DRMAA2_STRINGLIST       = 0,
   DRMAA2_JOBLIST          = 1,
   DRMAA2_QUEUEINFOLIST    = 2,
   DRMAA2_MACHINEINFOLIST  = 3,
   DRMAA2_SLOTINFOLIST     = 4,
   DRMAA2_RESERVATIONLIST  = 5
} drmaa2_listtype;

Querying the list of job sessions

All available job sessions can be requested with drmaa2_get_jsession_list(), which returns a list of type DRMAA2_STRINGLIST. This this can simply processed like explained above. The following code searches for a specific session, if it exists it opens it.

Gorge - A Go (#golang) Library for Accessing Grid Engine (2013-07-28)

Looks like the Go (#golang) programming language becomes popular for cluster management. Kamil offers a project called Gorge on github. Similar to my gestatus sub-package of the Go DRMAA implementation Gorge reads out job status information of Grid Engine jobs by parsing the qstat xml output. Not only job information is parsed it also has wrappers for qstat -pri / -ext /-urg for detailed queue status information. Additionally it contains functions for accessing Grid Engine's Arco DB. Thanks for sharing!

Go DRMAA Language Binding Source Code now Hosted on Github (2013-03-17)

Today I pushed the source code of the Go DRMAA language binding on a github repository: https://github.com/dgruber/drmaa

It now contains a sub-package called gestatus (Grid Engine status) which parses the qstat -xml -j under the hood and is therefore able to deliver almost all available information about a particular job. Hence gestatus only works for Grid Engine (tested with Univa Grid Engine) while the DRMAA library itself should be compatible also with other DRMs.

For a minimal gestatus example please have a look at the example file.

More Go DRMAA related examples you find in the Programming APIs section of this blog.

DRMAA Version 2 C Language Binding Specification Officially Released (2012-11-16)

The first language binding is now officially approved as OGF standard GFD-R-P.198. You can get the standard here.

Failure Handling in Go DRMAA (2012-11-11)

One thing not covered yet is the failure handling in Go DRMAA. The package documentation shows that an error is a pointer to an DRMAA Error type, which is a struct consisting of an error id (Id) and an human readable error message (Message). The type also fulfills the Go error interface (i.e. in Go terminology it's an “errorer”), which prints the error message, whenever the error.Error() method is called. That makes it simple to use in both relevant cases, when the program just has to log the error and when the program should react based on an expected error.

The error ids are derived from the DRMAA 1.0 IDL standard. The following list shows the Go DRMAA error ids:

"nil" in case of success
ErrorId
InternalError
DrmCommunicationFailure
AuthFailure
InvalidArgument
NoActiveSession
NoMemory
InvalidContactString
DefaultContactStringError
NoDefaultContactStringSelected
DrmsInitFailed
AlreadyActiveSession
DrmsExitError
InvalidAttributeFormat
InvalidAttributeValue
ConflictingAttributeValues
TryLater
DeniedByDrm
InvalidJob
ResumeInconsistentState
SuspendInconsistentState
HoldInconsistentState
ReleaseInconsistentState
ExitTimeout
NoRusage
NoMoreElements
NoErrno

The following program demonstrates the usage of error ids for control flow decisions. While there is just one session at a time allowed, a second session init() results in an error with the id AlreadyActiveSession. If such an error occurs during initialization, the session is usable (since it's already initialized) despite the error. In case an other error occurs during initialization, the program exits gracefully.

package main
 
import (
  "drmaa"
  "fmt"
  "os"
)
 
func main() {
  var session drmaa.Session
 
  session.Init("session=Init1")
  defer session.Exit()
 
  // ... 
 
  // opening a second session is illegal
  err := session.Init("session=Init2")
 
  // in case of an error check the error id for failure handling
  if err != nil {
    switch err.Id {
    case drmaa.AlreadyActiveSession:
      fmt.Println("We have already a session: ", err.Message)
      // let's continue by using the open session
    case drmaa.DrmsInitFailed:
      fmt.Println("Init failed: ", err.Message)
      os.Exit(1)
    default:
      fmt.Println("Unexpected error: ", err)
      os.Exit(2)
    }
  }
 
  contact, _ := drmaa.GetContact()
  fmt.Println("Session name: ", contact)
}

When running the program produces following output:

We have already a session: Initialization failed due to existing DRMAA session.
Session name: session=Init1

Setting Grid Engine Specific Job Submission Parameters and Fetching Job Usage Values in Go DRMAA (2012-11-04)

In order to continue the Go DRMAA API description series I present below a small program, which submits jobs with Univa Grid Engine specific submission parameters and reports all collected job usage values.

The most obvious difference between submission with qsub and with DRMAA is that a DRMAA job is expected to be a binary, while the default expectation of qsub is that the command name is a script. But both can submit scripts as well as binaries. For qsub the “-b y” parameter can be used in order to submit binary jobs. The binary itself is not transferred to the execution host, like job scripts are, they must exist in the path on the execution host. DRMAA jobs, which are job scripts can be submitted by setting the “-b n” parameter as job submission parameter. Then the job script is transferred by Grid Engine to the execution host, like submitting through qsub. Setting job submission parameters, which are not defined by the DRMAA standard is easy: They can be set with using the DRMAA standardized native specification, which is in Go the SetNativeSpecification() job template method. The job output is usually written in two files, the output file (“jobname”.o<jobno>) for stdout output and the error file (“jobname”.e<jobno>) for stderr output. In order to tell the system that all output should be in the output file, the SetJoinFiles(true) job template method can be called. In order to submit parallel jobs again the native specification has to be used and “-pe <pename> <slots>” has to be added. When having more parameters which are specific for a whole job class, the job class (which is a new feature in Univa Grid Engine 8.1) can be set in the native specification as well (like “-jc <classname>”). Finally the remote command, which points to a shell script in the current working directory is set.

After the job finished the exit status (in case the job fully ran) is printed, otherwise the signal which terminated the job is displayed. Finally a loop through all values of the resource map prints out the resource and the specific usage of this resource (like the resident segment size etc.).

package main
 
import (
  "drmaa"
  "fmt"
  "os"
)
 
func main() {
  session, err := drmaa.MakeSession()
  if err != nil {
    fmt.Println(err)
    return
  }
  defer session.Exit()
 
  jt, err := session.AllocateJobTemplate()
  if err != nil {
    fmt.Println(err)
    return
  }
 
  // stderr output of job is written to stdout output file
  jt.SetJoinFiles(true)
  // set jobs name for accounting and qstat
  jt.SetJobName("testJob")
  // set Grid Engine spefic submission parameters
  jt.SetNativeSpecification("-b n -pe mytestpe 4")
  wd, _ := os.Getwd()
  // set shell script to submit (requires "-b n" <- binary no)
  jt.SetRemoteCommand(wd + "/testjob.sh")
 
  // submit job
  id, err := session.RunJob(&amp;jt)
  if err != nil {
    fmt.Println("Error during job submission: ", err)
  }
 
  // wait until job finishs and get job information
  if ji, err := session.Wait(id, drmaa.TimeoutWaitForever); err == nil {
    if ji.HasExited() {
      fmt.Println("Job exited with exit status: ", ji.ExitStatus())
    }
    if ji.HasSignaled() {
      fmt.Println("Job was termintated through signal ", ji.TerminationSignal())
    }
    // report job usage
    fmt.Println("Job used following resources:")
    for resource, usage := range ji.ResourceUsage() {
      fmt.Println("Resource ", resource, " usage: ", usage)
    }
  }
}

Ultra-Fast Job Submission with Go DRMAA and Univa Grid Engine (2012-10-20)

In my last article I showed two basic examples how to use the Go DRMAA binding for simple job submission and job status checks. This time I want to demonstrate how easy it is to submit thousands of (possibly different) jobs into a Grid Engine system in a very fast way. Of course fast bulk job submission can also be done with the qsub -t switch, which allows to submit array jobs with one single submit, but then your job parameters and even the job command name must be the same for all jobs.

The example below consists of four different functions. The main() function creates a new DRMAA session by just calling drmaa.MakeSession(). The defer statement below puts the session.Exit() cleanup method on a stack, which is executed after the main function finishes. Finally, the submitJobs() function is called, which requires a reference to the session object as well as the amount of worker routines, which have to be spawned concurrently.

While playing around with my system (a dual core laptop, where all components of Univa Grid Engine 8.1 are running!), 512 workers was ideal for me in terms of performance: I was able to submit 1024 jobs within 1-2 seconds! Which is an average rate of 1-2 ms per job. An unfair comparison: Using a bash script with a single loop doing 1024 qsubs took between 18-19 seconds.

package main
 
import (
  "drmaa"
  "fmt"
  "runtime"
)
 
type jobTemplate struct {
  jobname string
  arg     string
}
 
func createJobs(session *drmaa.Session, jobs chan<- jobTemplate, amount int) {
  for i := 0; i < amount; i++ {
    var jt jobTemplate
    jt.jobname = "sleep"
    jt.arg = "10"
    jobs <- jt
  }
  close(jobs)
}
 
func submitJob(session *drmaa.Session, jobs <-chan jobTemplate, done chan<- bool) {
  // as long as there are jobs to submit, do so
  for job := range jobs {
    if djt, err := session.AllocateJobTemplate(); err == nil {
      djt.SetRemoteCommand(job.jobname)
      djt.SetArg(job.arg)
      session.RunJob(&amp;djt)
      session.DeleteJobTemplate(&amp;djt)
    }
  }
  done <- true
}
 
func submitJobs(session *drmaa.Session, workers int) {
  jobsChannel := make(chan jobTemplate)
  done := make(chan bool)
  // create 1024 jobs
  go createJobs(session, jobsChannel, 1024)
  // start worker
  for i := 0; i < workers; i++ {
    go submitJob(session, jobsChannel, done)
  }
  // block until all workers have finished
  for i := 0; i < workers; i++ {
    <-done
  }
}
 
func main() {
  const workers int = 512
 
  runtime.GOMAXPROCS(4)
 
  session, err := drmaa.MakeSession()
  if err != nil {
    fmt.Println(err)
    return
  }
  defer session.Exit()
 
  submitJobs(&amp;session, workers)
}

The submitJobs() function creates two channels, one for sending the jobs from the job generation function (createJobs()) to the workers, and one done channel for signaling that no more jobs are left and a hence a worker finished. Then the coroutine createJobs() is started asynchronously. It simply loops 1024 times generating structs, which are filled with the job-name and the job parameter. In an real-world example this function would parse a file, which contains a job and parameter list, for example. Finally, the workers are started as coroutines/or go routines. As long as there are jobs in the jobs channel, they are processing them by allocating a DRMAA job template and submit the job template to the Grid Engine master process. When no job is left the worker sending a done message and quit. When submitJobs() was able to collect all done messages it returns to the main function.

UPDATE (2012/10/27): A simple single-threaded C DRMAA application needs between 3-4 seconds to submit 1024 jobs in my environment. A simple single-threaded Java DRMAA application about 4 seconds. Of course a multi-threaded C DRMAA application could reach a similar performance, but it would be much more sophisticate, especially when having a single source for all jobs.

Go DRMAA Language Binding Update (2012-10-15)

I just uploaded the slightly enhanced version of the Google Go DRMAA language binding. It offers now the missing JobInfo access methods and the JobTemplate SetArg(), which accepts a simple string.

Documentation can be found here: http://www.gridengine.eu/DRMAA/GO/drmaa.html

And the linux_amd64 library, with README and Documentation here: http://www.gridengine.eu/DRMAA/GO/GODRMAA_02.zip

Google Go and Univa Grid Engine: A Go DRMAA Language Binding Implementation - Part 1 (2012-10-07)

Google's Go looks for me is like the most interesting programming language published in recent years (of course there are others, like “Julia”, but they are more domain specific (technical computing)). It is compiled, has automatic garbage collection, multiple return values, offers pointers, methods, interfaces, built-in maps, slices, range expressions, coroutines and channels, and its easy to use. There are a lot of articles which discusses all the features, so there is no need to go in further details. The amount of keywords is small (about 20), which makes it clearly arranged. But there is one little thing which I really miss: Method overloading. Having for each different parameter set a different method could IMHO lead to a method name zoo. There is a work-around to use the default interface{} (which is implemented by each Go type) which could be combined with an ellipsis (in Go: “…”) and type checking, but then the signature is hard to read. It would be easier if this would just be possible out-of-the box.

In order to gain more expertise in Go I developed a DRMAA (1.0) language binding in some free time. The current release is not well tested and there are still some details which I want to re-work, so I would describe the version as a first “Proof of Concept”. Nevertheless you can download it and use it for free without any bigger restrictions. There are no guarantees and I don't take any responsibility for anything bad which could happen (like that an asteroid destroys your computer while using ;). But when you have any issues feel free to contact me (Daniel) at This e-mail address is being protected from spambots. You need JavaScript enabled to view it. , so that I can repair the defect or we can at least discuss about the issue. I developed it using Univa Grid Engine 8.1. (feel free to download the 48-core limited free demo version), but it should work with any other Grid Engine fork which offers the C DRMAA 1.0 binding. Like the Java DRMAA or Python binding it uses internally the C library, hence in theory it should work also for any other job scheduler which offers a C DRMAA 1.0 binding. If you are trying that with success or without success, please let me know.

How to use the Go DRMAA binding

The Go DRMAA binding consists of the Go DRMAA library and a small piece of documentation. For a more detailed description of the functions please checkout the DRMAA spec or the Grid Engine DRMAA man pages. Internally the Go lib needs to access a C DRMAA 1.0 library, hence you must set the LD_LIBRARY_PATH to the right location. For Univa Grid Engine it is $SGE_ROOT/lib/lx-amd64. Additionally you need to source the GE settings.sh (or settings.csh) file before. Then your Go programs can be compiled and started. Here is a small example of a command which submits a binary program with one argument to the job scheduler. I called it “dsub” for drmaa submission (similar POSIX standardized submission client name “qsub”).

package main
 
import ("drmaa"
        "fmt"
        "os")
 
func main() {
   var session drmaa.Session
   args := os.Args[2:]
   
   if err := session.Init("session=dsubdrmaasession"); err == nil {
      jt, _ := session.AllocateJobTemplate()
      jt.SetRemoteCommand(os.Args[1])
      jt.SetArgs(args)
      id, _ := session.RunJob(&amp;jt)
      fmt.Println(id)
      session.DeleteJobTemplate(&amp;jt)
      session.Exit()
   } else {
      fmt.Println("Session init error: ", err)
   }
}

First you need to import the “drmaa” package then you have to create a DRMAA session. The DRMAA way to do it is to call “Init(“sessionname”)” where session name can be the name of a previously created session or an empty string (“”) which lets Grid Engine create a random session name. In order to simply that I also provide a “drmaa.MakeSession()” which returns a new initialized session back (which can be used with the “:=” operator). In order to submit a job you need to allocate a job template, which fields should be set accordingly. A job template can be get from the session method AllocateJobTemplate(). Internally there are some malloc calls hence you must call the session method DeleteJobTemplate() in order not to generate memory leaks. In order to run a job successfully it requires at least to provide a command / application name. This is done with the SetRemoteCommand() job template method. Arguments for the application are set by SetArgs() method, which requires a slice of strings as sing argument. A session should always be closed with an Exit() call. Just two examples where method overloading would had been nice: Session.SetArgs([]string) requires to convert a single argument into a string slice. Hence an additional Session.SetArgs(string) could make life easier. Instead I implemented Session.SetArg(string). The other example is the Init(string) method: A Session.Init() could replace Session.Init(“”). Instead I'm using the MakeSession().
An example program that prints out the status of a job (which id is given as the first parameter) is shown below: dstat.go.

package main
 
import ("drmaa"
        "fmt"
        "os")
 
func main() {
   var session drmaa.Session
   args := os.Args
 
 
   if err := session.Init("session=dsubdrmaasession"); err != nil {
      fmt.Println(err)
      return
   }
 
  if len(args) <= 1 {
    fmt.Println("Usage: dstat jobid")
    return
  }
 
   ps, _ := session.JobPs(args[1])
   fmt.Println("Job is in state: ", ps)
 
   session.Exit()
}

...and the most important thing: You can download the Go DRMAA library here!
Don't forget to source the $SGE_ROOT/default/common/settings.sh file and set the LD_LIBRARY_PATH to $SGE_ROOT/lib/lx-amd64/ before you are starting programming in Go with the drmaa pacakge and Univa Grid Engine

The package documentation is here: Go DRMAA Language Binding Documentation

Last DRMAA Meeting about DRMAA 2 C Language Binding (2012-08-23)

The public comment period was over and therefore we had yesterday night the final meeting where we went over the last issues of the upcoming DRMAAv2 C language binding. So the official final C spec will be published soon - stay tuned!

(2012-03-11) Exploiting the Univa Grid Engine JGDI API Java Interface - A JGDI Hello World Example

Doing evil things! The JGDI API is an unsupported (but available) Java interface for accessing and controlling Grid Engine. It is unsupported because there are no guarantees that the interface will change over time and that the methods are working like expected. Much of the underlaying code is automatically generated during the compile process.

DRMAA Version 2 - Released Publications

The DRMAA 2 standard (final draft 8) is until August 2nd, 2011, 23:59 CET open for public review.

The PDF document can be downloaded here.

Update (2012-01-27): Final DRMAA version 2 standard published

Please download the final DRMAA2 publication here.

DRMAA Version 2

The Distributed Resource Management Application API (DRMAA) is a programming library for job submission and job management. It abstracts about vendor specific job submission methods and provides a common interface in order to enable applications to run on different distributed management systems (DRMs; e.g. Univa Grid Engine or Condor) without refactoring the code. It simplifies job workflow creation and bulk job submissions through standardized C function calls and Java methods. While DRMAA version 1.0 (and its predecessors like 0.95 and 0.5) has only job submission, job control and event synchronizing on its agenda, the new open DRMAAv2 standard also includes host and job monitoring functionalities though a new monitoring session concept.

DRMAA - The Distributed Resource Management API

DRMAA or DRMAA version 1 is a highly adopted standard for accessing DRMs (distributed resource management systems). The DRMAA IDL API 1.0 specification can be found here and here.

Following systems are currently supported (this list is not complete, if you know other implementations, please write me a mail):

UGE - Univa Grid Engine
OGE - Oracle Grid Engine
SGE - Sun Grid Engine and SGE 6.2u5 Open Source forks
Condor
LSF
PBS
Torque
IBM Tivoli LoadLeveler
SLURM
Unicore
Kerrighed
EGEE Framework
XGridDRMAA

Most of the systems support the C API interface, but there are others which have additionally a Java API interface or a Python Interface (Python DRMAA API spec can be found here).

Nav view search

Navigation

Search

Modern CPUs Need Smarter Scheduling (2025-10-24)

MCP Servers Bring AI Reasoning to HPC Cluster Scheduling (2025-04-18)

Why is This Useful for HPC?

MCP Server for Open Cluster Scheduler

Getting Started

Quickstart on MacOS with ARM / M chip

Example Use Cases

Screenshots

Conclusion

Running Simple PyTorch Jobs on Gridware Cluster Scheduler: Generating 1000 Cat Images (2025-01-26)

System Setup

Submitting Cat Image Creation Jobs

Supervising AI Job Execution

Enhancing wfl with Large Language Models: Researching the Power of GPT for (HPC/AI) Job Workflows (2023-05-14)

wfl: A Brief Overview

Enhancing the go wfl library with LLMs

1. Job Error Analysis

2. Job Output Transformation

3. Job Template Generation

Conclusion

Streamline Your Machine Learning Workflows with the wfl Go Library (2023-04-10)

Sentiment Analysis with wfl

Sentiment Analysis with ANNs and wfl

Hyperparameter Tuning with wfl

Hyperparameter Tuning with Google Batch

Convolutional Neural Networks with Cifar10 and wfl

Creating Kubernetes based UberCloud HPC Application Clusters using Containers (2020-12-16)

Paper about Virtual Grid Engine (2019/12/12)

The Singularity is :%s/near/here/g - Using wfl and Go drmaa2 with Singularity (2018-12-04)

Creating Processes, Docker Containers, Kubernetes Jobs, and Cloud Foundry Tasks with wfl (2018-08-26)

Standards

Native DRMAA2 Compatible Go Interface (2016-04-16)

Reducing a Mapped Univa Grid Engine Accounting File with Glow (2016/03/06)

DRMAA2 Python wrapper created using SWIG (2016/01/02)

Submitting Jobs on behalf of Others (2015-09-21)

DRMAA2 Talk at HEPiX 2015 (2015-03-26)

Multi-clustering with Go, DRMAA2, and Grid Engine (2014-12-22)

Presentation from DRMAA2 Code Camp at OGF42 (2014-11-29)

Overview of DRMAA2 (2014-08-05)

DRMAA2 C API Webinar / Video Available for Download (2014-06-17)

DRMAA2 C API documentation / man pages

The DRMAA2 Tutorial - Introduction (1) (2013-10-05)

Compiling DRMAA2 Applications for Univa Grid Engine

A First Look at DRMAA2 Job Sessions

Using DRMAA2 lists and dictionaries

Querying the list of job sessions

Gorge - A Go (#golang) Library for Accessing Grid Engine (2013-07-28)

Go DRMAA Language Binding Source Code now Hosted on Github (2013-03-17)

DRMAA Version 2 C Language Binding Specification Officially Released (2012-11-16)

Failure Handling in Go DRMAA (2012-11-11)

Setting Grid Engine Specific Job Submission Parameters and Fetching Job Usage Values in Go DRMAA (2012-11-04)

Ultra-Fast Job Submission with Go DRMAA and Univa Grid Engine (2012-10-20)

Go DRMAA Language Binding Update (2012-10-15)

Google Go and Univa Grid Engine: A Go DRMAA Language Binding Implementation - Part 1 (2012-10-07)

How to use the Go DRMAA binding

...and the most important thing: You can download the Go DRMAA library here! Don't forget to source the $SGE_ROOT/default/common/settings.sh file and set the LD_LIBRARY_PATH to $SGE_ROOT/lib/lx-amd64/ before you are starting programming in Go with the drmaa pacakge and Univa Grid Engine

The package documentation is here: Go DRMAA Language Binding Documentation

Last DRMAA Meeting about DRMAA 2 C Language Binding (2012-08-23)

(2012-03-11) Exploiting the Univa Grid Engine JGDI API Java Interface - A JGDI Hello World Example

DRMAA Version 2 - Released Publications

Update (2012-01-27): Final DRMAA version 2 standard published

DRMAA Version 2

DRMAA - The Distributed Resource Management API

...and the most important thing: You can download the Go DRMAA library here!
Don't forget to source the $SGE_ROOT/default/common/settings.sh file and set the LD_LIBRARY_PATH to $SGE_ROOT/lib/lx-amd64/ before you are starting programming in Go with the drmaa pacakge and Univa Grid Engine