Add Any Resource to Gridware Cluster Scheduler / Open Cluster Scheduler (fka Sun Grid Engine) (2025-12-22)

GCS/OCS provides a flexible resource management system that has been refined over many years. Integrating new resources—whether quantum computers, tape libraries, cloud instances, hardware emulators, or something else entirely—follows a consistent pattern. Here's how it works.

Step 1: Define the Resource Type

Resources are defined in the complex configuration. To see what's already configured:

qconf -sc

To add new resources, you have two options:

Interactive editing:

qconf -mc

File-based editing:

qconf -sc > myresources
# edit myresources
qconf -Mc ./myresources

Each resource definition requires several attributes:

name and shortcut – how you refer to it
type – MEMORY, INT, DOUBLE, BOOL, RSMAP, etc.
requestable – whether jobs can request it via qsub -l
default – typically 0; note that a non-zero default gets implicitly added to every job request
consumable – whether the scheduler should decrement the available amount when scheduling jobs

Step 2: Initialize Resource Values

This step is optional but necessary for limited, consumable resources like GPUs or licenses. The key decision is where to initialize (within the complex_values field):

Global level – resource is available cluster-wide, independent of where the job runs:

qconf -me global

Host level – resource is attached to a specific execution host:

qconf -me <hostname>

Queue instance level – finer granularity than host; useful when different queues on the same host have different resource allocations:

qconf -mq <queuename>

In each case, add your resource to the complex_values field. For queue-level configuration, you can either set a value for all instances or use bracket notation to target specific queue instances on specific hosts.

Check the resource availability with qstat -F resourcename

Step 3: Request the Resource

Jobs request resources using the -l option:

qsub -l qpu=1 myjob.sh

To inspect what was requested and granted:

qstat -j <jobid>

The job environment receives granted resources as variables with the SGE_HGR_ (hard) or SGE_SGR_ (soft) prefix. Your job script can use these to configure itself appropriately.

Further Integration: Prolog and Epilog Scripts

For many resources, there's additional work required to actually access or release them. This is handled through prolog and epilog scripts configured in the queue configuration.

Prolog scripts typically:

Read SGE_HGR_* environment variables (which display the (hard) granted resources)
Interact with the resource (acquire locks, initialize hardware, provision cloud instances)
Write additional environment variables to the job's environment file

Epilog scripts typically:

Release or de-register the resource
Collect consumption metrics (power usage, costs, runtime statistics)
Update the resource usage file so consumption data appears in qacct -j output

This last point integrates directly with the extensible JSONL accounting system, making resource consumption visible and auditable.

The Starter Method

Another frequently used hook is the starter_method. Configure it in the queue with a path to a script or binary. Instead of executing the job directly, the execution daemon calls this starter script with the job script and its arguments as parameters.

From there, it's up to your starter script to:

Prepare the environment
Perform any setup required for the resource
Launch the actual job as a child process

Since the starter script inherits all job-related environment variables (including granted resources) and receives the job command line as arguments, it has everything needed to wrap job execution transparently.

A common use case: running jobs inside containers without requiring users to modify their job scripts. The starter method can invoke Apptainer, Enroot, or any other container runtime, passing the job script as the container's entry point. Users submit normal jobs; the infrastructure handles containerization.

Load Sensors

Load sensors provide dynamic information about resource states—temperatures, shared storage capacity, network utilization, or any other metric that changes over time.

They're scripts or binaries that the execution daemon triggers at configurable intervals. Each invocation reports the current value of one or more resources. These values can be provided whether or not the resource is initialized with a static value at the global, host, or queue instance level.

Load sensor data is useful for:

Users checking cluster status before submitting jobs
Administrators monitoring infrastructure health
The scheduler detecting overloaded resources and automatically putting them into alarm state

With these interfaces—complex configuration, prolog/epilog scripts, the starter method, and load sensors—you can integrate any resource, independent of what that resource actually is.

Nav view search

Navigation

Search