Creating Processes, Docker Containers, Kubernetes Jobs, and Cloud Foundry Tasks with wfl (2018-08-26)

There are many workload managers running in the data centers of the world. Some of them take care of running weather forecasts, end-of-day reports, reservoir simulations, or simulating functionalities before they are physically built-in in computer chips. Others are taking care of running business applications and services. Finally we have data-based workload types aggregating data and trying to discover new information through supervised and unsupervised learning methods.

All of them have a need to run jobs. Jobs are one of the old concepts in computer science which exists since the 50s of the last century. A job is a unit of work which needs to be executed. That execution needs to be controlled which is typically called job control. Examples of implementations of job control systems are command line shells (like bash, zsh, ...), but also job schedulers (like Grid Engine, SLURM, torque, pbs, LSF) and the new generation container orchestrators (like Kubernetes), or even application platforms (like Cloud Foundry).

All of the mentioned systems have different characteristics if, when, and how jobs are executed. Also how jobs are described is different from system to system. The most generic systems treat binaries or shell scripts as jobs. Container orchestrators executes applications stored in container images. Application platforms are building the executed containers in a transparent, completely automatic, easy to manage, and very secure way.

In large enterprises you will not find a single system handling all kind of workloads. That is unfortunate since it is really something which should be achieved. But the world is moving - new workload types arise, some of them are removed, and again others will stay. New workload types have different characteristics by nature. Around a decade ago I've seen the first appearance of map/reduce workloads with the advent of Hadoop. Data centers tried to figure out how to run this kind of workload side-by-side with their traditional batch jobs.

Later containers became incredibly popular and the need for managing them at scale was approached by a new class of workload managers called container orchestrators. Meanwhile also batch schedulers are enhanced to handle containers and map/reduce systems can do containers. Container orchestrators are now looking for more. They are moving up the stack were application platforms like Cloud Foundry have learned and provide functionalities since years.

But when looking at those systems you will also find lot's of similarities. As mentioned in the beginning they all have the need for running a job for different reasons. A job has a start and a defined end. It can fail to be accepted by the system, fail during execution, or succeed. Often they need more context, like environment variables to be set, job scheduling hints to be provided, or data to be staged.

Standards

If you are reading this blog you are most likely aware that there are standards already defined handling all these tasks. Even been much longer around than most of the new systems they are still applicable for new systems.

In some free time I adapted the DRMAA2 standard to Go and worked on an implementation for different systems. Currently it is working on processes, Docker, Kubernetes, and Cloud Foundry. The Grid Engine implementation is based on the C library, the others are pure Go.

When working on the implementation I wanted to write quickly test applications so I've build a much simpler abstraction around it, which I called wfl.

With wfl it makes even more fun to write applications which need to start processes, Docker containers, Kubernetes jobs, or Cloud Foundry tasks. You can even mix backend types in the same application or point to different Kubernetes clusters within one application. It is kept as simple as possible but still grants access to details of a DRMAA2 job when required.

There are a few examples in the wfl repository demonstrating its usage. A general introduction you find in the README on github.