Windows Batch Jobs on Cloud Run

Written by David Liebert.

Background

I was recently asked to assist a large Australian company automate a regular process where they import a large amount of data from a set of proprietary databases which can only be read by a Windows executable which is provided by a third party. My client was confident that the process could be batched into parallel streams which could be processed with serverless Cloud technology but they needed assistance with the containerisation and container runtime.

It was pretty clear from the start that this process might be a great candidate for a serverless solution, in principle, but a few challenges occurred to me straight away, especially when my client mentioned a preference for Google Cloud Run.

The challenges that struck me immediately were:

Although Google Cloud Run is very flexible about the kind of containers it can run (the marketing proudly states: “any language, any library, any binary”), it is seldom mentioned that only Linux containers are supported.
Cloud Run Services are suitable for HTTP based and event-driven applications. I wasn’t originally certain how suitable Cloud Run would be for a batch workload.
I understood from my client that the data files would be available in a Google Cloud storage bucket. I wasn’t sure how flexible Cloud Run could be about accessing persistent data and how easily it could distribute workload across multiple parallel batch jobs.

The obvious alternative I could have used was Kubernetes which:

Natively supports Windows nodes
Supports parallel batch workloads (via Job resources)
Supports many types of persistent (and ephemeral) disks.

Despite Kubernetes being a valid option, I really liked the idea of using Cloud Run because it scales to zero, and I didn’t want to have to create and then later destroy a Kubernetes cluster (with Windows nodes) in order to solve an essentially simple problem. Also, I had little experience with GKE Windows Nodes and Windows Containers on GKE and could not confirm from personal experience that they support all the Kubernetes features I might want to use. A little reading revealed that Windows GKE nodes lack quite a few features compared to Linux nodes, such as support for spot instances and preemptible instances (according to documentation).

Ultimately, I found it is possible to meet my client’s requirements with Cloud Run by using some relatively new features of Cloud Run and a bit of creativity. Below I will explain the different parts of the solution.

Windows Executables on Cloud Run

Although Cloud Run does not support Windows containers, it is possible to run Windows executables on Linux, at least in many cases. The Wine Project has been around since 1993 with the aim of providing a compatibility layer for running Windows software on Linux. Although it has a long history, Wine is still a work in progress, but most of the challenges are with running Windows GUI software. Windows text-mode terminal applications are far less of a challenge and generally run well.

Here’s an example of a Dockerfile for a Linux container with Wine installed:

FROM ubuntu:23.10
RUN apt-get update
RUN DEBIAN_FRONTEND=noninteractive TZ=Etc/UTC apt-get -y install tzdata && \   
dpkg --add-architecture i386 && \   
mkdir -pm755 /etc/apt/keyrings && \   
apt-get install -y wget && \   
wget -O /etc/apt/keyrings/winehq-archive.key https://dl.winehq.org/wine-builds/winehq.key && \   
wget -NP /etc/apt/sources.list.d/ https://dl.winehq.org/wine-builds/ubuntu/dists/mantic/winehq-mantic.sources && \   apt-get update -y && apt-get install --install-recommends -y winehq-stable
# This avoids ugly warning messages about having no display
ENV WINEDEBUG=-all
# Running an initial Windows command to initialise the ~/.wine configurationRUN wine cmd /c exit
ENTRYPOINT ["/usr/bin/wine"]

[N.B. Some long lines are split across two lines in the above.]

I’m not going to go into detail about the above Dockerfile except to note that I chose to download and install the latest Wine release directly from the official site. All common Linux distributions include their own Wine package which may be easier to install but possibly less up-to-date. The only slight complexity I noticed is that you need to configure Linux to include i386 support, which may not be enabled by default.

In my case, I tested and found that my client’s text mode Windows executable ran okay and during testing I couldn’t find any example of a text mode Windows executable that didn’t run fine under Wine. Your mileage may vary, but I think it’s a reasonable option to try.

Running Batch Workloads on Cloud Run

Cloud Run Services versus batch processing

Before discussing the solution I used, I will first give a little history about Cloud Run Services which is Google’s original Cloud Run product for HTTP and event-driven services. Because events are delivered to a Cloud Run service via HTTP it is fair to say that Cloud Run Services are exclusively for HTTP (i.e., software that listens for and responds to HTTP requests).

You might think that you could run any software using a Cloud Run Service just by configuring it to listen for HTTP requests and then starting any workload you like? But it’s not that simple… A Cloud Run Service is billed according to time spent responding to an HTTP request, and there is a practical limit to the amount of time that an HTTP server can take to respond to a request—an HTTP client won’t wait forever and will eventually time out. So couldn’t you just design your service to accept a request, immediately return HTTP 202 (“Accepted”), and then process the request in the background? The answer is no, because as soon as a Cloud Run service returns an HTTP response, Google stops billing and throttles the CPU to near zero—close enough to zero CPU time that your software effectively stops progressing. Furthermore, Google may terminate the container if it is not servicing an HTTP request. Cloud Run Services are simply not designed to handle long running batch workloads.

For more detailed information about Cloud Run Services CPU allocation see this documentation.

Cloud Run Jobs (Beta feature)

Cloud Run Jobs (as distinct from Cloud Run Services) is a more recent product which is perfectly suited for batch workloads. Furthermore, it includes a nice feature which assists you to split workload across multiple tasks in parallel. A Cloud Run job can be run with multiple tasks, each of which runs in a different container instance. And each task has two environment variables set:

$CLOUD_RUN_TASK_COUNT - the total number of tasks
$CLOUD_RUN_TASK_INDEX - the index number of the current task

If you launch a Cloud Run job with 10 tasks then each running container will have $CLOUD_RUN_TASK set to 10 and $CLOUD_RUN_INDEX set to a unique value between 0 and 9. This makes it really easy to handle cases where the containers are reading data from a common source but need to avoid accessing the same files.

For example, consider this simple Bash script that targets a persistent disk (more on this in the next section) mounted at /mnt:

count=0
for file in $(ls -x /mnt/*.dat | sort)
do   
if [ $((count % $CLOUD_RUN_TASK_COUNT)) == $CLOUD_RUN_TASK_INDEX ]   
then       
echo process file here: $file   
fi   
count=$((count + 1))
done

The above example makes a list of .dat files contained in the directory: /mnt and sorts this list. It then loops through the list but only processes files which correspond to its $CLOUD_RUN_TASK_INDEX variable. For example, if $CLOUD_RUN_TASK_COUNT was 20 and $CLOUD_RUN_TASK_INDEX was 4 then the task would process the 3rd, 23rd, 43rd, … files in the list.

Cloud Run Jobs have many configurable parameters including options to control the degree of parallelism. Refer to the documentation for details.

Mounting a Disk in a Cloud Run Job (Beta feature)

I was excited to discover that Cloud Run Jobs (and Services) now support mounting three different kinds of disks:

Google Cloud Storage Bucket - persistent disk mounted as a file system by Linux FUSE
Google Cloud Filestore Volume - persistent NFS mounted volume
In-memory volume - non-persistent disk, shared between all tasks

Note that the last option above is non-persistent. You might think it is similar to using the /tmp file system, which is also non-persistent in-memory storage and has always been available with Cloud Run. But the difference is that the new in-memory mount option is shared across all tasks belonging to a particular job. Also, the amount of memory available for the file system varies according to the combined memory allocated to all jobs tasks. As with the /tmp file system, using this option can cause memory exhaustion, which will crash tasks. You need to test carefully and ensure you allocate enough memory for your purposes.

Also, note that Google Cloud Storage is an object store, not a file system. When mounted via Linux FUSE it can be accessed like a file system but it doesn’t fully support all Posix semantics and locking so you could be a limitation in certain situations. And because GCS objects are immutable a FUSE mounted file system is likely to perform well for reading and writing entire files but poorly for small updates to large existing files. A way to address these limitations might be to use both a GCS bucket and an in-memory (or Datastore) file system and copy data between as needed.

Cloud Filestore is probably the most flexible option but it is much more expensive per GB than Cloud Storage.

Below is an example showing how you can deploy a Cloud Run Job with a GCS bucket mountpoint:

gcloud beta run jobs deploy test_job 
--image=australia-southeast1-docker.pkg.dev/xxx/yyy/zzz \ --command=/usr/bin/bash \
--args=-c,"/usr/bin/wine example.exe" \
--region=australia-southeast1 \\
--add-volume=name=gcs_bkt,type=cloud-storage,bucket="xx",readonly=false \
--add-volume-mount=volume=gcs_bkt,mount-path=/mnt \
[email protected]\

You need to ensure that the service account you specify has been granted a role which allows it to access the storage bucket.

More details about Cloud Run volume mounts can be found in the documentation.

Are there other good options?

I briefly considered Google Cloud Functions. I’m a big fan of GCF but it is a less flexible solution than Cloud Run and it is not suitable for long running jobs. The maximum run time of a GCF has been extended in Gen2 GCF but it is still not suitable for jobs which might take hours to complete. Also, the GCF runtime environment is totally controlled by Google so you can’t install OS packages.
If the requirement was for a really serious ETL/ELT pipeline then Google’s serverless solutions include Dataflow and Dataproc Serverless.

Other Google Cloud products that can be used for batch workloads include Workflows (suitable for orchestrating workloads across multiple GCP services) and Batch (which is aimed at HPC workloads and sits on top of GCE Managed Instance Groups).

Summary

It turned out that Cloud Run was a great option for my client’s batch workload. I was able to provision a scale-to-zero solution which was easy to launch, handled workload in parallel, could access data in a storage bucket and run a Windows executable.

The beauty of combining ephemeral Cloud Run Jobs with persistent GCS storage is that it allows you to use a generic container which reads configuration from a GCS bucket. The “configuration” might even be a script or program that does anything at all. Obviously this means that you need to take care with permissions on the storage bucket and ensure that the Cloud Run Job runs in a security context which conforms to the principle of least privilege.

In most cases I would expect to secure this type of workload with VPC Service Controls, Cloud Run Ingress Controls, and utilise a Serverless VPC Access Connector and Private Google Access. These features are outside the scope of this document.