Spark operator practices with Kubernetes(IBM Cloud Kubernetes Service)

yin xu
8 min readApr 11, 2021

Spark Overview

Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for incremental computation and stream processing.

Deployment

So, let’s talk about how to deploy your spark application once you finish your coding and local unit testing. Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program).

Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager, Mesos or YARN), which allocate resources across applications. Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks to the executors to run.

So how this will work with kubernetes?

few years ago, Kubernetes was added as a natively supported (though still experimental) scheduler for Apache Spark v2.3. This means that you can submit Spark jobs to a Kubernetes cluster using the spark-submit CLI with custom flags, much like the way Spark jobs are submitted to a YARN or Apache Mesos cluster.

spark-submit can be directly used to submit a Spark application to a Kubernetes cluster. The submission mechanism works as follows:

  • Spark creates a Spark driver running within a Kubernetes pod.
  • The driver creates executors which are also running within Kubernetes pods and connects to them, and executes application code.
  • When the application completes, the executor pods terminate and are cleaned up, but the driver pod persists logs and remains in “completed” state in the Kubernetes API until it’s eventually garbage collected or manually cleaned up.

Note that in the completed state, the driver pod does not use any computational or memory resources.

Is it all for Spark with Kubernetes? No, No, No….

Although the Kubernetes support offered by spark-submit is easy to use, there is a lot to be desired in terms of ease of management and monitoring. This is where the Kubernetes Operator for Spark (a.k.a. “the Operator”) comes into play. The Operator tries to provide useful tooling around spark-submit to make running Spark jobs on Kubernetes easier in a production setting, where it matters most. As an implementation of the operator pattern, the Operator extends the Kubernetes API using custom resource definitions (CRDs), which is one of the future directions of Kubernetes.

let’s look at the Kubernetes Operator for Spark. The Operator project originated from Google Cloud Platform team and was later open sourced, although Google does not officially support the product. It implements the operator pattern that encapsulates the domain knowledge of running and managing Spark applications in custom resources and defines custom controllers that operate on those custom resources.

The Operator defines two Custom Resource Definitions (CRDs), SparkApplication and ScheduledSparkApplication. These CRDs are abstractions of the Spark jobs and make them native citizens in Kubernetes. From here, you can interact with submitted Spark jobs using standard Kubernetes tooling such as kubectl via custom resource objects representing the jobs.

On their own, these CRDs simply let you store and retrieve structured representations of Spark jobs. It is only when combined with a custom controller that they become a truly declarative API. A declarative API allows you to declare or specify the desired state of your Spark job and tries to match the actual state to the desired state you’ve chosen.

The Operator controller and the CRDs form an event loop where the controller first interprets the structured data as a record of the user’s desired state of the job, and continually takes action to achieve and maintain that state. Below is an architectural diagram showing the components of the Operator:

Now, let’s try it:

  1. Install the Spark Operator:

The easiest way to install the Kubernetes Operator for Apache Spark is to use the Helm chart.

$ helm repo add spark-operator https://googlecloudplatform.github.io/spark-on-k8s-operator$ helm install my-release spark-operator/spark-operator --namespace spark-operator --create-namespace

This will install the Kubernetes Operator for Apache Spark into the namespace spark-operator. The operator by default watches and handles SparkApplications in every namespaces. If you would like to limit the operator to watch and handle SparkApplications in a single namespace, e.g., default instead, add the following option to the helm install command:

--set sparkJobNamespace=default

Looks simple, right? But here is one rocks blocked my way when I followed this official steps.

What that means? How I can fix it?

I tried:

a. Upgraded my Helm 2 to Helm 3, not working….

b. Uninstall Helm 2 and install again, not working…

After several research, I found this Helm initiation fail issue with Kubernetes 1.16(Yes, my version is 1.16, and it happens on my another 1.17 too)

Then this solution saved my life:

Before you get Helm repo, cover these steps:

kubectl create serviceaccount --namespace kube-system tillerkubectl create clusterrolebinding tiller-cluster-rule --clusterrole=cluster-admin --serviceaccount=kube-system:tillerhelm init --override spec.selector.matchLabels.'name'='tiller',spec.selector.matchLabels.'app'='helm' --output yaml | sed 's@apiVersion: extensions/v1beta1@apiVersion: apps/v1@' | kubectl apply -f -kubectl patch deploy --namespace kube-system tiller-deploy -p '{"spec":{"template":{"spec":{"serviceAccount":"tiller"}}}}

Then come back to follow above official steps to add Helm repo and install, then if you see this, it means you made it!

2. Deploy your spark application with Spark Operator:

You will prepare the yaml file which include everything for your spark application as 1st thing. Here is official example:

Hints:

If you look at this example yaml, her might confuse you:

What here “local” means? it means your local path??No, No, No….This is cluster environment, you need think from container perspective, here means from your running container image here:

Then, Let’s submit the job:

Then we can verify that the driver is being launched at the specific namespace:

The SparkApplication controller is responsible for watching SparkApplication CRD objects and submitting Spark applications described by the specifications in the objects on behalf of the user. After an application is submitted, the controller monitors the application state and updates the status field of the SparkApplication object accordingly. For example, the status can be “SUBMITTED”, “RUNNING”, “COMPLETED”, etc.

Transition of states for an application can be retrieved from the operator’s pod logs. Here we see part of the state transition for the example application SUBMITTED -> RUNNING:

Following needs and wants?

  1. Can I schedule this spark job deployment? How I can integrate this Spark Operator with Kubernete Cron job feature?

Bingo!

Spark Operator Provides native cron support for running scheduled applications.

This is another good reason to try with Spark Operator. See this? You can change your yaml file from default “SparkApplication” to “ScheduledSparkApplication”, then add standard cron schedule, then you made it!That is all. You don’t have care about manually deploy on time or setting up k8s Cron to achieve it, take it easy!

More info, here is official API:

2. I have a team, I don’t want give all cluster admin access to do this. How can I assign my team to deploy like this?

Then let’s take a look at the access control on Spark operator:

A Spark driver pod need a Kubernetes service account in the pod’s namespace that has permissions to create, get, list, and delete executor pods, and create a Kubernetes headless service for the driver. The driver will fail and exit without the service account, unless the default service account in the pod’s namespace has the needed permissions. To submit and run a SparkApplication in a namespace, please make sure there is a service account with the permissions in the namespace and set .spec.driver.serviceAccount to the name of the service account.

Looks familiar? Yes, it is Kubernetes RBAC concept and practice. Then it is easy now. We just have to find the Role or Cluster Role created when installed which have enough access to this service account.

Go back to find your installation logs(No worry if you cannot find, just list clusterrole in your installation namespace)

Then we find there is a role and cluster role created, so take a look which one is for service account “spark”. Yes, it is cluster role, then we got it. Next, create a new cluster role binding yaml to bind your user to this cluster role, that is it!

--

--