Improve Kubernetes scheduling for GPU-heavy apps with node templates

Kubernetes scheduling ensures that pods are associated with the correct nodes so that the Cubelet can perform them.

The whole mechanism promotes availability and performance, often with great results. However, the default behavior is an anti-pattern from a cost perspective. Pods running on half-empty nodes equate to higher cloud bills. This problem becomes even more acute with GPU-intensive workloads.

Perfect for parallel processing of multiple datasets, GPU instances have become a preferred option for training AI models, neural networks, and deep learning operations. They complete these tasks faster, but are also often expensive and lead to huge bills combined with inefficient planning.

This issue challenged one of CAST AI’s users: a company that develops an AI-driven security intelligence product. Their team overcame it with our platform’s node templates, an auto-scaling feature that improved the provisioning and performance of workloads that required GPU-enabled instances.

Learn how node templates can improve Kubernetes scheduling for GPU-intensive workloads.

The challenge of K8s scheduling for GPU workloads

Kube scheduler is the default Kubernetes scheduler that runs as part of the control plane. It selects nodes for newly created and not yet planned pods. By default, the scheduler tries to evenly distribute these pods.

Containers within pods can have different requirements, so the scheduler filters out any nodes that don’t meet the pod’s specific needs.

It identifies and scores all viable nodes for your pod, then chooses the one with the highest score and notifies the API server of this decision. Several factors affect this process, for example, resource requirements, hardware and software limitations, affinity specifications, etc.

Fig. 1 Overview of Kubernetes planning

The planner automates the decision-making process and delivers results quickly. However, it can be costly, because the generic approach can cause you to pay for resources that are not optimal for different environments.

Kubernetes doesn’t care about cost. Figuring out spend – determine, track, and reduce – is up to engineers, and this is particularly acute with GPU-intensive applications, as their rates are high.

Costly planning decisions

To better understand their price tag, let’s take a look at Amazon EC2 P4d, designed for machine learning and powerful computing apps in the cloud.

Powered by NVIDIA A100 Tensor Core GPUs, it delivers top throughput and low-latency networking with support for 400 Gbps instance networks. P4d promises to reduce the cost of training ML models by 60% and provide 2.5X better deep learning performance than previous P3 instance generations.

While it sounds impressive, it also has an on-demand price per hour that is hundreds of times the cost of a popular instance type like C6a. Therefore, it is essential to precisely control the generic decisions of the planner.

fig. 2 Price comparison of p4d and c6a

Unfortunately, when running Kubernetes on GKE, AKS, or Amazon Web Services’ Elastic Kubernetes Service (EKS), you have minimal influence over adjusting scheduler settings without using components such as Mutating AdmissionControllers.

That is still not a watertight solution, because you have to be careful when writing and installing webhooks.

Node templates to the rescue

This was exactly the challenge one of the CAST AI users faced. The company is developing an AI-powered intelligence solution for the real-time threat detection of social media and news media. The engine simultaneously analyzes millions of documents to catch emerging stories, but it also enables the automation of unique Natural Language Processing (NLP) models for intelligence and defense.

The amounts of classified and public data the product uses are growing. That means the workloads often require GPU-enabled instances, which adds cost and work.

Much of that effort can be saved by using node pools (Auto Scaling groups). But in addition to streamlining the provisioning process, node pools can also be very cost-ineffective, forcing you to pay for the capacity you don’t need.

CAST AI’s autoscaler and node templates improve on that by giving you tools for better cost control and reduction. In addition, thanks to the node template fallback feature, you can take advantage of spot instance savings and guarantee capacity even when spots are temporarily unavailable.

Node templates in action

CAST AI client workloads now run on predefined groups of instances. Instead of having to manually select specific instances, the team can define their attributes globally, for example “CPU Optimized”, “Memory Optimized”, and “GPU VMs”, and the autoscaler does the rest.

This feature has given them much more flexibility as they can use different instances more freely. As AWS adds new high-performance instance families, CAST AI automatically enrolls you in them so you don’t have to additionally enable them. This is not the case with node pools, which require you to keep track of new instance types and update your configurations accordingly.

Creating a node template allowed our customer to specify general requirements: instance types, the lifecycle of the new nodes to be added, and provisioning configurations. They also identified limitations such as the instance families they didn’t want to use (p4d, p3d, p2) and the GPU manufacturer (in this case, NVIDIA).

For these specific requirements, CAST AI found five matching instances. The autoscaler now follows these constraints when adding new nodes.

fig. 3 Node template example with GPU enabled instances

Once the GPU jobs are complete, the autoscaler automatically retires the GPU-enabled instances.

In addition, spot instance automation allows our customer to save up to 90% on hefty GPU VM costs without the negative impact of spot outages.

Since bargain prices for GPUs can vary enormously, it is essential to choose the most optimal at the time. CAST AI’s spot instance automation takes care of this. It can also provide the right balance between the most diverse and cheapest types.

And on-demand fallback can be a boon during massive spot outages or low availability. For example, an interrupted, improperly saved training process in deep learning workflows can lead to serious data loss. If AWS retires all of the EC2 G3 or p4d spots that your workloads have used at once, an automated fallback can save you a lot of hassle.

Create a node template for your workload

Creating a node template is relatively quick and you can do it in three different ways.

First, by using the CAST AI user interface. It’s easy if you’ve already connected and onboarded a cluster. Enter your product account and follow the on-screen instructions.

After you name the template, you must select whether you want to infect the new nodes and avoid assigning pods to them. You can also specify a custom label for the nodes you create using the template.

fig. 4 Node template from CAST AI

You can then associate the template with a relevant node configuration, but you can also indicate whether you want your template to use only spot or only on-demand nodes.

You also get a choice of processor architecture and the option to use GPU-enabled instances. If you select this preference, CAST AI automatically runs your workloads on relevant instances, including any new families added by your cloud provider.

Finally, you can also use constraints like:

  • Compute Optimized: Helps choose instances for apps that require powerful CPUs.
  • Storage optimized: Selects instances for apps that benefit from high IOPS.
  • Additional Restrictionssuch as instance family, minimum and maximum CPU and memory limits.

But the hard fact is that the fewer restrictions you add, the better the matches and the higher the cost savings. The CAST AI engine takes care of that.

You can also create node templates with Terraform (you’ll find all the details on GitHub) or use API (check the documentation).


Kubernetes scheduling can be challenging, especially when it comes to GPU-heavy applications. While the scheduler automates the provisioning process and produces quick results, it can often prove to be too generic and expensive for your application’s needs.

Node templates provide better performance and flexibility for GPU-intensive workloads. The feature also ensures that once a GPU instance is no longer needed, the autoscaler retires it and provides a lower cost option for your workload’s new requirements.

We’ve found this quality helps build AI apps faster and more reliably – and we hope it will support your efforts as well.

Group Made with Sketch.

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *