5 min read

How TorchServe could scale in a Kubernetes environment using KEDA

I almost burned a 7K Euros GPU card (NVIDIA A100 PCIe GPU) to understand how a TorchServe could meet the increasing of ondemand inference requests at scale.
How TorchServe could scale in a Kubernetes environment using KEDA

Some days ago I wrote the following blog post on how TorchServe can help to deliver AI models in production ready to serve inference requests.

Serve AI models using TorchServe in Kubernetes at scale
In a tipical MLOps pratice, among the various things, we need to serve our AI models to users exposing inference APIs. I tried a production ready framework (TorchServe) installing it on Azure Kubernetes Service and tested its power to the maximum.

Because I had some Azure credits left from my free tier (200$ for the first month),
an idea came in my mind: why not to try an NVIDIA A100 PCIe GPU (Standard_NC24ads_A100_v4) with 80GB of GPU trying to use it until its limit?

(from my rapid research this card should cost something like 7K Euros 😱😱😱)

So I did it and understood what it could be a really interesting idea to scale TorchServe in a Kubernetes environment.

Below you will find the tests and then an architecture which could be implemented on a Kubernetes production scenario to scale TorchServe workload based on custom metrics.



As I did in the above blog post, I runned TorchServe with the following setup:

In this way I reached a throughput of 418 requests per second with an avg response time of 68ms

worker node CPU usage
worker node GPU usage

Full tests here https://github.com/texano00/TorchServe-Lab/blob/main/JMeter/results/aks/Standard_NC24ads_A100_v4/SamLowe_roberta-base-go_emotions/README.md


In this way I reached a throughput of 16.9 requests per second with an avg response time of 280ms

GPU usage
CPU usage

Full test here https://github.com/texano00/TorchServe-Lab/blob/main/JMeter/results/aks/Standard_NC24ads_A100_v4/facebook_detr-resnet-50/README.md

Scale TorchServe

These tests allowed me to understand better one of the benefit of using TorchServe workers which is basically the optimization of CPU and Memory compared of scaling Kubernetes pods.

In conclusion, increasing TorchServe workers instead of scaling Kubernetes pods produced a reduction of:

  • -59% of CPU load avarage
  • -20% of memory usage

GPU usage remained almost the same.
I supposed that it was possible thanks by TorchServe instances shared by multiple workers (see TorchServe internal architecture https://github.com/pytorch/serve/blob/master/docs/internals.md).

And looking to the scale process in a production environment, it could be interesting put together Kubernetes Horizontal Pod Autoscaler with TorchServe workers in order to meet the increase of inference requests.

So basically we can think a setup with a TorchServe minimum workers (based on needs) and than a pod autoscaling process based on the TorchServe custom metrics.


In this way we could take the advantages of TorchServe workers (as seen before) and Kubernetes pod autoscaling to ditribute our workload (high availability, fault tolerance, etc).

An idea of the architecture could be as follow involving mainly Keda and Prometheus:

Scaling TorchServe using Keda in Kubernetes
  • Prometheus ingests custom metrics from all TorchServe pods (setup here https://www.yuribacciarini.com/serve-ai-models-using-torchserve/#5monitoring-stack)
  • The TorchServe metrics are visible on a Grafana dashboard to monitor how TorchServe workload is performing
  • Keda configured to scale TorchServe deployment based on prometheus custom metrics
  • The same Keda exposes Prometheus like metrics which could be plotted on a Grafana dashboard to monitor how Keda is working

Important note: in order to allow the above process to work properly, you have to configure memory requests/limits of TorchServe deployment.
To do so, on NVIDIA cards, you have to install the NVIDIA device plugin (https://github.com/NVIDIA/k8s-device-plugin) in your cluster Kubernetes.

NVIDIA device plugin implements the official Kubernetes device plugin framework described here https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/

Kubernetes provides a device plugin framework that you can use to advertise system hardware resources to the Kubelet.

In this way our pods are available to request a new type of resource named nvidia.com/gpu

Or, in a more realistic scenario (check NVIDIA supported cards) use Multi-Instance GPU (MIG) which allow to share in a secure way a phisical GPU and of course optimize its usage --> https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html

Basically, from users side, it is a concept similar to Virtual CPU but in GPU.

Automatically scale TorchServe workers

Because TorchServe allows hot workers scale up/down through APIs without any restarts needed (https://pytorch.org/serve/management_api.html#scale-workers), it could be also interesting to automatially scale TorchServe workers based on the same custom metrics exposed by TorchServe itself.

curl -v -X PUT "http://<torch-serve-host>:8081/models/noop?min_worker=3"

< HTTP/1.1 202 Accepted
< content-type: application/json
< x-request-id: 42adc58e-6956-4198-ad07-db6c620c4c1e
< content-length: 47
< connection: keep-alive
  "status": "Processing worker updates..."

In this case, because every TorchServe worker needs a specific amount of memory almost easy to predict, we need also to automatically update the pods requests/limits memory using the following Kubernetes feature (alpha in v1.27) which allows to update it without restarting the pod.

ref. https://kubernetes.io/docs/tasks/configure-pod-container/resize-container-resources/

This stuff is needed during both workers scale up and workers scale down to adjust the requested memory and also prevent memory waste (during TorchServe workes scale down).

In this way we ensure that the pods are always requesting the right amount of memory needed to run TorchServe with a dynamic number of internal workers.


It could be also interesting to use Vertical Pod Autoscaler instead of manual update pods requests/limits memory but I'm not sure of how much VPA is reactive and how much the TorchServe workers are fault tolerence related to not enough GPU memory.
Also in this case the above k8s feature of resizing Memory resources without restarting the pods is needed.

Finally, if Kubernetes is not able to continue running a pod after a TorchServe workers scale up,
it will put the pod in a waiting resources status so we will also be able to trigger Cluster Autoscaling mechanism to add new worker nodes and scale also on Kubernetes worker nodes side.

The main pitfall on Cluster Autoscaling is to avoid outages and serve the maximum inference requests at any time also during Cluster Autoscaling process.
An idea could be to anticipate the Kubernetes worker nodes scale UP before the TorchServe workers scaling in order to have a entire scale process as smooth as possible.

Tweets by YBacciarini