The aim of this blog post was to put together some knowledge I had regarding MLOps and deploy an AI service ready to serve multiple and different AI models (from HuggingFace in this case) in production.
I used TorchServe as a framework in a real Kubernetes as a service scenario (Azure AKS) using Standard_NV6ads_A10_v5 VM with GPU NVIDIA A10.
For doing this I used the Azure free tier which includes 200$ for the first month.
After the setup which involved TorchServe and Prometheus/Grafana stack,
I runned some stress test using JMeter tool to understand how TorchServe reacts and how can be scaled.
Some testes included:
- CPU vs GPU
- Kubernetes Pod scaling
- TorchServe configurations
- batch size
- Stress tests with JMeter
Finally I reported the original results and some personal thoughts.
Enjoy the reading 😎
One step behind
Two things make me courious and fascinated of MLOps in the past:
- The first one was the Udacity course "Intro to Machine Learning" that I followed in 2017. It opened me a new world.
If you have a basic knowledge of Python I really suggest it, link below 👇👇👇https://learn.udacity.com/courses/ud120
Personally I was fascinated to the included laboratory where you try to find clusters in a real huge and famous database coming from Enron.
What is Enron –> https://en.wikipedia.org/wiki/Enron_scandal
- The second one was a project that I did when I was working for my last employer.
It was basically a data extraction project with the aim to speed up the loans processes.
Data were mainly ID cards, passports, etc..
It was 2018 and it was the first time where I needed to setup a sort of MLOps, even if unfortunately with a lot of manual stuff, in order to be deliver new AI models to our users with attention of
- time to market
- inference performance
- inference quality
- backward compatiblity / no regression
Recently, I also read the "Introducing MLOps: How to Scale Machine Learning in the Enterprise" (https://www.yuribacciarini.com/books/) and started studying how to deploy multiple AI models in a production environment ready to serve an huge amount of inference requests.
I tried both but I discovered TorchServe more easy to configure and to use so in the following tests I used, as the title say, TorchServe framework.
The exact setup is available on GitHub, I will report here only the core stuff needed to setup a TorchServe container image ready to serve multiple HuggingFace models.
The setup allows basically the creation of a container image that contains our production ready server.
Like an hamburger the final docker image is composed by 3 layers:
- nvidia/cuda image
- Torch serve image
- Our custom image (models, Python dependencies, TorchServe configurations, handlers..)
Every new release will change only the upper layer and thanks to Docker layers it will speed up our MLOps
Basically with this image we are ready to deploy our AI server able to serve inference requests.
Let's see how.
1- Download the model from HuggingFace (this is an example of text-classification)
git clone https://huggingface.co/SamLowe/roberta-base-go_emotions
2- Write a custom handler
I mainly implemented the inference function (no pre/post processing needed in this case)
What is a TorchServe handler? --> https://pytorch.org/serve/custom_service.html
3- Package the model as
.mar for TorchServe usage
How to install
torch-model-archiver --> https://github.com/pytorch/serve/blob/master/model-archiver/README.md
You can iterate 1,2,3 steps for every HuggingFace models you want.
On my repo https://github.com/texano00/TorchServe-Lab you can find a couple of text-classification models and an object-detection model.
4- Build the container image
In order customize the CUDA version you need to build your own TorchServe image.
requirements.txtare needed to install Python libraries needed by our handlers
.marmodels previously generated
5- Monitoring stack
TorchServe can be configured to export
prometheus like metrics (see my cofiguration here https://github.com/texano00/TorchServe-Lab/blob/main/kubernetes/torchserve/templates/config.yaml)
In this way we are able to setup a Prometheus/Grafana stack to monitor our TorchServe with custom metrics.
6- Run it!
Now you can run the image in the platform you want.
Check here for the Kubernetes installation – > https://github.com/texano00/TorchServe-Lab/tree/main/kubernetes/torchserve
I runned these test from my local machine which is of course outside of Azure environment so network could have impacted the results but, in principle, the results make sense.
I configured the easy tests using JMeter, you can find the setup here https://github.com/texano00/TorchServe-Lab/tree/main/JMeter
I divided the tests in two categories based on the AI model used.
Type: Text Classification
The best that I was able to do was
51.7 inferences per second
with tolerable degradation (+10ms) on response time.
- 1 TorchServe worker
- 2 TorchServe pod replicas
- GPU enabled
- 5 parallel users
- Avg response time ~100ms
CPU vs GPU
Going straight to the conclusions, GPU was of course more performant than CPU probably because this text-classification model is optimized to be runned on GPU.
But since the degradation of performance is not so huge, I'd say that CPU could be a cheaper alternative to GPU (of course with some tuning needed).
While the memory was not a problem, the CPU usage was a bottleneck.
Instead, with GPU enabled, the CPU was of course almost free and the GPU was not so busy.
Type: Object Detection
Here the GPU usage was higher than text-classification model.
The best I did here was 4.1 inferences per second.
- GPU enabled
- 1 TorchServer worker
- 1 pod replica
- 5 parallel users
- avg response time 1200ms
CPU vs GPU
The degradation of using CPU instead GPU was to high in this case so for this
object-detection model CPU cannot be used at all.
To give you an idea, with the same setup above, using CPU I obtained:
- throughput 25 inference per minute
- avg response time 11s
All these tests allowed me to understand better how TorchServe works so I will be able to make a comparison with other frameworks.
In a few hours I was able to
- understand TorchServe
- configure it
- deploy a full working container on a Kubernetes cluster
So I'd say it is a quite easy framework to manage.
The most challanging part was the scaling of the pods and/or TorchServe internal workers due mainly to my limited GPU memory (4GB) but I'd say that the rollout process must be addressed in a dedicated way even if we have more GPU memory.
An other important thing to keep in mind is the huge size of container images.
Just to give an example, my final container image was 10GB.
Even if Docker optimize space using layers, this is an other part to address in order to avoid too high billings.
Finally, thanks to some easy stress tests, I understood how GPU and CPU reacts while using different AI models and GPU is NOT always the answer.
Sometimes CPU could have a similar performance while of course reducing costs.
Here an interesting Microsoft blog post of GPUvsCPU topic https://azure.microsoft.com/es-es/blog/gpus-vs-cpus-for-deployment-of-deep-learning-models/
I haven't documented yet how a GPU Kubernetes cluster could be optimized, as usual before doing it I'd like to make my hands dirty.
Maybe, next topic: MIG (NVIDIA Multi-Instance) vs Time Slicing vs MPS (Multi-Process Service) 🚀🚀🚀