Google Cloud published a blog post about "Save on GPUs: Smarter autoscaling for your GKE inferencing workloads." The article discusses how running LLM model inference workloads can be costly, even when using the latest open models and infrastructure.
One proposed solution is autoscaling, which helps optimize costs by ensuring that you are meeting customer demand while only paying for the AI accelerators you need.
The article provides guidance on how to set up autoscaling for inference workloads on GKE, focusing on choosing the right metric.
I found it particularly interesting to compare the different metrics for autoscaling on GPUs, such as using GPU utilization vs. batch size vs. queue size.
I found that using GPU utilization is not an effective metric for autoscaling LLM workloads because it can lead to overprovisioning. On the other hand, batch size and queue size provide direct indicators of how much traffic the inference server is experiencing, making them more effective metrics.
Overall, the article provided a helpful overview of how to optimize the cost performance of LLM inference workloads on GKE. I recommend reading the article to anyone looking to deploy LLM inference workloads on GKE.