Optimizing Cloud Scalability: Implementing Horizontal Pod Autoscalers with Kubernetes

Maurits Beudeker Cloud consultant

Publish date: 16 May 2024

In a recent collaboration between CloudNation and one of its clients, a key issue was at the forefront: how can the client better manage scalability in their cloud environment? In this blog, Maurits Beudeker, a Cloud Consultant at CloudNation, shares how he stepped in to tackle this challenge and the lessons he learned from the experience.

The Challenge

The client needed specific help with a problem: they occasionally received a large volume of messages all at once, which couldn’t be processed simultaneously. This caused delays within the application. To prevent this, they wanted to use Horizontal Pod Autoscalers in their Kubernetes cluster, which allows the number of workers to scale dynamically with a high number of requests.

Why Horizontal Pod Autoscalers? An Analogy:

Imagine you’re a supermarket manager who needs to unload 100 trucks every day to keep the warehouse stocked, but you don't know exactly when the trucks will arrive, and you only have 3 parking spots. If you don't know the exact arrival times of the trucks, would you pay 30 warehouse workers for the whole day? You risk having workers idle when there’s no work, and if nine trucks suddenly arrive at once, 30 workers might still be too few. Wouldn’t it be fantastic to have, on-demand, exactly the right number of workers based on the number of trucks waiting to be unloaded, who magically disappear when not needed (so you don’t have to pay 60 times the minimum wage)?

Three things are needed for this:
- Application Metrics (Application-specific statistics)
- In the analogy, this is someone noting how many trucks are currently waiting to park and be unloaded.
- Prometheus (an application that collects and displays these statistics over time)
- In the analogy, this is the person collecting and remembering when each measurement was taken.
- Horizontal Pod Autoscalers (HPAs): Kubernetes objects that can scale the number of pods of an application based on its usage.
- These are like micromanagers who determine, based on the collected data, how many warehouse workers are currently needed to unload the trucks as quickly as possible.
Be aware that this system can be misused; if you don’t set a maximum number of workers/pods, a malicious person could cause infinite scaling, resulting in extremely high costs. Therefore, it's wise to set a maximum scale limit to avoid unforeseen expenses.

From Request to Design

Armed with solid background knowledge, the team headed to the client’s location. Once there, the project scope was quickly established, clearly defining the exact requirements and necessary steps. This resulted in a clear course of action. Moreover, collaborating with the client's colleagues was smooth and pleasant, enhancing the overall atmosphere.

Intense work on the design was done throughout the day, aiming for rapid implementation. However, early on, the team encountered a dependency: standard Horizontal Pod Autoscalers (HPAs) could only scale based on CPU and memory usage, while the client specifically wanted to scale based on the application queue length. Since processing the queue was a separate process with limited impact on CPU usage, the standard parameters were insufficient. Thus, it was decided to scale based on the queue length.

To achieve this, the application needed to provide metrics to the Kubernetes cluster so the HPAs could use them. This required adding the Prometheus Adapter to the Kubernetes cluster. With the design in an advanced stage, discussions were held with the client to determine the location of the Application Metrics, ensuring seamless integration of all necessary elements.

However, the published metrics were not yet in a format interpretable by Prometheus. Fortunately, the product owner was on-site, allowing an immediate request to make the application metrics suitable for publication.

From Design to Implementation: A Seamless Transition

Upon returning on Thursday, it was found that the client's developers had made the necessary adjustments. This enabled the export and use of application-specific statistics to scale the workload. Nonetheless, it was challenging to finalize this process as it was not yet clear which statistics should be used for scaling the application.

Ultimately, it was decided to scale based on the estimated remaining time. This meant that by scaling the number of worker pods, the number of processed messages would increase, and the remaining processing time would decrease. By the end of the day, the results could be evaluated: using Grafana, it was observed how the workload increased and the time to process messages decreased until the maximum number of worker pods was reached. In case of a long queue, Kubernetes would automatically scale the number of worker pods, increasing the number of processed messages per second and reducing the remaining time.

Successful Implementation

The result was impressive: the system functioned excellently, and it was gratifying to see the desired effect achieved within two days. Despite the challenges, there was satisfaction at the end of the process, knowing the client was pleased with the solution. For the team, it was also a fulfilling experience to successfully complete a complex task.