The poor state of Kubernetes horizontal pod autoscaling

Posted on January 4, 2020 by wjwh

In a previous life I did a lot of work with control systems to monitor and adjust physical systems, and there is always something extremely satisfying about seeing them in action. In computing, their use is (sadly) mostly limited to autoscaling systems but even there the theory is often misapplied or used in an extremely limited way. Even in systems which are built entirely around scheduling workloads, like Kubernetes, the implementation is very limited. In this post I will look at the opration of the default implementation of the Kubernetes pod autoscaler, and also at some alternatives (which are also lacking).

Horizontal Pod Autoscaler

Kubernetes comes with a default autoscaler for pods called the Horizontal Pod Autoscaler (HPA). It will manage the amount of pods in a Deployment, StatefulSet or ReplicaSet, based on how much resources the current amount pods use and a preset target which must be supplied by the user.

By default, the resources monitored can be CPU or RAM used by the pods compared against the amount they have request-ed, as a percentage. If you install the external metrics API, you can even autoscale your pods on arbitrary metrics like the amount of messages outstanding in a message queue or the average latency of requests. Every 30 seconds (configurable) the HPA will check what the actual usage is versus how much it should be and then it will add or remove as many pods as required to bring the two back in line. As an example: say you have 10 pods running with a target CPU utilisation of 80% (to have some room for load spikes). If the pods use 90% CPU, the HPA determines it needs (90/80 * 10) = 11.25 pods to bring the utilisation back to 80%. This will in practice result in 2 pods being added. If the HPA measures the actual CPU utilisation to be only 40%, it can remove half the pods to bring the average utilisation back up to 80%. After adding or removing pods, there is generally a cooldown period during which no extra pods will be added or removed.

This approach has a few problems. First off, the amount of pods to be added or removed is based on a fairly short window of measurements. This means that if you have a short period of less utilisation, the HPA will happily throw away the majority of your pods in one go. When load picks back up, you will be extremely underprovisioned. There are some issues on the repo about making this more configrable. The linked issue has been open since 2017 but no fix has been made yet. It is also not possible to specify limits on how far to scale up or down, the HPA always tries to go to the desired amount of pods in one jump. This can lead to problems if the pods use significantly more resources on startup than during normal operation.

In one particular instance at $WORK we ran into a use case that suffers especially from this: autoscaling pods running a background worker process. In this case, as long as the queue has items all workers will work as hard as they can and the HPA. seeing that CPU use is high, will add more and more pods until the queue is empty again. Then when at last enough workers have been added to empty out the queue, there is a large surplus of workers and the HPA will remove a lot of them. We have observed instances where 80+ percent of pods was deleted in one action. After that action, there will be too few pods to keep up with the load and the cycle starts anew. This led to large “sawtooth” spikes in the number of pods used where the number of pods would continuously rise until the queue was empty, after which most pods got deleted again.

Other solutions

As a result of the shortcomings of the default HPA, several people have written alternative versions. However, all of the ones I have seen so far suffer from at least one of these flaws:

If I could have any system I wanted, it would probably have at least the following features:

So far, none of the alternatives I have seen hits all these. Perhaps I’ll have a stab myself, although there is of course the XKCD standards situation to watch out for and I have no idea yet on how to handle that.


The benefits for sophisticated pod autoscaling seems like one of the main reasons for adopting a container scheduling system like Kubernetes, but it does not really deliver. This seems to be a trend with many of the more “advanced” scheduling parts in Kubernetes, it often does not behave in the way you would expect. Many people discover fairly late that a Deployment referencing a ConfigMap will not roll itself if the ConfigMap is updated (protip: using the configmap generators from kustomize helps a lot here). Similarly, pods are only assigned to nodes when the pod is scheduled for the first time. This can lead to some surprises when you add a few new nodes (which will remain empty) and then remove the old ones too quickly. It is very possible to violate pod disruption budgets that way. You can also get some very lopsided node utilisation patterns when some nodes stay empty and some are very full. Some people have even added an automated process to kill random pods in the hopes that that will result in a better distribution of pods across nodes. Most of these things are poorly documented and you only discover them when it breaks.

Don’t get me wrong, Kubernetes gets a lot of things right and while it’s a complex beast, it’s usually successful in hiding that complexity. I would just like more escape hatches for those inevitable moments when you encounter a situations that the devs did not anticipate. The need for hacky scripts from all over the internet to execute basic functions correctly makes all of Kubernetes look less appealing, and it would be much preferable to have the official tools have less of these surprising gotchas.