Kubernetes v1.27 released in April 2023 came with an exciting announcement - we can now resize pod CPU and memory requests and limits in-place! Without deleting the pod or even restarting the containers!
This happened more than a year ago and since then a lot of folks seem to think this feature is already publicly available or is due to become so tomorrow.
But the reality is that this was originally released as an Alpha feature and since then had no success moving to Beta due to a number of unresolved issues.
Latest status as of June 2024 is that it has been pushed back to v1.32:
Here's the link to that comment on Github.
So first of all - this isn't coming tomorrow. But we can still play with the feature and understand its advantages and shortcomings. Which is exactly what I'm planning to do in this post.
Get a Cluster with Alpha Features
k3d is irreplaceable when we want quickly and cheaply test Kubernetes Alpha features. All we need to do is to pass the correct feature gate to the correct control plane component.
Install k3d
If you still haven't done so - install k3d:
with curl and bash:
or with another method of your choice listed here
In our case the component is the API server and the feature gate is called InPlacePodVerticalScaling
as can be seen here
I'm spinning up a single-node cluster with the following config:
The Happy Path - Updating the CPU
Now let's create a pod with one container defining resource requests and limits.
You can create the pod with:
I'm using progrium/stress
and setting it up for slow success by requesting a tenth of the CPU it needs and just enough memory.
stress --vm 1 --vm-bytes 128M --vm-hang 3
- this tells stress to spawn one worker that allocates 128 Mb of memory and then releases them every 3 seconds.
My pod is only currently allowed to have 150M of memory, so I expect it to run fine.
While this 'stress --cpu 1
tells the container to use one whole CPU. While it's actually allowed to only use 0.1 CPU. So it'll surely get throttled.
The container starts just fine:
After a few minutes I can also check its resource consumption by running:
It's running happily, consuming the 101m of CPU and 131M of memory. All within the limits.
A Note about resizePolicy
Once we toggle the InPlacePodVerticalScaling
feature gate - all new pods are automatically created with a new field resizePolicy
set for each container. If unset - the default will be restartPolicy
:NotRequired
:
This is the config I’m testing in this post. We can set it to restartPolicy: RestartContainer
- which will lead the container to be restarted when the relevant resource type is updated.
Pod QoS Matters
Now let's try to increase our container's limits in-place to give it more resources and see what happens:
Oops! That didn't work!
We're getting:
So what we now know is that while we can change the values of limits and requests - we can't change the pod QoS class. I.e the relationship between the requests and the limits has to follow the QoS - if we start with Guaranteed - we can’t manage requests and limits separately. And if we start with Burstable - we can’t set limits equal to requests.
Updating the Resources
Let's try to update both the requests and the limits while staying within the Guaranteed QoS:
If we now watch kubectl top pod stress
we will se how the container gradually gets the additional CPU time:
The CGroups Behind the Scenes
Now, being the curious cat that I am - I wanted to check how this works behind the scenes. I know there are cgroups involved in setting container resource restrictions but I like checking myself how stuff works.
The great thing with k3d is it's very easy to get into your nodes with a simple docker exec
.
Now I want to find my container and identify the path to its cgroup definition.
Find the container ID using ctr
- the containerd command-line utility:
and then - find the cgroup information for my container:
which will give me something like:
The important parts here are /sys/fs/cgroup
where all the cgroup definitions are found and the cgroupsPath
- where the specific constraints for this container are defined.
You'll notice there's a hierarchy there - first we have the pod...
directory and then - the directory named as the container id. This being a single-container pod - all the cgroup values will be featured in the parent folder. So that's where we're going to look.
That's right - 250 Mb of memory in bytes!
An that's correct too! According to the RedHat documentation:
The first value is the allowed time quota in microseconds for which all processes collectively in a child group can run during one period. The second value specifies the length of the period.
During a single period, when processes in a control group collectively exhaust the time specified by this quota, they are throttled for the remainder of the period and not allowed to run until the next period.
Impact on Scheduling
Another thing I wanted to try is update requests to more than my node can give and check if the scheduler will try to reschedule my pod to another node because the current one doesn't have the needed capacity.
Let's check how many cpus my node has access to:
I got 8. So let's try to request 10 and see what happens:
Alas, while the requests got updated - nothing else happens. Pod doesn't get rescheduled or evicted. Why? No idea.. Have I tried creating it with 10 cpu request from the beginning - it would have stayed pending because there aren't any nodes large enough. So I would expect the pod with requests higher than a node can satisfy to get evicted. But maybe my thinking is flawed?
Actually according to the official documentation - there shouldn’t be any scheduling impact. Instead the Pod status field should reflect that current resize request is “Infeasible”. Let’s check that:
Yes - it’s reflected correctly in the status field.
Still - we now have a pod that doesn’t abide to its spec. Which is puzzling and could lead to unexpected reliability issues.
Negating Resources
Until now all worked fine because we were only adding resources. Everybody likes having more stuff, nobody likes when stuff is taken away from them.
Let's start by taking back the CPU time we granted in the previous section:
I'm bringing the CPU requests back to 100m. Quite expectedly in a couple of seconds kubectl top
will show me that pod cpu consumption went down to 100m.
And the cgroup cpu.max
file will get updated as expected:
But what if I try to reduce memory?
Seems to work fine. Checking the cgroups I see the config has been updated:
And what if I need to free even more memory?
Note that I'm reducing memory to 100M which should cause my container to get OOMKilled. And it seems to work:
But I see that the pod continues running!
And checking the cgroup memory.max
file shows why:
The cgroup wasn't updated! Looks like something is getting in our way - protecting the container from getting less memory than it's already using. While this makes sense as a precaution - taking away memory from a running process may lead to irreversible corruption - this now leads to container limits holding an incorrect value which will surely puzzle anyone trying to understand why it's not getting OOMKilled.
I would expect some validating admission hook to tell me that memory can't be reduced. Looks like a bug to me.
Changing the resizePolicy
But what if we allow container restarts? Will the cgroup for memory get updated then?
It’s not possible to change the resizePolicy
for an existing pod, so let’s create a new one:
Apply this spec by:
And now let’s reduce the memory for that restart
container:
I’m setting the memory to 100m which is too low.
Pod status shows us that the resize request was actually received. And after a while the contiainer gets restarted, quite expectedly fails with RunContainerError
and then goes into the CrashLoopBackoff. With kubectl describe pod restart
showing us that the kubelet has restarted the container but it got OOMKilled :
The puzzling thing about this is that when we look at the cgroup for the pod we see that the memory limit doesn't get updated. So it’s not totally clear what triggers the OOMKill:
Still 150Mb 🤷
Saving Hungry Pods
Ok, we found out that memory being an incompressible resource - we can't really reduce it in-place to a value lower what than the container is already using.
But can we save an OOMing container by giving it more memory?
Let's try that with a similar pod but one that gets only 100M of memory from the get go (while trying to allocate 128):
Quite expectedly the container gets OOMKilled almost instantly:
And it will continue restarting and getting OOMkilled until we update its memory limits. So let's save it from this misery by giving it the memory it needs:
This seems to work fine:
But the pod continues getting killed:
And if check the cgroup memory.max
file we'll see why:
Its memory limit never actually got updated!
Why? I wasn't able to find an answer for this one. Why disallow saving containers from getting killed by providing them memory they need? I'm not aware of the technical limitations that would prevent this and I also didn't find anything in the KEP docs
So it looks like the only way to fix the OOMKill is still by deleting the pod and creating a new one with more memory.
Summary
In-place pod resizing is a long awaited feature. Still in alpha since v1.27 it will hopefully make it to beta by v1.32.
If the drawbacks and bugs get fixed.
And here are some of them I found:
- Memory can't be reduced lower than currently used (either with or without container restarts). But there's no notification about that.
- Giving more resources than available on the node doesn't lead to pod eviction (true for both CPU and Memory)
- If a pod is getting OOMKilled - it's not possible to give it more memory to save it from getting killed.
Will these get eventually fixed? I certainly hope so. Will the feature get it to beta by v1.32? Let's keep our fingers crossed.
Something in this post isn't clear or correct? Let me know in the comments.