Coming into 2023, continuous, cost-saving optimization of Kubernetes environments has become a top priority for many DevOps and Platform Engineering teams. These teams are looking to reduce costs associated with Kubernetes without impacting the performance or stability of their applications. However, after evaluating solutions on the market, or even trying to build something homegrown, many have experienced suboptimal results.
For example, they are able to get costs under control, then a few weeks later, costs begin to spike again and they return to square one. On the other hand, they can take actions that reduce the cost, but are uncertain about how many resources they should cut. If they cut too many resources, it could result in poor performance and system outages.
The PerfectScale team came from companies using Kubernetes at-scale and suffered from these issues, which is why PerfectScale came into existence. A little over a year ago, we started on a journey to solve this problem by building a solution that simplifies the continuous optimization process. Before we wrote a single line of code, we wanted to validate this pain point with other industry professionals.
We began an extensive process of interviewing Kubernetes professionals across various industries. After over 50 interviews, our findings showed us that, yes, optimizing Kubernetes is a major struggle across the board. Not only were they struggling to keep their costs under control, but they were also continually firefighting performance and resilience issues across their entire environment. You might expect that it’s an either-or kind of situation, but in reality, every organization suffers from both issues.
Gaps in the Kubernetes Continuous Optimization Toolset
The organizations we spoke with had a variety of monitoring and observability solutions their various teams are using, and more cost-conscious ones even have tools specifically for their FinOps team. It would seem that leveraging these solutions, or potentially using their data to build something homegrown, would provide them the information they need to easily and effectively optimize their Kubernetes environment. However, this is hardly ever the case.
There were consistent themes as to why their efforts had failed across companies, so we created a breakdown of common gaps we heard:
1. They focus on the Pets, instead of the Cattle.
Observability and monitoring tools are great at giving you the health and performance metrics of a particular entity, which could be a node, a pod, a microservice, etc. With Kubernetes (and containerized applications in general), this can be problematic as you are managing thousands, if not tens of thousands, of entities that are running across numerous nodes and continue to scale up and down to meet the load demands. To efficiently optimize your environment, you need a holistic view of your entire environment (cattle) that allows you to drill into a particular entity (pet) when it needs your attention.
2. Lack of business-oriented prioritization.
As stated above, you are using Kubernetes to manage thousands of services, and your goal is to keep them optimized (i.e. providing desired experience at the lowest possible cost). But there are only 60 40 work hours in a week. To efficiently optimize your environment, you need to spend your time on the areas that are causing the biggest problems. This is impossible with the current toolset for many organizations.
Let's look at a couple of scenarios:
- Peak load performance issues: You may be plagued with alert storms for issues happening during peak load times, but the issues “self-resolve” as load decreases. This doesn’t mean the issue is fixed, it means the issue is only happening at times when your application is getting the most traffic. This can be devastating to your business’s revenue stream. So how can you prioritize fixing an issue that is here one second, then gone the next?
- Autoscaling multiplies waste exponentially: Say you have a service or a pod that is over-provisioned by half a gig, which doesn't seem like much waste, but this service has 100 replicas running 24/7. How do we detect the combined impact of this waste so we can properly prioritize and remediate the issues?
3. Hard to handle all the cooks in the kitchen.
It takes a village to raise a Kubernetes-based application the right way. You start with the dev team drilling out code, DevOps/Platform teams ensuring the day-to-day operations, SREs working to keep the application performant and available, and FinOps trying to keep costs in check. Each team has its own set of tools, which makes it very hard to align, collaborate and govern when changes need to be made to optimize your environment.
4. Solutions aren’t balancing cost and performance.
Some tools do offer decent solutions to optimize and reduce costs. They usually recommend “quick win” changes, like leveraging Spot or Reserved instances or moving to cheaper nodes. These recommendations are important and valid, however, they only focus on the tip of the iceberg and don’t dive into where the big problem is–how to properly size your pods and nodes to build cost-effective and resilient systems.
True optimization requires you to properly adjust the resources allocated to your individual services to reduce costs without jeopardizing performance. There are several solutions that promise these results, but it’s important to be mindful of how they are determining what adjustments should be made. Kubernetes is a complex platform with many layers, and if the solution is not doing the proper analysis of your systems, recommendations can do much more harm than good.
Here are two examples (out of many) that we collected from our customers who used competitive solutions before switching to PerfectScale:
- The analysis looks at a single pod and not at all its replicas, leading to recommendations based on partial data.
- The analysis is not looking at the load trends properly and relies on average load patterns or other “basic” patterns. This can be dangerous as it can cause data flattening which can provide recommendations to save costs that ultimately degrade performance and availability. To avoid this scenario, you must constantly compare multiple graphs to avoid losing important data–either the spikes or the averages.
5. Not built for K8s continuous optimization.
The biggest issue is how the current toolset makes optimization a time-consuming and mundane activity. Each service needs to be evaluated and optimized manually and on a frequent basis. This could require one or multiple team members’ complete attention for weeks out of each quarter, which is something most teams cannot spare. Even if you had the time and manpower to pull this off, is this really the best way for your team to spend their time?
Additionally, many situations require you to make decisions quickly before performance issues impact the end-user or wasteful configuration drains the budget. The time it takes to manually identify problems, diagnose the root cause, and remediate these issues can result in devastating effects on your business.
Proactively and continuously optimizing large-scale Kubernetes environment has become beyond human-scale to accomplish efficiently, and organizations of all sizes need a better solution to simplify, automate and govern the process.
PerfectScale: Innovating and Simplifying Kubernetes Optimization
My partners and I decided to bring PerfectScale to life as we felt the shortcomings of the existing toolset in our own daily jobs and realized the opportunity to use our expertise to build an innovative solution to make a meaningful difference in the life of Kubernetes users.
Our mission sounds simple, but it’s not that trivial–we want to help our customers get the most out of Kubernetes, easily and effectively!
Our first target is to streamline how companies continuously optimize Kubernetes without the time-consuming and repetitive operations work. This gives teams a simplified way to ensure Kubernetes is resourced properly, keeping applications running with peak performance at the lowest cost possible.
We take a unique approach to provide actionable intelligence that allows teams to easily and efficiently optimize their Kubernetes environment. With PerfectScale, you get:
- Comprehensive Kubernetes Optimization. PerfectScale is the leading solution for improving the performance and stability of your environment, while continually optimizing your costs. Our solution aligns with the Well-Architected Framework, helping our customers improve the reliability, performance efficiency, cost-effectiveness, and environmental sustainability of their applications.
- Predictive and proactive intelligence. Perfectscale’s predictive, AI-guided intelligence understands the load patterns of your environment helping you instantly pinpoint misconfigurations and take quick actions to optimize your environment before the performance or resilience issues impact the end-user, or over-provisioned resources accumulate meaningful and costly waste.
- A completely integrated experience. PerfectScale aligns directly with your established workflows making continuous optimization a simplified and virtually effortless task. Our recommended configuration changes can be turned into tickets or pull requests with a single click of a button, or automatically executed for effortless optimization.
PerfectScale provides results instantly, helping you ensure application performance and availability while cutting wasted resources, cloud costs, and carbon emissions. Our solution is no-code and can be implemented in a matter of minutes with a minimal footprint. Start your continuous optimization journey today with a free, 30-day trial of PerfectScale!