Many companies regard waste from big data workloads running on Apache Spark as a cost of doing business, as inevitable as rent and taxes. At the same time, one-third of companies say they will exceed their cloud budget by up to 40%. Another survey showed a 39% year-over-year increase in cloud spend over budget.ย
It doesnโt have to be this way.ย
Smart organizations implement a host of FinOps activities at the cluster and application levels to remediate Spark application waste โ activities such as manually tuning applications, enabling managed autoscaling, rightsizing and enabling Spark Dynamic Allocation.ย
Table 1: A selection of cloud cost control approaches at the cluster and application levelsย
Each is valuable individually, and these approaches are often implemented in a mix-and-match way to enhance cost-cutting efforts across a cluster. However, each remediates only a portion of cloud costs and may have other drawbacks:ย
- Observability & Monitoring: A host of off-the-shelf software solutions can be deployed to identify and quantify waste. However, ๏ฌnding waste isnโt ๏ฌxing waste.ย
Waste-mitigating recommendations often generate more tasks for developers, which can become impossible to implement at scale and can redirect efforts away from innovation.ย
- Cluster Autoscaling: Autoscaling enables instances to be added and removed automatically to match the volatile demands of Spark workloads, but this solution does not solve the problem that Spark itself tends to waste auto-scaler-requested resources.ย
- Instance Rightsizing: Rightsizing matches instance resources to application requirements. This can even be done automatically in the case of Karpenter. However, rightsizing doesnโt prevent ine๏ฌcient applications from creating waste โ even with optimal instance types. It also does not address dynamically changing resource requirements.ย
- Manual Application Tuning: Manual tuning pulls down application resource allocations to peak utilization requirements. It can prevent applications from failing due to too few resources. But it doesnโt eliminate the waste that occurs when the utilization curve is not at peak (which is most of the time), nor does it account for dynamically changing data characteristics. In addition, manual tuning does not scale and developers may be resistant to spending time on tuning well-running applications for improved e๏ฌciency.ย
- Spark Dynamic Allocation: Spark can be con๏ฌgured to add more tasks to executors and kill idle executors whenever possible, but this does not address the fact that Spark executors can be arbitrarily overprovisioned and therefore arbitrarily wasteful.ย
With Spark Applications, Provisioning Often Means Overprovisioningย
The truth is, Spark is a highly wasteful application. The waste is primarily due to overprovisioning โ and the good news is that it is not your fault. Overprovisioning is inherent in Spark applications due to the way resources are allocated and utilized.ย
The resource utilization pro๏ฌle for a typical Spark application might look like the chart below, with the maximum utilization level reached for only a small fraction of the applicationโs execution time:ย
Spark developers are required to request a certain allocation level of memory and CPU for each of their applications. To prevent applications from failing due to insu๏ฌcient resources, developers typically request memory and CPU resources to accommodate peak usage.ย
ย
Some cost-conscious developers might try to reduce the provisioning line as low as possible, to align with peak resource requirement levels.ย
ย
ย
However, even if a developer reduces the allocation level to match the peak requested by the application, they cannot effectively โbend the allocation lineโ in real-time to align with actualย resource usage requirements that vary in real time. As a result, waste cannot be eliminated by tweaking and tuning alone.ย
The Solution: Real-Time Cost Optimizationย
Real-time cost optimization (RTCO) empowers businesses to eliminate this Spark application waste once and for all. RTCO systems are:ย
- Dynamic: Responding in real time to ever-changing application requirementsย
- Intelligent: Reallocating resources according to proven algorithmsย
- Immediate: Delivering results within minutes or hours, not days or weeksย
- Continuous: Working around the clock, just as your business doesย
- Autonomous: Freeing developers for innovation.ย
RTCO systems essentially โbend the allocation lineโ to conform to a second-by-second actual utilization. RTCO enables a near-ideal utilization scenario. The difference between the old resource peak allocation utilization level and the new RTCO-powered level can be as much as 47%.
Case Study: Apache Spark on Amazon EMRย
Cloud-forward companies are beginning to adopt RTCO to eliminate application waste. One such enterprise is Autodesk, a global leader in design and manufacturing software. Autodeskโs goal was to reduce costs by 50% by increasing capacity and rightsizing compute for the companyโs Apache Spark on Amazon EMR applications. With RTCO, Autodesk optimized their business results and successfully reduced Amazon EC2 costs by over 50%.ย