Spark, big data, waste, MySQL, database, data, data monitoring, data management

Many companies regard waste from big data workloads running on Apache Spark as a cost of doing business, as inevitable as rent and taxes. At the same time, one-third of companies say they will exceed their cloud budget by up to 40%. Another survey showed a 39% year-over-year increase in cloud spend over budget. 

It doesn’t have to be this way. 

AWS

Smart organizations implement a host of FinOps activities at the cluster and application levels to remediate Spark application waste — activities such as manually tuning applications, enabling managed autoscaling, rightsizing and enabling Spark Dynamic Allocation. 

Table 1: A selection of cloud cost control approaches at the cluster and application levels 

Each is valuable individually, and these approaches are often implemented in a mix-and-match way to enhance cost-cutting efforts across a cluster. However, each remediates only a portion of cloud costs and may have other drawbacks: 

  1. Observability & Monitoring: A host of off-the-shelf software solutions can be deployed to identify and quantify waste. However, finding waste isn’t fixing waste. 

Waste-mitigating recommendations often generate more tasks for developers, which can become impossible to implement at scale and can redirect efforts away from innovation. 

  1. Cluster Autoscaling: Autoscaling enables instances to be added and removed automatically to match the volatile demands of Spark workloads, but this solution does not solve the problem that Spark itself tends to waste auto-scaler-requested resources. 
  1. Instance Rightsizing: Rightsizing matches instance resources to application requirements. This can even be done automatically in the case of Karpenter. However, rightsizing doesn’t prevent inefficient applications from creating waste — even with optimal instance types. It also does not address dynamically changing resource requirements. 
  1. Manual Application Tuning: Manual tuning pulls down application resource allocations to peak utilization requirements. It can prevent applications from failing due to too few resources. But it doesn’t eliminate the waste that occurs when the utilization curve is not at peak (which is most of the time), nor does it account for dynamically changing data characteristics. In addition, manual tuning does not scale and developers may be resistant to spending time on tuning well-running applications for improved efficiency. 
  1. Spark Dynamic Allocation: Spark can be configured to add more tasks to executors and kill idle executors whenever possible, but this does not address the fact that Spark executors can be arbitrarily overprovisioned and therefore arbitrarily wasteful. 

With Spark Applications, Provisioning Often Means Overprovisioning 

The truth is, Spark is a highly wasteful application. The waste is primarily due to overprovisioning — and the good news is that it is not your fault. Overprovisioning is inherent in Spark applications due to the way resources are allocated and utilized. 

The resource utilization profile for a typical Spark application might look like the chart below, with the maximum utilization level reached for only a small fraction of the application’s execution time: 

 

Spark developers are required to request a certain allocation level of memory and CPU for each of their applications. To prevent applications from failing due to insufficient resources, developers typically request memory and CPU resources to accommodate peak usage. 

 

 

Some cost-conscious developers might try to reduce the provisioning line as low as possible, to align with peak resource requirement levels. 

 

 

 

However, even if a developer reduces the allocation level to match the peak requested by the application, they cannot effectively ‘bend the allocation line’ in real-time to align with actual resource usage requirements that vary in real time. As a result, waste cannot be eliminated by tweaking and tuning alone. 

The Solution: Real-Time Cost Optimization 

Real-time cost optimization (RTCO) empowers businesses to eliminate this Spark application waste once and for all. RTCO systems are: 

  • Dynamic: Responding in real time to ever-changing application requirements 
  • Intelligent: Reallocating resources according to proven algorithms 
  • Immediate: Delivering results within minutes or hours, not days or weeks 
  • Continuous: Working around the clock, just as your business does 
  • Autonomous: Freeing developers for innovation. 

RTCO systems essentially ‘bend the allocation line’ to conform to a second-by-second actual utilization. RTCO enables a near-ideal utilization scenario. The difference between the old resource peak allocation utilization level and the new RTCO-powered level can be as much as 47%.

 

 

Case Study: Apache Spark on Amazon EMR 

Cloud-forward companies are beginning to adopt RTCO to eliminate application waste. One such enterprise is Autodesk, a global leader in design and manufacturing software. Autodesk’s goal was to reduce costs by 50% by increasing capacity and rightsizing compute for the company’s Apache Spark on Amazon EMR applications. With RTCO, Autodesk optimized their business results and successfully reduced Amazon EC2 costs by over 50%. 

Techstrong TV

Click full-screen to enable volume control
Watch latest episodes and shows

AI Field Day

Click full-screen to enable volume control

SHARE THIS STORY

RELATED STORIES