Jonathan Lalou's Blog

Posts Tagged ‘CloudCostSavings’

[SpringIO2025] A cloud cost saving journey: Strategies to balance CPU for containerized JAVA workloads in K8s

Lecturer

Laurentiu Marinescu is a Lead Software Engineer at ASML, specializing in building resilient, cloud-native platforms with a focus on full-stack development. With expertise in problem-solving and software craftsmanship, he serves as a tech lead responsible for next-generation cloud platforms at ASML. He holds a degree from the Faculty of Economic Cybernetics and is an advocate for pair programming and emerging technologies. Ajith Ganesan is a System Engineer at ASML with over 15 years of experience in software solutions, particularly in lithography process control applications. His work emphasizes data platform requirements and strategy, with a strong interest in AI opportunities. He holds degrees from Eindhoven University of Technology and is passionate about system design and optimization.

Abstract

This article investigates strategies for optimizing CPU resource utilization in Kubernetes environments for containerized Java workloads, emphasizing cost reduction and performance enhancement. It analyzes the trade-offs in resource allocation, including requests and limits, and presents data-driven approaches to minimize idle CPU cycles. Through examination of workload characteristics, scaling mechanisms, and JVM configurations, the discussion highlights practical implementations that balance efficiency, stability, and operational expenses in on-premises deployments.

Contextualizing Cloud Costs and CPU Utilization Challenges

The escalating costs of cloud infrastructure represent a significant challenge for organizations deploying containerized applications. Annual expenditures on cloud services have surpassed $600 billion, with many entities exceeding budgets by over 17%. In Kubernetes clusters, average CPU utilization hovers around 10%, even in large-scale environments exceeding 1,000 CPUs, where it reaches only 17%. This underutilization implies that up to 90% of provisioned resources remain idle, akin to maintaining expensive infrastructure on perpetual standby.

The inefficiency stems not from collective oversight but from inherent design trade-offs. Organizations deploy expansive clusters to ensure capacity for peak demands, yet this leads to substantial idle resources. The opportunity lies in reclaiming these for cost savings; even doubling utilization to 20% could yield significant reductions. This requires understanding application behaviors, load profiles, and the interplay between Kubernetes scheduling and Java Virtual Machine (JVM) dynamics.

In simulated scenarios with balanced nodes and containers, tight packing minimizes rollout costs but introduces risks. For instance, upgrading containers sequentially due to limited spare capacity (e.g., 25% headroom) can prevent zero-downtime deployments. Scaling demands may fail due to resource constraints, necessitating cluster expansions that inflate expenses. These examples underscore the need for strategies that optimize utilization without compromising reliability.

Resource Allocation Strategies: Requests, Limits, and Workload Profiling

Effective CPU management in Kubernetes hinges on judicious setting of resource requests and limits. Requests guarantee minimum allocation for scheduling, while limits cap maximum usage to prevent monopolization. For Java workloads, these must align with JVM ergonomics, which adapt heap and thread pools based on detected CPU cores.

Workload profiling is essential, categorizing applications into mission-critical (requiring deterministic latency) and non-critical (tolerant of variability). In practice, reducing requests by up to 75% for critical workloads, counterintuitively, enhanced performance by allowing burstable access to idle resources. Experiments demonstrated halved hardware, energy, and real estate costs, with improved stability.

A binary search query identified optimal requests, but assumptions—such as non-simultaneous peaks—were validated through rigorous testing. For non-critical applications, minimal requests (sharing 99% of resources) maximized utilization. Scaling based on application-specific metrics, rather than default CPU thresholds, proved superior. For example, autoscaling on heap usage or queue sizes avoided premature scaling triggered by garbage collection spikes.

Code example for configuring Kubernetes resources in a Deployment YAML:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: java-app
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: app
        image: java-app:latest
        resources:
          requests:
            cpu: "500m"  # Reduced request for sharing
          limits:
            cpu: "2"     # Expanded limit for bursts

This configuration enables overcommitment, assuming workload diversity prevents concurrent peaks.

JVM and Application-Level Optimizations for Efficiency

Java workloads introduce unique considerations due to JVM behaviors like garbage collection (GC) and thread management. Default JVM settings often lead to inefficiencies; for instance, GC pauses can spike CPU usage, triggering unnecessary scaling. Tuning collectors (e.g., ZGC for low-latency) and limiting threads reduced contention.

Servlet containers like Tomcat exhibited high overhead; profiling revealed excessive thread creation. Switching to Undertow, with its non-blocking I/O, halved resource usage while maintaining throughput. Reactive applications benefited from Netty, leveraging asynchronous processing for better utilization.

Thread management is critical: unbounded queues in executors caused out-of-memory errors under load. Implementing bounded queues with rejection policies ensured stability. For example:

@Bean
public ThreadPoolTaskExecutor executor() {
    ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
    executor.setCorePoolSize(10);  // Limit threads
    executor.setMaxPoolSize(20);
    executor.setQueueCapacity(50); // Bounded queue
    executor.setRejectedExecutionHandler(new ThreadPoolExecutor.CallerRunsPolicy());
    return executor;
}

Monitoring tools like Prometheus and Grafana facilitated iterative tuning, adapting to evolving workloads.

Cluster-Level Interventions and Success Metrics

Cluster-wide optimizations complement application-level efforts. Overcommitment, by reducing requests while expanding limits, smoothed resource contention. Pre-optimization graphs showed erratic throttling; post-optimization, latency decreased 10-20%, with 7x more requests handled.

Success hinged on validating assumptions through experiments. Despite risks of simultaneous scaling, diverse workloads ensured viability. Continuous monitoring—via vulnerability scans and metrics—enabled proactive adjustments.

Key metrics included reduced throttling, stabilized performance, and halved costs. Policies at namespace and node levels aligned with overcommitment strategies, incorporating backups for node failures.

Implications for Sustainable Infrastructure Management

Optimizing CPU for Java in Kubernetes demands balancing trade-offs: determinism versus sharing, cost versus performance. Strategies emphasize application understanding, JVM tuning, and adaptive scaling. While mission-critical apps benefit from resource sharing under validated assumptions, non-critical ones maximize efficiency with minimal requests.

Future implications involve AI-driven predictions for peak avoidance, enhancing sustainability by reducing energy consumption. Organizations must iterate: monitor, fine-tune, adapt—treating efficiency as a dynamic goal.

Links:

Posted in en-US | Tags: AjithGanesan, ASML, CloudCostSavings, CPUManagement, JavaWorkloads, KubernetesOptimization, LaurentiuMarinescu, SpringIO2025 | No Comments »