Posts Tagged ‘SpringIO2025’
[SpringIO2025] A cloud cost saving journey: Strategies to balance CPU for containerized JAVA workloads in K8s
Lecturer
Laurentiu Marinescu is a Lead Software Engineer at ASML, specializing in building resilient, cloud-native platforms with a focus on full-stack development. With expertise in problem-solving and software craftsmanship, he serves as a tech lead responsible for next-generation cloud platforms at ASML. He holds a degree from the Faculty of Economic Cybernetics and is an advocate for pair programming and emerging technologies. Ajith Ganesan is a System Engineer at ASML with over 15 years of experience in software solutions, particularly in lithography process control applications. His work emphasizes data platform requirements and strategy, with a strong interest in AI opportunities. He holds degrees from Eindhoven University of Technology and is passionate about system design and optimization.
Abstract
This article investigates strategies for optimizing CPU resource utilization in Kubernetes environments for containerized Java workloads, emphasizing cost reduction and performance enhancement. It analyzes the trade-offs in resource allocation, including requests and limits, and presents data-driven approaches to minimize idle CPU cycles. Through examination of workload characteristics, scaling mechanisms, and JVM configurations, the discussion highlights practical implementations that balance efficiency, stability, and operational expenses in on-premises deployments.
Contextualizing Cloud Costs and CPU Utilization Challenges
The escalating costs of cloud infrastructure represent a significant challenge for organizations deploying containerized applications. Annual expenditures on cloud services have surpassed $600 billion, with many entities exceeding budgets by over 17%. In Kubernetes clusters, average CPU utilization hovers around 10%, even in large-scale environments exceeding 1,000 CPUs, where it reaches only 17%. This underutilization implies that up to 90% of provisioned resources remain idle, akin to maintaining expensive infrastructure on perpetual standby.
The inefficiency stems not from collective oversight but from inherent design trade-offs. Organizations deploy expansive clusters to ensure capacity for peak demands, yet this leads to substantial idle resources. The opportunity lies in reclaiming these for cost savings; even doubling utilization to 20% could yield significant reductions. This requires understanding application behaviors, load profiles, and the interplay between Kubernetes scheduling and Java Virtual Machine (JVM) dynamics.
In simulated scenarios with balanced nodes and containers, tight packing minimizes rollout costs but introduces risks. For instance, upgrading containers sequentially due to limited spare capacity (e.g., 25% headroom) can prevent zero-downtime deployments. Scaling demands may fail due to resource constraints, necessitating cluster expansions that inflate expenses. These examples underscore the need for strategies that optimize utilization without compromising reliability.
Resource Allocation Strategies: Requests, Limits, and Workload Profiling
Effective CPU management in Kubernetes hinges on judicious setting of resource requests and limits. Requests guarantee minimum allocation for scheduling, while limits cap maximum usage to prevent monopolization. For Java workloads, these must align with JVM ergonomics, which adapt heap and thread pools based on detected CPU cores.
Workload profiling is essential, categorizing applications into mission-critical (requiring deterministic latency) and non-critical (tolerant of variability). In practice, reducing requests by up to 75% for critical workloads, counterintuitively, enhanced performance by allowing burstable access to idle resources. Experiments demonstrated halved hardware, energy, and real estate costs, with improved stability.
A binary search query identified optimal requests, but assumptions—such as non-simultaneous peaks—were validated through rigorous testing. For non-critical applications, minimal requests (sharing 99% of resources) maximized utilization. Scaling based on application-specific metrics, rather than default CPU thresholds, proved superior. For example, autoscaling on heap usage or queue sizes avoided premature scaling triggered by garbage collection spikes.
Code example for configuring Kubernetes resources in a Deployment YAML:
apiVersion: apps/v1
kind: Deployment
metadata:
name: java-app
spec:
replicas: 3
template:
spec:
containers:
- name: app
image: java-app:latest
resources:
requests:
cpu: "500m" # Reduced request for sharing
limits:
cpu: "2" # Expanded limit for bursts
This configuration enables overcommitment, assuming workload diversity prevents concurrent peaks.
JVM and Application-Level Optimizations for Efficiency
Java workloads introduce unique considerations due to JVM behaviors like garbage collection (GC) and thread management. Default JVM settings often lead to inefficiencies; for instance, GC pauses can spike CPU usage, triggering unnecessary scaling. Tuning collectors (e.g., ZGC for low-latency) and limiting threads reduced contention.
Servlet containers like Tomcat exhibited high overhead; profiling revealed excessive thread creation. Switching to Undertow, with its non-blocking I/O, halved resource usage while maintaining throughput. Reactive applications benefited from Netty, leveraging asynchronous processing for better utilization.
Thread management is critical: unbounded queues in executors caused out-of-memory errors under load. Implementing bounded queues with rejection policies ensured stability. For example:
@Bean
public ThreadPoolTaskExecutor executor() {
ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
executor.setCorePoolSize(10); // Limit threads
executor.setMaxPoolSize(20);
executor.setQueueCapacity(50); // Bounded queue
executor.setRejectedExecutionHandler(new ThreadPoolExecutor.CallerRunsPolicy());
return executor;
}
Monitoring tools like Prometheus and Grafana facilitated iterative tuning, adapting to evolving workloads.
Cluster-Level Interventions and Success Metrics
Cluster-wide optimizations complement application-level efforts. Overcommitment, by reducing requests while expanding limits, smoothed resource contention. Pre-optimization graphs showed erratic throttling; post-optimization, latency decreased 10-20%, with 7x more requests handled.
Success hinged on validating assumptions through experiments. Despite risks of simultaneous scaling, diverse workloads ensured viability. Continuous monitoring—via vulnerability scans and metrics—enabled proactive adjustments.
Key metrics included reduced throttling, stabilized performance, and halved costs. Policies at namespace and node levels aligned with overcommitment strategies, incorporating backups for node failures.
Implications for Sustainable Infrastructure Management
Optimizing CPU for Java in Kubernetes demands balancing trade-offs: determinism versus sharing, cost versus performance. Strategies emphasize application understanding, JVM tuning, and adaptive scaling. While mission-critical apps benefit from resource sharing under validated assumptions, non-critical ones maximize efficiency with minimal requests.
Future implications involve AI-driven predictions for peak avoidance, enhancing sustainability by reducing energy consumption. Organizations must iterate: monitor, fine-tune, adapt—treating efficiency as a dynamic goal.
Links:
[SpringIO2025] Real-World AI Patterns with Spring AI and Vaadin by Marcus Hellberg / Thomas Vitale
Lecturer
Marcus Hellberg is the Vice President of AI Research at Vaadin, a company specializing in tools for Java developers to build web applications. As a Java Champion with nearly 20 years of experience in Java and web development, he focuses on integrating AI capabilities into Java ecosystems. Thomas Vitale is a software engineer at Systematic, a Danish software company, with expertise in cloud-native solutions, Java, and AI. He is the author of “Cloud Native Spring in Action” and an upcoming book on developer experience on Kubernetes, and serves as a CNCF Ambassador.
- Marcus Hellberg on LinkedIn
- Marcus Hellberg on GitHub
- Thomas Vitale on LinkedIn
- Thomas Vitale on GitHub
Abstract
This article examines practical patterns for incorporating artificial intelligence into Java applications using Spring AI and Vaadin, transitioning from experimental to production-ready implementations. It analyzes techniques for memory management, guardrails, multimodality, retrieval-augmented generation, tool calling, and agents, with implications for security, user experience, and system integration. Insights emphasize robust, observable AI workflows in on-premises or cloud environments.
Memory Management and Streaming in AI Interactions
Integrating large language models (LLMs) into applications requires addressing their stateless nature, where each interaction lacks inherent context from prior exchanges. Spring AI provides advisors—interceptor-like mechanisms—to augment prompts with conversation history, enabling short-term memory. For instance, a MessageChatMemoryAdvisor retains the last N messages, ensuring continuity without manual tracking.
This pattern enhances user interactions in chat-based interfaces, built here with Vaadin’s component model for server-side Java UIs. A vertical layout hosts message lists and inputs, injecting a ChatClientBuilder to construct clients with advisors. Basic interactions involve prompting the model and appending responses, but for realism, streaming via reactive fluxes improves responsiveness, subscribing to token streams and updating UI progressively.
Code illustration:
ChatClient chatClient = builder.build();
messageInput.addSubmitListener(submitEvent -> {
String message = submitEvent.getMessage();
MessageItem userItem = messageList.addMessage("You", message);
chatClient.stream(new Prompt(message))
.subscribe(response -> {
userItem.append(response.getResult().getOutput().getContent());
});
});
Streaming suits verbose responses, reducing perceived latency, while observability integrations (e.g., OpenTelemetry) trace interactions for debugging nondeterministic behaviors.
Guardrails for Security and Validation
AI workflows must mitigate risks like sensitive data leaks or invalid outputs. Input guardrails intercept prompts, using on-premises models to check for compliance with policies, blocking unauthorized queries (e.g., personal information). Output guardrails validate responses, reprompting for corrections if deserialization fails.
Advisors enable this: a default advisor with a local chat model filters inputs/outputs. For example, querying an address might be blocked if flagged, preventing cloud exposure. This ensures determinism in structured outputs, converting unstructured text to Java objects via JSON instructions.
Implications include privacy preservation in regulated sectors and integration with Spring Security for role-based tool access.
Multimodality and Retrieval-Augmented Generation
LLMs extend beyond text through multimodality, processing images, audio, or videos. Spring AI’s entity methods augment prompts for structured extraction, e.g., parsing attendee details from images into tables for programmatic use.
Retrieval-augmented generation (RAG) combats hallucinations by embedding external data as vectors in stores like PostgreSQL. A RetrievalAugmentationAdvisor retrieves relevant documents via similarity search, augmenting prompts. Customizations allow empty contexts for fallback to model knowledge.
Example:
VectorStore vectorStore = // PostgreSQL vector store
DocumentRetriever retriever = new VectorStoreDocumentRetriever(vectorStore);
RetrievalAugmentationAdvisor advisor = RetrievalAugmentationAdvisor.builder()
.documentRetriever(retriever)
.queryAugmentor(QueryAugmentor.contextual().allowEmptyContext(true))
.build();
This pattern grounds responses in proprietary data, with thresholds controlling retrieval scope.
Tool Calling, Agents, and Dynamic Integrations
Tool calling empowers LLMs as agents, invoking external functions for tasks like database queries. Annotations describe tools, passed to clients for dynamic selection. For products, a service might expose query/update methods:
@Tool(description = "Fetch products from database")
public List<Product> getProducts(@P(description = "Category filter") String category) {
// Database query
}
Agents orchestrate tools, potentially via Model Context Protocol for external services. Demonstrations include theme generation from screenshots, editing CSS via file system tools, highlighting nondeterminism and the need for safeguards.
In conclusion, these patterns enable production AI, emphasizing modularity, security, and observability for robust Java applications.
Links:
[SpringIO2025] Spring I/O 2025 Keynote
Lecturer
The keynote features Spring leadership: Juergen Hoeller (Framework Lead), Rossen Stoyanchev (Web), Ana Maria Mihalceanu (AI), Moritz Halbritter (Boot), Mark Paluch (Data), Josh Long (Advocate), Mark Pollack (Messaging). Collectively, they steer the Spring portfolio’s technical direction and community engagement.
Abstract
The keynote unveils Spring Framework 7.0 and Boot 4.0, establishing JDK 21 and Jakarta EE 11 as baselines while advancing AOT compilation, virtual threads, structured concurrency, and AI integration. Live demonstrations and roadmap disclosures illustrate how these enhancements—combined with refined observability, web capabilities, and data access—position Spring as the preeminent platform for cloud-native Java development.
Baseline Evolution: JDK 21 and Jakarta EE 11
Spring Framework 7.0 mandates JDK 21, embracing virtual threads for lightweight concurrency and records for immutable data carriers. Jakarta EE 11 introduces the Core Profile and CDI Lite, trimming enterprise bloat. The demonstration showcases a virtual thread-per-request web handler processing 100,000 concurrent connections with minimal heap, contrasting traditional thread pools. This baseline shift enables native image compilation via Spring AOT, reducing startup to milliseconds and memory footprint by 90%.
AOT and Native Image Optimization
Spring Boot 4.0 refines AOT processing through Project Leyden integration, pre-computing bean definitions and proxy classes at build time. Native executables startup in under 50ms, suitable for serverless platforms. The live demo compiles a Kafka Streams application to GraalVM native image, achieving sub-second cold starts and 15MB RSS—transforming deployment economics for event-driven microservices.
AI Integration and Modern Web Capabilities
Spring AI matures with function calling, tool integration, and vector database support. A live-coded agent retrieves beans from a running context to answer natural language queries about application metrics. WebFlux enhances structured concurrency with Schedulers.boundedElastic() replacement via virtual threads, simplifying reactive code. The demonstration contrasts traditional Mono/Flux composition with straightforward sequential logic executing on virtual threads, preserving backpressure while improving readability.
Data, Messaging, and Observability Advancements
Spring Data advances R2DBC connection pooling and Redis Cluster native support. Spring for Apache Kafka 4.0 introduces configurable retry templates and Micrometer metrics out-of-the-box. Unified observability aggregates metrics, traces, and logs: Prometheus exposes 200+ Kafka client metrics, OpenTelemetry correlates spans across HTTP and Kafka, and structured logging propagates MDC context. A Grafana dashboard visualizes end-to-end latency from REST ingress to database commit, enabling proactive incident response.
Community and Future Trajectory
The keynote celebrates Spring’s global community, highlighting contributions to null-safety (JSpecify), virtual thread testing, and AOT hint generation. Planned enhancements include JDK 23 support, Project Panama integration for native memory access, and AI-driven configuration validation. The vision positions Spring as the substrate for the next decade of Java innovation, balancing cutting-edge capabilities with backward compatibility.