Recent Posts
Archives

Archive for the ‘General’ Category

PostHeaderIcon Building Resilient Architectures: Patterns That Survive Failure

How to design systems that gracefully degrade, recover quickly, and scale under pressure.

1) Patterns for Graceful Degradation

When dependencies fail, your system should still provide partial service. Examples:

  • Show cached product data if the pricing service is down.
  • Allow “read-only” mode if writes are failing.
  • Provide degraded image quality if the CDN is unavailable.

2) Circuit Breakers

Prevent cascading failures with Resilience4j or Hystrix:

@CircuitBreaker(name = "inventoryService", fallbackMethod = "fallbackInventory")
public Inventory getInventory(String productId) {
    return restTemplate.getForObject("/inventory/" + productId, Inventory.class);
}

public Inventory fallbackInventory(String productId, Throwable t) {
    return new Inventory(productId, 0);
}

3) Retries with Backoff

Retries should be bounded and spaced out:

@Retry(name = "paymentService", fallbackMethod = "fallbackPayment")
public PaymentResponse processPayment(PaymentRequest req) {
    return restTemplate.postForObject("/pay", req, PaymentResponse.class);
}

RetryConfig config = RetryConfig.custom()
    .maxAttempts(3)
    .waitDuration(Duration.ofMillis(200))
    .intervalFunction(IntervalFunction.ofExponentialBackoff(200, 2.0, 0.5)) // jitter
    .build();

4) Scaling Microservices in Kubernetes/ECS

Scaling is not just replicas—it’s smart policies:

  • Kubernetes HPA: Scale pods based on CPU or custom metrics (e.g., p95 latency).
    kubectl autoscale deployment api --cpu-percent=70 --min=3 --max=10
  • ECS: Use Service Auto Scaling with CloudWatch alarms on queue depth.
  • Pre-warm caches: Scale up before big events (e.g., Black Friday).

PostHeaderIcon Fixing the “Failed to Setup IP tables” Error in Docker on WSL2

TL;DR:
If you see this error when running Docker on Windows Subsystem for Linux (WSL2):

ERROR: Failed to Setup IP tables: Unable to enable SKIP DNAT rule:
(iptables failed: iptables --wait -t nat -I DOCKER -i br-xxxx -j RETURN:
iptables: No chain/target/match by that name. (exit status 1))

👉 The cause is usually that your system is using the nftables backend for iptables, but Docker expects the legacy backend.
Switching iptables to legacy mode and restarting Docker fixes it:

sudo update-alternatives --set iptables /usr/sbin/iptables-legacy
sudo update-alternatives --set ip6tables /usr/sbin/ip6tables-legacy

Then restart Docker and verify:

sudo iptables -t nat -L

You should now see the DOCKER chain listed. ✅


🔍 Understanding the Problem

When Docker starts, it configures internal network bridges using iptables.
If it cannot find or manipulate its DOCKER chain, you’ll see this “Failed to Setup IP tables” error.
This problem often occurs in WSL2 environments, where the Linux kernel uses the newer nftables system by default, while Docker still relies on the legacy iptables interface.

In short:

  • iptables-nft (default in modern WSL2) ≠ iptables-legacy (expected by Docker)
  • The mismatch causes Docker to fail to configure NAT and bridge rules

⚙️ Step-by-Step Fix

1️⃣ Check which iptables backend you’re using

sudo iptables --version
sudo update-alternatives --display iptables

If you see something like iptables v1.8.x (nf_tables), you’re using nftables.

2️⃣ Switch to legacy mode

sudo update-alternatives --set iptables /usr/sbin/iptables-legacy
sudo update-alternatives --set ip6tables /usr/sbin/ip6tables-legacy

Confirm the change:

sudo iptables --version

Now it should say (legacy).

3️⃣ Restart Docker

If you’re using Docker Desktop for Windows:

wsl --shutdown
net stop com.docker.service
net start com.docker.service

or simply quit and reopen Docker Desktop.

If you’re running Docker Engine inside WSL:

sudo service docker restart

4️⃣ Verify the fix

sudo iptables -t nat -L

You should now see the DOCKER chain among the NAT rules:

Chain DOCKER (2 references)
target     prot opt source               destination
RETURN     all  --  anywhere             anywhere

If it appears — congratulations 🎉 — your Docker networking is fixed!


🧠 Extra Troubleshooting Tips

  • If the error persists, flush and rebuild the NAT table:
    sudo service docker stop
    sudo iptables -t nat -F
    sudo iptables -t nat -X
    sudo service docker start
    
  • Check kernel modules (for completeness):
    lsmod | grep iptable
    sudo modprobe iptable_nat
    
  • Keep Docker Desktop and WSL2 kernel up to date — many network issues are fixed in newer builds.

✅ Summary

Step Command Goal
Check backend sudo iptables --version Identify nft vs legacy
Switch mode update-alternatives --set ... legacy Use legacy backend
Restart Docker sudo service docker restart Reload NAT rules
Verify sudo iptables -t nat -L Confirm DOCKER chain exists

🚀 Conclusion

This “Failed to Setup IP tables” issue is one of the most frequent Docker-on-WSL2 networking errors.
The root cause lies in the nftables vs legacy backend mismatch — a subtle but critical difference in Linux networking subsystems.
Once you switch to the legacy backend and restart Docker, everything should work smoothly again.

By keeping your WSL2 kernel, Docker Engine, and iptables configuration aligned, you can prevent these issues and maintain a stable developer environment on Windows.

Happy containerizing! 🐋

PostHeaderIcon SRE Principles: From Error Budgets to Everyday Reliability

How to define, measure, and improve reliability with concrete metrics, playbooks, and examples you can apply this week.

In a world where users expect instant, uninterrupted access, reliability is a feature. Site Reliability Engineering (SRE) brings engineering discipline to operations with a toolkit built on error budgets, SLIs/SLOs, and automation. This post turns those ideas into specifics: exact metrics, alert rules, dashboards, code and infra changes, and a lightweight maturity model you can use to track progress.


1) What Is SRE Culture?

1.1 Error Budgets: A Contract Between Speed and Stability

An error budget is the amount of unreliability you are willing to tolerate over a period. It converts reliability targets into engineering freedom.

  • Example: SLO = 99.9% availability over 30 days → error budget = 0.1% unavailability.
  • Translation: Over 30 days (~43,200 minutes), you may “spend” up to 43.2 minutes of downtime before freezing risky changes.
  • Policy: If the budget is heavily spent (e.g., >60%), restrict deployments to reliability fixes until burn rate normalizes.

1.2 SLIs & SLOs: A Common Language

SLI (Service Level Indicator) is a measured metric; SLO (Service Level Objective) is the target for that metric.

Domain SLI (what we measure) Example SLO (target) Notes
Availability % successful requests (non-5xx + within timeout) 99.9% over 30 days Define failure modes clearly (timeouts, 5xx, dependency errors).
Latency p95 end-to-end latency (ms) ≤ 300 ms (p95), ≤ 800 ms (p99) Track server time and total time (incl. downstream calls).
Error Rate Failed / total requests < 0.1% rolling 30 days Include client-cancel/timeouts if user-impacting.
Durability Data loss incidents 0 incidents / year Backups + restore drills must be part of policy.

1.3 Automation Over Manual Ops

  • Automated delivery: CI/CD with canary or blue–green, automated rollback on SLO breach.
  • Self-healing: Readiness/liveness probes; restart on health failure; auto-scaling based on SLI-adjacent signals (e.g., queue depth, p95 latency).
  • Runbooks & ChatOps: One-click actions (flush cache keyspace, rotate credentials, toggle feature flag) with audit trails.

2) How Do You Measure Reliability?

2.1 Availability (“The Nines”)

SLO Max Downtime / Year Per 30 Days
99.0% ~3d 15h ~7h 12m
99.9% ~8h 46m ~43m
99.99% ~52m 34s ~4m 19s
99.999% ~5m 15s ~26s

2.2 Latency (Percentiles, Not Averages)

Track p50/p90/p95/p99. Averages hide tail pain. Tie your alerting to user-impacting percentiles.

  • API example: p95 ≤ 300 ms, p99 ≤ 800 ms during business hours; relaxed after-hours SLOs if business permits.
  • Queue example: p99 time-in-queue ≤ 2s; backlog < 1,000 msgs for >99% of intervals.

2.3 Error Rate

Define “failed” precisely: HTTP 5xx, domain-level errors (e.g., “payment declined” may be success from a platform perspective but failure for a specific business flow—track both).

2.4 Example SLI Formulas

# Availability SLI
availability = successful_requests / total_requests

# Latency SLI
latency_p95 = percentile(latency_ms, 95)

# Error Rate SLI
error_rate = failed_requests / total_requests

2.5 SLO-Aware Alerting (Burn-Rate Alerts)

Alert on error budget burn rate, not just raw thresholds.

  • Fast burn: 2% budget in 1 hour → page immediately (could exhaust daily budget).
  • Slow burn: 10% budget in 24 hours → open a ticket, investigate within business hours.

3) How Do You Improve Reliability?

3.1 Code Fixes (Targeted, Measurable)

  • Database hot paths: Add missing index, rewrite N+1 queries, reduce chatty patterns; measure p95 improvement before/after.
  • Memory leaks: Fix long-lived caches, close resources; verify with heap usage slope flattening over 24h.
  • Concurrency: Replace blocking I/O with async where appropriate; protect critical sections with timeouts and backpressure.

3.2 Infrastructure Changes

  • Resilience patterns: circuit breaker, retry with jittered backoff, bulkheads, timeouts per dependency.
  • Scaling & HA: Multi-AZ / multi-region, min pod counts, HPA/VPA policies; pre-warm instances ahead of known peaks.
  • Graceful degradation: Serve cached results, partial content, or fallback modes when dependencies fail.

3.3 Observability Enhancements

  • Tracing: Propagate trace IDs across services; sample at dynamic rates during incidents.
  • Dashboards: One SLO dashboard per service showing SLI, burn rate, top 3 error classes, top 3 slow endpoints, dependency health.
  • Logging: Structure logs (JSON); include correlation IDs; ensure PII scrubbing; add request_id, tenant_id, release labels.

3.4 Reliability Improvement Playbook (Weekly Cadence)

  1. Review SLO attainment & burn-rate charts.
  2. Pick top 1–2 user-visible issues (tail latency spike, recurring 5xx).
  3. Propose one code fix and one infra/observability change.
  4. Deploy via canary; compare SLI before/after; document result.
  5. Close the loop: update runbooks, tests, alerts.

4) Incident Response: From Page to Postmortem

4.1 During the Incident

  • Own the page: acknowledge within minutes; post initial status (“investigating”).
  • Stabilize first: roll back most recent release; fail over; enable feature flag fallback.
  • Collect evidence: time-bounded logs, key metrics, traces; snapshot dashboards.
  • Comms: update stakeholders every 15–30 minutes until stable.

4.2 After the Incident (Blameless Postmortem)

  • Facts first: timeline, impact, user-visible symptoms, SLIs breached.
  • Root cause: 5 Whys; include contributing factors (alerts too noisy, missing runbook).
  • Actions: 1–2 short-term mitigations, 1–2 systemic fixes; assign owners and due dates.
  • Learning: update tests, add guardrails (pre-deploy checks, SLO gates), improve dashboards.

5) Common Anti-Patterns (and What to Do Instead)

  • Anti-pattern: Alert on every 5xx spike → Do this: alert on SLO burn rate and user-visible error budgets.
  • Anti-pattern: One giant “golden dashboard” → Do this: concise SLO dashboard + deep-dive panels per dependency.
  • Anti-pattern: Manual runbooks that require SSH → Do this: ChatOps / runbook automation with audit logs.
  • Anti-pattern: Deploying without rollback plans → Do this: canary, blue–green, auto-rollback on SLO breach.
  • Anti-pattern: No load testing → Do this: regular synthetic load/chaos drills tied to SLOs.

6) A 30-Day Quick Start

  1. Week 1: Define 2–3 SLIs and SLOs; publish error budget policy.
  2. Week 2: Build SLO dashboard; create two burn-rate alerts (fast/slow).
  3. Week 3: Add tracing to top 3 endpoints; implement circuit breaker + timeouts to the noisiest dependency.
  4. Week 4: Run a game day (controlled failure); fix 2 gaps found; document runbooks.

7) Concrete Examples & Snippets

7.1 Example SLI Prometheus (pseudo-metrics)

# Availability SLI
sum(rate(http_requests_total{status=~"2..|3.."}[5m]))
/
sum(rate(http_requests_total[5m]))

# Error Rate SLI
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

# Latency p95 (histogram)
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

7.2 Burn-Rate Alert (illustrative)

# Fast-burn: page if 2% of monthly budget is burned in 1 hour
# slow-burn: ticket if 10% burned over 24 hours
# (Use your SLO window and target to compute rates)

7.3 Resilience Config (Java + Resilience4j sketch)

// Circuit breaker + retry with jittered backoff
CircuitBreakerConfig cb = CircuitBreakerConfig.custom()
  .failureRateThreshold(50f)
  .waitDurationInOpenState(Duration.ofSeconds(30))
  .permittedNumberOfCallsInHalfOpenState(5)
  .slidingWindowSize(100)
  .build();

RetryConfig retry = RetryConfig.custom()
  .maxAttempts(3)
  .waitDuration(Duration.ofMillis(200))
  .intervalFunction(IntervalFunction.ofExponentialBackoff(200, 2.0, 0.2)) // jitter
  .build();

7.4 Kubernetes Health Probes

livenessProbe:
  httpGet: { path: /health/liveness, port: 8080 }
  initialDelaySeconds: 30
  periodSeconds: 10
readinessProbe:
  httpGet: { path: /health/readiness, port: 8080 }
  initialDelaySeconds: 10
  periodSeconds: 5

8) Lightweight SRE Maturity Model

Level Practices What to Add Next
Level 1: Awareness Basic monitoring, ad-hoc on-call, manual deployments Define SLIs/SLOs, create SLO dashboard, add canary deploys
Level 2: Control Burn-rate alerts, incident runbooks, partial automation Tracing, circuit breakers, chaos drills, auto-rollback
Level 3: Optimization Error budget policy enforced, game days, automated rollbacks Multi-region resilience, SLO-gated releases, org-wide error budgets

9) Sample Reliability OKRs

  • Objective: Improve checkout service reliability without slowing delivery.
    • KR1: Availability SLO from 99.5% → 99.9% (30-day window).
    • KR2: Reduce p99 latency from 1,200 ms → 600 ms at p95 load.
    • KR3: Cut incident MTTR from 45 min → 20 min via runbook automation.
    • KR4: Implement canary + auto-rollback for 100% of releases.

Conclusion

Reliability isn’t perfection—it’s disciplined trade-offs. By anchoring work to error budgets, articulating SLIs/SLOs that reflect user experience, and investing in automation, observability, and resilient design, teams deliver systems that users trust—and engineers love operating.

Next step: Pick one service. Define two SLIs and one SLO. Add a burn-rate alert and a rollback plan. Measure, iterate, and share the wins.

PostHeaderIcon [DotJs2025] Supercharge Web Performance with Shared Dictionaries: The Next Frontier in HTTP Compression

In an era where digital payloads traverse global networks at breakneck speeds, the subtle art of data compression remains a cornerstone of efficient web delivery, often overlooked amid flashier optimizations. Antoine Caron, engineering manager for frontend teams at Scaleway, reignited this vital discourse at dotJS 2025, advocating for shared dictionaries as a transformative leap in HTTP efficiency. With a keen eye on performance bottlenecks, Antoine dissected how conventional compressors like Gzip and Brotli falter on repetitive assets, only to unveil a protocol that leverages prior transfers as reference tomes, slashing transfer volumes by up to 70% in real-world scenarios. This isn’t arcane theory; it’s a pragmatic evolution, already piloted in Chrome and poised for broader adoption via emerging standards.

Antoine’s clarion call stemmed from stark realities unearthed in the Web Almanac: a disconcerting fraction of sites neglect even basic compression, forfeiting gigabytes in needless transit. A Wikipedia load sans Gzip drags versus its zipped twin, a 15% velocity boon; jQuery’s minified bulk evaporates over 50KB under maximal squeeze, a 70% payload purge sans semantic sacrifice. Yet, Brotli’s binary prowess, while superior for static fare, stumbles on dynamic deltas—vendor bundles morphing across deploys. Enter shared dictionary compression: an HTTP extension where browsers cache antecedent responses as compression glossaries, enabling servers to encode novelties against these baselines. For jQuery’s trek from v3.6 to v3.7, mere 8KB suffices; YouTube’s quarterly refresh yields 70% thrift, prior payloads priming the pump.

This mechanism, rooted in Google’s erstwhile SDCH (Shared Dictionary Compression over HTTP) and revived in IETF drafts like Compression Dictionary Transport, marries client-side retention with server-side savvy. Chrome’s 2024 rollout—flagged under chrome://flags/#shared-dictionary-compression—harnesses Zstandard or Brotli atop these shared tomes, with Microsoft Edge’s ZSDCH echoing for HTTPS. Antoine emphasized pattern matching: regex directives tag vendor globs, caching layers sequester these corpora, subsequent fetches invoking them via headers like Dictionary: . Caveats abound—staticity’s stasis, cache invalidation’s curse—but mitigations like periodic refreshes or hybrid fallbacks preserve robustness.

Antoine’s vision extends to edge cases: CDN confederacies propagating dictionaries, mobile’s miserly bandwidths reaping richest rewards. As Interop 2025 mandates cross-browser parity—Safari and Firefox intent-to-ship signaling convergence—this frontier beckons builders to audit headers, prototype pilots, and pioneer payloads’ parsimony. In a bandwidth-beleaguered world, shared dictionaries don’t merely optimize; they orchestrate a leaner, more equitable web.

The Mechanics of Mutual Memory

Antoine unraveled the protocol’s weave: clients stash responses in a dedicated echelon, servers probe via Accept-Dictionary headers, encoding diffs against these reservoirs. Brotli’s static harbors, once rigid, now ripple with runtime references—Zstd’s dynamism amplifying for JS behemoths. Web Almanac’s diagnostics affirm: uncompressed ubiquity persists, yet 2025’s tide, per Chrome’s telemetry, portends proliferation.

Horizons of Header Harmony

Drafts delineate transport: dictionary dissemination via prior bodies or external anchors, invalidation via etags or TTLs. Antoine’s exhortation: audit via Lighthouse, experiment in canaries—Scaleway’s vantage yielding vendor variances tamed. As specs solidify, this symbiosis promises payloads pared, performance propelled.

Links:

PostHeaderIcon [DotJs2024] Embracing Reactivity: Signals Unveiled in Modern Web Frameworks

As web architectures burgeon in intricacy, the quest for fluid state orchestration intensifies, demanding primitives that harmonize intuition with efficiency. Ruby Jane Cabagnot, a Oslo-based full-stack artisan and co-author of Practical Enterprise React, illuminated this quest at dotJS 2024. With a portfolio spanning cloud services and DevOps, Ruby dissected signals’ ascendancy in frameworks like SolidJS and Svelte, tracing their lineage from Knockout’s observables to today’s compile-time elixirs. Her exposition: a clarion for developers to harness these sentinels, streamlining reactivity while amplifying responsiveness.

Ruby’s odyssey commenced with historical moorings: Knockout’s MVVM pioneers observables, auto-propagating UI tweaks; AngularJS echoed with bidirectional bonds, model-view symphonies. React’s virtual DOM and hooks refined declarative flows, context cascades sans impurity. Yet, SolidJS and Svelte pioneer signals—granular beacons tracking dependencies, updating solely perturbed loci. In Solid, createSignal births a reactive vessel: name tweaks ripple to inputs, paragraphs—minimal footprint, maximal sync. Svelte compiles bindings at build: $: value directives weave reactivity into markup, runtime overhead evaporated.

Vue’s ref system aligns, signals as breath-easy bindings. Ruby extolled their triad: intuitiveness supplants boilerplate bazaars; performance prunes needless re-renders, DOM diffs distilled; developer delight via declarative purity, codebases crystalline. Signals transcend UIs, infiltrating WebAssembly’s server tides, birthing omnipresent reactivity. Ruby’s entreaty: probe these pillars, propel paradigms where apps pulse as dynamically as their environs.

Evolutionary Echoes of Reactivity

Ruby retraced trails: Knockout’s observables ignited auto-updates; AngularJS’s bonds synchronized realms. React’s hooks democratized context; Solid/Svelte’s signals granularize, compile-time cunning curbing cascades—name flux mends markup sans wholesale refresh.

Signals’ Synergies in Action

Solid’s vessels auto-notify dependents; Svelte’s directives distill runtime to essence. Vue’s refs render reactivity reflexive. Ruby rejoiced: libraries obsolete, renders refined, ergonomics elevated—crafting canvases concise, performant, profound.

Links:

PostHeaderIcon [AWSReInventPartnerSessions2024] Demystifying AI-First Organizational Identity: Strategic Pathways and Operational Frameworks for Enterprise Transformation

Lecturer

Beth Torres heads strategic accounts for Eviden within the Atos Group, facilitating client alignment with artificial intelligence transformation initiatives. Kevin Davis serves as CTO of the AWS business group at Eviden, architecting machine learning operations and generative operations platforms. Eric Trell functions as AWS Cloud lead for Atos, optimizing hybrid and multi-cloud infrastructures.

Abstract

This scholarly examination articulates the distinction between conventional artificial intelligence adoption and genuine AI-first organizational identity, wherein intelligence permeates decision-making, customer engagement, and product architecture. It contrasts startup-native implementations with enterprise retrofitting, delineates MLOps/GenOps operational frameworks, and establishes ethical governance across model construction, deployment guardrails, and continuous monitoring. Cloud-enabled legacy data accessibility emerges as a pivotal enabler, alongside considerations for responsible artificial intelligence stewardship.

Conceptual Differentiation: AI Adoption versus AI-First Organizational Paradigm

The progression from cloud-first to AI-first organizational models necessitates embedding artificial intelligence as foundational infrastructure rather than peripheral augmentation. Whereas startups construct products with intelligence intrinsically woven throughout, established enterprises frequently append capabilities—exemplified by chatbot overlays—onto legacy systems.

AI-first identity manifests through operational preparedness: strategic platforms enabling accelerated use-case development by abstracting foundational complexities including data acquisition, quality assurance, and infrastructure provisioning. Artificial Intelligence Centers of Excellence institutionalize this preparedness, directing resources toward rapid return-on-investment validation through structured experimentation.

MLOps and GenOps frameworks streamline model lifecycle management at enterprise scale, addressing data integrity, ethical transparency, and governance requirements. Cloud-first positioning substantially facilitates this transition; mainframe-resident operational data, previously inaccessible for generative applications, becomes replicable to AWS environments without comprehensive modernization.

Ethical Governance and Technical Enablement Mechanisms

Responsible artificial intelligence necessitates multilayered ethical consideration. A tripartite framework structures this responsibility:

During model construction, training corpora undergo scrutiny for bias, provenance, and representativeness. Deployment guardrails leverage AWS-native capabilities to enforce content policies and contextual grounding. Continuous monitoring implements anomaly detection with predefined response protocols, calibrated according to interface interactivity levels.

\# Conceptual Bedrock guardrail implementation
import boto3

bedrock = boto3.client('bedrock-runtime')
guardrail = {
    'contentPolicy': [{'blockedTopics': ['prohibited-content']}],
    'contextualGrounding': True
}
response = bedrock.invoke_model(
    modelId='anthropic.claude-3',
    body=prompt,
    guardrailConfig=guardrail
)

Security compartmentalization within Bedrock preserves data isolation for sensitive domains such as healthcare. Production readiness extends beyond prompt efficacy to encompass data validation, accuracy verification, and misinformation mitigation within innovation toolchains.

Strategic Ramifications and Transformation Imperatives

AI-first positioning defends against startup disruption by enabling comparable innovation velocity. Ethical frameworks safeguard reputational integrity while ensuring output reliability. Cloud-mediated legacy data accessibility democratizes generative capabilities across historical systems.

Organizational consequences include systematic competitive advantage through intelligence-permeated operations, regulatory alignment via auditable governance, and cultural evolution toward experimentation-driven development. The paradigm compels reevaluation of educational curricula to incorporate technology ethics as core competency.

Links:

PostHeaderIcon [DevoxxUK2025] Concerto for Java and AI: Building Production-Ready LLM Applications

At DevoxxUK2025, Thomas Vitale, a software engineer at Systematic, delivered an inspiring session on integrating generative AI into Java applications to enhance his music composition process. Combining his passion for music and software engineering, Thomas showcased a “composer assistant” application built with Spring AI, addressing real-world use cases like text classification, semantic search, and structured data extraction. Through live coding and a musical performance, he demonstrated how Java developers can leverage large language models (LLMs) for production-ready applications, emphasizing security, observability, and developer experience. His talk culminated in a live composition for an audience-chosen action movie scene, blending AI-driven suggestions with human creativity.

The Why Factor for AI Integration

Thomas introduced his “Why Factor” to evaluate hype technologies like generative AI. First, identify the problem: for his composer assistant, he needed to organize and access musical data efficiently. Second, assess production readiness: LLMs must be secure and reliable for real-world use. Third, prioritize developer experience: tools like Spring AI simplify integration without disrupting workflows. By focusing on these principles, Thomas avoided blindly adopting AI, ensuring it solved specific issues, such as automating data classification to free up time for creative tasks like composing music.

Enhancing Applications with Spring AI

Using a Spring Boot application with a Thymeleaf frontend, Thomas integrated Spring AI to connect to LLMs like those from Ollama (local) and Mistral AI (cloud). He demonstrated text classification by creating a POST endpoint to categorize musical data (e.g., “Irish tin whistle” as an instrument) using a chat client API. To mitigate risks like prompt injection attacks, he employed Java enumerations to enforce structured outputs, converting free text into JSON-parsed Java objects. This approach ensured security and usability, allowing developers to swap models without code changes, enhancing flexibility for production environments.

Semantic Search and Retrieval-Augmented Generation

Thomas addressed the challenge of searching musical data by meaning, not just keywords, using semantic search. By leveraging embedding models in Spring AI, he converted text (e.g., “melancholic”) into numerical vectors stored in a PostgreSQL database, enabling searches for related terms like “sad.” He extended this with retrieval-augmented generation (RAG), where a chat client advisor retrieves relevant data before querying the LLM. For instance, asking, “What instruments for a melancholic scene?” returned suggestions like cello, based on his dataset, improving search accuracy and user experience.

Structured Data Extraction and Human Oversight

To streamline data entry, Thomas implemented structured data extraction, converting unstructured director notes (e.g., from audio recordings) into JSON objects for database storage. Spring AI facilitated this by defining a JSON schema for the LLM to follow, ensuring structured outputs. Recognizing LLMs’ potential for errors, he emphasized keeping humans in the loop, requiring users to review extracted data before saving. This approach, applied to his composer assistant, reduced manual effort while maintaining accuracy, applicable to scenarios like customer support ticket processing.

Tools and MCP for Enhanced Functionality

Thomas enhanced his application with tools, enabling LLMs to call internal APIs, such as saving composition notes. Using Spring Data, he annotated methods to make them accessible to the model, allowing automated actions like data storage. He also introduced the Model Context Protocol (MCP), implemented in Quarkus, to integrate with external music software via MIDI signals. This allowed the LLM to play chord progressions (e.g., in A minor) through his piano software, demonstrating how MCP extends AI capabilities across local processes, though he cautioned it’s not yet production-ready.

Observability and Live Composition

To ensure production readiness, Thomas integrated OpenTelemetry for observability, tracking LLM operations like token usage and prompt augmentation. During the session, he invited the audience to choose a movie scene (action won) and used his application to generate a composition plan, suggesting chord progressions (e.g., I-VI-III-VII) and instruments like percussion and strings. He performed the music live, copy-pasting AI-suggested notes into his software, fixing minor bugs, and adding creative touches, showcasing a practical blend of AI automation and human artistry.

Links:

PostHeaderIcon [RivieraDev2025] Dhruv Kumar – Platform Engineering + AI: The Next-Gen DevOps

At Riviera DEV 2025, Dhruv Kumar delivered an engaging presentation on platform engineering, a discipline reshaping software delivery by addressing modern development challenges. Stepping in for Silva Devi, Dhruv, a senior product manager at CloudBees, explored how platform engineering, augmented by artificial intelligence, streamlines workflows, enhances developer productivity, and mitigates the complexities of cloud-native environments. His talk illuminated the transformative potential of internal developer platforms (IDPs) and AI-driven automation, offering a vision for a more efficient and secure software development lifecycle (SDLC).

The Challenges of Modern Software Development

Dhruv began by highlighting the evolving responsibilities of developers, who now spend only about 11% of their time coding, according to a survey by software.com. The remaining time is consumed by non-coding tasks such as testing, deployment, and managing security vulnerabilities. The shift-left movement, while intended to empower developers by integrating testing and deployment earlier in the process, often burdens them with tasks outside their core expertise. This is compounded by the transition to cloud environments, which introduces complex microservices architectures and distributed systems, creating navigation challenges and integration headaches.

Additionally, the rise of AI has accelerated software development, increasing code volume and tool proliferation, while supply chain attacks exploit these complexities, demanding constant vigilance from developers. Dhruv emphasized that these challenges—fragmented workflows, heightened security risks, and tool overload—necessitate a new approach to streamline processes and empower teams.

Platform Engineering: A Unified Approach

Platform engineering emerges as a solution to these issues, providing a cohesive framework for software delivery. Dhruv defined it as the discipline of designing toolchains and workflows that enable self-service capabilities for engineering teams in the cloud-native era. Central to this is the concept of an internal developer platform (IDP), which integrates tools and processes across the SDLC, from coding to deployment. By establishing a common SDLC model and vocabulary, platform engineering ensures that stakeholders—developers, QA, and security teams—share a unified understanding, reducing miscommunication and enhancing actionability.

Dhruv highlighted three pillars of effective platform engineering: a standardized SDLC model, secure best practices embedded in workflows, and the freedom for developers to use familiar tools. This last point, supported by a Forbes study from September 2023, underscores that happier developers, using tools they prefer, complete tasks 10% faster. By fostering collaboration and reducing context-switching, platform engineering creates an environment where developers can focus on innovation rather than operational overhead.

AI as a Catalyst for Optimization

Artificial intelligence plays a pivotal role in amplifying platform engineering’s impact. Dhruv explained that AI’s value lies not in generating code but in filtering noise and optimizing practices. By leveraging a robust SDLC data model, AI can provide actionable insights, provided it is fed high-quality data. For instance, AI-driven testing can prioritize time-intensive issues, streamline QA processes, and run only relevant tests based on code changes, reducing costs and feedback cycles. Dhruv cited examples like AI agents identifying vulnerabilities in code components or assessing risks in production ecosystems, automating fixes where appropriate.

He also introduced the Model Context Protocol (MCP), an open standard that enables applications to provide context to large language models, enhancing AI’s ability to deliver precise recommendations. From troubleshooting CI/CD pipelines to onboarding new developers, AI, when integrated with platform engineering, empowers teams to address bottlenecks and scale efficiently in a cloud-native world.

Empowering Developers and Securing the Future

Dhruv concluded by emphasizing that platform engineering, bolstered by AI, re-engages all actors in the software delivery process, from developers to leadership. By normalizing data across tools and providing metrics like DORA (DevOps Research and Assessment), IDPs offer visibility into bottlenecks and investment opportunities. This holistic approach not only secures the tech stack against supply chain attacks but also fosters a culture of productivity and developer satisfaction.

He encouraged attendees to explore CloudBees’ platform, which exemplifies these principles by breaking free from traditional platform limitations. Dhruv’s call to action urged developers to adopt platform engineering practices, leverage AI for optimization, and provide feedback to refine these evolving methodologies, ensuring a future where software delivery is both efficient and resilient.

Links:

PostHeaderIcon [DevoxxFR2025] Boosting Java Application Startup Time: JVM and Framework Optimizations

In the world of modern application deployment, particularly in cloud-native and microservice architectures, fast startup time is a crucial factor impacting scalability, resilience, and cost efficiency. Slow-starting applications can delay deployments, hinder auto-scaling responsiveness, and consume resources unnecessarily. Olivier Bourgain, in his presentation, delved into strategies for significantly accelerating the startup time of Java applications, focusing on optimizations at both the Java Virtual Machine (JVM) level and within popular frameworks like Spring Boot. He explored techniques ranging from garbage collection tuning to leveraging emerging technologies like OpenJDK’s Project Leyden and Spring AOT (Ahead-of-Time Compilation) to make Java applications lighter, faster, and more efficient from the moment they start.

The Importance of Fast Startup

Olivier began by explaining why fast startup time matters in modern environments. In microservices architectures, applications are frequently started and stopped as part of scaling events, deployments, or rolling updates. A slow startup adds to the time it takes to scale up to handle increased load, potentially leading to performance degradation or service unavailability. In serverless or function-as-a-service environments, cold starts (the time it takes for an idle instance to become ready) are directly impacted by application startup time, affecting latency and user experience. Faster startup also improves developer productivity by reducing the waiting time during local development and testing cycles. Olivier emphasized that optimizing startup time is no longer just a minor optimization but a fundamental requirement for efficient cloud-native deployments.

JVM and Garbage Collection Optimizations

Optimizing the JVM configuration and understanding garbage collection behavior are foundational steps in improving Java application startup. Olivier discussed how different garbage collectors (like G1, Parallel, or ZGC) can impact startup time and memory usage. Tuning JVM arguments related to heap size, garbage collection pauses, and just-in-time (JIT) compilation tiers can influence how quickly the application becomes responsive. While JIT compilation is crucial for long-term performance, it can introduce startup overhead as the JVM analyzes and optimizes code during initial execution. Techniques like Class Data Sharing (CDS) were mentioned as a way to reduce startup time by sharing pre-processed class metadata between multiple JVM instances. Olivier provided practical tips and configurations for optimizing JVM settings specifically for faster startup, balancing it with overall application performance.

Framework Optimizations: Spring Boot and Beyond

Popular frameworks like Spring Boot, while providing immense productivity benefits, can sometimes contribute to longer startup times due to their extensive features and reliance on reflection and classpath scanning during initialization. Olivier explored strategies within the Spring ecosystem and other frameworks to mitigate this. He highlighted Spring AOT (Ahead-of-Time Compilation) as a transformative technology that analyzes the application at build time and generates optimized code and configuration, reducing the work the JVM needs to do at runtime. This can significantly decrease startup time and memory footprint, making Spring Boot applications more suitable for resource-constrained environments and serverless deployments. Project Leyden in OpenJDK, aiming to enable static images and further AOT compilation for Java, was also discussed as a future direction for improving startup performance at the language level. Olivier demonstrated how applying these framework-specific optimizations and leveraging AOT compilation can have a dramatic impact on the startup speed of Java applications, making them competitive with applications written in languages traditionally known for faster startup.

Links:

PostHeaderIcon [KotlinConf2024] DataFrame: Kotlin’s Dynamic Data Handling

At KotlinConf2024, Roman Belov, JetBrains’ Kotlin Moods group leader, showcased Kotlin DataFrame, a versatile library for managing flat and hierarchical data. Designed for general developers, not just data scientists, DataFrame handles CSV, JSON, and object subgraphs, enabling seamless data transformation and visualization. Roman demonstrated its integration with Kotlin Notebook for prototyping and a compiler plugin for dynamic type inference, using a KotlinConf app backend as an example. This talk highlighted how DataFrame empowers developers to build robust, interactive data pipelines.

DataFrame: A Versatile Data Structure

Kotlin DataFrame redefines data handling for Kotlin developers. Roman explained that, unlike traditional data classes, DataFrame supports dynamic column manipulation, akin to Excel tables. It can read, write, and transform data from formats like CSV or JSON, making it ideal for both analytics and general projects. For a KotlinConf app, DataFrame processed session data from a REST API, allowing developers to filter, sort, and pivot data effortlessly, providing a flexible alternative to rigid data class structures.

Prototyping with Kotlin Notebook

Kotlin Notebook, a plugin for IntelliJ IDEA Ultimate, enhances DataFrame’s prototyping capabilities. Roman demonstrated creating a scratch file to fetch session data via Ktor Client. The notebook’s auto-completion for dependencies, like Ktor or DataFrame, simplifies setup, downloading the latest versions from Maven Central. Interactive tables display hierarchical data, and each code fragment updates variable types, enabling rapid experimentation. This environment suits developers iterating on ideas, offering a low-friction way to test data transformations before production.

Dynamic Type Inference in Action

DataFrame’s compiler plugin, built for the K2 compiler, introduces on-the-fly type inference. Roman showed how it analyzes a DataFrame’s schema during execution, generating extension properties for columns. For example, accessing a title column in a sessions DataFrame feels like using a property, with auto-completion for column names and types. This eliminates manual schema definitions, streamlining data wrangling. Though experimental, the plugin cached schemas efficiently, ensuring performance, as seen when filtering multiplatform talk descriptions.

Handling Hierarchical Data

DataFrame excels with hierarchical structures, unlike flat data classes. Roman illustrated this with nested JSON from the KotlinConf API, converting categories into a DataFrame with grouped columns. Developers can navigate sub-DataFrames within cells, mirroring data class nesting. For instance, a category’s items array became a sub-DataFrame, accessible via intuitive APIs. This capability supports complex data like object subgraphs, enabling developers to transform and analyze nested structures without cumbersome manual mappings.

Building a KotlinConf Schedule

Roman walked through a practical example: creating a daily schedule for KotlinConf. Starting with session data, he converted startsAt strings to LocalDateTime, filtered out service sessions, and joined room IDs with room names from another DataFrame. Sorting by start time and pivoting by room produced a clean schedule, with nulls replaced by empty strings. The resulting HTML table, generated directly in the notebook, showcased DataFrame’s ability to transform REST API data into user-friendly outputs, all with concise, readable code.

Visualizing Data with Kandy

DataFrame integrates with Kandy, JetBrains’ visualization library, to create charts. Roman demonstrated analyzing GitHub commits from the Kotlin repository, grouping them by week to plot commit counts and average message lengths. The resulting chart revealed trends, like steady growth potentially tied to CI improvements. Kandy’s simple API, paired with DataFrame’s data manipulation, makes visualization accessible. Roman encouraged exploring Kandy’s website for examples, highlighting its role in turning raw data into actionable insights.

DataFrame in Production

Moving DataFrame to production is straightforward. Roman showed copying notebook code into IntelliJ’s EAP version, importing the generated schema to access columns as properties. The compiler plugin evolves schemas dynamically, supporting operations like adding a room column and using it immediately. This approach minimizes boilerplate, as seen when serializing a schedule to JSON. Though the plugin is experimental, its integration with K2 ensures reliability, making DataFrame a practical choice for building scalable backend systems, from APIs to data pipelines.

Links: