Recent Posts
Archives

Posts Tagged ‘AWSReInvent2025’

PostHeaderIcon [AWSReInvent2025] Basketball’s AI Revolution: How AWS and the NBA Are Changing the Game

Lecturer

Chris Benyarko is Executive Vice President of Direct-to-Consumer at the NBA, overseeing fan engagement and digital strategies. Andy Oh serves as Principal of Live Sports Events at Prime Video, leading NBA broadcasting partnerships. Kristen Schaff is Global Director of Sports Partnerships at AWS, managing collaborations across major leagues. Relevant links include Chris Benyarko’s LinkedIn profile (https://www.linkedin.com/in/chris-benyarko-/) and Kristen Schaff’s LinkedIn profile (https://www.linkedin.com/in/kristen-schaff/).

Abstract

This article investigates the NBA’s digital transformation via AWS, focusing on AI-driven analytics, fan personalization, and broadcasting innovations. It analyzes partnerships enhancing game strategies, viewer experiences, and global engagement, with implications for sports technology scalability.

The NBA-AWS Partnership: Shared Vision and Technological Foundations

The NBA’s strategic alliance with AWS, formally unveiled on October 1st, is rooted in a mutual commitment to innovation and an unwavering focus on fan experiences. Chris Benyarko emphasizes that this partnership transcends mere technology provision, positioning AWS as a true collaborator in advancing the league’s goals. At its foundation lies a shared philosophy: while the NBA prioritizes fan and future fan obsession, AWS brings its renowned customer-centric approach, creating a synergy that amplifies their joint efforts. This alignment enables the league to harness AWS’s robust infrastructure for seamless integration across various operations, ultimately accelerating the pace of technological advancements.

In the broader context of basketball’s ongoing evolution, the need for sophisticated, data-driven solutions has never been more pressing. AWS offers a scalable cloud platform that excels in handling complex analytics, artificial intelligence, and machine learning tasks, converting vast amounts of raw data into meaningful insights that inform decision-making at every level. Kristen Schaff highlights what drew AWS to the NBA, pointing out the league’s dynamic, fast-paced nature and its abundance of data as ideal attributes that align perfectly with AWS’s technological strengths. From player performance tracking to predictive modeling, this collaboration leverages AWS’s tools to address the unique demands of professional sports.

The methodology underpinning this partnership involves a comprehensive migration of workflows to AWS services, ensuring low-latency streaming and personalized content delivery that reaches audiences worldwide. By combining the NBA’s deep domain knowledge with AWS’s technical prowess, the alliance not only enhances current offerings but also paves the way for future innovations that could redefine the sport.

AI and Analytics Transforming Gameplay and Strategy

Artificial intelligence is at the forefront of reshaping basketball analytics, influencing everything from individual player development to collective team strategies during games. Chris Benyarko delves into the capabilities of Second Spectrum’s optical tracking system, which deploys 29 cameras in each arena to capture an astonishing 100 million data points per night. These metrics encompass detailed aspects such as player speed, defensive positioning, and shot quality, providing coaches and analysts with granular information that was previously unattainable.

AWS plays a pivotal role in this transformation by powering machine learning models that forecast game outcomes and simulate various scenarios, thereby assisting coaches in refining their tactics. The implications are significant, as teams can now gain substantial competitive advantages through data-informed decisions, while fans benefit from enriched content on platforms like NBA League Pass, including automated highlight reels that capture the most thrilling moments. Andy Oh complements this by describing how Prime Video integrates AWS for real-time statistical overlays, which add layers of depth to the viewing experience and foster greater immersion.

Nevertheless, challenges such as data latency persist, and the partnership addresses these through continuous infrastructure optimizations, ensuring that the flow of information remains timely and reliable.

Enhancing Fan Engagement Through Personalization

Personalization has emerged as a key driver in elevating fan engagement, utilizing AI to deliver content that resonates on an individual level. Chris Benyarko explains the progression of NBA League Pass, which now employs AI to generate highlights in multiple languages, offer alternate viewing streams focused on specific players, and provide predictive elements like real-time win probabilities. These features not only cater to diverse global audiences but also deepen the connection between fans and the game.

AWS’s extensive global network facilitates this by guaranteeing low-latency delivery to over 200 countries, making high-quality experiences accessible regardless of location. Kristen Schaff underscores the importance of data privacy within these personalization efforts, ensuring that the NBA’s fan-first principles are upheld through secure, unified data management practices.

An analysis of this approach reveals its potential to shift traditional passive spectatorship toward more interactive and tailored interactions, which in turn boosts viewer retention and opens new avenues for monetization through precisely targeted advertising.

Broadcasting Innovations and Latency Challenges

Prime Video’s integration of NBA content exemplifies how AWS enables groundbreaking broadcasting innovations. Andy Oh outlines the process of capturing feeds directly from arenas and minimizing transmission hops to achieve near-real-time delivery, a critical factor especially for integrations involving live betting.

Among the notable advancements is AI-generated commentary available in various languages, powered by AWS Bedrock for natural and accurate translations. The broader implications extend to democratizing access to premium content, thereby expanding the NBA’s global footprint and attracting new demographics. However, the persistent challenge of avoiding spoilers drives an ongoing emphasis on latency reduction, with AWS tools providing the means for continuous monitoring and swift adjustments to maintain optimal performance.

Implications for Sports and Broader Industries

The NBA-AWS partnership offers valuable insights that transcend the realm of sports, demonstrating the power of real-time data platforms, personalized content delivery, and AI in production environments. Chris Benyarko envisions extending these technologies to non-professional leagues, potentially increasing participation by making advanced analytics more widely available.

Looking ahead, AI could further innovate by predicting injuries or optimizing training regimens, fundamentally altering athletic preparation and performance. These developments not only enhance the sport but also provide scalable models applicable to other industries seeking to leverage data for competitive advantage.

Conclusion

The synergy between AWS and the NBA vividly illustrates the transformative potential of AI in sports. By enhancing analytics, personalization, and broadcasting through advanced cloud technologies, this collaboration redefines fan engagement and sets a precedent for innovation across various sectors.

Links:

  • https://www.youtube.com/watch?v=pZczwGVzWxo
  • https://www.linkedin.com/in/chris-benyarko-/
  • https://www.linkedin.com/in/kristen-schaff/

PostHeaderIcon [AWSReInvent2025] Introducing Nitro Isolation Engine: Transparency through Mathematics

Lecturer

JD Bean is a principal architect in AWS’s compute and ML services organization, specializing in virtualization and security innovations. Kareem Raslan serves as a senior principal engineer in AWS’s Nitro hypervisor team, focusing on hardware-software integration for cloud security. Nathan Chong is a principal applied scientist in AWS’s automated reasoning group, with expertise in formal verification and mathematical proofs. Relevant links include JD Bean’s LinkedIn profile (https://www.linkedin.com/in/jdbean/) and Nathan Chong’s LinkedIn profile (https://www.linkedin.com/in/nathan-chong-aws/).

Abstract

This article explores the AWS Nitro Isolation Engine, an advancement in the Nitro System that employs formal verification to ensure mathematical certainty in workload isolation. It examines the evolution of Nitro’s design, the application of automated reasoning for proofs, and the implications for cloud security, emphasizing compartmentalization and transparency.

The Evolution of the AWS Nitro System

The AWS Nitro System has fundamentally transformed the landscape of cloud virtualization by prioritizing enhanced security, superior performance, and accelerated innovation. JD Bean traces its development back to 2012, explaining how it culminated in a public launch in 2017 that marked a departure from conventional hypervisors such as Xen. At its core, the system relies on a customized version of the KVM hypervisor tailored specifically for cloud environments, complemented by the sixth generation of proprietary Nitro Silicon. This infrastructure underpins all EC2 instances introduced since 2018, demonstrating AWS’s commitment to reimagining virtualization.

In earlier iterations, systems like Xen depended on a component known as Dom0, which essentially functioned as a general-purpose operating system to handle essential tasks such as input/output operations, orchestration, and monitoring. However, as AWS expanded its services and built deeper relationships with customers, the limitations of Xen became increasingly apparent. The team recognized the need to push beyond these constraints, leading to a comprehensive reinvention that eliminated superfluous elements and relocated AWS-specific functions to dedicated hardware. Consequently, the Nitro System features a streamlined host operating system reduced to a minimal kernel, which not only minimizes potential attack surfaces but also enforces a policy of zero operator access, thereby isolating customer data from AWS personnel.

Within this broader context, the rise of cloud adoption has amplified the demand for confidential computing, where sensitive workloads require robust protections against unauthorized access. The Nitro architecture addresses these needs by compartmentalizing only the most critical isolation functions, which in turn optimizes efficiency and reduces vulnerabilities. This design philosophy ensures that customers can leverage the cloud’s scalability without compromising on security, setting the stage for subsequent advancements like the Nitro Isolation Engine.

Design and Implementation of the Nitro Isolation Engine

Building upon the foundational principles of the Nitro System, the Nitro Isolation Engine introduces a compact and formally verified module that significantly bolsters isolation assurances. Kareem Raslan elaborates on its compartmentalization strategy, noting how non-essential operations are shifted to user space, leaving behind a concise kernel comprising fewer than 100,000 lines of code dedicated solely to vital activities such as memory allocation and interrupt handling.

This engine is currently implemented on the Graviton 5 processor, available in preview mode, and utilizes specialized hardware extensions to facilitate secure transitions across compartments. The implementation methodology centers on rigorous specification, where the engine’s expected behaviors—such as maintaining strict workload separation—are articulated through precise mathematical models. Subsequently, the team employs tools like Isabelle to prove that the actual code aligns perfectly with these specifications, thereby guaranteeing that no deviations occur.

Nathan Chong further illuminates the process of automated reasoning, beginning with intuitive examples like the formula for the sum of the first n natural numbers and progressing to sophisticated machine-checked proofs. For the engine, this approach extends to verifying properties over potentially infinite states, which ensures that unauthorized access paths are entirely eliminated. The result is a system that not only performs efficiently but also withstands rigorous scrutiny, providing customers with unparalleled confidence in their data’s protection.

The implications of this design are profound, as it substantially diminishes the risk of exploitation by confining the trusted computing base to a minimal footprint. By verifying a smaller codebase through automated means, the engine mitigates issues stemming from legacy components, paving the way for a more secure cloud ecosystem.

Automated Reasoning and Mathematical Proofs

Automated reasoning stands as a cornerstone of the Nitro Isolation Engine, offering what the presenters describe as “transparency through mathematics” by delivering incontrovertible assurances of isolation. Nathan Chong contrasts informal proofs and specifications with their machine-checked counterparts in the Isabelle theorem prover, where each logical step is mechanically validated to prevent errors.

At the heart of this process lie core concepts such as specifications, which define the precise behaviors a system must exhibit, and proofs, which consist of finite chains of reasoning that irrefutably establish desired properties. For domains involving infinite possibilities, such as the natural numbers, techniques like mathematical induction are employed: a base case confirms the property for the initial value, while the inductive step demonstrates its preservation across subsequent values, much like a cascade of falling dominoes.

Scaling these methods to the complexities of the Nitro Isolation Engine requires advanced mathematical frameworks, including separation logic for managing memory resources, refinement techniques for bridging abstraction levels, and theorem provers to automate verification. Drawing on decades of research in formal methods, this approach ensures comprehensive coverage of real-world scenarios, including concurrent operations that could otherwise introduce subtle vulnerabilities.

An analysis of this methodology reveals its inherent value: unlike traditional testing, which is confined to finite scenarios, mathematical proofs provide exhaustive guarantees, fostering a level of trust that is essential for confidential computing environments. This not only elevates security standards but also enables organizations to innovate with greater assurance.

Implications for Cloud Security and Future Innovations

The introduction of the Nitro Isolation Engine heralds a new era in cloud security, where mathematical proofs become the benchmark for verifying system integrity. By emphasizing compartmentalization, the engine effectively minimizes the trusted computing base, thereby reducing the potential for exploits and enhancing overall resilience. Currently available as an always-on feature on Graviton 5 processors in preview, it invites users to request access through designated AWS channels, signaling AWS’s proactive stance in deploying cutting-edge security measures.

On a broader scale, the consequences extend to industries with stringent privacy requirements, such as finance and healthcare, where verifiable isolation can mitigate compliance risks and build customer confidence. AWS’s ongoing commitment to elevating security standards—evident throughout the Nitro System’s history—suggests that future innovations will continue to prioritize robust protections, allowing for rapid advancements without sacrificing safety.

This transparency through mathematics not only demystifies complex systems but also empowers users to make informed decisions about their cloud strategies, ultimately contributing to a more secure digital landscape.

Conclusion

The Nitro Isolation Engine exemplifies AWS’s unwavering dedication to pioneering secure and innovative cloud infrastructure. Through the rigorous application of formal verification, it achieves mathematical certainty in workload isolation, thereby redefining transparency and trust in the realm of virtualization.

Links:

  • https://www.youtube.com/watch?v=hqqKi3E-oG8
  • https://www.linkedin.com/in/jdbean/
  • https://www.linkedin.com/in/nathan-chong-aws/

PostHeaderIcon [AWSReInvent2025] Transforming Tire Innovation: How Apollo Tyres Harnessed AWS High-Performance Computing to Redefine Engineering Velocity

Lecturers

Alex Fronasier serves as Business Development Lead for Product Engineering in North America at Amazon Web Services (AWS), championing cloud-enabled advances across manufacturing domains. Shalender Gupta is Global Head of Data Engineering, Analytics, and Reporting at Apollo Tyres, steering the organization’s worldwide data and digital strategy. Gautam, representing AWS partner expertise, contributed deep insights into bespoke HPC platform customization.

Abstract

In an industry where milliseconds of performance and fractions of material efficiency separate market leaders from followers, simulation-driven design has become the lifeblood of innovation. Apollo Tyres’ bold migration to AWS High-Performance Computing stands as a compelling case study in how purposeful cloud architecture can dramatically accelerate engineering workflows while simultaneously driving down costs. This narrative traces the company’s journey from constrained on-premises systems to a scalable, self-service HPC environment, revealing the strategic decisions, technical foundations, and cultural shifts that unlocked unprecedented gains in speed, agility, and sustainability.

The New Imperatives of Engineering Excellence

Manufacturing no longer unfolds in isolated silos; it now competes in a digital-first arena where speed is existential. Established enterprises face disruptors unencumbered by legacy infrastructure, capable of moving from concept to market at breathtaking pace. Success, therefore, hinges on two intertwined capabilities: modernizing operations through cloud and automation, and compressing product development cycles to shrink time-to-market.

Today’s products are marvels of complexity—millions of lines of code, thousands of components, and sprawling global supply chains. Managing this intricacy demands a digital thread: a continuous, traceable flow of data across the entire lifecycle, from requirements to configuration to multidisciplinary validation. Apollo Tyres illustrated this beautifully with their tire genealogy—a living digital record that links every design decision to its downstream performance implications.

Yet complexity alone does not guarantee advantage. True differentiation emerges when organizations leverage simulation to explore thousands of virtual experiments, uncovering innovations that physical prototyping could never economically reveal. Quality must be engineered in from the outset, augmented by AI, IoT, and advanced analytics, rather than inspected in at the end. Efficiency, meanwhile, is not about cutting corners but about eliminating waste through smarter, data-driven choices.

These forces—digital primacy, digital thread mastery, and simulation at scale—are mutually reinforcing. Cloud-enabled operations feed the thread; the thread supplies rich data for quality optimization; simulation accelerates both. Companies that harmonize all three are positioned to dominate.

AWS lives these principles daily. Designing much of its own hardware while orchestrating a planetary supply chain gives the company intimate familiarity with these challenges. A relentless “working backwards” philosophy—from customer needs to rapid prototyping—infuses everything from data center infrastructure to consumer devices and warehouse robotics. At the heart of this agility lies secure, cloud-native collaboration, enabling globally distributed teams to innovate seamlessly, whether crafting integrated circuits or pioneering satellite constellations.

The Anatomy of Simulation and the Allure of the Cloud

A typical engineering simulation journey begins with conceptual design, evolves into detailed model preparation with boundary conditions, proceeds to systematic exploration of design alternatives, and concludes with job execution, result analysis, and insight extraction. These cycles repeat across phases: early design space mapping builds competitive edge, mid-stage robustness testing exposes failure modes, and pre-manufacturing validation de-risks production.

Organizations are flocking to the cloud for compelling reasons. Unlimited elastic capacity banishes queue times, dramatically lifting engineer productivity. Pay-as-you-go economics paired with on-demand scaling delivers financial flexibility. Global teams collaborate without friction, while built-in resilience ensures business continuity. Cutting-edge hardware becomes instantly accessible without capital outlay, and software licenses achieve far higher utilization—driving superior ROI. Shared infrastructure even advances corporate sustainability goals.

AWS structures its HPC offering around three pillars: an intuitive front-end for job submission, virtual desktops, and high-performance remote visualization; a vast compute layer with purpose-built instances; and sophisticated data management that preserves traceability—the very essence of the digital thread.

The true power lies in workload-to-instance matching. Different simulations—structural, thermal, fluid dynamics—exhibit distinct compute, memory, or accelerator profiles. AWS’s broad portfolio allows each job to run on its optimal instance, yielding dramatic cost-performance gains. Spot instances handle interruptible workloads, on-demand serves mission-critical runs, and savings plans lock in baseline capacity. Emerging AI-driven provisioning promises to automate these decisions entirely, while GPU instances capitalize on solver redesigns that exploit parallel processing.

Apollo Tyres’ Awakening: From Legacy Constraints to Cloud Liberation

Apollo Tyres commands respect across Asia-Pacific and Europe, with premium offerings marketed under the Vredestein banner for luxury and performance vehicles. Operating seven plants and spanning every tire category—from passenger cars to agricultural and off-road—the company faced classic HPC growing pains.

On-premises clusters imposed crushing capital burdens, interminable procurement cycles, and inflexible scaling during demand peaks. Visibility across global sites was fragmented, and manual job orchestration created bottlenecks that delayed critical insights. Tire design, after all, demands exquisitely detailed multiphysics simulation—modeling rubber compounds, structural integrity, heat dissipation, and wear under extreme conditions.

The pivot to AWS began with foundational services: AWS ParallelCluster for orchestration, Amazon DCV for seamless remote workstation access, and FSx for NetApp ONTAP for high-throughput storage. This triad enabled tight integration between simulation suites and design tools, delivering up to 59% faster runtimes and more than 60% cost reduction.

Rigorous benchmarking proved pivotal. Shalender Gupta shared a clear hierarchy: Graviton processors running Amazon Linux offered the lowest cost; if incompatible, shift to x86 AMD, then Intel; reserve Windows only for unavoidable enterprise applications. This disciplined approach shattered myths of cloud expense, revealing optimal configurations that balanced performance and economy.

Tachyon: Placing Power Back in Engineers’ Hands

To eliminate operational friction, Apollo Tyres partnered with AWS to deploy Tachyon—a tailored, cloud-native HPC management platform. Tachyon fundamentally rebalances control: researchers gain self-service autonomy, while administrators retain comprehensive visibility and governance.

Engineers now submit, monitor, and troubleshoot jobs through an elegant interface. They provision workstations on demand from a curated catalog and navigate files effortlessly—no more IT tickets. Administrators enjoy unified observability across clusters, project-level budgeting, and seamless Active Directory integration.

Under the hood, Tachyon runs on Amazon EKS with lightweight nodes, leverages OpenSearch for metadata, uses Lambda for scheduled billing and notifications, and deploys proxy nodes close to compute clusters. Secure private connectivity via Direct Connect or VPN completes the enterprise-grade posture.

Live demonstrations revealed the platform’s finesse: granular job configuration (queues, nodes, tasks per node, memory), instant cost previews before submission, deep utilization telemetry, and direct access to simulation outputs. Workstation sharing and lifecycle monitoring further streamline collaboration.

Tachyon AI elevates the experience further. Physics-informed models accelerate simulations, while an Amazon Bedrock-powered assistant enables natural-language interaction—querying job status, generating scripts, diagnosing failures, or optimizing for cost versus speed.

The results speak volumes: simulation times fell by 60% compared to on-premises, capital expenditure shifted to controlled operational spend, engineers refocused on innovation rather than infrastructure wrangling, and virtual prototyping largely supplanted physical testing.

Wisdom Earned and Horizons Ahead

Key lessons crystallized: exhaustive benchmarking is non-negotiable for cost and performance optimization; design everything for elasticity; monitor relentlessly with budget alerts; automate wherever possible. Planning for multi-cluster scale from day one smoothed subsequent expansion.

Looking forward, Apollo Tyres envisions chemical compound simulation to optimize material performance and longevity, component rationalization to simplify the bill of materials, global rollout across all R&D centers, and AI agents that autonomously run simulations and recommend optimal designs.

By mastering cloud HPC, Apollo Tyres has not merely accelerated workflows—it has redefined what is possible in tire engineering, setting a benchmark for simulation-driven manufacturing in the digital age.

Links:

PostHeaderIcon [AWSReInvent2025] Revolutionizing DevSecOps: How Cathay Pacific Achieved 75% Faster Security with Agentic AI

Lecturer

Mike Markell is a Practice Manager for AWS Professional Services in Hong Kong, where he leads digital transformation and security initiatives for major enterprises across Asia. Naresh Sharma is a senior technology leader at Cathay Pacific Airways, overseeing the airline’s global application security and DevSecOps strategy. Tony Leong is a Senior Security Architect at Cathay, specialized in building AI-powered security tooling and integrating AppSec-as-Code into high-velocity deployment pipelines.

Abstract

In the highly regulated and high-stakes environment of global aviation, managing security across more than 4,000 annual deployments presents a massive operational challenge. This article details how Cathay Pacific Airways revolutionized its “security-first” culture by moving beyond traditional security scanning to a comprehensive DevSecOps model. The core methodology centers on the implementation of Agentic AI and a RAG-based (Retrieval-Augmented Generation) assistant to solve the industry’s “false positive crisis.” By deploying “AI-powered security champions” and customized scanning rules, Cathay achieved a 75% reduction in vulnerability remediation time and a 50% reduction in security operations costs. The analysis explores the technical and cultural shifts required to empower over 1,000 developers to become proactive security practitioners while maintaining the airline’s rapid pace of innovation.

Context: The Bottleneck of Manual Security Reviews

For a global leader like Cathay Pacific, the pace of digital innovation is essential for maintaining a competitive edge in the aviation industry. However, this speed was being severely hindered by the limitations of traditional security scanning tools. The primary conflict centered on a high noise-to-signal ratio, where approximately 78% of the vulnerabilities identified by standard tools were determined to be false positives. This created a crisis where security teams were overwhelmed by alerts, leading to significant delays in the deployment of features for the airline’s fleet.

Furthermore, the manual review process required to validate these alerts created significant friction between the security and development teams. Developers often viewed security requirements as a hurdle that slowed down their ability to deliver value, while security professionals struggled to keep up with the volume of code being produced. To overcome these challenges, Cathay needed a solution that could scale with their deployment frequency—which covers everything from customer-facing apps to critical flight operation systems—without compromising on the rigorous safety standards that define the brand.

Methodology: Implementing Shift-Left Security with AI

The solution implemented by Cathay Pacific and AWS Professional Services involved a comprehensive “shift-left” strategy, which integrates security at the very beginning of the software development lifecycle. The cornerstone of this methodology is the use of Agentic AI. Unlike traditional static scanners, these AI agents act as “security champions” that provide real-time, context-aware guidance to developers as they write code. This allows for the identification of security anti-patterns and the suggestion of defensive coding practices before the code is even committed to a repository.

Another critical component of the methodology is the AppSec-as-Code library. This centralized knowledge base translates complex security policies into programmatic requirements that can be automatically enforced within CI/CD pipelines. To make this information accessible to developers, the team developed a RAG-based (Retrieval-Augmented Generation) assistant. This tool allows developers to query internal security standards using natural language, receiving accurate and context-specific advice instantly. Finally, the team moved away from “out of the box” tool configurations in favor of highly customized scanning rules. This technical fine-tuning was essential for drastically reducing the false-positive rate and ensuring that the security team only focused on legitimate threats.

Technical Analysis of Operational Gains

The implementation of AI-driven DevSecOps has yielded remarkable quantitative results for Cathay Pacific. The most significant outcome is a 75% reduction in the time required to remediate vulnerabilities. Because the AI agents filter out the vast majority of false positives and provide developers with clear, actionable fix suggestions, the entire security lifecycle has been compressed. Qualitatively, this has led to a 70% improvement in developer security capability, as the tools effectively serve as an automated, on-the-job training system that reinforces secure coding habits.

From a financial perspective, the automation of manual reviews and the reduction in wasted engineering time have led to a 50% cost reduction in security operations. The airline is now able to manage over 4,000 deployments annually with a higher level of confidence and lower overhead than was previously possible. A critical technical lesson learned during the journey was that “by default, no tool is perfect.” Success required a commitment to continuous customization and a willingness to collaborate with product vendors to tune their tools to the specific needs of the aviation industry. This iterative feedback loop was the key to moving from “human-in-the-loop” automation to a more efficient “AI-informed” model.

Consequences: A Cultural and Technical Transformation

The transformation at Cathay Pacific extended far beyond the technical architecture; it required a fundamental shift in the organization’s culture. The success of the project was predicated on a “can-do” spirit and the setting of ambitious targets that challenged the status quo. By providing developers with the tools to take ownership of security, the organization has fostered a culture where security is seen as a shared responsibility rather than an external constraint.

The implications for the global aviation and enterprise sectors are significant. Cathay has proven that it is possible to maintain a high-velocity deployment schedule in a safety-critical environment by leveraging the power of generative AI. Looking forward, the organization plans to develop even more insightful dashboards to provide security leaders with real-time visibility into the health of the application portfolio. The journey serves as a powerful testament to how Agentic AI can bridge the gap between agility and security, turning a potential bottleneck into a powerful competitive advantage.

Links:

PostHeaderIcon [AWSReInvent2025] Amazon S3 Performance: Architecture, Design, and Optimization for Data-Intensive Systems

Lecturer

Ian Heritage is a Senior Solutions Architect at Amazon Web Services, specializing in Amazon S3 and large-scale data storage architectures. With deep expertise in performance engineering and distributed systems, Ian Heritage helps organizations design and optimize their storage layers for high-throughput and low-latency applications, including machine learning training and real-time analytics. He is a prominent figure in the AWS storage community, known for his technical deep-dives into S3’s internal mechanics and best practices for performance at scale.

Abstract

This article explores the internal architecture and performance optimization strategies of Amazon S3, the industry-leading object storage service. It provides a detailed analysis of the differences between S3 General Purpose and the newly introduced S3 Express One Zone storage class, highlighting the architectural trade-offs between regional durability and sub-millisecond latency. The discussion covers advanced request management techniques, including prefix partitioning, request routing, and the role of the AWS Common Runtime (CRT) in maximizing throughput. By examining these technical foundations, the article offers practical guidance for architecting storage solutions that can handle millions of requests per second and petabytes of data for modern AI and analytics workloads.

S3 Storage Class Selection for High Performance

The performance of an S3-based application is fundamentally determined by the selection of the storage class. For over a decade, S3 General Purpose (Standard) has been the default choice, offering 99.999999999% (11 9s) of durability by replicating data across at least three Availability Zones. While this provides extreme reliability, the regional replication introduces a baseline latency that may be too high for certain “request-intensive” applications, such as machine learning model checkpoints or high-frequency trading logs.

To address these needs, AWS introduced S3 Express One Zone. This storage class is designed for workloads that require consistent, single-digit millisecond latency. By storing data within a single Availability Zone and utilizing a new, purpose-built architecture, Express One Zone can deliver up to 10x the performance of S3 Standard at a 50% lower request cost. This class is ideal for applications that perform frequent, small I/O operations where the overhead of regional replication would be the primary bottleneck. The choice between Standard and Express One Zone is thus a strategic decision between geographic durability and extreme performance.

Request Routing, Partitioning, and the Scale-Out Architecture

At its core, Amazon S3 is a massively distributed system that scales out to handle virtually unlimited throughput. The key to this scaling is “partitioning.” S3 automatically partitions buckets based on the object keys (names). Each partition can support a specific number of requests: 3,500 PUT/COPY/POST/DELETE requests and 5,500 GET/HEAD requests per second per prefix. For many years, users were advised to use randomized prefixes to ensure even distribution across partitions.

Modern S3 architecture has evolved to handle this automatically, but understanding prefix design remains crucial for performance. When an application’s request rate increases, S3 detects the hot spot and splits the partition to handle the load. However, this process takes time. For workloads that burst from zero to millions of requests instantly, pre-partitioning or using a wide range of prefixes is still a best practice. By spreading data across multiple prefixes (e.g., bucket/prefix1/, bucket/prefix2/), an application can linearly scale its throughput to accommodate massive concurrency, limited only by the client’s network bandwidth and CPU.

Client-Side Optimization with AWS CRT and SDKs

While the S3 service is designed for scale, the performance experienced by the end-user is often limited by the client-side implementation. To bridge this gap, AWS developed the Common Runtime (CRT) library. The CRT is a set of open-source, C-based libraries that implement high-performance networking best practices, such as automatic request retries, congestion control, and most importantly, multipart transfers.

'''
Conceptual example of enabling CRT in the AWS SDK for Python (Boto3)
'''
import boto3
from s3transfer.manager import TransferConfig

'''
The CRT allows for automatic parallelization of large object transfers
'''
config = TransferConfig(use_threads=True, max_concurrency=10)
s3 = boto3.client('s3')

s3.upload_file('large_data.zip', 'my-bucket', 'data.zip', Config=config)

The CRT automatically breaks large objects into smaller parts and uploads or downloads them in parallel. This utilizes the full network capacity of the EC2 instance and mitigates the impact of single-path network congestion. For applications using the AWS CLI or SDKs for Java, Python, and C++, opting into the CRT-based clients can result in a significant throughput increase—often double or triple the speed of standard clients for large files. Additionally, the CRT handles the complexities of DNS load balancing and connection pooling, ensuring that requests are distributed efficiently across the S3 frontend fleet.

Case Study: Optimization for Machine Learning and Analytics

Machine learning training is a premier use case for S3 performance optimization. During the training of large language models (LLMs), hundreds or thousands of GPUs must simultaneously read training data and write model “checkpoints.” These checkpoints are multi-gigabyte files that must be saved quickly to avoid idling expensive compute resources. By combining S3 Express One Zone with the CRT-based client, researchers can achieve the throughput necessary to saturate the high-speed networking of P4 and P5 instances.

In analytics, the use of “Range Gets” is a critical optimization. Instead of downloading an entire 1GB Parquet file to read a few columns, an application can request specific byte ranges. This reduces the amount of data transferred and speeds up query execution. S3 is optimized to handle these range requests efficiently, and when combined with a partitioned data layout (e.g., partitioning by date or region), it enables sub-second query responses over petabytes of data. This architectural synergy between storage class, partitioning, and client-side logic is what allows S3 to serve as the foundation for the world’s largest data lakes.

Links: