Jonathan Lalou's Blog

Posts Tagged ‘AmazonFSx’

[AWSReInvent2025] High-Performance Storage Architectures for AI/ML, Analytics, and HPC Workloads

Lecturer

Aditi is a Senior Product Manager for Amazon FSx at Amazon Web Services (AWS). With years of experience working directly with customers on high-performance workloads, she focuses on pushing the technical boundaries of what is possible with cloud storage to meet the demands of modern compute-intensive applications.

Abstract

This article examines the critical role of high-performance storage in supporting modern AI/ML, analytics, and High-Performance Computing (HPC) workloads. As organizations scale their compute resources—incorporating hundreds or thousands of CPU and GPU cores—storage often becomes the primary bottleneck, preventing linear performance scaling. We explore the technical architectures of Amazon FSx and Amazon S3, focusing on how these services address the needs of both “lift-and-shift” file-based applications and “cloud-native” S3-based data lakes. By analyzing customer use cases in genomics, media rendering, and large language model (LLM) training, we detail the methodologies for achieving peak performance at scale.

The Storage Bottleneck in Compute-Intensive Workloads

Modern high-performance workloads are characterized by their extreme reliance on massive datasets and high-core-count compute clusters. In an ideal cloud environment, adding more compute resources should lead to a proportional increase in work completed—a concept known as linear scaling. However, traditional storage solutions often fail to keep pace with the throughput demands of these clusters, leading to a performance plateau.

When storage becomes the bottleneck, compute instances sit underutilized as they compete for access to the same data store. This is particularly detrimental given that 90% to 95% of the expenditure for these workloads is typically allocated to compute resources. Consequently, an inefficient storage layer not only extends the time to insight but also significantly increases the total cost of ownership (TCO). To avoid this, storage must be architected to scale linearly alongside compute.

Navigating the Path to the Cloud: File Systems vs. Object Storage

Organizations generally approach high-performance storage on AWS from two distinct backgrounds: those with long-standing on-premises file-based workflows and those who have built native cloud applications around object storage.

The Persistence of File-Based Architectures

Despite the rise of object storage, file systems remain the preferred interface for many researchers and developers due to three primary factors: Familiar Interface: The intuitive nature of files and directories simplifies complex data management for data scientists and developers.
* Granular Permissions: File systems provide robust POSIX permissions, allowing for fine-grained control over which users can read, write, or execute specific files.
* Consistent Data Access:* For workloads where multiple users or compute nodes access the same data simultaneously, the strong consistency of file systems ensures that all parties see the most recent data updates.

Amazon FSx for High-Performance File Access

Amazon FSx addresses these needs by providing fully managed file systems that offer the performance of local storage with the scalability of the cloud. For “lift-and-shift” scenarios, FSx allows organizations to move their existing HPC and AI/ML pipelines to AWS without refactoring their applications.

Accelerating Generative AI and ML Workloads

The emergence of generative AI has placed a renewed emphasis on data strategy. Whether an organization is building a model from scratch or fine-tuning a foundational model, the quality and accessibility of its proprietary data are the primary differentiators.

Retrieval Augmented Generation (RAG)

To move beyond generic AI responses and reduce hallucinations, many organizations are implementing Retrieval Augmented Generation (RAG). RAG allows foundational models to access evolving, large-scale data lakes without requiring the data to be manually loaded into a prompt.

The RAG methodology involves:
1. Vectorization: Converting organizational data into vectors—numeric representations that capture semantic meaning.
2. Semantic Search: Using spatial similarity to compare a query vector against the data lake’s vectors to find the most relevant information.
3. Augmentation: Feeding the retrieved context back into the model to generate a more accurate and business-specific response.

Ingestion and Data Strategy with Amazon S3

Amazon S3 serves as the foundational data lake for these AI workflows due to its cost-effectiveness and virtually unlimited scalability. Organizations typically utilize two ingestion patterns:
* Batch Ingestion: Suitable for static or infrequently changing data such as historical records and product catalogs.
* Real-Time Ingestion: Essential for agentic workflows where AI models must respond to the latest available information.

Modernizing Self-Managed Databases with Amazon FSx

While fully managed services like Amazon RDS are popular, certain business and technical requirements drive organizations toward self-managed database architectures on AWS.

Drivers for Self-Managed Databases

Organizations choose to self-manage databases like Oracle, SQL Server, or SAP HANA for several reasons:
* Granular Control: The ability to choose specific versions of the database engine and the underlying operating system.
* Custom Protection Policies: Implementing specific backup intervals and recovery procedures that may not be available in managed services.
* High Resilience: Scaling databases across multiple Availability Zones or regions with custom failover configurations.

Optimization through Storage Features

A common oversight in database deployment is the potential for the storage layer to add significant value beyond simple data persistence. Amazon FSx file systems (including FSx for NetApp ONTAP, OpenZFS, and Windows File Server) enable features like:
* Snapshots and Cloning: Facilitating rapid testing and database upgrades by creating near-instantaneous copies of production environments.
* Performance Tuning: Choosing the right FSx service can significantly optimize the TCO and performance of database environments, particularly for high-transaction workloads.

Conclusion

As compute power continues to expand, the storage layer must evolve from a passive repository into a high-performance engine. By leveraging Amazon FSx and S3, organizations can eliminate storage bottlenecks, enabling their most demanding AI, HPC, and database workloads to scale linearly and cost-effectively in the cloud.

Links:

Posted in en-US | Tags: AaronDaly, Aditi, AmazonFSx, AmazonS3, AWS, AWSreInvent, AWSReInvent2025, CloudComputing, CloudStorage, Databases, GenAI, HPC, Jim, JordanDolman, MachineLearning, MonicaVeahore, RAG | No Comments »