Overview
What is AWS S3?
Amazon S3 (Simple Storage Service) is a scalable, distributed object storage system widely used for storing and retrieving large amounts of data, including files, backups, media, and logs. It organizes data into buckets and supports features like versioning, lifecycle policies, and multi-region durability. S3 is optimized for durability, availability, and throughput at internet scale, and it abstracts away complex infrastructure, making it a common mental model when designing file storage systems in interviews.
Requirement
Functional Requirements
- createDirectory(path) - create new directory
- deleteDirectory(path) (file or empty dir only) - delete a file or empty directory
- createFile(path, data) - create a new immutable file
- readFile(path, offset, length) - retrieve (partial) contents of a file
- listDirectory(path) - list children of a directory
Constraints & Assumptions
To frame the design properly, let’s break down the problem space through the key constraints & assumptions given in the interview prompt. Each of them not only narrows down our choices but also tells us what matters most in this system.
- Immutable Files
Once a file is written, we assume it can’t be modified—only read or deleted. This assumption simplifies consistency and concurrency handling, but it raises the bar for durability. Every file write must ensure the data is fully persisted, verified, and recoverable, as there's no “fix it later” path. - Mutable Directories
Users can add or delete files and subdirectories at any time, so the directory structure must support concurrent mutations with strong correctness guarantees. This is where transactional integrity and concurrency control (e.g., locks or database semantics) come into play. - Use Only Primitive Infrastructure Building Blocks
We’re not allowed to use ready-made distributed storage like S3 or GFS. Instead, we can only build using low-level components like:- RDBMS (e.g., PostgreSQL) for relational, transactional metadata
- KV Stores (e.g., Redis or RocksDB) for fast, hashed lookups
- Zookeeper for distributed coordination (locking, routing)
- Single-machine filesystems as the physical block store
- This constraint tests your ability to architect a reliable distributed system from first principles.
- Scale: Millions of Users, 100s of Petabytes
We’re targeting massive scale: billions of files and directories spread across petabytes of data. This mandates horizontally scalable architecture, efficient metadata sharding, and smart data placement strategies to avoid bottlenecks. - Single Region & Data Center
The system only needs to run in a single region and data center. This removes cross-region latency and replication complexity from our current scope, letting us focus more on core file system design. (But note: multi-region could be an extension question.)
Non-Functional Requirements
- Scalability - 100s of PBs & millions of directories
- Consistency - strong for directory metadata ops, eventual for block replicas
- Durability - no data loss; redundant in block storage
- latency - low latency in file read and directory listings
Below the line:
- Multi-tenancy - Customers logically isolated
- Extensibility - Potential for versioning, ACLs, quotas later
Core Entities
- file (immutable, content-hashed, block-indexed)
- directory (mutable, hierarchical, maps path -> children)
- block (fixed size chunk(64MB), deduplicated by hash)
- path (absolute string (/abc/def/file))
- customer (tenant root-level entity(namespace, isolation))
APIs - with Entity Mapping
createDirectory(path: str) -> boolean
- Inserts a new
directory
into namespace; updatespath
hierarchy undercustomer
. deleteDirectory(path: str) -> boolean
- Deletes a
directory
from namespace after verifying nofile
or sub-directory
children exist. createFile(path: str, data: bytes) -> boolean
- Creates a
file
entry, stores content as hashedblocks
, links metadata underdirectory
. readFile(path: str, offset: int, length: int) -> bytes
- Reads
file
metadata and fetches correspondingblock
ranges byblock_hash
. listDirectory(path: str) -> list[str]
- Lists all
file
anddirectory
entries under givenpath
in namespace.
High Level Design

Building Blocks
To handle hundreds of petabytes of storage and trillions of files efficiently, we break the system into modular components, each with a specific role in the file system pipeline. This separation of responsibilities improves scalability, clarity, and fault isolation. Below is a high-level overview of the core components and how they fit together.
API Gateway
The API Gateway is the entry point for all external requests. It exposes high-level file system APIs like putFile
, getFile
, createDirectory
, and others. It is stateless and horizontally scalable, simply forwarding validated requests to the backend service layer. This layer may also handle basic normalization, authentication, and retry logic.
File Service
The File Service is the central orchestrator for file system operations. It coordinates with other components to handle file uploads, reads, and metadata changes. For example, on a file upload, it splits the file into chunks, stores them across storage nodes, updates metadata, and commits the full file atomically. It ensures the correctness and consistency of user-visible operations.
Metadata Store
The Metadata Store tracks the structure of the file system—what paths exist, which are directories or files, who owns them, and which blocks make up a file. It stores two main types of metadata: directory hierarchy and file-to-block mappings. This store is strongly consistent and supports transactional updates, making it the foundation of the system’s correctness.
Chunk Manager
The Chunk Manager maintains a fast-access mapping from block hash to the list of nodes that store that block. It acts like a distributed address book for locating content-addressed data. It is backed by a key-value store and optimized for high-throughput lookups during file reads and uploads. It may also track block usage for garbage collection.
Block Store
The Block Store defines the logic for handling immutable blocks of data. Files are broken into fixed-size blocks (e.g., 64MB), each addressed by a SHA256 hash. These blocks are never changed once written. The block store handles integrity checks and provides APIs to read or write a specific block by hash.
Storage Nodes
Storage Nodes physically store and serve the blocks on local disk. Each node handles basic upload and download operations for the blocks it holds. Nodes are simple by design and scale out horizontally. They report their status to the chunk manager and are periodically cleaned up by background workers.
Garbage Collection Workers
Garbage Collection (GC) Workers clean up unused data and metadata asynchronously. When a file is deleted or no longer referenced, GC workers remove its associated blocks from storage and update metadata entries accordingly. These workers run in the background and are carefully throttled to avoid disrupting live traffic.
Workflows
To understand how the system fulfills core file system operations, we walk through simplified workflows that explain how each API maps to internal component interactions. These flows show how the architecture supports core functional requirements while remaining modular and efficient.
createDirectory
API Signature
IntentCreate a new directory if the parent exists and the path is not already taken.
How it Works
- Client sends request to API Gateway.
- File Service parses and validates the path.
- Metadata DB ensures parent exists and inserts a new entry for the directory.
This supports hierarchical directory creation and ensures path integrity.
deleteDirectory
API Signature
Intent
Remove an existing empty directory from the namespace. Any directory with content shouldn’t be removed by design.
How it Works
- Client sends request to API Gateway.
- File Service checks that the path is a directory and is empty.
- Metadata DB deletes the path entry.
Supports cleanup of unused structures and keeps the namespace consistent.
createFile
API Signature
IntentCreate a new immutable file by chunking the data and storing its metadata.
How it Works
- File Service verifies the parent exists and path is free.
- File content is split into blocks and uploaded to Storage Nodes.
- Metadata DB stores the file's block list and properties.
Satisfies immutability and scalable file storage with structured metadata tracking.
readFile
API Signature
IntentFetch part of a file using random access on immutable block data.
How it Works
- File Service retrieves block list and calculates which blocks are needed.
- Block data is fetched from Storage Nodes and stitched together.
Supports efficient and partial file reads, aligned with performance expectations.
listDirectory
API Signature
Intent
List direct children of a given directory.
How it Works
- File Service confirms the directory exists.
- Metadata DB returns immediate child paths.
Provides directory introspection necessary for navigation and organization.
Data Schema
To support a scalable, hierarchical distributed file system, we need to clearly separate responsibilities across three key schema tables. The namespace_entries
table captures the logical structure of the system—both directories and files—in a unified tree, enabling efficient traversal and listing. The file_metadata
table focuses solely on file-specific details, such as size and the ordered list of content-addressed blocks, allowing for precise and efficient reads. Finally, the block_locations
table decouples physical storage from logical structure by mapping each unique block hash to its replica locations, enabling deduplication, replication, and parallel access. Together, these three tables ensure clear separation of concerns—structure, metadata, and storage—which in future can further support strong consistency, high scalability, and modular system evolution.
1. namespace_entries
This table represents the hierarchical structure of the file system and includes both files and directories in a unified schema. The path
column acts as the primary key and stores the fully qualified name (e.g., /user/data/file1.txt
), which allows quick lookups and ensures uniqueness. name
captures the leaf node (e.g., in a Trie structure) for display or listing purposes, and parent_path
supports efficient directory traversal and listing via indexed lookups (e.g., for listDirectory
). The type
column distinguishes files from directories, enabling proper validation during creation and deletion operations. Timestamps like created_at
and updated_at
are useful for auditing, synchronization, and TTL-based policies. This table forms the logical foundation of the filesystem’s namespace.
2. file_metadata
This table stores file-specific metadata that is not applicable to directories. The path
column (foreign key to namespace_entries
) links this metadata to a logical file entry. size_bytes
tracks the file’s total size, helping validate offsets and calculate throughput. block_count
gives a quick count of how many blocks make up the file, which is helpful during reads. block_size
documents the system’s chunking strategy (e.g., fixed 64MB), and block_hashes
stores an ordered list of SHA256 hashes corresponding to each block—allowing for precise byte-range reads using block index arithmetic. The created_at
timestamp supports audit logging and optional versioning. Together, this schema enables scalable, immutable file reads and writes with minimal metadata overhead.
3. block_locations
This table maps each content-addressed block_hash
to a list of storage nodes that hold its replicas. The block_hash
(SHA256 of the block content) is the primary key and ensures deduplication: identical content results in the same hash and is stored only once. replica_nodes
holds an array of node identifiers that currently host this block, which supports high availability and latency-aware routing. ref_count
tracks how many files reference this block—critical for efficient garbage collection when files are deleted. last_used_at
helps with eviction policies or diagnosing cold storage patterns. This table abstracts the physical distribution layer and enables efficient, fault-tolerant retrieval of blocks at scale.

Deep Dives
1: How can we reduce read latency at scale?
In large-scale distributed file systems managing trillions of immutable files and petabytes of data, read latency is one of the most critical performance bottlenecks. This is especially evident when users fetch large files composed of dozens of blocks—each stored across multiple nodes. A naive design struggles due to high fan-out reads, slow or uneven replica response times, and expensive block-location metadata lookups. Solving this requires a layered approach that blends concurrency, smart replica routing, and strategic caching.
When a client issues a getFile(path, offset, length)
request, the system performs several steps: resolving file metadata into block hashes, mapping those hashes to replica nodes, retrieving the relevant blocks, and assembling the requested byte range. Each step adds latency, and the fan-out nature of reads—especially for large files—can make the tail latency unpredictable.
To reduce this, the 1st layer of defense is Parallel Block Fetch with Smart Replica Selection. Rather than fetching blocks serially or picking replicas arbitrarily, the system queries all necessary blocks concurrently and chooses the fastest replica for each block based on real-time node health, response history, or data locality. This significantly improves baseline performance, especially for cold reads, long-range reads, or mixed workloads with varying access patterns.
For even faster performance on hot or predictable paths, the system adds a 2nd layer: Block-Level Caching. Here, the File Service node maintains a lightweight in-memory L1 cache for recently accessed blocks, while a larger SSD-based L2 cache (e.g., Redis or local disk) serves frequently read content like model checkpoints or web assets. Since blocks are immutable, the caching system avoids invalidation complexity and can follow a simple cache-aside or prewarming model, making it safe and highly effective.
However, both layers rely on one foundational requirement: ultra-fast block-location resolution. This is handled via a Distributed Key-Value Store, which maps block_hash → [replica nodes]
. To ensure scalability and low latency, a horizontally partitioned store like Redis Cluster or FoundationDB is used, supporting batched multi-key lookups and millisecond-level latency even at trillion-scale block counts. The use of uniform SHA256 hashes ensures even distribution, reducing the risk of hotspots.
Each entry in this store is minimal yet expressive:
This supports not only routing decisions but also enables downstream processes like garbage collection and access monitoring. Co-locating KV shards with File Service nodes and batching lookups further drives down tail latency.
In summary, this layered approach—parallel fetches, smart caching, and efficient KV lookup—forms a robust foundation for low-latency reads. It gracefully scales with the number of files and blocks, adapts to varying access patterns, and ensures consistent performance even under massive user and data growth.
Here is the improvement on our high level design diagram:

2: Ensuring Strong Consistency in File and Directory Operations
Strong consistency is crucial for metadata operations—especially in a hierarchical file system where directory structure and file placement must remain accurate even under concurrent access. When users are creating, deleting, or listing files and directories, the system must guarantee correctness and avoid conflicts.
We focus our strong consistency model on the namespace metadata layer, where operations like putFile
, createDirectory
, and deleteDirectory
occur. These operations interact with the core data tables: namespace_entries
and file_metadata
.
Why It Matters
Imagine two users trying to write to the same directory at the same time, or one user trying to delete a folder while another is uploading a file into it. Without a consistency mechanism, this leads to orphan files, ghost directories, or data corruption.
To handle this, we apply three clear strategies:
1. Use a Transactional Metadata Store
All directory and file information is stored in a centralized metadata database (like Postgres). This allows operations to:
- Atomically check if a path exists
- Insert or delete entries with isolation
- Reject conflicting writes using unique constraints
Example:
createDirectory("/project/sub")
checks if the parent exists and then inserts the new path—all in a single transaction.
2. Apply Path-Level Locking on Mutations
Before modifying a directory (creating a new file, deleting a subfolder), the system acquires a lock on the parent path. This prevents race conditions between concurrent operations.
- We can use database-level advisory locks or a distributed coordinator like Zookeeper.
- This guarantees that only one operation mutates a given path at a time.
Example:
- Before inserting
file1.txt
into/a/b/
, a lock is acquired on/a/b/
. - Other operations must wait or retry, avoiding overlapping updates.
3. Commit Metadata Only After Successful Data Writes
During putFile
, blocks are uploaded to the storage layer first. The file metadata and namespace entry are only written after all blocks are confirmed. This avoids cases where an entry appears in the directory but its data is missing.
This approach ensures that:
- No partial files exist
- Reads never encounter broken references
3. How to ensure there is no data loss?
Each of these failure points must be addressed explicitly and defensively, not just assumed away. To say we’ve truly “solved durability” in a distributed file system, the system must satisfy several foundational guarantees—each addressing a different type of data loss risk in large-scale environments.
First, consider the block storage layer—the foundation of everything else. Since our files are immutable and chunked into content-addressable blocks, the system can independently replicate each block across multiple storage nodes. The obvious question is: how many replicas, and when do we consider a block "safe"? A common solution is to use a replication factor of 3, with quorum acknowledgment (e.g., 2 of 3 nodes must confirm a successful write before the block is considered durable). This provides a balance: it tolerates the failure of one replica and still retains data, without waiting on the slowest node (which helps latency too). Some systems try erasure coding to reduce storage cost, but it complicates recovery logic and isn’t great for write-heavy or low-latency workloads.
Next, let's talk about ordering and coordination between data and metadata. This is one of the most common sources of data loss or inconsistency if handled poorly. Suppose we write metadata first and crash before finishing block uploads. Now users see a file that doesn’t exist—or worse, a partial, unreadable one. To avoid this, our system writes all blocks first, and only when all are safely stored, it atomically commits metadata to the namespace (e.g., via a transaction to the namespace_entries and file_metadata tables). This sequencing ensures no file is visible until it’s fully present. We also treat block uploads as idempotent by keying them with content hashes. So, if a crash or retry happens, no duplicates are introduced.
Now consider crash recovery and orphan cleanup. Despite our best efforts, uploads can still fail mid-way—maybe the client disconnected, or a replica was unreachable. For this, we support background garbage collection (GC): the system periodically scans for blocks that are not referenced by any file metadata (i.e., blocks with ref_count = 0), and deletes them after a grace period. This means even if blocks are uploaded but never registered, they won’t waste space forever. Optionally, we can also use a write-ahead log (WAL) inside the file service—logging intent before attempting uploads—so on crash recovery, the service knows what needs cleanup.
Finally, to safeguard against long-term data degradation, like bit rot or disk failures, we enforce hash-based verification and re-replication. Every block is identified by a SHA256 hash, which is checked at write time and optionally revalidated on read or via periodic scrubbing. When a storage node fails permanently or gets marked as "unhealthy," the system consults the chunk manager to detect which blocks have lost replicas and re-replicates from surviving copies to restore the desired replication factor.
4. How does the system support high scalability?
The answer is pretty straightforward and similar to most of our contents with other interview questions - horizontally scaled with stateless servers or storage nodes. To recap, let’s dive into the full picture at a high-level on what we can do for each bottleneck section in our current design:
These can be answered easily if you have already checked our contents in ShowOffer. But, what if the interviewer would like going deeper into the numerical estimation for resource requirement? We found it will be helpful to provide a better quantitative calculation comparing to other questions. Now we will only focus on the top-3 layers: Metadata DB, Chunk Manager and Storage Nodes to estimate how much resource we need, as they are acting as the core storage and logic units in our system design.
Metadata Storage: Namespace Entries & File Metadata
Estimation
- 12B / 100M = ~120 DB shards (Postgres or CockroachDB)
- Shard key =
customer_id
or prefix hash(/path
) - Horizontally scaled with per-tenant ownership
Scale Strategy
Use logical sharding with load-aware router; add new shards over time
In practice, minimizing the cost of metadata migration during shard expansion with consistent hashing involves using virtual nodes (vnodes) and lazy, on-demand rebalancing. Each physical shard hosts multiple virtual nodes in the hash ring, spreading load more evenly. When a new shard is added, it claims a subset of vnodes, and only the entries that fall into the new vnode ranges are affected. Instead of eagerly migrating all impacted rows, the system can adopt a read-triggered migration model: when a request arrives for a path now owned by the new shard, the data is fetched from the old shard, written to the new one, and lazily garbage-collected from the source. This incremental approach spreads migration cost over time, avoids write spikes, and keeps the system fully online with no downtime or blocking.
Chunk Manager (KV Store)
Estimation
- If each KV node stores ~500GB:
- → 20TB / 500GB = ~40 KV partitions
- Throughput = reads per file read (10 lookups/file)
- → 100K QPS → 1M block lookups/sec
- Each KV node can serve 50K QPS → needs ~20 active nodes
Scale Strategy
Shard by block_hash prefix
; use high-QPS KV (e.g., Redis Cluster, RocksDB, FoundationDB)
Storage Nodes
Estimation
- Raw = 200PB * 3 = 600PB total replicated storage
- 600PB / 100TB = ~6,000 storage nodes
- Each node serves 5K–10K IOPS
- With 100K active reads/sec → 3K nodes needed just for bandwidth
Scale Strategy
- Uniformly distribute blocks by hash
- Route read to least-loaded replica
- Scale nodes horizontally and monitor skew
Final Thoughts
Designing a distributed hierarchical file system at PB scale challenges you to balance immutability and flexibility, enforce strong metadata consistency, and still deliver low-latency file access across hundreds of petabytes. By breaking down the problem into core workflows, building blocks, and deep dives—latency, consistency, durability, and scalability—you showcase not only architectural thinking but also your ability to reason through trade-offs under real-world constraints.
If you found this breakdown insightful but want to practice explaining your reasoning out loud, or simulate this interview under pressure—our coaching team at ShowOffer is here to help. From mock interviews tailored to companies like OpenAI and Meta, to 1:1 strategy reviews and system design white-boarding, we’ll help you bridge the gap from good to offer-ready.
Explore our expert-led mock system design rounds at ShowOffer.io — and get the offer you deserve.