min read

Design a S3-like distributed file system without needs on AWS features from the ground.

Distributed File System

Written by

Last revisited

June 25, 2025

Overview

What is AWS S3?

Amazon S3 (Simple Storage Service) is a scalable, distributed object storage system widely used for storing and retrieving large amounts of data, including files, backups, media, and logs. It organizes data into buckets and supports features like versioning, lifecycle policies, and multi-region durability. S3 is optimized for durability, availability, and throughput at internet scale, and it abstracts away complex infrastructure, making it a common mental model when designing file storage systems in interviews.

Requirement

Functional Requirements

createDirectory(path) - create new directory
deleteDirectory(path) (file or empty dir only) - delete a file or empty directory
createFile(path, data) - create a new immutable file
readFile(path, offset, length) - retrieve (partial) contents of a file
listDirectory(path) - list children of a directory

Constraints & Assumptions

To frame the design properly, let’s break down the problem space through the key constraints & assumptions given in the interview prompt. Each of them not only narrows down our choices but also tells us what matters most in this system.

Immutable Files
‍Once a file is written, we assume it can’t be modified—only read or deleted. This assumption simplifies consistency and concurrency handling, but it raises the bar for durability. Every file write must ensure the data is fully persisted, verified, and recoverable, as there's no “fix it later” path.‍
Mutable Directories
Users can add or delete files and subdirectories at any time, so the directory structure must support concurrent mutations with strong correctness guarantees. This is where transactional integrity and concurrency control (e.g., locks or database semantics) come into play.‍
Use Only Primitive Infrastructure Building Blocks
‍We’re not allowed to use ready-made distributed storage like S3 or GFS. Instead, we can only build using low-level components like:‍
1. RDBMS (e.g., PostgreSQL) for relational, transactional metadata‍
2. KV Stores (e.g., Redis or RocksDB) for fast, hashed lookups‍
3. Zookeeper for distributed coordination (locking, routing)‍
4. Single-machine filesystems as the physical block store
5. This constraint tests your ability to architect a reliable distributed system from first principles.

Apache Zookeeper is a lightweight coordination service designed for distributed systems. It works like a consistent, replicated key-value store with a file-like hierarchy, where each node (called a znode) can store data and metadata. It guarantees strong consistency using a leader-based protocol (ZAB), making it ideal for managing distributed locks, leader election, and synchronized access. In our file system design, Zookeeper helps safely coordinate directory-level operations across multiple nodes, ensuring correctness when many users modify paths concurrently, which we will cover later.

Scale: Millions of Users, 100s of Petabytes
We’re targeting massive scale: billions of files and directories spread across petabytes of data. This mandates horizontally scalable architecture, efficient metadata sharding, and smart data placement strategies to avoid bottlenecks.‍
Single Region & Data Center
The system only needs to run in a single region and data center. This removes cross-region latency and replication complexity from our current scope, letting us focus more on core file system design. (But note: multi-region could be an extension question.)

Non-Functional Requirements

Scalability - 100s of PBs & millions of directories
Consistency - strong for directory metadata ops, eventual for block replicas
Durability - no data loss; redundant in block storage
latency - low latency in file read and directory listings

Below the line:

Multi-tenancy - Customers logically isolated
Extensibility - Potential for versioning, ACLs, quotas later

Common Pitfalls to Avoid
Before diving further into the design, many candidates make critical mistakes such as:
Overusing abstract cloud terms like “just like S3” without showing how it’s built from scratch.
Ignoring immutability constraints, especially when handling file writes or updates.
Conflating file and directory semantics, leading to incorrect handling of metadata operations.
Underestimating metadata bottlenecks — the system’s scalability often hinges more on managing trillions of paths than raw data volume.
Skipping consistency guarantees — assuming eventual consistency is “good enough” everywhere, even for hierarchical operations.

Core Entities

file (immutable, content-hashed, block-indexed)
directory (mutable, hierarchical, maps path -> children)
block (fixed size chunk(64MB), deduplicated by hash)
path (absolute string (/abc/def/file))
customer (tenant root-level entity(namespace, isolation))

APIs - with Entity Mapping

createDirectory(path: str) -> boolean
Inserts a new directory into namespace; updates path hierarchy under customer.
deleteDirectory(path: str) -> boolean
Deletes a directory from namespace after verifying no file or sub-directory children exist.
createFile(path: str, data: bytes) -> boolean
Creates a file entry, stores content as hashed blocks, links metadata under directory.
readFile(path: str, offset: int, length: int) -> bytes
Reads file metadata and fetches corresponding block ranges by block_hash.
listDirectory(path: str) -> list[str]
Lists all file and directory entries under given path in namespace.

High Level Design

Building Blocks

To handle hundreds of petabytes of storage and trillions of files efficiently, we break the system into modular components, each with a specific role in the file system pipeline. This separation of responsibilities improves scalability, clarity, and fault isolation. Below is a high-level overview of the core components and how they fit together.

API Gateway

The API Gateway is the entry point for all external requests. It exposes high-level file system APIs like putFile, getFile, createDirectory, and others. It is stateless and horizontally scalable, simply forwarding validated requests to the backend service layer. This layer may also handle basic normalization, authentication, and retry logic.

File Service

The File Service is the central orchestrator for file system operations. It coordinates with other components to handle file uploads, reads, and metadata changes. For example, on a file upload, it splits the file into chunks, stores them across storage nodes, updates metadata, and commits the full file atomically. It ensures the correctness and consistency of user-visible operations.

Metadata Store

The Metadata Store tracks the structure of the file system—what paths exist, which are directories or files, who owns them, and which blocks make up a file. It stores two main types of metadata: directory hierarchy and file-to-block mappings. This store is strongly consistent and supports transactional updates, making it the foundation of the system’s correctness.

Chunk Manager

The Chunk Manager maintains a fast-access mapping from block hash to the list of nodes that store that block. It acts like a distributed address book for locating content-addressed data. It is backed by a key-value store and optimized for high-throughput lookups during file reads and uploads. It may also track block usage for garbage collection.

Block Store

The Block Store defines the logic for handling immutable blocks of data. Files are broken into fixed-size blocks (e.g., 64MB), each addressed by a SHA256 hash. These blocks are never changed once written. The block store handles integrity checks and provides APIs to read or write a specific block by hash.

Storage Nodes

Storage Nodes physically store and serve the blocks on local disk. Each node handles basic upload and download operations for the blocks it holds. Nodes are simple by design and scale out horizontally. They report their status to the chunk manager and are periodically cleaned up by background workers.

Garbage Collection Workers

Garbage Collection (GC) Workers clean up unused data and metadata asynchronously. When a file is deleted or no longer referenced, GC workers remove its associated blocks from storage and update metadata entries accordingly. These workers run in the background and are carefully throttled to avoid disrupting live traffic.

Workflows

To understand how the system fulfills core file system operations, we walk through simplified workflows that explain how each API maps to internal component interactions. These flows show how the architecture supports core functional requirements while remaining modular and efficient.

Workflow Structure
In each workflow, we try to make it into the same structure, which includes:
API Signature
Intent
High-Level Component Interactions

createDirectory

API Signature

createDirectory(path: str) -> bool

IntentCreate a new directory if the parent exists and the path is not already taken.

How it Works

Client sends request to API Gateway.
File Service parses and validates the path.
Metadata DB ensures parent exists and inserts a new entry for the directory.

This supports hierarchical directory creation and ensures path integrity.

deleteDirectory

API Signature

deleteDirectory(path: str) -> bool

Intent

‍Remove an existing empty directory from the namespace. Any directory with content shouldn’t be removed by design.

How it Works

Client sends request to API Gateway.
File Service checks that the path is a directory and is empty.
Metadata DB deletes the path entry.

Supports cleanup of unused structures and keeps the namespace consistent.

createFile

API Signature

putFile(path: str, data: bytes) -> bool

IntentCreate a new immutable file by chunking the data and storing its metadata.

How it Works

File Service verifies the parent exists and path is free.
File content is split into blocks and uploaded to Storage Nodes.
Metadata DB stores the file's block list and properties.

Satisfies immutability and scalable file storage with structured metadata tracking.

readFile

API Signature

getFile(path: str, offset: int, length: int) -> bytes

IntentFetch part of a file using random access on immutable block data.

How it Works

File Service retrieves block list and calculates which blocks are needed.
Block data is fetched from Storage Nodes and stitched together.

Supports efficient and partial file reads, aligned with performance expectations.

listDirectory

API Signature

listDirectory(path: str) -> List[str] (pagination)

Intent

‍List direct children of a given directory.

How it Works

File Service confirms the directory exists.
Metadata DB returns immediate child paths.

Provides directory introspection necessary for navigation and organization.

Data Schema

To support a scalable, hierarchical distributed file system, we need to clearly separate responsibilities across three key schema tables. The namespace_entries table captures the logical structure of the system—both directories and files—in a unified tree, enabling efficient traversal and listing. The file_metadata table focuses solely on file-specific details, such as size and the ordered list of content-addressed blocks, allowing for precise and efficient reads. Finally, the block_locations table decouples physical storage from logical structure by mapping each unique block hash to its replica locations, enabling deduplication, replication, and parallel access. Together, these three tables ensure clear separation of concerns—structure, metadata, and storage—which in future can further support strong consistency, high scalability, and modular system evolution.

1. namespace_entries

This table represents the hierarchical structure of the file system and includes both files and directories in a unified schema. The path column acts as the primary key and stores the fully qualified name (e.g., /user/data/file1.txt), which allows quick lookups and ensures uniqueness. name captures the leaf node (e.g., in a Trie structure) for display or listing purposes, and parent_path supports efficient directory traversal and listing via indexed lookups (e.g., for listDirectory). The type column distinguishes files from directories, enabling proper validation during creation and deletion operations. Timestamps like created_at and updated_at are useful for auditing, synchronization, and TTL-based policies. This table forms the logical foundation of the filesystem’s namespace.

2. file_metadata

This table stores file-specific metadata that is not applicable to directories. The path column (foreign key to namespace_entries) links this metadata to a logical file entry. size_bytes tracks the file’s total size, helping validate offsets and calculate throughput. block_count gives a quick count of how many blocks make up the file, which is helpful during reads. block_size documents the system’s chunking strategy (e.g., fixed 64MB), and block_hashes stores an ordered list of SHA256 hashes corresponding to each block—allowing for precise byte-range reads using block index arithmetic. The created_at timestamp supports audit logging and optional versioning. Together, this schema enables scalable, immutable file reads and writes with minimal metadata overhead.

3. block_locations

This table maps each content-addressed block_hash to a list of storage nodes that hold its replicas. The block_hash (SHA256 of the block content) is the primary key and ensures deduplication: identical content results in the same hash and is stored only once. replica_nodes holds an array of node identifiers that currently host this block, which supports high availability and latency-aware routing. ref_count tracks how many files reference this block—critical for efficient garbage collection when files are deleted. last_used_at helps with eviction policies or diagnosing cold storage patterns. This table abstracts the physical distribution layer and enables efficient, fault-tolerant retrieval of blocks at scale.

What is ref_count? Why it can be used in Garbage Collection?
ref_count (reference count) is an integer counter stored for each block in the block_locations table. It tracks how many files currently reference that specific block.
Because we’re using content-addressed storage (e.g., each block is uniquely identified by its SHA256 hash), multiple files can share the same block if their content is identical. For example, if two files contain the same 64MB block at some position, that block is stored once and both files reference it.
Imagine 3 files:
- /a/file1.txt → uses block hash abc123
- /a/file2.txt → also uses block hash abc123
- /b/duplicate.txt → also uses block hash abc123

Then in block_locations:

block_hash	replica_nodes	ref_count	...
abc123	[node1, node3, node5]	3	...

So ref_count = 3, because three different file entries point to this block.
This is an efficient tracker when a file is deleted. Its blocks are no longer needed if no other file references them. So the system will decrement ref_count for each of its blocks.

If ref_count drops to 0, the block is now orphaned and can be safely removed from the storage nodes.

Deep Dives

1: How can we reduce read latency at scale?

In large-scale distributed file systems managing trillions of immutable files and petabytes of data, read latency is one of the most critical performance bottlenecks. This is especially evident when users fetch large files composed of dozens of blocks—each stored across multiple nodes. A naive design struggles due to high fan-out reads, slow or uneven replica response times, and expensive block-location metadata lookups. Solving this requires a layered approach that blends concurrency, smart replica routing, and strategic caching.

When a client issues a getFile(path, offset, length) request, the system performs several steps: resolving file metadata into block hashes, mapping those hashes to replica nodes, retrieving the relevant blocks, and assembling the requested byte range. Each step adds latency, and the fan-out nature of reads—especially for large files—can make the tail latency unpredictable.

To reduce this, the 1st layer of defense is Parallel Block Fetch with Smart Replica Selection. Rather than fetching blocks serially or picking replicas arbitrarily, the system queries all necessary blocks concurrently and chooses the fastest replica for each block based on real-time node health, response history, or data locality. This significantly improves baseline performance, especially for cold reads, long-range reads, or mixed workloads with varying access patterns.

For even faster performance on hot or predictable paths, the system adds a 2nd layer: Block-Level Caching. Here, the File Service node maintains a lightweight in-memory L1 cache for recently accessed blocks, while a larger SSD-based L2 cache (e.g., Redis or local disk) serves frequently read content like model checkpoints or web assets. Since blocks are immutable, the caching system avoids invalidation complexity and can follow a simple cache-aside or prewarming model, making it safe and highly effective.

However, both layers rely on one foundational requirement: ultra-fast block-location resolution. This is handled via a Distributed Key-Value Store, which maps block_hash → [replica nodes]. To ensure scalability and low latency, a horizontally partitioned store like Redis Cluster or FoundationDB is used, supporting batched multi-key lookups and millisecond-level latency even at trillion-scale block counts. The use of uniform SHA256 hashes ensures even distribution, reducing the risk of hotspots.

Each entry in this store is minimal yet expressive:

{
  "block_hash": "e5a1b...",
  "replica_nodes": ["node7", "node18", "node25"],
  "ref_count": 3
}

This supports not only routing decisions but also enables downstream processes like garbage collection and access monitoring. Co-locating KV shards with File Service nodes and batching lookups further drives down tail latency.

In summary, this layered approach—parallel fetches, smart caching, and efficient KV lookup—forms a robust foundation for low-latency reads. It gracefully scales with the number of files and blocks, adapts to varying access patterns, and ensures consistent performance even under massive user and data growth.

Here is the improvement on our high level design diagram:

2: Ensuring Strong Consistency in File and Directory Operations

Strong consistency is crucial for metadata operations—especially in a hierarchical file system where directory structure and file placement must remain accurate even under concurrent access. When users are creating, deleting, or listing files and directories, the system must guarantee correctness and avoid conflicts.

We focus our strong consistency model on the namespace metadata layer, where operations like putFile, createDirectory, and deleteDirectory occur. These operations interact with the core data tables: namespace_entries and file_metadata.

Why It Matters

Imagine two users trying to write to the same directory at the same time, or one user trying to delete a folder while another is uploading a file into it. Without a consistency mechanism, this leads to orphan files, ghost directories, or data corruption.

To handle this, we apply three clear strategies:

1. Use a Transactional Metadata Store

All directory and file information is stored in a centralized metadata database (like Postgres). This allows operations to:

Atomically check if a path exists
Insert or delete entries with isolation
Reject conflicting writes using unique constraints

Example:

createDirectory("/project/sub") checks if the parent exists and then inserts the new path—all in a single transaction.

2. Apply Path-Level Locking on Mutations

Before modifying a directory (creating a new file, deleting a subfolder), the system acquires a lock on the parent path. This prevents race conditions between concurrent operations.

We can use database-level advisory locks or a distributed coordinator like Zookeeper.
This guarantees that only one operation mutates a given path at a time.

Example:

Before inserting file1.txt into /a/b/, a lock is acquired on /a/b/.
Other operations must wait or retry, avoiding overlapping updates.

Ensuring Safe Mutations via Locking
Before allowing any mutation like putFile, createDirectory, or deleteDirectory, the system must lock the parent directory path to prevent race conditions, such as:
- Two clients trying to create the same file at once
- A file being created while its parent directory is being deleted
Option 1: Database-Level Locking (e.g., Advisory Locks in Postgres)
In this approach, we use built-in DB-level advisory locks to coordinate access:
```
SELECT pg_advisory_xact_lock(hash('/a/b'));
```
This ensures only one transaction can modify /a/b at a time.
Advantages:
- Requires no extra infrastructure
- Lightweight and easy to adopt in Postgres
- Locks can be acquired even if the directory doesn’t exist yet
Limitations:
- The database becomes the central coordination point and may bottleneck under high concurrency
- Difficult to scale across a sharded or replicated metadata DB
- No built-in visibility or control over lock timeouts or stale holders
Option 2: Distributed Locking with Zookeeper (Recommended)
Here, the system uses Zookeeper to acquire a distributed ephemeral lock per directory path before mutation:
- File Service requests a lock on /locks/a/b
- If the lock is granted, it proceeds with the operation
- The lock is automatically released if the client crashes or times out
Why it fits our system better:
- Zookeeper provides strong guarantees for lock ownership, fencing, and crash recovery
- Works seamlessly across multiple nodes and service instances
- Scales horizontally—no single bottleneck
- More transparent and debuggable under high concurrency
- Can support advanced features like fairness, lock queues, or timeouts

‍

3. Commit Metadata Only After Successful Data Writes

During putFile, blocks are uploaded to the storage layer first. The file metadata and namespace entry are only written after all blocks are confirmed. This avoids cases where an entry appears in the directory but its data is missing.

This approach ensures that:

No partial files exist
Reads never encounter broken references

3. How to ensure there is no data loss?

Durability Risks Across the Data Lifecycle
Students often think durability is only about disk failures—but in practice, data loss is a risk at multiple points in the lifecycle:
1. Before blocks are fully written
- What if a client uploads a file but crashes midway?
- What if we write some blocks but never update metadata?
2. On storage nodes
- What if one of the disks goes down?
- What if a node silently corrupts a block?
3. During metadata operations
- What if a crash occurs after block upload but before metadata is committed?

Each of these failure points must be addressed explicitly and defensively, not just assumed away. To say we’ve truly “solved durability” in a distributed file system, the system must satisfy several foundational guarantees—each addressing a different type of data loss risk in large-scale environments.

First, consider the block storage layer—the foundation of everything else. Since our files are immutable and chunked into content-addressable blocks, the system can independently replicate each block across multiple storage nodes. The obvious question is: how many replicas, and when do we consider a block "safe"? A common solution is to use a replication factor of 3, with quorum acknowledgment (e.g., 2 of 3 nodes must confirm a successful write before the block is considered durable). This provides a balance: it tolerates the failure of one replica and still retains data, without waiting on the slowest node (which helps latency too). Some systems try erasure coding to reduce storage cost, but it complicates recovery logic and isn’t great for write-heavy or low-latency workloads.

Go with triple-replication + quorum write for simplicity, speed, and resilience—especially suitable in our single-region, PB-scale scenario.

Next, let's talk about ordering and coordination between data and metadata. This is one of the most common sources of data loss or inconsistency if handled poorly. Suppose we write metadata first and crash before finishing block uploads. Now users see a file that doesn’t exist—or worse, a partial, unreadable one. To avoid this, our system writes all blocks first, and only when all are safely stored, it atomically commits metadata to the namespace (e.g., via a transaction to the namespace_entries and file_metadata tables). This sequencing ensures no file is visible until it’s fully present. We also treat block uploads as idempotent by keying them with content hashes. So, if a crash or retry happens, no duplicates are introduced.

Always commit metadata last. This atomicity between block storage and metadata registration is what allows the system to safely crash and recover without leaving garbage behind.

Now consider crash recovery and orphan cleanup. Despite our best efforts, uploads can still fail mid-way—maybe the client disconnected, or a replica was unreachable. For this, we support background garbage collection (GC): the system periodically scans for blocks that are not referenced by any file metadata (i.e., blocks with ref_count = 0), and deletes them after a grace period. This means even if blocks are uploaded but never registered, they won’t waste space forever. Optionally, we can also use a write-ahead log (WAL) inside the file service—logging intent before attempting uploads—so on crash recovery, the service knows what needs cleanup.

‍

What is WAL?
WAL (Write-Ahead Log) is a durability mechanism where all intended changes are first recorded in a log before applying them to the actual system state (e.g., database or metadata store). Think of it like a journal — before making a real change, you write a note about what you're going to change, so that if you crash in the middle, you can pick up where you left off.
In our file system, the File Service can maintain a WAL locally or in a lightweight replicated store. The log records operations like:
- Client X wants to putFile(/a/b.txt) with blocks [hash1, hash2]
- All blocks uploaded; ready to commit metadata for /a/b.txt
- Delete directory /project123/tmp if empty
Each entry contains enough context to retry or roll back the operation if the service crashes midway.
Why We Need WAL
- Crash Recovery:
  Imagine a file upload is halfway through when the service crashes. Without WAL, it’s unclear whether:
  - The upload never started
  - Some blocks were stored
  - Metadata was partially written
  With WAL, we can replay the operation safely: either complete it or clean up.
- Atomicity Across Multiple Systems:
  File systems involve both data (blocks) and metadata (namespace entries). WAL coordinates these:
  - First write the WAL entry
  - Then upload blocks
  - Then write metadata
  - Then mark the WAL entry as completed
- Event Ordering & Replayability:
  In distributed systems, operations may arrive out of order or retry due to network glitches. WAL gives you a single, ordered log to reason about what happened and in what order.
- Support for GC and Debugging:
  WAL entries can optionally be kept for a time window, offering visibility into recent changes and helping garbage collectors distinguish between active and orphaned states.
In our distributed file system, WAL is that draft log — the safety net that ensures no file is ever left half-born, no directory half-deleted, and no upload forgotten in the event of failure. It’s the unsung hero of high-availability systems.

Finally, to safeguard against long-term data degradation, like bit rot or disk failures, we enforce hash-based verification and re-replication. Every block is identified by a SHA256 hash, which is checked at write time and optionally revalidated on read or via periodic scrubbing. When a storage node fails permanently or gets marked as "unhealthy," the system consults the chunk manager to detect which blocks have lost replicas and re-replicates from surviving copies to restore the desired replication factor.

4. How does the system support high scalability?

The answer is pretty straightforward and similar to most of our contents with other interview questions - horizontally scaled with stateless servers or storage nodes. To recap, let’s dive into the full picture at a high-level on what we can do for each bottleneck section in our current design:

‍

Layer	Scaling Bottleneck	What We Scale
1. Metadata DB	Too many entries; slow queries, write contention	Shard by customer or path prefix
2. Chunk Manager (KV)	Trillions of blocks → too many key lookups	Horizontal KV scaling by hash
3. Storage Nodes	Limited IOPS, disk capacity per node	Add nodes + partition files across them
4. API Layer	QPS overwhelms single entry point	Stateless replicas, horizontal scale
5. Async GC & Background Jobs	Lag with scale → write amplification	Distribute GC workers across partitions

‍

These can be answered easily if you have already checked our contents in ShowOffer. But, what if the interviewer would like going deeper into the numerical estimation for resource requirement? We found it will be helpful to provide a better quantitative calculation comparing to other questions. Now we will only focus on the top-3 layers: Metadata DB, Chunk Manager and Storage Nodes to estimate how much resource we need, as they are acting as the core storage and logic units in our system design.

Metadata Storage: Namespace Entries & File Metadata

Assumptions
Each file or directory = 1 row in namespace_entries
Each file = 1 row in file_metadata (with 10 blocks on average)
Total entries = 10B files + 2B directories = 12B rows
Per-partition limit: 100M rows for good performance
Indexed read/write latency < 10ms per operation per shard

Estimation

12B / 100M = ~120 DB shards (Postgres or CockroachDB)
Shard key = customer_id or prefix hash(/path)
Horizontally scaled with per-tenant ownership

Scale Strategy

Use logical sharding with load-aware router; add new shards over time

Scalable Sharding with Consistent Hashing
To support high scalability as the number of files and directories grows into the tens of billions, we recommend using consistent hashing as the sharding strategy for the metadata layer.
Unlike traditional modulus-based sharding (e.g., hash(path) % N), consistent hashing allows the system to add or remove shards dynamically while only remapping a small portion of keys — greatly reducing the operational cost of rebalancing.
This approach provides several benefits critical to our system’s design:
- Minimized Rebalancing Overhead: Only a fraction of metadata entries need to move when a new shard is added, avoiding full-table rewrites.
- Predictable Latency: Since keys remain mostly stable, caches remain warm and shard hot-spots are avoided.
- Operational Simplicity: No need to track per-entity routing tables; sharding is stateless and distributed across nodes.
- Fault Tolerance: If a shard fails, adjacent shards can take over its hash space temporarily.
While a routing-table-based method (e.g., prefix → shard) may offer stronger isolation or per-tenant tuning, it introduces complexity in routing management and doesn’t scale well with millions of arbitrary path prefixes.
Thus, consistent hashing is the more scalable, low-maintenance solution — ideal for supporting a single-region, multi-customer system operating at petabyte scale with unpredictable file structure growth.
But, how to process lowest cost for this migration in practice?

‍

In practice, minimizing the cost of metadata migration during shard expansion with consistent hashing involves using virtual nodes (vnodes) and lazy, on-demand rebalancing. Each physical shard hosts multiple virtual nodes in the hash ring, spreading load more evenly. When a new shard is added, it claims a subset of vnodes, and only the entries that fall into the new vnode ranges are affected. Instead of eagerly migrating all impacted rows, the system can adopt a read-triggered migration model: when a request arrives for a path now owned by the new shard, the data is fetched from the old shard, written to the new one, and lazily garbage-collected from the source. This incremental approach spreads migration cost over time, avoids write spikes, and keeps the system fully online with no downtime or blocking.

Chunk Manager (KV Store)

Assumptions
Each file has ~10 blocks → 10B files * 10 = 100B blocks
Each block_hash maps to a list of replicas (50–100 bytes)
Each key-value entry ≈ 200 bytes
Total size = 100B * 200 bytes = ~20 TB

Estimation

If each KV node stores ~500GB:
→ 20TB / 500GB = ~40 KV partitions
Throughput = reads per file read (10 lookups/file)
→ 100K QPS → 1M block lookups/sec
Each KV node can serve 50K QPS → needs ~20 active nodes

Scale Strategy

Shard by block_hash prefix; use high-QPS KV (e.g., Redis Cluster, RocksDB, FoundationDB)

Storage Nodes

Assumptions
Each file = 10 blocks @ 64MB → ~640MB average size
10B files → ~6.4 Exabytes (EB) theoretical capacity
Assuming deduplication & real-world compression → ~200PB actual stored
Each node stores 100TB usable
Each block has 3 replicas for durability

Estimation

Raw = 200PB * 3 = 600PB total replicated storage
600PB / 100TB = ~6,000 storage nodes
Each node serves 5K–10K IOPS
With 100K active reads/sec → 3K nodes needed just for bandwidth

Scale Strategy

Uniformly distribute blocks by hash
Route read to least-loaded replica
Scale nodes horizontally and monitor skew

What is “Skew” in Data Storage?
In the context of sharding and consistent hashing, “skew” refers to uneven distribution of data or traffic across shards or nodes.
For example, even if you have 100 shards and billions of files, skew can still occur if:
- Some shards end up with significantly more files or directories than others (data skew)
- Certain shards receive more read/write traffic due to popular paths or hot directories (traffic skew)
This imbalance can lead to:
- Hotspots: overloaded shards slow down operations
- Wasted capacity: underutilized nodes while others hit CPU, disk, or memory limits
- Increased tail latency: user-facing slowdowns caused by the slowest shard
To mitigate skew, systems often:
- Use virtual nodes (vnodes): more granular hash ranges per physical shard allow better balancing
- Periodically rebalance hot ranges across nodes
- Monitor access patterns to identify and redistribute hot keys or hot prefixes
In short, skew is the enemy of scalability and performance in a distributed system — and good shard planning is how we fight it.

Final Thoughts

Designing a distributed hierarchical file system at PB scale challenges you to balance immutability and flexibility, enforce strong metadata consistency, and still deliver low-latency file access across hundreds of petabytes. By breaking down the problem into core workflows, building blocks, and deep dives—latency, consistency, durability, and scalability—you showcase not only architectural thinking but also your ability to reason through trade-offs under real-world constraints.

If you found this breakdown insightful but want to practice explaining your reasoning out loud, or simulate this interview under pressure—our coaching team at ShowOffer is here to help. From mock interviews tailored to companies like OpenAI and Meta, to 1:1 strategy reviews and system design white-boarding, we’ll help you bridge the gap from good to offer-ready.

Explore our expert-led mock system design rounds at ShowOffer.io — and get the offer you deserve.

‍

Coach + Mock

Practice with a Senior+ engineer who just get an offer from your dream (FANNG) companies.

Schedule Now

Content:

Distributed File System

Unlock Full System Design Access