min read

Leaderboards are widely used in products like gaming, social media, fitness, and education. Unlike ad click aggregation systems that simply count events, leaderboards must maintain sorted rankings, support real-time updates, and handle time-based queries (e.g., daily or weekly top performers). They also differ from top-K video systems, which typically rely on approximate counts, whereas leaderboards often require exact accuracy.

Leaderboard

Written by

Senior Staff at Google

Last revisited

July 8, 2025

Functional Requirements

Submit User Score: Allow users to submit a score using their user_id and associated score.
Top-K Item Query: Extend the leaderboard to support top-K queries over other entities (e.g., top videos, top products).
User Rank Lookup: Fetch the current rank and score of a given user_id within a leaderboard.
Multi-Leaderboard and Time-Window Support: Support multiple leaderboards (e.g., global, regional, or by category) with the ability to query them across time intervals such as hourly, daily, or custom date ranges.

💡

Supporting arbitrary time windows (e.g., custom start/end timestamps) adds significant complexity — especially around data storage, memory, and compute costs. However, doing so can be a differentiator in interviews and is often seen as a strong design extension.

Non-Functional Requirements

💡

Key negotiation point:

Exact vs approximate accuracy is critical: Distribute Heap is used for exact accuracy but incurs higher resource usage and complexity. Count-Min Sketch is more resource efficient for approximate results.

Low Latency

Leaderboard updates should be reflected within 1 minute.
Queries (e.g., top-K or rank lookup) must return within 100 milliseconds.

High Scalability

System should handle millions of updates per second.
Must scale to support billions of users or items.

Accuracy Requirements

Leaderboard results should be exact, not approximate (unless otherwise specified as a tradeoff).

Cost Efficiency

Design should optimize for infrastructure cost, avoiding unnecessary compute or memory usage at scale.

High Level Design

FR1: Submit User Score

To support score submissions in a leaderboard system, data flows of below has to be managed correctly:

Accepts incoming score data
Updates the appropriate leaderboard (possibly across shards or time windows)
Maintains order for rank calculations

API Design

POST /leaderboard/{leaderboard_id}/score
Authorization: Bearer 
Content-Type: application/json{
  "score": 4250,
  "timestamp": "2025-06-29T12:45:00Z"
}
    

While submitting a user score may seem straightforward, there are several critical design considerations and discussion points that should not be overlooked.

💡

Edge Case Handling — Critical for Staff-Level Design

Handling edge cases is not just about correctness, but about ensuring robustness, consistency, and long-term maintainability — all key expectations for a Staff+ Engineer or EM. These are the subtle decisions that distinguish a production-grade system from a prototype.

Edge Case	Design Strategy
User submits a lower score	Define business rules: Should the system reject, ignore, or overwrite with the new lower score? This choice impacts user trust and leaderboard integrity.
Duplicate submissions	Ensure idempotency using a composite key like `(user_id, leaderboard_id)` to avoid double-counting, especially in retry scenarios.
Expired time window	Validate timestamps. Either reject outdated submissions or redirect them to a historical (archival) leaderboard. Prevents data pollution.
Ties in score	Establish deterministic tie-breaking rules — e.g., by `timestamp` or `user_id` — to ensure consistency across shards and replicas.

Designing for these cases upfront prevents subtle bugs, improves system robustness, and aligns with principles like eventual consistency, fairness, and resilience under high load or failure conditions.

FR2: Query Top K Videos

In large-scale systems, precomputing leaderboard data is a common technique to reduce read latency and improve scalability. It’s widely applicable across system design problems that require ranked metrics (e.g., top users, trending content, global scores). This module explores how to implement pre-computation using a database and batch/stream processing, along with associated trade-offs.

Low-latency reads for rank and top-K queries (avoid live computation)
Reduced load on real-time stores like Redis
Historical access to leaderboards over time (e.g., daily/weekly/monthly)
Support for custom or low-traffic leaderboards that don’t justify always-on compute

Schema Design

A basic schema for storing precomputed rank data:

CREATE TABLE leaderboard_user_ranks (
  leaderboard_id TEXT,
  time_window DATE,
  user_id TEXT,
  rank INT,
  score INT,
  updated_at TIMESTAMP,
  PRIMARY KEY (leaderboard_id, time_window, user_id)
);
    

Use tools like Apache Spark, Flink, or Airflow to periodically compute:

The top-K users for each leaderboard × time window
The rank of every active user

Precomputation frequency can be tuned (e.g., every 30 seconds, 1 minute, or 5 minutes) based on freshness requirements.

💡

Key talking points for optimization:

Index on (leaderboard_id, time_window) for fast filtering
Use materialized views to precompute top-K per time bucket
Enable caching on top of DB (e.g., Redis, CDN) for frequently accessed results
For custom queries (e.g., friend leaderboard), store user segments and filter top-K via SQL

In large-scale systems, precomputing leaderboard ranks and Top-K entries can drastically improve query latency and offload live computation. However, as a Senior or Staff Engineer, it's critical to understand the trade-offs that come with this optimization.

Data Staleness

Challenge: Precomputed rankings are only as fresh as the latest batch update.
Mitigations:
- Reduce batch interval (e.g., from 5 minutes to 30 seconds).
- Overlay with a real-time write-through cache (e.g., Redis) for the latest updates.

Infrastructure Complexity

Challenge: Requires a pipeline stack—stream processors, batch jobs, metadata management, and failure recovery.
Mitigations:
- Use managed ETL platforms (e.g., Google Dataflow, AWS Glue, Flink).
- Set up robust monitoring and alerting for job delays or failures.

Storage Overhead

Challenge: Every (leaderboard × time window × user) tuple consumes storage.
Mitigations:
- Apply TTL policies to expire outdated data.
- Archive historical data to cold storage (e.g., S3, GCS).
- Compact and roll up older data (e.g., daily → weekly snapshots).

Late or Out-of-Order Events

Challenge: Some updates may arrive after their intended processing window.
Mitigations:
- Leverage event-time watermarking in stream engines (e.g., Apache Flink).
- Run catch-up jobs for reprocessing affected windows.
- Requeue or flag late events for on-demand correction.

Computational Cost at Scale

Challenge: High cardinality (e.g., region × platform × game mode × time window) can become expensive to compute and store.
Mitigations:
- Prioritize high-traffic or monetized leaderboards.
- Use approximate techniques (e.g., Count-Min Sketch) for low-impact or long-tail segments (to be covered in future lessons).

Optimization solutions can be found in Deep Dive 1: How to optimize top-k queries?

FR3: User Rank Lookup

To support user rank lookup in a leaderboard system, you need a fast and scalable way to determine a user’s current position relative to others.

Option 1: Real-time and Accurate Rank

Redis ZSET stores (user_id, score) entries in a sorted order and Use:

ZREVRANK leaderboard:<id> <user_id> to get the user's rank (high score = lower rank)
ZSCORE leaderboard:<id> <user_id> to get the score

Pros:

Accurate and real-time
Fast reads and writes (log N for insertion and rank lookup)
Natively supports ranking, score updates, and top-K

Cons:

In-memory only — high memory usage for large leaderboards
Requires sharding if data grows (horizontal scaling adds complexity)
Persistence depends on Redis RDB or AOF — not strongly durable by default

Option 2: Scalable reads with relaxed SLA

Periodically batch compute user ranks (e.g., every minute) and Store results in Redis Hash or a DB for fast retrieval.

Pros:

Constant-time lookups (O(1)) — ideal for high-read workloads
Reduces Redis memory pressure; backend can batch compute with cheaper infra
Flexible to offload to secondary storage (e.g., Postgres, DynamoDB)

Cons:

May return stale data if ranks are recomputed periodically
Requires a background job to update ranks
Adds storage complexity (managing versioned caches, TTL, etc.)

Option 3: Count-Min Sketch

Use a Count-Min Sketch to count how often scores appear, and estimate a user’s rank by summing counts of users with higher scores.

Pros:

Very memory-efficient — works well when leaderboard is massive (e.g., 100M+ users)
Fast inserts and rank estimates
Scales horizontally with minimal coordination

Cons:

Results are approximate (bounded error)
Cannot determine exact rank or position
Doesn’t preserve full ordering — not suitable for deterministic user rank display

FR4: Multi-Leaderboard and Time-Window Support

Supporting multiple leaderboards across different categories and time windows adds a layer of complexity to a leaderboard system. It requires a flexible data model, scalable storage, and careful handling of both read and write paths.

Multiple dimensions: Leaderboards by category (e.g., game mode, video genre), region (e.g., US, EU, JP), or user-defined groups (e.g., friends).
Time-based windows: Hourly, daily, weekly, monthly, all-time, and custom ranges.
Efficient access patterns for writing scores, retrieving top-K users, and querying user ranks.

Example Leaderboard Key Data Structure

Format:
leaderboard:{region}:{window}:{date}

Examples:
leaderboard:global:all_time
leaderboard:us:daily:2025-06-29
leaderboard:eu:weekly:2025-W26
    

This approach ensures:

Logical separation of leaderboards
Easy partitioning and TTL control
Flexibility to query and manage leaderboards individually

Normalize IDs and avoid overly dynamic keys to prevent unbounded growth in Redis or storage.

💡

Key Considerations to support time window:

Compaction: Aggregate daily scores into weekly/monthly for long-term storage and efficiency.
TTL: Use Redis expiration or cleanup jobs to evict old, unused keys.
Window validation: Prevent writes to expired or invalid leaderboard windows.

To support flexible queries, especially from a dashboard or analytics view, it helps to maintain a metadata layer that tracks available leaderboards:

leaderboard_metadata(  leaderboard_id TEXT PRIMARY KEY,  category TEXT,  region TEXT,  time_window TEXT,  date DATE,  created_at TIMESTAMP)
    

This allows:

UI listing of active leaderboards
Admin tools to purge or archive old ones
Filtering, discovery, and auditing

Summary

This design addresses all four functional requirements in a modular and extensible manner:

FR1: Submit User Score
FR2: Top-K Item Query
FR3: User Rank Lookup
FR4: Multi-Leaderboard and Time-Window Support

The upcoming deep dives will explore how to scale the system and handle real-world edge cases.

Deep Dive

Deep Dive 1: How to optimize top-k queries?

When designing a system to get the top K items (like trending videos or top players), you need to balance accuracy, speed, and scalability.

There are three common strategies:

Distributed Heaps: Precise but resource-heavy.
Skip Lists: Efficient and sorted structure, great for rankings.
Count-Min Sketch: Fast and memory-efficient, but gives rough estimates.

Let’s first look at distributed heaps for exact ranking, then compare them with the other approaches.

Distributed Heap

💡

[Be Cautious]: From Simple to Complex Design

There is common suggestion to start with a first-principles approach, beginning with simple designs and gradually building complexity. It demonstrates a clear thought process and systematic thinking.

However such approach requires more time than direct proposing optimal solution. To be safe, if you plan to adopt staring from simple approach, be sure to align with your interviewer. For learning purpose, we would start from simple design and discuss different options.

Hash table + Single host

Assume all data fits into the memory of a single server. Sorting the videos with heap has a time complexity of O(Nlog⁡K), where N is the total unique videos and K is the number of videos to return such solution has the following drawbacks:

Scalability Issue: A single host can become a CPU bottleneck as the rate of incoming elements increases.
Improvement: Introduce a load balancer to distribute the load across multiple servers.

Hash Table + Multiple Hosts + Load-Balancing

Comparing with single host, processing distributed across servers to increase throughput. But it is still limited by individual server memory, requiring further scalability improvements (e.g., sharding or distributed storage).

Hash Table + Multiple Hosts + Partitioning

Each processor hosts a subset of videos and maintains a list of top K heavy hitters. The final list is created by merging these sorted lists using an n-way merge algorithm. There are complexities below:

1. Data Replication:

Ensures data redundancy to prevent loss if a processor fails.
Common practice: Replication factor of 3 (primary host + 2 backups).
Challenge: Synchronizing data during failures or updates.

💡

[Be proactive]: Replication is also useful to improve availability of the system.

An effective approach is to maintain multiple replicas of our Counter instances and periodically snapshot memory. This scales out reads from the Client and enables quick recovery during failures by loading the latest snapshot and reprocessing the stream to "catch up." With snapshots in place, stream retention can be capped and eventually dropped after all partitions are captured. However snapshots are large (multi-GB), introducing overhead that may reduce throughput. Ensuring snapshots are consistent will also require robust mechanisms.

2. Rebalancing:‍

Node Addition: Metadata updates to redistribute data among nodes.‍
Node Removal: Metadata updates to reassign data from the removed node.
Risks: Network partitions, insufficient memory/disk, or metadata inconsistencies during transitions.

3. Hot Partitioning:

Access patterns (e.g., viral videos, live events) cause uneven loads, slowing the system.
Requires load-balancing mechanisms to mitigate bottlenecks.

Skip List

Skip List is a probabilistic data structure that maintains a sorted list of elements and allows fast search, insertion, and deletion operations—on average in O(log N) time. It offers an efficient alternative to balanced trees and is widely used in distributed systems like Redis and LevelDB due to its simplicity and performance.

Core Idea:

A Skip List consists of multiple levels of linked lists:

The base level (Level 1) contains all elements in sorted order.
Each higher level acts as an “express lane,” skipping over elements, thus speeding up access.
Levels are created randomly, giving a balance between performance and space.

💡

Skip List Example

Suppose we want to store these sorted numbers in a skip list:

Each number appears at least on level 0.
Higher levels contain a subset of nodes.
Each node has forward pointers at each level it belongs to.

To search for 30:

Start at Level 2 → move from HEAD to 3 → to 17 → stop (next is 33, too large).
Drop to Level 1 at 17 → move to 25 → stop (next is 33, too large).
Drop to Level 0 at 25 → move to 30 → found!

This only took 6 comparisons vs. a full scan of 9 elements. In large datasets, the improvement will be more significant.

To effectively manage skip lists in a distributed way, each machine (shard) is assigned a key range. Skip list nodes are stored based on the key they represent and each shard maintains a local skip list over its assigned keys.

Shard	Key Range	Assigned Nodes
S1	[1–10]	3, 6, 9
S2	[11–24]	12, 17, 18, 21
S3	[25–35]	25, 30, 33

Here's a typical skip list node in code-like structure:

Node {
  key: 17,
  value: EventRecord,
  forward_pointers: [
    ptr_to_21 (local),         # Level 0
    ptr_to_25 (remote on Shard 3),  # Level 1
    None                       # Level 2 (this node may not exist here)
  ]
}

Nodes store forward pointers for each level they participate in.
Low levels (0) usually point to other nodes within the same shard.
High levels (2+) may point to remote nodes — for example, across shards.

Advantages in Distributed Systems:

Sorted Indexing: Great for maintaining time-ordered or score-based streams (e.g., trending videos, leaderboard scores).
Concurrent Access: Easier to implement concurrent versions compared to balanced trees.
Dynamic Range Queries: Efficiently handles range-based queries like “top K” or “between score A and B”.
Sharding-Friendly: Segments of a skip list can be distributed across nodes with consistent hashing.

Insertion

To insert a value in a skip list, we first locate the position by scanning forward from the highest level down to level 0. At each level, we move forward as long as the next node’s value is less than the target. This builds a path of predecessors that we’ll later use to update pointers.

We then assign the new node a random level—higher levels are less likely, creating a layered shortcut structure. The new node is linked into all levels up to its assigned height by updating the forward pointers of the predecessor nodes.

Rebalancing (Implicit)

Unlike trees, skip lists don’t perform explicit rebalancing. Instead, balance is probabilistic. The random level assignment ensures that higher levels are sparser. Over time, this keeps the list roughly balanced without complex rotations or height adjustments. In practice, the structure stays efficient with O(log n) search and insert time.

If a new node is taller than the current maximum height, the skip list grows by adding new empty levels and adjusting the head pointer. This is rare but necessary to maintain performance as the list grows.

Challenges of insertion and rebalancing

Random Level Variance: Occasionally, too many or too few high-level nodes can skew performance. Though rare, this randomness means worst-case time can degrade to O(n).
Pointer Management: Updating forward pointers correctly across multiple levels is error-prone, especially when insertions or deletions affect high levels.
Debugging: Because structure depends on random decisions, reproducing bugs or ensuring deterministic behavior in testing can be difficult.

Skip lists offer a good trade-off between simplicity and performance, but they require careful handling of randomness and pointer updates to avoid subtle bugs.

Count-Min Sketch

Count-Min Sketch is a probabilistic data structure that efficiently summarizes and queries large data streams, especially for tracking element frequency counts. It's designed to handle massive datasets where explicitly tracking every element would be impractical due to memory limitations.

This structure uses a 2D array (matrix) of counters and multiple hash functions to map stream elements to indices in each matrix row:

Hash Functions: Multiple independent hash functions map an element to specific indices in different rows of the matrix.
Increment Counters: For each incoming element, the counters at the hashed indices in each row are incremented.
Estimate Count: To estimate an element's frequency, the minimum counter value across all rows for that element's hashed indices is returned.

💡

For a detailed explanation of Count-Min Sketch, refer to the Designing Top-K YouTube Videos section. In general, Count-Min Sketch is not suitable for leaderboards, as it provides approximate counts and cannot guarantee the exact accuracy required for precise ranking.

Summary

Hash Table + Multiple Hosts + Partitioning

Best For:

Systems requiring exact frequency counts (e.g., billing, quotas, precise analytics).
Use cases where accuracy cannot be compromised, such as fraud detection thresholds or regulated reporting.

Skip List

Best For:

Applications needing sorted order, range queries, or top-K rankings.
Systems requiring fast updates and lookups with concurrent access.
Scenarios where accurate data ordering is more valuable than full scan speed.

Count-Min Sketch

Best For:

Approximate counting in large-scale streaming applications.
Scenarios where memory efficiency and speed are more important than perfect accuracy.

💡

Leaderboards are often implemented using skip lists because they naturally support the core operations required by such systems. A leaderboard needs to maintain a sorted list of players by score, allow fast updates when scores change, and efficiently retrieve the top-K entries or players within a given rank range.

Skip lists excel at these tasks—they maintain elements in sorted order and support fast insertion, deletion, and rank-based access with average O(log N) time complexity. This makes them ideal for real-time applications like online games, fitness apps, or competitive platforms where leaderboards need to be updated and queried constantly.

Deep Dive 2: How does this design scale to billions of rows per day?

To scale a leaderboard system to billions of rows per day, the architecture can have several improvements 1) write throughput, 2) storage efficiency, 3) query performance, 4)horizontal scalability. Many techniques are very standard method to scale services:

Write Path Scalability

Ingesting billions of score updates per day means millions of writes per second during peak load.

Kafka-based ingestion layer: Decouple submission from processing using a message queue (e.g., Kafka, Pub/Sub)
Batch updates: Aggregate score updates in micro-batches (e.g., 1-second or 5-second buffers) before writing to Redis or DB
Async workers: Distribute leaderboard update logic across horizontally scaled consumers

Leaderboard Sharding

Single leaderboard (e.g., global all-time) can become a bottleneck.

Shard leaderboards by region, category, or user hash:
- e.g., leaderboard:global:shard_1, leaderboard:game_mode:us:2025-06-29
Use consistent hashing or range-based partitioning
Use separate Redis instances per shard or Redis Cluster

Efficient Storage Management

Storing high-cardinality score records and historical windows becomes expensive.

Use Redis ZSET for active windows (e.g., hourly/daily), then archive
Periodically compact old windows into aggregate buckets (e.g., weekly, monthly)
Store historical data in cost-efficient storage (e.g., BigQuery, S3, ClickHouse)

Pre-aggregation and Caching

Frequent rank/top-K queries on large leaderboards become slow and expensive.

Precompute and cache top-K results every few seconds
Cache frequently accessed ranks (e.g., popular users) in Redis Hashes
Use a tiered storage model:
- Hot: Redis (ZSET)
- Warm: Precomputed rank cache
- Cold: Historical DB

Time Window Isolation

Leaderboard write and read operations can interfere across time windows or overload shared system resources if not properly isolated. To address this, use separate keys per time window (e.g., leaderboard:game:daily:2025-06-29) to ensure clean separation of data. Apply TTL policies or background cleanup jobs to evict stale data and minimize memory usage. For rarely accessed or custom time ranges, support lazy materialization, generating leaderboard data on demand to avoid unnecessary precomputation and storage overhead.

Horizontal Scaling & Service Decomposition

Break the system into independent services—ScoreWriterService, RankQueryService, and AggregationService—each deployed separately with autoscaling based on load, and use distributed task queues to manage and balance batch processing jobs efficiently.

Deep Dive 3: What consistency guarantees do we need? (e.g., race conditions during late score writes)

💡

Key talking point can be found below:

Consistency isn't binary — define what must be consistent, and how stale is acceptable
Consider user expectations: For real-time UIs, prioritize latency + monotonicity over perfect ordering
Design with conflict resolution logic baked into the write path

Write Consistency

Ensures that score updates are correctly reflected in the leaderboard, race conditions can occur when a user submits multiple scores in rapid succession (e.g., concurrent or late-arriving updates due to network delays), scenarios could happens:

Out-of-order writes: A lower score arrives after a higher one
Concurrent writes: Multiple clients or devices submit at the same time
Delayed score delivery: Events arrive late (e.g., Kafka lag, mobile offline buffering)

Here are several standard way to improve the write consistency:

Enforce monotonic updates: Reject or ignore updates that are lower than the latest score

Use timestamps or sequence numbers to determine if a score is stale

Apply last-write-wins (LWW) or highest-score-wins logic depending on the business rule

Read Consistency

Guarantees that a read reflects the most recent or correct state with two options below:

Strong consistency: Every read returns the most up-to-date score/rank
- Expensive, hard in distributed systems
Eventual consistency: Reads may be slightly stale but will converge
- Acceptable for many use cases like analytics, daily leaderboards

Consistency can be tuned based on use case scenariosL

Use write-through caching for real-time results
Cache precomputed ranks with a short TTL (e.g., refresh every 30s)
Indicate staleness in the API (e.g., "updated_at": "2025-06-29T14:32:00Z")

Cross-Shard or Aggregation Consistency

A user's score may be updated in one shard but not yet reflected in a global merged leaderboard, so the problem becomes how to ensures correct behavior when leaderboards are partitioned by region, category, or time.

Delay reads until partial aggregates are ready
Use bounded staleness windows for consistency guarantees (e.g., "within 1 min")
Employ read repair on cache miss (pull from all partitions and re-merge)

Designing a leaderboard involves balancing accuracy, performance, and complexity. For real-time systems with strict accuracy requirements, Skip Lists and precomputed ranks with Redis caching are preferred. Approximate methods like Count-Min Sketch are useful for long-tail or analytics use cases. Scalable architectures combine ingestion, precomputation, caching, and service decomposition to support billions of updates per day.

‍

Coach + Mock

Practice with a Senior+ engineer who just get an offer from your dream (FANNG) companies.

Schedule Now

Content:

Leaderboard

Unlock Full System Design Access