Functional Requirements
- Submit User Score: Allow users to submit a score using their
user_id
and associatedscore
. - Top-K Item Query: Extend the leaderboard to support top-K queries over other entities (e.g., top videos, top products).
- User Rank Lookup: Fetch the current rank and score of a given
user_id
within a leaderboard. - Multi-Leaderboard and Time-Window Support: Support multiple leaderboards (e.g., global, regional, or by category) with the ability to query them across time intervals such as hourly, daily, or custom date ranges.
Non-Functional Requirements
- Low Latency
- Leaderboard updates should be reflected within 1 minute.
- Queries (e.g., top-K or rank lookup) must return within 100 milliseconds.
- High Scalability
- System should handle millions of updates per second.
- Must scale to support billions of users or items.
- Accuracy Requirements
- Leaderboard results should be exact, not approximate (unless otherwise specified as a tradeoff).
- Cost Efficiency
- Design should optimize for infrastructure cost, avoiding unnecessary compute or memory usage at scale.
High Level Design
FR1: Submit User Score
To support score submissions in a leaderboard system, data flows of below has to be managed correctly:
- Accepts incoming score data
- Updates the appropriate leaderboard (possibly across shards or time windows)
- Maintains order for rank calculations

API Design
While submitting a user score may seem straightforward, there are several critical design considerations and discussion points that should not be overlooked.
Designing for these cases upfront prevents subtle bugs, improves system robustness, and aligns with principles like eventual consistency, fairness, and resilience under high load or failure conditions.
FR2: Query Top K Videos

In large-scale systems, precomputing leaderboard data is a common technique to reduce read latency and improve scalability. It’s widely applicable across system design problems that require ranked metrics (e.g., top users, trending content, global scores). This module explores how to implement pre-computation using a database and batch/stream processing, along with associated trade-offs.
- Low-latency reads for rank and top-K queries (avoid live computation)
- Reduced load on real-time stores like Redis
- Historical access to leaderboards over time (e.g., daily/weekly/monthly)
- Support for custom or low-traffic leaderboards that don’t justify always-on compute
Schema Design
A basic schema for storing precomputed rank data:
Use tools like Apache Spark, Flink, or Airflow to periodically compute:
- The top-K users for each leaderboard × time window
- The rank of every active user
Precomputation frequency can be tuned (e.g., every 30 seconds, 1 minute, or 5 minutes) based on freshness requirements.
In large-scale systems, precomputing leaderboard ranks and Top-K entries can drastically improve query latency and offload live computation. However, as a Senior or Staff Engineer, it's critical to understand the trade-offs that come with this optimization.
- Data Staleness
- Challenge: Precomputed rankings are only as fresh as the latest batch update.
- Mitigations:
- Reduce batch interval (e.g., from 5 minutes to 30 seconds).
- Overlay with a real-time write-through cache (e.g., Redis) for the latest updates.
- Infrastructure Complexity
- Challenge: Requires a pipeline stack—stream processors, batch jobs, metadata management, and failure recovery.
- Mitigations:
- Use managed ETL platforms (e.g., Google Dataflow, AWS Glue, Flink).
- Set up robust monitoring and alerting for job delays or failures.
- Storage Overhead
- Challenge: Every (leaderboard × time window × user) tuple consumes storage.
- Mitigations:
- Apply TTL policies to expire outdated data.
- Archive historical data to cold storage (e.g., S3, GCS).
- Compact and roll up older data (e.g., daily → weekly snapshots).
- Late or Out-of-Order Events
- Challenge: Some updates may arrive after their intended processing window.
- Mitigations:
- Leverage event-time watermarking in stream engines (e.g., Apache Flink).
- Run catch-up jobs for reprocessing affected windows.
- Requeue or flag late events for on-demand correction.
- Computational Cost at Scale
- Challenge: High cardinality (e.g., region × platform × game mode × time window) can become expensive to compute and store.
- Mitigations:
- Prioritize high-traffic or monetized leaderboards.
- Use approximate techniques (e.g., Count-Min Sketch) for low-impact or long-tail segments (to be covered in future lessons).
Optimization solutions can be found in Deep Dive 1: How to optimize top-k queries
?
FR3: User Rank Lookup
To support user rank lookup in a leaderboard system, you need a fast and scalable way to determine a user’s current position relative to others.
Option 1: Real-time and Accurate Rank
Redis ZSET stores (user_id, score)
entries in a sorted order and Use:
ZREVRANK leaderboard:<id> <user_id>
to get the user's rank (high score = lower rank)ZSCORE leaderboard:<id> <user_id>
to get the score
Pros:
- Accurate and real-time
- Fast reads and writes (log N for insertion and rank lookup)
- Natively supports ranking, score updates, and top-K
Cons:
- In-memory only — high memory usage for large leaderboards
- Requires sharding if data grows (horizontal scaling adds complexity)
- Persistence depends on Redis RDB or AOF — not strongly durable by default
Option 2: Scalable reads with relaxed SLA
Periodically batch compute user ranks (e.g., every minute) and Store results in Redis Hash or a DB for fast retrieval.
Pros:
- Constant-time lookups (O(1)) — ideal for high-read workloads
- Reduces Redis memory pressure; backend can batch compute with cheaper infra
- Flexible to offload to secondary storage (e.g., Postgres, DynamoDB)
Cons:
- May return stale data if ranks are recomputed periodically
- Requires a background job to update ranks
- Adds storage complexity (managing versioned caches, TTL, etc.)
Option 3: Count-Min Sketch
Use a Count-Min Sketch to count how often scores appear, and estimate a user’s rank by summing counts of users with higher scores.
Pros:
- Very memory-efficient — works well when leaderboard is massive (e.g., 100M+ users)
- Fast inserts and rank estimates
- Scales horizontally with minimal coordination
Cons:
- Results are approximate (bounded error)
- Cannot determine exact rank or position
- Doesn’t preserve full ordering — not suitable for deterministic user rank display
FR4: Multi-Leaderboard and Time-Window Support
Supporting multiple leaderboards across different categories and time windows adds a layer of complexity to a leaderboard system. It requires a flexible data model, scalable storage, and careful handling of both read and write paths.
- Multiple dimensions: Leaderboards by category (e.g., game mode, video genre), region (e.g., US, EU, JP), or user-defined groups (e.g., friends).
- Time-based windows: Hourly, daily, weekly, monthly, all-time, and custom ranges.
- Efficient access patterns for writing scores, retrieving top-K users, and querying user ranks.

Example Leaderboard Key Data Structure
This approach ensures:
- Logical separation of leaderboards
- Easy partitioning and TTL control
- Flexibility to query and manage leaderboards individually
Normalize IDs and avoid overly dynamic keys to prevent unbounded growth in Redis or storage.
To support flexible queries, especially from a dashboard or analytics view, it helps to maintain a metadata layer that tracks available leaderboards:
This allows:
- UI listing of active leaderboards
- Admin tools to purge or archive old ones
- Filtering, discovery, and auditing
Summary
This design addresses all four functional requirements in a modular and extensible manner:
- FR1: Submit User Score
- FR2: Top-K Item Query
- FR3: User Rank Lookup
- FR4: Multi-Leaderboard and Time-Window Support
The upcoming deep dives will explore how to scale the system and handle real-world edge cases.
Deep Dive
Deep Dive 1: How to optimize top-k queries?
When designing a system to get the top K items (like trending videos or top players), you need to balance accuracy, speed, and scalability.
There are three common strategies:
- Distributed Heaps: Precise but resource-heavy.
- Skip Lists: Efficient and sorted structure, great for rankings.
- Count-Min Sketch: Fast and memory-efficient, but gives rough estimates.
Let’s first look at distributed heaps for exact ranking, then compare them with the other approaches.
Distributed Heap
Hash table + Single host
Assume all data fits into the memory of a single server. Sorting the videos with heap has a time complexity of O(NlogK), where N is the total unique videos and K is the number of videos to return such solution has the following drawbacks:
- Scalability Issue: A single host can become a CPU bottleneck as the rate of incoming elements increases.
- Improvement: Introduce a load balancer to distribute the load across multiple servers.
Hash Table + Multiple Hosts + Load-Balancing
Comparing with single host, processing distributed across servers to increase throughput. But it is still limited by individual server memory, requiring further scalability improvements (e.g., sharding or distributed storage).
Hash Table + Multiple Hosts + Partitioning
Each processor hosts a subset of videos and maintains a list of top K heavy hitters. The final list is created by merging these sorted lists using an n-way merge algorithm. There are complexities below:
1. Data Replication:
- Ensures data redundancy to prevent loss if a processor fails.
- Common practice: Replication factor of 3 (primary host + 2 backups).
- Challenge: Synchronizing data during failures or updates.
2. Rebalancing:
- Node Addition: Metadata updates to redistribute data among nodes.
- Node Removal: Metadata updates to reassign data from the removed node.
- Risks: Network partitions, insufficient memory/disk, or metadata inconsistencies during transitions.
3. Hot Partitioning:
- Access patterns (e.g., viral videos, live events) cause uneven loads, slowing the system.
- Requires load-balancing mechanisms to mitigate bottlenecks.
Skip List
Skip List is a probabilistic data structure that maintains a sorted list of elements and allows fast search, insertion, and deletion operations—on average in O(log N) time. It offers an efficient alternative to balanced trees and is widely used in distributed systems like Redis and LevelDB due to its simplicity and performance.
Core Idea:
A Skip List consists of multiple levels of linked lists:
- The base level (Level 1) contains all elements in sorted order.
- Each higher level acts as an “express lane,” skipping over elements, thus speeding up access.
- Levels are created randomly, giving a balance between performance and space.

Advantages in Distributed Systems:
- Sorted Indexing: Great for maintaining time-ordered or score-based streams (e.g., trending videos, leaderboard scores).
- Concurrent Access: Easier to implement concurrent versions compared to balanced trees.
- Dynamic Range Queries: Efficiently handles range-based queries like “top K” or “between score A and B”.
- Sharding-Friendly: Segments of a skip list can be distributed across nodes with consistent hashing.
Insertion
To insert a value in a skip list, we first locate the position by scanning forward from the highest level down to level 0. At each level, we move forward as long as the next node’s value is less than the target. This builds a path of predecessors that we’ll later use to update pointers.
We then assign the new node a random level—higher levels are less likely, creating a layered shortcut structure. The new node is linked into all levels up to its assigned height by updating the forward pointers of the predecessor nodes.
Rebalancing (Implicit)
Unlike trees, skip lists don’t perform explicit rebalancing. Instead, balance is probabilistic. The random level assignment ensures that higher levels are sparser. Over time, this keeps the list roughly balanced without complex rotations or height adjustments. In practice, the structure stays efficient with O(log n) search and insert time.
If a new node is taller than the current maximum height, the skip list grows by adding new empty levels and adjusting the head pointer. This is rare but necessary to maintain performance as the list grows.
Challenges of insertion and rebalancing
- Random Level Variance: Occasionally, too many or too few high-level nodes can skew performance. Though rare, this randomness means worst-case time can degrade to O(n).
- Pointer Management: Updating forward pointers correctly across multiple levels is error-prone, especially when insertions or deletions affect high levels.
- Debugging: Because structure depends on random decisions, reproducing bugs or ensuring deterministic behavior in testing can be difficult.
Skip lists offer a good trade-off between simplicity and performance, but they require careful handling of randomness and pointer updates to avoid subtle bugs.
Count-Min Sketch
Count-Min Sketch is a probabilistic data structure that efficiently summarizes and queries large data streams, especially for tracking element frequency counts. It's designed to handle massive datasets where explicitly tracking every element would be impractical due to memory limitations.
This structure uses a 2D array (matrix) of counters and multiple hash functions to map stream elements to indices in each matrix row:
- Hash Functions: Multiple independent hash functions map an element to specific indices in different rows of the matrix.
- Increment Counters: For each incoming element, the counters at the hashed indices in each row are incremented.
- Estimate Count: To estimate an element's frequency, the minimum counter value across all rows for that element's hashed indices is returned.
Summary
Hash Table + Multiple Hosts + Partitioning
Best For:
- Systems requiring exact frequency counts (e.g., billing, quotas, precise analytics).
- Use cases where accuracy cannot be compromised, such as fraud detection thresholds or regulated reporting.
Skip List
Best For:
- Applications needing sorted order, range queries, or top-K rankings.
- Systems requiring fast updates and lookups with concurrent access.
- Scenarios where accurate data ordering is more valuable than full scan speed.
Count-Min Sketch
Best For:
- Approximate counting in large-scale streaming applications.
- Scenarios where memory efficiency and speed are more important than perfect accuracy.
Deep Dive 2: How does this design scale to billions of rows per day?
To scale a leaderboard system to billions of rows per day, the architecture can have several improvements 1) write throughput, 2) storage efficiency, 3) query performance, 4)horizontal scalability. Many techniques are very standard method to scale services:
Write Path Scalability
Ingesting billions of score updates per day means millions of writes per second during peak load.
- Kafka-based ingestion layer: Decouple submission from processing using a message queue (e.g., Kafka, Pub/Sub)
- Batch updates: Aggregate score updates in micro-batches (e.g., 1-second or 5-second buffers) before writing to Redis or DB
- Async workers: Distribute leaderboard update logic across horizontally scaled consumers
Leaderboard Sharding
Single leaderboard (e.g., global all-time) can become a bottleneck.
- Shard leaderboards by region, category, or user hash:
- e.g.,
leaderboard:global:shard_1
,leaderboard:game_mode:us:2025-06-29
- e.g.,
- Use consistent hashing or range-based partitioning
- Use separate Redis instances per shard or Redis Cluster
Efficient Storage Management
Storing high-cardinality score records and historical windows becomes expensive.
- Use Redis ZSET for active windows (e.g., hourly/daily), then archive
- Periodically compact old windows into aggregate buckets (e.g., weekly, monthly)
- Store historical data in cost-efficient storage (e.g., BigQuery, S3, ClickHouse)
Pre-aggregation and Caching
Frequent rank/top-K queries on large leaderboards become slow and expensive.
- Precompute and cache top-K results every few seconds
- Cache frequently accessed ranks (e.g., popular users) in Redis Hashes
- Use a tiered storage model:
- Hot: Redis (ZSET)
- Warm: Precomputed rank cache
- Cold: Historical DB
Time Window Isolation
Leaderboard write and read operations can interfere across time windows or overload shared system resources if not properly isolated. To address this, use separate keys per time window (e.g., leaderboard:game:daily:2025-06-29
) to ensure clean separation of data. Apply TTL policies or background cleanup jobs to evict stale data and minimize memory usage. For rarely accessed or custom time ranges, support lazy materialization, generating leaderboard data on demand to avoid unnecessary precomputation and storage overhead.
Horizontal Scaling & Service Decomposition
Break the system into independent services—ScoreWriterService, RankQueryService, and AggregationService—each deployed separately with autoscaling based on load, and use distributed task queues to manage and balance batch processing jobs efficiently.

Deep Dive 3: What consistency guarantees do we need? (e.g., race conditions during late score writes)
Write Consistency
Ensures that score updates are correctly reflected in the leaderboard, race conditions can occur when a user submits multiple scores in rapid succession (e.g., concurrent or late-arriving updates due to network delays), scenarios could happens:
- Out-of-order writes: A lower score arrives after a higher one
- Concurrent writes: Multiple clients or devices submit at the same time
- Delayed score delivery: Events arrive late (e.g., Kafka lag, mobile offline buffering)
Here are several standard way to improve the write consistency:
Enforce monotonic updates: Reject or ignore updates that are lower than the latest score
Use timestamps or sequence numbers to determine if a score is stale
Apply last-write-wins (LWW) or highest-score-wins logic depending on the business rule
Read Consistency
Guarantees that a read reflects the most recent or correct state with two options below:
- Strong consistency: Every read returns the most up-to-date score/rank
- Expensive, hard in distributed systems
- Eventual consistency: Reads may be slightly stale but will converge
- Acceptable for many use cases like analytics, daily leaderboards
Consistency can be tuned based on use case scenariosL
- Use write-through caching for real-time results
- Cache precomputed ranks with a short TTL (e.g., refresh every 30s)
- Indicate staleness in the API (e.g.,
"updated_at": "2025-06-29T14:32:00Z"
)
Cross-Shard or Aggregation Consistency
A user's score may be updated in one shard but not yet reflected in a global merged leaderboard, so the problem becomes how to ensures correct behavior when leaderboards are partitioned by region, category, or time.
- Delay reads until partial aggregates are ready
- Use bounded staleness windows for consistency guarantees (e.g., "within 1 min")
- Employ read repair on cache miss (pull from all partitions and re-merge)
Designing a leaderboard involves balancing accuracy, performance, and complexity. For real-time systems with strict accuracy requirements, Skip Lists and precomputed ranks with Redis caching are preferred. Approximate methods like Count-Min Sketch are useful for long-tail or analytics use cases. Scalable architectures combine ingestion, precomputation, caching, and service decomposition to support billions of updates per day.