min read

Data processing and Analytic

Introduction

Design a system to log incoming requests and analytic metrics with various requirements (such as realtime monitoring, aggregation with different dimension, top-K patterns) within different time window and granularity. The system focus on stream processing, efficient data structures, and optimized counting algorithms. There are several questions shares common:

Common Design Pattern

From a system design perspective, all three system above are rooted in how they ingest, process, and store large volumes of user-generated event data to produce continuously updated metrics. Key common architectural and design considerations include:

Event-Driven Ingestion Pipelines:All these systems must handle streams of incoming events—clicks, views, or scores—in real or near real time. This typically involves:
- Load Balancers / Ingress Points: Front-end services or APIs that accept new events.
- Messaging / Queueing Layers: Message brokers (e.g., Kafka, Kinesis) or queues to buffer incoming events, absorb bursty traffic, and decouple ingestion from downstream processing.
Horizontal Scalability & Distributed Architecture:Each system is designed for growth and redundancy:
- Stateless Compute Tiers: Multiple processing nodes that can be scaled horizontally to handle event spikes.
- Distributed Storage: No single machine can hold all data, so these systems rely on distributed databases or storage solutions (e.g., sharded SQL/NoSQL, columnar stores, key-value stores) for persistent storage and fast lookups.
Incremental, Aggregate Computation:The core function of each system is to increment or update counters or rankings as new events arrive:
- Atomic Increment Operations: Using atomic counters or distributed locks/semaphores to safely increment counts.
- Batch or Streaming Computation: Periodic batch jobs or stream processing frameworks (e.g., Flink, Spark Streaming, Beam) that aggregate events into summarized metrics.
Eventual Consistency and Data Integrity:While immediate strong consistency is often not mandatory, ensuring that the final aggregated values are correct is essential. Common patterns:
- Idempotent Operations: Ensuring repeated event processing does not corrupt totals (e.g., deduplicating events).
- Checkpointing and Replay: The ability to replay events from queues or logs in case of failures to reach a consistent state.
- Consistency Models: Eventual consistency is usually acceptable, but design considerations ensure data is not permanently lost and will eventually converge to accurate counts or rankings.
Fault Tolerance and High Availability:These systems must be resilient to machine failures and network issues:
- Replication: Data is replicated across multiple nodes or data centers to withstand failures.
- Failover Mechanisms: Automatically reroute traffic or elect new leader nodes if a component fails, keeping ingestion and updates flowing.
Caching and Query Optimization:Frequently accessed data (top leaderboard entries, popular videos’ view counts, trending ad campaigns) benefit from:
- In-Memory Caches / CDN Layers: To quickly serve common queries without hitting the storage layer every time.
- Pre-Computed Aggregates: Pre-materialized views or partial aggregates to speed up read queries.

Main Distinction

Ad Click Aggregation: Primarily focuses on summing large volumes of simple click events. Data structures emphasize efficient counting by keys (e.g., campaigns) and applying fraud filters. Top-K operations are less common, and time windows and batch aggregations are often acceptable.
Top-K YouTube Video Ranking: Similar counting problem but must maintain a global ordering (top K) from a huge set of videos. This drives the use of data structures and algorithms for partial top-K merges, approximation techniques, and memory-efficient indexing of the most popular items. Slightly delayed eventual consistency is generally acceptable.
Game Leaderboard: Focuses almost entirely on maintaining a continuous sorted order of a potentially large set of players. Data structures are heavily oriented around fast insertions, deletions, and rank queries (e.g., in-memory sorted sets or trees). Real-time, low-latency updates and exact ranking are crucial, and approximation is usually not acceptable due to the competitive nature of gaming.

Below is a structured comparison of the three systems from a system design perspective, focusing specifically on how they handle data processing and the data structures/algorithms they might employ:

Commonly Used Data Structure

There are commonly used data structure can be used to speed up the data retrieval

Count-Min Sketch

Count-Min Sketch is a probabilistic data structure that efficiently summarizes and queries large data streams, especially for tracking element frequency counts. It's designed to handle massive datasets where explicitly tracking every element would be impractical due to memory limitations.

This structure uses a 2D array (matrix) of counters and multiple hash functions to map stream elements to indices in each matrix row:

Hash Functions: Multiple independent hash functions map an element to specific indices in different rows of the matrix.
Increment Counters: For each incoming element, the counters at the hashed indices in each row are incremented.
Estimate Count: To estimate an element's frequency, the minimum counter value across all rows for that element's hashed indices is returned.

Reference:

Skip List

A skip list is a data structure that allows fast search, insertion, and deletion operations, similar to a balanced tree. It uses multiple linked lists with elements at different "levels," where higher levels "skip" more nodes, allowing for quick traversal. This results in an average time complexity of O(log n) for these operations.

Level 3: 1 -----------------------> 16 --------> 25

Level 2: 1 -----------> 8 -------> 16 --------> 25

Level 1: 1 --> 4 ---> 8 ---> 12 -> 16 -> 20 -> 25

Base Level: 1 -> 2 -> 3 -> 4 -> 5 -> 6 -> ... -> 25

Distributed Heap

Designing a distributed heap requires dividing the data and processing across multiple nodes in a cluster to balance the load and ensure efficiency. Below, I'll use a textual representation to explain the structure:

+-------------------+ | Central Node | | (Coordinator) | +---------+---------+ | ----------------------------------------- | | | | +-----v-----+ +-----v-----+ +-----v-----+ ... +-----v-----+ | Worker 1 | | Worker 2 | | Worker 3 | | Worker N | | (Heap A) | | (Heap B) | | (Heap C) | | (Heap D) | +-----------+ +-----------+ +-----------+ +-----------+ | | | | +-----+---------+---------------+-------------------+ | +---------v----------+ | Client Request API | +---------------------+

‍

Coach + Mock

Practice with a Senior+ engineer who just get an offer from your dream (FANNG) companies.

Schedule Now

Content: