showoffer

A Web Crawler (or spider/bot) systematically browses websites to collect information. It starts with a list of seed URLs, retrieves their content, extracts useful information, and discovers additional links for further crawling. Web crawlers are used for tasks such as:

This design focuses on designing the crawler component, specifically for downloading, processing, and storing web pages.

Functional Requirements

Non-Functional Requirements

Below the line

High Level Design

Web Crawling Workflow

💡

Crawler Scheduler

The crawler scheduler serves as the entry point for the system. It is responsible for accepting crawling requests, validating them, and orchestrating the overall crawling process. For periodic crawling, the scheduler ensures that tasks are executed at regular intervals based on the data freshness requirements specified by the user.

Content Processing Workflow

Before we start, let’s see what needs to be done in this “web crawler” case.

💡

Pipeline Processing

When dealing with complex workflow, breaking down the process into smaller, manageable stages is recommended. The main advantages are on two fronts:

(Better Error Handling) Imagine downloading a large webpage, processing its content, and storing the results. If we handle this as one large operation and the processing step fails, we'd need to start over from the beginning – downloading the page again unnecessarily. Instead, by splitting this into separate stages (download, process, store), we can retry just the failed stage. For example, if processing fails, we can retry that step using the already downloaded content, saving time and resources.
(Scale Independently) Different stages of the crawling process have different resource requirements. Downloading web pages is typically more time-consuming than parsing their content due to network latency and server response times. With a pipeline approach, we can scale each stage independently. For instance, we might run more download workers than parsing workers, optimizing our resource allocation based on where the bottlenecks actually occur.

💡

Worker

Crawler Workers: These workers handle the task of downloading HTML content from the URLs provided in the crawling requests.
Parsing Workers: These workers process the downloaded HTML, extracting relevant information such as text content and identifying new URLs for further crawling.

Queue

Crawling Queue: Stores messages containing URLs to be downloaded.
Parsing Queue: Holds messages related to parsing tasks for downloaded HTML pages.
Dead Letter Queue (DLQ): Contains failed messages that require manual intervention.

Web Crawler

Overview

Functional Requirements

Non-Functional Requirements

Below the line

High Level Design

Web Crawling Workflow

Crawler Scheduler

Content Processing Workflow

Pipeline Processing

Worker

Queue

Storage

Deep Dive

Handle failures gracefully and resume crawling without losing progress

Ensure politeness and not overloading web servers

Make crawler efficient and be able to handle 10 B pages in X days

Unlock Full System Design Access