HyperAIHyperAI

Command Palette

Search for a command to run...

Transform Any Website into a Dynamic Graph Knowledge Base with Crawl4ai and R2R: A Step-by-Step Guide

How to Turn Any Website into a Graph Knowledge Base With a Production-Ready Co-Pilot Businesses often struggle with managing and retrieving vast amounts of information efficiently, leading to frustration for both customers and internal teams. A solution is to create a custom co-pilot with a comprehensive understanding of business, products, or services, which can improve customer engagement and streamline internal knowledge access. This guide demonstrates transforming any website content into a dynamic, graph-powered knowledge base using open-source tools like Crawl4ai and R2R (Reason to Retrieve). Meet the Tools Crawl4ai: An open-source web crawling and scraping framework designed for modern AI workflows. Unlike traditional scrapers, Crawl4ai is optimized to extract and structure website content in a way that large language models (LLMs) can efficiently use. It supports various LLM providers, including OpenAI, Groq, and Ollama, and offers features like deep crawling, flexible scraping strategies, and content validation. R2R: Reason to Retrieve: A production-ready AI retrieval platform that supports Agentic Retrieval-Augmented Generation (RAG) with a RESTful API. It can handle multimodal content, offers hybrid search functionality, and includes user and document management features. R2R comes with an intuitive UI for document ingestion and management, making it a complete solution for building intelligent knowledge interfaces. Step-by-Step: Turning a Website into a Graph-Based Co-Pilot Scrape Website Contents with Crawl4ai Set Up Development Environment: Use a package manager like UV to create a virtual environment. bash uv venv .venv --python=python3.12 source .venv/bin/activate Install Dependencies: Install Crawl4ai and Pydantic to handle structured data validation. bash pip install crawl4ai pydantic Configure API Keys: Create a .env file to store your API keys securely. bash echo -e "OPENAI_API_KEY=$(read -sp 'Enter OpenAI API key: ' okey && echo $okey)\nGROQ_API_KEY=$(read -sp 'Enter Groq API key: ' gkey && echo $gkey)" > .env && echo -e "\n✅ .env file created successfully" Define the Data Model with Pydantic Use Pydantic to define a structured schema for product properties. This ensures the extracted data is clean, consistent, and machine-readable. ```python from pydantic import BaseModel class ProductDetails(BaseModel): upc: str type: str price: str inventory_count: int class Product(BaseModel): title: str description: str details: ProductDetails ``` Extract Contents with LLM Strategy Utilize the LLMExtractionStrategy class from Crawl4ai to extract structured product data directly from raw HTML. ```python from crawl4ai import LLMExtractionStrategy, LLMConfig, BaseModelfrom crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy extraction_strategy = LLMExtractionStrategy( llm_config=LLMConfig(provider="groq/deepseek-r1-distill-llama-70b", api_token=os.getenv('GROQ_API_KEY')), schema=Product.model_json_schema(), extraction_type="schema", instruction="Extract all product objects specified in the schema from the text.", chunk_token_threshold=1200, overlap_rate=0.1, apply_chunking=True, input_format="html", extra_args={"temperature": 0.1, "max_tokens": 1000}, verbose=True ) - Configure and execute the crawler to scrape and extract the required data.python from crawl4ai import AsyncWebCrawler, CrawlerRunConfig from crawl4ai.deep_crawling import BFSDeepCrawlStrategy from crawl4ai.deep_crawling.filters import FilterChain, URLPatternFilter, ContentTypeFilter import asyncio import json from dotenv import load_dotenv load_dotenv() filter_chain = FilterChain([ URLPatternFilter(patterns=["catalogue"], reverse=False), URLPatternFilter(patterns=["category", "/books/"], reverse=True), ContentTypeFilter(allowed_types=["text/html"]) ]) config = CrawlerRunConfig( deep_crawl_strategy=BFSDeepCrawlStrategy(max_depth=2, include_external=False, max_pages=2, filter_chain=filter_chain), cache_mode=CacheMode.BYPASS, extraction_strategy=extraction_strategy, scraping_strategy=LXMLWebScrapingStrategy(), verbose=True ) async def run_advanced_crawler(): outputs = [] browser_cfg = CrawlerConfig(headless=True) async with AsyncWebCrawler(config=browser_cfg) as crawler: results = await crawler.arun("https://books.toscrape.com/index.html", config=config) for result in results: if result.success: try: data = json.loads(result.extracted_content) if isinstance(data, list) and len(data) > 0: if any(item.get('error', False) for item in data if isinstance(item, dict)): error_items = [item for item in data if isinstance(item, dict) and item.get('error', False)] error_content = error_items[0].get('content', 'Unknown error') if error_items else 'Unknown error' print(f"Error in extracted data from {url}: {error_content}") continue for item in data: if isinstance(item, dict): item['source_url'] = url outputs.extend(data) except json.JSONDecodeError: print(f"Error decoding JSON from {url}: {result.extracted_content}") else: print(f"Error crawling {url}: {result.error_message}") if outputs: output_file = "extracted_products.json" with open(output_file, 'w', encoding='utf-8') as f: json.dump(outputs, f, indent=4, ensure_ascii=False) print(f"Data for {len(outputs)} products saved to {output_file}") if name == "main": asyncio.run(run_advanced_crawler()) ``` Step 2: Building and Querying the Knowledge Graph Install R2R R2R supports both lightweight and full installation modes. For a full installation, ensure you have Docker installed. Follow the official R2R installation guide to set up the system. Ingest and Process Data Upload the JSON file containing the structured product data to the R2R dashboard. ```python from r2r import R2RClient client = R2RClient(base_url="http://localhost:7272") client.documents.create(file_path='./extracted_products.json') - Check the status of the ingested document to ensure it was processed correctly.python client.documents.list() ``` Execute Searches and Interact with the Knowledge Base Use R2R's RAG capabilities to ask specific questions based on the ingested data. python client.retrieval.rag(query="How many of A Light in the Attic are in stock?") R2R extracts entities and relationships, creates vector embeddings, and provides accurate and contextually relevant answers. ** Listing Graph Properties** List graph entities and relationships to gain deeper insights into the structured data. python client.documents.list_entities(id='20021630-fe05-5c69-9b70-18a59bcd5a47') client.documents.list_relationships(id='20021630-fe05-5c69-9b70-18a59bcd5a47') Conclusion This guide outlines the process of turning static website content into a dynamic, queryable graph knowledge base using Crawl4ai and R2R. By leveraging Crawl4ai's advanced crawling and structuring features, and R2R's robust knowledge graph and RAG capabilities, businesses can enhance customer engagement and internal knowledge management. This approach moves beyond simple keyword searches, enabling complex, multi-hop reasoning over website content. Industry Evaluation and Company Profiles Industry Impact: The integration of graph knowledge bases and AI co-pilots is revolutionizing how businesses manage and utilize information. It allows for more precise and contextual queries, improving both customer service and internal operational efficiency. Company Profiles: - Crawl4ai: An innovative open-source project that bridges the gap between web content and AI systems. It is widely praised for its flexibility and ease of use in handling complex web scraping tasks. - R2R: A cutting-edge platform that automates the creation and querying of knowledge graphs. Its support for multimodal content and robust RAG capabilities make it a powerful tool for building intelligent, interactive knowledge interfaces.

Related Links