Boost Your LLM Agents with BM25: Efficient and Explainable Retrieval on Local CPU
Enhancing LLM Agents with BM25: A Lightweight Retrieval Solution Every intelligent generative AI (GenAI) agent, from customer support chatbots to autonomous assistants, relies heavily on the retrieval of relevant information to operate effectively. Retrieval is the process of fetching contextual data that augments the AI's reasoning capabilities, ensuring accurate and up-to-date responses. However, many practitioners default to cosine similarity with vector embeddings, often overlooking simpler, more efficient alternatives like BM25. Why BM25 Matters What Is Embedding + Cosine Similarity? The primary method for retrieval in modern AI systems involves vector embeddings, which convert text into dense vectors in high-dimensional spaces. Models like Sentence Transformers create these embeddings, allowing for comparisons using metrics such as cosine similarity. Documents whose embeddings are closest to the query's embedding are retrieved as relevant results. Challenges of Embedding-Based Retrieval While powerful, embedding-based retrieval has several drawbacks, especially in structured domains like encyclopedias, technical manuals, and product catalogs. These challenges include: Computational Cost: Generating and comparing embeddings can be resource-intensive, requiring significant GPU power. Scalability Issues: Large datasets can lead to performance bottlenecks. Explainability: Embedding-based retrieval is often seen as a black box, making it difficult to understand why certain documents are chosen over others. Freshness: Embeddings are based on static training data, which may not include the latest information. BM25: A Viable Alternative BM25, or Best Match 25, is a ranking function developed in the 1970s and widely used in search engines like Elasticsearch and Apache Lucene. It scores documents based on keyword frequency and document length, making it a lightweight, explainable, and CPU-friendly solution. Key Features of BM25 Token Frequency: Higher weights are given to tokens that appear more frequently in the query but less frequently in the overall corpus. Inverse Document Frequency (IDF): Tokens that are rare in the corpus but present in the query are given higher weights. Document Length Normalization: Longer documents are penalized slightly to prevent bias towards longer texts. BM25 + Lightweight AI Wrappers Combining BM25 with a small language model (LLM) creates a highly efficient information retrieval system. For instance, using models like Gemma-2B or TinyLlama, you can achieve: Speed: BM25 runs entirely on CPU, reducing the computational overhead. Simplicity: No additional training is required, making it easier to implement. Cost-Effectiveness: Requires fewer resources compared to embedding-based models. Explainability: Clearer understanding of how and why certain documents are retrieved. Case Study: Perplexity AI Perplexity CEO highlighted their reliance on traditional retrieval methods, including BM25, alongside modern techniques. They noted the difficulties in packing comprehensive knowledge into a single vector space, emphasizing the importance of a balanced approach: "It’s not purely vector space. It’s not like once the content is fetched, there is some BERT model that runs on all of it and puts it into a big giant vector database, which you retrieve from, it’s not like that. Because vector embeddings are not magically working for text; they struggle with understanding what’s a relevant document to a particular query." Implementation: Lightweight Retrieval with BM25 Step 1: Install Libraries To get started, install the necessary libraries: pip -q install whoosh rank_bm25 sentence-transformers transformers accelerate optimum --upgrade Step 2: Dataset Overview The Wikipedia Structured Contents dataset from Kaggle was used for this implementation. It contains JSONL files with article titles, abstracts, and infoboxes, making it ideal for keyword-driven retrieval. Step 3: In-Memory BM25 with rank_bm25 Tokenize the Corpus: python corpus = docs # List of "title. abstract" strings tokenized = [doc.lower().split() for doc in corpus] Initialize BM25: python from rank_bm25 import BM25Okapi bm25 = BM25Okapi(tokenized) Search Function: python def search_bm25(query, k=5): tokens = query.lower().split() scores = bm25.get_scores(tokens) top_idx = sorted(range(len(scores)), key=lambda i: -scores[i])[:k] return [(titles[i], abstracts[i]) for i in top_idx] Step 4: Persistent BM25 with Whoosh Define Schema with Field Boosts: python from whoosh.index import create_in, open_dir from whoosh.fields import Schema, TEXT, ID schema = Schema(title=TEXT(stored=True, field_boost=2.0), abstract=TEXT(stored=True)) Create or Clear the Index Directory: python import os, shutil if os.path.exists("indexdir"): shutil.rmtree("indexdir") os.mkdir("indexdir") ix = create_in("indexdir", schema) Index Documents: python writer = ix.writer() for t, a in zip(titles, abstracts): writer.add_document(title=t, abstract=a) writer.commit() Search Function: python def search_whoosh(query, k=5): ix = open_dir("indexdir") with ix.searcher() as searcher: parser = MultifieldParser(["title", "abstract"], schema=ix.schema) q = parser.parse(query) res = searcher.search(q, limit=k) return [(r["title"], r["abstract"]) for r in res] Generating Answers: BM25 with a Small LLM Once you have the retrieval system in place, you can integrate it with a lightweight LLM to generate answers. Here's an example using the Gemma-2B model: Load Model: python from transformers import AutoTokenizer, AutoModelForCausalLM tok = AutoTokenizer.from_pretrained("google/gemma-2b-it") mod = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it").eval() RAG Answer Function: python def generate_answer(query): contexts = search_bm25(query) # OR search_whoosh(query) ctx_text = "\n".join(f"{t}: {a}" for t, a in contexts) prompt = f"Answer using context:\n{ctx_text}\nQuestion: {query}\nAnswer:" inp = tok(prompt, return_tensors="pt", truncation=True, max_length=4096) out = mod.generate(**inp, max_new_tokens=150) return tok.decode(out[0], skip_special_tokens=True) This setup ensures sub-second end-to-end latency for queries, leveraging the efficiency of BM25 and the flexibility of small LLMs. Industry Insights and Company Profiles Industry insiders, like Perplexity AI, emphasize the practical benefits of using BM25 alongside more advanced models. By combining traditional search algorithms with contemporary AI, companies can achieve a balance between accuracy, speed, and cost-effectiveness. This hybrid approach is particularly valuable in structured domains where users seek precise, factual information. BM25's lightweight nature and CPU compatibility make it an attractive option for developers looking to enhance their AI applications without significant hardware investments. Companies like Elasticsearch and Apache Lucene have long relied on BM25, proving its reliability and effectiveness in large-scale applications. In summary, BM25 offers a robust, fast, and affordable alternative to embedding-based retrieval, especially for tasks requiring explainable and up-to-date information. Integrating BM25 with small LLMs can significantly enhance the performance of GenAI agents, making them more versatile and user-friendly.
