Upgrade Your RAG App to RAG 2.0: Build a Smarter, Conversational AI Chatbot with Memory, Better Chunking, and Search Control
Welcome back! In Part 1, you built your first Retrieval-Augmented Generation (RAG) app that answered questions from a document. But real users want more than one-off responses — they expect natural conversations, context awareness, and accurate, coherent replies. Today, you’ll upgrade your RAG app to RAG 2.0, transforming it into a smart, chat-friendly assistant. By the end of this tutorial, you’ll have a Streamlit-based RAG chatbot with these features: - Smarter document chunking using RecursiveCharacterTextSplitter - A larger, more conversational language model (Flan-T5-large) - Chat memory to maintain context across multiple turns - A search interface to let users see and refine retrieval results Quick Recap: What Is a RAG App? A RAG application combines two core components: 1. Retrieval – Finds the most relevant pieces of information from a document store 2. Generation – Uses a large language model to craft clear, accurate answers based on the retrieved content This is a solid foundation, but it can be significantly improved. Step 1: Prepare the Document Create a folder named data and save the following text as essay.txt: In the early days of a startup, speed and iteration matter more than elegance or scale. Founders are constantly experimenting, often pivoting from one idea to another based on user feedback. Success depends less on building the perfect product and more on discovering what users truly need. Many of today’s biggest tech companies started as something very different from what they are now. Twitter began as a podcast platform. Instagram started as a location check-in app. What made them successful was not their original vision, but their willingness to adapt quickly. The lesson: the key to startup success is not perfection, but learning, listening, and iterating fast. Step 2: Install Dependencies Run this command in your terminal: pip install streamlit langchain faiss-cpu streamlit transformers sentence-transformers Step 3: Build the RAG 2.0 App 3.1 Import Required Libraries Import necessary modules including Streamlit, LangChain components, Hugging Face pipelines, and text splitters. 3.2 Fix the Chunking Logic Earlier, you used CharacterTextSplitter, which splits text strictly by character count. This often breaks words mid-way, leading to incomplete or confusing chunks. Example issue: Chunk 1 ends with “Instagram started a” Chunk 2 starts with “started as a check-in app” — what started? Solution: Use RecursiveCharacterTextSplitter. It intelligently splits text at natural boundaries like sentences, paragraphs, and words before falling back to characters. Replace the old splitter with: text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=10) This ensures each chunk is semantically complete and improves retrieval accuracy. 3.3 Use a Larger, Conversational Model Switch from flan-t5-base to flan-t5-large for better response quality and conversational fluency. This model handles context and tone more naturally. Load the model and tokenizer: model_id = "google/flan-t5-large" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForSeq2SeqLM.from_pretrained(model_id) tokenizer.pad_token = tokenizer.eos_token pipe = pipeline("text2text-generation", model=model, tokenizer=tokenizer) llm = HuggingFacePipeline(pipeline=pipe) 3.4 Customize the Prompt Template Improve answer quality with a tailored prompt that guides the model to be helpful, clear, and context-aware. custom_prompt = PromptTemplate.from_template(""" You are a helpful assistant. Use the context to answer the user's question. If the context is unclear, respond gracefully. Be clear and complete. {context} Chat History: {chat_history} Question: {question} Helpful Answer: """) 3.5 Add Chat Memory Without memory, the bot treats every query as isolated. Use ConversationBufferMemory to store past interactions and pass them back into the prompt. memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True) Then pass it into the RAG chain: qa_chain = ConversationalRetrievalChain.from_llm( llm=llm, retriever=vector_store.as_retriever(), memory=memory, combine_docs_chain_kwargs={"prompt": custom_prompt} ) Now, when users ask follow-up questions like “and what about Twitter?”, the model understands the context and responds correctly. 3.6 Add Search Functionality Let users see what’s being retrieved. Add a sidebar search feature that shows top matching documents with scores. search_term = st.sidebar.text_input("Enter keyword or phrase") if search_term: results = vector_store.similarity_search_with_score(search_term, k=3) for i, (doc, score) in enumerate(results, 1): highlighted = re.sub(f"({re.escape(search_term)})", r"\1", doc.page_content, flags=re.IGNORECASE) st.sidebar.markdown(f"Match {i} (Score: {score:.4f}):") st.sidebar.write(highlighted.strip()) Lower scores mean higher similarity. This gives users control and transparency. Step 4: Full Streamlit App Code Save the full code in app.py as shown in the article. Run it with: streamlit run app.py You’ll get a fully functional chatbot with memory, search, and natural conversation. Final Tips If answers seem off, check chunking, test retrieval separately, or refine the prompt. Use retriever.get_relevant_documents("query") to debug what’s being retrieved. RAG 1.0 vs RAG 2.0 | Feature | RAG 1.0 | RAG 2.0 | |--------|--------|--------| | Chunking | Character-based, breaks words | Recursive, preserves meaning | | Model | Smaller, less conversational | Larger, better for dialogue | | Memory | None | Chat history preserved | | User Control | No search | See top matches, refine queries | Analogy to Remember RAG Think of RAG like a student studying for an exam: - Retrieval = Finding the right textbook pages - Generation = Writing clear, accurate answers - Memory = Remembering earlier questions - Search = Looking up specific topics quickly With just a few enhancements, you’ve turned a basic RAG demo into a powerful, user-friendly AI assistant — 10x smarter and ready for real-world use.