Retrieval-Augmented Generation (RAG)
Retrieval-augmented generation (RAG) is an AI framework used to improve the quality of responses generated by LLMs by supplementing their internal information representations by building the models on external knowledge sources. RAG is a technique that uses facts obtained from external sources to improve the accuracy and reliability of generative AI models. It optimizes the output of large language models to reference authoritative knowledge bases outside the training data source before generating a response.Implementing RAG in an LLM-based question answering system has two main benefits: 1. It ensures that the model has access to the latest and reliable facts; 2. Users are able to access the model's provenance, ensuring that the accuracy and veracity of its claims can be checked and ultimately trusted.
RAG combines an information retrieval component with a text generator model. RAG can be fine-tuned and its internal knowledge can be modified efficiently without retraining the entire model.
Benefits of Retrieval Enhancement Generation
RAG technology brings several benefits to an organization’s generative AI efforts.
- Cost-effective implementation:Chatbot development typically starts with a base model. A base model (FM) is an API-accessible LLM trained on a wide range of generalized and unlabeled data. Retraining a FM for organizational or domain-specific information is computationally and financially expensive. RAG is a more cost-effective way to introduce new data into the LLM, making generative AI techniques more widely accessible and usable.
- Provide the latest information: Even if the original source of training data for an LLM is suitable for user needs, keeping the data relevant is a challenge. RAG allows developers to feed generative models with the latest research, statistics, or news. They can use RAG to connect an LLM directly to a real-time social media feed, news site, or other frequently updated information source. The LLM can then provide the latest information to users.
- Enhance user trust:RAG allows LLM to present accurate information with source attribution. Output can include citations or references to the source. Users can also look up the source document themselves if further clarification or more detailed information is needed. This can increase trust and confidence in generative AI solutions.
- More control for developers: With RAG, developers can test and improve their chat applications more efficiently. They can control and change the LLM's information sources to adapt to changing needs or cross-functional use. Developers can also restrict the retrieval of sensitive information to different authorization levels and ensure that the LLM generates appropriate responses. In addition, they can also troubleshoot and fix if the LLM references the wrong information source for a specific issue. Organizations can more confidently implement generative AI technologies for a wider range of applications.
Retrieval Enhancement Generation Workflow
Without RAG, the LLM takes user input and creates responses based on the information it was trained on or what it already knows. RAG introduces an information retrieval component that leverages user input to first extract information from new data sources. Both the user query and the related information are provided to the LLM. The LLM uses the new knowledge and its training data to create better responses. The following sections outline the process.
- Creating External Data: New data outside the original training dataset is called LLMExternal Data. It can come from multiple data sources, such as APIs, databases, or document repositories. The data may exist in various formats, such as files, database records, or long texts. Another AI technique called embedded language model converts the data into a digital representation and stores it in a vector database. This process creates a knowledge base that a generative AI model can understand.
- Retrieve relevant information: The next step is to perform a relevance search. The user query is converted into a vector representation and matched against a vector database. For example, consider an intelligent chatbot that can answer an organization's HR questions. If an employee searches for "How much annual leave do I have?", the system will retrieve the annual leave policy document as well as the employee's personal past leave records. These specific documents will be returned because they are highly relevant to what the employee entered. Relevance is calculated and established using mathematical vector calculations and representations.
- Enhanced LLM Tips: Next, the RAG model augments the user input (or prompt) by adding the retrieved relevant data in context. This step uses prompt engineering techniques to effectively communicate with the LLM. Augmented prompts allow the large language model to generate accurate answers to user queries.
- Updating external data: The next question might be - what if the external data is out of date? To maintain current information for retrieval, update the document asynchronously and update the document's embedded representation. You can do this through automated real-time processes or periodic batch processing. This is a common challenge in data analysis - change management can be done using different data science methods.
The following diagram shows the conceptual flow of using RAG with LLM:

Image source: aws.amazon
References
【1】https://aws.amazon.com/cn/what-is/retrieval-augmented-generation/?nc1=h_ls