Happy Monday!
In this edition I will cover a technique used to tune large language models (LLMs) known as Retrieval Augmented Generation (RAG). First introduced by the Meta AI team, RAG enables developers to provide an off-the-shelf LLM access to a novel knowledge base at scale, without fine-tuning.
In the wild, LLMs suffer from two main problems: their lack of up to date knowledge or knowledge of a subject outside its initial training data, and its capacity to hallucinate. To address these issues it is important to provide the LLM with more information relevant to its actual usage in a particular context. In part, RAG can be understood as an example of few-shot learning, where a prompt is augmented with additional context to help the LLM provide richer and more relevant answers. RAG is often compared with fine-tuning as alternative approaches to customise LLMs, where RAG is usually a better fit for most use cases where the size of training data is not a limiting factor. (Few-shot vs fine-tuning has been covered in a previous edition of this newsletter.)
The RAG architecture is composed of two mains parts, the retriever and the generator - the generator includes the LLM. Thinking in terms of a LLM dedicated to document retrieval and summarisation, the retriever would contain the embedding model, which encodes the original input text data, and the vector DB, which stores the document embeddings. User queries are in turn encoded by the embedding model to return the subset of most closing matching document vectors. These document vectors are combined with the user query to form the final prompt passed to the LLM. Popular frameworks like LangChain support RAG approaches to LLM customisation.
This is of course to oversimplify the approach, as a successful implementation will require additional effort to handle cleaning of the prompts, vector DB, and possibly tuning of the embedding model amongst other things.
Thanks for reading!