Retrieval Augmented Generation in very few lines of code
Using Sentence Transformers for embeddings and Chroma vector database
Retrieval Augmented Generation (RAG) has received a lot of attention recently, many tools related to this technique flooded the AI ecosystem, being the most prominent LangChain and LlamaIndex. So, what is all this fuss about?
At its core, RAG, is a simple technique. The aforementioned projects are toolkits that provide or integrate to different services, data sources and user interfaces; they let you prototype quickly and might not even need them for production.
Let’s do some retrieval augmented generation in ~25 lines of code (or less!). The basic steps are:
- Embed and store documents.
- Query and retrieve documents.
- Generate a LLM response using the query and the retrieved documents.
Embeddings
I used multilingual-e5-large
Hugging Face’s Sentence Transformer model for encoding documents into embeddings. I needed a multilingual model which I could run locally, so I took a decision based on the MTEB Leaderboard. There you will see that the popular paid text-embedding-ada-002
model from OpenAI is not among the first positions.
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('intfloat/multilingual-e5-large')
model = AutoModel.from_pretrained('intfloat/multilingual-e5-large')
This particular model accepts an input of 512 tokens at most, the tokenizer producing an average of 0.26 tokens/char, a number to take into account. When a document exceeds this limit is should be split into smaller chunks, to be treated independently or averaged somehow.
Vector database
There are plenty of options nowadays. After some research I leaned towards Chroma because I wanted a simple in-memory database that I could run from a notebook.
# pip install chromadb
import chromadb
client = chromadb.Client()
By default, Chroma uses the Sentence Transformers all-MiniLM-L6-v2
model to create embeddings. This should be fine for most situations. It also provides other Sentence Transformers and Instructor Embedding models, API wrappers around OpenAI, Cohere, and Hugging Face, plus custom embedding functions — which I used for flexibility.
To create the following function I relied on the usage instructions at the model card, mostly copy & paste. You can skip this block completely, as has been said, Chroma ships with a default embedding function.
from chromadb import EmbeddingFunction
# the model returns many hidden states per document so we must aggregate them
def average_pool(last_hidden_states, attention_mask):
last_hidden = last_hidden_states.masked_fill(~attention_mask[...,None].bool(), 0.0)
return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[...,None]
class CustomHuggingFace(EmbeddingFunction):
def __call__(self, texts):
queries = [f'query: {text}' for text in texts] # multilingual-e5-large requirement
batch_dict = tokenizer(texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
return embeddings.tolist()
Adding data
Chroma lets you manage collections of embeddings. Collections are created with a name and an optional embedding function.
collection = client.get_or_create_collection(
name='test',
embedding_function=CustomHuggingFace(),
metadata={'hnsw:space': 'cosine'}
)
Distance function
The optional metadata
argument can be used to customize the distance method of the embedding space by setting the value of hnsw:space
. Valid options for are l2, ip (inner product), or cosine.
If Chroma is passed a list of documents
, it will automatically embed them with the collection's embedding function. Each document must have a unique associated id
.
It might be overwhelming to embed your documents all at once, it is advisable to do it by batches or one-by-one.
for i, doc in enumerate(documents):
collection.add(documents=[doc], ids=[str(i)] )
Querying
Chroma will first embed each query_text
with the collection's embedding function, and then perform the query with the generated embedding.
question = 'How long does it take to get to Mars?'
results = collection.query(
query_texts=[question],
n_results=5,
)['documents'][0]
Generation
The last step is to get the output of a LLM using the retrieved documents as context. I used ChatGPT for the occasion. Note that the prompt is composed by the context (as a system message) and the same question used to query those documents (user message).
import openai
openai.api_key = 'your API key here'
context = '\n'.join(results)
response = openai.ChatCompletion.create(
model='gpt-3.5-turbo-0613',
messages=[
{'role': 'system', 'content': context},
{'role': 'user', 'content': question},
]
)
print(response['choices'][0]['message']['content'])
Conclusion
We saw elemental steps to perform RAG and implemented them in very few lines of code. Up from here, there is a lot of room for improvement depending on what you need to do, such as embedding other kinds of documents (PDFs, SQL, …), developing agents to chat with your knowledge base, and more!
In the end it is all about dynamically providing useful context, based on the user’s query, to the LLM prompt.