Retrieval Augmented Generation in very few lines of code

Using Sentence Transformers for embeddings and Chroma vector database

Matías Battocchia
3 min readSep 7, 2023

Retrieval Augmented Generation (RAG) has received a lot of attention recently, many tools related to this technique flooded the AI ecosystem, being the most prominent LangChain and LlamaIndex. So, what is all this fuss about?

At its core, RAG, is a simple technique. The aforementioned projects are toolkits that provide or integrate to different services, data sources and user interfaces; they let you prototype quickly and might not even need them for production.

Let’s do some retrieval augmented generation in ~25 lines of code (or less!). The basic steps are:

  1. Embed and store documents.
  2. Query and retrieve documents.
  3. Generate a LLM response using the query and the retrieved documents.
RAG basic steps. Image by author.

Embeddings

I used multilingual-e5-large Hugging Face’s Sentence Transformer model for encoding documents into embeddings. I needed a multilingual model which I could run locally, so I took a decision based on the MTEB Leaderboard. There you will see that the popular paid text-embedding-ada-002 model from OpenAI is not among the first positions.

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained('intfloat/multilingual-e5-large')
model = AutoModel.from_pretrained('intfloat/multilingual-e5-large')

This particular model accepts an input of 512 tokens at most, the tokenizer producing an average of 0.26 tokens/char, a number to take into account. When a document exceeds this limit is should be split into smaller chunks, to be treated independently or averaged somehow.

Vector database

There are plenty of options nowadays. After some research I leaned towards Chroma because I wanted a simple in-memory database that I could run from a notebook.

# pip install chromadb

import chromadb

client = chromadb.Client()

By default, Chroma uses the Sentence Transformers all-MiniLM-L6-v2 model to create embeddings. This should be fine for most situations. It also provides other Sentence Transformers and Instructor Embedding models, API wrappers around OpenAI, Cohere, and Hugging Face, plus custom embedding functions — which I used for flexibility.

To create the following function I relied on the usage instructions at the model card, mostly copy & paste. You can skip this block completely, as has been said, Chroma ships with a default embedding function.

from chromadb import EmbeddingFunction

# the model returns many hidden states per document so we must aggregate them
def average_pool(last_hidden_states, attention_mask):
last_hidden = last_hidden_states.masked_fill(~attention_mask[...,None].bool(), 0.0)
return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[...,None]

class CustomHuggingFace(EmbeddingFunction):
def __call__(self, texts):
queries = [f'query: {text}' for text in texts] # multilingual-e5-large requirement

batch_dict = tokenizer(texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)

embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
return embeddings.tolist()

Adding data

Chroma lets you manage collections of embeddings. Collections are created with a name and an optional embedding function.

collection = client.get_or_create_collection(
name='test',
embedding_function=CustomHuggingFace(),
metadata={'hnsw:space': 'cosine'}
)

Distance function

The optional metadata argument can be used to customize the distance method of the embedding space by setting the value of hnsw:space. Valid options for are l2, ip (inner product), or cosine.

If Chroma is passed a list of documents, it will automatically embed them with the collection's embedding function. Each document must have a unique associated id.

It might be overwhelming to embed your documents all at once, it is advisable to do it by batches or one-by-one.

for i, doc in enumerate(documents):
collection.add(documents=[doc], ids=[str(i)] )

Querying

Chroma will first embed each query_text with the collection's embedding function, and then perform the query with the generated embedding.

question = 'How long does it take to get to Mars?'

results = collection.query(
query_texts=[question],
n_results=5,
)['documents'][0]

Generation

The last step is to get the output of a LLM using the retrieved documents as context. I used ChatGPT for the occasion. Note that the prompt is composed by the context (as a system message) and the same question used to query those documents (user message).

import openai

openai.api_key = 'your API key here'

context = '\n'.join(results)

response = openai.ChatCompletion.create(
model='gpt-3.5-turbo-0613',
messages=[
{'role': 'system', 'content': context},
{'role': 'user', 'content': question},
]
)

print(response['choices'][0]['message']['content'])

Conclusion

We saw elemental steps to perform RAG and implemented them in very few lines of code. Up from here, there is a lot of room for improvement depending on what you need to do, such as embedding other kinds of documents (PDFs, SQL, …), developing agents to chat with your knowledge base, and more!

In the end it is all about dynamically providing useful context, based on the user’s query, to the LLM prompt.

--

--

Matías Battocchia

I studied at Universidad de Buenos Aires. I live in Mendoza, Argentina. Interests: data, NLP, blockchain.