Showcase

Fully Local RAG with Llama 3.1 and Marqo

July 26, 2024

mins read

Meta AI just released Llama3.1, one of the most capable openly available LLM’s to date. With claims that it outperforms GPT-4o and Anthropic’s Claude 3.5 Sonnet on various benchmarks. The models come as a collection of 8B, 70B, and 405B parameter size models with 8B best suited for limited computational resources (like local deployment) and 405B setting new standards for AI.

The three models can be summarised as follows:

Llama 3.1 405B: The world's largest publicly available Large Language Model (LLM) according to Meta. It is perfect for tasks such as synthetic data generation, general knowledge, long-form text generation, multilingual translation and has even seen improvements on its mathematical ability.
Llama 3.1 70B: This model is ideal for content creation, conversational AI and R&D. It excels at text summarization, code generation and following instructions.
Llama 3.1 8B: This model is best suited for limited computational power and resources which makes it perfect for local deployment. It excels at text summarisation, classification and language translation.
‍

As we are building a local deployment, we will be using Llama 3.1 8B. As explained, this model excels at text summarisation, classification and language translation...but what about question and answering?

I created a simple, local RAG demo with Llama 3.1 8B and Marqo. The project allows you to locally run a Knowledge Question and Answering System and has been built based off the original Marqo & Llama repo by Owen Elliot. The code for this article can be found on GitHub here.

*Figure 1: Video of Llama 3.1 & Marqo Simple RAG Demo*

Project Structure

To begin building a local RAG Q&A, we need both the frontend and backend components. This section provides information about the overall project structure and the key features included. If you're eager to start building, jump to the Setup and Installation section.

1. User Interface (UI)

The frontend needs the following sections:

Q&A: where the Q&A interaction takes place
Add Knowledge: allows the users to input additional information or knowledge (RAG component)
Reset Q&A button: allows the user to clear the chat box

Tying this altogether:

*Figure 2: Visual representation of the frontend of our Knowledge Question and Answering System.*

Great! Now the front-end is established, the next (and most important) part is establishing the RAG component.

2. Llama 3.1

In this demo, I used 8B parameter Llama 3.1 GGUF models to allow for smooth local deployment. If you have 16GB of RAM, I would highly recommend starting with 8B parameter models. This project has been set up to work with other GGUF models too. Feel free to experiment!

You can obtain LlaMa 3.1 GGUF models from lmstudio-community/Meta-Llama-3.1-8B-Instruct-GGUF on the Hugging Face Hub. There are several models you can download from there but I recommend starting with Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf. The demo video in the GitHub repository uses Q2_K.

Be aware that the llama cpp library currently needs some adjustment to work with Llama 3.1 (see this issue). We do not need to do anything on our side but please be cautious as this may result in interesting responses!

3. Marqo

A huge part of improving Llama 3.1 8B is by adding knowledge to its prompts. This is also known as Retrieval Augmented Generation (RAG). In order to do this, we need to store this knowledge somewhere. That’s where Marqo comes in!

Marqo is more than just a vector database, it's an end-to-end vector search engine for both text and images. Vector generation, storage and retrieval are handled out of the box through a single API. No need to bring your own embeddings! This full-fledged vector search solution makes storing documents so easy and meant the backend component of this project was really easy to build.

Setup and Installation

Now we’ve seen the key components involved in creating a RAG Question and Answering System with Llama 3.1 and Marqo, we can start deploying locally to test it out!

Clone The Repository

First, clone the repository in a suitable location locally:


git clone https://github.com/ellie-sleightholm/marqo-llama3_1.git

Then change directory:


cd marqo-llama3_1

Now we can begin deployment!

Frontend

Navigate to your terminal and input the following:


cd frontend
npm i
npm run dev

‍ ‍Here, we install the necessary Node.js packages for the frontend project and then start the development server. This will be deployed at http://localhost:3000. To install npm for the first time, consult the npm documentation.

When navigating to http://localhost:3000, your browser should look the same as the image and video above, just without any text inputted. Keep the terminal open to track any updates on your local deployment. Now, let’s get the backend working!

Backend

1. Obtaining Llama 3.1 Models

As already mentioned above, to run this project locally, you will need to obtain the appropriate models. I recommend downloading the models from the lmstudio-community/Meta-Llama-3.1-8B-Instruct-GGUF Hugging Face hub.

There are several models you can download from here. I recommend starting with Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf as seen in the image below. Note, there is also a “take2” version of this. Feel free to use either.

*Figure 3: The models that can be downloaded as part of this demo.*

Simply, download this model and place it into a new directory, backend/models/8B/ (you need to create this directory) in the marqo-llama3_1 directory.

2. Install Dependencies

Now, open a new terminal inside the project and input the following:


cd backend
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

‍This navigates to the backend directory, create a virtual environment, activates it, and installs the required Python packages listed in the requirements.txt file.

To run this project, you’ll also need to download NLTK (Natural Language Toolkit) data because the document_processors.py script uses NLTK's sentence tokenization functionality. To do this, in your terminal specify the Python interpreter:


python3

Then, import NLTK:


import nltk
nltk.download("all")

‍Once you have installed all the dependencies listed above, we can set up Marqo.

3. Run Marqo

As already explained, for the RAG aspect of this project, I will be using Marqo, the end-to-end vector search engine.

Marqo requires Docker. To install Docker go to the Docker Official website. Ensure that docker has at least 8GB memory and 50GB storage. In Docker desktop, you can do this by clicking the settings icon, then resources, and selecting 8GB memory.

Open a new terminal and use docker to run Marqo:


docker rm -f marqo
docker pull marqoai/marqo:latest
docker run --name marqo -it -p 8882:8882 marqoai/marqo:latest

‍

‍First, this begins pulling from marqoai/marqo followed by setting up a vector store. Next, Marqo artefacts will begin downloading. Then, you’ll be greeted with this lovely welcome message once everything is set up successfully. This can take a little bit of time while it downloads everything needed to begin searching.

*Figure 4: Terminal output after successfully running Marqo with Docker.*

It's important that you keep this terminal open. This will allow us to see requests to the Marqo index when we add and retrieve information during the RAG process. Note, when the project starts, the Marqo index will be empty until you add information in the 'Add Knowledge' section of the frontend.

Great, now all that's left to do is run the web server!

4. Run the Web Server

To run the web server, navigate to a new terminal and input:


python3 -m flask run --debug -p 5001

This starts a Flask development server in debug mode on port 5001 using Python 3. This may take a few moments to run but you can expect your terminal output to look similar to the following:

*Figure 5: Terminal output after running the web server.*

Along with a few other outputs above this too. When you see this in your terminal output, navigate to http://localhost:3000 and begin inputting your questions to the Chatbot!

What to Expect

Now that you’ve set up this RAG Q&A locally, when you begin inputting information to the frontend, some of your terminals will begin to populate with specific information.

Input Initial Prompt

Let’s input “hello” into the chat window and see the response we get:

*Figure 6: Inputting 'hello' into our Q&A chatbot.*

We receive a lovely hello message back. Let's see how our terminals populated when we made this interaction.

First, the terminal we ran Marqo in will illustrate how it tried to search for the query inside the Marqo Knowledge store with:


 "POST /indexes/knowledge-management/search HTTP/1.1" 200 OK

This indicates a successful request to the marqo index. Nice!

The terminal you used to run python3 -m flask run --debug -p 5001 will look as follows:

*Figure 7: Terminal output after entering a chat message on the front end.*

We can see that the local server (127.0.0.1) received HTTP requests. The first one is an OPTIONS request to /getKnowledge, and the second one is a POST request to the same endpoint. Both returned a status code of 200, indicating successful requests.

Inside these HTTP requests, we also see the query which is “hello” and the context from marqo which, at the moment, is empty as we haven’t populated the Marqo Knowledge Store with any information.

Let’s add some information to the knowledge store and see how our terminal output changes.

Adding Information to Marqo Knowledge Store

If we ask Llama what it’s name is, we can get some funky answers out:

Let’s add some knowledge to the Marqo knowledge store. We’re going to input “your name is marqo” as seen in the image below:

*Figure 9: Adding knowledge to the 'Add Knowledge' section on the frontend.*

When we submit any knowledge, our terminal will look as follows:

*Figure 10: Terminal output after adding information to the 'Add Knowledge' section.*

Here, we can see successful /addKnowledge HTTP requests.

When we ask the Q&A bot what it’s name is, it responds correctly with “marqo” and it even gives us the context that we provided it!

Let’s see what’s happening on the back-end.

We see the Marqo Knowledge Store is now populated with the information “your name is marqo” and we can see that Marqo generated a score of 0.682 when comparing this against the query. As our threshold is > 0.6, Marqo will provide this context to the LLM. Thus, the LLM knows that it’s name is marqo.

*Figure 11: Terminal output after asking a question that matches a document inside our Marqo index.*

Making Changes

When running this project, feel free to experiment with different settings.

You can change the model in backend/ai_chat.py:


LLM = Llama(
	model_path="models/8B/your_model"
)

‍You can also change the score in the function query_for_content in backend/knowledge_store.py:


relevance_score = 0.6

This queries the Marqo knowledge store and retrieves content based on the provided query. It filters the results to include only those with a relevance score above 0.6 and returns the specified content from these results, limited to a maximum number of results as specified by the limit parameter. Feel free to change this score depending on your relevance needs.

Important Notes & Conclusion

This project is a very simple Knowledge Q&A with Llama 3.1 8B GGUF and Marqo. The project is aimed to provide a base for your own further development. Such further work may include:

Enabling chatbot memory: Store conversation history to make conversing with the chatbot more like a real-life experience
Providing an initial set of documents: at the moment, when the project starts, the Marqo index is empty. Results will be better if we preload the Marqo knowledge store with a set of initial documents relevant to the domain of interest.
Extending support for different document types

We will be working on some of these features in the future. To be the first notified when these features go live, sign up to our newsletter or star our repository.