Showcase

How to Implement Local RAG with Llama 3.2 and Marqo

Meta AI have just released Llama3.3, a 70B model that, for some applications, approaches the performance of Llama 3.1 405B! They previously released Llama3.2 which included small and medium-sized vision LLMs (11B and 90B) as well as lightweight, text-only models (1B and 3B). In this article, we will look at how you can build a local RAG application using Llama 1B and Marqo, the end-to-end vector search engine.

The code for this article can be found on GitHub here.

Figure 1: Video of a RAG Application using Llama 3.2 1B & Marqo.

Project Structure

To begin building a local RAG Q&A, we need both the frontend and backend components. This section provides information about the overall project structure and the key features included.

1. User Interface (UI)

The frontend needs the following sections:

  • Q&A: where the Q&A interaction takes place
  • Add Knowledge: allows the users to input additional information or knowledge (RAG component)
  • Reset Q&A button: allows the user to clear the chat box

Tying this altogether:

Figure 2: Visual representation of the frontend of our Knowledge Question and Answering System.

Great! Now the front-end is established, the next (and most important) part is establishing the RAG component.

2. Llama 3.2

In this demo, we use the 1B parameter Llama 3.2 GGUF models to allow for smooth local deployment. This project has been set up to work with other GGUF models too. Feel free to experiment! Please note, if you wish to experiment and you have 16GM of RAM, we found the 8B models to work best.

You can obtain Llama 3.2 GGUF models from bartowski/Llama-3.2-1B-Instruct-GGUF on the Hugging Face Hub. There are several models you can download from there but we recommend starting with Llama-3.2-1B-Instruct-Q6_K_L.gguf. The demo video in the GitHub repository uses this model.

3. Marqo

A huge part of improving Llama 3.2 1B is by adding knowledge to its prompts. This is also known as Retrieval Augmented Generation (RAG). In order to do this, we need to store this knowledge somewhere. That’s where Marqo comes in!

Marqo is more than just a vector database, it's an end-to-end vector search engine for both text and images. Vector generation, storage and retrieval are handled out of the box through a single API. No need to bring your own embeddings! This full-fledged vector search solution makes storing documents so easy and meant the backend component of this project was really easy to build.

Setup and Installation

Now we’ve seen the key components involved in creating a RAG Question and Answering System with Llama 3.2 and Marqo, we can start deploying locally to test it out!

Clone The Repository

First, clone the repository in a suitable location locally:


git clone https://github.com/ellie-sleightholm/marqo-llama-3.2-1B-rag


Then change directory:


cd marqo-llama-3.2-1B-rag


Now we can begin deployment!

Frontend

Navigate to your terminal and input the following:


cd frontend
npm i
npm run dev


Here, we install the necessary Node.js packages for the frontend project and then start the development server. This will be deployed at http://localhost:3000. To install npm for the first time, consult the npm documentation.

When navigating to http://localhost:3000, your browser should look the same as the image and video above, just without any text inputted. Keep the terminal open to track any updates on your local deployment. Now, let’s get the backend working!

Backend

1. Obtaining Llama 3.2 Models

As already mentioned above, to run this project locally, you will need to obtain the appropriate models. We recommend downloading the models from the bartowski/Llama-3.2-1B-Instruct-GGUF Hugging Face hub.

There are several models you can download from here. For this demo, we use the  Llama-3.2-1B-Instruct-Q6_K_L.gguf as seen in the image below.

Figure 3: The Llama models that can be downloaded and used as part of this demo.

Simply, download this model and place it into a new directory, backend/models/1B/ (you need to create this directory) in the marqo-llama-3.2-1B-rag directory.

2. Install Dependencies

Now, open a new terminal inside the project and input the following:


cd backend
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

This navigates to the backend directory, create a virtual environment, activates it, and installs the required Python packages listed in the requirements.txt file.

To run this project, you’ll also need to download NLTK (Natural Language Toolkit) data because the document_processors.py script uses NLTK's sentence tokenization functionality. To do this, in your terminal specify the Python interpreter:


python3

Then, import NLTK:


import nltk
nltk.download("all")

Once you have installed all the dependencies listed above, we can set up Marqo.

3. Run Marqo

As already explained, for the RAG aspect of this project, we will be using Marqo, the end-to-end vector search engine.

Marqo requires Docker. To install Docker go to the Docker Official website. Ensure that docker has at least 8GB memory and 50GB storage. In Docker desktop, you can do this by clicking the settings icon, then resources, and selecting 8GB memory.

Open a new terminal and use docker to run Marqo:


docker rm -f marqo
docker pull marqoai/marqo:latest
docker run --name marqo -it -p 8882:8882 marqoai/marqo:latest

First, this begins pulling from marqoai/marqo followed by setting up a vector store. Next, Marqo artefacts will begin downloading. Then, you’ll be greeted with this lovely welcome message once everything is set up successfully. This can take a little bit of time while it downloads everything needed to begin searching.

Figure 4: Terminal output after successfully running Marqo with Docker.

It's important that you keep this terminal open. This will allow us to see requests to the Marqo index when we add and retrieve information during the RAG process. Note, when the project starts, the Marqo index will be empty until you add information in the 'Add Knowledge' section of the frontend.

Great, now all that's left to do is run the web server!

4. Run the Web Server

To run the web server, navigate to a new terminal and input:


python3 -m flask run --debug -p 5001

This starts a Flask development server in debug mode on port 5001 using Python 3. This may take a few moments to run but you can expect your terminal output to look similar to the following:

Figure 5: Terminal output after running the web server.

Along with a few other outputs above this too. When you see this in your terminal output, navigate to http://localhost:3000 and begin inputting your questions to the Chatbot!

What to Expect

Now that you’ve set up this RAG Q&A locally, when you begin inputting information to the frontend, some of your terminals will begin to populate with specific information.

Input Initial Prompt

Let’s input “hello” into the chat window and see the response we get:

Figure 6: Inputting 'hello' into our Q&A chatbot.

We receive a lovely hello message back. Let's see how our terminals populated when we made this interaction.

First, the terminal we ran Marqo in will illustrate how it tried to search for the query inside the Marqo Knowledge store with:


 "POST /indexes/knowledge-management/search HTTP/1.1" 200 OK

This indicates a successful request to the marqo index. Nice!

The terminal you used to run python3 -m flask run --debug -p 5001 will look as follows:

Figure 7: Terminal output after entering a chat message on the front end.

We can see that the local server (127.0.0.1) received HTTP requests. The first one is an OPTIONS request to /getKnowledge, and the second one is a POST request to the same endpoint. Both returned a status code of 200, indicating successful requests.

Inside these HTTP requests, we also see the query which is “hello” and the context from marqo which, at the moment, is empty as we haven’t populated the Marqo Knowledge Store with any information.

Let’s add some information to the knowledge store and see how our terminal output changes.

Adding Information to Marqo Knowledge Store

If we ask Llama what it’s name is, we can get some funky answers out:

Figure 8: Llama 3.2 1B response before adding information to Marqo Knowledge Store.

Let’s add some knowledge to the Marqo knowledge store. We’re going to input “your name is marqo” as seen in the image below:

Figure 9: Adding knowledge to the 'Add Knowledge' section on the frontend.

When we submit any knowledge, our terminal will look as follows:

Figure 10: Terminal output after adding information to the 'Add Knowledge' section.

Here, we can see successful /addKnowledge HTTP requests.

When we ask the Q&A bot what it’s name is, it responds correctly with “marqo” and it even gives us the context that we provided it!

Let’s see what’s happening on the back-end.

We see the Marqo Knowledge Store is now populated with the information “your name is marqo” and we can see that Marqo generated a score of 0.682 when comparing this against the query. As our threshold is > 0.6, Marqo will provide this context to the LLM. Thus, the LLM knows that it’s name is marqo.

Figure 11: Terminal output after asking a question that matches a document inside our Marqo index.

Making Changes

When running this project, feel free to experiment with different settings.

You can change the model in backend/ai_chat.py:


LLM = Llama(
	model_path="models/1B/your_model"
)

You can also change the score in the function query_for_content in backend/knowledge_store.py:


relevance_score = 0.6

This queries the Marqo knowledge store and retrieves content based on the provided query. It filters the results to include only those with a relevance score above 0.6 and returns the specified content from these results, limited to a maximum number of results as specified by the limit parameter. Feel free to change this score depending on your relevance needs.

Questions?

If you have any questions, reach out on the Community Group.

Code

GitHub

Ellie Sleightholm
Head of Software Developer Relations at Marqo