Meta AI have just released Llama3.3, a 70B model that, for some applications, approaches the performance of Llama 3.1 405B! They previously released Llama3.2 which included small and medium-sized vision LLMs (11B and 90B) as well as lightweight, text-only models (1B and 3B). In this article, we will look at how you can build a local RAG application using Llama 1B and Marqo, the end-to-end vector search engine.
The code for this article can be found on GitHub here.
To begin building a local RAG Q&A, we need both the frontend and backend components. This section provides information about the overall project structure and the key features included.
The frontend needs the following sections:
Tying this altogether:
Great! Now the front-end is established, the next (and most important) part is establishing the RAG component.
In this demo, we use the 1B parameter Llama 3.2 GGUF models to allow for smooth local deployment. This project has been set up to work with other GGUF models too. Feel free to experiment! Please note, if you wish to experiment and you have 16GM of RAM, we found the 8B models to work best.
You can obtain Llama 3.2 GGUF models from bartowski/Llama-3.2-1B-Instruct-GGUF
on the Hugging Face Hub. There are several models you can download from there but we recommend starting with Llama-3.2-1B-Instruct-Q6_K_L.gguf
. The demo video in the GitHub repository uses this model.
A huge part of improving Llama 3.2 1B is by adding knowledge to its prompts. This is also known as Retrieval Augmented Generation (RAG). In order to do this, we need to store this knowledge somewhere. That’s where Marqo comes in!
Marqo is more than just a vector database, it's an end-to-end vector search engine for both text and images. Vector generation, storage and retrieval are handled out of the box through a single API. No need to bring your own embeddings! This full-fledged vector search solution makes storing documents so easy and meant the backend component of this project was really easy to build.
Now we’ve seen the key components involved in creating a RAG Question and Answering System with Llama 3.2 and Marqo, we can start deploying locally to test it out!
First, clone the repository in a suitable location locally:
git clone https://github.com/ellie-sleightholm/marqo-llama-3.2-1B-rag
Then change directory:
cd marqo-llama-3.2-1B-rag
Now we can begin deployment!
Navigate to your terminal and input the following:
cd frontend
npm i
npm run dev
Here, we install the necessary Node.js packages for the frontend project and then start the development server. This will be deployed at http://localhost:3000. To install
npm
for the first time, consult the npm documentation.
When navigating to http://localhost:3000, your browser should look the same as the image and video above, just without any text inputted. Keep the terminal open to track any updates on your local deployment. Now, let’s get the backend working!
As already mentioned above, to run this project locally, you will need to obtain the appropriate models. We recommend downloading the models from the bartowski/Llama-3.2-1B-Instruct-GGUF
Hugging Face hub.
There are several models you can download from here. For this demo, we use the Llama-3.2-1B-Instruct-Q6_K_L.gguf
as seen in the image below.
Simply, download this model and place it into a new directory, backend/models/1B/
(you need to create this directory) in the marqo-llama-3.2-1B-rag
directory.
Now, open a new terminal inside the project and input the following:
cd backend
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
This navigates to the backend directory, create a virtual environment, activates it, and installs the required Python packages listed in the requirements.txt file.
To run this project, you’ll also need to download NLTK (Natural Language Toolkit) data because the document_processors.py
script uses NLTK's sentence tokenization functionality. To do this, in your terminal specify the Python interpreter:
python3
Then, import NLTK:
import nltk
nltk.download("all")
Once you have installed all the dependencies listed above, we can set up Marqo.
As already explained, for the RAG aspect of this project, we will be using Marqo, the end-to-end vector search engine.
Marqo requires Docker. To install Docker go to the Docker Official website. Ensure that docker has at least 8GB memory and 50GB storage. In Docker desktop, you can do this by clicking the settings icon, then resources, and selecting 8GB memory.
Open a new terminal and use docker to run Marqo:
docker rm -f marqo
docker pull marqoai/marqo:latest
docker run --name marqo -it -p 8882:8882 marqoai/marqo:latest
First, this begins pulling from marqoai/marqo followed by setting up a vector store. Next, Marqo artefacts will begin downloading. Then, you’ll be greeted with this lovely welcome message once everything is set up successfully. This can take a little bit of time while it downloads everything needed to begin searching.
It's important that you keep this terminal open. This will allow us to see requests to the Marqo index when we add and retrieve information during the RAG process. Note, when the project starts, the Marqo index will be empty until you add information in the 'Add Knowledge' section of the frontend.
Great, now all that's left to do is run the web server!
To run the web server, navigate to a new terminal and input:
python3 -m flask run --debug -p 5001
This starts a Flask development server in debug mode on port 5001 using Python 3. This may take a few moments to run but you can expect your terminal output to look similar to the following:
Along with a few other outputs above this too. When you see this in your terminal output, navigate to http://localhost:3000 and begin inputting your questions to the Chatbot!
Now that you’ve set up this RAG Q&A locally, when you begin inputting information to the frontend, some of your terminals will begin to populate with specific information.
Let’s input “hello” into the chat window and see the response we get:
We receive a lovely hello message back. Let's see how our terminals populated when we made this interaction.
First, the terminal we ran Marqo in will illustrate how it tried to search for the query inside the Marqo Knowledge store with:
"POST /indexes/knowledge-management/search HTTP/1.1" 200 OK
This indicates a successful request to the marqo index. Nice!
The terminal you used to run python3 -m flask run --debug -p 5001
will look as follows:
We can see that the local server (127.0.0.1) received HTTP requests. The first one is an OPTIONS request to /getKnowledge
, and the second one is a POST request to the same endpoint. Both returned a status code of 200, indicating successful requests.
Inside these HTTP requests, we also see the query which is “hello” and the context from marqo which, at the moment, is empty as we haven’t populated the Marqo Knowledge Store with any information.
Let’s add some information to the knowledge store and see how our terminal output changes.
If we ask Llama what it’s name is, we can get some funky answers out:
Let’s add some knowledge to the Marqo knowledge store. We’re going to input “your name is marqo” as seen in the image below:
When we submit any knowledge, our terminal will look as follows:
Here, we can see successful /addKnowledge
HTTP requests.
When we ask the Q&A bot what it’s name is, it responds correctly with “marqo” and it even gives us the context that we provided it!
Let’s see what’s happening on the back-end.
We see the Marqo Knowledge Store is now populated with the information “your name is marqo” and we can see that Marqo generated a score of 0.682 when comparing this against the query. As our threshold is > 0.6, Marqo will provide this context to the LLM. Thus, the LLM knows that it’s name is marqo.
When running this project, feel free to experiment with different settings.
You can change the model in backend/ai_chat.py
:
LLM = Llama(
model_path="models/1B/your_model"
)
You can also change the score in the function query_for_content
in backend/knowledge_store.py
:
relevance_score = 0.6
This queries the Marqo knowledge store and retrieves content based on the provided query. It filters the results to include only those with a relevance score above 0.6 and returns the specified content from these results, limited to a maximum number of results as specified by the limit parameter. Feel free to change this score depending on your relevance needs.
If you have any questions, reach out on the Community Group.