Course Module
4

Build Your First Vector Search Application

Next Module →

In the age of big data and artificial intelligence, the ability to efficiently search through large datasets has become increasingly important. Traditional keyword-based search methods often fall short in providing accurate and relevant results, especially in complex scenarios involving natural language processing and multimedia data. This is where vector search comes into play.

Before diving in, if you need help, guidance, or want to ask questions, join our Community and a member of the Marqo team will be there to help.

1. What is Vector Search?

Vector search, also known as similarity search or vector similarity search, is a method that involves representing data points (such as text, images, or other types of documents) as vectors in a multi-dimensional space. Each vector captures the semantic meaning of the data, allowing for more accurate and context-aware search results. Instead of matching keywords, vector search calculates the similarity between vectors to find the most relevant results.

2. Why Use Vector Search?

  1. Improved Relevance: Vector search can capture the context and nuances of natural language, leading to more accurate search results.
  2. Multimodal Search: It enables searching across different types of data (e.g., text and images) simultaneously, making it versatile for various applications.
  3. Scalability: Vector search is designed to handle large-scale datasets efficiently, making it suitable for big data applications.
  4. Flexibility: It allows for complex queries that can incorporate multiple factors and weights, providing more sophisticated search capabilities.

3. Let’s Build! A Simple Search Demo

For this article, we will be using Marqo, an end-to-end vector search engine. Marqo is super easy to implement (only takes a few lines of code to set up) and they handle a lot of the complicated stuff for you, including embedding generation.

1. Set Up and Installation

We’ll start with downloading and installing Marqo. If you have any issues setting up Marqo, visit our Slack Community and send us your issue on the ‘get-help’ channel where we’ll be there to help!

  1. Marqo requires Docker. To install Docker go to Docker Docs and install for your operating system.
  2. Once Docker is installed, you can use it to run Marqo. First, open the Docker application and then head to your terminal and enter the following:

docker pull marqoai/marqo:latest
docker rm -f marqo
docker run --name marqo -it -p 8882:8882 marqoai/marqo:latest

First, you will begin pulling from marqoai/marqo followed by setting up a vector store. Next, Marqo artefacts will begin downloading. Then, you’ll be greeted with this lovely welcome message once everything is set up successfully. This can take a little bit of time while it downloads everything needed to begin searching.

That’s it - It really is as easy as that! Now we’re ready to use Marqo! It’s important that you keep your terminal open while we begin programming.

2. Start Searching!

While Docker is running, we can use Marqo as we would any other Python library. We’ll begin with a simple example where we create an index and perform searches on movie descriptions. If you have any issues with the following code, visit our Slack Community and send us your issue on the ‘get-help’ channel where we’ll be there to help!

Let’s first install Marqo in our terminal:


pip install marqo

Now we’re ready to write our first vector search system!

Navigate to a Python script and begin by importing Marqo:


import marqo

Next, we need to create a Marqo client that will communicate with the Marqo server. We'll specify the server URL, which in this case is running locally on http://localhost:8882.


# Create a Marqo client
mq = marqo.Client(url="http://localhost:8882")

This step sets up the client to interact with the Marqo API, allowing us to perform various operations such as creating indexes and adding documents.

Before we create a new index, it's good practice to delete any existing index with the same name to avoid conflicts. Here, we are deleting the "movies-index" if it already exists.


# Delete the index if it already exists
try:
    mq.index("movies-index").delete()
except:
    pass

This ensures that we start with a clean slate every time we run our script.

Next, we create an index named "movies-index" using a specific machine learning model, hf/e5-base-v2. This model is designed to generate embeddings for various types of text inputs. It will be used for vectorizing the documents we add to the index.


# Create an index
mq.create_index("movies-index", model="hf/e5-base-v2")

Creating an index is crucial as it prepares Marqo to store and manage the documents we'll be working with.

Now, we add some movie descriptions to our index. These descriptions will be vectorized and stored in the index, making them searchable. We specify a 'Title' and 'Description' for each movie.


# Add documents (movie descriptions) to the index
mq.index("movies-index").add_documents(
    [
        {
            "Title": "Inception",
            "Description": "A mind-bending thriller about dream invasion and manipulation.",
        },
        {
            "Title": "Shrek",
            "Description": "An ogre's peaceful life is disrupted by a horde of fairy tale characters who need his help.",
        },
        {
            "Title": "Interstellar",
            "Description": "A team of explorers travel through a wormhole in space to ensure humanity's survival.",
        },
        {
            "Title": "The Martian",
            "Description": "An astronaut becomes stranded on Mars and must find a way to survive.",
        },
    ],
    tensor_fields=["Description"],
)

In this step, we specify that the "Description" field of each document should be used for vector search by including it in the tensor_fields parameter.

With our index populated with movie descriptions, we can now perform a search query. Let's search for a movie related to space exploration.


# Perform a search query on the index
results = mq.index("movies-index").search(
    q="Which movie is about space exploration?"
)

This query searches the descriptions in our index for content related to space exploration.

Finally, we print out the search results, including the title, description, and the relevance score for each movie that matches the query.


# Print the search results
for result in results['hits']:
    print(f"Title: {result['Title']}, Description: {result['Description']}. Score: {result['_score']}")

The relevance score ( _score ) indicates how well each document matches the search query.

Let’s look at the outputs:


Title: Interstellar, Description: A team of explorers travel through a wormhole in space to ensure humanity's survival.. Score: 0.8173517436600624
Title: The Martian, Description: An astronaut becomes stranded on Mars and must find a way to survive.. Score: 0.8081475581626953
Title: Inception, Description: A mind-bending thriller about dream invasion and manipulation.. Score: 0.7978701791216605
Title: Shrek, Description: An ogre's peaceful life is disrupted by a horde of fairy tale characters who need his help.. Score: 0.7619883916893311

Interstellar has the highest relevance score (0.817), indicating it is the most relevant to the query "Which movie is about space exploration?". The Martian follows closely with a score of 0.808, also highly relevant to the query. Inception and Shrek have lower scores (0.798 and 0.762, respectively), indicating they are less relevant to the space exploration theme. These scores help us understand how well each movie's description aligns with the search query, allowing us to identify the most pertinent results efficiently.

Awesome! Now we’ve seen how to get started with a simple search demo with Marqo, let’s look at searching over different types of data!

4. Multimodal Search - Searching Images

We’ll now walk through a practical example of using the Marqo library for multimodal indexing. We'll create an index that can handle both text and image data, add a document to the index, and perform a search.

As with the previous example, we'll import the Marqo library and create a Marqo client.


import marqo

# Create a Marqo client with the specified URL
mq = marqo.Client(url="http://localhost:8882")

As with our previous example, before we create the index, it's important to delete any index with the same name that may already exist.


# Delete the movie index if it already exists
try:
    mq.index("my-multimodal-index").delete()
except:
    pass

Next, we'll define the settings for our index. We'll enable image indexing and specify the model to use for indexing. In this case, we're using the open_clip/ViT-B-32/laion2b_s34b_b79k model. Note that if you do not configure multi modal search, image urls will be treated as strings.


# Settings for the index creation, enabling image indexing and specifying the model to use.
settings = {
    "treat_urls_and_pointers_as_images": True,  # allows us to treat URLs as images and index them
    "model": "open_clip/ViT-B-32/laion2b_s34b_b79k",  # model used for indexing
}

# Create the index with the specified settings
response = mq.create_index("my-multimodal-index", **settings)

Now we'll add a document to our index. This document includes an image of a hippopotamus and a description. The image URL is treated as a tensor field.


# Add documents to the created index, including an image and its description
response = mq.index("my-multimodal-index").add_documents(
    [
        {
            "My_Image": "https://upload.wikimedia.org/wikipedia/commons/thumb/b/b3/Hipop%C3%B3tamo_%28Hippopotamus_amphibius%29%2C_parque_nacional_de_Chobe%2C_Botsuana%2C_2018-07-28%2C_DD_82.jpg/640px-Hipop%C3%B3tamo_%28Hippopotamus_amphibius%29%2C_parque_nacional_de_Chobe%2C_Botsuana%2C_2018-07-28%2C_DD_82.jpg",
            "Description": "The hippopotamus, also called the common hippopotamus or river hippopotamus, is a large semiaquatic mammal native to sub-Saharan Africa",
            "_id": "hippo-facts",  # unique identifier for the document
        }
    ],
    tensor_fields=["My_Image"],  # specify that "My_Image" should be treated as a tensor field
)

Finally, we can perform a search on our index. We'll search for the term "animal" and print the results.


# Search the index for the term "animal"
results = mq.index("my-multimodal-index").search("animal")

# Print the search results
import pprint
pprint.pprint(results)

After running the search query for the term "animal," we received the following output:


{'hits': [{'Description': 'The hippopotamus, also called the common '
                          'hippopotamus or river hippopotamus, is a large '
                          'semiaquatic mammal native to sub-Saharan Africa',
           'My_Image': 'https://upload.wikimedia.org/wikipedia/commons/thumb/b/b3/Hipop%C3%B3tamo_%28Hippopotamus_amphibius%29%2C_parque_nacional_de_Chobe%2C_Botsuana%2C_2018-07-28%2C_DD_82.jpg/640px-Hipop%C3%B3tamo_%28Hippopotamus_amphibius%29%2C_parque_nacional_de_Chobe%2C_Botsuana%2C_2018-07-28%2C_DD_82.jpg',
           '_highlights': [{'My_Image': 'https://upload.wikimedia.org/wikipedia/commons/thumb/b/b3/Hipop%C3%B3tamo_%28Hippopotamus_amphibius%29%2C_parque_nacional_de_Chobe%2C_Botsuana%2C_2018-07-28%2C_DD_82.jpg/640px-Hipop%C3%B3tamo_%28Hippopotamus_amphibius%29%2C_parque_nacional_de_Chobe%2C_Botsuana%2C_2018-07-28%2C_DD_82.jpg'}],
           '_id': 'hippo-facts',
           '_score': 0.5586894792769398}],
 'limit': 10,
 'offset': 0,
 'processingTimeMs': 256,
 'query': 'animal'}

Let's break down what each part of the output means:

  • hits: This is a list of documents that matched the search query. Each item in this list represents a single document that was found.
  • hits[0]: This is the first (and in this case, the only) document in the list of hits.
    • Description: This is the description field of the document, which describes the hippopotamus.
    • My_Image: This field contains the URL of the image associated with the document.
    • _highlights: This is a list of highlighted fields that matched the search query. In this example, the My_Image field is highlighted, showing the image URL.
    • _id: This is the unique identifier for the document, which we set as "hippo-facts".
    • _score: This is the relevance score of the document with respect to the search query. A higher score indicates a higher relevance.
  • limit: This indicates the maximum number of documents that can be returned in a single search query. Here, it is set to 10.
  • offset: This indicates the starting point of the search results. An offset of 0 means that the results start from the first document.
  • processingTimeMs: This shows the time taken to process the search query, measured in milliseconds. In this case, it took 256 milliseconds.
  • query: This is the search query that was executed. In this example, the query was "animal".

The search output provides detailed information about the documents that match your search query, including their descriptions, image URLs, relevance scores, and more. By understanding this output, you can gain insights into how your data is being indexed and retrieved, allowing you to refine your search capabilities and improve the relevance of your results.

5.Conclusion

In this article, we've walked through the steps of setting up a Marqo client, creating an index, adding documents, and performing a search query. This process allows us to efficiently search through content using vector search. Marqo makes it straightforward to implement powerful search capabilities in your applications.

If you want to see what else Marqo is capable of, visit our documentation here.

6.Code

https://github.com/marqo-ai/fine-tuning-embedding-models-course