Showcase

How to Build An Ecommerce Image Search Application with Marqo's State-of-the-Art Models

November 9, 2024

mins read

Creating an image search applications for ecommerce can be a challenging task, but Marqo Cloud and their state-of-the-art multimodal, ecommerce embeddings make it easy. In this guide, we’ll walk through the entire process—from setting up a Marqo index to creating a user interface where users can refine their search with additional themes or styles. By the end, you’ll have all the tools to build a powerful search engine ready for deployment.

Step 1: Set Up Your Marqo Index

Before we can start searching, we need to create an index in Marqo Cloud. This index will store product data, including titles, categories, and images, and enable Marqo’s powerful embeddings to process them for quick, relevant search results.

1.1 Initialize and Configure Marqo

We’ll begin by initializing the Marqo client, setting up the API key for secure access, and defining the index configuration. To obtain your API Key, see this article.


import marqo
import os
from dotenv import load_dotenv

# Load environment variables to securely access the Marqo API key
load_dotenv()

# Set up Marqo Client 
api_key = os.getenv("MARQO_API_KEY")   # To find your API Key, see https://www.marqo.ai/blog/finding-my-marqo-api-key
mq = marqo.Client("https://api.marqo.ai", api_key=api_key)

‍

Alternatively, if you prefer to run Marqo locally, you can use the following instead. For information on how to set up Marqo locally, see our GitHub.


# mq = marqo.Client("http://localhost:8882", api_key=None)

1.2 Define the Index Settings

Now, let’s set up the configuration for our index. This configuration will specify the type of index, the embedding model to use, and any specific settings for handling image URLs and inference.

As part of our Marqo-Ecommerce model launch, we released two embedding models, marqo-ecommerce-embeddings-B and marqo-ecommerce-embeddings-L. The B model is smaller and faster for inference (with times of 5.1 ms for single batch text, and 5.7 ms for image) and a smaller embedding dimension (768). The L model is larger (652M parameters), has a larger embedding dimension (1024), but has better retrieval performance. Marqo-Ecommerce-L has up to 7.3% MRR and 7.4% nDCG@10 average improvement over Marqo-Ecommerce-B across the three tasks for the 4M hard evaluation. For more information, see our blog post.

To load marqo-ecommerce-embeddings-B into Marqo Cloud, your settings configurations would be:


settings = {
    "type": "unstructured",  # The index type is set to 'unstructured', allowing flexible data formats.
    "model": "Marqo/marqo-ecommerce-embeddings-B",  # Specifies the embedding model for e-commerce.
    "modelProperties": {
        "name": "hf-hub:Marqo/marqo-ecommerce-embeddings-B",  # Model name on Hugging Face Hub.
        "dimensions": 768,  # Embedding dimensionality set to 768.
        "type": "open_clip"  # Model type, indicating use of the OpenCLIP architecture.
    },
    "treatUrlsAndPointersAsImages": True,  # Enables Marqo to use image URLs directly as image sources.
    "inferenceType": "marqo.CPU.large",  # Specifies the inference type, using Marqo's large CPU instance.
}

‍

‍For marqo-ecommerce-embeddings-L , your settings would be:


settings = {
    "type": "unstructured",  # Set the index type as unstructured data
    "model": "Marqo/marqo-ecommerce-embeddings-L",  # Specify alternative model
    "modelProperties": {
        "name": "hf-hub:Marqo/marqo-ecommerce-embeddings-L",  # Name of the model on Hugging Face Hub
        "dimensions": 1024,  # Larger dimensionality for embeddings
        "type": "open_clip"  # Model type, using OpenCLIP architecture
    },
    "treatUrlsAndPointersAsImages": True,  # Enable image URLs as image sources
    "inferenceType": "marqo.CPU.large",  # Specify the inference type using Marqo's large CPU instance
}

‍

‍For this demo, we use marqo-ecommerce-embeddings-L but feel free to use either settings depending on your preference and usecase.

1.3 Create the Index

We can now delete any existing index with the same name to avoid conflicts, and create a new one with our defined settings.


index_name = "marqo-ecommerce-l"  # Specify the name of the index

try:
    mq.index(index_name).delete()  # Delete the existing index if it already exists to avoid conflicts
except:
    pass  # If the index does not exist, skip deletion

mq.create_index(index_name, settings_dict=settings)  # Create a new Marqo index with the specified settings

`‍`Step 2: Add Documents to the Index

With the index set up, it’s time to add our product data. We will be using a 200k dataset that includes products from all ecommerce categories. This step involves reading data from a CSV file, formatting it for Marqo, and adding the documents in batches for efficient processing. You can access this CSV in our GitHub.

2.1 Load and Prepare Product Data

We’ll start by loading our product data, stored in a CSV file, into a pandas DataFrame. Each row in the DataFrame represents a product with fields like category, title, and image URL.


import pandas as pd  # Import the pandas library for data manipulation

path_to_data = "data/marqo-gs_100k.csv"  # Define the path to the CSV file containing product data
df = pd.read_csv(path_to_data)  # Load the product data from the CSV file into a pandas DataFrame

`‍`2.2 Format Data for Marqo

Next, we convert the data into a format suitable for Marqo. This involves creating a list of dictionaries, each containing the product’s category, title, and image URL.


documents = []  # Initialize an empty list to store formatted documents
for index, row in df.iterrows():  # Iterate over each row in the DataFrame
    document = {
        "image_url": row["image"],
        "query": row["query"],
        "title": row["title"],
        "score": row["score"],
    }
    documents.append(document)  # Append the formatted document to the list

2.3 Add Data in Batches

To optimize performance, we add the documents to Marqo in batches. We’ll also apply a custom mapping to weight different fields based on their importance in the search process.


batch_size = 64  # Define the batch size for uploading documents

for i in range(0, len(documents), batch_size):  # Loop through documents in batches
    batch = documents[i:i + batch_size]  # Select a batch of documents

    mq.index(index_name).add_documents(  # Add the batch of documents to the Marqo index
        batch,
        client_batch_size=batch_size,  # Set the batch size for the client
        mappings={
            "image_title_multimodal": {  # Define a multimodal field combining image, title, and category
                "type": "multimodal_combination",  # Set the field type as multimodal
                "weights": {"title": 0.1, "query": 0.1, "image_url": 0.8},  # Assign weights to each field
            }
        },
        tensor_fields=["image_title_multimodal"],  # Specify fields for tensor generation
    )

`‍`Step 3 (Optional): Monitor Index Statistics

Monitoring the index allows you to see the progress of the data upload and check resource usage. This step is optional but can be helpful for debugging and ensuring everything is set up correctly.

Code to Retrieve and Print Index Statistics

Here’s how to retrieve statistics about the number of documents and vectors, as well as storage and memory usage.


results = mq.index(index_name).get_stats()  # Retrieve statistics for the specified Marqo index
print(results)  # Print the index statistics to view details such as document count and memory usage

‍

‍This will print output similar to:


{'numberOfDocuments': 1152, 'numberOfVectors': 1152, 'backend': {'memoryUsedPercentage': 2.64, 'storageUsedPercentage': 1.28}}

‍

This can be helpful before deploying your UI interface to confirm that documents have been added to your index correctly.

Step 4: Build the User Interface

To make your search engine user-friendly, we’ll create a UI with Gradio. This interface will allow users to enter a query, specify if they want more or less of something in their query and view the top results returned by Marqo.

4.1 Define the Search Function

Our search function takes a query, themes to emphasize, and themes to avoid, and then fetches the top search results. It also retrieves the images associated with each product and formats the results for display.


import requests  # Import the requests library for handling HTTP requests
import io  # Import io for handling byte streams
from PIL import Image  # Import PIL's Image module for image processing

def search_marqo(query, themes, negatives):
    query_weights = {query: 1.0}  # Assign a weight of 1.0 to the main query
    if themes:
        query_weights[themes] = 0.75  # Apply a positive weight to emphasize additional themes
    if negatives:
        query_weights[negatives] = -1.1  # Apply a negative weight to de-emphasize certain themes

    # Perform search on the Marqo index
    res = mq.index(index_name).search(query_weights, limit=10)  # Limit results to top 10

    # Process results to prepare for display
    products = []
    for hit in res['hits']:
        image_url = hit.get('image_url')  # Get the image URL from the search hit
        title = hit.get('title', 'No Title')  # Get the product title, default to 'No Title' if missing

        # Retrieve image from the provided URL
        response = requests.get(image_url)  # Make a request to the image URL
        image = Image.open(io.BytesIO(response.content))  # Open the image from the response content

        # Prepare product details for display in the interface
        product_info = f'{title}'
        products.append((image, product_info))  # Append the image and details to the results list

    return products  # Return the list of processed products for display

4.2 Set Up the Gradio Interface

With the search function ready, we can now build the interface. We’ll use text boxes for user input and a gallery to display the search results.


import gradio as gr

# Gradio Blocks Interface
with gr.Blocks(css=".orange-button { background-color: orange; color: black; }") as interface:
    gr.Markdown("Multimodal Ecommerce Search with Marqo")
    with gr.Row():
        query_input = gr.Textbox(placeholder="Coffee machine", label="Search Query")
        themes_input = gr.Textbox(placeholder="Silver", label="More of...")
        negatives_input = gr.Textbox(placeholder="Buttons", label="Less of...")

    search_button = gr.Button("Submit", elem_classes="orange-button")
    results_gallery = gr.Gallery(label="Top 10 Results", columns=4)

    # Set up button click functionality
    search_button.click(fn=search_marqo, inputs=[query_input, themes_input, negatives_input], outputs=results_gallery)

# Launch the app
interface.launch()

‍

‍When launched, users can enter a main search query, add themes for more refinement, and even specify themes to avoid.

Step 5 (Optional): Deploy on Hugging Face Spaces

This ecommerce search demo has been built with the ability to deploy onto Hugging Face Spaces. Simply set up a Gradio Hugging Face Space and copy the contents of the app.py file. Note, you will need to define your Marqo API Key as a secret variable in your Hugging Face Space for this to work.

Step 6: Clean Up

If you follow the steps in this guide, you will create an index with CPU large inference and a basic storage shard. This index will cost $0.38 per hour. When you are done with the index you can delete it with the following code:


import marqo
import os

mq = marqo.Client("https://api.marqo.ai", api_key=os.getenv("MARQO_API_KEY"))
mq.delete_index(index_name)

‍

If you do not delete your index you will continue to be charged for it.

Conclusion

With these steps, you now have a fully functional ecommerce search engine powered by Marqo and Marqo’s state-of-the-art embeddings. This search engine allows users to find relevant products quickly, with the added flexibility of emphasizing or de-emphasizing certain themes.