Getting Started

How to Fine-Tune and Deploy an Embedding Model

December 6, 2024

mins read

This blog post shows you how to fine-tune embedding models with Marqtune and deploy them using Marqo Cloud for real-world applications. It covers getting set up, preparing a dataset, fine-tuning a model, and deploying it to create a searchable index ready for production.

Introduction

Fine-tuning and deploying embedding models tailored to specific use cases has often been a challenge—until now. Marqo Cloud provides an end-to-end solution, allowing you to train, embed, retrieve, and evaluate all in one platform.

So, how does it work? Start by selecting your embedding model—Marqo offers hundreds of open-source options, or you can bring your own. Next, fine-tune and evaluate this model using Marqtune, Marqo’s fine-tuning embedding model platform. We built Marqtune on the foundation of our new training framework, Generalized Contrastive Learning (GCL). With GCL, you can fine-tune embedding models to rank search results not only by semantic relevance but also by a ranking system defined by your search team. This means better, more relevant search results that cater to your business.

Once your model is trained, deploy it seamlessly on Marqo Cloud. From there, create an index, add your documents, and voilà—you’re set to deliver powerful search and recommendations. Whether you’re building a fashion search tool, an ecommerce platform, or any multimodal application, Marqo’s tools get you up and running in no time.

In this blog post, we’ll walk through the entire process. We also have a Google Colab Notebook that you can use to follow along with. There is also this YouTube video that walks you through it too:

Step 1: Set Up

Before anything else, we need to install the libraries that make everything possible: marqtune for fine-tuning and marqo for indexing and searching.


pip install marqtune
pip install marqo

‍

‍We will be using Marqtune for the fine-tuning. Sign up to Marqo Cloud to access Marqtune, and navigate to ‘API Keys’ to obtain your API Key. For more information, see this article. With our API Key, we set up the Marqtune client.


from marqtune.client import Client

api_key = 'api_key' # To find your API Key, see https://www.marqo.ai/blog/finding-my-marqo-api-key
marqtune_client = Client(url="https://marqtune.marqo.ai", api_key=api_key)

`‍`Step 2: Downloading and Preparing the Dataset

We’ll use a publicly available dataset for fine-tuning. This is a multimodal dataset containing queries, titles, image URLs, and a score factor. It is a subset of the Marqo-GS-10M dataset. For this demo, we will train our model on the first three columns; query, title, and image. For information on how to train a model using a score metric, see our full Marqtune walkthrough.

This is a sample of what the dataset looks like:

"query","title","image","score"
"Easy cleaning detachable kitchen scissors","Clever Cutter Kitchen Scissors Multi-functional Kitchen Scissors 2 in 1 Scissors ...","https://marqo-gcl-public.s3.us-west-2.amazonaws.com/marqtune_test/images/5840935959251565057.webp","29"
"Collar support","Sponge cervical support soft collar neck brace cervical breathable and ...","https://marqo-gcl-public.s3.us-west-2.amazonaws.com/marqtune_test/images/1360326383107717148.webp","26"
"Customizable Buttons for Men","Personalized Pinback Buttons 3 Inches Qty 20","https://marqo-gcl-public.s3.us-west-2.amazonaws.com/marqtune_test/images/7828718866322268331.webp","84"
"Unitards","Capezio Short Sleeve Leotard - Large - Black","https://marqo-gcl-public.s3.us-west-2.amazonaws.com/marqtune_test/images/16077728302492702729.webp","46"

‍

‍Let’s download the csv file locally. We will then use this to create a dataset in Marqtune.


from urllib.request import urlopen
import gzip

base_path = "https://marqo-gcl-public.s3.us-west-2.amazonaws.com/marqtune_test/datasets/v1"
training_data = "gs_100k_training.csv"
open(training_data, "w").write(
    gzip.open(urlopen(f"{base_path}/{training_data}.gz"), "rb").read().decode("utf-8")
)

‍

In order to create a dataset in Marqtune, we need to identify the columns we want to use in the CSVs as well as their types by defining a data schema. For this example, we'll define the query, title and image.


data_schema = {
    "query": "text",
    "title": "text",
    "image": "image_pointer",
}

‍

‍After defining the data schema we can then create the training dataset. Note that creating a dataset takes a few minutes to complete as it accomplishes a few steps:

The CSV file has to be uploaded
Some simple validations have to pass (e.g. the data schema needs to be validated against each row in the CSV input)
The URLs in the image_pointer columns are used to download the image files to the dataset

We create a training dataset as follows:


from marqtune.enums import DatasetType

# Create the training dataset.
training_dataset_name = f"{training_data}"
print(f"Creating training dataset ({training_dataset_name}):")
training_dataset = marqtune_client.create_dataset(
    dataset_name=training_dataset_name,
    file_path=training_data,
    dataset_type=DatasetType.TRAINING,
    data_schema=data_schema,
    query_columns=["query"],
    result_columns=["title", "image"],
    # setting wait_for_completion=True will make this a blocking call and will also print logs interactively
    wait_for_completion=True,
)

‍‍

‍Once your dataset has been created, the Marqo Cloud console will look as follows:

We can see that our dataset is ready (see the dataset status on the right-hand side). Now we can train a model using this dataset.

Note, for this example, we only create a training dataset. If you’d like to create an evaluation dataset and evaluate this fine-tuned model, please see our Marqtune walkthrough article.

Step 3: Fine-Tuning an Embedding Model

In this section, we fine-tune a pre-trained model (laion/CLIP-ViT-B-32-laion2B-s34B-b79K) using the dataset we have just created. We specify hyperparameters such as weights for different columns and the number of training epochs.

We first define the training hyperparameters. We have chosen a minimal set of hyperparameters to get you started - primarily the left/right keys to define the columns in the input CSV that we’re training on. We also specify that we want to train the model for 5 epochs. Please refer to the training parameters documentation for further information.


# Setup training hyper parameters:
training_params = {
    "leftKeys": ["query"],
    "leftWeights": [1],
    "rightKeys": ["image", "title"],
    "rightWeights": [0.9, 0.1],
    "epochs": 5,
}

‍

‍‍Next, we define the model and checkpoint that we wish to fine-tune. Then, we perform fine-tuning with Marqtune. Note, you may choose to run this training faster using more powerful hardware. You can specific this with: instance_type=InstanceType.PERFORMANCE.


from marqtune.enums import InstanceType

base_model = "ViT-B-32"
base_checkpoint = "laion2b_s34b_b79k"

model_name = f"{training_data}-model"
print(f"Training a new model ({model_name}):")
tuned_model = marqtune_client.train_model(
    dataset_id=training_dataset.dataset_id,
    model_name=model_name,
    instance_type=InstanceType.BASIC,
    base_model=f"Marqo/{base_model}.{base_checkpoint}",
    hyperparameters=training_params,
    wait_for_completion=True,
)

‍

‍You can expect your terminal output to look similar to the following while the fine-tuning is occurring:


2024-12-02,13:12:35 | INFO | Train Epoch: 0 [   256/100000 (0%)] Data (t): 2.267 Batch (t): 7.688, 33.2988/s, 33.2988/s/gpu LR: 0.000000 Logit Scale: 100.000, Logit Bias: 0.000, Txt_img_0_0_loss: 1.3925 (1.3925) Txt_txt_0_1_loss: 4.3815 (4.3815) Weighted_mean_loss: 1.1798 (1.1798) Loss: 2.0866 (2.0866)
2024-12-02,13:13:57 | INFO | Train Epoch: 0 [ 25856/100000 (26%)] Data (t): 0.000 Batch (t): 0.817, 313.753/s, 313.753/s/gpu LR: 0.000005 Logit Scale: 99.986, Logit Bias: 0.000, Txt_img_0_0_loss: 0.66661 (1.0295) Txt_txt_0_1_loss: 0.48414 (2.4328) Weighted_mean_loss: 0.48736 (0.83356) Loss: 0.57618 (1.3314)
2024-12-02,13:15:19 | INFO | Train Epoch: 0 [ 51456/100000 (52%)] Data (t): 0.001 Batch (t): 0.816, 313.489/s, 313.489/s/gpu LR: 0.000010 Logit Scale: 99.977, Logit Bias: 0.000, Txt_img_0_0_loss: 0.61750 (0.89220) Txt_txt_0_1_loss: 0.30888 (1.7249) Weighted_mean_loss: 0.41900 (0.69537) Loss: 0.49072 (1.0512)
2024-12-02,13:16:40 | INFO | Train Epoch: 0 [ 77056/100000 (77%)] Data (t): 0.000 Batch (t): 0.817, 313.593/s, 313.593/s/gpu LR: 0.000015 Logit Scale: 99.962, Logit Bias: 0.000, Txt_img_0_0_loss: 0.77137 (0.86199) Txt_txt_0_1_loss: 0.28524 (1.3649) Weighted_mean_loss: 0.47234 (0.63961) Loss: 0.57508 (0.93214)
2024-12-02,13:17:53 | INFO | Train Epoch: 0 [ 99840/100000 (100%)] Data (t): 0.001 Batch (t): 0.816, 314.367/s, 314.367/s/gpu LR: 0.000019 Logit Scale: 99.932, Logit Bias: 0.000, Txt_img_0_0_loss: 0.83676 (0.85695) Txt_txt_0_1_loss: 0.34349 (1.1607) Weighted_mean_loss: 0.53495 (0.61868) Loss: 0.63799 (0.87331)

‍

‍‍The Marqo Cloud console will look as follows. You can see your model being created.

If you click on your model, you will see log information.

It’s worth noting that once training has been successfully kicked off in Marqtune it will continue until completion no matter what happens to your local client session.

Once fine-tuning is complete, we need to release a checkpoint from this model so that we can use it in a Marqo Cloud index. You can do this directly in the Marqo Cloud platform or through code.

To release through code, run the following:


marqtune_client.model(tuned_model.model_id).release("epoch_5")

‍

You can also release directly in Marqo Cloud:

Press ‘release’ and select the epoch you wish to release. In this example, we’ll release epoch 5. Note, if you click on your model, there will be an option to release any of the epochs you wish. We will use this when creating an index in Marqo.

Step 4: Deploying a Fine-Tuned Model

Great! Now we have our released checkpoint, we can load this model with Marqo Cloud ready to perform searches and recommendations.

Initialize Marqo Client

First, we’ll initialize the Marqo client.


import marqo

marqo_client = marqo.Client("https://api.marqo.ai", api_key=api_key)

‍

Next, we define the settings for our index. Since we are using the model we just fine-tuned, we must specify this in our settings. Here, we specify "model":f"marqtune/{model_id}/epoch_5". Note, the model_id here is the model ID of your fine-tuned model and the epoch must have been released in Marqo Cloud. You can obtain your model ID with model_id = tuned_model.model_id where tuned_model is the model you have just fine-tuned. We then create the index.


model_id = tuned_model.model_id

settings = {
    "type": "unstructured",  # Set the index type as unstructured data
    "model":f"marqtune/{model_id}/epoch_5",
    "treatUrlsAndPointersAsImages": True,  # Enable image URLs as image sources
    "inferenceType": "marqo.CPU.large",
}

index_name = "ecommerce-search"

marqo_client.create_index(index_name=index_name, settings_dict=settings)

‍

‍In your Marqo Cloud console, navigate to ‘indexes’ and you’ll see your index being created.

Add Documents to Index

Once our index is ready, we can then add our documents. For ease, in this example we'll use the dataset that we trained our model on.

First, load the product data from the CSV file into a pandas Dataframe:


import pandas as pd
path_to_data = "gs_100k_training.csv"  # Define the path to the CSV file containing product data
df = pd.read_csv(path_to_data)  # Load the product data from the CSV file into a pandas DataFrame

‍

‍Next, we put the documents into a format that Marqo accepts.


documents = [
    {"image_url": image, "query": query, "title": title}
    for image, query, title in zip(df["image"], df["query"], df["title"])
]

‍

‍Then, we add these documents to our index. For this example, we'll only add 124 documents but feel free to change the settings below to add as many or as little as you'd like.

In this example, we use the add_documents function with mappings defined. This defines a multimodal field with weights for each field. For example, we weight the image_url more heavily (0.8) than compared with the title and query. We then use this multimodal mapping to create our vectors.


n = 124  # Set the number of documents to process
batch_size = 64  # Define the batch size for uploading documents

for i in range(0, min(len(documents), n), batch_size):  # Loop through up to 'n' documents in batches
    batch = documents[i:i + batch_size]  # Select a batch of documents

    marqo_client.index(index_name).add_documents(  # Add the batch of documents to the Marqo index
        batch,
        client_batch_size=batch_size,  # Set the batch size for the client
        mappings={
            "image_title_multimodal": {  # Define a multimodal field combining image, title, and queries
                "type": "multimodal_combination",  # Set the field type as multimodal
                "weights": {"title": 0.1, "query": 0.1, "image_url": 0.8},  # Assign weights to each field
            }
        },
        tensor_fields=["image_title_multimodal"],  # Specify fields for tensor generation
    )

‍

‍When adding documents, we always find it helpful to obtain the statistics associated with it. This is helpful to see how many documents and vectors you have in your index.


results = marqo_client.index(index_name).get_stats()  # Retrieve statistics for the specified Marqo index
print(results)  # Print the index statistics to view details such as document count and memory usage

`‍`Search Over Your Index

Now we have added documents to our index, we can perform searches. The search_marqo function below takes three inputs:

query: this is the input search query, i.e. 'a t-shirt'
more_of: what you want to see more of, i.e. 'stripes'
less_of: what you want to see less of, i.e. 'green'

When this function is called, it will then display the top search result and display the image. Note, in this example, we have added 124 documents. For better results, ensure you add more documents to your index.


import requests  # Import the requests library for handling HTTP requests
import io  # Import io for handling byte streams
from PIL import Image  # Import PIL's Image module for image processing
import matplotlib.pyplot as plt  # Import matplotlib for plotting

def search_marqo(query, more_of, less_of):
    query_weights = {query: 1.0}  # Assign a weight of 1.0 to the main query
    if more_of:
        query_weights[more_of] = 0.75  # Apply a positive weight to emphasize additional themes
    if less_of:
        query_weights[less_of] = -1.1  # Apply a negative weight to de-emphasize certain themes

    # Perform search on the Marqo index
    res = marqo_client.index(index_name).search(query_weights, limit=10)  # Limit results to top 10

    # Display and plot the image URL of the top result
    if res['hits']:
        top_result = res['hits'][0]  # Get the top result
        top_image_url = top_result.get('image_url', 'No Image URL')  # Extract the image URL of the top result
        print(f"Top result image URL: {top_image_url}")  # Print the top image URL

        # Plot the image from the top result's URL
        if top_image_url != 'No Image URL':
            response = requests.get(top_image_url)  # Fetch the image
            image = Image.open(io.BytesIO(response.content))  # Open the image
            plt.imshow(image)  # Plot the image
            plt.axis('off')  # Hide axes
            plt.title(f"Top Result: {top_result.get('title', 'No Title')}")  # Add a title
            plt.show()  # Display the plot

‍

‍Let’s perform a search!


query = "top"
more_of = None
less_of = None

search_marqo(query, more_of, less_of)

‍

‍This returns:

Great! A top has been returned by our Marqo index. We suggest that you continue to add more documents to your index to further refine your search results. Note, you can query your index while adding documents.

Clean Up

When you are done with the index you can delete it with the following code:


marqo_client.delete_index(index_name)

‍

If you do not delete your index you will continue to be charged for it.

Conclusion

Marqo provides a fully-fledged, end-to-end platform for search and recommendations. In this article, we created a dataset, fine-tuned a base model, and deployed this on Marqo Cloud ready for production. Training and deploying AI models has never been easier!