We've developed two new state-of-the-art foundation models for generating multimodal product embeddings from images and text, Marqo-Ecommerce-B and Marqo-Ecommerce-L . These models outperform existing state-of-the-art solutions like Amazon Titan's Multimodal Embedding by up to 88% and the best open source model (ViT-SO400M-14-SigLIP) by up to 31%.
We benchmarked a number of embedding models for multimodal product retrieval. These included the base models ViT-B-16-SigLIP
and ViT-L-16-SigLIP
, as well as the best open source CLIP/SigLIP model, ViT-SO400M-14-SigLIP
. We also included API-based multimodal embeddings offered by Amazon-Titan-Multimodal, GCP-Vertex, Jina-V1-CLIP, and Cohere-Embedding-v3, although there are limitations on some of these private providers (see below for more details).
Our benchmarking process was divided into two distinct regimes, each using different datasets of ecommerce product listings: marqo-ecommerce-hard
and marqo-ecommerce-easy
. Both datasets contained product images and text and only differed in size. The "easy" dataset is approximately 10-30 times smaller (200k vs 4M products), and designed to accommodate rate-limited models, specifically Cohere-Embeddings-v3 and GCP-Vertex (with limits of 0.66 rps and 2 rps respectively). The "hard" dataset represents the true challenge, since it contains four million ecommerce product listings and is more representative of real-world ecommerce search scenarios. For both the marqo-ecommerce-hard
and marqo-ecommerce-easy
datasets, the models were benchmarked on three different tasks:
We have made these datasets available on Hugging Face along with scripts to reproduce the evaluation.
The benchmarking results show that the Marqo-Ecommerce models consistently outperformed all other models across various metrics. Specifically, marqo-ecommerce-L
achieved an average improvement of 17.6% in MRR and 20.5% in nDCG@10 when compared with the current best open source model, ViT-SO400M-14-SigLIP
across all three tasks in the marqo-ecommerce-hard
dataset. When compared with the best private model, Amazon-Titan Multimodal
, we saw an average improvement of 38.9% in MRR and 45.1% in nDCG@10 across all three tasks, and 35.9% in Recall across the Text-to-Image tasks in the marqo-ecommerce-hard
dataset.
While contrastive learning models like CLIP and SigLIP are powerful, they are not optimized for the needs of ecommerce. They were trained on a large collection of images, many of which aren't related to ecommerce, with little curation or domain specificity. The product data in ecommerce datasets differs significantly from general-purpose datasets, resulting in suboptimal performance when these models are used for search and recommendations. Additionally, these models were trained on data that is now several years old, and they have no understanding of recent products or trends.
We built Marqo-Ecommerce-B and Marqo-Ecommerce-L models, which excel at ecommerce search, retrieval, and recommendation tasks. The models were trained on 100s of millions of samples from ~50 million unique products across 20,000 Amazon asin categories spanning from appliances to automotive to office products to pet supplies. The models were evaluated on extensive benchmark datasets that spanned over 4 million unique products covering the 20,000 categories. The categories are taken from Amazon’s product taxonomy.
The Marqo-Ecommerce embedding models are designed specifically to work seamlessly with Marqo Cloud, our end-to-end embeddings platform. Additionally, you can fine tune our embedding models on your own product catalogs and user behavior, using Marqtune. Marqtune is our embedding model training platform backed by our contrastive learning framework - GCL.
If you're building an ecommerce site, here's how Marqo's new models can help you:
These performance gains over the best open source (ViT-SO400M-14-SigLIP
) and best private model (Amazon-Titan Multimodal) highlight the potential of Marqo's ecommerce-specific models to significantly enhance real-world ecommerce applications, leading to improved customer satisfaction, higher conversion rates, and increased revenue for online retailers.
We've released both our models, Marqo-Ecommerce-B and Marqo-Ecommerce-L on Hugging Face. The B model is smaller and faster for inference (with times of 5.1 ms for single batch text, and 5.7 ms for image) and a smaller embedding dimension (768). The L model is larger (652M parameters), has a larger embedding dimension (1024), but has better retrieval performance. Marqo-Ecommerce-L has up to 7.3% MRR and 7.4% nDCG@10 average improvement over Marqo-Ecommerce-B across the three tasks on the 4M evaluation dataset.
Before we show you how to use these models in Marqo Cloud and/or Hugging Face, let's first take a look at their performance against existing, state-of-the-art embedding models. If you want to jump straight in to using the models, jump ahead to Deploy Marqo-Ecommerce Models on Marqo Cloud or Loading Marqo-Ecommerce Models.
Here are the detailed results for three general ecommerce retrieval tasks. These tasks measure the performance of various embedding models in retrieving images based on long and short text descriptions and categories. We focus on Precision, Recall, MRR (Mean Reciprocal Rank), and nDCG to showcase how our Marqo-Ecommerce models stack up against existing solutions, such as Amazon Titan Multimodal and other popular open-weights SigLIP ViT models from Google.
As previously noted, our benchmarking process was structured around two distinct scenarios: marqo-ecommerce-hard and marqo-ecommerce-easy. This section will look into the comprehensive evaluation conducted using the full 4 million products across the two datasets.
In this task, we evaluate how well models retrieve relevant images when given text descriptions. This dataset has 1 million image-title pairs.
The benchmark results are:
Table 2. Benchmark results for GoogleShopping-Text2Image Retrieval task in Marqo-Ecommerce-Hard.
These results demonstrate the advantage of Marqo’s ecommerce models over other leading models in the GoogleShopping-Text2Image Retrieval task. Marqo-Ecommerce-L and Marqo-Ecommerce-B achieved top performance across key metrics, with Marqo-Ecommerce-L scoring a relative improvement of 43.7% in MRR and 35.4% in Recall@10 over Amazon-Titan-Multimodal, and 19% in MRR and 15% in Recall@10 over ViT-SO400M-14-SigLip.
For this task, we asses the model's ability to retrieve images that correspond to a particular product category. Using the same split of 1 million image-title pairs we evaluate how well each model can associate text inputs with categories rather than specific product titles. The categories typically consists of a few words and are shorter than titles. While the Text2Image queries have exactly 1 corresponding image, for the Category2Image task there can be multiple images that correspond to a category.
The benchmark results are:
Again, the Marqo-Ecommerce-L model has the highest scores across all metrics with Marqo-Ecommerce-B also outperforming all other models. This includes an improvement of 88% in mAP, 52% in Precision@10, and 49.3% in nDCG@10 over Amazon-Titan. When compared with the best open source model, ViT-SO400M-14-SigLip, we see an improvement of 31.5% in mAP, 26.4% in Precision@10, 16.3% in MRR, and 25.9% in nDCG@10. These results demonstrate how well Marqo-Ecommerce models can recognize product categories and retrieve relevant images from large datasets.
In this task, we scaled the evaluation to 3 million image-title pairs taken from the Amazon-products dataset. This test focuses on the challenge of finding the correct image based on a product's title, simulating a real-world ecommerce environment where users search for products based on text queries.
The benchmark results are:
We can see from both Figure 4 and Table 4 that the Marqo-Ecommerce embedding models outperformed on this task too with an improvement of 36% in Recall@10, 45% in MRR, and 43% in nDCG@10 when compared to the best private model, Amazon-Titan. When compared with the best open source model, ViT-SO400M-14-SigLip, we saw an improvement of 15% in Recall@10, 17% in both MRR and nDCG@10.
As mentioned, our benchmarking process was divided into two distinct scenarios: marqo-ecommerce-hard and marqo-ecommerce-easy. This section covers the latter which features a corpus 10-30 times smaller and was designed to accommodate rate-limited models. We will look into the comprehensive evaluation conducted using the full 200k products across the two datasets. In addition to the models already benchmarked above, these benchmarks also include Cohere-embedding-v3 and GCP-Vertex.
In this task, we evaluate how well models retrieve relevant images when given text descriptions. For this, we selected 100k samples from the original 1M image-title pairs.
The benchmark results are:
Again, the Marqo-Ecommerce-L model has the highest scores across all metrics with Marqo-Ecommerce-B also outperforming all other models. This includes an improvement of 11.8% in Recall@10, 26.8% in MRR, and 22.9% in nDCG@10 when compared with Amazon-Titan. Against ViT-SO400M-14-SigLip, we saw an improvement of 3.9% in Recall@10, 11% in MRR, and 9.2% in nDCG@10.
For this task, we asses the model's ability to retrieve images that correspond to a particular product category. For this, we selected 100k samples from the original 1M image-title pairs.
The benchmark results are:
For GoogleShopping-Category2Image Retrieval task, there is an improvement of 67% in mAP, 55% in Precision@10, 36.9% in MRR, and 56.5% in nDCG@10 when compared with the best private model, Amazon-Titan. For the best open source model, ViT-SO400M-14-SigLIP, we saw an improvement of 21.7% in mAP, 18.5% in Precision@10, 18.6% in MRR, and 21.1% in nDCG@10.
In this task, we use a 100k image-title pairs from the Amazon-Products dataset.
The benchmark results are:
For AmazonProducts-Text2Image Retrieval task, there is an improvement of 10% in Recall@10, 36.9% in MRR, and 18.8% in nDCG@10 when compared with the best private model, Amazon-Titan. When compared with the best open source model, ViT-SO400M-14-SigLIP, we saw an improvement of 2.5% in Recall@10, 7.9% in MRR, and 6.6% in nDCG@10.
If you’re ready to integrate Marqo-Ecommerce-B or Marqo-Ecommerce-L into your application, here’s how you can load them using Hugging Face's transformers library in Python:
from transformers import AutoModel, AutoProcessor
import torch
from PIL import Image
import requests
# Choose your model: 'Marqo/marqo-ecommerce-embeddings-L' or 'Marqo/marqo-ecommerce-embeddings-B'
model_name = 'Marqo/marqo-ecommerce-embeddings-B'
# Load the model and processor
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
# Load an image and text for testing
img = Image.open(requests.get('https://raw.githubusercontent.com/marqo-ai/marqo-ecommerce-embeddings/refs/heads/main/images/dining-chairs.png', stream=True).raw).convert("RGB")
image = [img]
text = ["dining chairs", "a laptop", "toothbrushes"]
# Process the inputs
processed = processor(text=text, images=image, padding='max_length', return_tensors="pt")
processor.image_processor.do_rescale = False
# Perform inference
with torch.no_grad():
image_features = model.get_image_features(processed['pixel_values'], normalize=True)
text_features = model.get_text_features(processed['input_ids'], normalize=True)
# Compute the similarity probabilities
text_probs = (100 * image_features @ text_features.T).softmax(dim=-1)
print(text_probs)
# Output: [1.0000e+00, 8.3131e-12, 5.2173e-12]
You can also load the models using the OpenCLIP framework. Here’s how to do it:
from PIL import Image
import open_clip
import requests
import torch
# Specify model from Hugging Face Hub: 'hf-hub:Marqo/marqo-ecommerce-embeddings-L' or 'hf-hub:Marqo/marqo-ecommerce-embeddings-B'
model_name = 'hf-hub:Marqo/marqo-ecommerce-embeddings-L'
model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms(model_name)
tokenizer = open_clip.get_tokenizer(model_name)
# Preprocess the image and tokenize text inputs
# Load an example image from a URL
img = Image.open(requests.get('https://raw.githubusercontent.com/marqo-ai/marqo-ecommerce-embeddings/refs/heads/main/images/dining-chairs.png', stream=True).raw)
image = preprocess_val(img).unsqueeze(0)
text = tokenizer(["dining chairs", "a laptop", "toothbrushes"])
# Perform inference
with torch.no_grad(), torch.cuda.amp.autocast():
image_features = model.encode_image(image, normalize=True)
text_features = model.encode_text(text, normalize=True)
# Calculate similarity probabilities
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
# Display the label probabilities
print("Label probs:", text_probs)
# [1.0000e+00, 8.3131e-12, 5.2173e-12]
These models are available in both Marqo Cloud and Marqo open source, the end-to-end vector search engine. Here’s how you can implement them yourself.
import marqo
# To obtain your API Key, visit https://www.marqo.ai/blog/finding-my-marqo-api-key
api_key = "your_api_key"
mq = marqo.Client("https://api.marqo.ai", api_key=api_key)
# Alternatively, if you want to use Docker to run Marqo. For more information, see, https://github.com/marqo-ai/marqo
# mq = marqo.Client("http://localhost:8882", api_key=None)
# Define settings for creating an index
settings = {
"type": "unstructured", # Specify the type of data to be indexed
"model": "Marqo/marqo-ecommerce-embeddings-B", # Name of the embedding model to be used
"modelProperties": {
"name": "hf-hub:Marqo/marqo-ecommerce-embeddings-B", # Full name of the model on Hugging Face Hub
"dimensions": 768, # Dimensions of the embeddings
"type": "open_clip" # Type of the model
},
"treatUrlsAndPointersAsImages": True, # Treat URLs and pointers as images for indexing
}
# Create a new index called "marqo-ecommerce-b" with the specified settings
mq.create_index("marqo-ecommerce-b", settings_dict=settings)
import marqo
# To obtain your API Key, visit https://www.marqo.ai/blog/finding-my-marqo-api-key
api_key = "your_api_key"
mq = marqo.Client("https://api.marqo.ai", api_key=api_key)
# Alternatively, if you want to use Docker to run Marqo. For more information, see, https://github.com/marqo-ai/marqo
# mq = marqo.Client("http://localhost:8882", api_key=None)
# Define settings for creating an index
settings = {
"type": "unstructured", # Specify the type of data to be indexed
"model": "Marqo/marqo-ecommerce-embeddings-L", # Name of the embedding model to be used
"modelProperties": {
"name": "hf-hub:Marqo/marqo-ecommerce-embeddings-L", # Full name of the model on Hugging Face Hub
"dimensions": 1024, # Dimensions of the embeddings
"type": "open_clip" # Type of the model
},
"treatUrlsAndPointersAsImages": True, # Treat URLs and pointers as images for indexing
}
# Create a new index called "marqo-ecommerce-l" with the specified settings
mq.create_index("marqo-ecommerce-l", settings_dict=settings)
For a guide on how you can build your own Ecommerce Search Application with these models, visit our article.
GCL (Generalized Contrastive Learning) is a framework, built by Marqo, that's designed to go beyond binary relevance and leverage fine-grained rankings for multimodal retrieval tasks.
Here’s how to run the evaluation. First, install git if you don't already have it, and then run the following command from your terminal:
git clone https://github.com/marqo-ai/GCL
Install the packages required by GCL and then input the following into your terminal:
cd ./GCL
MODEL=hf-hub:Marqo/marqo-ecommerce-B
outdir=/MarqoModels/GE/marqo-ecommerce-B/gs-title2image2
hfdataset=Marqo/google-shopping-general-eval
python evals/eval_hf_datasets_v1.py \
--model_name $MODEL \
--hf-dataset $hfdataset \
--output-dir $outdir \
--batch-size 1024 \
--num_workers 8 \
--left-key "['title']" \
--right-key "['image']" \
--img-or-txt "[['txt'], ['img']]" \
--left-weight "[1]" \
--right-weight "[1]" \
--run-queries-cpu \
--top-q 4000 \
--doc-id-key item_ID \
--context-length "[[64], [0]]"
All of the scripts to perform evaluations with GCL can be found on our GitHub.
With the release of Marqo-Ecommerce-B and Marqo-Ecommerce-L, ecommerce platforms now have access to powerful, purpose-built embedding models that outperform existing solutions by up to 88%. These models are specifically tailored for the unique challenges of ecommerce, delivering highly accurate retrieval results, whether it's matching product titles to images or associating products with broader categories. The Marqo-Ecommerce models are set to transform search, retrieval, and recommendation tasks in the ecommerce industry.
Try out the following for yourself:
To learn more about how we can help drive revenue to your business, book a demo with us or read more in our Redbubble Case study.