Marqo Launches Family of Embedding Models for Ecommerce and Retail
Updated on April 14, 2026
Marqo Launches Family of Embedding Models for Ecommerce and Retail
Overview
Marqo-FashionCLIP and Marqo-FashionSigLIP are 150M parameter embedding models that:
- Outperform FashionCLIP2.0 and OpenFashionCLIP on all benchmarks (up to +57%)
- Outperform ViT-B-16-SigLIP in all benchmarks (Marqo-FashionSigLIP only)
- Are 10% faster for inference than FashionCLIP2.0 and OpenFashionCLIP
- Use Generalized Contrastive Learning (GCL) to optimize over seven fashion-specific aspects: descriptions, titles, colors, details, categories, keywords, and materials
Both models are released under the Apache 2.0 license and available from Hugging Face and Marqo Cloud.
What is Marqo-FashionCLIP & Marqo-FashionSigLIP?
These are state-of-the-art multimodal models for search and recommendations in fashion. They produce embeddings for both text and images for downstream search and recommendations applications. Trained on over 1M fashion products with rich metadata, they were fine-tuned from base models (ViT-B-16-laion and ViT-B-16-SigLIP-webli) using GCL.
The loss function contains seven components optimizing for long descriptions, product titles, colors, materials, categories, details, and keywords. This multi-part loss significantly outperformed standard text-image InfoNCE loss in contrastive learning, enabling retrieval of relevant results for both short keyword text and longer descriptive text.
Performance Metrics
Retrieval Results
Evaluation across seven publicly available fashion datasets showed substantial improvements:
Text-to-Image Task: - Marqo-FashionCLIP: +22% recall@1 improvement vs FashionCLIP2.0 - Marqo-FashionSigLIP: +57% recall@1 improvement vs FashionCLIP2.0
Category-to-Product Task: - Marqo-FashionCLIP: +8% precision@1 improvement - Marqo-FashionSigLIP: +11% precision@1 improvement
Sub-Category-to-Product Task: - Marqo-FashionCLIP: +11% precision@1 improvement - Marqo-FashionSigLIP: +13% precision@1 improvement
Inference Speed
Both models are 10% faster than existing fashion-specific models for combined text and image inference.
Evaluation Datasets
Seven publicly available datasets were used:
- 1DeepFashion (In-shop): 52,591 fashion images
- 2DeepFashion (Multimodal): 42,537 fashion images
- 3Fashion200K: 201,624 fashion images
- 4KAGL: 44,434 fashion images
- 5Atlas: 78,370 fashion images
- 6Polyvore: 94,096 fashion images
- 7iMaterialist: 721,065 fashion images
Evaluation Tasks
Three main multimodal retrieval tasks were assessed:
- Text-to-Image: Uses unique passages of text representative of longer descriptive queries (tail queries)
- Category-to-Product: Represents shorter keyword-like queries with potentially multiple correct results (head queries)
- Sub-Category-to-Product: More specific category-based retrieval
Metrics Used
- Text-to-image: Recall@1, Recall@10, Average Recall, Mean Reciprocal Rank (MRR)
- Category/Sub-category tasks: Precision@1, Precision@10, Average Precision, MRR
Getting Started
Deploy on Marqo Cloud (Recommended)
- 1Sign up to Marqo Cloud
- 2Install Marqo:
pip install marqo- 1Create an index:
import marqosettings = { "type": "unstructured", "model": "marqo-fashion-clip", "modelProperties": { "name": "ViT-B-16", "dimensions": 512, "url": "https://marqo-gcl-public.s3.us-west-2.amazonaws.com/marqo-fashionCLIP/marqo_fashionCLIP.pt", "type": "open_clip" }, "treatUrlsAndPointersAsImages": True }
api_key = "your_api_key" mq = marqo.Client("https://api.marqo.ai", api_key=api_key)
mq.create_index("fashion-index", settings_dict=settings) mq.index("fashion-index").search("black dress") ```
Use from Hugging Face
pip install open_clip_torchimport open_clip
model, _, preprocess = open_clip.create_model_and_transforms('hf-hub:Marqo/marqo-fashionCLIP')
tokenizer = open_clip.get_tokenizer('hf-hub:Marqo/marqo-fashionCLIP')import torch from PIL import Image
image = preprocess(Image.open("docs/fashion-hippo.png")).unsqueeze(0) text = tokenizer(["a hat", "a t-shirt", "shoes"])
with torch.no_grad(), torch.cuda.amp.autocast(): image_features = model.encode_image(image) text_features = model.encode_text(text) image_features /= image_features.norm(dim=-1, keepdim=True) text_features /= text_features.norm(dim=-1, keepdim=True) text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print("Label probs:", text_probs) ```
Conclusion
Marqo-FashionCLIP and Marqo-FashionSigLIP represent new state-of-the-art multimodal embedding models for fashion domain search and recommendations. With performance improvements up to 57% on text-to-image retrieval and 11% on category-to-product tasks, combined with faster inference, they offer significant advantages for ecommerce applications.
Resources
- Hugging Face: https://huggingface.co/collections/Marqo/marqo-fashionclip-and-marqo-fashionsiglip-66b43f2d09a06ad2368d4af6
- GitHub: https://github.com/marqo-ai/marqo-FashionCLIP
- Generalized Contrastive Learning: https://github.com/marqo-ai/GCL
Shape Your Growth With AI-Native
Product Discovery
Transform product discovery with Marqo and get measurable ROI in 14 days, not months.
