State of AI in Consumer & Retail 2026 - Now AvailableGet the report
Back to all Blog Posts
Industry Trends
April 14, 2026

Marqo Launches Family of Embedding Models for Ecommerce and Retail

Updated on April 14, 2026

Jesse Clark
Jesse ClarkCo-Founder & CTO
MarqoIndustry Trends

Marqo Launches Family of Embedding Models for Ecommerce and Retail

Overview

Marqo-FashionCLIP and Marqo-FashionSigLIP are 150M parameter embedding models that:

  • Outperform FashionCLIP2.0 and OpenFashionCLIP on all benchmarks (up to +57%)
  • Outperform ViT-B-16-SigLIP in all benchmarks (Marqo-FashionSigLIP only)
  • Are 10% faster for inference than FashionCLIP2.0 and OpenFashionCLIP
  • Use Generalized Contrastive Learning (GCL) to optimize over seven fashion-specific aspects: descriptions, titles, colors, details, categories, keywords, and materials

Both models are released under the Apache 2.0 license and available from Hugging Face and Marqo Cloud.

What is Marqo-FashionCLIP & Marqo-FashionSigLIP?

These are state-of-the-art multimodal models for search and recommendations in fashion. They produce embeddings for both text and images for downstream search and recommendations applications. Trained on over 1M fashion products with rich metadata, they were fine-tuned from base models (ViT-B-16-laion and ViT-B-16-SigLIP-webli) using GCL.

The loss function contains seven components optimizing for long descriptions, product titles, colors, materials, categories, details, and keywords. This multi-part loss significantly outperformed standard text-image InfoNCE loss in contrastive learning, enabling retrieval of relevant results for both short keyword text and longer descriptive text.

Performance Metrics

Retrieval Results

Evaluation across seven publicly available fashion datasets showed substantial improvements:

Text-to-Image Task: - Marqo-FashionCLIP: +22% recall@1 improvement vs FashionCLIP2.0 - Marqo-FashionSigLIP: +57% recall@1 improvement vs FashionCLIP2.0

Category-to-Product Task: - Marqo-FashionCLIP: +8% precision@1 improvement - Marqo-FashionSigLIP: +11% precision@1 improvement

Sub-Category-to-Product Task: - Marqo-FashionCLIP: +11% precision@1 improvement - Marqo-FashionSigLIP: +13% precision@1 improvement

Inference Speed

Both models are 10% faster than existing fashion-specific models for combined text and image inference.

Evaluation Datasets

Seven publicly available datasets were used:

  1. 1DeepFashion (In-shop): 52,591 fashion images
  2. 2DeepFashion (Multimodal): 42,537 fashion images
  3. 3Fashion200K: 201,624 fashion images
  4. 4KAGL: 44,434 fashion images
  5. 5Atlas: 78,370 fashion images
  6. 6Polyvore: 94,096 fashion images
  7. 7iMaterialist: 721,065 fashion images

Evaluation Tasks

Three main multimodal retrieval tasks were assessed:

  • Text-to-Image: Uses unique passages of text representative of longer descriptive queries (tail queries)
  • Category-to-Product: Represents shorter keyword-like queries with potentially multiple correct results (head queries)
  • Sub-Category-to-Product: More specific category-based retrieval

Metrics Used

  • Text-to-image: Recall@1, Recall@10, Average Recall, Mean Reciprocal Rank (MRR)
  • Category/Sub-category tasks: Precision@1, Precision@10, Average Precision, MRR

Getting Started

Deploy on Marqo Cloud (Recommended)

  1. 1Sign up to Marqo Cloud
  2. 2Install Marqo:
pip install marqo
  1. 1Create an index:
import marqo

settings = { "type": "unstructured", "model": "marqo-fashion-clip", "modelProperties": { "name": "ViT-B-16", "dimensions": 512, "url": "https://marqo-gcl-public.s3.us-west-2.amazonaws.com/marqo-fashionCLIP/marqo_fashionCLIP.pt", "type": "open_clip" }, "treatUrlsAndPointersAsImages": True }

api_key = "your_api_key" mq = marqo.Client("https://api.marqo.ai", api_key=api_key)

mq.create_index("fashion-index", settings_dict=settings) mq.index("fashion-index").search("black dress") ```

Use from Hugging Face

pip install open_clip_torch
import open_clip
model, _, preprocess = open_clip.create_model_and_transforms('hf-hub:Marqo/marqo-fashionCLIP')
tokenizer = open_clip.get_tokenizer('hf-hub:Marqo/marqo-fashionCLIP')

import torch from PIL import Image

image = preprocess(Image.open("docs/fashion-hippo.png")).unsqueeze(0) text = tokenizer(["a hat", "a t-shirt", "shoes"])

with torch.no_grad(), torch.cuda.amp.autocast(): image_features = model.encode_image(image) text_features = model.encode_text(text) image_features /= image_features.norm(dim=-1, keepdim=True) text_features /= text_features.norm(dim=-1, keepdim=True) text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs) ```

Conclusion

Marqo-FashionCLIP and Marqo-FashionSigLIP represent new state-of-the-art multimodal embedding models for fashion domain search and recommendations. With performance improvements up to 57% on text-to-image retrieval and 11% on category-to-product tasks, combined with faster inference, they offer significant advantages for ecommerce applications.

Resources

  • Hugging Face: https://huggingface.co/collections/Marqo/marqo-fashionclip-and-marqo-fashionsiglip-66b43f2d09a06ad2368d4af6
  • GitHub: https://github.com/marqo-ai/marqo-FashionCLIP
  • Generalized Contrastive Learning: https://github.com/marqo-ai/GCL

Commerce Superintelligence

Marqo has released Marqo-FashionCLIP and Marqo-FashionSigLIP, two 150M parameter embedding models that outperform existing fashion-specific models by up to 57% on benchmarks while delivering 10% faster inference.

Shape Your Growth With AI-Native
Product Discovery

Transform product discovery with Marqo and get measurable ROI in 14 days, not months.

Kicks Crew
Mejuri
Redbubble
Kogan
Shutterstock
SwimOutlet
Poshmark
Kicks Crew
Mejuri
Redbubble
Kogan
Shutterstock
SwimOutlet
Poshmark