Marqo-FashionCLIP & Marqo-FashionSigLIP are 150M parameter embedding models that:
We are releasing Marqo-FashionCLIP and Marqo-FashionSigLIP under the Apache 2.0 license.
Marqo-FashionCLIP & Marqo-FashionSigLIP are two new state-of-the-art multimodal models for search and recommendations in the fashion domain. Marqo-FashionCLIP & Marqo-FashionSigLIP can produce embeddings for both text and images that can then be used in downstream search and recommendations applications. The models were trained on over 1M fashion products with rich meta-data containing detailed descriptions, colors, styles, keywords and materials. This training dataset was not a part of any of the evaluation datasets. The models were fine-tuned from two existing base models (ViT-B-16-laion and ViT-B-16-SigLIP-webli) using GCL. The loss consisted of seven parts to optimize specifically for long descriptions, product titles, colors, materials, categories, details and keywords. This multi-part loss significantly outperformed the standard text-image InfoNCE loss in contrastive learning when fine-tuning. This results in a model that is able to retrieve more relevant results for both keyword-like short text and longer descriptive text - both of which are particularly relevant to search applications.
Marqo-FashionCLIP and Marqo-FashionSigLIP set a new frontier in terms of performance with improvements of up to +57% over existing fashion specific embedding models.
Evaluation of the models was performed using seven publicly available fashion datasets and were not included in the training dataset (details can be found later in the article). Each dataset is associated with different downstream tasks depending on the availability of metadata. The evaluation was across three key areas: text-to-image, category-to-product and sub-category to product. The text-to-image task uses unique passages of text that is representative of longer descriptive queries (like tail queries). The category and sub-category to product tasks represent shorter keyword like queries that may have multiple correct results (akin to head queries).
The models in the comparison consisted of fashion specific CLIP models - FashionCLIP2.0 and OpenFashionCLIP and two base models ViT-B-16-laion (laion2b_s34b_b88k) and ViT-B-16-SigLIP (webli) that were not fine-tuned for fashion. Results are shown in the graph below, demonstrating Marqo-FashionCLIP and Marqo-FashionSigLIP improving in all areas over the previous fashion specific models and base models. For text-to-image, category-to-product and sub-category-to-product Marqo-FashionCLIP improved recall@1 (text-to-image) and precision@1 (category/sub-category-to-product) compared to FashionCLIP2.0 by +22%, +8% and +11% respectively, while for Marqo-FashionSigLIP it was +57%, +11% and +13% respectively. The code to reproduce the benchmarks can be found in the GitHub repository.
Our comprehensive evaluation covers a wide range of query lengths, from single-word categories to detailed descriptions, as can be seen in the histogram below. The splits by the different tasks are also shown, demonstrating that the text-to-image covers a broad range of query lengths (up to 140 words) while the category and sub-category represent queries <15 words long with the majority being less than 6 words long.
The relative improvements in Mean Reciprocal Rank (MRR) are shown in the table below. All values are compared to FashionCLIP2.0 and the raw values can be found below.
We also report the inference times for Marqo-FashionCLIP and Marqo-FashionSigLIP. The combined inference times of text and image is used for the comparison. Marqo-FashionCLIP and Marqo-FashionSigLIP are 10% faster than the existing fashion specific models. Full details of the methodology can be found here along with a wide range of inference benchmark times.
Evaluation was performed across seven different publicly available datasets in the fashion domain. The tasks were further broken down into three distinct categories: text, category and sub-category. These represent different query patterns where text is akin to long natural language queries while category and sub-category represent shorter specific keyword queries. Below we provide details of all the datasets.
For comprehensive evaluations, we include seven publicly available datasets: DeepFashion (In-shop), DeepFashion (Multimodal), Fashion200K, KAGL, Atlas, Polyvore, and iMaterialist. Each dataset includes various meta information, some of which are:
Below are the results across the three main tasks. All evaluations were multimodal, consisting of both product images and text.
Text-to-image:
Category-to-product:
Sub-category-to-product:
To deploy on Marqo Cloud (recommended):1. Sign Up to Marqo Cloud.
2. Install Marqo:
pip install marqo
3. Create an index:
import marqo
settings = {
"type": "unstructured",
"model": "marqo-fashion-clip", # model name
"modelProperties": {
"name": "ViT-B-16", # model architecture
"dimensions": 512, # embedding dimensions
"url": "https://marqo-gcl-public.s3.us-west-2.amazonaws.com/marqo-fashionCLIP/marqo_fashionCLIP.pt", # model weights
"type": "open_clip" # loading library
},
"treatUrlsAndPointersAsImages": True
}
api_key = "your_api_key" # replace with your api key (https://www.marqo.ai/blog/finding-my-marqo-api-key)
mq = marqo.Client("https://api.marqo.ai", api_key=api_key)
mq.create_index("fashion-index", settings_dict=settings)
# triggers model download
mq.index("fashion-index").search("black dress")
If you want to deploy on Marqo open-source, see below.
Marqo FashionCLIP can be used in Marqo or downloaded from Hugging Face.
pip install open_clip_torch
# Marqo-FashionCLIP
import open_clip
model, _, preprocess = open_clip.create_model_and_transforms('hf-hub:Marqo/marqo-fashionCLIP')
tokenizer = open_clip.get_tokenizer('hf-hub:Marqo/marqo-fashionCLIP')
# Marqo-FashionSigLIP
import open_clip
model, _, preprocess = open_clip.create_model_and_transforms('hf-hub:Marqo/marqo-fashionSigLIP')
tokenizer = open_clip.get_tokenizer('hf-hub:Marqo/marqo-fashionSigLIP')
import torch
from PIL import Image
image = preprocess(Image.open("docs/fashion-hippo.png")).unsqueeze(0)
text = tokenizer(["a hat", "a t-shirt", "shoes"])
with torch.no_grad(), torch.cuda.amp.autocast():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print("Label probs:", text_probs)
We released this article illustrating a simple ecommerce search with a fashion dataset if you want to see the model in action.
Marqo-FashionCLIP and Marqo-FashionSigLIP are new state-of-the-art multimodal embedding models that you can use for search and recommendations in the fashion domain. They outperform existing models by +57% on text to image retrieval and 11% on category to product retrieval. Contact us to learn how to use FashionCLIP for your business, or to build custom models for any ecommerce domain.
Our community is a vibrant hub for growth and learning. If you haven't already, come and say hello on our Slack Community!
Here we provide detailed information on the evaluation datasets with example data.
DeepFashion (In-shop) contains 52,591 fashion images.
An example of the dataset is shown below:
"category1": "men",
"category2": "tees",
"category3": "tanks",
"item": "id_00006383",
"color": "Heather grey-green",
"description": "[\"The eye-popping rose print on this heathered short-sleeved tee is colored with solid swaths of hue for an almost paint-by-number effect. We think it's a cool, dapper way to stand out in a sea of stripes and solids. \", 'Lightweight knit', 'Contrast ribbed crew neck', '59% cotton, 41% polyester', '29\" full length, 44\" chest, 43\" waist, 9.5\" sleeve length', 'Measured from Medium', 'Machine wash cold', 'Made in China']",
"text": "The eye-popping rose print on this heathered short-sleeved tee is colored with solid swaths of hue for an almost paint-by-number effect. We think it's a cool, dapper way to stand out in a sea of stripes and solids. || Lightweight knit || Contrast ribbed crew neck || 59% cotton, 41% polyester || 29\" full length, 44\" chest, 43\" waist, 9.5\" sleeve length || Measured from Medium || Machine wash cold"
DeepFashion (Multimodal) contains 42,537 fashion images.
An example of the dataset is shown below:
"category1": "women",
"category2": "dresses",
"text": "This person wears a sleeveless tank top with pure color patterns. The tank top is with cotton fabric. This person wears a long pants. The pants are with chiffon fabric and graphic patterns. The person is wearing a ring on her finger. There is an accessory on her wrist."
Fashion200K contains 201,624 fashion images.
An example of the dataset is shown below:
"category1": "dresses",
"category2": "cocktail dresses",
"category3": "blue knee-length dress",
"text": "brown dress. the dress has a strapless design and a ruffled neckline. the dress is made of a material that appears to be a type of fabric. the dress is a burgundy color and has a fitted style."
KAGL contains 44,434 fashion images.
An example of the dataset is shown below:
"gender": "Women",
"category1": "Apparel",
"category2": "Topwear",
"category3": "Kurtas",
"baseColour": "Blue",
"season": "Summer",
"year": 2012,
"usage": "Ethnic",
"text": "Diva Women Blue Kurta"
Atlas contains 78,370 fashion images.
An example of the dataset is shown below:
"gender": "Men",
"category": "Western Wear",
"sub-category": "Formal Shirts",
"text": "ArrowMen Blue & Green Slim Fit Checked Formal Shirt"
Atlas contains 94,096 fashion images.
An example of the dataset is shown below:
"category": "Loafers & Moccasins",
"text": "blue bird women banana striped cotton loafers"
Atlas contains 721,065 fashion images.
An example of the dataset is shown below:
"pattern": "Floral",
"neckline": "Square Necked",
"gender": "Female",
"material": "Chiffon",
"category": "Casual Dresses"
"sleeve": "Puff Sleeves",
"style": "Summer",
"color": "Pink Purple Red White"
The evaluation tasks focus on multimodal retrieval, which best represents ecommerce search frameworks. Each product can be represented in a single modality (e.g., image or text) or in a multimodal form (e.g., a weighted embedding of both image and text). Below, we outline the tasks for each dataset:
In the text-to-image task, we measured Recall@1, Recall@10, Average Recall, and Mean Reciprocal Rank (MRR), with Average Recall being the average of Recall@1 and Recall@10. For other tasks, we utilized Precision@1, Precision@10, Average Precision, and MRR, where Average Precision is the average of Precision@1 and Precision@10. Since many types of meta information such as categories are shared among many products within a dataset, precision was primarily measured. Conversely, recall was used for the text-to-image task, as titles are more distinct to each product.
1. Install Marqo:
pip install marqo
docker pull marqoai/marqo:latest
docker rm -f marqo
docker run --name marqo -it -p 8882:8882 marqoai/marqo:latest
2. Create an index:
import marqo
settings = {
"type": "unstructured",
"model": "marqo-fashion-clip", # model name
"modelProperties": {
"name": "ViT-B-16", # model architecture
"dimensions": 512, # embedding dimensions
"url": "https://marqo-gcl-public.s3.us-west-2.amazonaws.com/marqo-fashionCLIP/marqo_fashionCLIP.pt", # model weights
"type": "open_clip" # loading library
},
}
mq = marqo.Client()
mq.create_index("fashion-index", settings_dict=settings)
# triggers model download
mq.index("fashion-index").search("black dress")