Announcement

Marqo Launches Family of Embedding Models for Ecommerce and Retail

Marqo-FashionCLIP & Marqo-FashionSigLIP are two new state-of-the-art multimodal models for search and recommendations in the fashion domain. The models have surpassed current SOTA models FashionCLIP2.0, and OpenFashionCLIP on 7 fashion evaluation datasets including DeepFashion and Fashion200K, by up to 57%. Marqo-FashionCLIP & Marqo-FashionSigLIP can produce embeddings for both text and images that can then be used in downstream search and recommendations applications.

Marqo-FashionCLIP & Marqo-FashionSigLIP are 150M parameter embedding models that:

We are releasing Marqo-FashionCLIP and Marqo-FashionSigLIP under the Apache 2.0 license.

Figure 1: Average precision & recall @1 improvement of Marqo-FashionCLIP and Marqo-FashionSigLIP compared to FashionCLIP2.0. The evaluation consisted of 7 fashion datasets covering a large range of query lengths. *OpenFashionCLIP was trained on iMaterialist.

What is Marqo-FashionCLIP & Marqo-FashionSigLIP?

Marqo-FashionCLIP & Marqo-FashionSigLIP are two new state-of-the-art multimodal models for search and recommendations in the fashion domain. Marqo-FashionCLIP & Marqo-FashionSigLIP can produce embeddings for both text and images that can then be used in downstream search and recommendations applications. The models were trained on over 1M fashion products with rich meta-data containing detailed descriptions, colors, styles, keywords and materials. This training dataset was not a part of any of the evaluation datasets. The models were fine-tuned from two existing base models (ViT-B-16-laion and ViT-B-16-SigLIP-webli) using GCL. The loss consisted of seven parts to optimize specifically for long descriptions, product titles, colors, materials, categories, details and keywords. This multi-part loss significantly outperformed the standard text-image InfoNCE loss in contrastive learning when fine-tuning. This results in a model that is able to retrieve more relevant results for both keyword-like short text and longer descriptive text - both of which are particularly relevant to search applications.

Figure 2: CLIP compared to GCL. The loss can be composed into multiple domain specific parts which provides improved training and downstream performance.

Performance of Marqo-FashionCLIP & Marqo-FashionSigLIP

Marqo-FashionCLIP and Marqo-FashionSigLIP set a new frontier in terms of performance with improvements of up to +57% over existing fashion specific embedding models.

Retrieval

Evaluation of the models was performed using seven publicly available fashion datasets and were not included in the training dataset (details can be found later in the article). Each dataset is associated with different downstream tasks depending on the availability of metadata. The evaluation was across three key areas: text-to-image, category-to-product and sub-category to product. The text-to-image task uses unique passages of text that is representative of longer descriptive queries (like tail queries). The category and sub-category to product tasks represent shorter keyword like queries that may have multiple correct results (akin to head queries).

The models in the comparison consisted of fashion specific CLIP models - FashionCLIP2.0 and OpenFashionCLIP and two base models ViT-B-16-laion (laion2b_s34b_b88k) and ViT-B-16-SigLIP (webli) that were not fine-tuned for fashion. Results are shown in the graph below, demonstrating Marqo-FashionCLIP and Marqo-FashionSigLIP improving in all areas over the previous fashion specific models and base models. For text-to-image, category-to-product and sub-category-to-product Marqo-FashionCLIP improved recall@1 (text-to-image) and precision@1 (category/sub-category-to-product) compared to FashionCLIP2.0 by +22%, +8% and +11% respectively, while for Marqo-FashionSigLIP it was +57%, +11% and +13% respectively. The code to reproduce the benchmarks can be found in the GitHub repository.

Figure 3: Performance of Marqo-FashionCLIP compared to previous models. The scores are the relative Precision@1 or Recall@1 improvement compared to FashionCLIP2.0. See below for more details.

Query Diversity

Our comprehensive evaluation covers a wide range of query lengths, from single-word categories to detailed descriptions, as can be seen in the histogram below. The splits by the different tasks are also shown, demonstrating that the text-to-image covers a broad range of query lengths (up to 140 words) while the category and sub-category represent queries <15 words long with the majority being less than 6 words long.

Figure 4: Histogram of all the text (queries) used in the evaluation benchmarks across seven datasets. The category- and sub-category to product task queries were aggregated into a single category. The descriptive queries come from the text-to-image task.

The relative improvements in Mean Reciprocal Rank (MRR) are shown in the table below. All values are compared to FashionCLIP2.0 and the raw values can be found below.

Table 1: Relative improvement in MRR for descriptive queries (text-to-image) and keyword like (category/sub-category) queries. Full metrics can be found later in the article.

Inference

We also report the inference times for Marqo-FashionCLIP and Marqo-FashionSigLIP. The combined inference times of text and image is used for the comparison. Marqo-FashionCLIP and Marqo-FashionSigLIP are 10% faster than the existing fashion specific models. Full details of the methodology can be found here along with a wide range of inference benchmark times.

Figure 5: Relative inference improvement for combined text and image. Full details can be found in “Benchmarking Models for Multi-modal Search”.

Detailed Evaluation

Evaluation was performed across seven different publicly available datasets in the fashion domain. The tasks were further broken down into three distinct categories: text, category and sub-category. These represent different query patterns where text is akin to long natural language queries while category and sub-category represent shorter specific keyword queries. Below we provide details of all the datasets.

Datasets

For comprehensive evaluations, we include seven publicly available datasets: DeepFashion (In-shop), DeepFashion (Multimodal), Fashion200K, KAGL, Atlas, Polyvore, and iMaterialist. Each dataset includes various meta information, some of which are:

Table 2: Summary statistics for the seven datasets used in the evaluation benchmark.

Results

Below are the results across the three main tasks. All evaluations were multimodal, consisting of both product images and text.

Text-to-image:

Table 3: Results for the text-to-image task, averaged across six datasets.

Category-to-product:

Table 4: Results for the category-to-product task, averaged across five datasets.

Sub-category-to-product:

Table 5: Results for the sub-category-to-product task, averaged across four datasets.

Try Marqo-FashionCLIP and Marqo-FashionSigLIP

Marqo FashionCLIP can be used in Marqo or downloaded from Hugging Face.

To use from Hugging Face:


pip install open_clip_torch


# Marqo-FashionCLIP
import open_clip
model, _, preprocess = open_clip.create_model_and_transforms('hf-hub:Marqo/marqo-fashionCLIP')
tokenizer = open_clip.get_tokenizer('hf-hub:Marqo/marqo-fashionCLIP')

# Marqo-FashionSigLIP
import open_clip
model, _, preprocess = open_clip.create_model_and_transforms('hf-hub:Marqo/marqo-fashionSigLIP')
tokenizer = open_clip.get_tokenizer('hf-hub:Marqo/marqo-fashionSigLIP') 

import torch
from PIL import Image

image = preprocess(Image.open("docs/fashion-hippo.png")).unsqueeze(0)
text = tokenizer(["a hat", "a t-shirt", "shoes"])

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)

We released this article illustrating a simple ecommerce search with a fashion dataset if you want to see the model in action.

To deploy on Marqo Cloud (recommended):

1. Sign Up to Marqo Cloud.

2. Install Marqo:


pip install marqo

3. Create an index:


import marqo

settings = {
    "type": "unstructured",
    "model": "marqo-fashion-clip",  # model name
    "modelProperties": {
        "name": "ViT-B-16",  # model architecture
        "dimensions": 512,  # embedding dimensions
        "url": "https://marqo-gcl-public.s3.us-west-2.amazonaws.com/marqo-fashionCLIP/marqo_fashionCLIP.pt",  # model weights
        "type": "open_clip"  # loading library
    },
    "treatUrlsAndPointersAsImages": True
}

api_key = "your_api_key"  # replace with your api key (https://www.marqo.ai/blog/finding-my-marqo-api-key)
mq = marqo.Client("https://api.marqo.ai", api_key=api_key)

mq.create_index("fashion-index", settings_dict=settings)

# triggers model download
mq.index("fashion-index").search("black dress")

If you want to deploy on Marqo open-source, see below.

Conclusion

Marqo-FashionCLIP and Marqo-FashionSigLIP are new state-of-the-art multimodal embedding models that you can use for search and recommendations in the fashion domain. They outperform existing models by +57% on text to image retrieval and 11% on category to product retrieval. Contact us to learn how to use FashionCLIP for your business, or to build custom models for any ecommerce domain.

Join the Community

Our community is a vibrant hub for growth and learning. If you haven't already, come and say hello on our Slack Community!

Code

Appendix A: Evaluation Datasets

Here we provide detailed information on the evaluation datasets with example data.

A.1 Evaluation Dataset Details

A.1.1 DeepFashion (In-shop)

DeepFashion (In-shop) contains 52,591 fashion images.

An example of the dataset is shown below:


"category1": "men",
"category2": "tees",
"category3": "tanks",
"item": "id_00006383",
"color": "Heather grey-green",
"description": "[\"The eye-popping rose print on this heathered short-sleeved tee is colored with solid swaths of hue for an almost paint-by-number effect. We think it's a cool, dapper way to stand out in a sea of stripes and solids. \", 'Lightweight knit', 'Contrast ribbed crew neck', '59% cotton, 41% polyester', '29\" full length, 44\" chest, 43\" waist, 9.5\" sleeve length', 'Measured from Medium', 'Machine wash cold', 'Made in China']",
"text": "The eye-popping rose print on this heathered short-sleeved tee is colored with solid swaths of hue for an almost paint-by-number effect. We think it's a cool, dapper way to stand out in a sea of stripes and solids.  || Lightweight knit || Contrast ribbed crew neck || 59% cotton, 41% polyester || 29\" full length, 44\" chest, 43\" waist, 9.5\" sleeve length || Measured from Medium || Machine wash cold"

A.1.2 DeepFashion (Multimodal)

DeepFashion (Multimodal) contains 42,537 fashion images.

An example of the dataset is shown below:


"category1": "women",
"category2": "dresses",
"text": "This person wears a sleeveless tank top with pure color patterns. The tank top is with cotton fabric. This person wears a long pants. The pants are with chiffon fabric and graphic patterns. The person is wearing a ring on her finger. There is an accessory on her wrist."

A.1.3 Fashion200K

Fashion200K contains 201,624 fashion images.

An example of the dataset is shown below:


"category1": "dresses",
"category2": "cocktail dresses",
"category3": "blue knee-length dress",
"text": "brown dress. the dress has a strapless design and a ruffled neckline. the dress is made of a material that appears to be a type of fabric. the dress is a burgundy color and has a fitted style."


A.1.4. KAGL

KAGL contains 44,434 fashion images.

An example of the dataset is shown below:


"gender": "Women",
"category1": "Apparel",
"category2": "Topwear",
"category3": "Kurtas",
"baseColour": "Blue",
"season": "Summer",
"year": 2012,
"usage": "Ethnic",
"text": "Diva Women Blue Kurta"


A.1.5 Atlas

Atlas contains 78,370 fashion images.

An example of the dataset is shown below:


"gender": "Men",
"category": "Western Wear",
"sub-category": "Formal Shirts",
"text": "ArrowMen Blue & Green Slim Fit Checked Formal Shirt"

A.1.6 Polyvore

Atlas contains 94,096 fashion images.

An example of the dataset is shown below:


"category": "Loafers & Moccasins",
"text": "blue bird women banana striped cotton loafers"

A.1.7 iMaterialist

Atlas contains 721,065 fashion images.

An example of the dataset is shown below:


"pattern": "Floral",
"neckline": "Square Necked",
"gender": "Female",
"material": "Chiffon",
"category": "Casual Dresses"
"sleeve": "Puff Sleeves",
"style": "Summer",
"color": "Pink Purple Red White"

A.2 Tasks

The evaluation tasks focus on multimodal retrieval, which best represents ecommerce search frameworks. Each product can be represented in a single modality (e.g., image or text) or in a multimodal form (e.g., a weighted embedding of both image and text). Below, we outline the tasks for each dataset:

A.3 Metrics

In the text-to-image task, we measured Recall@1, Recall@10, Average Recall, and Mean Reciprocal Rank (MRR), with Average Recall being the average of Recall@1 and Recall@10. For other tasks, we utilized Precision@1, Precision@10, Average Precision, and MRR, where Average Precision is the average of Precision@1 and Precision@10. Since many types of meta information such as categories are shared among many products within a dataset, precision was primarily measured. Conversely, recall was used for the text-to-image task, as titles are more distinct to each product.

Appendix B

B.1 Deploy on Marqo Open-Source

1. Install Marqo:


pip install marqo
docker pull marqoai/marqo:latest
docker rm -f marqo
docker run --name marqo -it -p 8882:8882 marqoai/marqo:latest

2. Create an index:


import marqo

settings = {
    "type": "unstructured",
    "model": "marqo-fashion-clip", # model name
    "modelProperties": {
        "name": "ViT-B-16", # model architecture
        "dimensions": 512, # embedding dimensions
        "url": "https://marqo-gcl-public.s3.us-west-2.amazonaws.com/marqo-fashionCLIP/marqo_fashionCLIP.pt", # model weights
        "type": "open_clip" # loading library
    },
}

mq = marqo.Client()

mq.create_index("fashion-index", settings_dict=settings)

# triggers model download
mq.index("fashion-index").search("black dress")

Useful Resources

Jesse N. Clark & David Jung
Jesse and David work within the Applied Sciences Division at Marqo, performing R&D in AI for search and recommendations.