Showcase

Ecommerce Image Classification with Marqo-FashionCLIP

August 14, 2024

mins read

Marqo have recently announced two new state-of-the-art embedding models for multimodal fashion search and recommendations: Marqo-FashionCLIP & Marqo FashionSigLIP. In this article, we will explore how you can build your own ecommerce image classification with Marqo-FashionCLIP.

This article will be using the open-source model Marqo-FashionCLIP for image classification.

You can find the code for this article here: Google Colab. If you are new to Google Colab, you can follow this guide on getting set up. As always, if you face any issues, join our Slack Community and a member of our team will help.

1. Installation

We first begin by installing the relevant modules needed for this example. We install the open clip and datasets libraries which will allow us to load our embedding models and datasets, respectively.


pip install open_clip_torch
pip install datasets

2. Load a Dataset

Since we are performing ecommerce image classification, we will need an ecommerce or fashion dataset. For this demo, we choose ceyda/fashion-products-small which is a small collection of fashion products.


from datasets import load_dataset

# Load the dataset
ds = load_dataset('ceyda/fashion-products-small')

Let's take a look at the features inside this dataset by printing ds. This outputs:


DatasetDict({
    train: Dataset({
        features: ['filename', 'link', 'id', 'masterCategory', 'gender', 'subCategory', 'image'],
        num_rows: 42700
    })
})

We see that we have filename, link, id, masterCategory, gender, subCategory and image. Let's print the first example from this dataset to see what these features mean:


entry = ds['train'][0]
entry

This outputs:


{'filename': '15970.jpg',
 'link': 'http://assets.myntassets.com/v1/images/style/properties/7a5b82d1372a7a5c6de67ae7a314fd91_images.jpg',
 'id': '15970',
 'masterCategory': 'Apparel',
 'gender': 'Men',
 'subCategory': 'Topwear',
 'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=384x512>}

Thus, the features of the dataset are as follows:

filename: this is the filename of the image, indicating that the image is stored or identified with this name.
link: this is a URL link to the actual image file, which is hosted online. This link can be used to view or download the image.
id: this is a unique identifier for the image, which can be used to reference this specific item within the dataset.
masterCategory: this indicates the broad category under which this product falls.
gender: this specifies the intended gender for the product, in this case, men's clothing.
subCategory: this is a more specific category within the master category. "Topwear" indicates that the product is an item of clothing worn on the upper body, such as a shirt, t-shirt, or jacket.
image: this is a PIL (Python Imaging Library) image object, which allows for image manipulation and processing. It specifies the image mode (RGB, meaning it has red, green, and blue color channels) and the image size (384 pixels wide by 512 pixels tall).

Cool, let’s look at the image!


image = entry['image']
image

*Figure 1: Example image from the dataset we’re using for image classification.*

As expected, it's an item of men's topwear.

We can see that the data itself is comprised of a train dataset, so we will define our dataset as this.


dataset = ds['train']

Awesome, so now we've seen what our dataset looks like, it's time to load our model and perform preprocessing.

3. Load Marqo-FashionCLIP Model and Preprocessing

For this demo, we will be using the Marqo-FashionCLIP model which is a state-of-the-art multimodal model for search and recommendations in the fashion domain. Note, we prefix the model with `hf-hub:`, this is used to specify that the resource (in this case, a model) is located on the Hugging Face Model Hub. It’s a way of indicating that the model should be retrieved from this platform.

For more information about the model, visit this article. The model and its preprocessing function are loaded. The model is moved to the appropriate device: CUDA is a technology from NVIDIA that lets your computer's graphics card (GPU) help with heavy computing tasks, making them faster. The code checks if your computer has a compatible GPU; if it does, it uses the GPU for running the model; if not, it uses the regular processor (CPU) instead.


import open_clip
import torch

# Load the Marqo/marqo-fashionCLIP model and preprocessors
model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms('hf-hub:Marqo/marqo-fashionCLIP')
tokenizer = open_clip.get_tokenizer('hf-hub:Marqo/marqo-fashionCLIP')

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

4. Perform Image Classification with Marqo-FashionCLIP

Let's take a look at our Marqo-FashionCLIP model predictions for image classification on this dataset.

This code uses the model to classify three example images from our dataset by comparing their visual features with textual descriptions of subcategories. It processes and normalizes the features of the images and subcategory texts, calculates their similarity, and predicts the subcategory for each image. Finally, it visualizes the images alongside their predicted and actual subcategories in a plot.


import matplotlib.pyplot as plt

# Select indices for three example images
indices = [1, 17, 23]

# Get the list of possible subcategories from the dataset
subcategories = list(set(example['subCategory'] for example in dataset))

# Preprocess the text descriptions for each subcategory using the tokenizer
text_inputs = tokenizer([f"a photo of {c}" for c in subcategories]).to(device)

# Create a figure with subplots
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Loop through the indices and process each image
for i, idx in enumerate(indices):
    # Select an example image from the dataset
    example = dataset[idx]
    image = example['image']
    subcategory = example['subCategory']

    # Preprocess the image
    image_input = preprocess_val(image).unsqueeze(0).to(device)

    # Calculate image and text features
    with torch.no_grad():
        image_features = model.encode_image(image_input)
        text_features = model.encode_text(text_inputs)

    # Normalize the features
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    # Calculate similarity between image and text features
    similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
    values, indices = similarity[0].topk(1)

    # Display the image in the subplot
    axes[i].imshow(image)
    axes[i].set_title(f"Predicted: {subcategories[indices[0]]}, Actual: {subcategory}")
    axes[i].axis('off')

# Show the plot
plt.tight_layout()
plt.show()

‍

This outputs the following:

*Figure 2: Predictions made by our Marqo-FashionCLIP model on our dataset. The model successfully identified these items within the dataset.*

Nice! Our model is able to predict the images in this dataset.

Conclusion

In this article, we've used Marqo-FashionCLIP for image classification on a small fashion dataset. We performed image classification for three specific images from the dataset and computed the predicted sub-category for each image and compared these with the true sub-category.