Showcase

Ecommerce Image Classification with OpenCLIP

In the world of ecommerce, effective image classification is key to improving search results and product recommendations. Marqo’s newly launched ecommerce embedding models outperform competitors like Amazon Titan by up to 88%, setting a new standard for retrieval tasks.

In this blog post, we’ll walk you through how to use these models for image classification using the OpenCLIP library. If you’d prefer to use Hugging Face’s transformers, then see this article. By the end, you'll have a practical guide to integrating these state-of-the-art models into your own ecommerce platform and projects.

Image Classification with OpenCLIP

We have a Google Colab notebook that contains the code presented in this article (as well as for transformers so you can get up and running yourself. Check it out here.

1. Load the Model and Tokenizer

First, we import the relevant modules and define our model, processor and tokenizer. For this example, we will be using a state-of-the-art ecommerce embedding model, Marqo/marqo-ecommerce-embeddings-L. You can find out more about this model, including benchmarking and results, in Marqo's latest blog. This model is capable of encoding both text (ecommerce items) and images into feature vectors that we can compare.


import open_clip
import torch
import requests
import numpy as np
from PIL import Image
from io import BytesIO

# Set the model and tokenizer
model_name = 'hf-hub:Marqo/marqo-ecommerce-embeddings-L'
model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms(model_name)

# Load tokenizer
tokenizer = open_clip.get_tokenizer(model_name)

2. Tokenize Ecommerce Items

We create a list of ecommerce items to classify and tokenize the text using the model’s tokenizer. This prepares the items for input into the model by converting them into a format the model can process.


# List of ecommerce items for classification
ecommerce_items = [
    "laptop", "notebook", "microwave", "toothbrush", "plates", "glasses",
    "bicycle helmet", "perfume", "book", "hair straightener"
]

# Tokenize the text (ecommerce items) using the tokenizer function
text = tokenizer(ecommerce_items)

# Print a message indicating text features are being encoded
print("Encoding text features..."

# Encode text features with the model in a no-gradient context (for efficiency) and using mixed precision (if on a GPU with CUDA)
with torch.no_grad(), torch.amp.autocast('cuda'):
    text_features = model.encode_text(text)
    
    # Normalize text features to have unit norm, which aids in similarity comparison
    text_features /= text_features.norm(dim=-1, keepdim=True)

3. Fetch and Process the Image

Let’s find an image we want to classify. We fetch the image URL and process it using the model’s preprocessing function. This prepares the image for input into the model.


import matplotlib.pyplot as plt

# URL of the image to classify
url = "https://media.met-helmets.com/app/uploads/2023/09/met-rivale-mips-road-cycling-helmet-BL3.jpg"

# Fetch and process the image
response = requests.get(url)
if response.status_code == 200:
    image = Image.open(BytesIO(response.content))
    processed_image = preprocess_val(image).unsqueeze(0)

    # Display the image
    plt.imshow(image)
    plt.axis('off')  # Hide axes for a cleaner look
    plt.show()
else:
    raise ValueError(f"Error: Unable to retrieve image from URL. Status code: {response.status_code}")

This is what the image URL looks like: it’s a bicycle helmet. Feel free to use a different type of ecommerce item URL here.

Figure 1: Image of a bicycle helmet that we will classify.

4. Encode the Image Features

With the image processed, we use the model to encode the image into a feature vector, just like we did with the text. This allows the model to compare the image to the list of ecommerce items. Finally, we calculate the similarity between the image features and the text features (ecommerce items).


# Encode the image using the model
with torch.no_grad(), torch.amp.autocast('cuda'):
    image_features = model.encode_image(processed_image)
    image_features /= image_features.norm(dim=-1, keepdim=True)

    # Calculate probabilities for each ecommerce item
    text_probs = (100 * image_features @ text_features.T).softmax(dim=-1)

    # Sort and get the top 5 predictions
    sorted_confidences = sorted(
        {ecommerce_items[i]: float(text_probs[0, i]) for i in range(len(ecommerce_items))}.items(),
        key=lambda x: x[1],
        reverse=True
    )
    top_5_confidences = dict(sorted_confidences[:5])

5. Calculate and Display Predictions

We then display the top 5 predictions based on the highest similarity scores.


# Display the top 5 predictions
print("Top 5 Predictions:")
for item, confidence in top_5_confidences.items():
    print(f"{item}: {confidence:.2f}%")

This returns:


Top 5 Predictions:
bicycle helmet: 1.00%
glasses: 0.00%
plates: 0.00%
laptop: 0.00%
toothbrush: 0.00%

As we can see, the model is able to correctly identify the image is indeed a bicycle helmet!

Conclusion

With Marqo’s new ecommerce embedding models, you can efficiently classify images and match them with relevant ecommerce items, providing more accurate and personalized recommendations. Whether you're using Hugging Face transformers or OpenCLIP, these models offer flexibility and top-tier performance for ecommerce tasks. By following the steps in this blog, you can integrate these capabilities into your own systems and improve your ecommerce experience.

Next Steps

Run this Code in Google Colab

Check out the Models on Hugging Face

Try out the Demo on Hugging Face Space

Join Our Community

Ellie Sleightholm
Head of Developer Relations at Marqo