Building a useful ecommerce search and recommendation service requires accurate image classification. Hugging Face's transformers library is a popular tool that you can use along with Marqo's new ecommerce models to classify product images.
In this blog post, we’ll walk you through how to use these models for image classification using Hugging Face’s transformers library. If you’d prefer to use OpenCLIP, then see this article. By the end, you'll have a practical guide to integrating these state-of-the-art models into your own ecommerce platform and projects.
We have a Google Colab notebook that contains the code presented in this article (as well as for open-clip
) so you can get up and running yourself. Check it out here.
First, we import the relevant modules and define our model and processor. For this example, we will be using a state-of-the-art ecommerce embedding model, Marqo/marqo-ecommerce-embeddings-L
. You can find out more about this model, including benchmarking and results, in Marqo's latest blog.
Let's load this model:
from transformers import AutoModel, AutoProcessor
import torch
from PIL import Image
import requests
from io import BytesIO
# Define the model and processor
model_name = 'Marqo/marqo-ecommerce-embeddings-L'
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
First, we create a list of ecommerce items, which are tokenized using a processor, preparing the text data for input to the model. Next, we fetch an image from a provided URL, convert it into an RGB format suitable for processing, and display it using matplotlib for visual confirmation. This setup enables the model to classify the image based on the prepared item list.
# List of ecommerce items for classification - you can add more items if needed
ecommerce_items = [
"laptop", "notebook", "microwave", "toothbrush", "plates", "glasses",
"bicycle helmet", "perfume", "book", "hair straightener"
]
# Tokenize the text (ecommerce items) for input to the model
# The processor converts each item into a tokenized format and applies padding and truncation for uniform tensor shapes
text_inputs = processor(text=ecommerce_items, return_tensors="pt", padding="max_length", truncation=True)
# URL of the image to classify
url = "https://media.met-helmets.com/app/uploads/2023/09/met-rivale-mips-road-cycling-helmet-BL3.jpg" # Replace with any image URL of your choice
# Fetch and process the image from the specified URL
response = requests.get(url)
if response.status_code == 200:
# Convert the image to RGB format (compatible with model requirements)
image = Image.open(BytesIO(response.content)).convert("RGB")
else:
# Raise an error if the image cannot be retrieved
raise ValueError(f"Error: Unable to retrieve image from URL. Status code: {response.status_code}")
# Display the image to confirm it's loaded correctly
import matplotlib.pyplot as plt
# Use matplotlib to display the image without axes for a cleaner view
plt.imshow(image)
plt.axis('off') # Hide axes for a cleaner look
plt.show()
After running this code, we get:
This is what the image URL looks like: it’s a bicycle helmet! Feel free to input a different ecommerce item URL into here.
Once you have the image and ecommerce items ready, process both using the processor. This prepares the data to be input into the model. We also move the data to the appropriate device (GPU or CPU) for efficient processing.
# Process the image and move inputs to the device
image_inputs = processor(images=image, return_tensors="pt")
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
image_inputs = {k: v.to(device) for k, v in image_inputs.items()}
text_inputs = {k: v.to(device) for k, v in text_inputs.items()}
Next, encode both the text and image into features using the model. These features are vectors that capture the underlying semantics of the image and the text. We normalize these features to make them comparable.
Finally, calculate the probabilities for each ecommerce item by comparing the image features with the text features. Sort the items based on their confidence scores and display the top 5 most relevant predictions.
# Encode and normalize text and image features
with torch.no_grad():
image_features = model.get_image_features(**image_inputs)
text_features = model.get_text_features(**text_inputs)
# Normalize features
image_features = image_features / image_features.norm(dim=-1, keepdim=True)
text_features = text_features / text_features.norm(dim=-1, keepdim=True)
# Calculate probabilities and get top 5 predictions
text_probs = (100 * image_features @ text_features.T).softmax(dim=-1)
sorted_confidences = sorted(
{ecommerce_items[i]: float(text_probs[0, i]) for i in range(len(ecommerce_items))}.items(),
key=lambda x: x[1],
reverse=True
)
top_5_confidences = dict(sorted_confidences[:5])
Finally, we display the top 5 predictions for our image.
# Display the top 5 predictions
print("Top 5 Predictions:")
for item, confidence in top_5_confidences.items():
print(f"{item}: {confidence:.2f}%")
This returns,
Top 5 Predictions:
bicycle helmet: 1.00%
glasses: 0.00%
plates: 0.00%
laptop: 0.00%
toothbrush: 0.00%
As we can see, the model is able to correctly identify the image is indeed a bicycle helmet!
With Marqo’s new ecommerce embedding models, you can efficiently classify images and match them with relevant ecommerce items, providing more accurate and personalized recommendations. Whether you're using Hugging Face transformers or OpenCLIP, these models offer flexibility and top-tier performance for ecommerce tasks. By following the steps in this blog, you can integrate these capabilities into your own systems and improve your ecommerce experience.
Check out the Models on Hugging Face