AI-Native vs Behavioral Ranking: The Future of Ecommerce Product Discovery
Read Full Story

Visual discovery is becoming an increasingly important part of ecommerce search. Shoppers do not always describe products using precise keywords. Instead, they search using style descriptions, visual references, or natural language that reflects how they think about products. To support these behaviors, modern search systems must understand both text and visual information.
AI powered product discovery systems make this possible by learning how images, product attributes, and natural language relate to one another. This allows retailers to build discovery experiences where shoppers can search visually, explore similar products, and find items that match the style or concept they have in mind.
In this post, we explore how AI based discovery can be applied to large image collections using a playful experiment. Using a dataset of nearly 100,000 AI generated hot dog images, we demonstrate how visual similarity search works and how the same techniques can power image driven product discovery in ecommerce catalogs.
Multimodal AI is changing how people discover products online. In ecommerce, shoppers often search visually, whether they are browsing styles, looking for similar products, or trying to match an image to something in a retailer’s catalog. To build high performing visual discovery experiences, search systems need to understand both text and images at scale.
Marqo is an AI native ecommerce search and product discovery platform designed to power multimodal search experiences, including image search and text to image retrieval. In this post, we explore how semantic search can be applied to a large image dataset using a fun example, a dataset of nearly 100,000 AI generated hot dog images.
I was particularly interested in seeing what a generative model could produce from the same prompt across a large number of images. Using the Huggingface diffusers library, I generated close to 93,000 images and used them as the basis for this experiment.
My original plan of 1 million hot-dogs in a day quickly unraveled as I soon realized the required 695 hot-dogs/minute would be unattainable so I had to settle for 13 hot-dogs/minute and ~93,000 (only 7,000 off) images.

Here is a sample of 100 images randomly selected. There is a pretty wide variety of dogs that are generated and some very interesting interpretations of what constitutes a hot dog.
To dig a bit deeper into hot dog-100k dataset, we can index the data using Marqo.
In ecommerce, this same workflow applies to product catalogs. Indexing images with multimodal AI makes it possible to build visual search and product discovery experiences where shoppers can search using natural language, style descriptions, or even images. The goal is not just similarity search, but commercially relevant discovery that improves conversion and revenue.
After downloading the dataset we start up Marqo:
Once Marqo is up and running we can get the files ready for indexing:
Now we can start indexing:
Check we have our images in the index:
One noticeable thing is the presence of some black images. These are caused by the built-in filtering that suppresses images which may be deemed NSFW. I couldn’t be bothered to remove the filter so we have some of these images in the dataset. We can easily remove these though using Marqo. Since I am lazy I will just search for a blank image using a natural language description — “a black image”.

Now we have a black image, we can search using that and find all the other duplicate black images and remove them from the dataset.
Now our dataset should be free from images that do not contain hot dogs.
Given the variety of hot dogs generated from a single prompt, I was keen to understand more. A quick search of the following query — “two hot dogs” yielded something of interest but more was to follow, “a hamburger” and “a face”.

Armed with this survey, I created a new index with the following four documents — “one hot dog”, “two hot dogs”, “a hamburger” and “a face”. We can use our dataset images as queries against these labels and get back scores for each image:label pair. This is effectively doing zero-shot learning to provide a score for each category which could be thresholded to provide classification labels to each image for each category.
Now for each image we have the computed scores against each category which were just “documents” in our small index.
We have now calculated scores for the different categories described previously. The next thing to do is update our indexed data to have the scores. If we re-index a document it will update it with any new information. We can remove our image field from the documents as it has already been indexed with a model previously.
After all this work we can now animate the hot-dog-100k. We will do this by effectively “sorting” the vectors and creating a movie from these vectors accompanying images. Before we animate, we can make it more interesting as well by restricting the images we can search over by pre-filtering on some of the scores we calculated. This effectively restricts the space we can use for the animation based on the filtering criteria.

To animate the images based on their sorted vectors we can take an image as a start point (based on the query “a photo of a smiling face”) and find the next closest one (as seen above). Repeat the process until no more images are left and you have walked across the latent space. We are effectively solving a variant of the traveling salesmen problem — albeit with an approximate algorithm.
After we have done this walk we have the “sorted” list of images. These can then be animated in the order they appear in the list.

Visual understanding is becoming a core capability of modern product discovery systems. As ecommerce catalogs grow and shoppers increasingly search using natural language and visual cues, discovery engines must interpret both product attributes and visual characteristics in order to surface relevant results.
Experiments like the hot dog dataset illustrate how AI models can organize large image collections based on visual similarity and semantic meaning. In an ecommerce setting, the same approach allows retailers to power image search, style based discovery, and product recommendations that reflect how customers actually explore products.
Marqo applies these capabilities within an AI native ecommerce search and product discovery platform. By understanding both product content and shopper intent, Marqo enables retailers to deliver more intuitive discovery experiences that help customers find the right products faster while increasing engagement and conversion.