Product Discovery

June 11, 2026

Visual Search in Ecommerce: How Multimodal AI Is Changing Product Discovery

June 11, 2026

Ellie SleightholmHead of Developer Relations

Product Discovery

Visual Search in Ecommerce: How Multimodal AI Is Changing Product Discovery

Visual search is the fastest-growing capability in ecommerce product discovery. Shoppers increasingly expect to find products by how they look, not just by what words describe them. The platforms that can process both text and images in a unified model are delivering measurably higher conversion rates than those limited to text-only retrieval.

Consider the way most ecommerce search platforms process products today. They read a title, a description, and a set of tags. They have never seen the product itself. This is the equivalent of an interior designer selecting furniture from a spreadsheet without a single photograph, or a fashion buyer making purchasing decisions by reading catalog descriptions without ever seeing the clothing. The options for "black dress" are endless: a structured midi, a flowing maxi, a fitted mini, a beaded gown, a casual linen shift. The text says the same thing. The products are completely different. Until a search platform can see the product, it is guessing.

This guide covers how visual search works, why most implementations fall short, and what separates true multimodal product discovery from image-matching features bolted onto text search.

What Visual Search Means in Ecommerce

Visual search in ecommerce refers to the ability to discover products using visual information: uploading a photo, searching by style attributes, or finding visually similar items. But the term is used loosely, and the implementations behind it vary enormously.

There are three levels of visual search capability in ecommerce today:

Level 1: Image-to-Text Proxy

The most common approach. The system analyzes an uploaded image, extracts text labels (e.g., "blue dress, floral pattern, midi length"), and then runs a conventional text search using those labels. This is what most "visual search" features actually do.

Why it falls short: The system never actually sees the product. It converts visual information into text and then matches text to text. Nuances like silhouette, texture, drape, color harmony, and overall aesthetic are lost in translation. A shopper searching for "something like this" while uploading a photo of an elegant evening look gets results that match the extracted labels, not the visual impression.

Level 2: Separate Image Matching

A dedicated image similarity engine runs alongside the text search. When a shopper uploads a photo, the image engine finds visually similar products. When they type a query, the text engine handles it. The two systems operate independently.

Why it falls short: Text queries with visual intent cannot use the image engine. When a shopper types "minimalist gold necklace thin chain" or "dark academia aesthetic," the text engine handles it alone. The system understands what words mean but has no concept of what products look like. Visual understanding is available only through explicit image upload, not through natural language.

Level 3: Unified Multimodal Understanding

Text and images exist in the same mathematical space. The AI understands what a product looks like and what words describe it simultaneously. When a shopper types "elegant evening dress," the system returns products that look elegant, not just products with "elegant" in the title. When a shopper uploads a photo and adds "but in navy," the system processes both signals together.

This is true multimodal product discovery. It means visual understanding is present in every search, every recommendation, and every category page, not just when a shopper explicitly uploads an image.

Why Visual Search Matters for Revenue

Visual search is not a feature checkbox. It directly impacts revenue through three mechanisms:

1. Descriptive queries convert when visual understanding is present

The highest-intent queries in ecommerce are descriptive: "gift for mom who likes gardening," "minimalist home office setup," "cottagecore dress for summer." These queries express visual and conceptual intent that text-only search cannot satisfy.

When the search engine understands what products look like, descriptive queries return visually coherent results. When it does not, results are either irrelevant or missing entirely. Redbubble generated $11M in incremental revenue after deploying multimodal search with Marqo, translating to $11M in incremental revenue.

2. New products surface immediately

A product with a sparse text description but a great image is invisible to text-only search. With multimodal understanding, that product is discoverable from the moment it enters the catalog because the AI recognizes its visual attributes: color palette, style, category, and aesthetic positioning.

This is critical for fashion, home goods, beauty, and any visually driven category where new products launch constantly and product images carry more information than text descriptions.

3. The entire catalog becomes discoverable

For most retailers, product text metadata is incomplete. Titles are inconsistent. Descriptions vary in quality. Attributes are partially filled. Text-only search can only work with what the text says. Multimodal search can see what the product actually is, closing the gap between incomplete metadata and the shopper's visual intent.

How Multimodal Search Works

Multimodal means the AI processes more than one type of input at the same time. In ecommerce, this means text (queries, product titles, descriptions) and images (product photos, shopper uploads) are understood together inside a single model, not in separate systems. When a shopper types a query, the AI draws on both textual and visual understanding to retrieve results. When a shopper uploads a photo, the same model interprets it alongside any text they provide.

True multimodal search uses embedding models that process text and images into the same unified model. This means a text query like "rustic wooden dining table" and an image of a rustic wooden dining table end up near each other mathematically, even though one is language and the other is pixels.

The quality of this alignment determines everything. Generic multimodal models trained on web data understand broad concepts but miss ecommerce-specific nuances. A model that has never seen thousands of handbag variations cannot distinguish between "structured" and "slouchy," between "everyday carry" and "evening clutch."

Marqo was built by former Amazon engineers who saw firsthand that the largest ecommerce platform in the world converts at 18% while the industry average sits below 3%. They founded Marqo to bring that level of product understanding to the broader ecommerce market. At Marqo, multimodal understanding is built into the core architecture:

Marqo's commitment to best-in-class product understanding is reflected in every layer of its technology:

Purpose-built embedding models trained on hundreds of millions of ecommerce products. These are not general-purpose vision models. They are trained specifically to understand product attributes, style relationships, and visual commerce vocabulary.

A dedicated model per retailer fine-tuned on each retailer's specific catalog. A fashion retailer's model understands silhouette, drape, and trend vocabulary. A home goods retailer's model understands room aesthetics, material texture, and design styles.

Text and image in one unified space. Every search query, whether typed or uploaded, benefits from visual understanding. There is no separate image engine. Visual intelligence is present in every interaction.

73-78% relevance improvement over generic embedding models on a benchmark of over 4 million products. This is the gap between general-purpose multimodal and ecommerce-trained multimodal.

Visual Search Use Cases That Drive Revenue

Fashion and Apparel

Fashion is the most visually driven ecommerce category. Shoppers describe what they want using style language: "effortless French girl style," "business casual but not boring," "streetwear with a vintage edge." These queries are inherently visual. Text-only search treats them as keyword problems. Multimodal search treats them as style problems and returns visually coherent results.

KICKS CREW deployed Marqo's multimodal search across their sneaker and streetwear catalog and saw a 17.7% lift in conversion rate and 28% increase in cart value. Sneakers are among the most visually differentiated products in ecommerce. Color blocking, silhouette, material combinations, and design details matter more than text descriptions.

Home and Furniture

"Mid-century modern desk lamp" is a visual concept. A text search returns products with those words. A multimodal search returns products that look mid-century modern, including items described as "retro brass table lamp" or "vintage-inspired desk light" that visually match but use different vocabulary.

Beauty and Cosmetics

Color matching, finish, and aesthetic are visual attributes that text struggles to capture. "Warm-toned neutral eyeshadow palette" means something specific visually that varies enormously in text descriptions across brands.

Marketplace and Resale

Resale platforms face the hardest search problem: inconsistent listings, variable image quality, and one-of-a-kind inventory. Visual search is not optional for resale. It is the only way to make millions of unique items discoverable. Resale platforms using Marqo have seen double-digit increases in add-to-cart rate with AI-native product discovery that understands images, text, and product attributes together.

What to Ask When Evaluating Visual Search

Not all visual search claims are equal. When evaluating platforms, these questions separate real multimodal capability from marketing:

1Is visual understanding present in text search, or only in image upload? If a shopper types a visually descriptive query, does the system understand visual attributes? Or does visual search only activate when someone uploads a photo?

1Does the model process your product images, or convert them to text labels? Image-to-text proxy is not visual search. Ask whether the AI has actually seen your products or is working from extracted labels.

1Is the model trained on ecommerce data or general web data? A vision model trained on ImageNet understands "dog" and "cat." It does not understand "oversized boyfriend blazer" vs "structured cropped blazer."

1Is there a dedicated model for your catalog? A shared model treats every retailer's visual vocabulary the same. A dedicated model understands your specific product relationships, style positioning, and visual language.

1Can you test with descriptive queries, not just image uploads? Run "dark academia aesthetic" or "minimalist Scandinavian kitchen" through the search. If the results are not visually coherent, the system does not have real visual understanding.

The Revenue Impact of Multimodal Search

Visual search is not a niche feature. It is the foundation of modern product discovery. The retailers deploying true multimodal search are seeing the largest revenue improvements in the category:

A leading fast fashion retailer: $130M in attributed incremental revenue. The largest published revenue result from any ecommerce search platform.

Redbubble: $11M incremental revenue

KICKS CREW: 17.7% conversion lift, 28% increase in cart value

Kogan: $10.1M incremental revenue

Mejuri: 19.84% increase in search revenue per user

These results come from replacing text-only or AI-layered search with product-native intelligence. The largest gains consistently appear on the descriptive and visual queries where text-only platforms fail.

How Visual Understanding Powers Smarter Recommendations

Visual search is not limited to the search bar. When a platform genuinely sees products, that understanding transforms every recommendation surface across the shopping experience.

Show Me Similar. A shopper clicks on a linen blazer. A text-based system recommends other products tagged "blazer." A visually intelligent system recommends products that actually look similar: the same relaxed structure, the same natural fabric texture, the same tonal palette. The difference is the gap between reading a label and seeing the garment.

Complete the Outfit. Complete the Room. When the AI understands what a product looks like, it can recommend products that visually complement it, not just products that co-occur in purchase history. A navy linen blazer pairs with specific trousers, shoes, and bags based on visual coherence. A mid-century walnut desk pairs with a specific chair, lamp, and shelf based on design language. Text-based recommendations cannot make these connections because they have never seen the products.

Frequently Bought Together, Without Purchase History. Traditional "frequently bought together" relies on co-purchase data. If a product is new, it has no co-purchase history and receives no recommendations. When the AI has seen every product, it knows what pairs well from day one. A new seasonal handbag can be recommended alongside the right outfit the moment it enters the catalog.

These recommendation capabilities matter for revenue because they increase average order value, not just conversion rate. KICKS CREW saw a 28% increase in average cart value alongside a 17.7% conversion lift after deploying Marqo. Better recommendations mean shoppers find more of what they want in a single session.

For these reasons, enterprise retailers across fashion, beauty, home goods, and footwear are switching from legacy search vendors to Marqo. The platforms that can see products deliver measurably different outcomes than those still reading text about them.

Frequently Asked Questions

What is visual search in ecommerce?

Visual search allows shoppers to discover products using visual information, whether by uploading an image, searching by style attributes, or typing visually descriptive queries. True visual search uses multimodal AI that understands both product images and text in a unified model, so visual intelligence is present in every search interaction, not just image uploads.

How does multimodal search differ from image search?

Image search typically matches uploaded photos to visually similar products using a separate image engine. Multimodal search unifies text and image understanding in one model. This means text queries like "minimalist gold jewelry" benefit from visual understanding, not just keyword matching. The AI knows what minimalist jewelry looks like, not just what the word "minimalist" means.

Which ecommerce categories benefit most from visual search?

Fashion, footwear, jewelry, home goods, beauty, and any category where product appearance drives purchase decisions. Resale and marketplace platforms also benefit significantly because of inconsistent product metadata. However, even categories like electronics and sporting goods see improvement because visual search helps with style and design differentiation.

What is the revenue impact of visual search?

Retailers deploying multimodal AI-native search see 10-20% improvement in search conversion rates, with the largest gains on descriptive and style-based queries. Marqo has delivered the largest published revenue results in ecommerce search, including $130M in attributed revenue for a leading fast fashion retailer and $11M for Redbubble. These results come from true multimodal understanding, not image-matching features added to text search.

What is the difference between image-to-text proxy and true multimodal search?

Image-to-text proxy analyzes an image, extracts text labels, and runs a text search. The system never actually processes the visual content, it converts everything to text first. True multimodal search processes images and text in the same mathematical space, preserving visual nuances like style, texture, color harmony, and aesthetic that text labels cannot capture. The difference is most visible on descriptive queries where visual understanding determines relevance.

Does Marqo support visual search?

Marqo's architecture is multimodal from the ground up. Product images and text exist in the same unified embedding space. Every search query benefits from visual understanding, whether the shopper types text, uploads an image, or combines both. Each retailer gets a dedicated model fine-tuned on their specific catalog's visual and textual attributes.

Marqo is an AI-native product discovery platform that understands products visually, semantically, and commercially. Book a demo to see how it performs on your catalog.

Commerce Superintelligence

Visual search in ecommerce allows shoppers to find products by image instead of text. Marqo's unified multimodal AI processes images and descriptions together, enabling shop-the-look, cross-category discovery, and lower search abandonment for fashion and home retailers.

Shape Your Growth With AI-Native
Product Discovery

Transform product discovery with Marqo and get measurable ROI in 14 days, not months.

Get a demo

Visual Search in Ecommerce: How Multimodal AI Is Changing Product Discovery

Visual Search in Ecommerce: How Multimodal AI Is Changing Product Discovery