What is Text-to-Image Search and How It Powers Ecommerce Discovery with Marqo

Text-to-image search enables ecommerce sites to connect shopper intent expressed in words with visually relevant product results, making discovery faster and more intuitive. In the world of commerce, where shoppers expect relevant and visually meaningful results, text-to-image search is not just a nice-to-have, it’s a competitive differentiator.

Marqo is an AI-native ecommerce search and product discovery platform built to improve relevance, personalization, and revenue. Unlike traditional search systems that rely on keyword matching or generic embedding models shared across many stores, Marqo trains a dedicated large language model on each retailer’s product catalog. This catalog-trained AI understands product attributes, shopper intent, and brand context, enabling more accurate text-to-image retrieval that aligns with real ecommerce behavior.

In this post, we’ll explain what text-to-image search is, how it works, and how Marqo applies it to deliver meaningful discovery experiences for ecommerce.

What is tensor search?

Tensor search is a new method of information retrieval. It is based on deep-learning models that learn from data about its content and meaning. This translates to a better understanding of queries and information and greatly improves relevance. Tensor search works for more than text and can be used to search other forms of media like images, movies and audio directly — without using text as an intermediate representation.

Challenges with search

Envision an online store that needs to build a search function for customers to find items on its website. Now picture a customer searching for a “long sleeved top”. Although a simplification, it effectively gets broken down into “long”, “sleeved” and “top”. So what gets shown to that customer? At a high level it is all the items which have the terms “long”, “sleeved”, and “top” coded into their metadata. This process of creating and coding metadata can be painstaking and manual but is necessary to make these items searchable. Now what if the customer searched for “sleesve”, or “long-sleeve” instead? By “top”, did they mean a sweater, jumper, t-shirt or something else? The search function would have to have rules — often hand crafted — for how to handle these nuances. This gets complex quickly and these systems of exact matching and symbolic rules are never really complete — they need to be rewritten as language evolves. Finally, absent or misconfigured rules can have a jarring effect on end-users. It is not simply that the relevance of results degrades but no results may be returned at all.

Why do we need tensor search?

Pre-trained deep learning models learn the rules, instead of manually curating them which reduces manual work and improves relevance. Additionally, the models are not confined to just working with text but can work over any media type like images, videos and audio. Take the previous example where we have results for “long sleeved top”. Imagine now a customer has found a result of interest and wants to find visually similar items with some modification — “visually similar to this long sleeved top but with a floral pattern”.

As seen in the animation above, tensor search permits this intuitive search paradigm. An initial search yields relevant results using images alone. The search is then easily refined using a combination of natural language and images — all without the need for metadata. A subtle result of this is users can now navigate a catalogue of inventory fluently by writing naturally and selecting things they like. There is no reliance on the website owner to maintain hand-crafted and hard to manage ontologies for their inventory. These concepts are also learnt by the models that power tensor search.

As seen in the example above, with tensor search users aren’t limited to the keyword standard for search. They can search with questions, related terms or with images, audio or videos directly (or any combination thereof). Users can search the way they think.

What is a tensor?

Tensors are generalisations of vectors and matrices and can represent high dimensional data. For example, a single entry in a database could consist of characters, words, sentences, paragraphs, pages or books. Each scale of these datum requires its own representation and needs to be flexible enough to accommodate all the variations. Our approach is to represent these data and associated queries not by just a single vector, but by collections of these — tensors.

Why use tensors?

Tensors provide a rich representation that can be scaled up or down and in concert with the data. They are a generalisation of the vector based representations and provide an effective API for searching across different modalities. For example, in Marqo, tensors are built so that components of the tensor can be associated with specific parts of a document, image, or video. Not only can this improve search relevance, but it can provide other key information like localisation and explainability (“why did this result get returned?”).

Take the following example of searching an encyclopaedia — it has entries for many different things and each entry itself can contain any or all of short text, documents, images, or videos. When a user queries this database, it should be able to work across all these forms of data and return the most relevant one. This is illustrated in the figure below where the query is not only compared against entire entries, but against each field, and even to specific locations within the field.

Operations can be performed over these to return different results depending on the use case. For example, “give me the most relevant sentence” from some text vs “give me the most relevant passage”. Tensors provide a common interface to search between modalities. For example, using natural language to search images is enabled by these representations of the data — as has been made famous by CLIP. Since speed is critical, pre-filtering and matching queries to a corpus is facilitated by efficient and robust algorithms like Hierarchical Navigable Small Worlds (HNSW).

The best of both worlds

Although we have spent considerable time espousing the benefits of tensor search, traditional search methods still represent an important function — such as finding exact matches in text. Combining these methods with tensor search allows us to have the best of both worlds through hybrid search. Not only that, hybrid search can further improve relevance and robustness in the same way ensembling does in machine learning. For these reasons, such functionalities are still a part of Marqo.

Marqo: tensor search for humans

In summary, the benefits of tensor search are numerous and include improved relevance, multi-modality, localization, highlighting, and flexibility. Marqo makes tensor search accessible to all developers by exposing a familiar RESTful API. Easy-to-use javascript and python clients allow for plug-and-play functionality. Developers can also explore the deep customisation options offered. Check out our getting started guide or our github.

Features of Marqo include:

Simple and intuitive API.
“Batteries included” tensor search application. Create a tensor search engine in 3 lines of code.
Designed for the cloud and horizontal scalability.
Prefiltering with a query DSL (domain specific language).
Efficient approximate KNN (k nearest-neighbours) search using HNSW algorithm.
Automatic batching and parallel indexing.
Indexing using CPU, GPU or multiple GPUs.
Searching using CPU or GPU.
ONNX support.
Text-text, Text-image, image-text and image-image search.
Hybrid search — tensor and lexical search.
Pre-trained text and image models.
Results re-ranking.
Text and image search highlighting.
Integrations with popular libraries like CLIP, Huggingface and Sbert (and more to come!).

Text-to-image search has become an important tool in ecommerce discovery because it lets shoppers express intent in richer ways than keywords alone. When powered by a system that understands both visual content and textual intent, product discovery becomes faster, more relevant, and more engaging.

Marqo’s approach to text-to-image search is grounded in its identity as an AI-native ecommerce search and product discovery platform. By training a dedicated large language model on each retailer’s catalog and learning from real shopper behavior, Marqo delivers text-to-image relevance that aligns with business outcomes like improved conversion, higher average order value, and reduced search friction.

Whether you’re enhancing your category pages, visual browsing experience, or search bar relevance, Marqo’s catalog-trained AI provides a commerce-focused foundation for richer product discovery.