Back to all Blog Posts

February 18, 2026

Context Is All You Need: AI Powered Ecommerce Search and Product Discovery

1. Introduction to Multimodal Search

Modern ecommerce search requires more than keyword matching. Product catalogs include images, titles, descriptions, and structured attributes, and shoppers often search using intent that cannot be captured through text alone. To deliver high performing product discovery experiences, retailers need search that understands both language and visual context.

Marqo is an AI native ecommerce search and product discovery platform built to increase conversion and revenue. Marqo trains a dedicated large language model on each retailer’s catalog, allowing it to understand product attributes, taxonomy, and brand context at a deeper level. This catalog trained intelligence enables more relevant search results, better personalization, and more intuitive discovery experiences.

In this article, we explore how combining text, images, and behavioral context improves ecommerce search relevance and personalization, and we provide practical examples of how these techniques can be applied.

Let’s dig into multimodal search. This articles has three main parts:

  1. Introduction to Multimodal Search
  2. Multimodal Search in Practice
    2.1 Multimodal queries
    2.2 Negation
    2.3 Excluding Low Quality Images
    2.4 Searching with Images
    2.5 Conditional Search with Popular or Liked Items
    2.6 Searching as Prompting
    2.7 Ranking with Other Signals
    2.8 Multimodal Entities
  3. Detailed Examples (with code)
An example of multimodal "document" that contain both images and text.

1.1 Multimodal Search

Multimodal search is search that operates over multiple modalities. We can think of two ways of doing multimodal search, using multimodal queries and multimodal documents. In both cases, they may contain any combination of text and image data. For clarity we will stick to two modalities for now, text and images but the concepts are not restricted to just those and can be extended to video, audio and other data types (for example).

An example of multimodal search using images and text to refine a search.

1.2 Benefits

For ecommerce teams, these capabilities translate into better product discovery, higher conversion rates, and more personalized shopping journeys.

There are numerous benefits to this multimodal approach. For example:

  • The multimodal representations (encoding) of documents allows for utilizing not just text or images or a combination of these. This allows complementary information to be captured that is not present in either modality (i.e. text search only, or image search only).
  • Using multimodal representations allows for updatable and editable meta data for documents without re-training a model or re-indexing large amounts of data.
  • Relevance feedback can be easily incorporated at a document level to improve or modify results.
  • Curating queries with additional context allows for personalization and curation of results on a per query basis without additional models or fine-tuning.
  • Curation can be performed in natural language.
  • Business logic can be incorporated into the search using natural language.

2. Multimodal Search in Practice

In this section we will walk through a number of ways multimodal search can be used to improve and curate results.

2.1 Multimodal Queries

Multimodal queries are queries that are made up of multiple components and/or multiple modalities. This allows retailers to refine how results are ranked by combining multiple intent signals, rather than relying on a single query alone. The similarity scoring will now be against a weighted collection of items rather than a single piece of text data. This allows finer grained curation of search results than by using a single part query alone. We have seen previous examples of this earlier in the article already where both images and text are used to curate the search.

Shown below is an example of this where the query has multiple components. The first query is for an item while the second query is used to further condition the results. This acts as a “soft” or “semantic” filter.

This multi-part query can be understood to be a form of manual query expansion. The animation below illustrates how the query can be used to modify search results.

An example of multimodal search two text queries to further refine the search.

2.2 Negation

In the previous examples we saw how multiple queries can be used to condition the search. In those examples, the terms were being added with a positive weighting. Another way to utilise these queries is to use negative weighting terms to move away from particular terms or concepts. Below is an example of a query with an additional negative term:

Now the search results are also moving away from the buttons while being drawn to the green shirt and short sleeves.

An example of multimodal search using negation to avoid certain concepts - `buttons` in this case.

2.3 Excluding Low Quality Images

Negation can help avoid particular things when returning results, like low-quality images or ones with artefacts. Avoiding things like low-quality images or NSFW content can be easily described using natural language as seen in the example query below:

In the example below the initial results contain three low-quality images. These are denoted by a red mark for clarity and the poor image quality can be seen by the strong banding in the background of these images.

An example of multimodal search using negation to avoid low quality images. The low-quality images are denoted by a red dot next to them.

An alternative is to use the same query to clean up existing data by using a positive weight to actively identify low-quality images for removal.

2.4 Searching with Images

In the earlier examples we have seen how searching can be performed using weighted combinations of images and text. Searching with images alone (via image embeddings) can also be performed to utilize image similarity to find similar looking items. An example query is below:

It can also be easily extended in the same way as with text to include multiple multimodal terms.

An example of multimodal search using an image as the query before refining the search further with natural language.

2.5 Conditional Search with Popular or Liked Items

Another way to utilize the multimodal queries is to condition the query using a set of items. For example, this set could come from previously liked or purchased items (this is a form of similarity search). This will steer the search in the direction of these items and can be used to promote particular items or themes. This method can be seen as a form of relevance feedback that uses items instead of variations on the query words themselves. To avoid any extra inference at search time, we can pre-compute the set of items vectors and fuse them into a context vector.

Below is an example of two sets of 4 items that are going to be used to condition the search. The contribution for each item can also be adjusted to reflect the magnitude of its popularity.

Two sets of items based on different relevance feedback mechanisms that can be used to curate the search.
Two results sets for identical queries that were conditioned on two different sets of items (image data). The search results are aligned with their conditioning.

2.6 Searching as Prompting

An alternative method to constructing multi-part queries is to append specific characteristics or styles to the end of a query. This is effectively the same as "prompting" in text to image generation models like DALLE and Stable Diffusion. For example, additional descriptors can be appended to a query to curate the results. An example query with additional prompting is below:

The impact of this prompting on the results can be seen in the animation.

Results that are curated with prompting.

Another example query of searching as prompting:

Results that are curated with prompting.

2.7 Ranking with Other Signals

In addition to curating the search with the methods outlined above, we can modify the similarity score to allow ranking with other signals or metrics. For example document specific values can be used to multiply or bias the vector similarity score. This allows for document specific concepts like overall popularity to impact the ranking. Below is the regular query and search results based on vector similarity alone. There are three low-quality images in the result set and can be identified by the strong banding in the background of the images.

Results that are based on similarity alone.

To illustrate the ability to modify the score and use other signals for ranking we have calculated an aesthetic score metric for each item. The aesthetic score is meant to identify "aesthetic" images and rate them between 1 and 10. We can now bias the score using this document (but query independent) field. An example is below:

Results that are based on similarity and aesthetic score.

In the image above, the results have now been biased by the aesthetic score to remove the low-quality images (which have a low aesthetic score). This example uses aesthetic score but any other number of scalars can be used - for example ones based around sales and/or popularity.

2.8 Multimodal Entities

Multimodal entities or items are just that - representations that take into account multiple pieces of information. These can be images or text or some combination of both. Examples include using multiple display images for ecommerce. Using multiple images can aid retrieval and help disambiguate between the item for sale and other items in the images. If a multimodal model like CLIP is used, then the different modalities can be used together as they live in the same latent space.

3. Detailed Example

In the next section we will demonstrate how all of the above concepts can be implemented using Marqo.

3.1 Dataset

The dataset consists of ~220,000 e-commerce products with images, text and some meta-data. The items span many categories of items, from clothing and watches to bags, backpacks and wallets. Along with the images they also have an aesthetic score, caption, and price. We will use all these features in the following example. Some images from the dataset are below.

Some example images from the dataset.

3.2 Installing Marqo

The first thing to do is start Marqo (The first step is to set up Marqo locally and install the Python client). To start the workflow, we can run the following docker command from a terminal (for M-series Mac users see here).

The next step is to install the python client (a REST API is also available).

3.3 Loading the Data

The first step is to load the data. The images are hosted on s3 for easy access. We use a file that contains all the image pointers as well as the meta data for them (found here).

3.4 Create the Index

Now we have the data prepared, we can set up the index. We will use a ViT-L-14 from open clip as the model. This model is very good to start with. It is recommended to use a GPU (at least 4GB VRAM) otherwise a smaller model can be used (although results may be worse).

3.5 Add Images to the Index

Now we can add images to the index (these become vector embeddings, specifically, image embeddings) which can then be searched over. We can also select the device we want to use and also which fields in the data to embed. To use a GPU, change the device to cuda (see here for how to use Marqo with a GPU).

3.6 Searching

Now the images are indexed, we can start searching.

3.7 Searching as Prompting

Like in the examples above, it is easy to do more specific searches by adopting a similar style to prompting.

3.8 Searching with Semantic Filters

Now we can extend the searching to use multi-part queries. These can act as "semantic filters" that can be based on any words to further refine the results.

3.9 Searching with Negation

In addition to additive terms, negation can be used. Here we remove buttons from long sleeve shirt examples.

3.10 Searching with Images

In addition to text, searching can be done with images alone.

3.11 Searching with Multimodal Queries

The multi-part queries can span both text and images.

3.12 Searching with Ranking

We can now extend the search to also include document specific values to boost the ranking of documents in addition to the vector similarity. In this example, each document has a field called aesthetic_score which can also be used to bias the score of each document.

3.13 Searching with Popular or Liked Products

Results at a per-query level can be personalized using sets of items. These items could be previously liked or popular items. To perform this we do it in two stages. The first is to calculate the "context vector" which is a condensed representation of the items. This is pre-computed and then stored to remove any additional overhead at query time. The context is generated by creating documents of the item sets and retrieving the corresponding vectors. The first step is to create a new index to calculate the context vectors.

Then we construct the objects from the sets of items we want to use for the context.

We can now define mappings objects to determine how we want to combine the different fields. We can then index the documents.

To get the vectors to use as context vectors at search time - we need to retrieve the calculated vectors. We can then create a context object that is used at search time.

3.14 Indexing as Multimodal Objects

For the final part of this example, we demonstrate how both text and images can be combined together as a single entity and allow multimodal representations. We will create a new index in the same way as before but with a new name.

To index the documents as multimodal objects, we need to create a new field and add in what we want to use.

The next step is to index. The only change is an additional mappings object which details how we want to combine the different fields for each document.

Finally we can search in the same way as before.

4. Conclusion

Ecommerce search and product discovery is no longer just about matching keywords. Shoppers expect search to understand intent, product attributes, visual similarity, and context, while delivering personalized results that feel relevant in real time.

This article demonstrated how combining text and image understanding, semantic filters, negation, ranking signals, and relevance feedback can dramatically improve discovery performance. These capabilities are essential for retailers that want to increase conversion, reduce friction, and unlock more revenue from existing traffic.

Marqo is an AI native ecommerce search and product discovery platform built for these outcomes. By training a dedicated large language model on each retailer’s catalog and combining it with real time behavioral learning, Marqo delivers relevance and personalization that generic search systems cannot match.

Ready to explore better search?

Marqo drives more relevant results, smoother discovery, and higher conversions from day one.

Talk to a Search Expert