Tips

How to Build a Multimodal Hybrid Search System with Vectors and Full Text

August 20, 2024
5
mins read

In this article we will cover the fundamentals of implementing ecommerce hybrid search with Marqo on real data from Amazon. We will build a basic UI to interact with Marqo and see the effects of various parameters on search. In doing this we will implement core site functionality including lexical retrieval, vector retrieval, hybrid retrieval, sort orders, and sponsored product spots.

All code for this article can be found on our GitHub here.

What is Hybrid Search?

Hybrid search in Marqo combines vector search with BM25 lexical search. There are three main mechanisms for this:

  • Retrieve with vector search and rank with lexical search
  • Retrieve with lexical search and rank with vector search
  • Retrieve with both vector and lexical search and fuse the results

Why do Hybrid Search?

While dense retrieval (vector search) is incredibly powerful, there are some shortcomings. Modern embedding models used in vector search display a great ability to capture the semantics of a piece of text or an image, for many retrieval application semantics are what matter and vector search can support this very well. However, in some application we also need guarantees around the inclusion of certain keywords.

A great example of this is things like quantities. Due to the way tokenizers work, and how embeddings encapsulate meaning in a continuous space, embedding models struggle to provide exact matches on things like quantities. For example a query like “modern EDP for winter 50 ml” expresses a few pieces of key information: the user wants an eau de perfume (EDP), a contemporary scent, a scent suitable for winter, and it must be in a 50 ml bottle. The first three things are perfect for vector search as they are largely semantic information however the quantity is not great for vector search, if we show the user results that are not in 50 ml containers then we are not doing our job.

Any situation where keyword matches are important to search, for example health, beauty, technical documentation, tech products, or groceries, are great applications for hybrid search.

A Practical Hybrid Search Example

Take the query “18k gold ring”, the user has a pretty clear information need where all three terms in the the query need to be satisfied simultaneously.

A naive lexical search will match the terms but not necessarily match all of them. In this lexical search example we see that everything is 18k gold but not everything is a ring.

Figure 1: Naive lexical search for the query "18k gold ring".

If we instead do vector search we have a different problem. The improved visual understanding means that everything is now a gold ring however the embeddings don’t correctly capture the importance of having 18 karats.

Figure 2: Vector search for the query "18k gold ring".

We can combine the best of the two with a hybrid search to get a visual understanding of multimodal vector search and the term matching of lexical search.

Figure 3: Hybrid search for the query "18k gold ring".

Hybrid Search Methods in Detail

We mentioned three methods for hybrid search currently supported in Marqo. Here, we will go into more detail on how each works. Hybrid search can be broken into two main components, retrieval and re-ranking/fusion, each having a number of methods within them.

Figure 4: Overview of Hybrid Search Methods in Marqo.

Tensor Retrieval

For tensor retrieval we do our initial search with vector search and then re-score with lexical search. This method can be useful when vector search is already finding candidates that have your keywords but isn’t returning them in the order you desire. Given a set of documents returned from a vector search retrieval, the final result set will be ordered by the BM25 scores in relation to the query. Note that documents which don’t contain any of the keywords will get a score of 0.

This typically works best with a larger candidate set.

Lexical Retrieval

Lexical retrieval is the opposite of tensor retrieval, in this instance we do the initial retrieval with lexical search and then re-rank with the tensor scores. This is useful when you need to be certain that the keywords are contained in your results but want a semantic ordering to the final results. Given a set of documents returned from a lexical search retrieval, the final result set will be ordered by the vector similarity scores in relation to the query. In this case all documents will get a score.

This typically works best with a larger candidate set.

Disjunct Retrieval

The most common pattern is to perform a “disjunction” retrieval (fetch both tensor and lexical result sets) and fuse them. For fusion, Marqo implements Reciprocal Rank Fusion (RRF). RRF is rank based which means that the actual scores from each retrieval don’t matter, only their ordering matters. This is useful because BM25 scores and vector similarity scores are not directly comparable. The reciprocal rank score of a result is calculated as follows:

$$ \text{RRF}(d \in D) = \sum_{r \in R}\frac{1}{k+r(d)}$$

Where \( D \) is our set of retrieved documents and \( R \) is our set of retrievers, \( r(d) \) is the rank of document \( d \) with retriever \( r \). A constant smoothing factor \( k \) is included which is typically set to 60 (a number that is empirically determined). The final score of a document is the sum of the reciprocal of its rank with each retriever in the system.

In our instance we have just two retrievers, lexical search an vector search. This means that Marqo can also easily include an additional term \( \alpha \) which weights the reciprocal rank scores for each retriever to give one system more emphasis than the other. The lexical reciprocal ranks are multiplied by \( 1 - \alpha \) whereas the vector search reciprocal ranks are multiplied by \( \alpha \).

Disjunct retrieval with RRF is the default in Marqo and works well for the majority of use cases with most recall set sizes (limit parameter in Marqo).

Putting it into Practice

To demonstrate hybrid search we will implement a basic ecommerce multimodal search system using Marqo and data from Amazon. You can get the code on our GitHub here.

First we need to clone the repository and set up our environment:


git clone https://github.com/marqo-ai/marqo-hybrid-demo.git
cd marqo-hybrid-demo
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Once you have cloned the repo you can download the data we need for our index:


mkdir data
wget  -o data/amazon_products.jsonl

This dataset is 1.4 GB so allow some time for it to download. In total there are 500,000 products in the dataset.

Index settings

Hybrid search with Marqo is very similar to normal search and requires minimal code changes. The minimum requirement is that your documents have text fields for lexical search and tensor fields for vector search.

For these examples we will create a structured index with a number of attributes typical to ecommerce search. For our demo we will use the following settings. We define multiple fields with lexical search and one tensor field which is a multimodal combination of the product image and the product title.


index_settings = {
    "type": "structured",
    "model": "open_clip/ViT-B-16-quickgelu/metaclip_fullcc",
    "normalizeEmbeddings": True,
    "vectorNumericType": "bfloat16",
    "annParameters": {
        "spaceType": "prenormalized-angular",
        "parameters": {"efConstruction": 512, "m": 16},
    },
    "allFields": [
        {"name": "main_category", "type": "text", "features": ["filter"]},
        {"name": "title", "type": "text", "features": ["lexical_search"]},
        {"name": "store", "type": "text", "features": ["lexical_search", "filter"]},
        {"name": "features", "type": "array", "features": ["lexical_search"]},
        {"name": "description", "type": "text", "features": ["lexical_search"]},
        {"name": "categories", "type": "array", "features": ["filter"]},
        {"name": "average_rating", "type": "float", "features": ["score_modifier"]},
        {"name": "rating_number", "type": "float", "features": ["score_modifier"]},
        {"name": "price", "type": "float", "features": ["score_modifier"]},
        {"name": "details", "type": "text", "features": ["lexical_search"]},
        {"name": "product_image", "type": "image_pointer"},
        {
            "name": "multimodal_image_title",
            "type": "multimodal_combination",
            "dependentFields": {"product_image": 0.9, "title": 0.1},
        },
        {"name": "sponsored", "type": "bool", "features": ["filter"]},
        {"name": "bid_amount", "type": "float", "features": ["filter", "score_modifier"]},
    ],
    "tensorFields": ["multimodal_image_title"],
}

Run the create index script to create an index with these settings. We use bfloat16 in this example to cut vector storage space in half at a small cost to latency.


python 2.create_index.py

Indexing Data

Once we have built our index we can start indexing data. This process can be stopped and resumed later, I recommend indexing a few thousand things to start with and then moving on - you can always run it again later to add more. A GPU is strongly recommended here as it will be significantly faster.


python 3.index_data.py --device "gpu"

If using a CPU, you can remove the flag for --device. Don’t worry if you see some errors, this dataset has a number of dead image links.

Marqo Hybrid Search API

To do hybrid search we simply set the search method in our request like so:


mq.index("amazon-example").search(
		"EDP 50 ml",
		search_method="HYBRID"
)

This will fire off two parallel searches (lexical and tensor) and fuse the results together. The parallelisation and fusion is done with a custom searcher written in Java to provide better efficiency.

By default, Marqo will do a disjunct retrieval and fuse the results with RRF. Marqo exposes all the parameters of RRF described earlier, the full specification of a hybrid RRF search is as follows:


mq.index("amazon-example").search(
    q="EDP 50 ml",
    search_method="HYBRID",
    hybrid_parameters={
        "retrievalMethod": "disjunction",
        "rankingMethod": "rrf",
        "alpha": 0.5,
        "rrfK": 60,
    },
)

The retrievalMethod and rankingMethod parameters control how the hybrid search is executed. For example to retrieve with tensor search and rank with lexical search you would adjust them as follows:


mq.index("amazon-example").search(
    q="EDP 50 ml",
    search_method="HYBRID",
    hybrid_parameters={
        "retrievalMethod": "tensor",
        "rankingMethod": "lexical",
    },
)

Running the application with a UI

We can get hands on with our new search and try some different parameters by running the UI.


python app.py

This UI exposes all the parameters to Hybrid search that we discussed in the previous section so you can see how they each influence the search experience.

Extension - Sponsored Product Search

Ecommerce applications typically feature things like sponsored search. We have this implemented for this demo as well, the keen eyed would have spotted the sponsored and bid_amount fields in the index schema.

Sponsored search inject products into search results where vendors have paid some bid amount for sponsorship. The goal of sponsored search is to boost these items but not sacrifice all relevancy. In this article we implement an approach that factors in the relevance score as well as the bid amount. Usually we retain accuracy well as in the following example:

Figure 5: Normal vs sponsored results for the query "suit".

In other instances we suffer small degradations in relevancy for sponsored items:

Figure 6: Normal vs sponsored results for the query "gold chronograph".

To randomly generate some sponsored products we can run the following script.


python 4.randomly_sponsor_items.py

This script makes use of Marqo’s partial updates API which can update document metadata in place without impacting the HNSW graph in any way. To do a partial update we provide the document _id and any fields that we want to change.


response = MQ.index(INDEX_NAME).update_documents(
    [
        {"_id": _id, "sponsored": sponsored, "bid_amount": random.random()}
        for _id in batch
    ],
)

Once we have randomly sponsored some items in the index we can use the sponsored search checkbox in the UI to insert a row of sponsored items in the results.

To do a sponsored search we first need to retrieve results with have the sponsored flag:


results = MQ.index(INDEX_NAME).search(
    q=query,
    search_method=search_type,
    limit=200,
    hybrid_parameters=hybrid_parameters,
    filter_string="sponsored:true",
)

Once we have a result set we can initiate an auction on the results, we implement a quality based variant of a simple auction which uses the _score from Marqo to weight the bid_amount and allow for sorting that factors in both attributes.


def auction_spots_with_score(results: List[dict], n_spots: int) -> List[dict]:
    for res in results:
        res["rank_score"] = res["bid_amount"] * res["_score"]

    results.sort(key=lambda x: x["rank_score"], reverse=True)

    top_sponsored = results[:n_spots]

    for i in range(len(top_sponsored) - 1):
        next_score = top_sponsored[i + 1]["_score"]
        next_bid_amount = top_sponsored[i + 1]["bid_amount"]
        current_score = top_sponsored[i]["_score"]
        top_sponsored[i]["price_to_pay"] = (
            next_score * next_bid_amount
        ) / current_score

    top_sponsored[-1]["price_to_pay"] = top_sponsored[-1]["bid_amount"]

    final_results = top_sponsored

    return final_results

Summary

In this article, we explored the implementation of a basic ecommerce search system using Marqo's hybrid search capabilities. We demonstrated how combining vector search with BM25 lexical search can enhance search results by leveraging both semantic understanding and keyword matching. Much of the code presented in this articles accompanying code can be easily scaled up for larger datasets and scaled deployments of Marqo.

The provided dataset and UI provide a platform to experiment with different hybrid search configurations in an ecommerce setting.

Additionally we implemented a sponsored product search feature, which can also be explored via different retrieval methods.

This article is just a primer on hybrid search and how to use it in Marqo. For detailed documentation on all the API features we recommend checking out the docs and getting hands on with the source code for this article.

Code

The code for this article can be found on our GitHub.

Owen Elliott
Solutions Architect at Marqo