Showcase

Fine-Tuning Multimodal Ecommerce Search on Over 100 Million Products

July 1, 2024

mins read

While working with an e-commerce retailer with a large product catalogue of over 110 million items, we put together metrics on performance using this real world data. We constructed an index in Marqo using a ViT-B-32 model that we had fine-tuned on over 1.5 billion (real) query interactions using Generalised Contrastive Learning (GCL). The model was fine-tuned to match queries to a combination of title, description, and image. It also used signals from live search to learn a ranking beyond what can be obtained using only binary relationships.

The index had:

More than 110 million documents with one vector per document.
Vectors which combined product title, description, and image into a unified representation.
Full precision 32 bit float vectors with 512 dimensions.
HNSW graphs constructed with efConstruction=512 and M=16.
BM25 lexical search on product titles.

We evaluated the performance of the index on random sets of queries submitted by real users. Evaluations used 100,000 queries and were sampled equally from the head, torso, and tail. This index was running on our middle tier marqo.balanced storage and marqo.GPU inference. On this index we could achieve 125ms average end-to-end latency at 300QPS with >99% average recall and a median recall of 100%. This latency includes all the pre- and post-processing, tokenization, inference and database operations. The fine-tuned model improved between 73-78% over the base model across nDCG, MRR, mERR and mAP.

First, let's provide some background. If you're already familiar with HNSW, skip ahead.

What are HNSW Graphs?

HNSW (Hierarchical Navigable Small World) graphs data structures used to quickly find items that are close to each other in large datasets. Picture a series of maps where each map shows more details as you go down. You start from the simplest map and move to more detailed ones, making it easier and faster to find what you’re looking for.

HNSW graphs are used for approximate nearest neighbor search, and are effective for high-dimensional data. The structure consists of multiple layers of graphs where each layer represents a subset of the data with decreasing density. This hierarchy allows for efficient searching by navigating through the layers, starting from the sparsest layer and moving down to the densest. HNSW graphs combine small-world network principles with hierarchical navigability, enabling fast and scalable search performance.

The HNSW algorithm has three primary parameters that impact recall, memory, and latency: M, efConstruction, and efSearch. M is the number of bidirectional links to form between each node in the graph, the final layer of the graph typically uses 2 · M links; this impacts the recall, memory usage, and latency where higher M gives better quality retrieval but worse performance. efConstruction is the number of candidates to hold in the heap when constructing the graph, evaluating more candidates gives better graphs with higher recall, however it does increase the time spent indexing; efConstruction does not impact search latency. efSearch is the size of the candidate list to hold in the heap at search time, higher efSearch can increase recall at the cost of latency.

Approximate Search and Recall

Recall is the proportion of results retrieved by an approximate search engine which are within the actual closest results. To make vector search feasible at scale, approximations have to be made. The downside is that we run a risk of missing good results when recall is low. Many vector databases, Marqo included, use HNSW to do approximate nearest neighbors search in log(n) time. It is very important to measure your recall before going into production because when recall is low your search will not be reflective of offline exact KNN benchmarks. Poor recall can directly impact business metrics such as add-to-cart rate and revenue. There is no guarantee about which results in your recall set will be missed when approximating search so we strive for 100% recall whenever possible. This reduces the risk of missing any popular items which attract a lot of sales.

*Figure 2: Illustrative example of how recall can potentially cause a loss of a document at any rank in the recall set.*

For all recall tests we use a set of 50,000 real queries drawn from a balanced set of head, torso, and tail queries.

For this particular application we are most interested in recall. High recall is essential to getting the most out of our fine-tuned model which has been trained on billions of examples. Result sets are fed into downstream re-rankers so a large candidate set of 1000 is required. This means efSearch must be at least 1000. Fortunately Marqo already defaults to efSearch = 2000 which is sufficiently high to achieve near perfect recall at any scale (when paired with our default efConstruction of 512 and default M of 16).

The top results are the most important so we check recall@100 for this test - this is also a realistic page size for a typical e-commerce site so observations here are transferrable to simpler single stage retrieval systems.

Recall and efSearch

In literature and other benchmarks it is common to see recall provided as the mean of recalls for the query set. Plotting the mean and median here shows that from efSearch 100 to 2000 both are quite good across the board, at our default of efSearch=2000 the median is 100% recall and the mean is >99%.

*Figure 3: Average and median recall@100 when comparing exact nearest neighbours to approximate nearest neighbours with HNSW. This experiment used 100,000 queries.*

An average recall of ~93% at efSearch=100 looks like it would be quite good for a lot of use cases, and is 20x less work than efSearch=2000. But don’t jump the gun, these numbers don’t tell the whole story! Taking a look at the standard deviation of recall as a function of efSearch we can see that it decreases from 14% to 4% as efSearch increases. At an efSearch of 100 there exists a cohort of queries with much worse recall than others which is hidden by naively only looking at the average and median.

Figure 4: Standard deviation of recall@100 when comparing exact nearest neighbours to approximate nearest neighbours with HNSW. Lower efSearch exhibits a much larger standard deviation. This experiment used 100,000 queries.

We can take this further by looking at all our summary statistics together. The hard truth of approximate search is that recall is not guaranteed to be good across the board and that the average hides a lot of behavior. In the below plot we can see that when efSearch is less than 1500 we have some queries with 0% recall.

Figure 5: Summary statistics for recall@100 when comparing exact nearest neighbours and approximate nearest neighbours with HNSW. Lower efSearch parameters exhibit much larger standard deviations and a minimum recall of as low as 0. This experiment used 100,000 queries.

efSearch and Latency

The typical argument against using very large efSearch values is that it costs you in terms of latency. This is true but it practice we observe that this is a very small price to pay. Average latency increases from 15ms to 30ms (2x increase) when going from efSearch=100 to efSearch=2000 while serving 4 search threads.

Figure 6: Latency vs efSearch. Latency increases with efSearch as is expected however in absolute terms the increase is minimal. This experiment is the average over 100,000 queries with 4 concurrent clients.

In most real world applications trading a small amount of latency for an improvement to recall is the sensible choice. The latency remains consistent as efSearch increases with standard deviation only increasing by 2ms, this makes sense as each search is single threaded so at this concurrency there is no fighting for CPU cycles between threads, it simply takes longer as efSearch increases.

Figure 7: Latency vs efSearch. Latency remains relatively consistent as efSearch increases. A high efSearch does not appear to have any notable change to consistency of latency. This experiment is the average over 100,000 queries with 4 concurrent clients.

Fine-Tuning

We had fine-tuned a model on approximately 500M rows of query-product data for the index of 110M products. A fine-tuned model is necessary for several reasons:

Base models only have general associations and lack detailed understanding of the language used in the business. This degrades the quality of the results by showing less relevant items.
The base models have not been trained to use all the available product data. This results in poorer performance compared to models that have been trained to use all available data (in vs out of domain).
The base models have no notion of rank. Using Generalised Contrastive Learning allows the model to learn beyond binary relationships and can encode relationships that exist on a continuum. This means the model can learn rank alongside relevance.

Figure 8: GCL allows for any number of images and text associated with products to be used for training and then retrieval. Additionally, GCL can incorporate feedback like purchases to better learn relationships for queries and products.

The model was trained for 1.5B query-product samples and demonstrated improvements of 73-78% over the baseline when evaluated on a golden set of historic purchases across mAP, MRR, nDCG, and mERR.

Figure 9: Graph comparing the base model with the fine-tuned model on a golden dataset of 30k queries and associated purchased products. mAP, MRR, nDCG and mERR all improve by 73-78% over the base model.

Performance

To look at the performance in terms of latency, we use a larger set of 100,000 randomly sampled queries from a balanced set of head, torso, and tail queries. We looked at performance when fetching 10, 100, and 200 results (all with efSearch=2000). Another aspect typically ignored in benchmarks is the difference between efSearch and limit where limit is the number of results you actually want to return to the client.

Decoupling efSearch and limit is very important for performant search. Some engines (like OpenSearch with Lucene) have an intertwined limit and efSearch. This means that if you want to use efSearch = 2000 with OpenSearch and Lucene, you need to actually retrieve 2000 results which costs you a lot in network and processing of JSON. For example, if you have 10 shards this means fetching 2,000 documents from each shard and then merging those 20,000 documents down to 2,000 which go back to the client who then drops 1,900 of them for a page size of 100; this is a lot of wasted time. If efSearch and limit are decoupled then you can just fetch 100 from each shard, combine 1,000 results and return 100 to the client while still having the same number of candidates explored in the HNSW traversal.

Note on Performance Benchmark Times

Here we are looking at end-to-end search. This includes the following steps which are performed within Marqo and constitute the core components of vector search:

Request parsing
Query inference with custom fine-tuned ViT-B-32 model
Vector retrieval
Post-processing

This is everything except for networking time to the client (i.e. the total time of everything in Marqo) and is everything you need to do vector search.

We often see inference times ignored in vector search system benchmarks which is not really reflective of the real world where your vectors are not known until a search occurs and must be calculated on the fly. Marqo benefits from significantly reduced network latency between inference and storage infrastructure as they run within a cluster as an end-to-end systems rather than having to call out to external services like OpenAI or HuggingFace which incurs more networking time.

Inference Only

Under a sustained load of 350 QPS the inference time with the ViT-B-32 model remained tightly grouped around the 50ms mark with a standard deviation of 25ms.

End-to-End Search

Limit=100

Under a sustained load of 300 QPS our average total time including inference and search is 126ms with a standard deviation of 49ms, this is searching over 110 million full precision 32 bit float vectors with 512 dimensions. This is a realistic scenario for many medium to large search applications both in terms of average load and page size.

Limit=200

To better understand the implication of a higher limit and emphasise our earlier point on the importance of having a separably configurable efSearch and limit we ran the same test with the same load with twice the number of retrieve results.

This time there is a 39% increase in average latency. This index used 7 shards so each shard is processing an additional 100 results and the API nodes are now dealing with an additional 700 results. The HNSW search time is unchanged and all we have altered is the amount of data being serialised, de-serialised, and moved around the network. We also see a 62% increase in P75 latency.

Summary

Our findings in this benchmarking support some of our other explorations into recall and affirmed our selection of high default parameters to ensure good recall. An end-to-end p50 latency of 120ms for 300QPS over >110 million documents is more than sufficient for most real-time vector search applications. Configuring the index with either more replicas, shards, or on a higher class of storage (marqo.performance) would comfortably decrease latency and give head room for higher QPS if needed.