The index had:
We evaluated the performance of the index on random sets of queries submitted by real users. Evaluations used 100,000 queries and were sampled equally from the head, torso, and tail. This index was running on our middle tier marqo.balanced storage and marqo.GPU inference. On this index we could achieve 125ms average end-to-end latency at 300QPS with >99% average recall and a median recall of 100%. This latency includes all the pre- and post-processing, tokenization, inference and database operations. The fine-tuned model improved between 73-78% over the base model across nDCG, MRR, mERR and mAP.
First, let's provide some background. If you're already familiar with HNSW, skip ahead.
HNSW (Hierarchical Navigable Small World) graphs data structures used to quickly find items that are close to each other in large datasets. Picture a series of maps where each map shows more details as you go down. You start from the simplest map and move to more detailed ones, making it easier and faster to find what you’re looking for.
HNSW graphs are used for approximate nearest neighbor search, and are effective for high-dimensional data. The structure consists of multiple layers of graphs where each layer represents a subset of the data with decreasing density. This hierarchy allows for efficient searching by navigating through the layers, starting from the sparsest layer and moving down to the densest. HNSW graphs combine small-world network principles with hierarchical navigability, enabling fast and scalable search performance.
The HNSW algorithm has three primary parameters that impact recall, memory, and latency: M, efConstruction, and efSearch. M is the number of bidirectional links to form between each node in the graph, the final layer of the graph typically uses 2 · M links; this impacts the recall, memory usage, and latency where higher M gives better quality retrieval but worse performance. efConstruction is the number of candidates to hold in the heap when constructing the graph, evaluating more candidates gives better graphs with higher recall, however it does increase the time spent indexing; efConstruction does not impact search latency. efSearch is the size of the candidate list to hold in the heap at search time, higher efSearch can increase recall at the cost of latency.
Recall is the proportion of results retrieved by an approximate search engine which are within the actual closest results. To make vector search feasible at scale, approximations have to be made. The downside is that we run a risk of missing good results when recall is low. Many vector databases, Marqo included, use HNSW to do approximate nearest neighbors search in log(n) time. It is very important to measure your recall before going into production because when recall is low your search will not be reflective of offline exact KNN benchmarks. Poor recall can directly impact business metrics such as add-to-cart rate and revenue. There is no guarantee about which results in your recall set will be missed when approximating search so we strive for 100% recall whenever possible. This reduces the risk of missing any popular items which attract a lot of sales.
For all recall tests we use a set of 50,000 real queries drawn from a balanced set of head, torso, and tail queries.
For this particular application we are most interested in recall. High recall is essential to getting the most out of our fine-tuned model which has been trained on billions of examples. Result sets are fed into downstream re-rankers so a large candidate set of 1000 is required. This means efSearch must be at least 1000. Fortunately Marqo already defaults to efSearch = 2000 which is sufficiently high to achieve near perfect recall at any scale (when paired with our default efConstruction of 512 and default M of 16).
The top results are the most important so we check recall@100 for this test - this is also a realistic page size for a typical e-commerce site so observations here are transferrable to simpler single stage retrieval systems.
In literature and other benchmarks it is common to see recall provided as the mean of recalls for the query set. Plotting the mean and median here shows that from efSearch 100 to 2000 both are quite good across the board, at our default of efSearch=2000 the median is 100% recall and the mean is >99%.
An average recall of ~93% at efSearch=100 looks like it would be quite good for a lot of use cases, and is 20x less work than efSearch=2000. But don’t jump the gun, these numbers don’t tell the whole story! Taking a look at the standard deviation of recall as a function of efSearch we can see that it decreases from 14% to 4% as efSearch increases. At an efSearch of 100 there exists a cohort of queries with much worse recall than others which is hidden by naively only looking at the average and median.
We can take this further by looking at all our summary statistics together. The hard truth of approximate search is that recall is not guaranteed to be good across the board and that the average hides a lot of behavior. In the below plot we can see that when efSearch is less than 1500 we have some queries with 0% recall.
The typical argument against using very large efSearch values is that it costs you in terms of latency. This is true but it practice we observe that this is a very small price to pay. Average latency increases from 15ms to 30ms (2x increase) when going from efSearch=100 to efSearch=2000 while serving 4 search threads.
In most real world applications trading a small amount of latency for an improvement to recall is the sensible choice. The latency remains consistent as efSearch increases with standard deviation only increasing by 2ms, this makes sense as each search is single threaded so at this concurrency there is no fighting for CPU cycles between threads, it simply takes longer as efSearch increases.
We had fine-tuned a model on approximately 500M rows of query-product data for the index of 110M products. A fine-tuned model is necessary for several reasons:
The model was trained for 1.5B query-product samples and demonstrated improvements of 73-78% over the baseline when evaluated on a golden set of historic purchases across mAP, MRR, nDCG, and mERR.
To look at the performance in terms of latency, we use a larger set of 100,000 randomly sampled queries from a balanced set of head, torso, and tail queries. We looked at performance when fetching 10, 100, and 200 results (all with efSearch=2000). Another aspect typically ignored in benchmarks is the difference between efSearch and limit where limit is the number of results you actually want to return to the client.
Decoupling efSearch and limit is very important for performant search. Some engines (like OpenSearch with Lucene) have an intertwined limit and efSearch. This means that if you want to use efSearch = 2000 with OpenSearch and Lucene, you need to actually retrieve 2000 results which costs you a lot in network and processing of JSON. For example, if you have 10 shards this means fetching 2,000 documents from each shard and then merging those 20,000 documents down to 2,000 which go back to the client who then drops 1,900 of them for a page size of 100; this is a lot of wasted time. If efSearch and limit are decoupled then you can just fetch 100 from each shard, combine 1,000 results and return 100 to the client while still having the same number of candidates explored in the HNSW traversal.
Here we are looking at end-to-end search. This includes the following steps which are performed within Marqo and constitute the core components of vector search:
This is everything except for networking time to the client (i.e. the total time of everything in Marqo) and is everything you need to do vector search.
We often see inference times ignored in vector search system benchmarks which is not really reflective of the real world where your vectors are not known until a search occurs and must be calculated on the fly. Marqo benefits from significantly reduced network latency between inference and storage infrastructure as they run within a cluster as an end-to-end systems rather than having to call out to external services like OpenAI or HuggingFace which incurs more networking time.
Under a sustained load of 350 QPS the inference time with the ViT-B-32 model remained tightly grouped around the 50ms mark with a standard deviation of 25ms.
Under a sustained load of 300 QPS our average total time including inference and search is 126ms with a standard deviation of 49ms, this is searching over 110 million full precision 32 bit float vectors with 512 dimensions. This is a realistic scenario for many medium to large search applications both in terms of average load and page size.
To better understand the implication of a higher limit and emphasise our earlier point on the importance of having a separably configurable efSearch and limit we ran the same test with the same load with twice the number of retrieve results.
This time there is a 39% increase in average latency. This index used 7 shards so each shard is processing an additional 100 results and the API nodes are now dealing with an additional 700 results. The HNSW search time is unchanged and all we have altered is the amount of data being serialised, de-serialised, and moved around the network. We also see a 62% increase in P75 latency.
Our findings in this benchmarking support some of our other explorations into recall and affirmed our selection of high default parameters to ensure good recall. An end-to-end p50 latency of 120ms for 300QPS over >110 million documents is more than sufficient for most real-time vector search applications. Configuring the index with either more replicas, shards, or on a higher class of storage (marqo.performance) would comfortably decrease latency and give head room for higher QPS if needed.
Checking the recall of your search is easy with Marqo, check out our cookbook for an example.
Do you have millions of vectors and need real time search with near perfect recall? Try Marqo Cloud or schedule a demo with our team!