Research

Understanding Recall in HNSW Search

June 4, 2024

mins read

Vector search systems, pivotal in AI applications, often rely on the Hierarchical Navigable Small Worlds (HNSW) algorithm. However, the behaviour of HNSW under real-world scenarios using vectors generated with deep learning models remains under-explored. This article focuses on HNSW's efficacy across a spectrum of datasets, including synthetic vectors tailored to mimic specific intrinsic dimensionalities, widely-used retrieval benchmarks with popular embedding models, and proprietary e-commerce image data with CLIP models. We survey the most popular HNSW vector databases and collate their default parameters to provide a realistic fixed parameterisation for the duration of the paper.

Have you calculated the recall of your vector search system? We observe HNSW with under-configured parameters dropping NDCG@10 by as much as 18% on popular benchmark datasets, changing rankings of models compared to exact KNN. Additionally the order in which you insert your data into the graph can result in up to a 17% relative change in recall. This holds true for text only datasets with text embedding models and multimodal datasets with CLIP models.

Unlike other articles which search for a fixed recall and tune parameters, we focus on a fixed set of parameters. While it is great to evaluate systems at various combinations of parameters this is also a time-consuming and expensive exercise that is cost and/or computationally intractable for many users. As such, default parameters will tend to be used in the majority of implementations, these are often poorly documented and in some libraries not easily accessible via the API. Many users also choose to access Vector databases through abstraction layers such as LangChain and LLaMaIndex which further obfuscates the HNSW parameters and behaviour.

Our exploration of recall in HNSW has led to the default configuration in Marqo V2. We opt for significantly higher defaults than other vector databases with efConstruction = 512, M = 16, and efSearch = 2000. In particular, efSearch is found to have a significant impact on recall in our experiments. Marqo V2 utilises Vespa which can handle these parameters without sacrificing performance.

Quick Recap on HNSW

HNSW is an algorithm for approximate search that is used by many vector databases. HNSW offers a time complexity of n log(n) indexing of data and log(n) searching of data; this is better than doing exact KNN at search time which requires all distances to be calculated for a search complexity of n. This time complexity has made HNSW a popular choice for production search systems as indexing is typically not as time sensitive as search is. HNSW also has other nice properties, including parameterisation which makes it configurable with ways to trade-off computation time and memory usage. The algorithm works by forming a multi-layered graph where the density increases with each layer - the bottom layer contains all data and is usually implemented with twice the number of connections. This means that during search long distance connections can traject the query across the graph and land in the correct part of the base layer.

Indexing and search are very similar procedures. At indexing time a layer is randomly selected from an exponentially decaying distribution to decide the entry layer for that data point - the graph is searched and bidirectional links are formed to nearest neighbours.

Search is similar except it always starts at the entry point in the highest layer, the search continues through the layers, until the base layer where the nearest neighbours are found. An illustration of the process is given below.

There are three main parameters in HNSW:

M: The maximum number of bidirectional links to form from each node in the graph. Usually implementations will use 2 * M for base layer with all points. Higher M increases memory usage as well as latency for both indexing and search.
efConstruction: The number of candidates to consider when indexing. Typically implementations will store these candidates in a heap to efficiently get the nearest by distance to the current point. Only affects latency at indexing.
efSearch: Same as efConstruction, but for search time. Only affects latency at search.

What is Intrinsic Dimensionality?

A concept we will mention a lot in this article is ‘Intrinsic Dimensionality’. Intrinsic Dimensionality can be thought of as the minimal number of dimensions required to represent some data, in this case a collection of vectors. We can approximate intrinsic dimensionality over an entire dataset or for smaller regions of the vector space, referred to as local intrinsic dimensionality. The intrinsic dimensionality is lower than the actual dimension of the vectors and will vary with different model and dataset combinations.

For example, some CLIP models are trained on very diverse datasets with up to 5 billion image text pairs and manage to effectively represent all this information in 768 dimensional vectors. If you then generate a dataset using this model with only images of cats as the input it will likely have a much lower intrinsic dimensionality than 768 as not all dimensions of the vectors hold as much information as others.

What Influences Recall?

There are a number of factors that come into play with recall in HNSW search. The parameterisation of the algorithm itself is obviously a key factor however, properties of the models and the data itself can also have an impact.

A key attribute of the vector space that is known to influence recall is the intrinsic dimensionality of the data. Data with higher intrinsic dimensionality typically results in worse recall. Additionally, we have observed that local intrinsic dimensionality can influence recall and will showcase both of these properties further into the article.

Default Parameters of Vector Databases

Note: In this table, k represents the number of results to return. Systems with efSearch = k do not specify a default efSearch and set it to k at search time.

Many HNSW implementations use low default parameters. M = 16 is common, with Weaviate as an outlier. M affects memory usage and search latency but aids recall when the data has high intrinsic-dimensionality. efConstruction ranges from 40 to 512. For efSearch, many vector databases use k (number of results) as the efSearch parameter, which isn't ideal as recall isn't tied to the result set size. If efSearch is tied to k and you run efSearch 512 for k=10, you must process 502 extra results, increasing networking and processing costs. For example, in OpenSearch with Lucene on 10 shards, running efSearch 512 with a limit of 100 results requires retrieving 5120 results and then aggregating them to return 100, making it 5x the work outside of the graph traversal itself compared to systems with configurable efSearch.

For this article we take the median of each parameter excluding Marqo V2, as this was resultant of our research. We use a limit of 10 as this is the number of results typically evaluated at on leaderboards such as MTEB and is on the same scale as what is typical in many RAG implementations, which the selected models are often used for. This results in M = 16, efConstruction = 128 and efSearch of 10 and 40 (instead of averaging the two we will evaluate with both separately).

Datasets and Models

We consider a few cases for exploring behaviour:

Synthetic Data at specific intrinsic dimensionality
Benchmark datasets (ArguAna, TRECCOVID, NFCorpus, Quora, SCIDOCS, SciFact, and CQADupStack)
Real world e-commerce data

For the benchmark datasets we look at an assortment of popular open-source embedding models:

intfloat/e5-small
intfloat/e5-base
intfloat/e5-base-v2
intfloat/e5-small-v2
BAAI/bge-base-en
BAAI/bge-small-en
BAAI/bge-base-en-v1.5
BAAI/bge-small-en-v1.5
thenlper/gte-base
intfloat/multilingual-e5-base
intfloat/multilingual-e5-small
llmrails/ember-v1
infgrad/stella-base-en-v2
sentence-transformers/all-MiniLM-L6-v2
TaylorAI/bge-micro
intfloat/multilingual-e5-large

Recall on Synthetic Data

To see the effect of intrinsic dimensionality on recall, we need a methodology for generating data at specific intrinsic dimensionalities.

We propose a process which involves projecting different numbers of orthonormal basis vectors into a higher dimension. This lets us fix the dimensionality of the vectors but control the intrinsic dimensionality. Details on the process are in section 4.1 of our paper.

We can test the validity of this process by performing a Principal Component Analysis (PCA) on the datasets constructed with various numbers of orthonormal basis vectors and visualising the cumulative explained variance for the components with data generated from various numbers of orthonormal basis vectors. The number of components whose cumulative explained variance is greater than a sufficiently high threshold (we use 0.99) is analogous to the intrinsic dimensionality as it represents the minimum number of dimensions required to explain the data.

Evaluating the recall@10 on these datasets shows a clear degradation as the intrinsic dimensionality increases, even with only 10,000 1024 dim vectors at efConstruction=128, M=16, and efSearch=40.

Recall on Benchmark Datasets

We can see that there is a relationship between intrinsic dimensionality and the recall of Hierarchical Navigable Small World (HNSW) graphs. Our findings indicate that different vector space properties intrinsic to various models, can significantly impact recall performance. Many popular retrieval models, typically trained with contrastive loss, lack explicit control over properties like intrinsic dimensionality. These models are generally evaluated using exact K-Nearest Neighbors (KNN), which does not accurately reflect their performance in approximate nearest neighbor retrieval systems.

Our evaluation on benchmark datasets for various models shows that rankings change when models are assessed using different retrieval systems. This highlights that a leader board based on exact KNN might not truly represent the models' performance with approximate retrieval methods. The models were ranked using Normalized Discounted Cumulative Gain (NDCG).

Average NDCG@10 Scores

The average NDCG@10 scores for various models across different retrieval systems significantly shifts at low efSearch. Smaller models like all-MiniLM-L6-V2 and bge-micro improve their relative performance in approximate retrieval systems, moving up the leaderboard by 2-3 places at efSearch = 10.

Local Intrinsic Dimensionality and Retrieval Performance

With the synthetic data, we looked at the intrinsic dimensionality of the dataset as a whole. We can extend this idea to look at the Local Intrinsic Dimensionality (LID) of every point in the dataset - this involves computing the 100 nearest neighbours for every point in each dataset and then estimating the intrinsic dimensionality of that set of neighbours. The result is a local intrinsic dimensionality value for every vector in every dataset.

Constructing graphs with data sorted in descending LID order improves recall, mimicking a simulated annealing process. Conversely, ascending LID order deteriorates initial conditions for graph optimization. For smaller datasets (ArguAna, NFCorpus, SciFact, and SCIDOCS), we compute a Pearson correlation of 0.61 between recall and average path length in the final layer of the HNSW graph. This supports out hypothesis that late integration of tight localities into the graph results in longer path lengths with better recall (via better reachability).

LID Ordered Insertion and Recall

The following table shows the average recall@10 for various models on benchmark datasets, highlighting that inserting data in descending LID order generally yields higher recall for both HNSWLib and FAISS implementations. On average, descending LID order improves recall by 2.6 and 6.2 percentage points for HNSWLib and FAISS, respectively.

Higher efSearch Impact

The following demonstrates that higher efSearch values (40 in this case) yield overall higher recall, but differences between insertion orders remain present. HNSWLib and FAISS continue to show significant recall differences based on LID order, supporting the influence of LID on retrieval performance.

Relevance and Leaderboard Rankings

We can also measure the downstream impact of recall from these insertion orders. Retrieval systems are commonly evaluated with NDCG@10 - this is effected by the recall@10 where worse recall means inexact results can degrade NDCG. The following table shows the implications of different insertion orders on model rankings using NDCG at efSearch = 10. Model rankings shift under different insertion orders, indicating that vector space properties impact models unequally. Some models are more robust to changes in insertion order, affecting their performance on downstream retrieval tasks.

Real World E-commerce Data

The insertion of data in order of LID is an interesting but highly contrived example. However, it raises the question, can similar issues be observed in realistic real-world conditions? When using HNSW, indexing a document is not a stateless operation, it updates the state of the graph and the sequence of updates directly impacts the final structure.

We look at two datasets:

An e-commerce catalogue of collectibles, handbags, streetwear, sneakers, and watches; and
A homewares catalogue of home, furniture, kitchenware, wall, renovation, bed, rugs, lighting, baby, lifestyle, pet, and office.

For these datasets, the product images are embedded with a ViT-B-32 CLIP model using the laion2b_s34b_b79k checkpoint and the ViT-L-14 CLIP model using the laion2b_s32b_b82k checkpoint.

Real-world data often comes with some higher order structures, one very common example is categories in online shopping catalogues; categories contain products which inherently share semantic similarities. It is a very likely scenario that these categories could influence indexing order, either unintentionally or in unavoidable scenarios such as adding a new category into the catalogue which needs to be indexed.

So does it matter if you order your data by category? Yes, and it also matters what order of categories you do. We firmly recommend that you ensure randomisation of data ordering during indexing when using HNSW indexes with any vector search system.

This is because categories group similar items which exist in localities within the vector space. We can verify this by comparing the intrinsic dimensionality of each category to the intrinsic dimensionality of the entire dataset.

For the fashion dataset:

For the homewares dataset:

For the fashion data we can try all permutations of categories exhaustively, this is not feasible for the homewares data so in this instance we only look at a smaller set of orders.

For the fashion data we observe that certain ordering have significantly lower recall than others.

At efSearch=10

At efSearch=40

We observe disparity between orders in the homewares data as well however it is not as extreme as in the fashion data.

At efSearch=10

At efSearch=40

Summary

Recall in HNSW is not a trivial problem but we see there are clearly things we can do to improve our recall other than turning up parameters for the HNSW algorithm. Some tips and best practices we recommend for any vector search system are as follows:

It is worth checking your recall, you can have the ‘best’ embedding model in the world but if your recall is very low then it doesn’t matter
Randomly shuffle your data when indexing
Evaluate recall at points that make sense for your task to understand the implications, if your 8 top results render on screen first for a user and your page size is 48 then recall@8 and recall@48 are probably more sensible than the recall@10 and recall@100 commonly seen in benchmarks
Exact KNN benchmarks are not 1:1 with approximate ones
efSearch is very important for getting good recall, we strongly advise against using anything less than 100, ideally use at least 512 for good recall on most datasets
Maintaining a larger candidate list within the graph traversal is much more efficient than the network and serialisation/de-serialisation cost of moving data around - be aware of this when using implementations like Opensearch with Lucene where efSearch is tied to the number of results
HNSW implementations differ between vector databases in order to accommodate updates, deletions, and other operations which you need in a real production system, these impact performance and recall so testing different implementations is a good idea

This work and other extensive testing has informed the defaults in Marqo of efConstruction = 512, M = 16, and efSearch = 2000 - these parameters offer a good balanced of recall and latency. If you want good defaults for HNSW out of the box then try Marqo here.

Read the full paper studying the relationships between data, models, intrinsic dimensionality, and recall here.

‍