Showcase

Marqo V2: Performance at Scale, Predictability, and Control

March 1, 2024

mins read

We have just released Marqo 2, the next evolution of Marqo's core inference and vector search engine. Building on the success of Marqo 1, Marqo 2 offers significant improvements in performance and features, without compromising the simplicity of Marqo version 1. This article lets you dive into the second version of this open source platform and discover how it addresses the limitations of its predecessor, leveraging Marqo's inference engine in concert with Vespa for unparalleled performance in large-scale, high-throughput semantic search applications.

Background

We released Marqo 1.0 on Jul 24, 2023, as one of the first end-to-end vector search platforms. Marqo is more than just a vector database. Standard vector databases are focused on enabling fast similarity search over embeddings generated by you, often using an approximate nearest neighbour (ANN) algorithm. Marqo is more than that. We refer to Marqo as a documents-in-documents-out solution: The Marqo indexing API accepts documents, not embeddings, and the search API takes plain text (or an image URL) and returns document search results. While Marqo lets you be up and running with state of the art in semantic information retrieval in minutes, you can also work directly with embeddings, run your custom machine learning model and much more.

We are incredibly proud of what our users have been able to achieve with Marqo 1. However, we identified a critical use case in serving e-commerce workloads, which demanded a level of performance that Marqo 1's original architecture, relying on OpenSearch as a vector store, was not ideally suited for. To address these challenges and fulfil our vision, we decided to make a number of architectural changes and release Marqo 2.

Many of the pain points our users dealt with were related to Marqo 1’s backend vector store, OpenSearch. These issues would fall into two broad categories:

Marqo’s document design on OpenSearch. In order to support an arbitrary number of embeddings per document (arising from the chunking of data as well as from multiple tensor fields per document), Marqo 1’s OpenSearch documents require lots of data duplication, which leads to memory and disk storage inefficiency.
OpenSearch’s inherent limitations. Marqo 1 used the Lucene engine. Lucene indexes are made up of immutable segments that are created and merged as documents are added. In practice, this meant unpredictable and often unacceptable performance during and after large-volume indexing as Lucene’s resource-intensive segment merges took place.

As a consequence of the already restricted performance of OpenSearch, we were also unable to improve search recall as much as we wanted, as this required higher HNSW efSearch values which would in turn increase search latency.

For Marqo 2, we looked at a number of open source and proprietary vector databases, including Milvus, Vespa, OpenSearch (AWS managed), Weaviate, Redis, and Qdrant. Published benchmarks frequently fail to answer many questions that are critical in choosing a vector database for high throughput production vector search at very large scales. For instance:

In production use cases, it is common to encounter tens or even hundreds of millions of vectors. The performance of vector databases for such large numbers of vectors is a critical consideration. Most benchmarks, however, tend to focus on relatively small datasets.
How does the vector database perform under concurrent indexing and search, as well as ongoing mutation of indexed documents? Excellent search performance despite high throughput indexing and updates is a requirement for business-critical use cases.
How do different vector databases compare in terms of space complexity and memory efficiency?
What does it take to have a highly available deployment that can meet strict SLAs despite node failures?

Our internal benchmarks revealed that Vespa excelled as the optimal choice, satisfying all the criteria mentioned above, including being highly performant. For example, with 50M vectors, Vespa had a P50 latency of 16ms vs 140ms for Milvus for an infrastructure identical in cost. Vespa is highly configurable, allowing for precise control over various aspects. It offers customisation of the underlying data structures, as well as the choice between memory and disk storage at the field level. Additionally, Vespa supports multi-phase ranking, providing further flexibility in search result optimisation. A proof of concept demonstrated that this configurability was invaluable in reducing Marqo's latency by more than half and increasing its throughput by up to 2x, compared to Marqo 1.

In the following sections, we will delve into the enhancements introduced in Marqo 2, followed by a comprehensive benchmark comparing the recall and performance of Marqo 1 and Marqo 2 on Marqo Cloud. We will then provide a detailed exploration of the design and implementation of Marqo 2, and conclude with a glimpse into what the future holds for this innovative vector search solution.

What’s new in Marqo 2?

For Marqo 2.0, our focus has been on delivering everything Marqo has to offer with much higher performance, scalability and reliability. In our benchmarks, we have seen 2x improvement in throughput (QPS), 2x reduction in P50 latency and 2.3x reduction in P99 latency, as we have pushed recall up to 99%. In the context of search, recall is a metric that compares the number of relevant documents retrieved by an approximate search method to the total number of relevant documents identified by a k-nearest neighbour (k-NN) search, which serves as the ground truth.

From a functionality point of view, perhaps the most significant change in Marqo 2 would be the choice of index type. When creating a Marqo 2 index, you can choose between an unstructured index (default) or a structured index. An unstructured index will behave almost identically to Marqo 1, inferring document structure and data types as new documents are indexed. A structured index, however, will require all document fields to be defined at index creation time, together with other properties that are determined by how the data will be queried. This information helps Marqo design the underlying schema to be as space efficient and performant as possible and provides stricter validation when adding documents or searching. We will talk more about index types in the Deep dive section of this post.

Other new features in Marqo 2 are:

efSearch: When searching with Marqo 2, the efSearch parameter of the HNSW retrieval phase is configurable, so you can get the best recall. The efSearch parameter controls the trade-off between search accuracy and search speed.
Exact KNN search: Marqo 2 can perform an exact nearest neighbour search when you set the approximate parameter to false. This means Marqo is all you need to calculate recall and fine-tune your index configuration.
New space types. Marqo 2 supports euclidean, angular, dotproduct, prenormalized-angular, and hamming distance metrics. L1 and Linf distance metrics are no longer supported.
bfloat16 numeric type. With the bfloat16 numeric type, you can index up to 2x more documents with the same amount of memory for a modest recall and performance penalty.

Based on our experiments, a default value of 2000 was selected for the efSearch parameter. This value was determined to be the point of diminishing returns, where the rate of decline in the gradient of the recall curve with respect to efSearch starts to flatten out. In other words, increasing efSearch beyond 2000 results in only marginal improvements in recall, making 2000 an optimal choice for balancing accuracy and performance.

Benchmark

We used 10 million documents from the DBPedia dataset to compare Marqo 1 and Marqo 2 in terms of recall and search performance on Marqo Cloud. 2x marqo.performance shards and 1 replica were used for storage with 5x marqo.GPU nodes for inference. For embeddings, Wikipedia article titles were used with the hf/e5-base-v2 model (768 dimensions). Queries were picked randomly from a set of 1000 options. e.g., "modern skyscrapers”, "photographing historic battlefield”. We used Marqo 2’s exact nearest neighbour search to obtain reference results for recall@10 calculation. Recall@k for a query is defined as the percentage of ANN search results that are also returned with an exact KNN search, when searching with a limit of k.

Many vector database benchmarks concentrate exclusively on nearest neighbour search. However, these results may not accurately predict real-world performance, as implementing vector search involves much more than just nearest neighbour search. A typical search request involves the following stages:

Preprocessing: For instance, in image search, the image is downloaded and prepared for analysis.
Inference: This stage calculates the query embedding.
Nearest neighbour search: Here we retrieve the most similar documents based on a specified distance metric.
Ranking: The retrieved documents are further sorted based on a ranking function determined by index configuration and search parameters.
Post-processing: The final set of results is converted from the indexed format to the format suitable for user display.

The numbers in this benchmark represent the entire vector search process, as illustrated in the figure above, which includes full-context inference of 512 tokens to compute the query embedding.

Using the default configuration of Marqo 2 (i.e. efSearch not specified in the search request), we obtained the following results:

Metrics	Marqo 1	Marqo 2
Recall@10	0.81	0.97
Recall standard deviation	0.16	0.05
QPS	147.8	157.7
Avg. latency (ms)	171.98	72.11
P50 latency (ms)	170	70
P99 latency (ms)	250	140

As shown in the table above, Marqo 2 offers not just a 20% improvement in recall, but also a much lower recall standard deviation. The significant improvement to search result quality is achieved in addition to, not in spite of, an increase in QPS and a major reduction of search latency, with a 59% reduction in P50 latency and a 44% reduction in P99 latency.

While a low median latency is a must for any production system, maintaining an acceptable level of standard deviation across different queries (i.e. low P90 and P99 latency) is often the biggest challenge. Specific combinations of query and filter string are often responsible for the occasional slow search results and such combinations tend to be impossible to predict in advance. To assess how Marqo 2 performs in such cases, we indexed an additional synthetic field with the value tag_70 for 70% of the documents, tag_20 for 20% of the documents and random values for the rest. As the tables below demonstrate, the latency gap between Marqo 1 and Marqo 2 grows even larger when searching with these filter values.

Filtering for tag_70

Metrics	Marqo 1	Marqo 2	Marqo 2 Improvemnt
Avg. latency (ms)	222.34	85.72	+61.4%
P50 latency (ms)	220	83	+62%
P99 latency (ms)	300	150	+50%

Filtering for tag_20

Metrics	Marqo 1	Marqo 2	Marqo 2 Improvement
Avg. latency (ms)	292.72	106.77	+60%
P50 latency (ms)	290	100	+65%
P99 latency (ms)	400	180	+55%

Deep dive

Structured index

At the core of Marqo 2 are structured indexes. A structured index requires all fields to be defined at index creation time (adding fields to an existing index will be supported in the future). A field definition consists of a name, a type (e.g., text, bool, int etc) and zero or more features. Possible field features are lexical_search, filter and score_modifier.

This predefined configuration allows Marqo to design the underlying Vespa schema in the most optimal manner as the index configuration tells Marqo both what will be stored and, more importantly, how the stored data will be queried. The latter is determined by field features and affects the data structures used for a field.

The lexical_search feature makes a text or text array field available for lexical search by creating an inverted index enabled for BM25 ranking.
The filter feature creates an in-memory B-tree index that enables fast exact and range filtering for the field. The index is further optimised for filtering by storing a bit vector posting list.
score_modifier feature makes a numeric field available as a score modifier at search time by adding it to the score modifiers tensor.

Index metadata is stored in a system schema, marqo__settings by default, so that Marqo instances remain stateless. Depending on its features, the value of a field could be stored more than once. For example, a field with both lexical_search and filter features will create both an inverted index and a B-tree index. The value of a filter field is also stored in memory. For these fields, Marqo will avoid disk access when reading documents by accessing only the in-memory copy of the field.

Unstructured index

Structured indexes deliver the most robust and performant experience that Marqo has to offer, but they require upfront planning about data schema and query design. To offer the ease of use seen in Marqo 1, we introduced unstructured indexes. These allow users to experiment with Marqo and explore new data without making all decisions in advance. Unlike many other vector databases that necessitate defining a schema beforehand, Marqo's unstructured indexes enable rapid prototyping and vector search with just a few lines of code, without the constraints of a predefined schema.

An unstructured index takes only a single line of code to create:‍

mq.create_index("my-first-index")

‍Once created, unstructured indexes can index and search semi-structured data, including data type inference and dynamic field recognition. All text fields are lexically searchable and filterable (up to a configurable max length limit) and all numeric fields are filterable and available as score modifiers. The search API is almost identical to structured indexes, but with some modest space efficiency and performance considerations (more on this later). Under the hood though, unstructured indexes are implemented very differently.

As noted before, a Marqo index is backed by a structured Vespa schema. In order to index semi-structured data, we need to be able to handle new fields that appear in a document on-the-fly and efficiently. Some of our ideas were:

Update the Vespa schema at indexing time to add new fields as needed. Tens or hundreds of Marqo instances will need to update the schema in a serialisable manner. This requires complex concurrency management and updating a Vespa schema is a slow process.
Store all fields in a generic <string, string> dictionary. We would need to maintain a mapping from field names to data types. This mapping would need to be strongly consistent and support serialisable updates, which would have considerable performance implications. Furthermore, storing numeric values as strings is wasteful and will increase disk and memory costs significantly.

Performance at scale was front-of-mind here. We decided to use a generic, yet flexible Vespa schema design for unstructured indexes that did not require complex concurrency management, but also did not sacrifice performance. The Vespa schema backing an unstructured index comprises the following data structures:

All strings: map<string, string> storing all string fields, available for lexical search using an inverted index.
Long strings: map<string, string>. Strings longer than filterStringMaxLength, stored on disk and used only for Marqo document reconstruction.
Short strings: map<string, string>. Strings shorter than or equal to filterStringMaxLength. Stored in an in-memory B-tree index for fast filtering.
String arrays: array<string>. Flattened list made up of all string arrays in the document. Values are prefixed before being stored so that each array’s items are easily identifiable and the original array fields can be recovered. Backed by an in-memory B-tree index for fast filtering.
Numeric values. These are stored in two dictionaries, one for ints and another for floats. They are backed by in-memory B-tree index for log(n) time range and exact filtering.
Booleans: map<string, byte>. Indexed in-memory for fast filtering.
Chunks: array<string>. All tensor field chunks, prefixed with field name and flattened.
Embeddings: a mapping from integer chunk indexes to embeddings. Index in an HNSW graph for log(n) time similarity search.

How is an unstructured index different in practice?

To enable the flexibility that comes with unstructured indexes, we had to make some minor efficiency and functionality concessions. The most notable of these is how embeddings are stored when you have more than one tensor field. Structured indexes create an HNSW graph for each tensor field. Unstructured indexes, on the other hand, store all embeddings from all tensor fields in a single HNSW graph. This has implications both in terms of functionality and recall.

In terms of functionality, this difference means we were not able to implement searchable attributes with unstructured indexes. The searchable attributes feature enables the user to search only a subset of an index’s tensor fields. With a structured index, you can choose to search only one or more specific tensor fields. Behind the scenes, Marqo carries out retrieval on each tensor field’s HNSW graph, and ranks the union of the results to compute the final result set. With an unstructured index, this is not possible as there is only one HNSW graph to search.

Another consequence of this difference in the underlying HNSW data structure is its potential effect on recall. When searching more than one tensor field and for a fixed value of efSearch, a structured index will be retrieving and ranking a larger number of documents. In such scenarios, expect to need a higher efSearch with an unstructured index than a structured index to achieve your desired rate of recall.

Unstructured indexes could also be less memory efficient, though this is dependent on your data. An unstructured index will store all numeric values, as well as strings shorter than a configurable threshold (filterStringMaxLength) in memory. So if your documents contain lots of numeric values or short strings that you are not planning on using in filter strings, a structured index can be much more memory efficient as you can simply avoid using the filter feature for these fields to keep them out of memory.

What’s next

We are proud of the features and major performance improvements we have delivered with Marqo 2. In the next few months, we will be releasing a host of new, exciting features built on top of the foundation that Marqo 2 has laid, driven by the needs and feedback of our customers, including the open-source community. Some of these are high throughput partial updates, inference caching and more sophisticated result ranking options. Keep an eye out for what’s to come!

For more information and to get started with Marqo, check out the Getting Started Guide. Explore Marqo Cloud for a managed Marqo offering.