TL;DR We generalize the popular training method of CLIP to accommodate any number of text and images when representing documents and also encode relevance (or rank) to provide better first stage retrieval. Known as Generalized Contrastive Learning (GCL), our results show that GCL achieves a 94.5% increase in NDCG@10 and 504% for ERR@10 for in-domain and 26.3 - 48.8% and 44.3 - 108.0% increases in NDCG@10 and ERR@10 respectively for cold-start evaluations, measured relative to the CLIP baseline. When compared to a keyword search only baseline of BM25, there is an improvement of 300 - 750% for in-domain and cold start respectively across NDCG@10 and ERR@10. Finally we contribute a multi-modal benchmark dataset of 10M rows, across 100k queries and ~5M products each with ranking data for training and evaluation. Read more in the pre-print, GitHub repository, or below.
Vector search works by representing data as learned vectors - embeddings. Nearest neighbor search is then used to return the closest (i.e most relevant) results. The efficacy of the method relies on producing high quality embeddings. High quality embeddings are ones that have the desired similarity accurately encoded in the underlying vector space - things that are close together in the vector space are “relevant” in the downstream application. However, real-world use cases are more complex than single document, single query schemes with embedding models trained on binary relevance relationships.
Although vector search is very powerful and enables searching across just about any data, the current methods have some limitations. The prevailing methods for training embedding models are largely disconnected from the end use-case (like search), the vector database, and the requirements of users. This means that a lot of the potential of vector search is being unmet. Some of the current challenges are described below.
Current models encode and represent one piece of information with one vector. The reality is that often there are multiple pieces of pertinent information for a document that may span multiple modalities. For example, in product search there may be a title, description, reviews, and multiple images, each with its own caption. Furthermore, different documents will have different affinities for different queries and can be quantified using various engagement, behavioral or judgement metrics. Existing training methods require binary relationships of the same magnitude only and cannot use this important information. GCL generalises embedding model training to use as many pieces of information as is desired.
For many retrieval applications where a query is asking a question or the query will only have a single correct result, training models on binary relationships works well. The model only needs to learn how to match the query with the document. However, when there are degenerate queries - multiple results that satisfy some criteria of relevance - the ordering of the results is only ever learned indirectly from the many binary relationships. In reality, the ordering of results matters, even for first stage retrieval. GCL allows for the magnitude of query-document specific relevance to be encoded in the embeddings and improves ranking of candidate documents.
For multi-modal models like CLIP, these are trained to only work from image to text (and vice versa). The text-text understanding is not as good as text only models due to the text-text relationships being learned indirectly through images. For many applications, having both inter- and intra-modality understanding is required. GCL allows for any combination of inter- and intra-modal understanding by directly optimizing for this.
In developing GCL, it became apparent there was a disconnect with publicly available datasets for embedding model training and evaluation for real-world use cases. Existing benchmarks are typically text only or inter-modal only and focus on the 1-1 query-result paradigm. Additionally, existing datasets have limited notions of relevance, with the majority encoding it as a binary relationship while several use (up-to) a handful of discrete categorizations often on the test set only. This differs from a typical real-world use cases where relevance can be both hard binary relationships or come from continuous variables. To help with this we compiled a dataset of 10M (ranked) product-query pairs, across ~100k queries, nearly 5M products, and four evaluation splits (available here).
Having the ability to use all or subsets of data to represent documents is critical for users. Some documents might be represented by a single piece of information or multiple pieces spanning multiple modalities. GCL extends contrastive methods to include the ability to directly optimize for single unified embeddings from single or multiple modalities. GCL also allows for optimzing for the exact same structure as is stored in the vector database. For example, optimizing the embeddings to perform well when used individually or as fused representations with variable constituent embeddings (e.g. to conserve memory).
As highlighted in the previous section, there are several short-comings of existing embedding model training methods with respect to vector search. To overcome these limitations, Generalized Contrastive Learning (GCL) was developed alongside a 10M row dataset for development and evaluation.
GCL works by extending existing CLIP training to define equality relationships between pieces of information. For CLIP, there are text captions T and images V and it works by learning embeddings by satisfying a relationship like T = V. GCL extends this by generalising it to a left-hand side (LHS, L) and right-hand side (RHS, R), i.e L = R. Instead of restricting the LHS or RHS to a single section of text or an image, GCL allows these to be any number of text or images with a magnitude to define the strength of the relationship. This is highlighted in the example below where we are defining an equality that says on one side we have a query (”lunar new year outfit for pet”) and it is being equated to a document that is made up of a title ("Pet Scarf, Red Chinese New Year Themed Pet Clothes Accessory"),an image, and this relationship is defined by a weight w1.
The weights are converted from ground-truth relevance score by a score-to-weight function. GCL makes it possible to learn ranking based on historical data. For example, we can attribute a higher weight to a query-document pair if many people have downloaded the document after searching with this query. Moreover, GCL generalizes traditional single-field learning by training with multiple fields, merging elements such as title and product image into a fused embedding. This approach of utilizing simple mean embedding is consistent with practices in popular vector databases (e.g. to reduce memory). This is easily extended to more complex learned approaches, for example using another learned head to unify the embeddings.
In addition to the GCL framework, we have created and released a dataset to develop and benchmark approaches on. The Marqo-GS-10M consists of ~10M query-product pairs, each containing a query, title, image and rank. The dataset is further split into training and multiple evaluation splits. For a vector search system, there are varying definitions of what would constitute “in-domain” and “out-of-domain”. For example, some use cases might have completely known and fixed set of queries and documents while others will be dominated by completely unseen queries and documents. Most systems will be some combination of these and the composition of this will likely change over time. Marqo-GS-10M defines four distinct evaluation splits so model performance is much better understood. The splits are as follows:
The table below shows the results for the different splits outlined above. Both text only and multimodal models were rank-tuned on Marqo-GS-10M starting from pre-trained base models. The metrics shown are normalised discounted cumulative gain (nDCG), rank-biased precision (RBP) and expected reciprocal rank (ERR). See the Appendix for metric descriptions.
All metrics are between 0 and 1, with higher being better.
In addition to what has been described above, GCL is also compatible with recent popular methods like binary and Matryoshka embeddings. Testing (un-released) has shown that GCL still outperforms CLIP based training on Marqo-GS-10M when using binary or Matryoshka methods during training. Below are the results for Matryoshka GCL compared to GCL. Reducing the dimension by a factor of 2 results in no performance degradation while reducing it by 4x has minimal degradation (>95%).
Below are results comparing different training and evaluation regimes for using binary embeddings with GCL. The comparison is for GCL training using float embeddings for both training and evaluation. This is compared to models that were trained to target both float and binary embeddings while evaluated using either float or binary embeddings. This shows that the performance matches or exceeds the pure float embedding training while preserving most (>90%) of the relevance when using binary embeddings in a single stage of retrieval.
GCL extends the benefits of CLIP for multi-modal contrastive learning but adds in flexibility to deal with many aspects of real world data like continuous relevance and varying data sources. To evaluate the methods we also developed a 10M row dataset with multi-modal data along with ranking information. Finally, GCL is compatible with other embedding training methods like Matryoshka and binary embeddings. Read the paper here and the dataset here.
Metric descriptions.
nDCG: evaluates by considering both the relevance and the rank of the documents returned. It first computes the Discounted Cumulative Gain (DCG), which sums the graded relevance scores of documents adjusted for their position in the result list.
RBP: measures by simulating a user’s likelihood of continuing to view successive search results. It uses a persistence parameter p to weigh the relevance of each document, with earlier results receiving more weight. This metric reflects the probability that a user will keep looking through the list, focusing on the relevance of documents and the typical depth of a search session.
ERR: a metric that combines the concept of reciprocal rank with multiple relevance levels. It calculates the expected position of finding a satisfactory search result, adjusted by a user's likelihood of stopping after encountering relevant documents. ERR effectively measures how well results satisfies user queries with highly ranked and relevant results.