Research

Generalized Contrastive Learning for Multi-Modal Retrieval and Ranking

TL;DR We generalize the popular training method of CLIP to accommodate any number of text and images when representing documents and also encode relevance (or rank) to provide better first stage retrieval. Known as Generalized Contrastive Learning (GCL), our results show that GCL achieves a 94.5% increase in NDCG@10 and 504% for ERR@10 for in-domain and 26.3 - 48.8% and 44.3 - 108.0% increases in NDCG@10 and ERR@10 respectively for cold-start evaluations, measured relative to the CLIP baseline. When compared to a keyword search only baseline of BM25, there is an improvement of 300 - 750% for in-domain and cold start respectively across NDCG@10 and ERR@10. Finally we contribute a multi-modal benchmark dataset of 10M rows, across 100k queries and ~5M products each with ranking data for training and evaluation. Read more in the pre-print, GitHub repository, or below.

1. Introduction to vector search

Vector search works by representing data as learned vectors - embeddings. Nearest neighbor search is then used to return the closest (i.e most relevant) results. The efficacy of the method relies on producing high quality embeddings. High quality embeddings are ones that have the desired similarity accurately encoded in the underlying vector space - things that are close together in the vector space are “relevant” in the downstream application. However, real-world use cases are more complex than single document, single query schemes with embedding models trained on binary relevance relationships.

An animation demonstrating a simplified 2-D example of vector similarity. The points represent vectors in 2D. A query vector is represented by the orange dot. Finding relevant results is done by calculating the distance of the query vector with its neighbors and sorting by the closest distance.

2. Limitations of current embedding models for vector search

Although vector search is very powerful and enables searching across just about any data, the current methods have some limitations. The prevailing methods for training embedding models are largely disconnected from the end use-case (like search), the vector database, and the requirements of users. This means that a lot of the potential of vector search is being unmet. Some of the current challenges are described below.

2.1 Restricted to using a single piece of information to represent a document

Current models encode and represent one piece of information with one vector. The reality is that often there are multiple pieces of pertinent information for a document that may span multiple modalities. For example, in product search there may be a title, description, reviews, and multiple images, each with its own caption. Furthermore, different documents will have different affinities for different queries and can be quantified using various engagement, behavioral or judgement metrics. Existing training methods require binary relationships of the same magnitude only and cannot use this important information. GCL generalises embedding model training to use as many pieces of information as is desired.

Standard model training assumes relationships are 1-1. However, in the real-world there are often multiple pieces of useful information that can be used. GCL extends CLIP by allowing the models to be directly optimized to use these additional data.

2.2 No notion of rank when dealing with degenerate queries

For many retrieval applications where a query is asking a question or the query will only have a single correct result, training models on binary relationships works well. The model only needs to learn how to match the query with the document. However, when there are degenerate queries - multiple results that satisfy some criteria of relevance - the ordering of the results is only ever learned indirectly from the many binary relationships. In reality, the ordering of results matters, even for first stage retrieval. GCL allows for the magnitude of query-document specific relevance to be encoded in the embeddings and improves ranking of candidate documents.

An example of searching for an answer to a question compared to searching when there are multiple relevant results.

2.3 Poor text understanding when using CLIP like methods

For multi-modal models like CLIP, these are trained to only work from image to text (and vice versa). The text-text understanding is not as good as text only models due to the text-text relationships being learned indirectly through images. For many applications, having both inter- and intra-modality understanding is required. GCL allows for any combination of inter- and intra-modal understanding by directly optimizing for this.

Intra-modality understanding in methods like CLIP are only ever learned indirectly through other modalities. With GCL, the intra-modal understanding is optimized for directly, improving performance on these tasks.

2.4 Lack of representative datasets to develop methods for vector search

In developing GCL, it became apparent there was a disconnect with publicly available datasets for embedding model training and evaluation for real-world use cases. Existing benchmarks are typically text only or inter-modal only and focus on the 1-1 query-result paradigm.  Additionally, existing datasets have limited notions of relevance, with the majority encoding it as a binary relationship while several use (up-to) a handful of discrete categorizations often on the test set only. This differs from a typical real-world use cases where relevance can be both hard binary relationships or come from continuous variables. To help with this we compiled a dataset of 10M (ranked) product-query pairs, across ~100k queries, nearly 5M products, and four evaluation splits (available here).

An example from the Marqo-GS-10M dataset. Each query has 100 results with ranking information across 100 discrete values.

2.5 Unoptimized methods for unified embeddings

Having the ability to use all or subsets of data to represent documents is critical for users. Some documents might be represented by a single piece of information or multiple pieces spanning multiple modalities. GCL extends contrastive methods to include the ability to directly optimize for single unified embeddings from single or multiple modalities. GCL also allows for optimzing for the exact same structure as is stored in the vector database. For example, optimizing the embeddings to perform well when used individually or as fused representations with variable constituent embeddings (e.g. to conserve memory).

GCL allows for the additional optimization of unified embeddings. This can come from a pooling mechanism or another learned model.

3. Generalised contrastive learning

As highlighted in the previous section, there are several short-comings of existing embedding model training methods with respect to vector search. To overcome these limitations, Generalized Contrastive Learning (GCL) was developed alongside a 10M row dataset for development and evaluation.

3.1 Method

GCL works by extending existing CLIP training to define equality relationships between pieces of information. For CLIP, there are text captions T and images V and it works by learning embeddings by satisfying a relationship like T =  V. GCL extends this by generalising it to a left-hand side (LHS, L) and right-hand side (RHS, R), i.e L =  R. Instead of restricting the LHS or RHS to a single section of text or an image, GCL allows these to be any number of text or images with a magnitude to define the strength of the relationship. This is highlighted in the example below where we are defining an equality that says on one side we have a query (”lunar new year outfit for pet”) and it is being equated to a document that is made up of a title ("Pet Scarf, Red Chinese New Year Themed Pet Clothes Accessory"),an image, and this relationship is defined by a weight w1.

Schematic illustrating how GCL can be used for multiple pieces of information while leveraging non-binary relevance signals.

The weights are converted from ground-truth relevance score by a score-to-weight function. GCL makes it possible to learn ranking based on historical data. For example, we can attribute a higher weight to a query-document pair if many people have downloaded the document after searching with this query. Moreover, GCL generalizes traditional single-field learning by training with multiple fields, merging elements such as title and product image into a fused embedding. This approach of utilizing simple mean embedding is consistent with practices in popular vector databases (e.g. to reduce memory). This is easily extended to more complex learned approaches, for example using another learned head to unify the embeddings.

3.2 Marqo-GS-10M

In addition to the GCL framework, we have created and released a dataset to develop and benchmark approaches on. The Marqo-GS-10M consists of ~10M query-product pairs, each containing a query, title, image and rank. The dataset is further split into training and multiple evaluation splits. For a vector search system, there are varying definitions of what would constitute “in-domain” and “out-of-domain”. For example, some use cases might have completely known and fixed set of queries and documents while others will be dominated by completely unseen queries and documents. Most systems will be some combination of these and the composition of this will likely change over time. Marqo-GS-10M defines four distinct evaluation splits so model performance is much better understood. The splits are as follows:

  1. Training split with 80% of queries and 50% of documents.
  2. Novel query split with the other 20% of queries and the same documents as the training split.
  3. Novel corpus split with the same queries as the training split and unseen documents with the equal size of the training corpus.
  4. Zero-shot split with both unseen queries and documents.
A diagram illustrating the composition of the different test splits (left) and different rank-to-weight functions that can be used during training.

3.3 Results

The table below shows the results for the different splits outlined above. Both text only and multimodal models were rank-tuned on Marqo-GS-10M starting from pre-trained base models. The metrics shown are normalised discounted cumulative gain (nDCG), rank-biased precision (RBP) and expected reciprocal rank (ERR). See the Appendix for metric descriptions.

All metrics are between 0 and 1, with higher being better.

Text only. The BM25 baseline is shown and compared to the pre-trained baselines (), text only contrastive learning (Cross E.) and GCL.*
Image only: The pre-trained baselines (), CLIP training and GCL.*
Multi-modal: Trained models for CLIP training and GCL.

3.4 Extensions to Matryoshka and binary embeddings

In addition to what has been described above, GCL is also compatible with recent popular methods like binary and Matryoshka embeddings. Testing (un-released) has shown that GCL still outperforms CLIP based training on Marqo-GS-10M when using binary or Matryoshka methods during training. Below are the results for Matryoshka GCL compared to GCL. Reducing the dimension by a factor of 2 results in no performance degradation while reducing it by 4x has minimal degradation (>95%).

Zero-shot evaluation results for unseen queries and documents for GCL and Matryoshka GCL for three different embedding dimensions.

Below are results comparing different training and evaluation regimes for using binary embeddings with GCL. The comparison is for GCL training using float embeddings for both training and evaluation. This is compared to models that were trained to target both float and binary embeddings while evaluated using either float or binary embeddings. This shows that the performance matches or exceeds the pure float embedding training while preserving most (>90%) of the relevance when using binary embeddings in a single stage of retrieval.

Zero-shot evaluation results for unseen queries and documents for GCL and binary embeddings trained using GCL.

4. Conclusion

GCL extends the benefits of CLIP for multi-modal contrastive learning but adds in flexibility to deal with many aspects of real world data like continuous relevance and varying data sources. To evaluate the methods we also developed a 10M row dataset with multi-modal data along with ranking information. Finally, GCL is compatible with other embedding training methods like Matryoshka and binary embeddings. Read the paper here and the dataset here.

5. Appendix

Metric descriptions.

nDCG:  evaluates by considering both the relevance and the rank of the documents returned. It first computes the Discounted Cumulative Gain (DCG), which sums the graded relevance scores of documents adjusted for their position in the result list.

RBP:  measures by simulating a user’s likelihood of continuing to view successive search results. It uses a persistence parameter p to weigh the relevance of each document, with earlier results receiving more weight. This metric reflects the probability that a user will keep looking through the list, focusing on the relevance of documents and the typical depth of a search session.

ERR: a metric that combines the concept of reciprocal rank with multiple relevance levels. It calculates the expected position of finding a satisfactory search result, adjusted by a user's likelihood of stopping after encountering relevant documents. ERR effectively measures how well results satisfies user queries with highly ranked and relevant results.

Jesse Clark
Jesse is a co-founder and the CTO at Marqo, he leads the applied sciences division performing R&D in AI for search and recommendations.