Matryoshka Representation Learning with CLIP for Multimodal Retrieval and Ranking

May 24, 2024
mins read
TL;DR We introduce Matryoshka Representation Learning (MRL), facilitating flexible embedding sizes in vector databases. This allows a balance between efficiency and granularity. Through MRL, embeddings condense into smaller dimensions while preserving performance in retrieval and ranking tasks. In summary, MRL empowers cost-effective flexibility without compromising performance in multimodal retrieval and ranking tasks.
Small embeddings have lower operating costs but less granularity. Conversely, large embeddings offer more representation power, but at a higher operating cost.

In a vector database, embedding size plays a crucial role in determining the operating cost. While smaller embeddings can lead to higher efficiency and lower cost, they also offer less granularity. It would be beneficial for a vector database to provide flexible embedding sizes catering to different end-users. In this blog post, we discuss a technique that allows such flexibility.

Matryoshka Representation Learning (MRL)

Matryoshka Representation Learning (MRL) is a technique that allows flexible sizes of embeddings with minimal model adjustments. Once a model generates an embedding of fixed-size for a sample, it utilizes the first several dimensions to create a separate, smaller embedding. For example, if the size of original embedding is 512, we can extract the first 256 dimensions, 128 dimensions, and 64 dimensions to create three smaller embeddings. This group of user-selected dimensions (i.e., {512, 256, 128}) is referred to as a dimension set.

MRL creates a smaller embedding by selecting the first n dimensions of an existing embedding without additional computational cost.

When using MRL, a model needs to be trained with training losses computed by each sub-dimension embedding. By doing so, the model learns to condense important information in the smaller dimensions. In this blog, we trained Generalized Contrastive Learning (GCL), our model that extends CLIP to allow multiple representations for a sample, with MRL on a subset of GS-Marqo-10M. For more information on GCL and the dataset, refer to our blog.

For each dimension in the dimension set, we extract the Matryoshka representation for image and text features. These representations are used to calculate a loss for each sub-dimension. The final loss is a weighted sum of these losses. The weights, referred to as relative importance scales, can be either ones or predetermined weights.

Reducing Embedding Sizes with Minimal Impact on Performance

We compared the retrieval and ranking performance of GCL with and without MRL across in-domain, novel query, novel document, and zero-shot splits. The relative reduction of normalized discounted cumulative gain (nDCG) is demonstrated when the embedding size decreases, with the original embedding's performance set as the benchmark at 100%. It's evident that GCL trained with MRL maintains nDCG across different splits, while GCL without MRL experiences a significant decrease in performance. This demonstrates that the embedding sizes can be reduced by training the model with MRL, while minimally affecting its performance.

GCL trained with MRL shows marginal reduction in performance, while the original GCL struggles to maintain the performance.

Performance with Original Embedding Size

A crucial question is whether the GCL trained with MRL performs as well as the original GCL without MRL, using the same embedding size. This would confirm that the model trained with MRL can be used even without reducing the embedding size, offering flexibility in choosing to either reduce or maintain the original size. Otherwise, separate models would need to be trained, limiting its practical use. Our findings demonstrate that both GCL with MRL and without MRL perform similarly with the original embedding size.

When using the original embedding size, GCL trained with MRL achieves comparable results to the one without MRL.

Important notes: MRL is known to encourage a model to converge faster, which complicates a straightforward comparison between GCL with and without MRL by merely adding MRL losses during training. In our experiments, we decreased the number of epochs and modified relative importance scales to establish a setting where both GCL with MRL and GCL without MRL show similar in-domain performance. Therefore, the results shouldn't be interpreted as MRL damaging in-domain performance while enhancing novel query, novel document, and zero-shot performance. Rather, they indicate that both achieve comparable performance.

Hyperparameters and Architecture

There are crucial hyperparameters such as the dimension set and relative importance scales and architecture choices we can make . Here are some observations on them:

  • Avoid a large dimension set: We noticed a decline in performance as the number of sub-dimensions increased. This happens because the model needs to learn meaningful representations for every sub-dimensions. In practice, smaller dimensions like 8 or 16 offer limited usability due to their lower performance. Thus, it's recommended to omit unnecessary sub-dimensions from the dimension set during training.
We trained two GCL models with MRL: one with a small dimension set of {512, 256, 128, 64} and another one with a large dimension set of {512, 256, 128, 64, 32, 16, 8}. The one with smaller dimension set outperforms the other model in all splits.
  • Experiment with the dimension set:We conducted a test using an extreme case with a dimension set of {256}. Interestingly, the performance difference between the 256-dimension and the original 512-dimension was less than 0.019 nDCG across all splits. We encourage you to experiment with various dimension sets.
  • Adjust relative importance scales: The relative importance scales determine the weights of training losses for each sub-dimension. We found that setting them to ones can lead to overfitting smaller dimensions as they are penalized more frequently. For example, when the dimension set is {64, 128, 256, 512}, the first 64 dimensions are penalized four times every iteration. To balance the convergence rate among sub-dimensions or prioritize a particular sub-dimension's performance, adjust the relative importance scales accordingly.
  • Additional projection layers may or may not be beneficial: The original paper trained a separate linear classifier head for each sub-dimension. In a similar way, we experimented with two methods of introducing additional linear layers.
    • Method #1: After an embedding is truncated, it's passed to a separate linear layer, where the linear layer is:
  • Method #2: The original embedding is passed to a linear layer to reduce the dimension. The linear layer is defined as follows:

Please note that Method #2 bypasses the MRL truncation process, creating a loose connection to MRL. Despite this, we have included it for comparison as it serves as a straightforward baseline. Method #1 shows advantages in the novel query split, but it underperforms in other splits. In contrast, Method #2 performs best in the novel document and zero-shot splits when using smaller embedding sizes. The decision to include a projection linear layer is not straightforward and might be worth experimenting with, depending on the specific use case.

There is no clear winner on whether a projection linear layer is beneficial or not. Experiments must be conducted for each use case to determine its effectiveness.


The original paper introduced "adaptive retrieval" for retrieval tasks. This method uses smaller dimensions to first retrieve document candidates, which are then reranked using larger dimensions. However, in this blog, we only considered a first-stage retrieval system and did not apply the adaptive retrieval. Implementing such a method could potentially improve performance further.


In this blog, we analyzed the effectiveness of MRL for a CLIP-based model, GCL, in multimodal retrieval and ranking. Our findings indicate that training the model with MRL can mitigate performance degradation when the embedding size decreases. We also examined the impact of changes in hyperparameters and the model's architecture on its performance. Overall, MRL could enable users to select the embedding sizes for the same model without incurring additional computational cost. However, careful consideration and thorough experimentation are crucial to optimize the model's performance.

David Jung
Applied Scientist