There are a multitude of factors need to be considered when selecting models and model architectures for production search and recommendation systems. Models that return more relevant results might have more parameters and require more FLOPS for inference. This will mean there is a relevancy-latency trade-off as the larger models take longer to produce embeddings while retrieving better candidates.
However, this can be further complicated by other differences like context length, image input size, and embedding dimension. Longer context allows for more text to be encoded per embedding but typically takes longer for inference. Larger images can take longer to process but can yield better results. Increasing embedding dimensions take up more memory and require more compute when calculating distances which will impact latency.
The table below provides some high level guidance about how to reason about model specifications.
Attribute | What does it mean? | Why does it matter? |
---|---|---|
Image size | This is the size of image that the model will use as input. All images will be resized to the model input size. | Increasing image size can improve retrieval performance but can increase latency. |
Context length | This is the amount of tokenized text that the model can use. There is no rule for converting words to tokens but tokens ~ 0.75 words | Longer context lengths can increase latency but allows for less vectors to be used to represent large text. |
Embedding dimension | This is the size of the vector that the model will output. | The number of dimensions directly impacts latency in the retrieval stage since larger dimensions require more computations. |
Parameters/FLOPS | Indicates the model “size” in terms of parameters, memory and compute. | The larger these are, the more memory is required and typically increases latency. |
CLIP and derivative models have become pervasive amongst multi-modal applications. CLIP stands for “contrastive language image pretraining” and provides a simple and effective way to learn visual representations with natural language. CLIP has been used extensively in image generation, zero-shot classification, LLM’s and generating embeddings for images and text. The latter application has also meant that CLIP models have found popularity in multi-modal retrieval tasks like search. It is this latter use cases which is the focus of this post. We combine retrieval performance (from open_clip) with latency measurements, context length and image input sizes to provide a holistic and practical view of model performance. All benchmarking done in this article utilises the open_clip implementations.
CLIP models and its derivatives [CLIPA, EVA, SigLIP] have become popular for retrieval use cases like search and recommendations. CLIP can produce embeddings for images and text that live in the same latent (embedding) space. This means that text can be compared directly to images and vice versa. This simple property permits cross-modal search by comparing the embedding generated from a text query to a corpus of image embedding using nearest neighbor search.
The latency was performed across two commonly used GPU’s for inference - T4 and A10g. Each model had its text and image inference (not pre-processing) timed using an average from 100 inferences. Each model also had a warm-up of 100 inferences. The text used for benchmarking was random combinations of three letter dictionary words. Each text inference was on a unique sentence and all models saw the exact same set of text. Images were pre-processed and then had random noise added to them. Each image was unique but all the models saw the same set of images. Finally, all models were inferenced using PyTorch’s AMP.
The retrieval results are the same that appear in the open clip repository. These scores are the average retrieval performance across three different datasets (Flickr, MSCOCO and WinoGAViL) and asymmetric text-image and image-text retrieval tasks. These tasks provide a general view of performance. It should be noted that the performance measured from these tasks is representative of the models performance on these tasks and generalizing beyond these may yield different results, particularly in search. It is strongly suggested to develop an evaluation benchmark that represents the end use case as close as possible.
Below we show the performance of the models along different dimensions. We use the average score for retrieval performance and then plot that with respect to different aspects that impact performance.
All benchmarks are performed using an NVIDIA A10 Tensor Core GPU.
For an interactive version please refer to our hugging face space.
If you do not know where to start then we suggest either of the following models:
Use case | Model | Pretrained | What it is best for |
---|---|---|---|
Fastest inference | ViT-B-32 | laion2b_s34b_b79k | When the best performance at the lowest latency/memory is required. |
Best balanced | ViT-L-14 | laion2b_s32b_b82k | When low latency is still required but with much better retrieval performance. GPU recommended. |
Best all-round | xlm-roberta-large-ViT-H-14 | frozen_laion5b_s13b_b90k | When the best performance is required. Latency is increased along with memory. GPU recommended. |
Alternatively, selecting a model based on the table below when specific requirements are required will work well for selecting a model. A good rubric to follow is;
Then select the best performing model based on the average score.
Although some models score better on the three benchmark datasets, what we have found is that they may not generalise as well. In addition to the benchmark results shown, internal benchmarking has shown those models above to be very good at specific domains like product search. The models above are very good general models and should perform well across many tasks. It is always a good idea to develop a benchmark closely related to the task and evaluate multiple models. At the very least, a vibe check can be used on the real use case.