Introduction to CLIP and Multimodal Models

Next Module →

Multimodal Models and CLIP

1. What are Multimodal Models?

Multimodal models are a class of machine learning models that can process and integrate information from multiple modalities or types of data. Unlike traditional models that handle a single type of input (e.g., text or image), multimodal models can simultaneously handle and learn from different types of inputs such as text, images, audio, and more. Pretty cool!

Figure 1: The different types of data that multimodal models can handle simultaneously.

These models are significant because they mirror how humans perceive and process information. For example, when we look at an image, we often also consider contextual text (like a caption) to fully understand the image's content. Multimodal models aim to bring this level of understanding to artificial intelligence, enabling more sophisticated and contextually aware applications.

2. What is CLIP Model?

CLIP (Contrastive Language-Image Pre-training) is a multimodal model developed by OpenAI that can understand and relate images and textual descriptions in a unified manner. CLIP was introduced in early 2021 [1] and represents a significant advancement in the field of multimodal learning.

Figure 2: An image with its corresponding caption, 'Marqo logo'.

CLIP stands out because it was trained to directly understand the relationship between images and the natural language descriptions associated with them. An example of this can be seen in the Figure above where we have an image of Marqo’s logo with the caption ‘Marqo logo’. This enables CLIP to perform a variety of tasks including zero-shot classification, image search, and captioning.

3. How Does CLIP Work?

CLIP uses a unique approach to learn the relationship between images and text. In this section we’ll break down the fundamental aspects of CLIP to truly understand how it works.

i. Data Collection

As with any machine learning model, data is essential…and the more, the better! CLIP was trained on a large dataset of 400 million image-text pairs collected from the internet. This diverse dataset ensured that the model could generalize well to a wide range of concepts and scenarios.

The example used in the original CLIP paper [1] was an image of a dog with the caption ‘Pepper the aussie pup’. Of course, in the Figure below you can see this example as well as others behind it.

Figure 3: Example of the dataset used to train CLIP model. As used in the original CLIP paper [1].

Awesome, so we’ve seen what data we’re dealing with. Let’s take a look at how we can transform this data in machine readable format.

ii. Dual Encoder Architecture

As we’ve established in our previous articles on sentence embeddings and image embeddings. Encoders are needed to translate these into a fixed-dimensional vector or embedding. You’d be right in thinking that because we have both text and images here, we’ll need both a text and image encoder respectively.

CLIP employs these two separate encoders:

  • Text Encoder: This encoder processes text descriptions and converts them into a fixed-dimensional feature vector. The text encoder is usually based on a Transformer architecture similar to those used in language models like BERT.
  • Image Encoder: This encoder processes images and converts them into a fixed-dimensional feature vector. The image encoder is typically a Vision Transformer (ViT).
Figure 4: CLIP encoder architecture. Text is processed through the text encoder and images are processed through the image encoder. This creates vector embeddings to represent each [1].

Now we have the ability to generate vector embeddings for both our text and images. How do we train our model to understand these pairs? That’s where contrastive learning plays a vital role.

iii. Contrastive Learning

The core of CLIP's training process is contrastive learning. The goal is to bring the feature vectors of matching image-text pairs closer together in the embedding space while pushing the vectors of non-matching pairs further apart.

Figure 5: Contrastive Learning approach in CLIP model [1].

This is achieved through a contrastive loss function. For each image-text pair in a training batch, CLIP computes the cosine similarity between all possible image and text feature vectors. The model then maximizes the similarity for correct pairs and minimizes it for incorrect pairs. This process trains the model to align images with their corresponding textual descriptions effectively.

The key outcome of this is that you get a dual embedding. So, you can understand:

  • Images with language
  • Language with images
  • Images with images
  • Language with language.

Super powerful!

iv. Zero-Shot Learning

Once trained, CLIP can perform various tasks without additional fine-tuning. This capability is known as zero-shot learning. A dataset classifier is created from label text and this is used for zero-shot prediction as seen in the Figure below:

Figure 6: Create dataset classifier from label text which is then used for zero-shot prediction.

Awesome! By leveraging CLIP's ability to understand and relate images and text through contrastive learning, we can achieve tasks such as zero-shot classification, image search and retrieval, and generating textual descriptions for images, all without needing task-specific fine-tuning. This opens up new possibilities for creating more intuitive and versatile AI systems that can seamlessly integrate and process multimodal information.

Let’s take a look at CLIP in action!

4. Programming with CLIP

Let’s look at the power of CLIP through a simple programming example.

For this article, we will be using Google Colab (it’s free!). If you are new to Google Colab, you can follow this guide on getting set up - it’s super easy! For this module, you can find the notebook on Google Colab here or on GitHub here. As always, if you face any issues, join our Slack Community and a member of our team will help!

We will take three different images of cats from Unsplash. Let’s display these.

# Import relevant modules
from PIL import Image
from IPython.display import display
from io import BytesIO
from transformers import CLIPProcessor, CLIPModel
import requests
import matplotlib.pyplot as plt
import numpy as np

# Create a list of urls. These are the urls corresponding to the images of the cats.
urls = [

#  Take a list of image URLs, download each image, and open them
images = [ for url in urls]

# Display images
for image in images:
Figure 7: Three pictures of cats doing different things. Taken from Unsplash. Photo Credit: svklimkin, Timo Volz and Kate Stone Matheson respectively.

We want to generate embeddings for these images so we can use the CLIP model. We load CLIPModel to generate the embeddings and CLIPProcessor for preprocessing (e.g. resizing images, etc.).

We create the model and processor. Specifically, we use OpenAI’s model clip-vit-base-patch32 [7].

# Load CLIP model and processor
model = CLIPModel.from_pretrained('openai/clip-vit-base-patch32')
processor = CLIPProcessor.from_pretrained('openai/clip-vit-base-patch32')

We already have the images stored in the variable images. So, all that’s left is to create captions. We will create a list of captions containing the true captions as well as fake ones.

captions = ["a cat sleeping",
            "a cat yawning",
            "a cat jumping",
            "a motorbike on a road",
            "rainbow in the sky",
            "a water bottle"]

We now process the captions and images as follows:

# Process captions and images
inputs = processor(
    text=captions,         # input the captions
    images=images,         # inputs the images
    return_tensors='pt',   # returns pytorch tensors
    padding=True           # ensures sequences all the same length

All that’s left to do is get the predictions and display the results! Let’s get the predictions first.

# Get predictions
outputs = model(**inputs)
probs = outputs.logits_per_image.argmax(dim=1)

Now we can display the images with their corresponding, predicted captions.

# Display images with predicted captions
for i, image in enumerate(images):
    argmax = probs[i].item()
    print(f"Predicted Caption: {captions[argmax]}")

These are the results:

Figure 8: Predicted captions with corresponding images using CLIP.

Pretty cool!

5. Limitations of CLIP

As you’ll know in machine learning, not all models are perfect and so CLIP does come with its limitations.

  • CLIP's zero-shot performance is limited in fine-grained classification tasks, such as differentiating models of cars, species of flower and variants of aircraft
  • CLIP struggles with abstract tasks like counting the number of objects in an image
  • CLIP struggles with novel tasks such as estimating the distance to the nearest car in a photo

With these limitations, we ask ourselves, can we extend CLIP to accommodate for more complex and realistic data scenarios?

6. Generalised Contrastive Learning (GCL)

Generalized Contrastive Learning (GCL) is a framework developed and researched by our team here at Marqo, that extends the principles of CLIP to accommodate more complex and realistic data scenarios [2]. Unlike CLIP, which primarily focuses on learning relationships between images and single text descriptions, GCL generalizes the training process to include any number of text and image pairs when representing documents. Additionally, GCL incorporates relevance (or rank) information to improve the first stage of retrieval processes.

Figure 9: Schematic illustrating how GCL can be used for multiple pieces of information while leveraging non-binary relevance signals [2].

For example, consider a query like "lunar new year outfit for pet." This query might correspond to a document with a title like "Pet Scarf, Red Chinese New Year Themed Pet Clothes Accessory" and an image of the pet scarf. GCL can learn the relationship between the query and this document, with a weight indicating how relevant the document is based on historical user interactions, such as clicks or purchases. By training on multiple types of data, GCL creates more comprehensive embeddings that better represent the documents. This approach improves the ranking and retrieval of documents by directly optimizing for relevance and incorporating various data sources into the training process.

Awesome! Let’s take a look at it in action.

7. Programming with GCL

First, we install the relevant modules:

! pip install open_clip_torch
! pip install wget

Now, let's take an image of a pair of Oxford shoes and display it below.

from PIL import Image
import requests
from io import BytesIO
import matplotlib.pyplot as plt

# Define the URL of the image
image_url = ""

# Download the image
response = requests.get(image_url)
image =

# Display the image
plt.axis('off')  # Hide axes

This returns:

Figure 10: An image of men’s Oxfords.

Now, let's take this image along with three captions:

  • “A dog”
  • “Vintage Style Women's Oxfords”
  • “A cat”

The idea is that we will use GCL to generate scores for each of these captions when compared to the image. We will expect captions 1 and 3 to be low as, indeed, the image is not a picture of a cat or dog. Caption 2 will hopefully have the highest score as they are indeed, Oxford shoes. However, they are men’s shoes and not necessarily vintage. This is something GCL will be able to pick up on and so we expect a higher score compared to captions 1 and 3.

Let’s see what we get!

import torch
import open_clip
import wget

# Download the model
model_url = "", "")
model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='')
tokenizer = open_clip.get_tokenizer('ViT-B-32')

# Preprocess the image
image = preprocess(image).unsqueeze(0)

# Tokenize the text
text = tokenizer(["a dog", "Vintage Style Women's Oxfords", "a cat"])
logit_scale = 10

# Calculate the features and probabilities
with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (logit_scale * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)

This returns:

Label probs: tensor([[0.0997, 0.7586, 0.1418]])

Awesome, we get the results:

  • A dog: 0.0997
  • Vintage Style Women's Oxfords: 0.7586
  • A cat: 0.1418

We can see that GCL clearly knows the image is not a cat or a dog due to the low scores associated with those captions. It clearly identifies the shoes are oxfords with a score of 0.7586. This score also suggests GCL was able to understand the shoes were mens rather than womens and were not necessarily vintage.

Pretty cool! If you want to try out GCL for yourself, visit our repository here with more information.

8. Conclusion

Multimodal models like CLIP represent a significant step forward in creating AI systems that can understand and interact with the world more like humans do. By integrating and learning from multiple types of data, these models can perform a wide range of tasks with greater accuracy and flexibility. Models like Marqo’s GCL extend CLIP to accommodate for more complex and realistic data scenarios.

9. References

[1] A. Radford, et al. Learning Transferable Visual Models From Natural Language Supervision (2021)

[2] T. Zhu, M. Jung and J. Clark. Generalized Contrastive Learning for Multi-Modal Retrieval and Ranking (2024)

10. Code