Foundations of Embedding Models

Next Module →

Before diving in, if you experience any difficulties or have any questions, join our Slack-Community and we’ll be there to help! Please post them into the ‘marqo-courses’ channel 🚀 If you’re new to Slack, don’t worry, join the channel and Ellie will send you a ‘getting started with Slack’ guide! 😊

If you want to build your own embedding search applications, try out Marqo for free!

Foundations of Embedding Models

The process of generating dense vectors involves various techniques and algorithms, each designed to capture different aspects of the underlying data. Regardless of the method used, the goal remains consistent: to transform raw data into compact, meaningful representations that capture the inherent relationships and semantics. This module will explore the foundations of generating such dense vectors through different, exciting technologies!

1. Word Embeddings

The first type of embeddings we will consider are word embeddings. More specifically, we will take a look at the Word2Vec model [1] which, as you might have guessed it, generates word embeddings. These are dense vector representations of words. Although Word2Vec wasn’t the first embedding generation tool released, it revolutionised how we represent and understand the semantic meaning of words.

Word2Vec, or ‘Word to Vector’, is trained to learn the relationships between words in large databases of texts. The algorithm uses one of two modes: Continuous Bag of Words (CBOW) or Skip-Gram.

Continuous Bag of Words (CBOW) aims to predict a word based on the words surrounding it. It takes the context of the surrounding words as input and tries to predict the target word in the middle. Below is an illustrative example for the sentence “Pineapples are spiky and yellow”.

Figure 1: Visual representation of the Continuous Bag of Words (CBOW) algorithm in Word2Vec.

We see that the CBOW method takes the surrounding words and aims to predict the target word which, in this case, is ‘spiky’.

Skip-gram works the opposite way; it predicts the context words (surrounding words) based on the target word. It takes a target word as input and tries to predict the surrounding words.

Figure 2: Visual representation of the Skip-Gram algorithm in Word2Vec.

The projection depicted in both images here is what is known as the projection layer; it is a hidden layer in the network that performs the transformation from the input layer to the output layer.

Choosing between the two approaches depends on the task's requirements. The skip-gram model performs well with limited amounts of data and is particularly effective at representing infrequent words. In contrast, the CBOW model produces better representations for more commonly occurring words.

It was from this very discovery that Mikolov et al. [1] produced the classic word-arithmetic example we saw in the previous article, ‘King - Man + Woman = Queen’. Let’s take a look at it in action!

As a reminder: we will be using Google Colab (it’s free!). If you are new to Google Colab, you can follow this guide on getting set up - it’s super easy! For this module, you can find the notebook on Google Colab here or on GitHub here. As always, if you face any issues, join our Slack Community and a member of our team will help!

During this article, we will make use of the GPUs in Google Colab. To do this, head to ‘Runtime → Change Runtime Type’ and click ‘T4 GPU’. For more help on this, visit our guide.

To do this we’ll need to first install gensim:

!pip install gensim

We could load a Word2Vec model in here but these models can take a while to download so instead, we’ll use an alternative word embedding generation tool: GloVe. If you want to learn more about GloVe, visit our community Slack channel and chat with one of our team!

We load the glove-wiki-gigaword-50 model and perform vector arithmetic. We want to use the GloVe model to find the most similar vector to ‘King - Man + Woman’. Note how ‘King’ and ‘Woman’ are set as positive (because we are adding them) and ‘Man’ as negative (we are subtracting).

# Import relevant modules
import gensim.downloader as api

# Download and load the GloVe model - this can take 10-20 seconds on GPU
glove_model = api.load("glove-wiki-gigaword-50")

# Perform the vector arithmetic: king - man + woman
result = glove_model.most_similar(positive=['king', 'woman'], negative=['man'], topn=1)


Let’s take a look at the output.

[('queen', 0.8523604273796082)]

The result shows that the word vector for 'queen' is the closest match to the vector obtained by the operation 'king - man + woman', with a similarity score of 0.852. This demonstrates the ability of word embeddings; they capture semantic relationships and analogies between words. Pretty cool!

You might be wondering, what would happen if we had the same word but it had different meanings? In the following three sentences, the word “scale” has completely different meanings:

“The fisherman removed one scale from the fish.”
“The architect drew the plans to scale.”
“The company decided to scale up production.”

If we were to separate these sentences using Word2Vec, it would look as follows:

# Import relevant modules
from gensim.models import Word2Vec

# Define your sentences
sentences = [
    ["The", "fisherman", "removed", "one", "scale", "from", "the", "fish"],
    ["The", "architect", "drew", "the", "plans", "to", "scale"],
    ["The", "company", "decided", "to", "scale", "up", "production"]

# Train the Word2Vec model
model = Word2Vec(sentences, vector_size=10, window=5, min_count=1)
# Obtain the vector embedding corresponding to the word 'scale'
embedding_scale = model.wv['scale']

We define our sentences in the variable sentences and train our Word2Vec model with these sentences. We also instantiate the following:

  • vector_size: this parameter defines the dimensionality of the word vectors. In our case, each word will be represented by a 10-dimensional vector. This is quite short for a vector embedding but for the purpose of illustration, we use it here.
  • window: this parameter specifies the maximum distance between the current and predicted word within a sentence. window=5 means that the model will consider a context window of 5 words to the left and 5 words to the right of the target word.
  • min_count: This parameter sets the minimum frequency count for words. min_count=1 means that all words, regardless of their frequency, will be included.

When we print out embedding_scale, we obtain the following:

array([-0.0053659 ,  0.0023273 ,  0.051031  ,  0.09011646, -0.09301799,
       -0.07114912,  0.06461012,  0.08979519, -0.05018128, -0.03761054],

A 10-dimensional vector embedding corresponding to the word ‘scale’ - as easy as that!

This approach provides us with a single embedding for the word ‘scale’ despite it having different meaning across the three sentences. This isn’t ideal. So, how would we overcome this? We need a word embedding model that encodes the context of words. Luckily for us, we have just the tool for that!

BERT (Bidirectional Encoder Representations from Transformers) [2] is another word embedding model. Within BERT, vector embeddings for each word (or token) are generated; the same as what we see with Word2Vec. The difference is that the embeddings BERT produce are much richer and encode the context of words. This is because of something known as the attention mechanism (we’ll explore this and BERT in more detail in Module 5). BERT can be considered as paying attention to the specific words depending on the context. This means it prioritizes what context words have the biggest impact on a specific embedding. So, in our example above, BERT would generate a different embedding for the word "scale" because the surrounding words provide different contexts.

Of course, Word2Vec and BERT (along with many other word embedding models) are great when considering single words but unfortunately, cannot be extended for sentences [2]. Fortunately for us, sentence embeddings were built for this exact purpose.

2. Sentence Embeddings

Beyond words, these models can be extended to generate sentence and even document embeddings. These embeddings offer a more holistic understanding of textual data by capturing the semantic meaning of larger units such as sentences. While word embeddings excel at representing individual words, sentence embeddings encapsulate the meaning of entire sentences.

S-BERT (Sentence-BERT) [3] is a modification of the BERT network mentioned above, that is specifically designed for generating sentence embeddings.

Figure 3: Illustration of S-BERT [3].

The key innovation of S-BERT is the use of a Siamese network structure, where two identical BERT models are used to generate embeddings for two sentences simultaneously. These embeddings are then compared using a loss function that encourages similar sentences to have similar embeddings [3].

We will describe sentence embeddings in more detail in Module 5 but for now, let’s look at a quick example from the sentence-transformers library in Python.

We begin by installing sentence-transformers:

!pip install sentence-transformers

We can now load an S-BERT model. For this example, we will load the model bert-base-nli-mean-tokens. This is one of the most commonly used S-BERT models.

# Import relevant modules
from sentence_transformers import SentenceTransformer

# Initiate the sentence transformer model
model = SentenceTransformer('bert-base-nli-mean-tokens')

Now let’s create some sentences we wish to create embeddings for. We’ve taken these from Marqo’s website:

# Create some sentences!
marqo_sentences = [
	"Join our community to discuss and ask questions",
    "Use your data to increase relevance with embedding search with Marqo",
    "AI search that understands the way your customers think",
    "Request Your Marqo Demo Today"

Now we have our sentences, we can create our embeddings!

# Create the embeddings 
embeddings = model.encode(marqo_sentences)

Again, it really is as simple as that! We can see the vector embeddings when we print out the variable embeddings:.

array([[ 0.45324317,  0.4565915 ,  1.7168269 , ...,  0.4784594 ,
        -0.13483621, -0.26560253],
       [-0.10421544,  0.04392781,  1.9816391 , ..., -1.1041645 ,
        -0.6477818 , -0.19729616],
       [-0.31245297,  0.3985917 ,  1.3437531 , ..., -0.4828734 ,
        -1.2322669 ,  0.49748766],
       [-0.06220015, -0.06114725,  2.3531651 , ..., -0.36535743,
         0.30400008, -0.2788106 ]], dtype=float32)

As well as the shape of the embeddings when we print out embeddings.shape:

(4, 768)

So, our sentences have been translated into 768-dimensional vectors… awesome! In Module 5 we’ll explore sentence transformers in more detail and how they can be used in artificial intelligence applications.

3. Question-Answering

With the emerging presence of chat-bots in artificial intelligence, it’s only appropriate we discuss question-answering models!

A conventional approach to performing Question-Answering inference is using Hugging Face’s BERTforQuestionAnswering [4] class. This works by separating the sentence into individual words (also known as tokens). This process is called tokenization and has been illustrated below. The same process happens in S-BERT as mentioned earlier.

Figure 4: Illustration of sentence tokenization and embedding.

Note, we have two tokens, [CLS] and [SEP]:

  • [CLS] - Classification Token: you can think of it as a special marker we put at the very beginning of a sentence or a piece of text. It's like a "start here" sign. When the model processes the text, it uses this marker to help understand and summarize the whole sentence. It's especially important when we want to do something like figuring out the overall meaning or category of the text.
  • [SEP] - Separator Token: this is like a special divider or a comma that shows where one sentence ends and another begins.

It does so for both the question and the context sentence (the answer in our case). It then predicts how likely each word in the context sentence is to be the start or end of the answer [4]. Afterward, it selects the token with the highest probability of being the start position as the beginning of the answer, and similarly, it identifies the token with the highest probability of being the end position as the conclusion of the answer.

Figure 5: Simplified architecture of Bert for Question-Answering.

Let’s consider an example. We first import the relevant modules.

# Import relevant modules
from transformers import BertForQuestionAnswering, AutoTokenizer
from transformers import pipeline

Next, we load the model and the tokenizer. As a reminder, the tokenizer here divides each sentence into individual words. We will use the model SpanBERT/spanbert-large-cased for the tokenizer and mrm8488/spanbert-large-finetuned-squadv2 for the model itself. These models might seem complex but don’t worry, you don’t need to worry about them in this article.

# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("SpanBERT/spanbert-large-cased")
model = BertForQuestionAnswering.from_pretrained("mrm8488/spanbert-large-finetuned-squadv2")

We will create a pipeline as well as define a question and the context. We will ask what the tallest mountain in the world is while providing context about the tallest mountain in the world.

# Create the pipeline for question answering
qa_pipeline = pipeline("question-answering", model=model, tokenizer=tokenizer)

# Define question and context - change this if you wish!
question = "What is the highest mountain in the world?"
context = "Mount Everest is the highest mountain in the world"

# Input the data into our pipeline
data = {'context': context, "question": question}
# Generate predictions
predictions = qa_pipeline(data)

This returns the output

{'score': 0.9971988201141357, 'start': 0, 'end': 13, 'answer': 'Mount Everest'}

The result indicates that the model identifies "Mount Everest" as the answer to the question, "What is the highest mountain in the world?" based on the context provided. The confidence score of 0.997 suggests almost absolute certainty. The answer is contextually appropriate, demonstrating the model's ability to extract relevant information from the given text.

You might be thinking, well of course the model will guess this correctly - the context provided is so specific. In reality, one usually has to first extract the relevant context from a database of documents. This is what is known as information retrieval (IR). An example of this is Dense Passage Retrieval (DPR) [5]; the state-of-the-art IR model. It combines the power of dense vector representations with passage retrieval techniques.

DPR uses a dual encoder architecture [5], which means it has two separate components for encoding input data into embeddings. The two encoders are:

  • Context Encoder: This part of DPR is responsible for encoding passages or documents. Given a passage, it processes the text and produces a fixed-size representation (embedding) that captures the semantic meaning of the passage.
  • Query Encoder: This component is used for encoding queries. When a user inputs a query, the query encoder processes the text and produces a fixed-size representation (embedding) that captures the semantic meaning of the query.

The architecture of DPR and its two encoders can be seen in the figure below.

Figure 6: Architecture of Dense Passage Retrieval (DPR). DPR uses a dual encoder architecture with both a question and passage encoder.

Both encoders are based on BERT (as established previously). During training, DPR learns to produce embeddings that are close in the embedding space for relevant question-answer pairs, and far apart for irrelevant pairs. This allows it to effectively retrieve relevant passages when given a query.

Let’s take a look at how we can implement it ourselves!

First import the relevant modules. We will need both the embedding encoder and tokenizer for both the context and question.

from transformers import DPRQuestionEncoder, DPRQuestionEncoderTokenizer, DPRContextEncoder, DPRContextEncoderTokenizer 

We can initiate our models and tokenizers:

question_model = DPRQuestionEncoder.from_pretrained('facebook/dpr-question_encoder-single-nq-base')
question_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained('facebook/dpr-question_encoder-single-nq-base')

context_model = DPRContextEncoder.from_pretrained('facebook/dpr-ctx_encoder-single-nq-base')
context_tokenizer = DPRContextEncoderTokenizer.from_pretrained('facebook/dpr-ctx_encoder-single-nq-base')

Let’s now create our questions and context. We have three questions and different contexts surrounding each question.

questions = [
    "what is the largest planet in our solar system?",
    "what is the capital city of England?",
    "how many bones are there in an adult human body?"

contexts = [
    "The largest planet in our solar system is Jupiter.",
    "What is the largest planet in our solar system?",
    "Mars is the next planet outwards from Earth",
    "what is the capital city of England?",
    "London is the capital city of England",
    "York was the old capital city of England",
    "how many bones are there in an adult human body?",
    "there are 206 bones in the adult human body",
    "the largest bone in the body is the femur"

Let’s now tokenize and create the embeddings for our questions and context.

# Converts question text into token IDs
query_tokens = question_tokenizer(
    questions,              # list of questions to be tokenized
    max_length=256,         # Maximum length of tokenized sequence
    padding='max_length',   # Ensures all input sequences have the same length
    truncation=True,        # Truncates sequences longer than max_length
    return_tensors='pt'     # Returns sequences as tensors

# Pass the tokenized questions through 'question_model' to obtain the embeddings
query = question_model(**query_tokens)

# Converts context text into token IDs. Same arguments as above.
context_tokens = context_tokenizer(

# Pass the tokenized context through 'context_model' to obtain the embeddigns
context = context_model(**context_tokens)

We can now take each of the generated vector embeddings and see which are closest in the vector space.

import torch
from sentence_transformers.util import cos_sim

# Loop through each query vector (embedding) generated by the question model
for i, query_vec in enumerate(query.pooler_output):
    # Calculate the cosine similarity between the current query vector and all context vectors
    probs = cos_sim(query_vec, context.pooler_output)
    # Find the index of the context vector with the highest cosine similarity to the current query vector
    argmax = torch.argmax(probs)
    # Print the original question
    print('Question:', questions[i])
    # Print the context that has the highest similarity to the question
    print('Answer:', contexts[argmax])

We get the output:

Question: what is the largest planet in our solar system?
Answer: The largest planet in our solar system is Jupiter.

Question: what is the capital city of England?
Answer: London is the capital city of England

Question: how many bones are there in an adult human body?
Answer: there are 206 bones in the adult human body

In this example, DPR correctly identified the answer to our questions. DPR is not always the most accurate method but this example has highlighted its capability for a small set of questions and corresponding context. Pretty cool!

4. Vision Transformers

Building on the concept of word and sentence embeddings, Vision Transformers (ViTs) [6] extend the power of transformer architectures to computer vision. We’ve already established that word embeddings transform text into numerical representations but can we do the same with images? The answer is yes! Vision Transformers (ViTs) divide images into fixed-size patches and embed these patches into vectors [6]. These are then fed into the transformer model. This allows ViTs to capture complex information within an image, much like how we’ve already established with words and sentences.

We’ve seen what words look like in a vector space but we can actually extend this for sentences and images! Take the following example of an image of a cat sleeping and the caption “a cat sleeping”. Both of these can be projected into a vector space by an image and text encoder respectively.

Figure 7: Representation of both text and images in a vector space. Photo credit: Kate Stone Matheson.

We will cover Vision Transformers in more detail in Module 8 but for now, we’ll illustrate the power of them through a simple example.

We will take three different images of cats from Unsplash. Let’s display these.

# Import relevant modules
from PIL import Image
from IPython.display import display
from io import BytesIO
from transformers import CLIPProcessor, CLIPModel
import requests
import matplotlib.pyplot as plt
import numpy as np

# Create a list of urls. These are the urls corresponding to the images of the cats.
urls = [

#  Take a list of image URLs, download each image, and open them
images = [ for url in urls]

# Display images
for image in images:
Figure 2h: Three pictures of cats doing different things. Taken from Unsplash. Photo Credit: svklimkin, Timo Volz and Kate Stone Matheson respectively.

We want to generate embeddings for these images so we can use the CLIP (Contrastive Language–Image Pre-training) model. We load CLIPModel to generate the embeddings and CLIPProcessor for preprocessing (e.g. resizing images, etc.). We will cover CLIP in more detail in a future module but for now, just know that Vision Transformers are used within CLIP to generate image embeddings.

We create the model and processor. Specifically, we use OpenAI’s model clip-vit-base-patch32[7].

# Load CLIP model and processor
model = CLIPModel.from_pretrained('openai/clip-vit-base-patch32')
processor = CLIPProcessor.from_pretrained('openai/clip-vit-base-patch32')

We already have the images stored in the variable images. So, all that’s left is to create captions. We will create a list of captions containing the true captions as well as fake ones.

captions = ["a cat sleeping",
            "a cat yawning",
            "a cat jumping",
            "a motorbike on a road",
            "rainbow in the sky",
            "a water bottle"]

We now process the captions and images as follows:

# Process captions and images
inputs = processor(
    text=captions,         # input the captions
    images=images,         # inputs the images
    return_tensors='pt',   # returns pytorch tensors
    padding=True           # ensures sequences all the same length

All that’s left to do is get the predictions and display the results! Let’s get the predictions first.

# Get predictions
outputs = model(**inputs)
probs = outputs.logits_per_image.argmax(dim=1)

Now we can display the images with their corresponding, predicted captions

# Display images with predicted captions
for i, image in enumerate(images):
    argmax = probs[i].item()
    print(f"Predicted Caption: {captions[argmax]}")

These are the results:

Figure 8: Predicted captions with corresponding images using CLIP.

Pretty cool!

5. Summary

In this module, we explored the various types of embedding models used in machine learning. While these models can be incredibly powerful, they aren't always perfect out of the box. This is where fine-tuning comes in—it allows us to adjust models to deliver the best possible results for specific tasks. For the remainder of this course, we'll dive deeper into the following topics and provide examples of fine-tuning for different use cases:

  • Sentence Transformers
  • Vision Transformers
  • CLIP Models

Before we get into these in more detail, we'll discuss vector databases and their crucial role in embedding generation. Join us in the next article to learn more!

6. References

[1] T. Mikolov, et al., Linguistic Regularities in Continuous Space Word Representations (2013)

[2] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2019)

[3] Reimers, N., & Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks (2019)

[4] Transformers - BertForQuestionAnswering, Hugging Face

[5] V. Karpukhin, et al., Dense Passage Retrieval for Open-Domain Question Answering (2020)

[6] A. Dosovitskiy, et al., An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (2020)

[7] Open AI CLIP Model, Hugging Face

7. Code