The process of generating dense vectors involves various techniques and algorithms, each designed to capture different aspects of the underlying data. Regardless of the method used, the goal remains consistent: to transform raw data into compact, meaningful representations that capture the inherent relationships and semantics. This module will explore the foundations of generating such dense vectors through different, exciting technologies!
Before diving in, if you need help, guidance, or want to ask questions, join our Community and a member of the Marqo team will be there to help.
1. Word Embeddings
The first type of embeddings we will consider are word embeddings. More specifically, we will take a look at the Word2Vec model [1] which, as you might have guessed it, generates word embeddings. These are dense vector representations of words. Although Word2Vec wasn’t the first embedding generation tool released, it revolutionised how we represent and understand the semantic meaning of words.
Word2Vec, or ‘Word to Vector’, is trained to learn the relationships between words in large databases of texts. The algorithm uses one of two modes: Continuous Bag of Words (CBOW) or Skip-Gram.
Continuous Bag of Words (CBOW) aims to predict a word based on the words surrounding it. It takes the context of the surrounding words as input and tries to predict the target word in the middle. Below is an illustrative example for the sentence “Pineapples are spiky and yellow”.
We see that the CBOW method takes the surrounding words and aims to predict the target word which, in this case, is ‘spiky’.
Skip-gram works the opposite way; it predicts the context words (surrounding words) based on the target word. It takes a target word as input and tries to predict the surrounding words.
The projection depicted in both images here is what is known as the projection layer; it is a hidden layer in the network that performs the transformation from the input layer to the output layer.
Choosing between the two approaches depends on the task's requirements. The skip-gram model performs well with limited amounts of data and is particularly effective at representing infrequent words. In contrast, the CBOW model produces better representations for more commonly occurring words.
It was from this very discovery that Mikolov et al. [1] produced the classic word-arithmetic example we saw in the previous article, ‘King - Man + Woman = Queen’. Let’s take a look at it in action!
As a reminder: we will be using Google Colab (it’s free!). If you are new to Google Colab, you can follow this guide on getting set up - it’s super easy! For this module, you can find the notebook on Google Colab here or on GitHub here. As always, if you face any issues, join our Slack Community and a member of our team will help!
During this article, we will make use of the GPUs in Google Colab. To do this, head to ‘Runtime → Change Runtime Type’ and click ‘T4 GPU’. For more help on this, visit our guide.
To do this we’ll need to first install gensim:
We could load a Word2Vec model in here but these models can take a while to download so instead, we’ll use an alternative word embedding generation tool: GloVe. If you want to learn more about GloVe, visit our community Slack channel and chat with one of our team!
Let’s take a look at the output.
The result shows that the word vector for 'queen' is the closest match to the vector obtained by the operation 'king - man + woman', with a similarity score of 0.852. This demonstrates the ability of word embeddings; they capture semantic relationships and analogies between words. Pretty cool!
You might be wondering, what would happen if we had the same word but it had different meanings? In the following three sentences, the word “scale” has completely different meanings:
“The fisherman removed one scale from the fish.”
“The architect drew the plans to scale.”
“The company decided to scale up production.”
If we were to separate these sentences using Word2Vec, it would look as follows:
A 10-dimensional vector embedding corresponding to the word ‘scale’ - as easy as that!
This approach provides us with a single embedding for the word ‘scale’ despite it having different meaning across the three sentences. This isn’t ideal. So, how would we overcome this? We need a word embedding model that encodes the context of words. Luckily for us, we have just the tool for that!
BERT (Bidirectional Encoder Representations from Transformers) [2] is another word embedding model. Within BERT, vector embeddings for each word (or token) are generated; the same as what we see with Word2Vec. The difference is that the embeddings BERT produce are much richer and encode the context of words. This is because of something known as the attention mechanism (we’ll explore this and BERT in more detail in Module 5). BERT can be considered as paying attention to the specific words depending on the context. This means it prioritizes what context words have the biggest impact on a specific embedding. So, in our example above, BERT would generate a different embedding for the word "scale" because the surrounding words provide different contexts.
Of course, Word2Vec and BERT (along with many other word embedding models) are great when considering single words but unfortunately, cannot be extended for sentences [2]. Fortunately for us, sentence embeddings were built for this exact purpose.
2. Sentence Embeddings
Beyond words, these models can be extended to generate sentence and even document embeddings. These embeddings offer a more holistic understanding of textual data by capturing the semantic meaning of larger units such as sentences. While word embeddings excel at representing individual words, sentence embeddings encapsulate the meaning of entire sentences.
S-BERT (Sentence-BERT) [3] is a modification of the BERT network mentioned above, that is specifically designed for generating sentence embeddings.
The key innovation of S-BERT is the use of a Siamese network structure, where two identical BERT models are used to generate embeddings for two sentences simultaneously. These embeddings are then compared using a loss function that encourages similar sentences to have similar embeddings [3].
Now let’s create some sentences we wish to create embeddings for. We’ve taken these from Marqo’s website:
Now we have our sentences, we can create our embeddings!
So, our sentences have been translated into 768-dimensional vectors… awesome! In Module 5 we’ll explore sentence transformers in more detail and how they can be used in artificial intelligence applications.
3. Question-Answering
With the emerging presence of chat-bots in artificial intelligence, it’s only appropriate we discuss question-answering models!
Note, we have two tokens, [CLS] and [SEP]:
- [CLS] - Classification Token: you can think of it as a special marker we put at the very beginning of a sentence or a piece of text. It's like a "start here" sign. When the model processes the text, it uses this marker to help understand and summarize the whole sentence. It's especially important when we want to do something like figuring out the overall meaning or category of the text.
- [SEP] - Separator Token: this is like a special divider or a comma that shows where one sentence ends and another begins.
It does so for both the question and the context sentence (the answer in our case). It then predicts how likely each word in the context sentence is to be the start or end of the answer [4]. Afterward, it selects the token with the highest probability of being the start position as the beginning of the answer, and similarly, it identifies the token with the highest probability of being the end position as the conclusion of the answer.
Let’s consider an example. We first import the relevant modules.
We will create a pipeline as well as define a question and the context. We will ask what the tallest mountain in the world is while providing context about the tallest mountain in the world.
This returns the output
The result indicates that the model identifies "Mount Everest" as the answer to the question, "What is the highest mountain in the world?" based on the context provided. The confidence score of 0.997 suggests almost absolute certainty. The answer is contextually appropriate, demonstrating the model's ability to extract relevant information from the given text.
You might be thinking, well of course the model will guess this correctly - the context provided is so specific. In reality, one usually has to first extract the relevant context from a database of documents. This is what is known as information retrieval (IR). An example of this is Dense Passage Retrieval (DPR) [5]; the state-of-the-art IR model. It combines the power of dense vector representations with passage retrieval techniques.
DPR uses a dual encoder architecture [5], which means it has two separate components for encoding input data into embeddings. The two encoders are:
- Context Encoder: This part of DPR is responsible for encoding passages or documents. Given a passage, it processes the text and produces a fixed-size representation (embedding) that captures the semantic meaning of the passage.
- Query Encoder: This component is used for encoding queries. When a user inputs a query, the query encoder processes the text and produces a fixed-size representation (embedding) that captures the semantic meaning of the query.
The architecture of DPR and its two encoders can be seen in the figure below.
Both encoders are based on BERT (as established previously). During training, DPR learns to produce embeddings that are close in the embedding space for relevant question-answer pairs, and far apart for irrelevant pairs. This allows it to effectively retrieve relevant passages when given a query.
Let’s take a look at how we can implement it ourselves!
First import the relevant modules. We will need both the embedding encoder and tokenizer for both the context and question.
We can initiate our models and tokenizers:
Let’s now create our questions and context. We have three questions and different contexts surrounding each question.
Let’s now tokenize and create the embeddings for our questions and context.
We can now take each of the generated vector embeddings and see which are closest in the vector space.
We get the output:
In this example, DPR correctly identified the answer to our questions. DPR is not always the most accurate method but this example has highlighted its capability for a small set of questions and corresponding context. Pretty cool!
4. Vision Transformers
Building on the concept of word and sentence embeddings, Vision Transformers (ViTs) [6] extend the power of transformer architectures to computer vision. We’ve already established that word embeddings transform text into numerical representations but can we do the same with images? The answer is yes! Vision Transformers (ViTs) divide images into fixed-size patches and embed these patches into vectors [6]. These are then fed into the transformer model. This allows ViTs to capture complex information within an image, much like how we’ve already established with words and sentences.
We’ve seen what words look like in a vector space but we can actually extend this for sentences and images! Take the following example of an image of a cat sleeping and the caption “a cat sleeping”. Both of these can be projected into a vector space by an image and text encoder respectively.
We will cover Vision Transformers in more detail in Module 8 but for now, we’ll illustrate the power of them through a simple example.
We will take three different images of cats from Unsplash. Let’s display these.
We now process the captions and images as follows:
All that’s left to do is get the predictions and display the results! Let’s get the predictions first.
Now we can display the images with their corresponding, predicted captions
These are the results:
Pretty cool!
5. Summary
In this module, we explored the various types of embedding models used in machine learning. While these models can be incredibly powerful, they aren't always perfect out of the box. This is where fine-tuning comes in—it allows us to adjust models to deliver the best possible results for specific tasks. For the remainder of this course, we'll dive deeper into the following topics and provide examples of fine-tuning for different use cases:
- Sentence Transformers
- Vision Transformers
- CLIP Models
Before we get into these in more detail, we'll discuss vector databases and their crucial role in embedding generation. Join us in the next article to learn more!
6. References
[1] T. Mikolov, et al., Linguistic Regularities in Continuous Space Word Representations (2013)
[2] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2019)
[3] Reimers, N., & Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks (2019)
[4] Transformers - BertForQuestionAnswering, Hugging Face
[5] V. Karpukhin, et al., Dense Passage Retrieval for Open-Domain Question Answering (2020)
[6] A. Dosovitskiy, et al., An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (2020)
[7] Open AI CLIP Model, Hugging Face