Introduction to Sentence Transformers

Next Module →

Before diving in, if you experience any difficulties or have any questions, join our Slack-Community and we’ll be there to help! Please post them into the ‘marqo-courses’ channel 🚀 If you’re new to Slack, don’t worry, join the channel and Ellie will send you a ‘getting started with Slack’ guide! 😊

If you want to build your own embedding search applications, try out Marqo for free!

Introduction to Sentence Transformers

In the rapidly evolving field of Natural Language Processing (NLP), Sentence Transformers have emerged as a powerful tool for encoding sentences into high-dimensional vectors (also known as embeddings), which can then be used for various tasks such as semantic search, clustering, and sentence similarity. Sentence transformers are a type of deep learning model specifically designed to capture the semantic meaning of sentences, going beyond the capabilities of traditional word embeddings.

Traditional word embeddings like Word2Vec or GloVe focus on representing individual words in a continuous vector space, capturing semantic relationships between words. However, these models often fall short when dealing with longer texts or sentences, as they do not consider the context in which words appear. Sentence transformers address this limitation by encoding entire sentences or text fragments into fixed-size vectors, preserving this contextual meaning. This approach has significantly enhanced the performance of NLP applications, enabling more accurate and meaningful text analysis.

In this article, we’ll explore the fundamentals of Sentence Transformers and provide hands-on examples utilising Hugging Face’s Python library, sentence-transformers.

1. Introduction to Transformers

Before we jump into the fundamentals of Sentence Transformers, it’s important to understand the advancements in machine learning that led to the creation of sentence transformers.

Previous State of the Art Translation

Previously, translating text from one language to another heavily relied on encoder-decoder architectures based on Recurrent Neural Networks (RNNs) [1] and Long Short-Term Memory Networks (LSTMs) [2]. These architectures had two main phases:

  1. Encoding Phase: The encoder processes the entire input sentence and condenses it into a single context vector, which is the final hidden state of the encoder. This context vector serves as a summary of the input sentence.
  2. Decoding Phase: The decoder then takes this context vector as its initial hidden state and generates the output sentence one word at a time.

You might be wondering what the meaning of the “hidden state” is. In very simple terms, imagine you are reading a paragraph to understand it (encoding phase). As you read each sentence, you form an idea in your mind (hidden states at each step). After finishing the paragraph, you have a clear understanding of the overall meaning (final hidden state). Now, you want to explain this paragraph to someone else (decoding phase). You start explaining using your clear understanding (initial hidden state), and you continue to elaborate sentence by sentence until you've conveyed the entire meaning. So, "hidden states" are intermediate representations of the input data at each step of the process within the neural network.

The process of this architecture can be seen below for English to Spanish translation:

Figure 1: Encoder-decoder architecture. Single context vector shared between two models creating an information bottleneck. The [PAD] here is a special token used to pad the final sequence to ensure it has the same length as the input sentence.

The primary issue with this approach is the reliance on a single context vector to represent the entire input sentence. If the encoder's summary is inadequate, the quality of the translation deteriorates. This is especially true for longer sentences because of the long-range dependency problem; the difficulty that models like RNNs and LSTMs face in capturing and retaining information about relationships between distant elements in a sequence. The encoder-decoder architecture with one context vector shared between two models can act as an information bottleneck as all information is passed through this point.

To address this, in 2015, Bahdanau et al. [3] proposed an innovative approach: rather than relying on a single context vector, their model incorporates all the input words, assigning varying levels of importance to each. This approach, known as the attention mechanism, allows the model to focus on the most relevant parts of the input sentence when generating the output.

The attention mechanism is inspired by human visual processing system. When you read a book, the majority of what is in your field of vision is actually disregarded; you pay more attention to the word you are currently reading. This allows your brain to focus on what matters most while ignoring everything else.

In order to imitate the same effects in deep learning models, we assign an attention weight to each of our inputs. These weights represent the relative importance of each input element to the other input elements. This way we guide our model to pay greater attention to particular inputs that are more critical to performing the task at hand.

Figure 2: Visual representation of the attention mechanism for a single sentence. For visual purposes only.

So, with the attention mechanism, our original architecture becomes:

Figure 3: Encoder-decoder architecture with the attention mechanism. This reduces the bottleneck in the previous architecture

This architecture evaluates a set of positions in the encoder’s hidden states to identify where the most relevant information resides. By doing so, it creates a context vector that dynamically adjusts to include significant details, enhancing the model’s ability to handle long sentences and complex dependencies. This elegant solution has since become a cornerstone in the development of advanced Natural Language Processing models and beyond.

Attention is All You Need

The attention mechanism began to influence further ideas with the infamous paper Attention is All You Need published in 2017 [4]. The authors introduced the transformer model, which eliminated the need for RNNs by using the attention mechanism alone, leading to superior performance and generalization capabilities. This shift revolutionized the NLP ecosystem, moving it away from RNN-based models towards transformers.

The original transformer architecture worked well for sequence to sequence problems but for specific natural language problems like question and answering, and text summarisation, it needed improvement. There were two main concerns:

  • A lot of data was needed to train transformers from scratch
  • The architecture may not be complex enough to understand patterns to solve language problems

To address this issue, BERT was introduced.


One of the most famous pre-trained models is BERT, Bidirectional Encoder Representations from Transformers, by Google AI [5]. BERT was trained on the BooksCorpus, which has over 800 million words, and English Wikipedia, which has 2.5 billion words. This extensive and diverse dataset enabled BERT to achieve state-of-the-art performance across a variety of NLP tasks.

BERT was built with the ideology that different NLP problems all rely on the same fundamental understanding of language. BERT models can be built in two phases. The first is the pre-training phase where we train the model to understand the language. The second is the fine-tuning phase where we further train the model on a specific task. This addresses the data concern of transformers; if we already have a pre-trained model that understands language then we only need data to fine-tune the model for our specific case.

Previously, RNNs were mainly built for very specific use-cases and would change depending on that. This is the beauty of transformers; they can be generalized. This means it’s possible to use the same ‘core’ of a model and change the final layers for different use cases. This feature of transformers sparked a whole new type of models in NLP: pretrained models.

Pre-trained transformer models are trained on enormous amounts of training data. During training, the model learns patterns, structures and relationships within the data. Overall, allowing it to make accurate predictions and generate meaningful responses when presented with new, unseen data. Typically, this pre-training is done by companies like Google and OpenAI as training on such vast datasets is expensive. These models are then available to the public to use for free — super helpful!

Imagine we take a question and answering platform and we want to determine questions that are similar to a given input question. How would we do this with BERT?

Well, we’d take an input question “What is gravity?” and another question on the platform “How do aeroplanes fly?” and pass them into BERT. BERT would then generate word vectors, that are then passed into some feed-forward layer that gives an individual output known as the similarity score. The higher the score, the more similar the two questions.

Figure 4: Obtaining a similarity score between two sentences using the BERT model.

Of course, we’d have to pass in every question on the platform to determine the similarity. Once complete, you’d select the questions that had the highest similarity as the most relevant/similar questions to the input question.

There’s a big problem here. Imagine we have 10 million questions on the question-answering platform. Then, whenever a new question came in, we’d need to run the forward pass of BERT 10 million times. This is not viable.

The most obvious solution here would be to pass the input question into BERT to get a single vector that represents that question. Then, compare it against all other questions using something like a cosine similarity metric as explained in this article. The final step would be to return the nearest neighbours as the most related questions. This set up would require us to use the BERT model once and not 10 million times.

The issue with BERT is that it only gives us word vectors so, if you want a sentence vector, you’d need to somehow aggregate the word vectors into a single vector. The most straightforward way to do this would be to take the average of the word vectors. This technique is called mean pooling. This system as outlined in the figure below is the simplest form of sentence transformer.

Figure 5: The simplest form of a sentence transformer. Pass a sentence through BERT to produce word vectors. Aggregate these word vectors into a single sentence vector by mean pooling.

Okay great! Well…not so great. The output sentence produced here is actually really poor quality. So poor that you might be better to take the average of GloVe embeddings and not even use BERT! Let’s look at how we can fix that.

3. Sentence Transformers

The solution came in 2019 with Nils Reimers and Iryna Gurevych's SBERT (Sentence-BERT) [6] and the sentence-transformers library [7]. SBERT generated high-quality sentence embeddings, drastically reducing search times and outperforming previous models in semantic textual similarity tasks. Unlike BERT, SBERT allowed for efficient storage and comparison of sentence embeddings, making it highly scalable.

So, how does SBERT work?

SBERT is fine-tuned on sentence pairs using a siamese architecture. This means we have two of the exact same BERT sentence transformer networks connected. This can be seen in the Figure below where two separate sentences are passed through the same architecture. In the original SBERT paper, they tested three different pooling methods (mean, max and [CLS]) and mean pooling performed best.

Figure 6: SBERT siamese architecture. Creating sentence embeddings with BERT.

This siamese architecture produces sentence embeddings. Let’s call these embeddings u and v respectively. These embeddings are then combined to form a single vector embedding that represents the relationship between the two sentences. Different concatenation methods were tested but the best performing was (u, v, |u-v|) as illustrated below.

Figure 7: Concatenation of the sentence embeddings, u and v. This concatenated vector represents the relationship between the two sentences.

To train the sentence transformer, this combined vector is fed into a neural network, typically a feed-forward neural network (FFNN), which can be trained with various objectives depending on the task. Let’s take a look at an example: Natural Language Inference (NLI). NLI works by taking two sentences and determining whether sentence one entails or contradicts sentence two, or neither. This allows BERT to understand sentence meanings as a whole and is therefore chosen for training.

An example of an NLI sentence pair can be seen below for the sentences “She purchased a new laptop from the store” and “She works in technology”. This example is considered neutral because the fact that she has purchased a laptop does not necessarily imply that she works in technology.

Figure 8: Example of an NLI sentence pair. This sentence pair is neutral.

We then minimise a loss function which trains the network. Let’s break down these steps for training sentence transformers on the NLI dataset:

  1. Embed the Sentences: For each sentence pair in the NLI dataset, embeddings are computed.
  2. Concatenation: The embeddings of the two sentences are then used to form a new, concatenated vector.
  3. Feedforward Neural Network (FFNN): This concatenated vector is then fed into a feedforward neural network. The FFNN processes this vector through one or more hidden layers and then outputs a vector of logits. Logits refer to the vector of raw (non-normalized) predictions that a classification model outputs. These are the values that come directly from the last linear transformation of a neural network, before any normalization or activation function like softmax is applied.
  4. Softmax Layer: The output vector (logits) from the FFNN is passed through a softmax layer, which converts these logits into probabilities. Each entry in this output probability vector corresponds to one of the NLI labels (entailment, contradiction, neutral). The softmax function ensures that the probabilities are non-negative and sum to one, making them interpretable as the likelihood of each class given the input sentence pair.
  5. Cross-Entropy Loss: During training, the output probabilities from the softmax layer are compared to the actual label of the sentence pair using the cross-entropy loss function. This loss function is effective for classification tasks as it penalizes the differences between the predicted probabilities and the actual one-hot encoded labels (where the true label has a probability of 1, and all others have a probability of 0).
  6. Optimization: The cross-entropy loss is used to perform backpropagation. By minimizing this loss, the parameters of the FFNN are adjusted to better fit the NLI dataset.

The overall architecture for our use case therefore looks like the Figure below.

Figure 9: The overall architecture for training a sentence transformer on the NLI dataset. We begin with the concatenated vector embedding that is produced by the original two sentence embeddings, u and v. This is then passed through a feed-forward neural network and a softmax layer to produce the output probabilities. This, along with the true label, are passed into a loss function to further train the model.

More generally, this architecture can be used for training on different datasets. The same idea holds: take an input that consists of pairs of sentences (or individual sentences) to be embedded. Each sentence is passed through a BERT encoder. A pooling layer is then applied to the output. For sentence pair tasks, a similarity function is used to compare the embeddings of the two sentences. Then the similarity scores are fed into a loss function which trains the sentence transformer.

Since SBERT, various sentence transformer models have been developed and optimized using loss functions to produce accurate sentence embeddings. These models are trained on diverse sentence pairs to ensure robustness in capturing sentence similarities.

Awesome! Now that you have the background about sentence transformers, let's explore programming them with the sentence-transformers library!

4. Programming with Sentence Transformers

Now we’ve covered the basics of sentence transformers we can use Hugging Face’s library sentence-transformers to run some examples. This library was created by the creators of SBERT and it’s free and easy to use!

We’ll be using Google Colab for this tutorial. If you are new to Google Colab, you can follow this guide on getting set up - it’s super easy! For this module, you can find the notebook on Google Colab here or on GitHub here. As always, if you face any issues, join our Community and a member of our team will help!

First we install it with pip:

!pip install sentence-transformers

Next, we create our Sentence Transformer model. We will use the original SBERT model bert-base-nli-mean-tokens:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-base-nli-mean-tokens')

This produces the following:

  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})

The output is the SentenceTransformer object which is comprised of some key components.

  • The Transformer: this has a maximum sequence length of 128 tokens and says that it does not want to lowercase any input (’do_lower_case’). The Transformer model here is the BertModel.
  • The Pooling: the sentence embedding to be produced is 768-dimensional. There are other entries which we won’t cover but, if you’re curious, ask on our Community.

We can now create some sentences to be turned into sentence embeddings. We’ll take some examples from Marqo’s website.

Let’s create a list of these sentences in Python.

marqo_sentences = [
    "With Marqo you can use your data to increase relevance with embedding search",
    "Join the conversation on our Marqo Community Channel",
    "AI search that understands the way your customers think",
    "How RedBubble increased revenue with Marqo",
    "With Marqo you can use your data to increase downloads with embedding search"

embeddings = model.encode(marqo_sentences)

As we’ve seen in the SentenceTransformer object above, we expect these sentence embeddings to be 768-dimensional. Let’s see if they are:


This returns the output:

(5, 768)

So, yes! We have 5 sentences that are all 768-dimensional. Awesome!

Let’s use cosine similarity to compute how similar each sentence is:

import numpy as np
from sentence_transformers.util import cos_sim

# Assuming `embeddings` is a numpy array or a list of numpy arrays
embeddings = np.array(embeddings)

# Compute the cosine similarity matrix
sim = cos_sim(embeddings, embeddings).numpy()


This will produce a 5x5 matrix similar to the image below.

Figure 10: Entries of a matrix.

The value in each entry corresponds to the similarity of the row and column number. So for example, the first entry (top-left) would be the similarity between sentence 1 and sentence 1. In this case, we expect the cosine value to be 1 because both sentences are identical. The top-right entry would be the similarity between sentence 1 and sentence 5.

This produces a 5x5 matrix as seen below:

[[0.9999999  0.6356604  0.66937524 0.4040454  0.9407738 ]
 [0.6356604  0.9999999  0.6081568  0.21924865 0.58871996]
 [0.66937524 0.6081568  1.         0.2225984  0.59990084]
 [0.4040454  0.21924865 0.2225984  0.99999976 0.5078909 ]
 [0.9407738  0.58871996 0.59990084 0.5078909  1.0000001 ]]

We can see that sentence 1 and sentence 5 are similar as the (5, 1) entry (bottom-left) is 0.9407738. Take a glance yourself at the different values and come to a conclusion over which sentences you think are similar. It’s worth noting that the entries on the diagonal are roughly equal to 1. This is because these are the points at which a sentence is compared against itself—these are obviously identical and so we get maximum similarity.

In this example we used SBERT but there are so many other sentence transformer models available. After its release, newer and faster models were released and as a result, significantly outperformed the original SBERT. Interestingly, SBERT is actually no longer listed as an available model on’s models page!

Here are some of the best SBERT models currently available:

  • all-mpnet-base-v2.
  • all-MiniLM-L6-v2
  • all-distilroberta-v1
  • paraphrase-mpnet-base-v2

Feel free to change your code to use these models and see what results you get!

Some of the advantages of using these models over SBERT include:

  • Processing sequences that are longer
  • Different and better base models (not just BertModel)
  • Additional normalisation layers

Pretty cool!

5. Summary

In this article, we’ve explored the background behind sentence transformers and started coding with Hugging Face’s Python library, sentence-transformers. In the next article, we’ll explore some of the newer models in more detail and explain how you can train and fine-tune your own sentence transformers!

6. References

[1] D. Rumelhart, et al. Learning Internal Representations by Error Propagation (1986)

[2] S. Hochreiter and J. Schmidhuber. Long Short-Term Memory (1997)

[3] Bahdanau, D., Cho, K., & Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate (2015)

[4] A. Vaswani, et al. Attention is All You Need (2017)

[5] J. Devlin, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018)

[6] N. Reimers and I. Gurevych. Sentence Embeddings using Siamese BERT-Networks (2019)

[7] Sentence Transformer Library, Hugging Face

7. Code