Course Module
1

Introduction to Vector Embeddings

Next Module →

This module will explore the basics of vector representations; explaining what vectors are, how they represent data in information retrieval and how similarities can be computed using mathematical methods.

Before diving in, if you need help, guidance, or want to ask questions, join our Community and a member of the Marqo team will be there to help.

1. Definition of Vectors

Vectors are fundamental mathematical objects used extensively in computer science and data science. A vector, \( \textbf{v} \), is typically described by a list of numbers [1]:

$$\textbf{v} = \left(6, 3, \dots, 7\right)$$

Note, they do not have to include only integers but any other number too, including decimals, fractions, etc.:

$$\textbf{v} = \left(0.1, \frac{1}{5}, \sqrt{2}, \dots, 5\right)$$

More generally, we write a vector, \( \textbf{v} \), in the form:

$$\textbf{v} = (v_1, v_2, \dots, v_n),$$

where \( n \) represents the number of dimensions of the vector (also the same as the number of elements in that vector).

For example, a vector comprised of two numbers is known as a two-dimensional vector. Take \( \textbf{v} = (1, 1) \) as an example. This can be represented on a two-dimensional \( (x, y) \) grid as follows:

Figure 1: Representation of a two-dimensional vector.

where \( \textbf{v} = (x = 1, y = 1) \).

We can also extend vectors to three-dimensions as seen below. They can be extended to even higher dimensions but that gets a little tricky to visualise!

Figure 2: Representation of a three-dimensional vector.

Two key components of vectors are their magnitude and direction - both features play a vital role in artificial intelligence systems as we’ll see in the next section.

Magnitude: The magnitude (or length) of a vector, \( \textbf{v} \), is calculated by a mathematical formula known as the Euclidean norm [1],

$$||\textbf{v}|| = \sqrt{v_1^2 + v_2^2 + \dots v_n^2~},$$

where \( n \) represents the dimension of the vector.

Direction: The direction of a vector, \( \theta \), is given by the angle it forms with the coordinate axes. In two-dimensions, the direction, \( \theta \), of a vector \( \textbf{v} \), can be found using the following formula [1],

$$\theta = \tan^{-1} \left(\frac{v_2}{v_1}\right).$$

So for our previous two-dimensional example where \( \textbf{v} = (1, 1) \), the magnitude is given by,

$$||\textbf{v}|| = \sqrt{1^2 + 1^2} = \sqrt{2},$$

and the direction is,

$$\theta = \tan^{-1} \left(\frac{1}{1}\right) = \tan^{-1}(1) = 45^\circ.$$

Both these quantities can be seen below.

Figure 3: Representation of the magnitude and direction of a vector in two dimensions.

It’s worth noting that vectors represent quantities with magnitude and direction in a one-dimensional space. To extend this to higher dimensions, we use what are known as tensors. These are more general mathematical objects that extend the concept of vectors to higher dimensions, allowing for the representation of multidimensional arrays of numbers with multiple indices. This comes in handy when we have a lot of data!

So vectors are pretty cool, right? But what do they have to do with computer science and artificial intelligence systems?

2. Vector Representation

We’ve seen that vectors store information in the form of numbers. So, the next question is, can these vectors of numbers describe something more complex? A word? A sentence? Even an image? Well, the answer is yes!

Vectors can be used as numerical representations of complex information such as words, sentences, images, videos and even more. This makes sense because, for a computer to understand our input, we have to convert it into machine-readable format. Not only this but language inherently contains a lot of information within it, requiring a considerable volume of data to represent even small snippets of text! Let’s start with an example.

The word “King” might be represented as the vector \( \textbf{v}_K = (0.5, 0.1, 0.3, \dots) \) and the word “Queen” might be represented by the vector \( \textbf{v}_Q = (0.6, 0.2, -0.4, \dots) \) [2]. We’ve added subscripts to the vectors here so we know which is “King” and which is “Queen”. If you’re wondering how we generate these vectors, we’ll be covering that in the next module!

If you look at the numbers in each of the vectors \( \textbf{v}_K \) and \( \textbf{v}_Q \), you might notice that they’re closely related. Here’s what the vectors might look like in two-dimensional space [2]:

Figure 4: Vector representation of the words ‘King’ and ‘Queen’ in two dimensions.

Both vectors are close in the vector space which indicates their semantic similarity. Semantic similarity refers to the likeness or closeness in meaning between two pieces of text, words, phrases, sentences, or documents.

This concept of representing words as vectors and measuring their semantic similarity isn't limited to just words like "King" and "Queen". We can extend this idea to represent entire sentences, paragraphs, or documents in a similar manner!

For instance, a sentence like "The dog sat on the mat" might be represented as a vector, and another sentence like "A canine laid on the rug" would have its own vector representation. Despite using different words, these sentences convey similar meanings, and thus their vectors would likely be close together in the vector space.

Furthermore, this approach isn't confined to textual data. Images, for example, can also be represented as vectors. Each pixel in an image can be assigned a value, and these values collectively form a vector representation of the image [3]. Similarly, videos, audio clips, and even more complex data structures can be represented using vectors.

By leveraging these vector representations, we can perform various operations such as measuring similarity, performing classification tasks, or even generating new content through techniques like vector arithmetic or neural networks.

The vectors mentioned in this section are also known as vector embeddings. These embeddings are crucial when building efficient artificial intelligence systems. In the next module, we'll look into how these vector embeddings are generated and how they can be applied to solve real-world problems across different domains. First, we will establish the different types of vectors.

3. Dense vs Sparse Vectors

We now know that we can represent language as vectors but there are two options we can take with this. We can either use dense or sparse vectors.

Let’s consider the make-up of both dense vs sparse vectors. Dense vectors have very few elements inside them that are zero. Meaning these vectors are populated with dense numerical information:

$$\textbf{v}_{dense} = (0.3, -0.4, 0.9, 0.9, 0.7, 0.1, ...., 0.5).$$

Dense vectors are straightforward to work with and are often used when the data has no inherent sparsity or when computational efficiency isn't a major concern.

On the other hand, sparse vectors have many elements inside them that are zero. Meaning they are sparse with numerical information.

$$\textbf{v}_{sparse} = (0, 0, 0, 0, 0 ,1, ...., 0).$$

These are used when the data being represented is inherently sparse, meaning there are many missing or zero values. Sparse vectors are memory-efficient and often preferred when dealing with high-dimensional data where most values are zero.

What does this mean in the context of representing data? Let’s consider the following sentences:


 🧑‍🍳 “The chef cooked dinner for the guest.”
  

and,


👱‍♂️ “The guest cooked dinner for the chef”.
  

Both sentences have very different meanings but both contain the exact same words in each sentence. For this reason, sparse vectors would generate a perfect (or near-perfect) match for these two sentences. That’s because sparse vectors are based on the presence or absence of words, rather than their order or context. Therefore, they treat sentences with the same words as highly similar, even if the meanings differ significantly.

While sparse vectors represent text syntax, dense vectors are able to capture the semantic meaning behind the information they are representing. The corresponding dense vector representation of the sentences would not be as similar. This highlights the advantage of dense vectors in capturing the nuanced meanings and contextual relationships within text, making them more effective for tasks requiring semantic understanding.

Let’s say we create dense vectors for all the words in a cookbook, reduce the dimensionality of the vectors and visualise them in 3D, they may look as follows:

Figure 5: Illustration of the clustering of similar words in a vector space.

Notice how words with similar meaning (semantically similar) are clustered together. In this case, we have the names of fruit clustered together. This consequence of vector generations means that machine learning models can effectively capture the semantic relationships between words and entities.

Another famous example [2] in the field of Natural Language Processing (NLP) is the equation:

$$\text{King - Man + Woman = Queen}$$

This equation exemplifies the power of dense vectors in NLP. In this equation, words are represented as dense vectors in a high-dimensional space, capturing relationships through vector arithmetic. Here’s what it may look like in vector space:

Figure 6: Illustration of words in a vector space. Demonstrating the 'King - Man + Woman = Queen’ example.

By subtracting the vector representation of “Man” from that of “King” and adding it to “Queen”, the model learns to capture the relationship between these words. Pretty cool!

But, how do we compute whether vectors are similar to one another?

4. Similarity Measures

Similarity measures in vector embeddings are important because they allow us to quantify how closely related words or entities are in meaning. These measures allow for efficient retrieval of similar items, relevance ranking, and personalised recommendations in applications such as information retrieval, recommendation systems, and clustering.

One of the most common similarity measures is cosine similarity. This measures the cosine of the angle between two vectors, \( \textbf{u}, \textbf{v} \) in a multi-dimensional space. Mathematically, we calculate this by,

$$\text{cosine similarity} = \cos(\theta) = \frac{\textbf{u} \cdot \textbf{v}}{||\textbf{u}|| \cdot ||\textbf{v}||},$$

where \( \textbf{u} \cdot \textbf{v} \) is the dot product of \( \textbf{u} \) and \( \textbf{v} \), and \( ||\textbf{u}|| \) and \( ||\textbf{v}|| \) are the magnitudes of the vectors \( \textbf{u} \) and \( \textbf{v} \).

Computing the cosine similarity will return a value in the range, \( \cos(\theta) \in [-1, 1] \), where a value of 1 implies perfectly aligned vectors, 0 indicates no similarity and -1 indicates maximum dissimilarity.

Let’s test this out through code! For this course, we will be using Google Colab (it’s free!). If you are new to Google Colab, you can follow this guide on getting set up - it’s super easy! For this module, you can find the notebook on Google Colab here or on GitHub here. As always, if you face any issues, join our Slack Community and a member of our team will help!

Let’s try this out for some vectors. We are going to use the cosine similarity function, cos_sim, from the python library sentence-transformers. To utilise this, you will need to run:


!pip install sentence-transformers

We first import the cos_sim function and create two vectors, vector1 and vector2. We then compute the cosine similarity between these two vectors and observe the output:


from sentence_transformers.util import cos_sim

vector1 = [0.1, 0.4, 0.5, 0.0]
vector2 = [0.1, 0.4, 0.5, 0.1]

cosine_similarity = cos_sim(vector1, vector2)

cosine_similarity

The output:


tensor([[0.9883]])

Notice how the two vectors have the same entries except for the last number which deviates by only 0.1. This is why we have a cosine similarity score near to 1 because the vectors are closely related.

Let's try the same but for very different vectors.


vector3 = [0.9, 0.0, -0.5, 0.0]
vector4 = [-0.9, 0.8, 0.5, 0.6]

cosine_similarity = cos_sim(vector3, vector4)

cosine_similarity

The output:


tensor([[-0.7173]])

Notice how we now have a negative value for the cosine similarity. This means that these two vectors are not similar. Why don't you try out a few different vectors (try varying the number of elements inside) and see what results you get!

5. Summary

In this module we’ve covered how and why vectors are used to represent various types of information. But how do we actually generate these vector embeddings? Let’s find out in the next module!

6. References

[1] M. Corral, Vector Calculus (2022)

[2] T. Mikolov, et al., Linguistic Regularities in Continuous Space Word Representations (2013)

[3] Y. Lucen, et al., Gradient-Based Learning Applied to Document Recognition (1998)

7. Code

https://github.com/marqo-ai/fine-tuning-embedding-models-course/blob/main/1_intro_to_vector_representations.ipynb