Module
3

Introduction to Vector Databases

← Back to Course Menu

Before diving in, if you experience any difficulties or have any questions, join our Slack-Community and we’ll be there to help! Please post them into the ‘marqo-courses’ channel 🚀 If you’re new to Slack, don’t worry, join the channel and Ellie will send you a ‘getting started with Slack’ guide! 😊

If you want to build your own embedding search applications, try out Marqo for free!

Introduction to Vector Databases

Vector databases have been growing in popularity throughout the last few years with a surge in interest coinciding with the release of large language models (LLMs) like ChatGPT. They captured the wider developer community’s attention when developers began to realise the impact that vector databases can have on such models.

This article will guide you through the fundamentals of vector databases and vector search. Specifically, we will look at concepts such as vector embeddings, vector indexes, search, similarity measures and nearest neighbor methods. In the next article we will look at how we can implement our own vector search system using Marqo!

1. What is a Vector Database?

The clue is in the name: a vector database is a database that stores information as vectors. Vectors are numerical representations of data objects, also known as vector embeddings.

The biggest advantage of a vector database is that it allows for precise and fast similarity search and retrieval of data. This is different to traditional methods that query databases based on exact matches; vector databases can be used to find the most similar or relevant data based on their contextual meaning (also known as semantic meaning). Vector databases index and store vector embeddings. Let’s take a look at what these are.

2. What are Vector Embeddings?

Vector embeddings are numerical representations of data e.g., images, text and audio as discussed in Module 1 of our Fine-Tuning Embedding Models Course. When these vectors are created, they capture the semantic meaning of the data. This in turn, allows for better search and retrieval results. Representing data in such a way is crucial as it allows the data to be more easily understood by computer systems.

The figure below illustrates the idea behind vector embeddings. We can take some data i.e. text or audio, pass it through their respective encoder models which produces vector embeddings. These then represent the input data in a numerical format. For more information on these embedding models, read our previous article.

Figure 1: Illustration of vector embedding generation for both text and audio.

Vector databases, as mentioned, store these vector embeddings. They take advantage of the mathematical properties of embeddings where similar items are stored together. This is where vector indexing becomes important.

Owing to these embeddings capturing the contextual meaning behind words (and other forms of data), we are able to generate queries and search results in a human-like way. This makes vector search engines the preferred way of searching especially when considering applications that might be sensitive to incorrect spelling. In the next article, we’ll look at building our own vector search engine using Marqo.

3. What is a Vector Index?

Vector indexing is a process of carefully and cleverly organizing vector embeddings to optimize the retrieval process. It involves specific (and advanced) algorithms to organize the high dimensional vectors in a neat and searchable manner. It’s done in such a way that similar vectors are grouped together. Thus, vector indexing allows fast and precise similarity searches.

Indexing can be viewed very similar to the way books are stored in a library. We don’t go to a library and look through every book until we find the one we want. Instead, we go to the specific section where the book is placed. This is similar to how indexing works in databases; it allows for efficient and fast retrieval of the data you want.

Let’s imagine we have a group of images and we generate a vector embedding for each image. The vector index would organise the vectors in such a way that it makes it easier to find similar images. Overall, allowing for efficient retrieval of data. We will explore appropriate vector indexing techniques in just a moment but first, we must understand the importance of vector search.

4. What is Vector Search?

What are the use of vector databases if we can’t retrieve information from within them? That’s where vector search comes in! Vector embeddings provide us with the tools to find and retrieve similar objects from a vector database by searching for objects close together in the vector space. This is called vector search. It can also be called similarity search or semantic search.

Vector embeddings allow us to find and retrieve similar objects from the vector database by searching for objects that are close together. Take the example of ‘King’ and ‘Queen’; both are close together in the vector space because the embeddings have understood their semantic meaning.

Figure 2: Vector representation of the words ‘King’ and ‘Queen’ in two dimensions.

This feature can be extended to a search query. For example, we can find words similar to the word ‘Banana’. All we do is generate a vector embedding for this query vector and retrieve its nearest neighbors. The Figure below illustrates this where its nearest neighbor is the word “Apple”.

Figure 3: Illustration of generating a vector embedding for a query (’Banana’) and retrieving its nearest neighbor (’Apple’).

Awesome! But how do we actually compute nearest neighbors?

The first thing to consider are similarity measures; these take two vector embeddings and compute the distance between them. There are several ways to do this e.g. the following all supported by Marqo:

  • Euclidean: Calculates a straight line between two vectors. The range is such that, \(\theta \in [0, \infty]\). A value of 0 indicates identical vectors and values tending upwards from this (towards infinity) represent increasingly dissimilar vectors.
  • Angular: This is also known as cosine similarity which measures the angle between two vectors. The range is such that, \(\cos(\theta) \in [-1, 1]\), where a value of 1 implies identical vectors, 0 indicates no similarity, and -1 indicates maximum dissimilarity.
  • Dot Product: This calculates the dot product of two vectors and the corresponding cosine of the angle between them. The range is such that, \(\theta \in [-\infty, \infty]\). A value of 0 represents orthogonal/perpendicular vectors, large values represent similar vectors, and negative values represent dissimilar vectors.
  • Hamming: This calculates the number of differences between vectors at each dimension.

Similarity measures are fruitful when considering vector databases. Now we can compute distances between vector embeddings, how do we find the nearest neighbors?

5. Approximate Nearest Neighbors

We’ve explored the idea of vector indexing and the different similarity measurements we can perform when dealing with vectors. The easiest way to find the closest items to a given query would be performing what is known as the k-Nearest-Neighbors (kNN) algorithm [1]. This involves computing the distance of every vector in the vector space with your query. As you can imagine, if you have large datasets, this can get computationally expensive.

A solution to this is what is known as the Approximate Nearest Neighbor (ANN) approach [2]. This approach aims to quickly find points in a vector space that are close to a given query point by using probabilistic and heuristic methods rather than computing exact distances to all points. This approach significantly reduces computational time but this can come at the sacrifice of some accuracy.

In the previous example, you could pre-calculate some clusters in your vector space. This could be ‘fruits’, ‘trees’, etc. as illustrated in the image below. Then, when you query the database for the word ‘Banana’, you begin your search by looking at the fruits cluster rather than comparing the distance of every vector in the entire vector space. The algorithm essentially points you in the right direction to begin your search, similar to heading to the ‘science-fiction’ section in a book store. It also prevents you from deviating away from relevant results as all objects have been organised by similarity.

Figure 4: Illustration of the clustering of similar words in a vector space and how vector indexing can allow for efficient retrieval.

The default ANN algorithm in Marqo is Hierarchical Navigable Small World (HNSW). For more information on this, check out Jesse, CTO of Marqo, giving a talk here.

6. Use Cases of Vector Search

There are a variety of use cases for vector search applications ranging from natural language processing (NLP) to image and audio recognition, recommendation systems and personalized content delivery. As you may have guessed from this article, the most popular use case is searching. Marqo, in particular, is used consistently in e-commerce. For example, Marqo was able to increase RedBubble’s average search revenue with the implementation of their search engine. Pretty cool stuff!

Here is an example image-search demo using Marqo designed to do multimodal search with weighted queries. The example shows a user inputting 'green shirt' into Marqo. Marqo allows you to specify what you want to see 'more of' or 'less of'. In the example below, the user specifies 'more of short sleeves'.

Figure 5: Using Marqo to perform vector search on images.

Vector databases are also crucial for Large Language Models (LLMs) as they provide them with long-term memory. LLMs actually immediately forget what you have just discussed if you don’t store the information in the form of a vector database for example. This allows LLMs to have meaningful and consistent conversations with you.

A big problem with LLMs is their tendency to hallucinate. This is a process by which LLMs produce inaccurate or false information believing it’s correct. Vector databases can be partnered with these through a process called Retrieval Augmented Generation (RAG). This stores domain-specific context in a vector to minimise the LLM from hallucinating. Another awesome use case for vector databases!

In the next article, we’ll be showing you how you can get set up using Marqo, an end-to-end vector search engine. This can be used for searching but can also be partnered with LLMs and used to perform RAG (as mentioned above) among many other things.

7. Summary

This article explained the fundamentals of vector databases and vector search. We explain how vector databases store information as vector embeddings, which are numerical representations that capture the semantic meaning of data. We covered key concepts such as vector indexing, which organizes these embeddings for efficient retrieval, and vector search, which finds contextually related objects using similarity measures like Euclidean distance. Additionally, we discuss the k-Nearest Neighbors (kNN) algorithm for finding the closest items in a vector space and introduce the Approximate Nearest Neighbor (ANN) approach as a faster alternative for large datasets.

8. Want to Begin Building?

In the next article we’ll be showing you how you can get set up using Marqo, an end-to-end vector search engine. As always, if you need any help, join our community Slack where a member of our team will assist you!

9. References

[1] T. Cover and P. Hart, Nearest Neighbor Pattern Classification (1967)

[2] P. Indyk & R. Motwani, Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality (1998)