Getting Started

July 10, 2024

5

mins read

Vector search has emerged as a popular approach to how we, as humans, access and use information. Unlike traditional keyword-based search methods, vector search takes advantage of key machine learning algorithms to truly understand the underlying context within data. Vector search relies on vector databases; databases that store information as vectors. In this article, we will provide you with the fundamentals of vector databases and vector search. Specifically, we will look at concepts such as vector embeddings, vector indexes, search, similarity measures and nearest neighbor methods. Let's first take a look at vector databases and why they're so important!

The clue is in the name: a vector database is a database that stores information as vectors. Vectors are numerical representations of data objects, also known as **vector embeddings**. The biggest advantage of a vector database is that it allows for precise and fast similarity search and retrieval of data. This is different to traditional methods that query databases based on exact matches; vector databases can be used to find the most similar or relevant data based on their contextual meaning (also known as semantic meaning). Vector databases index and store vector embeddings. Let’s take a look at what these are.

**Vector embeddings** are numerical representations of data e.g., images, text and audio. When these vectors are created, they capture the semantic meaning of the data. This in turn, allows for better search and retrieval results. Representing data in such a way is crucial as it allows the data to be more easily understood by computer systems.

The figure below illustrates the idea behind vector embeddings. We can take some data i.e. text or audio, pass it through their respective encoder models which produces vector embeddings. These then represent the input data in a numerical format.

Vector databases, as mentioned, store these vector embeddings. They take advantage of the mathematical properties of embeddings where similar items are stored together. This is where vector indexing becomes important.

Owing to these embeddings capturing the contextual meaning behind words (and other forms of data), we are able to generate queries and search results in a human-like way. This makes vector search engines the preferred way of searching especially when considering applications that might be sensitive to incorrect spelling. In the next article, we’ll look at building our own vector search engine using Marqo.

**Vector indexing** is a process of carefully and cleverly organizing vector embeddings to optimize the retrieval process. It involves specific (and advanced) algorithms to organize the high dimensional vectors in a neat and searchable manner. It’s done in such a way that similar vectors are grouped together. Thus, vector indexing allows fast and precise similarity searches.

Indexing can be viewed very similar to the way books are stored in a library. We don’t go to a library and look through every book until we find the one we want. Instead, we go to the specific section where the book is placed. This is similar to how indexing works in databases; it allows for efficient and fast retrieval of the data you want.

Let’s imagine we have a group of images and we generate a vector embedding for each image. The vector index would organise the vectors in such a way that it makes it easier to find similar images. Overall, allowing for efficient retrieval of data. We will explore appropriate vector indexing techniques in just a moment but first, we must understand the importance of **vector search**.

What are the use of vector databases if we can’t retrieve information from within them? That’s where **vector search**comes in! Vector embeddings provide us with the tools to find and retrieve similar objects from a vector database by searching for objects close together in the vector space. This is called **vector search**. It can also be called **similarity search** or **semantic search.**

**Improved Relevance**: Vector search can capture the context and nuances of natural language, leading to more accurate search results.**Multimodal Search**: It enables searching across different types of data (e.g., text and images) simultaneously, making it versatile for various applications.**Scalability**: Vector search is designed to handle large-scale datasets efficiently, making it suitable for big data applications.**Flexibility**: It allows for complex queries that can incorporate multiple factors and weights, providing more sophisticated search capabilities.

Vector embeddings allow us to find and retrieve similar objects from the vector database by searching for objects that are close together. Take the example of ‘King’ and ‘Queen’; both are close together in the vector space because the embeddings have understood their semantic meaning.

This feature can be extended to a search query. For example, we can find words similar to the word ‘Banana’. All we do is generate a vector embedding for this **query vector** and retrieve its nearest neighbors. The Figure below illustrates this where its nearest neighbor is the word “Apple”.

Awesome! But how do we actually compute nearest neighbors?

The first thing to consider are similarity measures; these take two vector embeddings and compute the distance between them. There are several ways to do this e.g. the following all supported by Marqo:

**Euclidean:**Calculates a straight line between two vectors. The range is such that, \(\theta \in [0, \infty]\). A value of 0 indicates identical vectors and values tending upwards from this (towards infinity) represent increasingly dissimilar vectors.**Angular:**This is also known as**cosine similarity**which measures the angle between two vectors. The range is such that, \(\cos(\theta) \in [-1, 1]\), where a value of 1 implies identical vectors, 0 indicates no similarity, and -1 indicates maximum dissimilarity.**Dot Product:**This calculates the dot product of two vectors and the corresponding cosine of the angle between them. The range is such that, \(\theta \in [-\infty, \infty]\). A value of 0 represents orthogonal/perpendicular vectors, large values represent similar vectors, and negative values represent dissimilar vectors.**Hamming:**This calculates the number of differences between vectors at each dimension.

Similarity measures are fruitful when considering vector databases. Now we can compute distances between vector embeddings, how do we find the nearest neighbors?

We’ve explored the idea of vector indexing and the different similarity measurements we can perform when dealing with vectors. The easiest way to find the closest items to a given query would be performing what is known as the **k-Nearest-Neighbors (kNN) algorithm**. This involves computing the distance of every vector in the vector space with your query. As you can imagine, if you have large datasets, this can get computationally expensive.

A solution to this is what is known as the **Approximate Nearest Neighbor (ANN) approach**. This approach aims to quickly find points in a vector space that are close to a given query point by using probabilistic and heuristic methods rather than computing exact distances to all points. This approach significantly reduces computational time but this can come at the sacrifice of some accuracy.

In the previous example, you could pre-calculate some clusters in your vector space. This could be ‘fruits’, ‘trees’, etc. as illustrated in the image below. Then, when you query the database for the word ‘Banana’, you begin your search by looking at the fruits cluster rather than comparing the distance of every vector in the entire vector space. The algorithm essentially points you in the right direction to begin your search, similar to heading to the ‘science-fiction’ section in a book store. It also prevents you from deviating away from relevant results as all objects have been organised by similarity.

The default ANN algorithm in Marqo is Hierarchical Navigable Small World (HNSW). For more information on this, check out Jesse, CTO of Marqo, giving a talk here.

Traditional search engines, often known as keyword-based or text-based search engines, rely heavily on matching exact words or phrases within documents to the query terms entered by the user. This method utilizes indexing techniques, where documents are processed, and a searchable index of words and their locations is created. When a query is made, the search engine looks for documents containing the same keywords and ranks them based on factors like keyword frequency, position, and relevancy to the overall content. While this method is efficient for straightforward queries, it often struggles with understanding the context, meaning, or semantic relationships between words, leading to less effective results when dealing with nuanced or complex queries.

Vector search, on the other hand, leverages the power of machine learning and artificial intelligence to understand the meaning behind words and phrases by representing them as vectors in a high-dimensional space. As previously explored, this approach involves converting text into numerical representations (embeddings) that capture the semantic relationships between words. Vector search engines can thus compare the similarity between these vectors rather than relying on exact keyword matches. This allows for a deeper understanding of context, enabling the search engine to find relevant documents even if they don't contain the exact query terms but have related meanings. Consequently, vector search is particularly effective in handling complex queries, understanding user intent, and providing more accurate and contextually relevant results.

The main benefits of using vector search over other search methods are as follows:

**Improved Relevance**: Provides results that better match user intent by understanding context and semantics.**Contextual Understanding**: Handles synonyms and nuanced queries more effectively by capturing word relationships.**Enhanced Precision**: Differentiates between similar-sounding words based on context, reducing ambiguity.**Handling Complex Queries**: Excels at managing long-tail and intricate queries with higher accuracy.**Personalization**: Offers a tailored search experience by integrating user data and preferences.

There are a variety of use cases for vector search applications ranging from natural language processing (NLP) to image and audio recognition, recommendation systems and personalized content delivery. As you may have guessed from this article, the most popular use case is **searching**. Marqo, in particular, is used consistently in Ecommerce. For example, Marqo was able to increase RedBubble’s average search revenue with the implementation of their search engine. Pretty cool stuff!

Here is an example image-search demo using Marqo designed to do multimodal search with weighted queries.

Vector databases are also crucial for Large Language Models (LLMs) as they provide them with long-term memory. LLMs actually immediately forget what you have just discussed if you don’t store the information in the form of a vector database for example. This allows LLMs to have meaningful and consistent conversations with you.

A big problem with LLMs is their tendency to hallucinate. This is a process by which LLMs produce inaccurate or false information believing it’s correct. Vector databases can be partnered with these through a process called Retrieval Augmented Generation (RAG). This stores domain-specific context in a vector to minimise the LLM from hallucinating. Another awesome use case for vector databases!

This article explained the fundamentals of vector databases and vector search. We looked at how vector databases store information as vector embeddings, which are numerical representations that capture the semantic meaning of data. We covered key concepts such as vector indexing, which organizes these embeddings for efficient retrieval, and vector search, which finds contextually related objects using similarity measures like Euclidean distance. Additionally, we discuss the k-Nearest Neighbors (kNN) algorithm for finding the closest items in a vector space and introduce the Approximate Nearest Neighbor (ANN) approach as a faster alternative for large datasets.

Visit this article where we show you how you can get set up using Marqo, an end-to-end vector search engine. As always, if you need any help, join our Community where a member of our team will assist you.

Ellie Sleightholm

Head of Software Developer Relations at Marqo

More Posts

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Marqo is more than a vector database, it's an end-to-end vector search engine.

For media inquiries, please email press@marqo.ai

© Copyright Marqo.ai 2024. All Rights Reserved.