Vector Database : A Beginner’s Guide

What is vector?

In mathematics and physics, a vector is a quantity that has both magnitude and direction. It is often represented as an arrow where the length of the arrow denotes the magnitude, and the direction in which the arrow points indicates the direction of the vector. Vectors are used to represent physical quantities such as force, velocity, and displacement, which require both magnitude and direction for a complete description.

In computer programming, the term “vector” can have a different meaning. It generally refers to a one-dimensional, dynamic array or a list data structure that can grow or shrink in size and provides random access to its elements. The elements of a vector in programming are typically of the same data type and can be accessed using an index.

In the context of vector database technology, a “vector” refers to a data representation where items such as text, images, audio, or other complex data types are converted into numerical vector embeddings. These embeddings are high-dimensional vectors that capture the essential features of the data items, allowing them to be processed and compared in a mathematical space.

What is the meaning of storing data in vectors?

Storing data in vectors refers to the representation of data in a format that consists of an ordered set of numbers, which can be thought of as points in a multi-dimensional space. Each number in the vector represents a dimension or a feature of the data point. This form of data representation is particularly useful for various computational and analytical tasks, as it allows for the application of mathematical and statistical operations that can help in understanding, processing, and deriving insights from the data.

In nutshell, it is a method of representing data that facilitates efficient storage, manipulation, and analysis, especially in fields such as text mining, signal processing, and machine learning. It enables the application of algorithms that can detect patterns, similarities, and structures within the data, which is essential for tasks like clustering, denoising, and semantic analysis.

An example of vector representation of data from a real-world use case is in the field of image classification.

For instance, in handwritten character recognition, an image of a handwritten digit can be converted into a vector by flattening the pixel matrix into a one-dimensional array where each element represents the intensity of a pixel.

Image credit : Himanshu Beniwal

Next we will try to understand about Vector DB

Vector databases, also known as VecDBs, are specialized database systems designed to efficiently store, retrieve, and manage high-dimensional vector representations. These vectors are often used in applications involving machine learning, particularly with large language models (LLMs), where they represent data points in a multi-dimensional space. The core concept behind vector databases is to handle the complexity and size of vector data that traditional databases may struggle with.

Core Features of Vector Database is:

Vector Embeddings: Vector embeddings are typically generated using machine learning models, such as neural networks, which are trained to transform raw data into a vector space where similar items are represented by vectors that are close to each other. This transformation is based on the features extracted from the data, which could be words in a sentence, pixels in an image, or any other measurable attribute of the data.

High-Dimensional Space : The vectors in a vector database are high-dimensional, meaning they may have hundreds or even thousands of dimensions. Each dimension represents a feature or a combination of features that are learned from the data during the training of the machine learning model.

Similarity Search: Vector databases enable similarity searches, where the goal is to find data points that are closest (most similar) to a given query point in the vector space. This is often done using Approximate Nearest Neighbor (ANN) search algorithms.

Indexing: To facilitate fast retrieval, vector databases use indexing techniques such as Product Quantization (PQ), Locality-Sensitive Hashing (LSH), and Hierarchical Navigable Small World (HNSW) graphs.

Scalability: Vector databases are designed to handle large datasets efficiently, often using distributed systems to scale horizontally.

Diving Deep into the core operations of Vector DB:

Each DB has two core operations (a) Storage (b) Retrieval. For vector DB, we need to know how these two core operations are designed in principle. And then we can explore different implementations of the core operations by different vendors in their respective vector DBs.

Storage:

To understand the storage process of vector DB, we need to be aware of the term called “Vector Embeddings”. Lets try to define it first.

If we want to compare two sentences in manner to know if both the sentences are meaningfully same or not

E.g :

Sentence 1 : David goes to school everyday which is 5 miles away from his house

Sentence 2 : David’s school is 5 miles away where he goes everyday in the morning.

Next question is how does computer know that these two sentences are semantically (meaningfully) same.

And the answer is that we use Embeddings model which takes this kind of data and generate embeddings for each word.

Now we introduced new word called “Embeddings” but what exactly it means?

In the context of machine learning, embeddings are a way of representing data as a set of numerical values. These numerical values are often referred to as a “vector” because they can be thought of as a series of numbers arranged in a specific order. The purpose of embeddings is to capture the semantic meaning of the data in a way that can be used by machine learning algorithms.

For example, in natural language processing, word embeddings are used to represent words as vectors in a high-dimensional space.

Now we understand basic understand of embeddings, lets get back the Embeddings Model. What is this model and how it is build? What is the purpose of building Embeddings Model?

Embeddings Models are neural network based models which are trained using large set of training data made of pairs of inputs and labeled outputs. And we know that with each training cycle, the network modified the activations in each layer with a goal to predict a right output for the given input.

We will not go in detail of neural network in this article but embeddings are the outcomes from the second last layer of these neural networks.

Once trained, embedding model transform raw data into embeddings(numbers -> float values) and know where to place new data in the vector space.

Credit: Jay Alamar

The vectors are visualized using colors and arranged next to each word

Here the vector for Girl and Boy looks similar. And Vector for King and Queen looks similar but vector for Water looks very different from others.

Word2vec is one of the Embedding Model to convert words into embeddings.

As these embeddings are represented as multi dimensional vector space hence the term Vector Embeddings.

Once we have embeddings of the data then we will use vector DB to store it. Now lets dive deep into how does this process works. What does it take to store the data.

To answer this question, we need to see an example to understand what kind data and relation we are talking here.

Suppose we have a geospatial data of a Candy Shop, Grocery Store and a person. Now this data will be stored as vector embeddings and then we can find the closeness of each vector by calculating the distance. This kind of data will help us to support usecases like closeness of objects in geospatial locations, find the nearest shop to the person.

Retrieval:

Next we look into two key features of Vector DB when it comes to retrieving of embeddings data

Vector Search

Indexing Techniques

Vector Search:

In vector search, we have a query vector (user query in plain text converted into vector form) and rest of the data represented as multi-dimensional data.

For this requirement, we want to search data in the vector space. And here we do that using Similarity Search.

But why can’t we have exact search as we do in Relational databases. Here is a scenario to explain it:

Navigating through datasets to locate items that match specific criteria is a routine task in areas like e-commerce platforms, digital libraries, and content management systems. Identifying items that meet exact numerical thresholds is relatively simple through standard query syntax in structured data environments. For instance, selecting books in a library’s database that fall within a certain price range.

However, challenges arise when attempting to discern items that align with a user’s ambiguous or broadly defined search queries. These queries often lack precision and can encompass a wide array of terms. For instance, a customer might search for a broad category such as “laptops,” a more defined query like “gaming laptops,” or even a highly specific model like “Dell XPS 13.” This variability necessitates advanced search algorithms capable of interpreting and matching a broad spectrum of user inquiries to the relevant items in the inventory.

Now we understand the need of similarity search ( and we have already seen it while doing google search or any internet search engine), let go through some of the semantics for doing similarity search.

Distance Metrics:

Distance metrics are a means to evaluate the similarity of real-world objects represented as vector embeddings. Distance metrics play a crucial role in vector search by quantifying the similarity or dissimilarity between vectors. Here are some of the key distance metrics used in vector similarity search:

Dot Product: Measures the similarity between two vectors by evaluating the sum of their coordinate products.

Cosine Similarity: Captures similarities based on orientation in high-dimensional spaces, commonly used in tasks like document retrieval and recommendation systems.

Euclidean Distance: Calculates the straight-line distance between two points in an n-dimensional space, representing the length of the line segment connecting the points.

Manhattan Distance: Also known as taxicab or city block distance, used when dealing with discrete or ordinal variables.

Hamming Distance: Measures the difference between two binary vectors, often used with binary data.

These distance metrics are foundational in vector search, influencing various applications such as image or text similarity search, recommendation systems, clustering, anomaly detection, and more. The choice of distance metric depends on the specific problem, model requirements, and computational resources, ensuring that the metric aligns with the data’s characteristics and the objectives of the analysis

Search :

So far we have Vector embeddings which represents our data about a given domain (text, image etc) and the knowledge of distance metric. Now we will look into search algorithms.

There are few Search Algorithms which are employed for Vector Search.

K-Nearest Neighbor (KNN):

Description: KNN is a straightforward algorithm that finds the k exact nearest neighbors of a given query point by calculating the distance between the query point and all other points in the dataset.

Accuracy: KNN provides exact results, ensuring the selection of the k nearest neighbors based on precise similarity.

Computational Cost: As the dataset grows larger, KNN becomes computationally expensive since it requires calculating distances between the query vector and all vectors in the dataset.

Scalability: The performance of KNN degrades as the dataset size increases, making it impractical for very large datasets.

Use Cases: Common use cases for KNN include relevance ranking based on natural language processing (NLP) algorithms, product recommendations, and similarity search for images or videos.

Approximate Nearest Neighbor (ANN):

Description: ANN algorithms aim to find approximate nearest neighbors without examining every vector in the dataset. They use techniques like space partitioning and hyperplane division to efficiently find approximate nearest neighbors.

Efficiency: ANN algorithms can significantly reduce computational time compared to KNN by employing techniques like data indexing, data partitioning, and pruning.

Scalability: ANN methods can handle large-scale datasets efficiently, allowing similarity search in high-dimensional spaces.

Approximation Errors: ANN algorithms sacrifice some accuracy in favor of improved efficiency. The results may not always be perfectly accurate.

Configuration Sensitivity: The performance of ANN algorithms can be sensitive to various parameters, requiring careful tuning.

Use Cases: ANN algorithms are more suitable for larger datasets or high-dimensional spaces where computational efficiency is crucial. They are effective when efficiency and faster search times are priorities, even if some level of approximation is acceptable.

Next we will see the implementation of KNN and ANN in different types of indexing techniques.

Indexing Techniques:

To further improve the search operations, Vector DBs deploy various indexing techniques. Here is a quick intro of popular indexing techniques:

Brute-Force Indexing: This technique involves storing all vectors in a database and performing a linear scan to find the nearest neighbors for a given query vector. It is simple to implement but can be computationally expensive for large datasets.

Ball Tree Indexing: Ball trees are tree-based indexing structures that partition a space into balls. They are particularly useful for high-dimensional spaces and can efficiently find nearest neighbors. However, they can be memory-intensive and may not perform well with non-uniformly distributed data.

KD-Tree Indexing: KD-trees are a type of space partitioning tree that recursively splits the space into two parts based on a single feature. They are efficient for low-dimensional spaces and can handle non-uniformly distributed data. However, they may not be suitable for high-dimensional spaces and can suffer from the curse of dimensionality.

Hierarchical Navigable Small-World Graph (HNSW): HNSW is a graph-based indexing technique that uses a small-world graph to represent the data. It is efficient for finding nearest neighbors and can handle high-dimensional spaces. However, it can be memory-intensive and may not perform well with non-uniformly distributed data.

Annoy (Approximate Nearest Neighbors Oh Yeah): Annoy is a library for approximate nearest neighbor search. It uses a tree-based indexing structure and can handle high-dimensional spaces. It is efficient for finding nearest neighbors and can handle non-uniformly distributed data. However, it may not be suitable for very large datasets.

Locality-Sensitive Hashing (LSH): LSH is a technique that maps data points into a lower-dimensional space, preserving their locality. It is particularly useful for high-dimensional spaces and can efficiently find nearest neighbors. However, it may not perform well with non-uniformly distributed data.

Dimensionality Reduction: Techniques like Principal Component Analysis (PCA) and t-SNE can be used to reduce the dimensionality of the data before indexing. This can improve the performance of indexing techniques and reduce memory requirements. However, it may not be suitable for all types of data and can lead to loss of information.

Vector Quantization: Techniques like k-means clustering and LSH-based quantization can be used to represent the data as a set of clusters. This can improve the performance of indexing techniques and reduce memory requirements. However, it may not be suitable for all types of data and can lead to loss of information.

Dive Deep into common Indexing Techniques:

If one wants to understand Vector databases better then he/she needs to have good understanding of different types of indexing techniques used in vector search. We will cover 4 type here to see the difference and their applications:

Flat Index:

The initial type of indexes to explore in our discussion are the most fundamental: flat indexes.

These indexes are termed ‘flat’ due to the straightforward manner in which vectors are stored within them; there is no alteration or compression applied to the vectors inputted. This direct approach ensures that there is no loss of data fidelity, resulting in the highest possible accuracy in search outcomes. Consequently, searches conducted using flat indexes yield impeccable search quality, as every query undergoes a thorough comparison without any approximation or clustering of vector data.

However, this method’s fidelity comes at a significant expense of search efficiency. When employing flat indexes, the query vector, denoted as xq , is meticulously compared against every single vector stored in the index. This involves computing the distance between xq and each vector in its entirety, which can lead to substantial search durations, especially as the size of the index grows.

Incorporating the concept of Vector Database (DB) Flat Indexes into this framework, these indexes stand out for their simplicity and accuracy in environments where precision is paramount. While they enable exact matching by conducting exhaustive searches through the entire dataset, the trade-off is increased computational load and slower response times, highlighting the balance between search precision and performance in vector databases.

Locality Sensitive Hashing:

Locality Sensitive Hashing (LSH) is a technique used in computer science for approximate nearest neighbor search and data clustering. It aims to reduce the dimensionality of high-dimensional data while preserving relative distances between items. LSH works by hashing similar input items into the same “buckets,” allowing for efficient data clustering and nearest neighbor search. Unlike conventional hashing techniques that minimize hash collisions, LSH maximizes collisions to group similar items together.

LSH is designed to have hash functions that satisfy specific conditions:

If two points are close in a high-dimensional space, the probability of their hash values being the same is high.

If two points are far apart, the probability of their hash values being the same is low.

The time complexity to identify close objects is sub-linear, making the search process efficient.

One common application of LSH is in efficient approximate nearest neighbor search algorithms. By using LSH families with parameters like width and number of hash tables, the algorithm can quickly identify approximate nearest neighbors by hashing data points into buckets and retrieving points hashed into the same bucket as the query point.

LSH implementation involves generating hash values using methods like Min-wise independent permutations, Nilsimsa Hash, TLSH, Random Projection (SimHash), among others. These methods aim to segment and hash data multiple times to identify candidate pairs or potential matches efficiently.

In summary, Locality Sensitive Hashing is a powerful technique in data processing that optimizes search processes by grouping similar items together based on hash values, enabling fast and efficient approximate nearest neighbor searches and data clustering.

Hierarchical Navigable Small World :

In the landscape of search algorithms, particularly for high-dimensional data, traditional methods like brute-force searches, tree-based structures (such as KD-trees), and hashing techniques (for example, Locality-Sensitive Hashing) have historically been the go-to solutions. However, these techniques often fall short when applied to vast, high-dimensional datasets due to their inability to scale efficiently. The computational complexity and resource demands escalate quickly as the size of the dataset increases, making these methods impractical for large-scale applications that involve extensive records and numerous dimensions.

The Navigable Small World (NSW) model presented a noteworthy advancement in addressing the scalability challenge posed by high-dimensional data. By leveraging a graph-based structure, NSW enhanced the scaling capabilities beyond what traditional methods could offer, facilitating more efficient searches across large datasets. Nonetheless, the NSW approach still exhibited polylogarithmic scaling characteristics, which, despite being an improvement, did not entirely overcome the limitations encountered in very large-scale scenarios. The search efficiency and scalability in datasets with a massive number of records and dimensions remained a significant concern.

Enter Hierarchical Navigable Small World (HNSW), a refined evolution of the NSW model that introduces a multi-layered hierarchy to the organization of data. This hierarchical structuring is akin to incorporating a “zoom out” feature in a digital map, where higher layers provide a macroscopic view of the data landscape, allowing for rapid navigation towards the region of interest before delving into more detailed searches at lower levels. HNSW significantly enhances search efficiency and scalability by enabling quicker approximation to the vicinity of the target query before executing more precise searches, thereby mitigating the limitations of polylogarithmic scaling and offering a robust solution for navigating through large, high-dimensional datasets.

One real-world example of using Hierarchical Navigable Small World (HNSW) implementation in online services is in recommendation systems for e-commerce platforms like Amazon. In this context, HNSW is utilized to enhance product recommendations for users based on their preferences and behavior. Here is how HNSW is applied and the benefits it provides to the service or product:

Real-World Example: E-commerce Recommendation Systems
- Application: HNSW is integrated into recommendation systems of e-commerce platforms to improve product recommendations for users.
- Usage: The algorithm efficiently finds similar products based on user interactions, purchase history, and browsing behavior
- Benefits:
  - Personalized Recommendations: HNSW enables the system to deliver personalized recommendations, enhancing user engagement and satisfaction.
  - Fast Retrieval: The fast search capabilities of HNSW allow for quick retrieval of relevant items, improving the user experience by providing timely and accurate recommendations.
  - Dynamic Updates: HNSW supports real-time updates, allowing the system to adapt to changing user preferences and trends effectively.

How it Works:
- Efficient Search: HNSW’s fast search speeds and high recall rate (proportion of relevant items that are successfully retrieved by a search algorithm out of the total number of relevant items in the dataset) ensure that users receive relevant recommendations quickly.
- Scalability: The algorithm scales well with the size of the dataset, making it suitable for handling large product catalogs and diverse user preferences.
- Personalization: By leveraging HNSW, e-commerce platforms can offer personalized recommendations tailored to each user’s preferences and behavior.

In essence, HNSW graphs are developed by starting with NSW graphs and then segmenting them into several layers. As new layers are added, they progressively remove the intermediate links between vertices.

Inverted File Index (IVF):

Inverted flat indexes are a pivotal element in Vector databases, particularly in Approximate Nearest Neighbor (ANN) search algorithms. These indexes efficiently narrow down the search space by utilizing centroids and clustering techniques. Here are the key concepts related to inverted flat indexes, focusing specifically on centroids:

Key Concepts of Inverted Flat Indexes with Centroids:

Inverted File Index (IVF):
1. Centroids: In the context of IVF, centroids are representative points that define clusters within the vector space. These centroids are computed using unsupervised clustering methods like k-means.
2. Cluster Assignment: Each vector in the database is assigned to the partition with the closest centroid, creating clusters of similar vectors.
3. Inverted Index: By correlating each centroid with a list of vectors in its cluster, an inverted index is created, enabling efficient retrieval of similar vectors.

Efficiency and Speed:
1. Partition-Based Indexing: IVF is a partition-based indexing strategy that enhances search efficiency by grouping vectors into clusters based on centroids.
2. Rapid Search: By associating vectors with centroids and clusters, IVF accelerates the search process by focusing on relevant partitions rather than comparing every vector in the database.

Memory and Storage:
1. Memory Requirements: Storing IVF indexes involves memory considerations, requiring space for centroids, cluster assignments, and vector information.
2. Storage Efficiency: IVF strikes a balance between memory usage and search speed, making it suitable for small to medium-sized datase/ts.

Vector Database Providers:

Let’s checkout common indexing techniques used in some of the open source and commercial DBs

S.No	Vector DB	Type	Vector Index Type
1	Weaviate	Open Source	Flat HNSW
2	Pinecone	Commercial	HNSW PGA ( Pinecone Graph Algorithm)
3	Milvus	Open Source	Flat IVF_Flat IVF_PQ BIN_IVF BIN_IVF_Flat
4	Qdrant	Open Source	HNSW
5	Faiss	Open Source	HSNW IVF PQ SQ LSH

Weaviate:

The vector indexing concepts used in Weaviate involve the utilization of vector databases to enhance the speed and efficiency of similarity searches. Weaviate employs vector indexing as a fundamental concept in its vector database, enabling semantic or context-based searches and the storage of large volumes of data without compromising performance. Vectors, which are arrays of elements capturing meaning from various data types like texts, images, and videos, play a crucial role in representing data objects in a multi-dimensional space. Weaviate supports two main index types: the flatindex, suitable for small datasets, and the hnswindex, which is more complex but scales well for large datasets due to its logarithmic time complexity for queries.

The Hierarchical Navigable Small World (HNSW) algorithm is a key component in Weaviate, functioning as an index type that facilitates fast queries on multi-layered graphs. HNSW indexes in Weaviate offer rapid query capabilities, although rebuilding the index when adding new vectors can be resource-intensive. Additionally, Weaviate supports two types of indices: the approximate nearest neighbor index (ANN) and the inverted index. The ANN index serves vector-search queries, while the inverted index allows for filtering by properties and BM25 queries.

Pinecone :

Pinecone is a vector database that allows users to store and query vector embeddings for fast retrieval and similarity search. It is designed to be easy to use and operate, with a focus on performance and cost-efficiency at any scale. Pinecone uses purpose-built data structures and algorithms to ensure indexes stay up and running, always.

One of the algorithms used by Pinecone is Random Projection, which projects high-dimensional vectors to a lower-dimensional space while preserving their similarity. This allows for faster querying and indexing. Pinecone also uses Hierarchical Navigable Small World (HNSW) graphs, which are a further adaptation of navigable small world (NSW) graphs. HNSW-based ANNS consistently top out as the highest performing indexes in vector similarity search.

Pinecone owns indexing algorithm called PGA is designed to handle the memory, compute, and scale requirements of real-world AI applications, which is a challenge for bolt-on vector indexes that are not natively designed for vector databases. Pinecone’s indexing algorithms are optimized for performance and are designed to work seamlessly with the rest of the Pinecone platform.

Milvus :

The vector indexing concepts used in Milvus are crucial for efficient similarity searches. Milvus supports various types of vector indexes, including Tree-based, Graph-based, Hash-based, and Quantization-based indexes. These indexes are designed to efficiently query vectors similar to a target vector through Approximate Nearest Neighbors Search (ANNS) algorithms. ANNS focuses on finding neighbors of the target vector rather than returning the most accurate result, thus improving retrieval efficiency within an acceptable range.

Milvus provides different index types based on the data type of the embeddings. For floating-point embeddings, indexes like FLAT, IVF_FLAT, IVF_PQ, IVF_SQ8, HNSW, and SCANN are available for CPU-based searches, while GPU_IVF_FLAT and GPU_IVF_PQ are for GPU-based searches. On the other hand, for binary embeddings, indexes like BIN_FLAT and BIN_IVF_FLAT are supported.

When creating indexes in Milvus, users can specify parameters like the distance metric (e.g., Cosine Similarity, Euclidean Distance) and index type to accelerate vector searches. Index building involves organizing data into segments, with each segment having its index file. Indexing significantly enhances retrieval performance in Milvus

Qdrant:

The indexing process in Qdrant is optimized for performance by using the HNSW algorithm, which is well-compatible with filtering and is one of the most accurate and fastest algorithms according to public benchmarks. HNSW limits the maximum degree of nodes on each layer of the graph to m and allows the use of ef_construct and ef parameters to specify a search range.

Qdrant also supports sparse vector indexing, which is optimized for memory and search speed by utilizing an inverted index structure to store vectors for each non-zero dimension. The search mechanism uses the dot product to score vectors and implements optimizations to minimize the number of vectors scored, especially for dimensions with numerous vectors.

Faiss:

It is a library but a powerful tool for efficient similarity search and clustering of dense vectors, offering a range of indexing types that cater to various trade-offs in search time, quality, memory usage, and more. So it primarily works well as an in-memory solution rather than a standalone/managed DB service. Features of Faiss library :

Basic Indexes: Faiss provides simple baseline indexes like exact search, with most indexing structures offering trade-offs in search time, quality, memory usage, training time, and more.

Composite Indexes: Faiss allows the creation of composite indexes by combining vector transformations and multiple indexing methods. These composite indexes can be tailored to specific needs, balancing factors like recall, latency, and memory usage effectively.

Indexing Methods: Faiss offers a variety of indexing methods such as Exact Search for L2, Hierarchical Navigable Small World (HNSW), Inverted File (IVF), Locality-Sensitive Hashing (LSH), Scalar Quantizer (SQ), and Product Quantizer (PQ), among others. Each method serves a specific purpose in optimizing search efficiency and accuracy.

GPU Implementation: Faiss boasts a state-of-the-art GPU implementation for the most relevant indexing methods, enhancing speed and performance in similarity searches. The GPU support allows for efficient processing of large datasets, making Faiss a versatile and scalable tool for similarity search tasks.

Memory Usage: Faiss focuses on methods that compress original vectors to scale efficiently to datasets of billions of vectors, optimizing memory usage. By compressing vectors, Faiss ensures that large datasets can be indexed without excessive memory requirements, making it suitable for handling massive amounts of data effectively.

Accuracy and Speed: Faiss emphasizes the trade-off between speed and accuracy, evaluating methods based on how well they match brute-force search results. The library aims to provide fast and accurate similarity search capabilities, crucial for various applications that rely on efficient nearest neighbor searches.

This is not an exhaustive list of Vector DBs. In future, I will target advance tutorial on Vector DB with focus upon types of use-cases, performance of different indexing types , how to choose a vector DB for a particular use case etc

That’s all on Vector DB introduction.

Did you like this blog?

Share your feedback below.