Pgvector: How to Turn PostgreSQL into Vector Database Effortlessly
Published on
If you're using PostgreSQL and haven't heard of pgvector yet, you're missing out on a game-changing extension. Designed to supercharge your PostgreSQL database, pgvector is the ultimate tool for efficient vector storage and similarity search. It's not just an add-on; it's a full-fledged solution that integrates seamlessly with PostgreSQL, enhancing its capabilities to new heights.
But why should you care? Because in today's data-driven world, the ability to efficiently store and query large sets of vectors is crucial for machine learning, recommendation systems, and many other applications. pgvector not only makes this possible but does it with a level of efficiency and speed that's hard to match.
Why pgvector is Essential for PostgreSQL Users
What is pgvector?
pgvector is a PostgreSQL extension that specializes in storing vectors and performing similarity searches within those vectors. Unlike traditional PostgreSQL data types, pgvector is optimized for high-dimensional data, making it ideal for machine learning models, image recognition, and natural language processing tasks.
Core Features of pgvector:
-
Efficient Vector Storage: pgvector compresses high-dimensional vectors without losing data integrity. This is a big deal because high-dimensional data can be a storage nightmare.
-
Multiple Distance Metrics: Whether you're working with Euclidean, Cosine, or Manhattan distances, pgvector has got you covered. It supports a variety of distance metrics, giving you the flexibility to choose the best one for your specific use case.
-
Seamless Integration: One of the best things about pgvector is how effortlessly it integrates with PostgreSQL. You don't have to juggle between different databases; pgvector works as a native extension, allowing you to perform complex queries right within PostgreSQL.
Let's get down to the numbers. With pgvector, you can run 1 million OpenAI embeddings at approximately 1800 queries per second (QPS) with a 91% accuracy rate. If you're aiming for higher accuracy, you can achieve 670 QPS with a staggering 98% accuracy. These metrics are not just impressive; they're revolutionary.
pgvector's performance doesn't just stop at high QPS and accuracy. It's also incredibly scalable. Whether you're running it on an 8-core ARM with 32 GB of RAM or a 64-core ARM with 256 GB, pgvector scales beautifully, ensuring that you get the most out of your hardware resources.
How Fast is pgvector Really? (pgvector Benchmarks)
Breaking Down pgvector's Speed and Accuracy
If you're looking for speed and accuracy in vector searches, pgvector has got you covered. The latest version, pgvector 0.4.0, has been put through rigorous testing to measure its performance. Let's get into the details:
Benchmark Methodology
- Test Runner: A Python script was used for data upload, index creation, and query execution.
- Runtime: Each test ran for 30-40 minutes, covering various levels of workload to measure performance.
- Pre-warming RAM: Before each test, 10,000 to 50,000 "warm-up" queries were executed to optimize RAM utilization.
Hardware Used for Testing
- 2XL: 8-core ARM, 32 GB RAM
- 4XL: 16-core ARM, 64 GB RAM
- 8XL: 32-core ARM, 128 GB RAM
- 12XL: 48-core ARM, 192 GB RAM
- 16XL: 64-core ARM, 256 GB RAM
Dataset
The tests used the dbpedia-entities-openai-1M (opens in a new tab) dataset, which includes 1 million embeddings with 1536 dimensions. This dataset is created using OpenAI and is based on Wikipedia articles.
Performance Results
- With Probes Set to 10: pgvector achieved an accuracy@10 of 0.91 and a QPS (queries per second) of 380.
- With Probes Set to 40: pgvector not only got faster but also maintained almost the same accuracy as Qdrant, with an accuracy@10 of 0.98 and a QPS of 140.
Scaling the Database
The performance of pgvector scales predictably with the size of the database. For example, a 4XL instance achieved an accuracy@10 of 0.98 and a QPS of 270 with probes set to 40. An 8XL instance got an accuracy@10 of 0.98 and a QPS of 470, surpassing Qdrant's results.
Benchmark Results of pgvector
On a 64-core, 256 GB server, pgvector achieved around 1800 QPS with a 0.91 accuracy. This is for version 0.4.0, and it's worth noting that newer versions are expected to show even better performance.
Why pgvector Wins Over Commerical Vector Databases
After diving into the impressive benchmarks of pgvector, you might be wondering how it stacks up against other commerical vector database solutions like Milvus. Well, let's break it down:
Ease of Integration
- Milvus: Operates as a separate service, adding complexity to your tech stack.
- pgvector: Seamlessly integrates with PostgreSQL, simplifying your tech stack.
Data Synchronization
- Milvus: Lacks native integration, leading to data synchronization issues.
- pgvector: Offers native integration, eliminating synchronization hassles.
Performance Metrics
- Milvus: Struggles to match pgvector's performance.
- pgvector: Handles 1 million OpenAI embeddings at 1800 QPS with a 91% accuracy rate.
Metric | Milvus | pgvector |
---|---|---|
Integration Complexity | High | Low |
Data Synchronization | Problematic | Seamless |
Query Speed (QPS) | Lower | 1800 |
Accuracy | Lower | 91% |
In summary, if you're looking for a streamlined, efficient, and high-performing solution for vector storage and search, pgvector is the clear winner.
Tips for Optimizing pgvector Performance
- Adjust Your Postgres Config: Make sure it aligns with your RAM and CPU cores.
- Pre-warm Your Database: Use the warm-up technique described earlier.
- Choose the Right Distance Function: If your vectors are normalized, prefer inner-product to L2 or Cosine distances.
- Tune the Lists Constant: Increasing this can accelerate your queries. For example, tests with OpenAI embeddings used a lists constant of 2000 instead of the suggested 1000.
By looking at these detailed metrics and tips, it's clear that pgvector offers robust performance for vector storage and similarity search, making it a top choice for PostgreSQL users.
What You Can Do With pgvector?
How to Use pgvector as an Open Source Vector Database
Pgvector is not just an extension; it's a transformative tool that turns your PostgreSQL database into a powerful vector database. This is particularly useful for those who are looking to perform complex operations on high-dimensional data without having to switch to a specialized database. In this section, we will delve into how to set up and use pgvector as an open-source vector database.
Installation and Setup:
-
Clone the Repository: Start by cloning the pgvector GitHub repository to your local machine.
git clone https://github.com/your/pgvector/repo.git
-
Compile and Install: Navigate to the directory and compile the extension.
cd pgvector make make install
-
Database Configuration: Log in to your PostgreSQL database and create the pgvector extension.
CREATE EXTENSION pgvector;
-
Table Creation: Create a table with a vector column to store your high-dimensional data.
CREATE TABLE my_vectors (id SERIAL PRIMARY KEY, vector_field VECTOR(128));
Basic Operations:
-
Inserting Vectors: Insert data into the vector field.
INSERT INTO my_vectors (vector_field) VALUES ('{1.1, 2.2, 3.3, ..., 128.128}');
-
Vector Search: Perform a similarity search using the vector field.
SELECT * FROM my_vectors ORDER BY vector_field <-> '{3.3, 2.2, 1.1, ..., 128.128}' LIMIT 10;
-
Indexing: Create an index to speed up your similarity searches.
CREATE INDEX idx_vector_field ON my_vectors USING ivfflat(vector_field);
Advanced Features:
-
Tuning Parameters: You can tune the parameters of your index to balance between speed and accuracy.
SET pgvector.index_type = 'hnsw'; SET pgvector.ef_search = 64;
-
Batch Operations: Pgvector supports batch operations for inserting and updating vectors, which can be particularly useful for machine learning applications.
-
Monitoring and Observability: Use PostgreSQL's built-in monitoring tools to keep an eye on the performance of your vector operations.
k-NN Search Capabilities of pgvector
What is k-NN Search?
k-NN (k-Nearest Neighbors) search is an algorithm used to find the "k" closest points to a given point in a multi-dimensional space. It's widely used in machine learning for classification and clustering.
How pgvector Enables k-NN:
- Data Type: pgvector introduces a new data type called
vector
, which can store multi-dimensional data. - Operators: It provides operators like
<->
for Euclidean distance and<=>
for cosine distance to calculate similarity between vectors. - Indexing: You can create indexes on the vector columns to speed up k-NN queries.
Here's a SQL function that performs a k-NN search using pgvector:
CREATE OR REPLACE FUNCTION knn_search(query_vector vector, k int)
RETURNS TABLE(id INT, distance FLOAT)
LANGUAGE SQL STABLE AS $$
SELECT id, query_vector <=> vector_column AS distance
FROM your_table
ORDER BY distance ASC
LIMIT k;
$$;
Store OpenAI Embeddings with pgvector
Why Store OpenAI Embeddings? OpenAI embeddings are high-dimensional vectors that capture the semantic meaning of text. They are useful for tasks like text similarity, clustering, and classification.
How to Store OpenAI Embeddings with pgvector
- Create a Table: Create a PostgreSQL table with a column of type
vector
.CREATE TABLE documents ( id SERIAL PRIMARY KEY, content TEXT, embedding VECTOR(1536) );
- Insert Data: Insert the OpenAI embeddings into the
embedding
column.INSERT INTO documents(content, embedding) VALUES ('some text', '{your 1536-dim vector}');
- Query: Use pgvector's operators to query the embeddings.
SELECT * FROM documents WHERE embedding <=> '{query vector}' < 0.5;
PGVector with Langchain: Advanced Text Search
Langchain and PGVector can be combined to create a powerful text search and retrieval system. Below is a guide on how to integrate PGVector with Langchain to perform advanced text search operations.
Pre-requisites
- Make sure you have installed all the necessary packages and set up your environment variables as shown in the initial section of your notebook.
Step 1. Initialize Langchain Components
Firstly, initialize the Langchain components like OpenAIEmbeddings
, TextLoader
, and CharacterTextSplitter
.
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import TextLoader
loader = TextLoader("your_text_file.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings()
Step 2. Initialize PGVector Next, initialize PGVector with the connection string to your Postgres database.
from langchain.vectorstores.pgvector import PGVector
CONNECTION_STRING = "your_connection_string_here"
COLLECTION_NAME = "your_collection_name_here"
db = PGVector.from_documents(
embedding=embeddings,
documents=docs,
collection_name=COLLECTION_NAME,
connection_string=CONNECTION_STRING,
)
** Step 3. Perform Similarity Search**
You can now perform a similarity search using PGVector's similarity_search_with_score
method.
query = "Your search query here"
docs_with_score = db.similarity_search_with_score(query)
for doc, score in docs_with_score:
print("Score: ", score)
print(doc.page_content)
** Step 4. Maximal Marginal Relevance Search** For more advanced search, you can use Maximal Marginal Relevance (MMR) to optimize for both similarity to the query and diversity among selected documents.
docs_with_score = db.max_marginal_relevance_search_with_score(query)
for doc, score in docs_with_score:
print("Score: ", score)
print(doc.page_content)
Step 5. Working with Existing VectorStore If you already have an existing vector store, you can initialize it directly and add documents to it.
store = PGVector(
collection_name=COLLECTION_NAME,
connection_string=CONNECTION_STRING,
embedding_function=embeddings,
)
store.add_documents([Document(page_content="new_content_here")])
** Step 6. Overriding an Existing VectorStore** If you need to update an existing collection, you can override it.
db = PGVector.from_documents(
documents=docs,
embedding=embeddings,
collection_name=COLLECTION_NAME,
connection_string=CONNECTION_STRING,
pre_delete_collection=True,
)
** Step 7. Using VectorStore as a Retriever** Finally, you can use the vector store as a retriever for more advanced operations.
retriever = store.as_retriever()
print(retriever)
Building AI-Powered Search with Amazon SageMaker, Amazon RDS for PostgreSQL, and pgvector
In this section, we demonstrate how to build a product catalog similarity search solution using Amazon SageMaker and Amazon RDS for PostgreSQL with the pgvector extension. We will use a pre-trained Hugging Face model to generate document embeddings and store them in an RDS for PostgreSQL database. Then, we will utilize pgvector's similarity search capabilities to find items in the product catalog that best match a customer's search query.
Steps to Implement the Solution:
-
Set Up SageMaker Notebook Instance: Create a SageMaker notebook instance to run the Python code in a Jupyter notebook.
-
Data Preparation: Translate item descriptions from German to English using Amazon Translate.
-
Model Hosting: Deploy a pre-trained Hugging Face model into SageMaker to generate 384-dimensional vector embeddings for the product catalog.
-
Data Storage: Connect to RDS for PostgreSQL and create a table to store the raw text and text embeddings.
-
Real-Time Inference: Use SageMaker to encode the query text into embeddings.
-
Similarity Search: Perform a similarity search using pgvector in the RDS for PostgreSQL database.
Prerequisites:
- An AWS account with appropriate IAM permissions.
- Familiarity with AWS services like SageMaker, RDS, and CloudFormation.
Deployment: Use an AWS CloudFormation stack to deploy the solution, which will create all the necessary resources, including networking components, a SageMaker notebook instance, and an RDS for PostgreSQL instance.
Here are some key code snippets for implementing the solution:
-
Data Ingestion: Use Amazon Translate to translate item descriptions from German to English.
import boto3 translate = boto3.client(service_name='translate', use_ssl=True) result = translate.translate_text(Text=str(j), SourceLanguageCode="de", TargetLanguageCode="en")
-
Model Hosting: Deploy a pre-trained Hugging Face model into SageMaker.
from sagemaker.huggingface.model import HuggingFaceModel predictor = HuggingFaceModel(env=hub, role=role).deploy(initial_instance_count=1, instance_type="ml.m5.xlarge")
-
Data Storage: Create a table in RDS for PostgreSQL to store embeddings.
CREATE TABLE IF NOT EXISTS products( id bigserial primary key, description text, descriptions_embeddings vector(384) );
-
Similarity Search: Perform a similarity search using pgvector.
SELECT id, url, description, descriptions_embeddings FROM products ORDER BY descriptions_embeddings <-> ARRAY[...];
How to use PgVector in Python
Certainly, you can extend the functionality of pgvector
in Python by adding custom methods or integrating it with other Python libraries. Below is an example of how you could extend pgvector
to include a method for calculating the Euclidean distance between two vectors in Django.
Extending Django Functionality
First, let's create a custom manager for the Item
model that includes a method for calculating the Euclidean distance.
from django.db import models
from pgvector.django import VectorField, L2Distance
import math
class ItemManager(models.Manager):
def euclidean_distance(self, vector):
# Use the L2Distance function from pgvector to get the squared Euclidean distance
queryset = self.annotate(distance_squared=L2Distance('embedding', vector))
# Take the square root to get the actual Euclidean distance
for item in queryset:
item.distance = math.sqrt(item.distance_squared)
return queryset
class Item(models.Model):
embedding = VectorField(dimensions=3)
objects = ItemManager()
Now you can use this custom manager in your Django views or shell:
# Get items sorted by their Euclidean distance to the vector [3, 1, 2]
items = Item.objects.euclidean_distance([3, 1, 2]).order_by('distance')
# Print the items and their distances
for item in items:
print(f"Item ID: {item.id}, Distance: {item.distance}")
Extending SQLAlchemy Functionality
Similarly, you can extend SQLAlchemy functionality by adding a custom query class.
from sqlalchemy import Column, Integer, create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker, BaseQuery
from sqlalchemy.sql import func
from pgvector.sqlalchemy import Vector
Base = declarative_base()
class VectorQuery(BaseQuery):
def euclidean_distance(self, vector):
return self.add_columns(
func.sqrt(
func.pow(Vector.l2_distance(self._entities[0].field, vector), 2)
).label('distance')
)
class Item(Base):
__tablename__ = 'items'
id = Column(Integer, primary_key=True, autoincrement=True)
embedding = Column(Vector(3))
@classmethod
def query(cls):
return VectorQuery(cls)
# Database setup
engine = create_engine('postgresql://localhost/mydatabase')
Base.metadata.create_all(engine)
Session = sessionmaker(bind=engine)
session = Session()
# Get items and their Euclidean distance to the vector [3, 1, 2]
items_with_distance = session.query(Item).euclidean_distance([3, 1, 2]).all()
# Print the items and their distances
for item, distance in items_with_distance:
print(f"Item ID: {item.id}, Distance: {distance}")
In this example, I've added an id field to the Item class as a primary key. The VectorQuery class is used to extend the query capabilities, adding a method for calculating the Euclidean distance. Finally, the query class method is used to set this custom query class for the Item model.
You can then use this in your SQLAlchemy code:
# Get items and their Euclidean distance to the vector [3, 1, 2]
items = session.query(Item).euclidean_distance([3, 1, 2]).all()
# Print the items and their distances
for item, distance in items:
print(f"Item ID: {item.id}, Distance: {distance}")
Conclusion
Pgvector is a game-changer if you're using PostgreSQL databases. It's not just an add-on; it's like a turbocharger for your database. It lets you store and search through large sets of data really quickly and accurately. This is super useful for things like machine learning, where you need to sift through tons of data fast.
What's more, pgvector fits right into PostgreSQL, so you don't have to juggle multiple databases. It's also flexible, letting you choose how you want to measure the "distance" between data points. You can even extend its features if you're using Python frameworks like Django. In short, if you're into data and using PostgreSQL, you don't want to miss out on pgvector.
FAQ
What is the use of Pgvector?
Pgvector is an extension for PostgreSQL databases that specializes in storing and searching through large sets of vectors. It's particularly useful for applications that require quick and accurate similarity searches, such as machine learning models, recommendation systems, and natural language processing tasks.
What are the advantages of Pgvector?
Pgvector offers several benefits:
- Speed: It allows for fast queries, even when dealing with high-dimensional data.
- Efficiency: It compresses data without losing its integrity, saving storage space.
- Flexibility: It supports multiple distance metrics like Euclidean, Cosine, and Manhattan, allowing you to choose the best one for your needs.
- Seamless Integration: It integrates directly into PostgreSQL, so you don't have to manage multiple databases.
- Scalability: It performs well on various hardware configurations, ensuring optimal use of resources.
Is PostgreSQL a vector database?
No, PostgreSQL is not inherently a vector database. However, with the Pgvector extension, you can effectively turn PostgreSQL into a powerful vector database capable of storing and performing similarity searches on high-dimensional data.
How to use Pgvector in PostgreSQL?
To use Pgvector in PostgreSQL, you'll need to:
- Install the Pgvector extension by running
CREATE EXTENSION pgvector;
in your PostgreSQL database. - Create a table with a vector column, for example:
CREATE TABLE my_vectors (id SERIAL PRIMARY KEY, vector_field VECTOR(128));
. - Insert vectors into the table:
INSERT INTO my_vectors (vector_field) VALUES ('{1.1, 2.2, 3.3, ..., 128.128}');
. - Perform similarity searches using SQL queries, like:
SELECT * FROM my_vectors ORDER BY vector_field <-> '{3.3, 2.2, 1.1, ..., 128.128}' LIMIT 10;
.