Psycopg Tutorial: Using pgvector to do Retrieval Augmented Generation

Psycopg Tutorial: Using pgvector to do Retrieval Augmented Generation

If you’ve been following along with my tutorials, then you’ve already installed PostgreSQL and Pgvector as well as Haystack for Pgvector. If you haven’t done that yet, see the links below to do those first.

Introducing Psycopg

Recall how we installed Haystack integrated with Pgvector:

pip install pgvector-haystack

If you did this install, you automatically also installed the psycopg Python module. Pyscopg is a Python module that allows you to connect to the PostgreSQL database. As the Psycopg web site says:

“Psycopg is the most popular PostgreSQL adapter for the Python programming language. Its core is a complete implementation of the Python DB API 2.0 specifications. Several extensions allow access to many of the features offered by PostgreSQL.”

If you want to directly use Pgvector and PosgreSQL without using Haystack, you can directly install Pyscopg 3 by doing this instead:

pip install psycopg[binary]

You can find documentation for Pyscopg here. The PyPi link is found here. See also this documentation. Alternatively, this documentation here explains how to install Pyscopg 2 if you prefer.

A Short Psycopg Tutorial

Haystack does most of the hard work for you, but sometimes you want to go ‘down to the metal’ and directly interact with PostgreSQL and Pgvector directly. That way if Haystack isn’t handling things as you see fit you can implement whatever you need by directly using Pgvector and PostgreSQL.

Inside the git repo for Pgvector (found here) there is an examples folder (found here). This tutorial is similar to the sentence_embedding.py example (found here). However, we’re going to put it all into a Python class that will allow us to re-use the class later to connect to PostgreSQL/Pgvector.

Let’s start with some imports:

from pgvector.psycopg import register_vector
import psycopg
from sentence_transformers import SentenceTransformer

Setting Up the PgvectorManager Class

Next, we’ll create a class to wrap our code into:

class PgvectorManager:
    def __init__(self,
                 dbname='postgres',
                 user='postgres',
                 password_file=r'D:\Documents\Secrets\postgres_password.txt',
                 model_name='all-MiniLM-L6-v2'):
        self.dbname = dbname
        self.user = user
        self.password = self.get_password(password_file)
        self.model_name = model_name
        # Connect to Postgres and create the vector extension
        if self.password is None:
            raise ValueError("Failed to retrieve the database password.")
        self.conn = psycopg.connect(dbname=self.dbname, autocommit=True, user=self.user, password=self.password)
        self.conn.execute('CREATE EXTENSION IF NOT EXISTS vector')
        register_vector(self.conn)
        # Create the SentenceTransformer model
        self.model = SentenceTransformer(self.model_name)
        self.vector_size = self.model.get_sentence_embedding_dimension()
        self.texts = None
        self.embeddings = None

We’ve setup a class called PgvectorManager along with a constructor that lets us pass as parameters:

  • dbname: The name of your database. I set it to default to 'postgres'.
  • user: The name of user name for the login. I also set this to default to 'postgres'.
  • passwordfile: This is a file path to a text file where you will store your password. You could just set the password directly in your code, but this is bad practice because your password may escape into github that way. So it is better to store the actual password in a file. I happen to use the file and path 'D:\Documents\Secrets\postgrespassword.txt'.
  • model_name: The name of the Huggingface model we’ll use to embed our text. I defaulted it to 'all-MiniLM-L6-v2'.

You may want to note how we connect to the database. After retrieving the password from the file:

self.password = self.get_password(password_file)

We’ll see later what the function does. We then create a connection to our database like this:

self.conn = psycopg.connect(dbname=self.dbname, autocommit=True, user=self.user, password=self.password)

Then we be sure to activate the Pgvector extension in case it isn’t activated already:

self.conn.execute('CREATE EXTENSION IF NOT EXISTS vector')

Then we register use of vectors via the connection:

register_vector(self.conn)

And finally, we setup the transformer model and grab the size of the vectors that model uses:

self.model = SentenceTransformer(self.model_name)
self.vector_size = self.model.get_sentence_embedding_dimension()

How to Install Psycopg Directly

If you already installed Haystack, you have already installed SentenceTransformer. But if not, here is how to do it directly (as documented in the PyPi index page for sentence-transformers as well as the HuggingFace documentation on sentence-transformers):

pip install -U sentence-transformers

Writing Our Methods

We’ll need to define a function to get_password:

    @staticmethod
    def get_password(password_file):
        try:
            with open(password_file, 'r') as file:
                password = file.read().strip()
        except FileNotFoundError:
            print(f"The file '{password_file}' does not exist.")
            password = None
        except Exception as e:
            print(f"An error occurred: {e}")
            password = None

        return password

We open the file, read in the password, and pass it back.

Now let’s add several useful methods:

First a method to create the embeddings for out text using our chosen model:

def create_embeddings(self, texts):
    self.embeddings = self.model.encode(texts)
    self.vector_size = self.embeddings.shape[1]

Next, a method to create the documents table we’ll be using:

def create_documents_table(self):
    self.conn.execute('DROP TABLE IF EXISTS documents')
    self.conn.execute('CREATE TABLE documents (id bigserial PRIMARY KEY, content text, '
                      'embedding vector('+str(self.vector_size)+'))')

Then a method to, given a list of texts (a list of strings) that stores the texts and their embeddings:

def store_documents(self, texts):
    self.texts = texts
    self.create_embeddings()
    self.create_documents_table()
    self.insert_documents()

We’ll then need to create the methods to create the embeddings and to insert the documents:

def create_embeddings(self):
    self.embeddings = self.model.encode(self.texts)
    self.vector_size = self.embeddings.shape[1]


    def insert_documents(self):
        if self.embeddings is None or len(self.texts) != len(self.embeddings):
            raise ValueError("Mismatch between texts and embeddings. Ensure embeddings are created before inserting.")
        for content, embedding in zip(self.texts, self.embeddings):
            self.conn.execute('INSERT INTO documents (content, embedding) VALUES (%s, %s)', (content, embedding))

Finally, we’ll create a method to find the nearest neighbor using the HNSW index using cosine similarity:

def find_neighbors(self, document_id=1):
    neighbors = self.conn.execute(
        'SELECT content FROM documents WHERE id != %(id)s ORDER BY embedding <=> '
        '(SELECT embedding FROM documents WHERE id = %(id)s) LIMIT 5',
        {'id': document_id}).fetchall()
    for neighbor in neighbors:
        print(neighbor[0])

The Pgvector Similarity Functions

A bit of explanation might be helpful here. The “<=>” command is the new command the Pgvector extension added to PostgreSQL. It specifies to do a cosine distance search using the HNSW index.

Other commands include:

  • Inner Product: <#>
  • L1 Distance: <+>

See the “Getting Started” section of the Pgvector documentation for more detail.

Now let’s bring the whole thing together with this function (which is NOT a method of the class):

def test_postgres_document_manager():
    model_name = 'sentence-transformers/all-mpnet-base-v2'  # 'sentence-transformers/multi-qa-mpnet-base-dot-v1'
    manager = PgvectorManager(model_name=model_name)
    texts = [
        'The dog is barking',
        'The cat is purring',
        'The bear is growling'
    ]
    manager.store_documents(texts)
    manager.find_neighbors(1)

This function utilizes the class we’ve built. It first instantiates an instance of the PgvectorManage with the model I’ve selected (different from the default model) then creates a list of text to embed.

'The dog is barking',
'The cat is purring',
'The bear is growling'

It then stores the documents into the PostgreSQL database and finally finds the neighbors closest to the row with id = 1 (which will be ‘the dog is barking’.) Go ahead and call the function like this:

if __name__ == '__main__':
    test_postgres_document_manager()

You can find the final version of the code in my github repo here.

You will get back:

The bear is growling
The cat is purring

Why is this the result? Obviously ‘The dog is barking’ isn’t a neighbor to itself, so it isn’t returned. The cosine similarity found ‘the bear is growling’ to be more similar to ‘the dog is barking’ then ‘The cat is purring’.

Checking the Results in pgAdmin4

Just to be sure this is all really working, let’s open pgAdmin4 and check that the ‘documents’ table looks like. You can look up how to run and login to pgAdmin4 and run the query tool in this post here. Open the query tool and run the following query:

select *
from documents

Here is the result I get back:

The data output on pgAdmin4, which includes three results: The dog is barking, The cat is purring, The bear is growling.

It looks like we’re storing the text as well as the embedding vectors just like we expected.

So, there you have it. You now have a simple class that connects to PostgreSQL, activates Pgvector, and then allows you to store text and its embeddings all without using a tool like Haystack.

One note here that I think is important. If you’ll recall from the previous post, Haystack has no way to connect to PosgreSQL without setting an environment variable. I find that inconvenient. It is nicer to be able to login using the approach I’ve laid out in this post. But, as noted, this doesn’t work for now in Haystack and the only approach that seems to work is to set the connection string in an environment variable.

Links:

SHARE


comments powered by Disqus

Follow Us

Latest Posts

subscribe to our newsletter