Installing Haystack for pgvector in Preparation for Retrieval Augmented Generation

Installing Haystack for pgvector in Preparation for Retrieval Augmented Generation

In our previous posts we have been preparing to do Retrieval Augmented Generation (RAG) with Large Language Models (LLMs). We first installed PostgreSQL then we installed pgvector – a PostgreSQL extension that adds a new vector field type that we will use with text embedding. We also discussed pgvector’s new index type, that of a Hierarchal Navigable Small Worlds (HNSW) Index, as explained in this post.

Installing Haystack for pgvector

We’re now ready to install Haystack for pgvector. As mentioned in our previous post, Haystack is a great tool that does more of the work for you when doing RAG using LLMs. It allows you to build a RAG pipeline that creates the queries for the LLM by first pulling back relevant documents and inserting it into the query to the LLM. This grounds the answer against the retrieved documents.

Assuming you’ve already installed PostgreSQL and pgvector using the past posts, we’re now ready to install Haystack’s pgvector module into your Python environment:

pip install pgvector-haystack

A Brief pgvector Haystack Demo

Now that we’ve installed Haystack, let’s do a quick demo of how it works to make sure that it is installed correctly.

The pgvector Haystack integration quick start tutorial (found here) explains how to setup a haystack database in PostgreSQL.

Setting Up a Connection String to the Haystack pgvector Database table

You must first setup an environment variable called PG_CONN_STR that contains a connection string for Haystack to use. To do that in Windows do the following:

Windows search, env is in the search bar and the best match of "Edit the system environment variables" is the first option. The first option is circled in red.

  1. Click the search icon
  2. Start typing “environment variables” until the right option comes up
  3. Click “Edit the system environment variables”

You will then see this modal:

System Properties window is open, near the bottom right the button "Environment Variables..." is circled in red.

Click the “Environmental Variables…” button and you’ll see this modal:

A different screen of the System Properties window. It has two sections, User variables for BNielson and System variables. Near the bottom right there are three buttons, New..., Edit..., and Delete.... New... is circled in red and is the left most button.

Click the “New…” button and you’ll see this dialog:

A popup window is shown, it is titled New System Variable. There are two text boxes, Variable name: and Variable value:. In Variable name: "PG_CONN_STR" is inputted.

Enter “PG_CONN_STR” for the variable name.

For the variable value we will create the connection string in this format:

PG_CONN_STR="postgresql://USER:PASSWORD@HOST:PORT/DB_NAME

Save this new environment variable. You may need to restart your Python environment or reboot your computer to have it detect the environment variable.

Initializing the pgvector Database Using Haystack

We’re now ready to initialize the pgvectorDocumentStore which will create a table in our PostgreSQL database in which to store our document embeddings in a vector.

You can find completed code in my git repo here.

Let’s first do some imports:

from haystack_integrations.document_stores.pgvector import pgvectorDocumentStore
from haystack import Document

The first imports the pgvectorDocumentStore that we’ll be using and the second imports the Document class that we’ll use to create fake documents to insert into our database. Now let’s now write a function to initialize the pgvectorDocumentStore:

def initialize_database(recreate_table=False):
    # Initialize the pgvectorDocumentStore
    ds = pgvectorDocumentStore(
        table_name="haystack_docs",
        embedding_dimension=768,
        vector_function="cosine_similarity",
        recreate_table=recreate_table,
        search_strategy="hnsw",
        hnsw_recreate_index_if_exists=True
    )
    if ds.count_documents() == 0:
        # Do loading of documents
        pass
    return ds

Let’s go over what this function does. It has one parameter of ‘recreate_table’ which defaults to False. This will tell the document store if we’re creating the PostgreSQL/pgvector table from scratch or not. Obviously, we don’t want to wipe out the data we’ve already loaded, so normally we want this set to False.

Next, we create the document store (‘ds’) by calling pgvectorDocumentStore which initializes the store. We pass the following parameters:

  • table_name: This is literally the name of the table that will be created inside of PostgreSQL
  • embedding_dimensions: This will be determined by what SentenceTransformer you end up using, but 768 is quite common, so for now we’ll use that.
  • vectorfunction: This specifies which vector function we’ll be using. For our purposes we’ll be using “cosinesimilarity” just like we did in our past posts (here and here)
  • recreate_table: True means wipe out any existing table and start from scratch. False means use the existing table with all existing documents in it, if any.
  • search_strategy: This specifies which type of search strategy or index type we’ll be using. We’ll specify “hnsw” which refers to hierarchy navigable small worlds as discussed in this post here.
  • hnswrecreateindexifexists: We’ll set this to True so that the HSNW index is recreated each time we initialize the document store. I’ve found that if I don’t do this I tend to get an error that things are out of date.

After we’ve setup the document store, this function then checks how many documents are in the document store. If the number is 0, then I’ve left a spot to load the data into the database. Finally, the document store (‘ds’) is returned for use by the rest of the program.

How To Pass a Connection String

It would be very convenient if you could just pass the connection string directly to Haystack’s PgvectorDocumentStore rather than having to store it as an environment variable. And there is even a parameter that allows you to pass the connection string.

But alas my testing shows that it doesn’t work as you’d expect. When I attempt to pass the connection string directly (as a string), I get the following error:

  File "D:\Documents\AI\LLMs\.venv\lib\site-packages\haystack_integrations\document_stores\pgvector\document_store.py", line 174, in _create_connection
    conn_str = self.connection_string.resolve_value() or ""
AttributeError: 'str' object has no attribute 'resolve_value'

Apparently Haystack expects you to pass type 'Secret' rather than a string. (i.e. The error above happens because it assumed you passed a variable of class ‘Secret’ rather than a string directly containing the connection string). But see documentation for the Secret class here. You can read up on secret management here.)

Here is where things get a bit confusing. The PgvectorDocumentStore class does have a parameter for a connection_string (as documented in the API reference for the PgvectorDocumentStore.) So it looks as if the Haystack developers intended to allow us to pass a connection string, but then the code assumes you're passing a Secret instead of a string. So how do you handle that?

Reviewing the Secret class, it is not immediately apparently how to pass a simple connection string to it. There is a method called 'from_token() that takes a string. But a connection string is not a token. So if this is the correct method (and as we'll see it is!) it's a misleading name!

I decided to try it out anyhow using code that look something like this:

connection_str: str = (f"postgresql://{postgres_user_name}:{postgres_password}@"
                        f"{self.postgres_host}:{postgres_port}/{postgres_db_name}")
connection_token: Secret = Secret.from_token(connection_str)

And surprisingly it worked great! So this is how to avoid an environment variable should you wish to simply pass the connection string instead. In a future post, I'll add that to my code. But for now I'm going to stick with the environment variable approach.

Storing Test Documents

Next, we need a way to test that we can write documents to the document store. Let’s create a function based on this tutorial found on the Haystack site in the documentation for the pgvectorDocumentStore:

def test_initialize_database():
    ds = initialize_database(recreate_table=True)
    ds.write_documents([
        Document(content="This is first", embedding=[0.1] * 768),
        Document(content="This is second", embedding=[0.3] * 768)
    ])
    print(ds.count_documents())

Finally let’s call testinitializedatabase to kick everything off:

 test_initialize_database()

Assuming you’ve properly installed PostgreSQL and pgvector as well as the pgvector version of Haystack (see links below if you need to know how to do this), you should see a ‘2’ print because it inserted 2 documents and the document store (‘ds’) returned a count of those documents (i.e. ‘count_documents()’).

To verify this really worked, go to the pgAdmin4 console (as explained in this post) and you should now see a new table called ‘haystack_docs’. Run a select on this table to check if the data has been properly inserted:

Select *
From haystack_docs

You should then see two rows with fake embeddings of all 0.1’s and 0.3s just like we inserted:

The app pgAdmin 4 is open and on the subdirectory of Databases (1) > postgres > Schemas (1) > public > Tables (3) > haystack_docs. haystack_docs is circled in red. On the right under Query, "select *" and second line "from haystack_device" are circled in red. Under Data Output, the second id is indicated with a red arrow.

Congratulations! You have now installed Haystack with the pgvector module installed. 

Links:

SHARE


comments powered by Disqus

Follow Us

Latest Posts

subscribe to our newsletter