Environment Setup for RAG using Python, Haystack, PostgreSQL, pgvector, and Hugging Face

Environment Setup for RAG using Python, Haystack, PostgreSQL, pgvector, and Hugging Face

This post is your one stop shop for how to setup your environment to follow my blog posts on your local machine to do Retrieval Augmented Generation (RAG) using Haystack, PostgreSQL, and Pgvector along with Hugging Face for the open-source Large Language Model.

As I’ve written these blog posts about how to use Haystack with Pgvector and PostgreSQL (for use with RAG) I’ve tried to explain how I set up my environment. But over the course of so many blog posts, even I’ve started to get confused as to what I do and don’t have installed. This post will remedy that situation by carefully walking you through every step to get your environment set up.

To be sure I have this correct, I started with a fresh environment and recorded every step for both a CPU and GPU environment that will run my code. My environment is on Windows using the Pycharm IDE running Python 3.11, though operating system is not supposed to make a difference, so this isn’t supposed to matter that I’m on Windows.

Starting with a Fresh Environment

First, start with a fresh python environment. For Pycharm I can do that by doing the following:

First select File -> Settings and you’ll get this modal:

A Python settings page via Psycharm with Python Interpreter tab open. On the right top bar there is an Add Interpreter dropdown and "Add Local Interpreter..." is selected.

Click “Add Interpreter” and then select “Add Local Interpreter…” That will take you to a screen in PyCharm to allow you to start with a fresh environment. There will be an equivalent way to do this in all IDEs.

Next, you’ll need to upgrade your pip installer like this:

python.exe -m pip install --upgrade pip

For GPU Users

For GPU users, you’ll need to install Pytorch directly so that you can install the GPU version. CPU users can skip this step entirely because installing Haystack will also install the CPU version of Pytorch.

First, visit the Pytorch “Getting Started” Website:

https://pytorch.org/get-started/locally/

You should see a control on the screen that looks something like this:

A control screen with options "Stable (2.4.0)", "Windows", "Pip", "Python", and "CUDA 11.8" selected.

For me, I’m using Windows and Cuda 11.4, so this is the command it gave me to run:

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

If you don’t know what version of Cuda you have, run this command at the CMD prompt:

Nvcc --version

A command prompt window with the command "Nvcc --version" having been inputted. Various build versions are shown, including an 11.4 version for Cuda compilation tools.

Even though my cuda version (11.4) wasn’t one of the options, I just picked the closest one (11.8) and it worked fine.

After you have Pytorch installed for use with Cuda (for GPUs) you can now move on to do the rest of the install. If you are doing CPU only, skip this step and just move on to the rest of the script.

Installing PostgreSQL and pgvector

You’ll next need to install PosgreSQL and the pgvector extension that adds cosine similarity search and HNSW indexes to PostgreSQL. I have two very detailed posts on how to that:

  1. Installing PostgreSQL: https://www.mindfiretechnology.com/blog/archive/installing-postgresql-in-preparation-for-retrieval-augmented-generation/
  2. Installing Pgvector: https://www.mindfiretechnology.com/blog/archive/installing-pgvector-in-preparation-for-retrieval-augmented-generation/

Installing Haystack for pgvector

Next, you’ll need to install the pgvector version of Haystack so that you have the pgvector modules available. I wrote a detailed blog post on how to that here:

https://www.mindfiretechnology.com/blog/archive/installing-haystack-for-pgvector-in-preparation-for-retrieval-augmented-generation/

The actual install isn’t too hard as you just need to do the following:

pip install pgvector-haystack

However, check the post for how to either setup the PostgreSQL connection string as an environment variable or how to directly pass it via code.

EPUB Related Installs

For my last post, I used EPUB files as the document format of choice. (If only to show you how to load EPUB files since Haystack doesn’t have a built-in component for EPUB formatted documents.) I discussed how to do this in my post on loading EPUB files into Haystack.

There are basically two installs you’ll need for this:

pip install bs4
pip install ebooklib

Hugging Face Installs

You’ll also need to install the Hugging Face Transformers package so that we can use a Hugging Face open-source Large Language Model. Here are the installs you’ll need:

pip install 'transformers[torch]'
pip install "sentence-transformers"

You may find this Hugging Face documentation helpful if you are using a GPU, though the above instructions should work fine without this.

Other Installs

Unfortunately, installing Haystack and Hugging Face seem to miss some important installs. I found I needed to also do the following install to get my code to work:

pip install trafilatura

Specifying Versions

If you follow this script, you should be able to match my environment closely enough to run my code. However, please keep in mind that code evolves over time, and as new updates are released, my code may become outdated compared to the latest versions. Therefore, it is possible that you may still encounter issues with my code. I have experienced frustration numerous times while trying to figure out why a seemingly helpful demo fails to work properly.

To avoid this problem, you can perform each of the aforementioned installations to precisely match the versions I utilized when writing my code. Instead of executing the install commands provided above, you have the option to run them with a specific version by following the instructions below:

pip install pgvector-haystack==0.5.1
pip install bs4==0.0.2
pip install ebooklib==0.18
pip install 'transformers[torch]'==4.43.2
pip install "sentence-transformers"==3.0.1
pip install trafilatura==1.11.0

Alternatively, I have a requirements text available:

You can use these by doing this:

pip install -r requirements-cpu.txt

Or:

pip install -r requirements-gpu.txt

And that’s it! You should now have a correctly setup environment to run not only the last few blog posts I did but also the upcoming ones that setup a RAG pipeline in Haystack.

Note: Installing directly from the requirements file will likely work for everything except installing Pytorch. If you are working with a CPU it should be sufficient on its own. If you are using a GPU you'll probably still need to do the Pytorch install as specified above.

SHARE


comments powered by Disqus

Follow Us

Latest Posts

subscribe to our newsletter