Using Neo4j Graph Database for Retrieval Augmented Generation (RAG)

Using Neo4j Graph Database for Retrieval Augmented Generation (RAG)

In two previous posts (here and here) we talked about how to install Neo4j Graph Database for use with the Book Search Archive, our sample project using the Mindfire Technology open-source AI stack. In this post I’ll cover the code necessary to use Neo4j as a document store for Retrieval Augmented Generation (RAG).

You can find the version of the code at the time of this post here.

Adding Neo4j to Document Conversion Pipeline

Haystack’s Neo4j integration page discusses how to use the integration. Unlike the Pgvector integration with Haystack, the Neo4j integration was created and maintained by Neo4j and requires a separate install.

pip install sentence-transformers # required in order to run pipeline examples given below
pip install neo4j-haystack

You can find the Neo4j documentation on the Haystack integration here.

Next, we’ll add the Neo4j document store to the document_processor.py file. This new sub function (that I put inside of the _initialize_document_store method) can switch between the pgvector and Neo4j document stores:

class DocumentStoreType(Enum):
    Pgvector = 1
    Neo4j = 2
…
def init_doc_store(force_recreate: bool = False) -> Union[PgvectorDocumentStore, Neo4jDocumentStore]:
    if self._document_store_type == DocumentStoreType.Pgvector:
        connection_token: Secret = Secret.from_token(self._postgres_connection_str)
        doc_store: PgvectorDocumentStore = PgvectorDocumentStore(
            connection_string=connection_token,
            table_name=self._table_name,
            embedding_dimension=self.embed_dims,
            vector_function="cosine_similarity",
            recreate_table=self._recreate_table or force_recreate,
            search_strategy="hnsw",
            hnsw_recreate_index_if_exists=True,
            hnsw_index_name=self._table_name + "_hnsw_index",
            keyword_index_name=self._table_name + "_keyword_index",
        )
        return doc_store
    elif self._document_store_type == DocumentStoreType.Neo4j:
        # https://haystack.deepset.ai/integrations/neo4j-document-store
        doc_store: Neo4jDocumentStore = Neo4jDocumentStore(
            url=self._neo4j_url,
            username=self._db_user_name,
            password=self._db_password,
            database=self._db_name,
            embedding_dim=self.embed_dims,
            embedding_field="embedding",
            index="document-embeddings",  # The name of the Vector Index in Neo4j
            node_label="Document",  # Providing a label to Neo4j nodes which store Documents
            recreate_index=self._recreate_table or force_recreate,
        )
        return doc_store

Now just call this function like this and the rest of the code stays mostly the same:

document_store: Union[PgvectorDocumentStore, Neo4jDocumentStore]
document_store = init_doc_store()

There are various other small changes that I was forced to make to allow for the code to handle either type of document store, but other than that, not much else has changed.

Another thing to note is that the Neo4j document store (Neo4jDocumentStore) isn’t quite the same as the PostgreSQL document store (PgvectorDocumentStore). With the pgvector document store I needed to build a connection string out of the login information. With Neo4j I actually pass the login information into the document store. Also note that there is no flag to recreate the database in Neo4j and instead you just recreate the index, and it also recreates the database.

Querying the Graph Database

After running the new code and loading the Neo4j database, we can then go to the Neo4j browser and query the graph we created.

Let’s start with this query:

MATCH (n:Document) RETURN count(n) AS totalEntities;

Will add more detailed description at a later date. image 1

For me, I’ve loaded 619 documents. Let’s create a query to view them as a graph:

Will add more detailed description at a later date. image 2

Not much of a graph, huh? It’s just a bunch of nodes, each of which is a document fragment. So, at this point this is no different than just using the PostgreSQL/pgector database. We’ll go into detail how to take advantage of a graph database in future posts.

But what if you want to get the actual data back? This query might work for you:

MATCH (n:Document) 
RETURN {
    id: n.id,
    properties: [key IN keys(n) WHERE key <> 'embedding' | {key: key, value: n[key]}]
} AS document 
LIMIT 25

Will add more detailed description at a later date. image 3

This cypher query actually returns all properties for a node. So, it’s just like querying a table in a regular SQL relational database.

Adding Neo4j to the RAG Pipeline

Now we’re ready to add the graph database to the RAG pipeline. We basically use the same trick. We create a function that will return the appropriate document store. It is nearly identical to the one we did in the document conversion pipeline. But this time we add it to rag_pipeline.py. After that we can add the nodes to the RAG pipeline to utilize the Neo4j database. This works the same way as before except with a new wrinkle: we can no longer use the PgvectorKeywordRetriever component. So lexical searches must be disabled if we’re using Neo4j.

if self._search_mode == SearchMode.LEXICAL or self._search_mode == SearchMode.HYBRID \
        and self._document_store_type == DocumentStoreType.Pgvector:
    lex_retriever: RetrieverWrapper = RetrieverWrapper(
        PgvectorKeywordRetriever(document_store=self._document_store, top_k=self._retriever_top_k))
    rag_pipeline.add_component("lex_retriever", lex_retriever)
    rag_pipeline.connect("query_input.query", "lex_retriever.query")
    rag_pipeline.connect("lex_retriever.documents", "doc_query_collector.lexical_documents")

if self._search_mode == SearchMode.SEMANTIC or self._search_mode == SearchMode.HYBRID:
    semantic_retriever: RetrieverWrapper
    if self._document_store_type == DocumentStoreType.Neo4j:
        semantic_retriever = RetrieverWrapper(
            Neo4jEmbeddingRetriever(document_store=self._document_store, top_k=self._retriever_top_k))
    else:
        semantic_retriever = RetrieverWrapper(
            PgvectorEmbeddingRetriever(document_store=self._document_store, top_k=self._retriever_top_k))
    rag_pipeline.add_component("semantic_retriever", semantic_retriever)
    rag_pipeline.connect("query_embedder.embedding", "semantic_retriever.query")
    rag_pipeline.connect("semantic_retriever.documents", "doc_query_collector.semantic_documents")

Notice how I check for each type of document store we want to use and adjust the pipeline either way. I also changed the RetrieverWrapper to accept either kind of document store.

Conclusions

At this point we’re not really utilizing the power of a graph database. But we have successfully changed our code to handle either a PostgreSQL or Neo4j database as the document store. So, our open-source AI stack now includes the power of graph databases should we need that ability. This is exciting!

SHARE


comments powered by Disqus

Follow Us

Latest Posts

subscribe to our newsletter