
Using Neo4j Graph Database for Retrieval Augmented Generation (RAG)
- By Bruce Nielson
- ML & AI Specialist
In two previous posts (here and here) we talked about how to install Neo4j Graph Database for use with the Book Search Archive, our sample project using the Mindfire Technology open-source AI stack. In this post I’ll cover the code necessary to use Neo4j as a document store for Retrieval Augmented Generation (RAG).
You can find the version of the code at the time of this post here.
Adding Neo4j to Document Conversion Pipeline
Haystack’s Neo4j integration page discusses how to use the integration. Unlike the Pgvector integration with Haystack, the Neo4j integration was created and maintained by Neo4j and requires a separate install.
pip install sentence-transformers # required in order to run pipeline examples given below
pip install neo4j-haystack
You can find the Neo4j documentation on the Haystack integration here.
Next, we’ll add the Neo4j document store to the document_processor.py file. This new sub function (that I put inside of the _initialize_document_store method) can switch between the pgvector and Neo4j document stores:
class DocumentStoreType(Enum):
Pgvector = 1
Neo4j = 2
…
def init_doc_store(force_recreate: bool = False) -> Union[PgvectorDocumentStore, Neo4jDocumentStore]:
if self._document_store_type == DocumentStoreType.Pgvector:
connection_token: Secret = Secret.from_token(self._postgres_connection_str)
doc_store: PgvectorDocumentStore = PgvectorDocumentStore(
connection_string=connection_token,
table_name=self._table_name,
embedding_dimension=self.embed_dims,
vector_function="cosine_similarity",
recreate_table=self._recreate_table or force_recreate,
search_strategy="hnsw",
hnsw_recreate_index_if_exists=True,
hnsw_index_name=self._table_name + "_hnsw_index",
keyword_index_name=self._table_name + "_keyword_index",
)
return doc_store
elif self._document_store_type == DocumentStoreType.Neo4j:
# https://haystack.deepset.ai/integrations/neo4j-document-store
doc_store: Neo4jDocumentStore = Neo4jDocumentStore(
url=self._neo4j_url,
username=self._db_user_name,
password=self._db_password,
database=self._db_name,
embedding_dim=self.embed_dims,
embedding_field="embedding",
index="document-embeddings", # The name of the Vector Index in Neo4j
node_label="Document", # Providing a label to Neo4j nodes which store Documents
recreate_index=self._recreate_table or force_recreate,
)
return doc_store
Now just call this function like this and the rest of the code stays mostly the same:
document_store: Union[PgvectorDocumentStore, Neo4jDocumentStore]
document_store = init_doc_store()
There are various other small changes that I was forced to make to allow for the code to handle either type of document store, but other than that, not much else has changed.
Another thing to note is that the Neo4j document store (Neo4jDocumentStore) isn’t quite the same as the PostgreSQL document store (PgvectorDocumentStore). With the pgvector document store I needed to build a connection string out of the login information. With Neo4j I actually pass the login information into the document store. Also note that there is no flag to recreate the database in Neo4j and instead you just recreate the index, and it also recreates the database.
Querying the Graph Database
After running the new code and loading the Neo4j database, we can then go to the Neo4j browser and query the graph we created.
Let’s start with this query:
MATCH (n:Document) RETURN count(n) AS totalEntities;
For me, I’ve loaded 619 documents. Let’s create a query to view them as a graph:
Not much of a graph, huh? It’s just a bunch of nodes, each of which is a document fragment. So, at this point this is no different than just using the PostgreSQL/pgector database. We’ll go into detail how to take advantage of a graph database in future posts.
But what if you want to get the actual data back? This query might work for you:
MATCH (n:Document)
RETURN {
id: n.id,
properties: [key IN keys(n) WHERE key <> 'embedding' | {key: key, value: n[key]}]
} AS document
LIMIT 25
This cypher query actually returns all properties for a node. So, it’s just like querying a table in a regular SQL relational database.
Adding Neo4j to the RAG Pipeline
Now we’re ready to add the graph database to the RAG pipeline. We basically use the same trick. We create a function that will return the appropriate document store. It is nearly identical to the one we did in the document conversion pipeline. But this time we add it to rag_pipeline.py. After that we can add the nodes to the RAG pipeline to utilize the Neo4j database. This works the same way as before except with a new wrinkle: we can no longer use the PgvectorKeywordRetriever component. So lexical searches must be disabled if we’re using Neo4j.
if self._search_mode == SearchMode.LEXICAL or self._search_mode == SearchMode.HYBRID \
and self._document_store_type == DocumentStoreType.Pgvector:
lex_retriever: RetrieverWrapper = RetrieverWrapper(
PgvectorKeywordRetriever(document_store=self._document_store, top_k=self._retriever_top_k))
rag_pipeline.add_component("lex_retriever", lex_retriever)
rag_pipeline.connect("query_input.query", "lex_retriever.query")
rag_pipeline.connect("lex_retriever.documents", "doc_query_collector.lexical_documents")
if self._search_mode == SearchMode.SEMANTIC or self._search_mode == SearchMode.HYBRID:
semantic_retriever: RetrieverWrapper
if self._document_store_type == DocumentStoreType.Neo4j:
semantic_retriever = RetrieverWrapper(
Neo4jEmbeddingRetriever(document_store=self._document_store, top_k=self._retriever_top_k))
else:
semantic_retriever = RetrieverWrapper(
PgvectorEmbeddingRetriever(document_store=self._document_store, top_k=self._retriever_top_k))
rag_pipeline.add_component("semantic_retriever", semantic_retriever)
rag_pipeline.connect("query_embedder.embedding", "semantic_retriever.query")
rag_pipeline.connect("semantic_retriever.documents", "doc_query_collector.semantic_documents")
Notice how I check for each type of document store we want to use and adjust the pipeline either way. I also changed the RetrieverWrapper to accept either kind of document store.
Conclusions
At this point we’re not really utilizing the power of a graph database. But we have successfully changed our code to handle either a PostgreSQL or Neo4j database as the document store. So, our open-source AI stack now includes the power of graph databases should we need that ability. This is exciting!