A Short Explanation of Hierarchal Navigable Small Worlds (HNSW) Index for pgvector

A Short Explanation of Hierarchal Navigable Small Worlds (HNSW) Index for pgvector

In our last two posts (here and here) we first installed PostgreSQL and then installed the pgvector extension to add functionality to PostgreSQL to allow us to use PostgreSQL as our document store for use with Retrieval Augmented Generation (RAG) for use with Large Language Models (LLMs).

Before we continue on to use pgvector and PostgreSQL for RAG (similar to how we did in this post) we should first briefly explain how pgvector handles similarity comparisons. Moreover, how does it differ from our simple cosine similarity that we previously used as the basis for a semantic search (back in this post.)

Recall that a cosine similarity (as explained in this post) requires us to first embed every single passage then embed our query. Then we do a comparison of the embedded query to every single embedded passage. The result is a slow process. It can take minutes to do each query. This isn’t very useful for real life cases where we need a quick response to avoid losing our user’s interest. So how can we speed up the process?

One very clever answer is to use an algorithm called the Hierarchal Navigable Small World (HNSW) algorithm. The HNSW algorithm was first published in a 2016 paper called “Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs” by Malkov and Yashunin (found here).

This short post isn’t going to go into all the details of how the HNSW algorithm works in detail, but I’m going to give a high level ‘gist’ of what it does and why it significantly speeds up cosine similarity comparisons. I’ll also include some great links to posts that explain it in more detail.

Consider the way we did our semantic search back in this post. We loaded a copy of the Federalist Papers, split it up into chunks of text, and then embedded each chuck. When we did a query we had to compare to every single chuck of text using a straight linear search, i.e. starting with the first one and trying out every single one until we get to the end of the list. So, if we have 10,000 chunks of text to search through we must perform 10,000 cosine similarity calculations.

How might we utilize Artificial Intelligence algorithms to speed this search up?

To visualize this, imagine all the chunks of text existing on a graph. Except this graph is likely a 768-dimensional graph instead of a 2-D graph. But since we can’t imagine what 768 dimensions looks like, we’ll just imagine this in 2 dimensions instead, though the concept is identical.

Suppose the embedding for our query sits somewhere within this graph next to chunks of text. What we want is the chunk (or chunks) of text closest to the query.

There is an Artificial Intelligence (AI) algorithm that works something like that called the k-nearest neighbor (KNN) algorithm. In this example (taken from Wikipedia) the green circle could be thought of as our query and blue boxes and red triangles represent the chunks of text we’re comparing to.

A diagram showcasing a green filled in circle at the center, two red filled in triangles orbiting it, a blue filled in square orbiting it, and then a solid circle encircling all the beforementioned. Then, you have two more filled in blue squares orbiting that, then, a dotted line encircling all the beforementioned. Then, you have three blue filled in squares orbiting all that, and three red filled in triangles orbiting all that as well.

So, in this example, we’d probably want to return either the top 3 closest chunks of text (in the solid circle) or the top 5 (in the dotted circle).

A lot of research has gone into how to speed up a KNN search. HNSW utilizes some of these clever speed ups to KNN so that it can quickly find the best documents without having to do a comparison with every single chunk of text.

Imagine that the chunks of text are connected to each other in a graph that likes something like this (taken from the 2016 HNSW paper):

A diagram showing various light blue circles connected to each other with lines, some are connected to more than one. There is also a green circle in which several of the blue circles also are connected to. One of the light blue circles has a red arrow pointed at it, and it then has a second red arrow pointing from it to the green circle (despite not being connected to the green circle by a line).

You enter the graph where the red dotted line enters and then you traverse the graph until you find a node (i.e. the chunk of text) that has no connected nodes that are closer to your query. Because you don’t have to search every single node the process is quite a bit faster. The downside is that you may not find the closest match this way, but you should find something pretty close.

However, even with this speed up, you will quickly overwhelm the algorithm as the number of documents grows and the process will become intractable again.

So, imagine that you have multiple layers of graphs at different levels of abstraction. Something like this (taken from the 2016 paper)

A diagram with three levels, labled Layer=2, Layer=1, and Layer=0. On Layer=0, the top layer, there is a red dot and a light blue dot, a red arrow points from the red dot to the light blue dot. On Layer=1, which is the middle layer, there are four light blue dots arranged in a parallelogram, three of them are connected to each other forming the shape of a triangle, and a fourth is connected to just one corner dot, but has a red arrow pointing from it to another light blue dot, but not the one it is connected to. A red arrow descends from Layer=2 from the light blue dot to the aforementioned dot that is only connected once. A black dotted line descends from Layer=2's red dot to a light blue dot direclty under it which is connected to the previous two light blue dots, but is not connected in any way to the singular dot that is connected only once. On Layer=0, the bottom layer, you see a rendition of the previous diagram with the many light blue dots and single green dot. All four of the dots found on Layer=1 have lines dropping to connect to the dots in Layer=0.

You’d be able to search the top layer first (which is a very sparse graph, so there aren’t many nodes to search!), find the closest node, then descend a level to find a closer node, descend again, etc. The result would be a much faster search even if the number of documents got quite large.

This is in essence what the HNSW algorithm does. It creates layers of such graphs and keeps them properly connected to allow for a fast search. The downside is that the HNSW algorithm isn’t guaranteed to return the absolute closest matches.

pgvector creates a new kind of field (a vector) with a new kind of index that allows you to do a cosine similarity search using an HNSW index.

For now, that’s really all you need to know to understand how we’re going to utilize PostgreSQL and pgvector to do a faster cosine similarity search.



comments powered by Disqus

Follow Us

Latest Posts

subscribe to our newsletter