Semantic Search and Cosine Similarity

Semantic Search and Cosine Similarity

In my last post, I talked about Cosine Similarity and how to use it to find similarities between two vectors. We also talked about how we can convert two sentences into two vectors and then use Cosine Similarity to compare how similar two sentences are.

Now let’s put our new knowledge to work and use it to do a ‘semantic search’ within an entire book. (1)

What is a Semantic Search?

So, what is a ‘semantic search’? Instead of searching a book for an exact matching word or phrase, we’ll create a way to search for similar ideas across synonyms. Or even being able to ask a question and find the best matching answer.

First, we’ll need a book. Let’s pick something not too large, such as The Federalist Papers off of Guttenberg Project. You can grab your own copy there or find it in my GitHub for my blog posts. Look for “Federalist Papers.epub”. Or use a book of your own so long as it is in epub format.

Installing the Needed Software

First, install the software we’ll need for this post using the following commands:

pip install sentence_transformers
pip install ebooklib
pip install bs4
Once those are installed, you can import the following:
from sentence_transformers import SentenceTransformer
from ebooklib import epub, ITEM_DOCUMENT
from bs4 import BeautifulSoup
import NumPy as np

If you don’t already have NumPy installed, you can ‘pip install NumPy’ to install it.

Read In the Epub File

First, let’s write a function to read in the book’s epub file.

def epub_to_paragraphs(epub_file_path, min_words=0):
    paragraphs = []
    book = epub.read_epub(epub_file_path)

    for section in book.get_items_of_type(ITEM_DOCUMENT):
        paragraphs.extend(epub_sections_to_paragraphs(section, min_words=min_words))

    return paragraphs

This function takes the name of the file and ‘min_words’ which is the minimum number of words required to be in a paragraph or else we’ll strip it out. I found this useful to allow us to strip out pages or paragraphs that have no real content, like a title page.

We then read in the epub file using “epub.read_epub”

An epub file may contain several parts. We are interested in the actual text of the book (versus, say, the cover or the images) so we loop over “book.get_items_of_type(ITEM_DOCUMENT)” to get each section of the book.

Getting Paragraphs

But what we really want is not ‘sections’ but individual paragraphs. So let’s write this function:

def epub_sections_to_paragraphs(section, min_words=0):
    html = BeautifulSoup(section.get_body_content(), 'html.parser')
    p_tag_list = html.find_all('p')
    paragraphs = [
        {
            'text': paragraph.get_text().strip(),
            'chapter_name': ' '.join([heading.get_text().strip() for heading in html.find_all('h1')]),
            'para_no': para_no,
        }
        for para_no, paragraph in enumerate(p_tag_list)
        if len(paragraph.get_text().split()) >= min_words
    ]
    return paragraphs

This function uses the “BeautifulSoup” library that we installed to take each section and turn it into html. That then allows us to grab every <p> tag and that will be a paragraph. We also shave off the chapter name out of the heading. This is also where we filter out any paragraph that doesn’t have at least ‘min_words’. For now, we’ll just take everything, so min_word = 0.

Embeddings

To work the semantic search magic, we need to be able to encode the book and later the query into ‘embeddings’ which are the vectors we’ll perform the cosine similarity on. So let’s write two functions to do this for us:

def create_embeddings(texts, model):
    return model.encode([text.replace("\n", " ") for text in texts])

This first function will take the text we want to encode as well as a model (that’s what sentence_transformers is) that will do the encoding. We are removing all newline characters on the fly so that we’re dealing only with the text itself.

def get_embeddings(model, paragraphs):
    texts = [para['text'] for para in paragraphs]
    return create_embeddings(texts, model)

This second function takes out paragraphs dictionary from the epub_sections_to_paragraphs function (above) and grabs only the text then passes it to create_embeddings.

Semantic Search

Finally, we need a cosine similarity function (from our previous blog post):

def cosine_similarity(query_embedding, embeddings):
    dot_products = np.dot(embeddings, query_embedding)
    query_magnitude = np.linalg.norm(query_embedding)
    embeddings_magnitudes = np.linalg.norm(embeddings, axis=1)
    cosine_similarities = dot_products / (query_magnitude * embeddings_magnitudes)
    return cosine_similarities

And finally, here is the function do to the actual semantic search:

def semantic_search(model, embeddings, query, top_results=5):
    query_embedding = create_embeddings([query], model)[0]
    scores = cosine_similarity(query_embedding, embeddings)
    results = np.argsort(scores)[::-1][:top_results].tolist()
    return results

This function takes the model (the sentence_transformer), the embeddings vector we created, and a query (the search query), as well as how many top answers to return. We then embed/encode the query then we run a cosine similarity against each paragraph of the book and take the top matches.

Putting It All Together

With these functions in place, now we can write our code to do the semantic search:

def test_semantic_search():
    paragraphs = epub_to_paragraphs(r"Federalist Papers.epub", min_words=3)
    model = SentenceTransformer('sentence-transformers/multi-qa-mpnet-base-dot-v1')
    embeddings = get_embeddings(model, paragraphs)

    query = 'Are we a democracy or a republic?'
    results = semantic_search(model, embeddings, query)

    print("Top results:")
    for result in results:
        para_info = paragraphs[result]
        chapter_name = para_info['chapter_name']
        para_no = para_info['para_no']
        paragraph_text = para_info['text']
        print(f"Chapter: '{chapter_name}', Passage number: {para_no}, Text: '{paragraph_text}'")
        print('')

Let’s go over this in detail:

Load the book:

paragraphs = epub_to_paragraphs(r"Federalist Papers.epub", min_words=3)

Load the hugging face model to do the embeddings (turn the book into vectors so that we can do the cosine similarity) and then do the embeddings:

model = SentenceTransformer('sentence-transformers/multi-qa-mpnet-base-dot-v1')
embeddings = get_embeddings(model, paragraphs)

Do the actual Semantic Search:

query = 'Are we a democracy or a republic?'
results = semantic_search(model, embeddings, query)

So here we’re asking the question “Are we a democracy or a republic?” and searching for the best answer out of the Federalist Papers. Here was the top result I got:

Passage number: 56, Text: 'A republic, by which I mean a government in which the scheme of representation takes place, opens a different prospect, and promises the cure for which we are seeking. Let us examine the points in which it varies from pure democracy, and we shall comprehend both the nature of the cure and the efficacy which it must derive from the Union.'

That’s a spot-on result!

You can find the entire code at my GitHub. The file “blog_semantic_search.py” contains all the code above. Give it a run and see how it works!

How (Why?) Does it Work?

One question you might ask is how and why does this actually work? In our last blog post we just basically counted words up to determine a similarity. But this is clearly doing something far more sophisticated. The answer is that sentence_transformer embeds words into a vector such that synonyms and related concepts are closer to each other in the ‘vector space’. That is how and why we get such good results out of a semantic search like this using so very little code.

For those interested in pursuing this further, I have a fuller featured version of Semantic Search available in a different GitHub repo. This version includes loading PDFs (which I’ll cover in a future blog post) and offers a lot more options about how to break down the text into pages or paragraphs.

I also have a Google Colab with the same code to try out.

Notes: (1) With thanks to Dwarkesh Patel from the Dwarkesh podcast for the idea for this blog post. His Google Colab.

SHARE


comments powered by Disqus

Follow Us

Latest Posts

subscribe to our newsletter