Our Open-Source AI Stack: The Book Search Archive
- By Bruce Nielson
- ML & AI Specialist
In our on-going series of AI blog posts, we’ve been developing a simple application that shows off (and tests out) our low cost AI open-source stack. This series of blogs has been developing a way to search and query a book (we’ve been using the Federalist Papers as our example).
We have now released an improved version of this “Book Search” program that we’re calling the “Book Search Archive”. You can find the current code (as it is being developed) in this GitHub repo.
At this point the code is similar to the code we released back in this post but with a number of improvements that we’ll be covering in future posts. The code shows how to develop a simple Retrieval Augmented Generation (RAG) system.
The goal was to develop an AI stack that will allow us to deliver a low-cost AI solution to our clients by using open-source solutions that are still strong enough to be competitive. (As discussed in this post here about a low-cost Tech Support Agent to assist new tech support agents).
I am planning to have future blog posts that will go over changes I’ve made in the Book Search Archive program. But feel free to try this out for yourself in the meantime.
Setting Up the Book Search Archive
The first step is to set up your environment. This post goes over how to setup your environment in detail. There is one difference, however. This new Book Search Archive repo uses more up to date version of code. To catch up to the right versions you need only look at the requirements.txt file in the repo. (Found here). I am planning to eventually upgrade everything, but as I upgrade Python packages I always re-freeze the current requirements into the requirements.txt file. As mentioned in the setup post, you will probably have to install PyTorch separately as pip install doesn’t work right with the PyTorch software for some reason. (Probably because you have to specify where to get the download and inside of a requirements.txt file that isn’t specified).
Features of the Book Search Archive (Current)
The Book Search Archive program is very similar to what we already developed in past posts especially these posts:
Main Articles on Haystack and pgvector:
- Retrieval Augmented Generation with Haystack and pgvector (part 1)
- Retrieval Augmented Generation with Haystack and pgvector (part 2)
Other Haystack Related Articles:
- Writing a Custom Haystack Pipeline Component
- Google AI Integration with Haystack
- Loading EPUB files with Haystack
- Haystack Streaming Text Generation
Hugging Face Model Related Articles:
- Using Hugging Face Generators for Retrieval Augmented Generation
- Avoiding Text Truncations in Retrieval Augmented Generation
The current release includes all functionality from those posts plus the following new features (to be discussed in future posts in detail):
-
Greatly enhanced meta data collection from EPUB files (which are actually just HTML)
- Example: It now captures page numbers as well as chapter and section titles right out of the text and stores them as meta data. This will allow you to find the text in a paper back book if desired by looking up the quotes used.
- A Hybrid Search using Lexical and Semantic search together. Or each one individually if preferred.
-
Improved text capture. I now capture all text in the book even stuff outside of a paragraph tag.
- Example: Some books I have used have quotes at the top of a page that were getting skipped. Now, nothing gets skipped.
-
Skipping sections that aren’t helpful
- Okay, actually, I meant to say we don’t skip anything UNLESS we identify it as not helpful. Such as tables of context or bibliographies.
- I have added a number of ways to specify what you do NOT want included in the archive. This improves searching if it doesn’t have to search through things that aren’t helpful to a user.
-
Improved Document Splitting
- It now splits not only on sentences based on a ‘.’ (as comes default with Haystack if you do a ‘sentence’ search) but also accepts ‘?’ and ‘!’ or even new lines.
- If all of those still fail to allow the Haystack pipeline to avoid truncation then it will fail over to a word split and find a good split based on chunks of words. This assures us that we never get truncated when embedding text.
-
Loading multiple books at once
- Now you can specify a directory and it will search that directory for all EPUB files (PDFs coming soon!) and loads all of them.
-
Loading additional books
- No longer do you need to start fresh each time you do a load. Now you can load more books into an existing document store without losing what you already had.
-
A Custom Document Joiner
- Haystack comes with a Document Joiner that, at least in the version I was running on, screws up text streaming from the LLM. I wrote my down version of a document joiner to avoid this problem.
Expect future posts explaining each of the above at some point.
Future Features
We have plans to include additional features in the Book Search Archive coming soon. These might include:
- Streaming from a custom LLM on your own private server
- LangChain integration
- A full chat interface: have a conversation with the author of the book rather than just get an answer to a single question
- Additional architectures (e.g. ReAct and Chain of Thought)
- Re-ranker
- Improved Semantic Text Chunking
- Apache AGE (graph database in PostgreSQL)
- Integration with other tools such as: a. LlamaIndex b. LlamaParse to create a graph knowledge base c. NeuralDB (Embedding free indexing) d. DSPy (Prompt Tuning) e. ElasticSearch f. Streamlit (Customizable UI Interfaces)
So, stay tuned for even more great (but low cost and simple) Artificial Intelligence solutions!