Best Practices and Tutorials

Extending RAG capabilities to Excel with Couchbase, LLamaIndex, and Amazon Bedrock

As everything around us is gradually becoming more data-driven, Excel is still integral for businesses, providing the capability to provide invaluable insights from the data in the sheets. However, data scientists and analysts agree that extracting meaningful information from these vast datasets can be extremely time-consuming and requires specialized skill sets. But this is where Generative AI and Large Language Models can help simplify the insights generation process. A major component to help with this process is Retrieval Augmented Generation (RAG). 

RAG is a powerful technique that aids the accuracy of large language models by enabling the model to have access to external facts through information retrieval. Typically, large language models (LLMs) take a user’s input and deliver responses based on information the LLM has been trained on (which can sometimes be outdated or incorrect). RAG combines this information with supplemental data like a company’s knowledge base or relevant documents, enabling it to deliver factually accurate, contextually relevant responses.

This blog will guide you on how to build a RAG system specifically tailored for ingesting Excel data and generating insights. We’ll leverage LlamaIndex and LlamaParse to transform the spreadsheets into a searchable knowledge base and store this data in Couchbase Vector Search for fast retrieval of relevant context based on a user’s query and then leverage Amazon Bedrock for the LLM response. 

What is LLamaIndex and LLamaParse

LlamaIndex is an open-source orchestration framework designed to help developers build AI applications with large language models (LLMs). It helps to bridge the gap between custom data sources and LLMs. LLamaIndex gives capability to users to ingest data from various sources such as files or vector databases, and then it indexes this data into intermediate representations. LLamaIndex gives the capability to query this data in natural language and interact with it. 

First, the data gets indexed into a vector index. This creates a searchable knowledge base specific to the domain. During querying, the system searches for relevant information based on the user’s prompt and then provides this information to the large language model in order to generate a response.

LlamaParse is a specialized component within the LlamaIndex ecosystem, designed to be a powerful document-parsing platform that streamlines the process of parsing and extracting structured information from complex documents. It has been built to parse and clean data to ensure that users can have high-quality input for LLM use-cases like RAG. LlamaParse supports parsing different types of documents like PDFs, Excel, HTML ,etc. 

Building a retrieval augmented generation system with Excel data

In this blog, we will create a RAG system utilizing a customer complaints dataset (from Kaggle). This dataset provides detailed information about consumer complaints spanning various financial products and services. The RAG system powered by Couchbase Vector search will facilitate the extraction of critical information from the data. 

A visual representation of the dataset is provided below.

complaints dataset from kaggle loading into a database

Installing the Required Dependencies and Instantiating LlamaParse

After installing the dependencies, we will now instantiate LlamaParse with the parsing instruction to parse the Excel file: 

When we instantiate the LlamaParse object, we pass parsing_instruction and result_type as the Parsing options.

In result_type, we have specified the format of our output. By default, LlamaParse will return results as parsed text. The other available options are markdown and JSON which returns a structure representing the parsed object. In parsing_instruction we can provide additional context to LlamaParse on the data. LlamaParse uses LLMs under the hood, allowing us to give it natural-language instructions about what and how to parse information. 

Next, we will load the Excel file and parse it using LlamaParser:

Storing the parsed data using a Couchbase vector index 

Before proceeding, ensure that you have an account on Couchbase Capella and that you have set up a vector index inside Couchbase. You can follow the following guide to set up your cluster and the vector index.  Once you have the account and index ready, you can proceed further. 

The parsed data is stored within the documents variable. Now it will be populated inside of Couchbase. To achieve this, the documents will be converted into VectorStoreIndex. This index will subsequently be stored within the Couchbase Vector store by first converting the documents supported by the vector store using the Bedrock Embeddings: 

We will also connect to the Couchbase instance: 

Now, we will be calling the VectorStoreIndex method to store the index in Couchbase. The VectorStoreIndex takes the documents and splits them up into nodes. It then creates vector embeddings of the text of every node using the specified embedding model, in this case, Bedrock embeddings, which will be ready to be queried by an LLM.

LlamaIndex offers various file-based node parsers designed to generate nodes according to the specific content type being processed, such as JSON, Markdown, and other formats:

Response generation from Amazon Bedrock

When a user uploads an Excel file, it’s parsed using LlamaParse and stored in a Couchbase vector store. For each user query, the system performs a vector search to retrieve relevant chunks of information from the stored Excel data. These relevant chunks are then used as context for the language model (Bedrock’s Mistral model) to generate a response.

Results

We can now check the performance of our RAG application by examining responses to various queries provided by the user.

Conclusion

This blog shares how one can build a Retrieval Augmented Generation (RAG) system for simplifying the analysis of massive amounts of Excel data. This is handled by extracting information from the data using LlamaParse, transforming it into a VectorStoreIndex format, and subsequently storing this index within Couchbase.

Share this article
Get Couchbase blog updates in your inbox
This field is required.

Author

Posted by Shivay Lamba, Developer Evangelist

Leave a comment

Ready to get Started with Couchbase Capella?

Start building

Check out our developer portal to explore NoSQL, browse resources, and get started with tutorials.

Use Capella free

Get hands-on with Couchbase in just a few clicks. Capella DBaaS is the easiest and fastest way to get started.

Get in touch

Want to learn more about Couchbase offerings? Let us help.