As everything around us is gradually becoming more data-driven, Excel is still integral for businesses, providing the capability to provide invaluable insights from the data in the sheets. However, data scientists and analysts agree that extracting meaningful information from these vast datasets can be extremely time-consuming and requires specialized skill sets. But this is where Generative AI and Large Language Models can help simplify the insights generation process. A major component to help with this process is Retrieval Augmented Generation (RAG).Â
RAG is a powerful technique that aids the accuracy of large language models by enabling the model to have access to external facts through information retrieval. Typically, large language models (LLMs) take a user’s input and deliver responses based on information the LLM has been trained on (which can sometimes be outdated or incorrect). RAG combines this information with supplemental data like a company’s knowledge base or relevant documents, enabling it to deliver factually accurate, contextually relevant responses.
This blog will guide you on how to build a RAG system specifically tailored for ingesting Excel data and generating insights. We’ll leverage LlamaIndex and LlamaParse to transform the spreadsheets into a searchable knowledge base and store this data in Couchbase Vector Search for fast retrieval of relevant context based on a user’s query and then leverage Amazon Bedrock for the LLM response.Â
What is LLamaIndex and LLamaParse
LlamaIndex is an open-source orchestration framework designed to help developers build AI applications with large language models (LLMs). It helps to bridge the gap between custom data sources and LLMs. LLamaIndex gives capability to users to ingest data from various sources such as files or vector databases, and then it indexes this data into intermediate representations. LLamaIndex gives the capability to query this data in natural language and interact with it.Â
First, the data gets indexed into a vector index. This creates a searchable knowledge base specific to the domain. During querying, the system searches for relevant information based on the user’s prompt and then provides this information to the large language model in order to generate a response.
LlamaParse is a specialized component within the LlamaIndex ecosystem, designed to be a powerful document-parsing platform that streamlines the process of parsing and extracting structured information from complex documents. It has been built to parse and clean data to ensure that users can have high-quality input for LLM use-cases like RAG. LlamaParse supports parsing different types of documents like PDFs, Excel, HTML ,etc.Â
Building a retrieval augmented generation system with Excel data
In this blog, we will create a RAG system utilizing a customer complaints dataset (from Kaggle). This dataset provides detailed information about consumer complaints spanning various financial products and services. The RAG system powered by Couchbase Vector search will facilitate the extraction of critical information from the data.Â
A visual representation of the dataset is provided below.
Installing the Required Dependencies and Instantiating LlamaParse
1 |
!pip install llama-index llama-parse llama-index-vector-stores-couchbase llama-index-llms-bedrock fastembed llama-index-embeddings-bedrock |
After installing the dependencies, we will now instantiate LlamaParse with the parsing instruction to parse the Excel file:Â
1 2 3 4 5 6 7 8 9 10 11 12 |
from llama_parse import LlamaParse from google.colab import userdata from llama_index.core import SimpleDirectoryReader import nest_asyncio nest_asyncio.apply() parser = LlamaParse(    api_key=userdata.get('LLAMA_CLOUD_API_KEY'),    parsing_instruction = """You are parsing a customer complaints dataset.. The column Company contains the company name. Please extract Product, Sub-product, Issue, Consumer complaint narrative, company public response, company, state, zipcode, information from the columns.""",    result_type="markdown" ) |
When we instantiate the LlamaParse object, we pass parsing_instruction and result_type as the Parsing options.
In result_type, we have specified the format of our output. By default, LlamaParse will return results as parsed text. The other available options are markdown and JSON which returns a structure representing the parsed object. In parsing_instruction we can provide additional context to LlamaParse on the data. LlamaParse uses LLMs under the hood, allowing us to give it natural-language instructions about what and how to parse information.Â
Next, we will load the Excel file and parse it using LlamaParser:
1 2 3 |
file_extractor = {".xlsx": parser} documents = SimpleDirectoryReader(input_files=[file_name.xlsx'], file_extractor=file_extractor).load_data() |
Storing the parsed data using a Couchbase vector indexÂ
Before proceeding, ensure that you have an account on Couchbase Capella and that you have set up a vector index inside Couchbase. You can follow the following guide to set up your cluster and the vector index. Once you have the account and index ready, you can proceed further.Â
The parsed data is stored within the documents variable. Now it will be populated inside of Couchbase. To achieve this, the documents will be converted into VectorStoreIndex. This index will subsequently be stored within the Couchbase Vector store by first converting the documents supported by the vector store using the Bedrock Embeddings:Â
1 2 3 4 |
from llama_index.llms.bedrock import Bedrock from llama_index.embeddings.bedrock import BedrockEmbedding llm = Bedrock(model="mistral.mistral-large-2402-v1:0", region_name="us-east-1") embeddings = BedrockEmbedding(model="amazon.titan-embed-text-v1") |
We will also connect to the Couchbase instance:Â
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
def connect_to_couchbase(connection_string, db_username, db_password):     """Connect to couchbase"""     from couchbase.cluster import Cluster     from couchbase.auth import PasswordAuthenticator     from couchbase.options import ClusterOptions     from datetime import timedelta     auth = PasswordAuthenticator(db_username, db_password)     options = ClusterOptions(auth)     connect_string = connection_string     cluster = Cluster(connect_string, options)     # Wait until the cluster is ready for use.     cluster.wait_until_ready(timedelta(seconds=5))     return cluster |
Now, we will be calling the VectorStoreIndex method to store the index in Couchbase. The VectorStoreIndex takes the documents and splits them up into nodes. It then creates vector embeddings of the text of every node using the specified embedding model, in this case, Bedrock embeddings, which will be ready to be queried by an LLM.
LlamaIndex offers various file-based node parsers designed to generate nodes according to the specific content type being processed, such as JSON, Markdown, and other formats:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
index = VectorStoreIndex.from_documents( Â Â Â Â Â Â Â Â Â Â Â Â documents, Â Â Â Â Â Â Â Â Â Â Â Â storage_context=storage_context, Â Â Â Â Â Â Â Â ) Â Â vector_store = get_vector_store( Â Â Â Â Â Â Â Â Â Â Â Â cluster, Â Â Â Â Â Â Â Â Â Â Â Â DB_BUCKET, Â Â Â Â Â Â Â Â Â Â Â Â DB_SCOPE, Â Â Â Â Â Â Â Â Â Â Â Â DB_COLLECTION, Â Â Â Â Â Â Â Â Â Â Â Â INDEX_NAME, Â Â Â Â Â Â Â Â ) Â Â storage_context = StorageContext.from_defaults(vector_store=vector_store) Â Â index = VectorStoreIndex.from_documents( Â Â Â Â Â Â Â Â Â Â Â Â documents, Â Â Â Â Â Â Â Â Â Â Â Â storage_context=storage_context, Â Â Â Â Â Â Â Â ) |
Response generation from Amazon Bedrock
When a user uploads an Excel file, it’s parsed using LlamaParse and stored in a Couchbase vector store. For each user query, the system performs a vector search to retrieve relevant chunks of information from the stored Excel data. These relevant chunks are then used as context for the language model (Bedrock’s Mistral model) to generate a response.
1 2 3 4 |
rag_stream_response = st.session_state.chat_engine_rag.stream_chat(question)    for chunk in rag_stream_response.response_gen:        rag_response += chunk        message_placeholder.markdown(rag_response) |
Results
We can now check the performance of our RAG application by examining responses to various queries provided by the user.
Conclusion
This blog shares how one can build a Retrieval Augmented Generation (RAG) system for simplifying the analysis of massive amounts of Excel data. This is handled by extracting information from the data using LlamaParse, transforming it into a VectorStoreIndex format, and subsequently storing this index within Couchbase.
-
- Get started with Couchbase Capella developer platform for free
- Read more posts and tutorials on Generative AI (GenAI)
- Learn more about LLM Embeddings