Extending RAG capabilities to Excel with Couchbase, LLamaIndex, and Amazon Bedrock

As everything around us is gradually becoming more data-driven, Excel is still integral for businesses, providing the capability to provide invaluable insights from the data in the sheets. However, data scientists and analysts agree that extracting meaningful information from these vast datasets can be extremely time-consuming and requires specialized skill sets. But this is where Generative AI and Large Language Models can help simplify the insights generation process. A major component to help with this process is Retrieval Augmented Generation (RAG).

RAG is a powerful technique that aids the accuracy of large language models by enabling the model to have access to external facts through information retrieval. Typically, large language models (LLMs) take a user’s input and deliver responses based on information the LLM has been trained on (which can sometimes be outdated or incorrect). RAG combines this information with supplemental data like a company’s knowledge base or relevant documents, enabling it to deliver factually accurate, contextually relevant responses.

This blog will guide you on how to build a RAG system specifically tailored for ingesting Excel data and generating insights. We’ll leverage LlamaIndex and LlamaParse to transform the spreadsheets into a searchable knowledge base and store this data in Couchbase Vector Search for fast retrieval of relevant context based on a user’s query and then leverage Amazon Bedrock for the LLM response.

What is LLamaIndex and LLamaParse

LlamaIndex is an open-source orchestration framework designed to help developers build AI applications with large language models (LLMs). It helps to bridge the gap between custom data sources and LLMs. LLamaIndex gives capability to users to ingest data from various sources such as files or vector databases, and then it indexes this data into intermediate representations. LLamaIndex gives the capability to query this data in natural language and interact with it.

First, the data gets indexed into a vector index. This creates a searchable knowledge base specific to the domain. During querying, the system searches for relevant information based on the user’s prompt and then provides this information to the large language model in order to generate a response.

LlamaParse is a specialized component within the LlamaIndex ecosystem, designed to be a powerful document-parsing platform that streamlines the process of parsing and extracting structured information from complex documents. It has been built to parse and clean data to ensure that users can have high-quality input for LLM use-cases like RAG. LlamaParse supports parsing different types of documents like PDFs, Excel, HTML ,etc.

Building a retrieval augmented generation system with Excel data

In this blog, we will create a RAG system utilizing a customer complaints dataset (from Kaggle). This dataset provides detailed information about consumer complaints spanning various financial products and services. The RAG system powered by Couchbase Vector search will facilitate the extraction of critical information from the data.

A visual representation of the dataset is provided below.

Installing the Required Dependencies and Instantiating LlamaParse

!pip install llama-index llama-parse llama-index-vector-stores-couchbase llama-index-llms-bedrock fastembed llama-index-embeddings-bedrock

1	!pip install llama-index llama-parse llama-index-vector-stores-couchbase llama-index-llms-bedrock fastembed llama-index-embeddings-bedrock

After installing the dependencies, we will now instantiate LlamaParse with the parsing instruction to parse the Excel file:

from llama_parse import LlamaParse
from google.colab import userdata
from llama_index.core import SimpleDirectoryReader
import nest_asyncio

nest_asyncio.apply()

parser = LlamaParse(
   api_key=userdata.get('LLAMA_CLOUD_API_KEY'),
   parsing_instruction = """You are parsing a customer complaints dataset.. The column Company contains the company name.  Please extract Product, Sub-product, Issue, Consumer complaint narrative, company public response, company, state, zipcode, information from the columns.""",
   result_type="markdown"
)

from llama_parse import LlamaParse

from google.colab import userdata

from llama_index.core import SimpleDirectoryReader

import nest_asyncio

nest_asyncio.apply()

parser = LlamaParse(

api_key=userdata.get('LLAMA_CLOUD_API_KEY'),

parsing_instruction = """You are parsing a customer complaints dataset.. The column Company contains the company name. Please extract Product, Sub-product, Issue, Consumer complaint narrative, company public response, company, state, zipcode, information from the columns.""",

result_type="markdown"

)

When we instantiate the LlamaParse object, we pass parsing_instruction and result_type as the Parsing options.

In result_type, we have specified the format of our output. By default, LlamaParse will return results as parsed text. The other available options are markdown and JSON which returns a structure representing the parsed object. In parsing_instruction we can provide additional context to LlamaParse on the data. LlamaParse uses LLMs under the hood, allowing us to give it natural-language instructions about what and how to parse information.

Next, we will load the Excel file and parse it using LlamaParser:

file_extractor = {".xlsx": parser}

documents = SimpleDirectoryReader(input_files=[file_name.xlsx'], file_extractor=file_extractor).load_data()

file_extractor = {".xlsx": parser}

documents = SimpleDirectoryReader(input_files=[file_name.xlsx'], file_extractor=file_extractor).load_data()

Storing the parsed data using a Couchbase vector index

Before proceeding, ensure that you have an account on Couchbase Capella and that you have set up a vector index inside Couchbase. You can follow the following guide to set up your cluster and the vector index. Once you have the account and index ready, you can proceed further.

The parsed data is stored within the documents variable. Now it will be populated inside of Couchbase. To achieve this, the documents will be converted into VectorStoreIndex. This index will subsequently be stored within the Couchbase Vector store by first converting the documents supported by the vector store using the Bedrock Embeddings:

from llama_index.llms.bedrock import Bedrock
from llama_index.embeddings.bedrock import BedrockEmbedding
llm = Bedrock(model="mistral.mistral-large-2402-v1:0", region_name="us-east-1")
embeddings = BedrockEmbedding(model="amazon.titan-embed-text-v1")

from llama_index.llms.bedrock import Bedrock

from llama_index.embeddings.bedrock import BedrockEmbedding

llm = Bedrock(model="mistral.mistral-large-2402-v1:0", region_name="us-east-1")

embeddings = BedrockEmbedding(model="amazon.titan-embed-text-v1")

We will also connect to the Couchbase instance:

def connect_to_couchbase(connection_string, db_username, db_password):
    """Connect to couchbase"""
    from couchbase.cluster import Cluster
    from couchbase.auth import PasswordAuthenticator
    from couchbase.options import ClusterOptions
    from datetime import timedelta

    auth = PasswordAuthenticator(db_username, db_password)
    options = ClusterOptions(auth)
    connect_string = connection_string
    cluster = Cluster(connect_string, options)

    # Wait until the cluster is ready for use.
    cluster.wait_until_ready(timedelta(seconds=5))

    return cluster

def connect_to_couchbase(connection_string, db_username, db_password):

"""Connect to couchbase"""

from couchbase.cluster import Cluster

from couchbase.auth import PasswordAuthenticator

from couchbase.options import ClusterOptions

from datetime import timedelta

auth = PasswordAuthenticator(db_username, db_password)

options = ClusterOptions(auth)

connect_string = connection_string

cluster = Cluster(connect_string, options)

# Wait until the cluster is ready for use.

cluster.wait_until_ready(timedelta(seconds=5))

return cluster

Now, we will be calling the VectorStoreIndex method to store the index in Couchbase. The VectorStoreIndex takes the documents and splits them up into nodes. It then creates vector embeddings of the text of every node using the specified embedding model, in this case, Bedrock embeddings, which will be ready to be queried by an LLM.

LlamaIndex offers various file-based node parsers designed to generate nodes according to the specific content type being processed, such as JSON, Markdown, and other formats:

 index = VectorStoreIndex.from_documents(
            documents,
            storage_context=storage_context,
        )

  vector_store = get_vector_store(
            cluster,
            DB_BUCKET,
            DB_SCOPE,
            DB_COLLECTION,
            INDEX_NAME,
        )

  storage_context = StorageContext.from_defaults(vector_store=vector_store)

  index = VectorStoreIndex.from_documents(
            documents,
            storage_context=storage_context,
        )

index = VectorStoreIndex.from_documents(

documents,

storage_context=storage_context,

)

vector_store = get_vector_store(

cluster,

DB_BUCKET,

DB_SCOPE,

DB_COLLECTION,

INDEX_NAME,

)

storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(

documents,

storage_context=storage_context,

)

Response generation from Amazon Bedrock

When a user uploads an Excel file, it’s parsed using LlamaParse and stored in a Couchbase vector store. For each user query, the system performs a vector search to retrieve relevant chunks of information from the stored Excel data. These relevant chunks are then used as context for the language model (Bedrock’s Mistral model) to generate a response.

rag_stream_response = st.session_state.chat_engine_rag.stream_chat(question)
   for chunk in rag_stream_response.response_gen:
       rag_response += chunk
       message_placeholder.markdown(rag_response)

rag_stream_response = st.session_state.chat_engine_rag.stream_chat(question)

for chunk in rag_stream_response.response_gen:

rag_response += chunk

message_placeholder.markdown(rag_response)

Results

We can now check the performance of our RAG application by examining responses to various queries provided by the user.

Conclusion

This blog shares how one can build a Retrieval Augmented Generation (RAG) system for simplifying the analysis of massive amounts of Excel data. This is handled by extracting information from the data using LlamaParse, transforming it into a VectorStoreIndex format, and subsequently storing this index within Couchbase.

- Get started with Couchbase Capella developer platform for free
- Read more posts and tutorials on Generative AI (GenAI)
- Learn more about LLM Embeddings

Shivay Lamba, Developer Evangelist

Share this article

Platform

Self-Managed

Services

Capabilities

Why Couchbase?

Migrate to Capella

By Use Case

By Industry

By Application Need

Popular Docs

By Developer Role

Quickstart

Resource Center

About

Partnerships

Our Services

Partners: Register a Deal

Ready to register a deal with Couchbase?

Marriott

Extending RAG capabilities to Excel with Couchbase, LLamaIndex, and Amazon Bedrock

What is LLamaIndex and LLamaParse

Building a retrieval augmented generation system with Excel data

Installing the Required Dependencies and Instantiating LlamaParse

Storing the parsed data using a Couchbase vector index

Response generation from Amazon Bedrock

Results

Conclusion

Get Couchbase blog updates in your inbox

Author

Posted by Shivay Lamba, Developer Evangelist

Leave a comment Cancel reply

Ready to get Started with Couchbase Capella?

Start building

Use Capella free

Get in touch