{"id":16210,"date":"2024-08-26T18:40:37","date_gmt":"2024-08-27T01:40:37","guid":{"rendered":"https:\/\/www.couchbase.com\/blog\/?p=16210"},"modified":"2024-09-03T11:08:00","modified_gmt":"2024-09-03T18:08:00","slug":"guide-to-data-prep-for-rag","status":"publish","type":"post","link":"https:\/\/www.couchbase.com\/blog\/guide-to-data-prep-for-rag\/","title":{"rendered":"A Step-by-Step Guide to Preparing Data for Retrieval-Augmented Generation (RAG)"},"content":{"rendered":"<p><span style=\"font-weight: 400;\">In today&#8217;s data-driven world, the ability to efficiently gather and prepare data is crucial for the success of any application. Whether you&#8217;re developing a chatbot, a recommendation system, or any AI-driven solution, the quality and structure of your data can make or break your project. In this article, we&#8217;ll take you on a journey to explore the process of information gathering and smart chunking, focusing on how to prepare data for <a href=\"https:\/\/www.couchbase.com\/blog\/an-overview-of-retrieval-augmented-generation\/\">Retrieval-Augmented Generation (RAG)<\/a> in any application with your database of choice.<\/span><\/p>\n<div id=\"attachment_16211\" style=\"width: 910px\" class=\"wp-caption alignnone\"><a href=\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/08\/image2-3.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-16211\" class=\"wp-image-16211 size-large\" src=\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/08\/image2-3-935x1024.png\" alt=\"\" width=\"900\" height=\"986\" srcset=\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/08\/image2-3-935x1024.png 935w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/08\/image2-3-274x300.png 274w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/08\/image2-3-768x841.png 768w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/08\/image2-3-300x329.png 300w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/08\/image2-3.png 1156w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\" \/><\/a><p id=\"caption-attachment-16211\" class=\"wp-caption-text\">High level overview of converting docs for RAG<\/p><\/div>\n<p>&nbsp;<\/p>\n<h2>Data Collection: The Foundation of RAG<\/h2>\n<h3>The Magic of Scrapy<\/h3>\n<p><span style=\"font-weight: 400;\">Imagine a spider, not the creepy kind, but a diligent librarian spider in the massive library of the internet. This spider, embodied by Scrapy&#8217;s <em>Spider<\/em> class, starts at the entrance (the starting URL) and methodically visits every room (webpage), collecting precious books (HTML pages). Whenever it finds a door to another room (a hyperlink), it opens it and continues its exploration, ensuring no room is left unchecked. This is how Scrapy works\u2014systematically and meticulously gathering every piece of information.<\/span><\/p>\n<h3>Leveraging Scrapy for Data Collection<\/h3>\n<p><a href=\"https:\/\/docs.scrapy.org\/en\/latest\/\"><span style=\"font-weight: 400;\">Scrapy<\/span><\/a><span style=\"font-weight: 400;\"> is a Python-based framework designed to extract data from websites. It&#8217;s like giving our librarian spider superpowers. With Scrapy, we can build web spiders that navigate through web pages and extract desired information with precision. In our case, we deploy Scrapy to crawl the Couchbase documentation website and download HTML pages for further processing and analysis.<\/span><\/p>\n<h3>Setting Up Your Scrapy Project<\/h3>\n<p><span style=\"font-weight: 400;\">Before our spider can start its journey, we need to set up a Scrapy project. Here\u2019s how you do it:<\/span><\/p>\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Install Scrapy<\/b><span style=\"font-weight: 400;\"><span style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">: If you haven&#8217;t already installed Scrapy, you can do so using pip:<\/span><\/span><\/span>\n<pre class=\"nums:false lang:default decode:true \">pip install scrapy\r\n<\/pre>\n<p><span style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">\u00a0<\/span><\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Create a New Scrapy Project<\/b><span style=\"font-weight: 400;\"><span style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">: Set up your new Scrapy project with the following command:<\/span><\/span><\/span>\n<pre class=\"nums:false lang:default decode:true \">scrapy startproject couchbase_docs\r\n<\/pre>\n<\/li>\n<\/ol>\n<h3>Crafting the Spider<\/h3>\n<p><span style=\"font-weight: 400;\">With the Scrapy project set up, we now create the spider that will crawl through the Couchbase documentation website and download HTML pages. Here\u2019s how it looks:<\/span><\/p>\n<pre class=\"nums:false lang:default decode:true\">from pathlib import Path\r\nimport scrapy\r\n\r\nclass CouchbaseSpider(scrapy.Spider):\r\n\u00a0 \u00a0 name = \"couchbase\"\r\n\u00a0 \u00a0 start_urls = [\"https:\/\/docs.couchbase.com\/home\/index.html\",]\r\n\r\n\u00a0 \u00a0 def parse(self, response):\r\n\u00a0 \u00a0 \u00a0 \u00a0 # Download HTML content of the current page\r\n\u00a0 \u00a0 \u00a0 \u00a0 page = response.url.split(\"\/\")[-1]\r\n\u00a0 \u00a0 \u00a0 \u00a0 filename = f\"{page}.html\"\r\n\u00a0 \u00a0 \u00a0 \u00a0 Path(filename).write_bytes(response.body)\r\n\u00a0 \u00a0 \u00a0 \u00a0 self.log(f\"Saved file {filename}\")\r\n\r\n\u00a0 \u00a0 \u00a0 \u00a0 # Extract links and follow them\r\n\u00a0 \u00a0 \u00a0 \u00a0 for href in response.css(\"ul a::attr(href)\").getall():\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 if href.endswith(\".html\") or \"docs.couchbase.com\" in href:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 yield response.follow(href, self.parse)<\/pre>\n<h3>Running the Spider<\/h3>\n<p><span style=\"font-weight: 400;\">To run the spider and initiate the data collection process, execute the following command within the Scrapy project directory:<\/span><\/p>\n<pre class=\"nums:false lang:default decode:true \">scrapy crawl couchbase\r\n<\/pre>\n<p><span style=\"font-weight: 400;\">This command will start the spider, which will begin crawling the specified URLs and saving the HTML content. The spider extracts links from each page and follows them recursively, ensuring comprehensive data collection.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">By automating data collection with Scrapy, we ensure all relevant HTML content from the Couchbase documentation website is retrieved efficiently and systematically, laying a solid foundation for further processing and analysis.<\/span><\/p>\n<h2>Extracting Text Content: Transforming Raw Data<\/h2>\n<p><span style=\"font-weight: 400;\">After collecting HTML pages from the Couchbase documentation website, the next crucial step is to extract the text content. This transforms raw data into a usable format for analysis and further processing. Additionally, we may have PDF files containing valuable data, which we&#8217;ll also extract. Here, we&#8217;ll discuss how to use Python scripts to parse HTML files and PDFs, extract text data, and store it for further processing.<\/span><\/p>\n<h3>Extracting Text from HTML Pages<\/h3>\n<p><span style=\"font-weight: 400;\">To extract text content from HTML pages, we&#8217;ll use a Python script that parses the HTML files and retrieves text data enclosed within\u00a0&lt;p&gt;\u00a0tags. This approach captures the main body of text from each page, excluding any HTML markup or structural elements.<\/span><\/p>\n<h4>Python Function for Text Extraction<\/h4>\n<p><span style=\"font-weight: 400;\">Below is a Python function that demonstrates how to extract text content from HTML pages and store it in text files:<\/span><\/p>\n<pre class=\"nums:false lang:default decode:true \">from bs4 import BeautifulSoup\r\n\r\ndef get_data(html_content):\r\n\u00a0 \u00a0 soup = BeautifulSoup(html_content, \"html.parser\")\r\n\r\n\u00a0 \u00a0 title = str(soup).split('&lt;title&gt;')[1].split('&lt;\/title&gt;')[0]\r\n\u00a0 \u00a0 if \" | Couchbase Docs\" in title:\r\n\u00a0 \u00a0 \u00a0 \u00a0 title = title[:(title.index(\" | Couchbase Docs\"))].replace(\" \", \"_\")\r\n\u00a0 \u00a0 else:\r\n\u00a0 \u00a0 \u00a0 \u00a0 title = title.replace(\" \", \"_\")\r\n\u00a0 \u00a0 data = \"\"\r\n\u00a0 \u00a0 lines = soup.find_all('p')\r\n\u00a0 \u00a0 for line in lines:\r\n\u00a0 \u00a0 \u00a0 \u00a0 data += \" \" + line.text\r\n\u00a0 \u00a0 return title, data<\/pre>\n<h3>How to Use?<\/h3>\n<p><span style=\"font-weight: 400;\">To use the\u00a0<em>get_data()\u00a0<\/em>function, incorporate it into your Python script or application and provide the HTML content as a parameter. The function will return the extracted text content.<\/span><\/p>\n<pre class=\"nums:false lang:default decode:true \">html_content = '&lt;html&gt;&lt;head&gt;&lt;title&gt;Sample Page&lt;\/title&gt;&lt;\/head&gt;&lt;body&gt;&lt;p&gt;This is a sample paragraph.&lt;\/p&gt;&lt;\/body&gt;&lt;\/html&gt;'\r\ntitle, text = get_data(html_content)\r\nprint(title)\u00a0 # Output: Sample_Page\r\nprint(text) \u00a0 # Output: This is a sample paragraph.<\/pre>\n<h3>Extracting Text Content from PDFs<\/h3>\n<p><span style=\"font-weight: 400;\">For extracting text content from PDFs, we&#8217;ll use a Python script that reads a PDF file and retrieves its data. This process ensures that all relevant textual information is captured for analysis.<\/span><\/p>\n<h4>Python Function for Text Extraction<\/h4>\n<p><span style=\"font-weight: 400;\">Below is a Python function that demonstrates how to extract text content from PDFs:<\/span><\/p>\n<pre class=\"nums:false lang:default decode:true \">from PyPDF2 import PdfReader\r\n\r\ndef extract_text_from_pdf(pdf_file):\r\n\u00a0 \u00a0 reader = PdfReader(pdf_file)\r\n\u00a0 \u00a0 text = ''\r\n\u00a0 \u00a0 for page in reader.pages:\r\n\u00a0 \u00a0 \u00a0 \u00a0 text += page.extract_text()\r\n\u00a0 \u00a0 return text<\/pre>\n<h4>How to Use?<\/h4>\n<p><span style=\"font-weight: 400;\">To use the\u00a0extract_text_from_pdf()\u00a0function, incorporate it into your Python script or application and provide the PDF file path as a parameter. The function will return the extracted text content.<\/span><\/p>\n<pre class=\"nums:false lang:default decode:true \">pdf_path = 'sample.pdf'\r\ntext = extract_text_from_pdf(pdf_path)<\/pre>\n<p><span style=\"font-weight: 400;\">With the text content extracted and saved, we&#8217;ve completed the process of getting data from the Couchbase documentation.<\/span><\/p>\n<h2>Chunking: Making Data Manageable<\/h2>\n<p><span style=\"font-weight: 400;\">Imagine you have a lengthy novel and you want to create a summary. Instead of reading the entire book at once, you break it down into chapters, paragraphs, and sentences. This way, you can easily understand and process each part, making the task more manageable. Similarly, chunking in text processing helps in dividing large texts into smaller, meaningful units. By organizing text into manageable chunks, we can facilitate easier processing, retrieval, and analysis of information.<\/span><\/p>\n<h3>Semantic and Content Chunking for RAG<\/h3>\n<p><span style=\"font-weight: 400;\">For Retrieval-Augmented Generation (RAG), chunking is particularly important. We implemented both semantic and content chunking methods to optimize the data for the RAG process, which involves retrieving relevant information and generating responses based on that information.<\/span><\/p>\n<h4>Recursive Character Text Splitter<b><br \/>\n<\/b><\/h4>\n<p><a href=\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/08\/image3-3.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-large wp-image-16212\" src=\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/08\/image3-3-1024x172.png\" alt=\"\" width=\"900\" height=\"151\" srcset=\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/08\/image3-3-1024x172.png 1024w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/08\/image3-3-300x51.png 300w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/08\/image3-3-768x129.png 768w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/08\/image3-3.png 1182w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\" \/><\/a><\/p>\n<p><a href=\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/08\/image4-2.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-large wp-image-16213\" src=\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/08\/image4-2-1024x489.png\" alt=\"\" width=\"900\" height=\"430\" srcset=\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/08\/image4-2-1024x489.png 1024w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/08\/image4-2-300x143.png 300w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/08\/image4-2-768x366.png 768w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/08\/image4-2-1536x733.png 1536w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/08\/image4-2-1320x630.png 1320w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/08\/image4-2.png 1788w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\" \/><\/a><\/p>\n<p><a href=\"https:\/\/api.python.langchain.com\/en\/latest\/character\/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html\"><span style=\"font-weight: 400;\">Recursive character text splitter<\/span><\/a><span style=\"font-weight: 400;\"> chunking by Langchain involves breaking down a piece of text into smaller chunks using recursive patterns within the characters of the text. This technique utilizes separators such as\u00a0<code>\\n\\n<\/code>\u00a0(double newline),\u00a0<code>\\n<\/code> (newline),\u00a0(space), and\u00a0<code>\"\"<\/code>\u00a0(empty string).<\/span><\/p>\n<h4>Semantic Chunking<\/h4>\n<p><a href=\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/08\/image5-2.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-large wp-image-16215\" src=\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/08\/image5-2-1024x172.png\" alt=\"\" width=\"900\" height=\"151\" srcset=\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/08\/image5-2-1024x172.png 1024w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/08\/image5-2-300x51.png 300w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/08\/image5-2-768x129.png 768w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/08\/image5-2.png 1182w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\" \/><\/a><\/p>\n<p><a href=\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/08\/image1-3.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-large wp-image-16214\" src=\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/08\/image1-3-1024x489.png\" alt=\"\" width=\"900\" height=\"430\" srcset=\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/08\/image1-3-1024x489.png 1024w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/08\/image1-3-300x143.png 300w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/08\/image1-3-768x367.png 768w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/08\/image1-3-1536x734.png 1536w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/08\/image1-3-1320x631.png 1320w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/08\/image1-3.png 1785w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\" \/><\/a><\/p>\n<p><span style=\"font-weight: 400;\">Semantic chunking is a text processing technique that focuses on grouping words or phrases based on their semantic meaning or context. This approach enhances understanding by creating meaningful chunks that capture the underlying relationships in the text. It is particularly useful for tasks that require detailed analysis of text structure and content organization.<\/span><\/p>\n<h3>Implementation of Semantic and Content Chunking<\/h3>\n<p><span style=\"font-weight: 400;\">For our project, we implemented both semantic and content chunking methods. Semantic chunking preserves the hierarchical structure of the text, ensuring that each chunk maintains its contextual integrity. Content chunking was applied to remove redundant chunks and optimize storage and processing efficiency.<\/span><\/p>\n<h4>Python Implementation<\/h4>\n<p><span style=\"font-weight: 400;\">Here\u2019s a Python implementation of semantic and content chunking:<\/span><\/p>\n<pre class=\"nums:false lang:default decode:true \">import hashlib\r\nfrom langchain.text_splitter import RecursiveCharacterTextSplitter\r\n\r\n# Global set to store unique chunk hash values across all files\r\nglobal_unique_hashes = set()\r\n\r\ndef hash_text(text):\r\n\u00a0 \u00a0 # Generate a hash value for the text using SHA-256\r\n\u00a0 \u00a0 hash_object = hashlib.sha256(text.encode())\r\n\u00a0 \u00a0 return hash_object.hexdigest()\r\n\r\ndef chunk_text(text, title, Chunk_size=2000, Overlap=50, Length_function=len, debug_mode=0):\r\n\u00a0 \u00a0 global global_unique_hashes\r\n\r\n\u00a0 \u00a0 chunks = RecursiveCharacterTextSplitter(\r\n\u00a0 \u00a0 \u00a0 \u00a0 chunk_size=Chunk_size,\r\n\u00a0 \u00a0 \u00a0 \u00a0 chunk_overlap=Overlap,\r\n\u00a0 \u00a0 \u00a0 \u00a0 length_function=Length_function\r\n\u00a0 \u00a0 ).create_documents([text])\r\n\r\n\u00a0 \u00a0 if debug_mode:\r\n\u00a0 \u00a0 \u00a0 \u00a0 for idx, chunk in enumerate(chunks):\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 print(f\"Chunk {idx+1}: {chunk}\\n\")\r\n\u00a0 \u00a0 \u00a0 \u00a0 print('\\n')\r\n\r\n\u00a0 \u00a0 # Deduplication mechanism\r\n\u00a0 \u00a0 unique_chunks = []\r\n\u00a0 \u00a0 for chunk in chunks:\r\n\u00a0 \u00a0 \u00a0 \u00a0 chunk_hash = hash_text(chunk.page_content)\r\n\u00a0 \u00a0 \u00a0 \u00a0 if chunk_hash not in global_unique_hashes:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 unique_chunks.append(chunk)\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 global_unique_hashes.add(chunk_hash)\r\n\r\n\u00a0 \u00a0 for sentence in unique_chunks:\r\n\u00a0 \u00a0 \u00a0 \u00a0 sentence.page_content = title + \" \" + sentence.page_content\r\n\r\n\u00a0 \u00a0 return unique_chunks<\/pre>\n<p><span style=\"font-weight: 400;\">These optimized chunks are then embedded and stored in the Couchbase cluster for efficient retrieval, ensuring seamless integration with the RAG process.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">By employing both semantic and content chunking techniques, we effectively structured and optimized textual data for the RAG process and storage in the Couchbase cluster. The next step is to embed the chunks we just generated.<\/span><\/p>\n<h2>Embedding Chunks: Mapping the Galaxy of Data<\/h2>\n<p><span style=\"font-weight: 400;\">Imagine each chunk of text as a star in a vast galaxy. By embedding these chunks, we assign each star a precise location in this galaxy, based on its characteristics and relationships with other stars. This spatial mapping allows us to navigate the galaxy more effectively, finding connections and understanding the broader universe of information.<\/span><\/p>\n<h3>Embedding Text Chunks for RAG<\/h3>\n<p><span style=\"font-weight: 400;\">Embedding text chunks is a crucial step in the Retrieval-Augmented Generation (RAG) process. It involves transforming text into numerical vectors that capture the semantic meaning and context of each chunk, making it easier for machine learning models to analyze and generate responses.<\/span><\/p>\n<h3>Utilizing the BAAI Model BGE-M3<\/h3>\n<p><span style=\"font-weight: 400;\">To embed the chunks, we use the <\/span><a href=\"https:\/\/huggingface.co\/BAAI\/bge-m3\"><span style=\"font-weight: 400;\">BAAI model BGE-M3<\/span><\/a><span style=\"font-weight: 400;\">. This model is capable of embedding text into a high-dimensional vector space, capturing the semantic meaning and context of each chunk.<\/span><\/p>\n<h3>Embedding Function<\/h3>\n<p><span style=\"font-weight: 400;\">The embedding function takes the chunks generated from the previous step and embeds each chunk into a 1024-dimensional vector space using the BAAI model BGE-M3. This process enhances the representation of each chunk, facilitating more accurate and contextually rich analysis.<\/span><\/p>\n<h3>Python Script for Embedding<\/h3>\n<p><span style=\"font-weight: 400;\">Here&#8217;s a Python script that demonstrates how to embed text chunks using the BAAI model BGE-M3:<\/span><\/p>\n<pre class=\"nums:false lang:default decode:true \">import json\r\nimport numpy as np\r\nfrom json import JSONEncoder\r\nfrom baai_model import BGEM3FlagModel\u00a0 \r\n\r\nembed_model = BGEM3FlagModel('BAAI\/bge-m3', use_fp16=True)\r\n\r\nclass NumpyEncoder(JSONEncoder):\r\n\u00a0 \u00a0 def default(self, obj):\r\n\u00a0 \u00a0 \u00a0 \u00a0 if isinstance(obj, np.ndarray):\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 return obj.tolist()\r\n\u00a0 \u00a0 \u00a0 \u00a0 return JSONEncoder.default(self, obj)\r\n\r\ndef embed(chunks):\r\n\u00a0 \u00a0 embedded_chunks = []\r\n\u00a0 \u00a0 for sentence in chunks:\r\n\u00a0 \u00a0 \u00a0 \u00a0 emb = embed_model.encode(str(sentence.page_content), batch_size=12, max_length=600)['dense_vecs']\r\n\u00a0 \u00a0 \u00a0 \u00a0 embedding = np.array(emb)\r\n\u00a0 \u00a0 \u00a0 \u00a0 np.set_printoptions(suppress=True)\r\n\r\n\u00a0 \u00a0 \u00a0 \u00a0 json_dump = json.dumps(embedding, cls=NumpyEncoder)\r\n\u00a0 \u00a0 \u00a0 \u00a0 embedded_chunk = {\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \"data\": str(sentence.page_content),\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \"vector_data\": json.loads(json_dump)\r\n\u00a0 \u00a0 \u00a0 \u00a0 }\r\n\u00a0 \u00a0 \u00a0 \u00a0 embedded_chunks.append(embedded_chunk)\r\n\u00a0 \u00a0 return embedded_chunks<\/pre>\n<h3>How to Use?<\/h3>\n<p><span style=\"font-weight: 400;\">To use the\u00a0embed()\u00a0function, incorporate it into your Python script or application and provide the chunks generated from the previous steps as input. The function will return a list of embedded chunks.<\/span><\/p>\n<pre class=\"nums:false lang:default decode:true \">chunks = [\r\n\u00a0 \u00a0 # Assume chunks is a list of text chunks generated previously\r\n\u00a0 \u00a0 {\"page_content\": \"This is the first chunk of text.\"},\r\n\u00a0 \u00a0 {\"page_content\": \"This is the second chunk of text.\"}\r\n]\r\n\r\nembedded_chunks = embed(chunks)<\/pre>\n<p><span style=\"font-weight: 400;\">These optimized chunks, now embedded in a high-dimensional vector space, are ready for storage and retrieval, ensuring efficient utilization of resources and seamless integration with the RAG process. By embedding the text chunks, we transform raw text into a format that machine learning models can efficiently process and analyze, enabling more accurate and contextually aware responses in the RAG system.<\/span><\/p>\n<h2>Storing Embedded Chunks: Ensuring Efficient Retrieval<\/h2>\n<p><span style=\"font-weight: 400;\">Once the text chunks have been embedded, the next step is to store these vectors in a database. These embedded chunks can be pushed into vector databases or traditional databases with vector search support, such as Couchbase, Elasticsearch, or Pinecone, to facilitate efficient retrieval for Retrieval-Augmented Generation (RAG) applications.<\/span><\/p>\n<h3>Vector Databases<\/h3>\n<p><span style=\"font-weight: 400;\">Vector databases are designed specifically to handle and search through high-dimensional vectors efficiently. By storing embedded chunks in a vector database, we can leverage advanced search capabilities to quickly retrieve the most relevant information based on the context and semantic meaning of queries.<\/span><\/p>\n<h3>Integrating with RAG Applications<\/h3>\n<p><span style=\"font-weight: 400;\">With the data prepared and stored, it is now ready to be used in RAG applications. The embedded vectors enable these applications to retrieve contextually relevant information and generate more accurate and meaningful responses, enhancing the overall user experience.<\/span><\/p>\n<h2>Conclusion<\/h2>\n<p><span style=\"font-weight: 400;\">By following this guide, we have successfully prepared data for Retrieval-Augmented Generation. We covered data collection using Scrapy, text content extraction from HTML and PDFs, chunking techniques, and embedding text chunks using the BAAI model BGE-M3. These steps ensure that the data is organized, optimized, and ready for use in RAG applications.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">For more such technical and engaging content, check out other <a href=\"https:\/\/www.couchbase.com\/blog\/category\/vector-search\/\">vector search-related blogs<\/a> on our website and stay tuned for the next part in this series.<\/span><\/p>\n<h3>References<\/h3>\n<ul>\n<li style=\"list-style-type: none;\">\n<ul>\n<li aria-level=\"1\">What is <a href=\"https:\/\/www.couchbase.com\/blog\/an-overview-of-retrieval-augmented-generation\/\">Retrieval Augmented Generation (RAG)<\/a>?<\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><a href=\"https:\/\/docs.scrapy.org\/en\/latest\/\"><span style=\"font-weight: 400;\">Scrapy docs<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><a href=\"https:\/\/api.python.langchain.com\/en\/latest\/character\/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html\"><span style=\"font-weight: 400;\">LangChain Recursive Character Text Splitter docs<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><a href=\"https:\/\/huggingface.co\/BAAI\/bge-m3\"><span style=\"font-weight: 400;\">BAAI BGE-M3 model on Hugging Face<\/span><\/a><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>Contributors<\/h3>\n<ul>\n<li style=\"list-style-type: none;\">\n<ul>\n<li><a href=\"https:\/\/in.linkedin.com\/in\/sanjivanipatra\"><span style=\"font-weight: 400;\">Sanjivani Patra<\/span><\/a><\/li>\n<li><a href=\"https:\/\/www.linkedin.com\/in\/nishanth-vm?utm_source=share&amp;utm_campaign=share_via&amp;utm_content=profile&amp;utm_medium=android_app\"><span style=\"font-weight: 400;\">Nishanth VM<\/span><\/a><\/li>\n<li><a href=\"https:\/\/www.linkedin.com\/in\/ashokkumaralluri\/\"><span style=\"font-weight: 400;\">Ashok Kumar Alluri<\/span><\/a><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>In today&#8217;s data-driven world, the ability to efficiently gather and prepare data is crucial for the success of any application. Whether you&#8217;re developing a chatbot, a recommendation system, or any AI-driven solution, the quality and structure of your data can [&hellip;]<\/p>\n","protected":false},"author":85513,"featured_media":16216,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"inline_featured_image":false,"footnotes":""},"categories":[1814,1815,9973,9139,9937],"tags":[9923,9924],"ppma_author":[10012],"class_list":["post-16210","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-application-design","category-best-practices-and-tutorials","category-generative-ai-genai","category-python","category-vector-search","tag-embeddings","tag-rag-retrieval-augmented-generation"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.7.1 (Yoast SEO v25.7) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>A Step-by-Step Guide to Preparing Data for Retrieval-Augmented Generation (RAG)<\/title>\n<meta name=\"description\" content=\"MData gathering, chunking &amp; embedding techniques for efficient Retrieval-Augmented Generation (RAG) with Scrapy, Python, and BAAI model.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.couchbase.com\/blog\/guide-to-data-prep-for-rag\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"A Step-by-Step Guide to Preparing Data for Retrieval-Augmented Generation (RAG)\" \/>\n<meta property=\"og:description\" content=\"MData gathering, chunking &amp; embedding techniques for efficient Retrieval-Augmented Generation (RAG) with Scrapy, Python, and BAAI model.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.couchbase.com\/blog\/guide-to-data-prep-for-rag\/\" \/>\n<meta property=\"og:site_name\" content=\"The Couchbase Blog\" \/>\n<meta property=\"article:published_time\" content=\"2024-08-27T01:40:37+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2024-09-03T18:08:00+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/08\/image2-4.46.15\u202fPM.png\" \/>\n\t<meta property=\"og:image:width\" content=\"2200\" \/>\n\t<meta property=\"og:image:height\" content=\"1400\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Sanjivani Patra - Software Engineer\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Sanjivani Patra - Software Engineer\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"9 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/www.couchbase.com\/blog\/guide-to-data-prep-for-rag\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/www.couchbase.com\/blog\/guide-to-data-prep-for-rag\/\"},\"author\":{\"name\":\"Sanjivani Patra - Software Engineer\",\"@id\":\"https:\/\/www.couchbase.com\/blog\/#\/schema\/person\/bef3d836c397a5a2cd59c80a43070250\"},\"headline\":\"A Step-by-Step Guide to Preparing Data for Retrieval-Augmented Generation (RAG)\",\"datePublished\":\"2024-08-27T01:40:37+00:00\",\"dateModified\":\"2024-09-03T18:08:00+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/www.couchbase.com\/blog\/guide-to-data-prep-for-rag\/\"},\"wordCount\":1628,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/www.couchbase.com\/blog\/#organization\"},\"image\":{\"@id\":\"https:\/\/www.couchbase.com\/blog\/guide-to-data-prep-for-rag\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/08\/image2-4.46.15\u202fPM.png\",\"keywords\":[\"embeddings\",\"RAG retrieval-augmented generation\"],\"articleSection\":[\"Application Design\",\"Best Practices and Tutorials\",\"Generative AI (GenAI)\",\"Python\",\"Vector Search\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/www.couchbase.com\/blog\/guide-to-data-prep-for-rag\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.couchbase.com\/blog\/guide-to-data-prep-for-rag\/\",\"url\":\"https:\/\/www.couchbase.com\/blog\/guide-to-data-prep-for-rag\/\",\"name\":\"A Step-by-Step Guide to Preparing Data for Retrieval-Augmented Generation (RAG)\",\"isPartOf\":{\"@id\":\"https:\/\/www.couchbase.com\/blog\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/www.couchbase.com\/blog\/guide-to-data-prep-for-rag\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/www.couchbase.com\/blog\/guide-to-data-prep-for-rag\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/08\/image2-4.46.15\u202fPM.png\",\"datePublished\":\"2024-08-27T01:40:37+00:00\",\"dateModified\":\"2024-09-03T18:08:00+00:00\",\"description\":\"MData gathering, chunking & embedding techniques for efficient Retrieval-Augmented Generation (RAG) with Scrapy, Python, and BAAI model.\",\"breadcrumb\":{\"@id\":\"https:\/\/www.couchbase.com\/blog\/guide-to-data-prep-for-rag\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.couchbase.com\/blog\/guide-to-data-prep-for-rag\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.couchbase.com\/blog\/guide-to-data-prep-for-rag\/#primaryimage\",\"url\":\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/08\/image2-4.46.15\u202fPM.png\",\"contentUrl\":\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/08\/image2-4.46.15\u202fPM.png\",\"width\":2200,\"height\":1400},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.couchbase.com\/blog\/guide-to-data-prep-for-rag\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.couchbase.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"A Step-by-Step Guide to Preparing Data for Retrieval-Augmented Generation (RAG)\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.couchbase.com\/blog\/#website\",\"url\":\"https:\/\/www.couchbase.com\/blog\/\",\"name\":\"The Couchbase Blog\",\"description\":\"Couchbase, the NoSQL Database\",\"publisher\":{\"@id\":\"https:\/\/www.couchbase.com\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.couchbase.com\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/www.couchbase.com\/blog\/#organization\",\"name\":\"The Couchbase Blog\",\"url\":\"https:\/\/www.couchbase.com\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.couchbase.com\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/2023\/04\/admin-logo.png\",\"contentUrl\":\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/2023\/04\/admin-logo.png\",\"width\":218,\"height\":34,\"caption\":\"The Couchbase Blog\"},\"image\":{\"@id\":\"https:\/\/www.couchbase.com\/blog\/#\/schema\/logo\/image\/\"}},{\"@type\":\"Person\",\"@id\":\"https:\/\/www.couchbase.com\/blog\/#\/schema\/person\/bef3d836c397a5a2cd59c80a43070250\",\"name\":\"Sanjivani Patra - Software Engineer\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.couchbase.com\/blog\/#\/schema\/person\/image\/e4335546d16f0f88af0dd52b94ab89d7\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/fffb8781af2f43cc312cfd67e2873c3a93111077161855cc2b605e3733cb712b?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/fffb8781af2f43cc312cfd67e2873c3a93111077161855cc2b605e3733cb712b?s=96&d=mm&r=g\",\"caption\":\"Sanjivani Patra - Software Engineer\"},\"url\":\"https:\/\/www.couchbase.com\/blog\/author\/sanjivanipatra\/\"}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"A Step-by-Step Guide to Preparing Data for Retrieval-Augmented Generation (RAG)","description":"MData gathering, chunking & embedding techniques for efficient Retrieval-Augmented Generation (RAG) with Scrapy, Python, and BAAI model.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.couchbase.com\/blog\/guide-to-data-prep-for-rag\/","og_locale":"en_US","og_type":"article","og_title":"A Step-by-Step Guide to Preparing Data for Retrieval-Augmented Generation (RAG)","og_description":"MData gathering, chunking & embedding techniques for efficient Retrieval-Augmented Generation (RAG) with Scrapy, Python, and BAAI model.","og_url":"https:\/\/www.couchbase.com\/blog\/guide-to-data-prep-for-rag\/","og_site_name":"The Couchbase Blog","article_published_time":"2024-08-27T01:40:37+00:00","article_modified_time":"2024-09-03T18:08:00+00:00","og_image":[{"width":2200,"height":1400,"url":"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/08\/image2-4.46.15\u202fPM.png","type":"image\/png"}],"author":"Sanjivani Patra - Software Engineer","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Sanjivani Patra - Software Engineer","Est. reading time":"9 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.couchbase.com\/blog\/guide-to-data-prep-for-rag\/#article","isPartOf":{"@id":"https:\/\/www.couchbase.com\/blog\/guide-to-data-prep-for-rag\/"},"author":{"name":"Sanjivani Patra - Software Engineer","@id":"https:\/\/www.couchbase.com\/blog\/#\/schema\/person\/bef3d836c397a5a2cd59c80a43070250"},"headline":"A Step-by-Step Guide to Preparing Data for Retrieval-Augmented Generation (RAG)","datePublished":"2024-08-27T01:40:37+00:00","dateModified":"2024-09-03T18:08:00+00:00","mainEntityOfPage":{"@id":"https:\/\/www.couchbase.com\/blog\/guide-to-data-prep-for-rag\/"},"wordCount":1628,"commentCount":0,"publisher":{"@id":"https:\/\/www.couchbase.com\/blog\/#organization"},"image":{"@id":"https:\/\/www.couchbase.com\/blog\/guide-to-data-prep-for-rag\/#primaryimage"},"thumbnailUrl":"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/08\/image2-4.46.15\u202fPM.png","keywords":["embeddings","RAG retrieval-augmented generation"],"articleSection":["Application Design","Best Practices and Tutorials","Generative AI (GenAI)","Python","Vector Search"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.couchbase.com\/blog\/guide-to-data-prep-for-rag\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.couchbase.com\/blog\/guide-to-data-prep-for-rag\/","url":"https:\/\/www.couchbase.com\/blog\/guide-to-data-prep-for-rag\/","name":"A Step-by-Step Guide to Preparing Data for Retrieval-Augmented Generation (RAG)","isPartOf":{"@id":"https:\/\/www.couchbase.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.couchbase.com\/blog\/guide-to-data-prep-for-rag\/#primaryimage"},"image":{"@id":"https:\/\/www.couchbase.com\/blog\/guide-to-data-prep-for-rag\/#primaryimage"},"thumbnailUrl":"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/08\/image2-4.46.15\u202fPM.png","datePublished":"2024-08-27T01:40:37+00:00","dateModified":"2024-09-03T18:08:00+00:00","description":"MData gathering, chunking & embedding techniques for efficient Retrieval-Augmented Generation (RAG) with Scrapy, Python, and BAAI model.","breadcrumb":{"@id":"https:\/\/www.couchbase.com\/blog\/guide-to-data-prep-for-rag\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.couchbase.com\/blog\/guide-to-data-prep-for-rag\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.couchbase.com\/blog\/guide-to-data-prep-for-rag\/#primaryimage","url":"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/08\/image2-4.46.15\u202fPM.png","contentUrl":"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/08\/image2-4.46.15\u202fPM.png","width":2200,"height":1400},{"@type":"BreadcrumbList","@id":"https:\/\/www.couchbase.com\/blog\/guide-to-data-prep-for-rag\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.couchbase.com\/blog\/"},{"@type":"ListItem","position":2,"name":"A Step-by-Step Guide to Preparing Data for Retrieval-Augmented Generation (RAG)"}]},{"@type":"WebSite","@id":"https:\/\/www.couchbase.com\/blog\/#website","url":"https:\/\/www.couchbase.com\/blog\/","name":"The Couchbase Blog","description":"Couchbase, the NoSQL Database","publisher":{"@id":"https:\/\/www.couchbase.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.couchbase.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.couchbase.com\/blog\/#organization","name":"The Couchbase Blog","url":"https:\/\/www.couchbase.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.couchbase.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/2023\/04\/admin-logo.png","contentUrl":"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/2023\/04\/admin-logo.png","width":218,"height":34,"caption":"The Couchbase Blog"},"image":{"@id":"https:\/\/www.couchbase.com\/blog\/#\/schema\/logo\/image\/"}},{"@type":"Person","@id":"https:\/\/www.couchbase.com\/blog\/#\/schema\/person\/bef3d836c397a5a2cd59c80a43070250","name":"Sanjivani Patra - Software Engineer","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.couchbase.com\/blog\/#\/schema\/person\/image\/e4335546d16f0f88af0dd52b94ab89d7","url":"https:\/\/secure.gravatar.com\/avatar\/fffb8781af2f43cc312cfd67e2873c3a93111077161855cc2b605e3733cb712b?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/fffb8781af2f43cc312cfd67e2873c3a93111077161855cc2b605e3733cb712b?s=96&d=mm&r=g","caption":"Sanjivani Patra - Software Engineer"},"url":"https:\/\/www.couchbase.com\/blog\/author\/sanjivanipatra\/"}]}},"authors":[{"term_id":10012,"user_id":85513,"is_guest":0,"slug":"sanjivanipatra","display_name":"Sanjivani Patra - Software Engineer","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/fffb8781af2f43cc312cfd67e2873c3a93111077161855cc2b605e3733cb712b?s=96&d=mm&r=g","author_category":"","last_name":"Patra - Software Engineer","first_name":"Sanjivani","job_title":"","user_url":"","description":""}],"_links":{"self":[{"href":"https:\/\/www.couchbase.com\/blog\/wp-json\/wp\/v2\/posts\/16210","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.couchbase.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.couchbase.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.couchbase.com\/blog\/wp-json\/wp\/v2\/users\/85513"}],"replies":[{"embeddable":true,"href":"https:\/\/www.couchbase.com\/blog\/wp-json\/wp\/v2\/comments?post=16210"}],"version-history":[{"count":0,"href":"https:\/\/www.couchbase.com\/blog\/wp-json\/wp\/v2\/posts\/16210\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.couchbase.com\/blog\/wp-json\/wp\/v2\/media\/16216"}],"wp:attachment":[{"href":"https:\/\/www.couchbase.com\/blog\/wp-json\/wp\/v2\/media?parent=16210"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.couchbase.com\/blog\/wp-json\/wp\/v2\/categories?post=16210"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.couchbase.com\/blog\/wp-json\/wp\/v2\/tags?post=16210"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.couchbase.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=16210"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}