In today’s data-driven world, the ability to efficiently gather and prepare data is crucial for the success of any application. Whether you’re developing a chatbot, a recommendation system, or any AI-driven solution, the quality and structure of your data can make or break your project. In this article, we’ll take you on a journey to explore the process of information gathering and smart chunking, focusing on how to prepare data for Generación mejorada por recuperación (RAG) in any application with your database of choice.
Data Collection: The Foundation of RAG
The Magic of Scrapy
Imagine a spider, not the creepy kind, but a diligent librarian spider in the massive library of the internet. This spider, embodied by Scrapy’s Spider class, starts at the entrance (the starting URL) and methodically visits every room (webpage), collecting precious books (HTML pages). Whenever it finds a door to another room (a hyperlink), it opens it and continues its exploration, ensuring no room is left unchecked. This is how Scrapy works—systematically and meticulously gathering every piece of information.
Leveraging Scrapy for Data Collection
Chatarra is a Python-based framework designed to extract data from websites. It’s like giving our librarian spider superpowers. With Scrapy, we can build web spiders that navigate through web pages and extract desired information with precision. In our case, we deploy Scrapy to crawl the Couchbase documentation website and download HTML pages for further processing and analysis.
Setting Up Your Scrapy Project
Before our spider can start its journey, we need to set up a Scrapy project. Here’s how you do it:
- Install Scrapy: If you haven’t already installed Scrapy, you can do so using pip:
1pip instale scrapy
- Create a New Scrapy Project: Set up your new Scrapy project with the following command:
1scrapy startproject couchbase_docs
Crafting the Spider
With the Scrapy project set up, we now create the spider that will crawl through the Couchbase documentation website and download HTML pages. Here’s how it looks:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
de pathlib importar Ruta importar scrapy clase CouchbaseSpider(scrapy.Spider): nombre = "couchbase" start_urls = ["https://docs.couchbase.com/home/index.html",] def analizar(auto, respuesta): # Download HTML content of the current page página = respuesta.url.dividir("/")[-1] nombre de archivo = f"{page}.html" Ruta(nombre de archivo).write_bytes(respuesta.cuerpo) auto.registro(f"Saved file {filename}") # Extract links and follow them para href en respuesta.css("ul a::attr(href)").getall(): si href.endswith(".html") o "docs.couchbase.com" en href: rendimiento respuesta.follow(href, auto.analizar) |
Running the Spider
To run the spider and initiate the data collection process, execute the following command within the Scrapy project directory:
1 |
scrapy crawl couchbase |
This command will start the spider, which will begin crawling the specified URLs and saving the HTML content. The spider extracts links from each page and follows them recursively, ensuring comprehensive data collection.
By automating data collection with Scrapy, we ensure all relevant HTML content from the Couchbase documentation website is retrieved efficiently and systematically, laying a solid foundation for further processing and analysis.
Extracting Text Content: Transforming Raw Data
After collecting HTML pages from the Couchbase documentation website, the next crucial step is to extract the text content. This transforms raw data into a usable format for analysis and further processing. Additionally, we may have PDF files containing valuable data, which we’ll also extract. Here, we’ll discuss how to use Python scripts to parse HTML files and PDFs, extract text data, and store it for further processing.
Extracting Text from HTML Pages
To extract text content from HTML pages, we’ll use a Python script that parses the HTML files and retrieves text data enclosed within <p> tags. This approach captures the main body of text from each page, excluding any HTML markup or structural elements.
Python Function for Text Extraction
Below is a Python function that demonstrates how to extract text content from HTML pages and store it in text files:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
de bs4 importar BeautifulSoup def get_data(html_content): soup = BeautifulSoup(html_content, "html.parser") título = str(soup).dividir('<title>')[1].dividir('</title>')[0] si " | Couchbase Docs" en título: título = título[:(título.índice(" | Couchbase Docs"))].sustituir(" ", "_") si no: título = título.sustituir(" ", "_") datos = "" líneas = soup.find_all('p') para línea en líneas: datos += " " + línea.texto devolver título, datos |
How to Use?
Para utilizar el get_data() function, incorporate it into your Python script or application and provide the HTML content as a parameter. The function will return the extracted text content.
1 2 3 4 |
html_content = '<html><head><title>Sample Page</title></head><body><p>This is a sample paragraph.</p></body></html>' título, texto = get_data(html_content) imprimir(título) # Output: Sample_Page imprimir(texto) # Output: This is a sample paragraph. |
Extracting Text Content from PDFs
For extracting text content from PDFs, we’ll use a Python script that reads a PDF file and retrieves its data. This process ensures that all relevant textual information is captured for analysis.
Python Function for Text Extraction
Below is a Python function that demonstrates how to extract text content from PDFs:
1 2 3 4 5 6 7 8 |
de PyPDF2 importar PdfReader def extract_text_from_pdf(pdf_file): lector = PdfReader(pdf_file) texto = '' para página en lector.páginas: texto += página.extract_text() devolver texto |
How to Use?
To use the extract_text_from_pdf() function, incorporate it into your Python script or application and provide the PDF file path as a parameter. The function will return the extracted text content.
1 2 |
pdf_path = 'sample.pdf' texto = extract_text_from_pdf(pdf_path) |
With the text content extracted and saved, we’ve completed the process of getting data from the Couchbase documentation.
Chunking: Making Data Manageable
Imagine you have a lengthy novel and you want to create a summary. Instead of reading the entire book at once, you break it down into chapters, paragraphs, and sentences. This way, you can easily understand and process each part, making the task more manageable. Similarly, chunking in text processing helps in dividing large texts into smaller, meaningful units. By organizing text into manageable chunks, we can facilitate easier processing, retrieval, and analysis of information.
Semantic and Content Chunking for RAG
For Retrieval-Augmented Generation (RAG), chunking is particularly important. We implemented both semantic and content chunking methods to optimize the data for the RAG process, which involves retrieving relevant information and generating responses based on that information.
Divisor de texto de caracteres recursivos
Recursive character text splitter chunking by Langchain involves breaking down a piece of text into smaller chunks using recursive patterns within the characters of the text. This technique utilizes separators such as \n\n
(double newline), \n
(newline), (space), and ""
(empty string).
Semantic Chunking
Semantic chunking is a text processing technique that focuses on grouping words or phrases based on their semantic meaning or context. This approach enhances understanding by creating meaningful chunks that capture the underlying relationships in the text. It is particularly useful for tasks that require detailed analysis of text structure and content organization.
Implementation of Semantic and Content Chunking
For our project, we implemented both semantic and content chunking methods. Semantic chunking preserves the hierarchical structure of the text, ensuring that each chunk maintains its contextual integrity. Content chunking was applied to remove redundant chunks and optimize storage and processing efficiency.
Python Implementation
Here’s a Python implementation of semantic and content chunking:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
importar hashlib de langchain.divisor_texto importar RecursiveCharacterTextSplitter # Global set to store unique chunk hash values across all files global_unique_hashes = configure() def hash_text(texto): # Generate a hash value for the text using SHA-256 hash_object = hashlib.sha256(texto.codificar()) devolver hash_object.hexdigest() def chunk_text(texto, título, Chunk_size=2000, Overlap=50, Length_function=len, debug_mode=0): global global_unique_hashes chunks = RecursiveCharacterTextSplitter( tamaño_trozo=Chunk_size, chunk_overlap=Overlap, length_function=Longitud_función ).create_documents([texto]) si debug_mode: para idx, trozo en enumerar(chunks): imprimir(f"Chunk {idx+1}: {chunk}\n") imprimir('\n') # Deduplication mechanism unique_chunks = [] para trozo en chunks: chunk_hash = hash_text(trozo.page_content) si chunk_hash no en global_unique_hashes: unique_chunks.añadir(trozo) global_unique_hashes.añada(chunk_hash) para frase en unique_chunks: frase.page_content = título + " " + frase.página_contenido devolver unique_chunks |
These optimized chunks are then embedded and stored in the Couchbase cluster for efficient retrieval, ensuring seamless integration with the RAG process.
By employing both semantic and content chunking techniques, we effectively structured and optimized textual data for the RAG process and storage in the Couchbase cluster. The next step is to embed the chunks we just generated.
Embedding Chunks: Mapping the Galaxy of Data
Imagine each chunk of text as a star in a vast galaxy. By embedding these chunks, we assign each star a precise location in this galaxy, based on its characteristics and relationships with other stars. This spatial mapping allows us to navigate the galaxy more effectively, finding connections and understanding the broader universe of information.
Embedding Text Chunks for RAG
Embedding text chunks is a crucial step in the Retrieval-Augmented Generation (RAG) process. It involves transforming text into numerical vectors that capture the semantic meaning and context of each chunk, making it easier for machine learning models to analyze and generate responses.
Utilizing the BAAI Model BGE-M3
To embed the chunks, we use the BAAI model BGE-M3. This model is capable of embedding text into a high-dimensional vector space, capturing the semantic meaning and context of each chunk.
Embedding Function
The embedding function takes the chunks generated from the previous step and embeds each chunk into a 1024-dimensional vector space using the BAAI model BGE-M3. This process enhances the representation of each chunk, facilitating more accurate and contextually rich analysis.
Python Script for Embedding
Here’s a Python script that demonstrates how to embed text chunks using the BAAI model BGE-M3:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
importar json importar numpy como np de json importar JSONEncoder de baai_model importar BGEM3FlagModel embed_model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=Verdadero) clase NumpyEncoder(JSONEncoder): def por defecto(auto, obj): si isinstance(obj, np.ndarray): devolver obj.tolist() devolver JSONEncoder.por defecto(auto, obj) def incrustar(chunks): embedded_chunks = [] para frase en chunks: emb = embed_model.codificar(str(frase.page_content), tamaño_lote=12, max_length=600)['dense_vecs'] incrustación = np.matriz(emb) np.set_printoptions(suprimir=Verdadero) json_dump = json.vuelca(incrustación, cls=NumpyEncoder) embedded_chunk = { "datos": str(frase.page_content), "vector_data": json.cargas(json_dump) } embedded_chunks.añadir(embedded_chunk) devolver embedded_chunks |
How to Use?
To use the embed() function, incorporate it into your Python script or application and provide the chunks generated from the previous steps as input. The function will return a list of embedded chunks.
1 2 3 4 5 6 7 |
chunks = [ # Assume chunks is a list of text chunks generated previously {"page_content": "This is the first chunk of text."}, {"page_content": "This is the second chunk of text."} ] embedded_chunks = incrustar(chunks) |
These optimized chunks, now embedded in a high-dimensional vector space, are ready for storage and retrieval, ensuring efficient utilization of resources and seamless integration with the RAG process. By embedding the text chunks, we transform raw text into a format that machine learning models can efficiently process and analyze, enabling more accurate and contextually aware responses in the RAG system.
Storing Embedded Chunks: Ensuring Efficient Retrieval
Once the text chunks have been embedded, the next step is to store these vectors in a database. These embedded chunks can be pushed into vector databases or traditional databases with vector search support, such as Couchbase, Elasticsearch, or Pinecone, to facilitate efficient retrieval for Retrieval-Augmented Generation (RAG) applications.
Bases de datos vectoriales
Vector databases are designed specifically to handle and search through high-dimensional vectors efficiently. By storing embedded chunks in a vector database, we can leverage advanced search capabilities to quickly retrieve the most relevant information based on the context and semantic meaning of queries.
Integrating with RAG Applications
With the data prepared and stored, it is now ready to be used in RAG applications. The embedded vectors enable these applications to retrieve contextually relevant information and generate more accurate and meaningful responses, enhancing the overall user experience.
Conclusión
By following this guide, we have successfully prepared data for Retrieval-Augmented Generation. We covered data collection using Scrapy, text content extraction from HTML and PDFs, chunking techniques, and embedding text chunks using the BAAI model BGE-M3. These steps ensure that the data is organized, optimized, and ready for use in RAG applications.
For more such technical and engaging content, check out other blogs relacionados con la búsqueda de vectores on our website and stay tuned for the next part in this series.