Fine-tuning machine learning models starts with having well-prepared datasets. This guide will walk you through how to create these datasets, from gathering data to making instruction files. By the end, you’ll be equipped with practical knowledge and tools to prepare high-quality datasets for your fine-tuning tasks.

This post continues the details guides from preparing data for RAG, and building end-to-end RAG applications with Couchbase vector search.

High-Level Overview

Data collection/gathering

The first step is gathering data from various sources. This involves collecting raw information that will later be cleaned and organized into structured datasets.

For an in-depth, step-by-step guide on preparing data for retrieval augmented generation, please refer to our comprehensive blog post: “Step by Step Guide to Prepare Data for Retrieval Augmented Generation”.

Our approach to data collection

In our approach, we utilized multiple methods to gather all relevant data:

    1. Web scraping using Scrapy:
      • Scrapy is a powerful Python framework for extracting data from websites. It allows you to write spiders that crawl websites and scrape data efficiently. 
    2. Extracting documents from Confluence:
      • We directly downloaded documents stored within our Confluence workspace. But this can also be done by utilising the Confluence API which would involve writing scripts to automate the extraction process.
    3. Retrieving relevant files from Git repositories:
      • Custom scripts were written to clone repositories and pull relevant files. This ensured we gathered all necessary data stored within our version control systems.

By combining these methods, we ensured a comprehensive and efficient data collection process, covering all necessary sources.

Text content extraction

Once data is collected, the next crucial step is extracting text from documents such as web pages and PDFs. This process involves parsing these documents to obtain clean, structured text data.

For detailed steps and code examples on extracting text from these sources, refer to our comprehensive guide in the blog post: “Step by Step Guide to Prepare Data for Retrieval Augmented Generation”.

Libraries used for text extraction

    • HTML: BeautifulSoup is used to navigate HTML structures and extract text content.
    • PDFs: PyPDF2 facilitates reading PDF files and extracting text from each page.

These tools enable us to transform unstructured documents into organized text data ready for further processing.

Creating sample JSON data

This section focuses on generating instructions for dataset creation using functions like generate_content() and generate_instructions(), which derive questions based on domain knowledge.

Generating instructions (questions)

To generate instruction questions, we’ll follow these steps:

    1. Chunk sections: The text is chunked semantically to ensure meaningful and contextually relevant questions.
    2. Formulate questions: These chunks are sent to a language model (LLM), which generates questions based on the received chunk.
    3. Create JSON format: Finally, we’ll structure the questions and associated information into a JSON format for easy access and utilization.

Sample instructions.json

Here’s an example of what the instructions.json file might look like after generating and saving the instructions:

Implementation

To implement this process:

    1. Load domain knowledge: retrieve domain-specific information from a designated file
    2. Generate instructions: utilize functions like generate_content() to break down data and formulate questions using generate_instructions()
    3. Save questions: use save_instructions() to store generated questions in a JSON file

generate_content function

The generate_content function tokenizes the domain knowledge into sentences and then generates logical questions based on those sentences:

generate_instructions function

This function demonstrates how to generate instruction questions using a language model API:

Loading and saving domain knowledge

We use two additional functions: load_domain_knowledge() to load the domain knowledge from a file and save_instructions() to save the generated instructions to a JSON file.

load_domain_knowledge function

This function loads domain knowledge from a specified file.

save_instructions function

This function saves the generated instructions to a JSON file:


Example usage

Here’s an example demonstrating how these functions work together:

This workflow allows for efficient creation and storage of questions for dataset preparation.

Generating datasets (train, test, validate)

This section guides you through creating datasets to fine-tune various models, such as Mistral 7B, using Ollama’s Llama2. To ensure accuracy, you’ll need domain knowledge stored in files like domain.txt.

Python functions for dataset creation

query_ollama Function

This function asks Ollama’s Llama 2 model for answers and follow-up questions based on specific prompts and domain context:

create_validation_file function

This function divides data into training, testing, and validation sets, saving them into separate files for model training:

Managing dataset creation

main function

The main function coordinates dataset generation, from querying Ollama’s Llama 2 to formatting results into JSONL files for model training:

Using these tools

To start refining models like Mistral 7B with Ollama’s Llama 2:

    1. Prepare domain knowledge: store domain-specific details in domain.txt
    2. Generate instructions: craft a JSON file, instructions.json, with prompts for dataset creation
    3. Run the main function: execute main() with file paths to create datasets for model training and validation

These Python functions empower you to develop datasets that optimize machine learning models, enhancing performance and accuracy for advanced applications.

Conclusion

That’s all for today! With these steps, you now have the knowledge and tools to improve your machine learning model training process. Thank you for reading, and we hope you’ve found this guide valuable. Be sure to explore our other blogs for more insights. Stay tuned for the next part in this series and check out other vector search-related blogs. Happy modeling, and see you next time!

References

Contributors

Sanjivani Patra – Nishanth VMAshok Kumar Alluri

 

Author

Posted by Sanjivani Patra - Software Engineer

Leave a reply