{"id":16493,"date":"2024-10-24T20:51:08","date_gmt":"2024-10-25T03:51:08","guid":{"rendered":"https:\/\/www.couchbase.com\/blog\/?p=16493"},"modified":"2024-10-29T05:52:44","modified_gmt":"2024-10-29T12:52:44","slug":"prepare-datasets-fine-tuning-ml-models","status":"publish","type":"post","link":"https:\/\/www.couchbase.com\/blog\/prepare-datasets-fine-tuning-ml-models\/","title":{"rendered":"Preparing Datasets for Fine-Tuning ML Models: A Comprehensive Guide"},"content":{"rendered":"<p><span style=\"font-weight: 400;\">Fine-tuning machine learning models starts with having well-prepared datasets. This guide will walk you through how to create these datasets, from gathering data to making instruction files. By the end, you&#8217;ll be equipped with practical knowledge and tools to prepare high-quality datasets for your fine-tuning tasks. <\/span><\/p>\n<p style=\"text-align: left;\"><span style=\"font-weight: 400;\">This post continues the details guides from <a href=\"https:\/\/www.couchbase.com\/blog\/guide-to-data-prep-for-rag\/\">preparing data for RAG<\/a>, and <a href=\"https:\/\/www.couchbase.com\/blog\/rag-applications-with-vector-search-and-couchbase\/\">building end-to-end RAG applications<\/a> with Couchbase vector search.<\/span><\/p>\n<div id=\"attachment_16494\" style=\"width: 910px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-16494\" class=\"wp-image-16494 size-large\" src=\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/10\/image1-8-1024x256.png\" alt=\"\" width=\"900\" height=\"225\" srcset=\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/10\/image1-8-1024x256.png 1024w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/10\/image1-8-300x75.png 300w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/10\/image1-8-768x192.png 768w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/10\/image1-8-1536x383.png 1536w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/10\/image1-8-1320x330.png 1320w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/10\/image1-8.png 1999w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\" \/><p id=\"caption-attachment-16494\" class=\"wp-caption-text\">High-Level Overview<\/p><\/div>\n<h2>Data collection\/gathering<\/h2>\n<p><span style=\"font-weight: 400;\">The first step is gathering data from various sources. This involves collecting raw information that will later be cleaned and organized into structured datasets.<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><span style=\"font-weight: 400;\">For an in-depth, step-by-step guide on preparing data for retrieval augmented generation, please refer to our comprehensive blog post:\u00a0<\/span><span style=\"font-weight: 400;\">&#8220;Step by Step Guide to Prepare Data for Retrieval Augmented Generation&#8221;<\/span><span style=\"font-weight: 400;\">.<\/span><\/p>\n<h3>Our approach to data collection<\/h3>\n<p><span style=\"font-weight: 400;\">In our approach, we utilized multiple methods to gather all relevant data:<\/span><\/p>\n<ol>\n<li style=\"list-style-type: none;\">\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Web scraping using Scrapy:<\/b>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><a href=\"https:\/\/docs.scrapy.org\/en\/latest\/\"><span style=\"font-weight: 400;\">Scrapy<\/span><\/a><span style=\"font-weight: 400;\"> is a powerful Python framework for extracting data from websites. It allows you to write spiders that crawl websites and scrape data efficiently.\u00a0<\/span><\/li>\n<\/ul>\n<\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Extracting documents from Confluence:<\/b>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><span style=\"font-weight: 400;\">We directly downloaded documents stored within our Confluence workspace. But this can also be done by utilising the <\/span><a href=\"https:\/\/docs.atlassian.com\/atlassian-confluence\/REST\/6.6.0\/\"><span style=\"font-weight: 400;\">Confluence API<\/span><\/a><span style=\"font-weight: 400;\"> which would involve writing scripts to automate the extraction process.<\/span><\/li>\n<\/ul>\n<\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Retrieving relevant files from Git repositories:<\/b>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><span style=\"font-weight: 400;\">Custom scripts were written to clone repositories and pull relevant files. This ensured we gathered all necessary data stored within our version control systems.<\/span><\/li>\n<\/ul>\n<\/li>\n<\/ol>\n<\/li>\n<\/ol>\n<p><span style=\"font-weight: 400;\">By combining these methods, we ensured a comprehensive and efficient data collection process, covering all necessary sources.<\/span><\/p>\n<h2>Text content extraction<\/h2>\n<p><span style=\"font-weight: 400;\">Once data is collected, the next crucial step is extracting text from documents such as web pages and PDFs. This process involves parsing these documents to obtain clean, structured text data.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">For detailed steps and code examples on extracting text from these sources, refer to our comprehensive guide in the blog post:\u00a0<\/span><span style=\"font-weight: 400;\">&#8220;Step by Step Guide to Prepare Data for Retrieval Augmented Generation&#8221;<\/span><span style=\"font-weight: 400;\">.<\/span><\/p>\n<h3>Libraries used for text extraction<\/h3>\n<ul>\n<li style=\"list-style-type: none;\">\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>HTML:<\/b><span style=\"font-weight: 400;\">\u00a0<\/span><a href=\"https:\/\/beautiful-soup-4.readthedocs.io\/en\/latest\/\"><span style=\"font-weight: 400;\">BeautifulSoup<\/span><\/a><span style=\"font-weight: 400;\"> is used to navigate HTML structures and extract text content.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>PDFs:<\/b><span style=\"font-weight: 400;\">\u00a0<\/span><a href=\"https:\/\/pypdf2.readthedocs.io\/en\/3.x\/\"><span style=\"font-weight: 400;\">PyPDF2<\/span><\/a><span style=\"font-weight: 400;\"> facilitates reading PDF files and extracting text from each page.<\/span><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">These tools enable us to transform unstructured documents into organized text data ready for further processing.<\/span><\/p>\n<h2>Creating sample JSON data<\/h2>\n<p><span style=\"font-weight: 400;\">This section focuses on generating instructions for dataset creation using functions like <code>generate_content()<\/code>\u00a0and\u00a0<code>generate_instructions()<\/code>, which derive questions based on domain knowledge.<\/span><\/p>\n<h3>Generating instructions (questions)<\/h3>\n<p><span style=\"font-weight: 400;\">To generate instruction questions, we&#8217;ll follow these steps:<\/span><\/p>\n<ol>\n<li style=\"list-style-type: none;\">\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Chunk sections:<\/b><span style=\"font-weight: 400;\">\u00a0The text is chunked semantically to ensure meaningful and contextually relevant questions.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Formulate questions:<\/b><span style=\"font-weight: 400;\">\u00a0These chunks are sent to a language model (LLM), which generates questions based on the received chunk.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Create JSON format:<\/b><span style=\"font-weight: 400;\">\u00a0Finally, we&#8217;ll structure the questions and associated information into a JSON format for easy access and utilization.<\/span><\/li>\n<\/ol>\n<\/li>\n<\/ol>\n<h3>Sample\u00a0<i>instructions.json<\/i><\/h3>\n<p><span style=\"font-weight: 400;\">Here&#8217;s an example of what the\u00a0<code>instructions.json<\/code>\u00a0file might look like after generating and saving the instructions:<\/span><\/p>\n<pre class=\"nums:false lang:default decode:true \">[\r\n\u00a0 \u00a0 \"What is the significance of KV-Engine in the context of Magma Storage Engine?\",\r\n\u00a0 \u00a0 \"What is the significance of Architecture in the context of Magma Storage Engine?\"\r\n]<\/pre>\n<h3>Implementation<\/h3>\n<p><span style=\"font-weight: 400;\">To implement this process:<\/span><\/p>\n<ol>\n<li style=\"list-style-type: none;\">\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Load domain knowledge<\/b><span style=\"font-weight: 400;\">: retrieve domain-specific information from a designated file<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Generate instructions<\/b><span style=\"font-weight: 400;\">: utilize functions like <code>generate_content()<\/code> to break down data and formulate questions using <code>generate_instructions()<\/code><\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Save questions<\/b><span style=\"font-weight: 400;\">: use <code>save_instructions()<\/code> to store generated questions in a JSON file<\/span><\/li>\n<\/ol>\n<\/li>\n<\/ol>\n<h4><em>generate_content<\/em> function<\/h4>\n<p><span style=\"font-weight: 400;\">The\u00a0<code>generate_content<\/code> function tokenizes the domain knowledge into sentences and then generates logical questions based on those sentences:<\/span><\/p>\n<pre class=\"nums:false lang:default decode:true \">def generate_content(domain_knowledge, context):\r\n\u00a0 \u00a0 questions = []\r\n\u00a0 \u00a0 # Tokenize domain knowledge into sentences\r\n\u00a0 \u00a0 sentences = nltk.sent_tokenize(domain_knowledge)\r\n\r\n\u00a0 \u00a0 # Generate logical questions based on sentences\r\n\u00a0 \u00a0 for sentence in sentences:\r\n\u00a0 \u00a0 \u00a0 \u00a0 question = generate_instructions(sentence, context)\r\n\u00a0 \u00a0 \u00a0 \u00a0 questions.append(question)\r\n\r\n\u00a0 \u00a0 return questions<\/pre>\n<h3><em>generate_instructions<\/em> function<\/h3>\n<p><span style=\"font-weight: 400;\">This function demonstrates how to generate instruction questions using a language model API:<\/span><\/p>\n<pre class=\"nums:false lang:default decode:true \">def generate_instructions(domain, context):\r\n\u00a0 \u00a0 prompt = \"Generate a question from the domain knowledge provided which can be answered with the domain knowledge given. Don't create or print any numbered lists, no greetings, directly print the question.\"\r\n\u00a0 \u00a0 url = 'https:\/\/localhost:11434\/api\/generate'\r\n\u00a0 \u00a0 data = {\"model\": model, \"stream\": False, \"prompt\": f\"[DOMAIN] {domain} [\/DOMAIN] [CONTEXT] {context} [\/CONTEXT] {prompt}\"}\r\n\u00a0 \u00a0 response = requests.post(url, json=data)\r\n\u00a0 \u00a0 response.raise_for_status()\r\n\r\n\u00a0 \u00a0 return response.json()['response'].strip()<\/pre>\n<h3>Loading and saving domain knowledge<\/h3>\n<p><span style=\"font-weight: 400;\">We use two additional functions:\u00a0<code>load_domain_knowledge()<\/code>\u00a0to load the domain knowledge from a file and\u00a0save_instructions()\u00a0to save the generated instructions to a JSON file.<\/span><\/p>\n<h3><em>load_domain_knowledge<\/em> function<\/h3>\n<p><span style=\"font-weight: 400;\">This function loads domain knowledge from a specified file.<\/span><\/p>\n<pre class=\"nums:false lang:default decode:true \">def load_domain_knowledge(domain_file):\r\n\u00a0 \u00a0 with open(domain_file, 'r') as file:\r\n\u00a0 \u00a0 \u00a0 \u00a0 domain_knowledge = file.read()\r\n\u00a0 \u00a0 return domain_knowledge<\/pre>\n<h3><em>save_instructions<\/em> function<\/h3>\n<p><span style=\"font-weight: 400;\">This function saves the generated instructions to a JSON file:<\/span><\/p>\n<pre class=\"nums:false lang:default decode:true \">def save_instructions(instructions, filename):\r\n\u00a0 \u00a0 with open(filename, 'w') as file:\r\n\u00a0 \u00a0 \u00a0 \u00a0 json.dump(instructions, file, indent=4)<\/pre>\n<hr \/>\n<h2>Example usage<\/h2>\n<p><span style=\"font-weight: 400;\">Here&#8217;s an example demonstrating how these functions work together:<\/span><\/p>\n<pre class=\"nums:false lang:default decode:true \"># Example usage\r\ndomain_file = \"domain_knowledge.txt\"\r\ncontext = \"sample context\"\r\ndomain_knowledge = load_domain_knowledge(domain_file)\r\ninstructions = generate_content(domain_knowledge, context)\r\nsave_instructions(instructions, \"instructions.json\")<\/pre>\n<p><span style=\"font-weight: 400;\">This workflow allows for efficient creation and storage of questions for dataset preparation.<\/span><\/p>\n<h2>Generating datasets (train, test, validate)<\/h2>\n<p><span style=\"font-weight: 400;\">This section guides you through creating datasets to fine-tune various models, such as Mistral 7B, using Ollama&#8217;s Llama2. To ensure accuracy, you&#8217;ll need domain knowledge stored in files like <code>domain.txt<\/code>.<\/span><\/p>\n<h3>Python functions for dataset creation<\/h3>\n<h4>query_ollama Function<\/h4>\n<p><span style=\"font-weight: 400;\">This function asks Ollama&#8217;s Llama 2 model for answers and follow-up questions based on specific prompts and domain context:<\/span><\/p>\n<pre class=\"nums:false lang:default decode:true\">def query_ollama(prompt, domain, context='', model='llama2'): \r\n  url = 'https:\/\/localhost:11434\/api\/generate'\r\n  data = {\"model\": model, \"stream\": False, \"prompt\": f\"[DOMAIN] {domain} [\/DOMAIN] [CONTEXT] {context} [\/CONTEXT] {prompt}\"}\r\n  response = requests.post(url, json=data)\r\n  response.raise_for_status()\r\n\r\n  followup_data = {\"model\": model, \"stream\": False, \"prompt\": response.json()['response'].strip() + \"What is a likely follow-up question or request? Return just the text of one question or request.\"}\r\n\r\n  followup_response = requests.post(url, json=followup_data)\r\n  followup_response.raise_for_status()\r\n  return response.json()['response'].strip(), followup_response.json()['response'].replace(\"\\\"\", \"\").strip()<\/pre>\n<h4><em>create_validation_file<\/em> function<\/h4>\n<p><span style=\"font-weight: 400;\">This function divides data into training, testing, and validation sets, saving them into separate files for model training:<\/span><\/p>\n<pre class=\"nums:false lang:default decode:true \">def create_validation_file(temp_file, train_file, valid_file, test_file):\r\nwith open(temp_file, 'r') as file:\r\n\u00a0 \u00a0 lines = file.readlines()\r\n\r\ntrain_lines = lines[:int(len(lines) * 0.8)]\r\ntest_lines = lines[int(len(lines) * 0.8):int(len(lines) * 0.9)]\r\nvalid_lines = lines[int(len(lines) * 0.9):]\r\n\r\nwith open(train_file, 'a') as file:\r\n\u00a0 \u00a0 file.writelines(train_lines)\r\n\r\nwith open(valid_file, 'a') as file:\r\n\u00a0 \u00a0 file.writelines(valid_lines)\r\n\r\nwith open(test_file, 'a') as file:\r\n\u00a0 \u00a0 file.writelines(test_lines)<\/pre>\n<h3>Managing dataset creation<\/h3>\n<h4><em>main<\/em> function<\/h4>\n<p><span style=\"font-weight: 400;\">The main function coordinates dataset generation, from querying Ollama&#8217;s Llama 2 to formatting results into JSONL files for model training:<\/span><\/p>\n<pre class=\"nums:false lang:default decode:true \">def main(temp_file, instructions_file, train_file, valid_file, test_file, domain_file, context=''):\r\n# Check if instructions file exists\r\nif not Path(instructions_file).is_file():\r\n\u00a0 \u00a0 sys.exit(f'{instructions_file} not found.')\r\n\r\n# Check if domain file exists\r\nif not Path(domain_file).is_file():\r\n\u00a0 \u00a0 sys.exit(f'{domain_file} not found.')\r\n\r\n# Load domain knowledge\r\ndomain = load_domain(domain_file)\r\n\r\n# Load instructions from file\r\nwith open(instructions_file, 'r') as file:\r\n\u00a0 \u00a0 instructions = json.load(file)\r\n\r\n# Process each instruction\r\nfor i, instruction in enumerate(instructions, start=1):\r\n\u00a0 \u00a0 print(f\"Processing ({i}\/{len(instructions)}): {instruction}\")\r\n\u00a0 \u00a0 \u00a0 \r\n\u00a0 \u00a0 # Query Ollama's llama2 model to get model answer and follow-up question\r\n\u00a0 \u00a0 answer, followup_question = query_ollama(instruction, domain, context)\r\n\u00a0 \u00a0 \u00a0 \r\n\u00a0 \u00a0 # Format the result in JSONL format\r\n\u00a0 \u00a0 result = json.dumps({\r\n\u00a0 \u00a0 \u00a0 \u00a0 'text': f'&lt;s&gt;[INST] {instruction}[\/INST] {answer}&lt;\/s&gt;[INST]{followup_question}[\/INST]'\r\n\u00a0 \u00a0 }) + \"\\n\"\r\n\u00a0 \u00a0 \u00a0 \r\n\u00a0 \u00a0 # Write the result to temporary file\r\n\u00a0 \u00a0 with open(temp_file, 'a') as file:\r\n\u00a0 \u00a0 \u00a0 \u00a0 file.write(result)\r\n\r\n# Create train, test, and validate files\r\ncreate_validation_file(temp_file, train_file, valid_file, test_file)\r\nprint(\"Done! Training, testing, and validation JSONL files created.\")<\/pre>\n<h3>Using these tools<\/h3>\n<p><span style=\"font-weight: 400;\">To start refining models like Mistral 7B with Ollama&#8217;s Llama 2:<\/span><\/p>\n<ol>\n<li style=\"list-style-type: none;\">\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Prepare domain knowledge:<\/b><span style=\"font-weight: 400;\"> store domain-specific details in <code>domain.txt<\/code><\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Generate instructions:<\/b><span style=\"font-weight: 400;\"> craft a JSON file, <code>instructions.json<\/code>, with prompts for dataset creation<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Run the main function:<\/b><span style=\"font-weight: 400;\"> execute <code>main()<\/code> with file paths to create datasets for model training and validation<\/span><\/li>\n<\/ol>\n<\/li>\n<\/ol>\n<p><span style=\"font-weight: 400;\">These Python functions empower you to develop datasets that optimize machine learning models, enhancing performance and accuracy for advanced applications.<\/span><\/p>\n<h2>Conclusion<\/h2>\n<p><span style=\"font-weight: 400;\">That&#8217;s all for today! With these steps, you now have the knowledge and tools to improve your machine learning model training process. Thank you for reading, and we hope you&#8217;ve found this guide valuable. Be sure to explore our other blogs for more insights. Stay tuned for the next part in this series and check out other <a href=\"https:\/\/www.couchbase.com\/blog\/category\/vector-search\/\">vector search-related blogs<\/a>. Happy modeling, and see you next time!<\/span><\/p>\n<h2>References<\/h2>\n<ul>\n<li style=\"list-style-type: none;\">\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><a href=\"https:\/\/docs.scrapy.org\/en\/latest\/\"><span style=\"font-weight: 400;\">Python Scrapy module docs<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><a href=\"https:\/\/docs.atlassian.com\/atlassian-confluence\/REST\/6.6.0\/\"><span style=\"font-weight: 400;\">Confluence REST API docs<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><a href=\"https:\/\/beautiful-soup-4.readthedocs.io\/en\/latest\/\"><span style=\"font-weight: 400;\">Beautiful soup 4 docs<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><a href=\"https:\/\/pypdf2.readthedocs.io\/en\/3.x\/\"><span style=\"font-weight: 400;\">PyPDF module docs<\/span><\/a><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h2>Contributors<\/h2>\n<p style=\"padding-left: 40px;\"><a href=\"https:\/\/in.linkedin.com\/in\/sanjivanipatra\"><span style=\"font-weight: 400;\">Sanjivani Patra<\/span><\/a> &#8211;\u00a0<a href=\"https:\/\/www.linkedin.com\/in\/nishanth-vm?utm_source=share&amp;utm_campaign=share_via&amp;utm_content=profile&amp;utm_medium=android_app\"><span style=\"font-weight: 400;\">Nishanth VM<\/span><\/a> &#8211; <a href=\"https:\/\/www.linkedin.com\/in\/ashokkumaralluri\/\"><span style=\"font-weight: 400;\">Ashok Kumar Alluri<\/span><\/a><\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Fine-tuning machine learning models starts with having well-prepared datasets. This guide will walk you through how to create these datasets, from gathering data to making instruction files. By the end, you&#8217;ll be equipped with practical knowledge and tools to prepare [&hellip;]<\/p>\n","protected":false},"author":85513,"featured_media":16495,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"inline_featured_image":false,"footnotes":""},"categories":[1814,1815,9973,9139,9937],"tags":[9231,9923,2140,9924],"ppma_author":[10012],"class_list":["post-16493","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-application-design","category-best-practices-and-tutorials","category-generative-ai-genai","category-python","category-vector-search","tag-data-science","tag-embeddings","tag-machine-learning","tag-rag-retrieval-augmented-generation"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.7.1 (Yoast SEO v25.7) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Preparing Datasets for Fine-Tuning ML Models: A Comprehensive Guide - The Couchbase Blog<\/title>\n<meta name=\"description\" content=\"Create high-quality datasets for fine-tuning models with this guide on data gathering, text extraction, and instruction file generation.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.couchbase.com\/blog\/prepare-datasets-fine-tuning-ml-models\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Preparing Datasets for Fine-Tuning ML Models: A Comprehensive Guide\" \/>\n<meta property=\"og:description\" content=\"Create high-quality datasets for fine-tuning models with this guide on data gathering, text extraction, and instruction file generation.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.couchbase.com\/blog\/prepare-datasets-fine-tuning-ml-models\/\" \/>\n<meta property=\"og:site_name\" content=\"The Couchbase Blog\" \/>\n<meta property=\"article:published_time\" content=\"2024-10-25T03:51:08+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2024-10-29T12:52:44+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/10\/blog-data-prep-for-ml-models.png\" \/>\n\t<meta property=\"og:image:width\" content=\"2400\" \/>\n\t<meta property=\"og:image:height\" content=\"1256\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Sanjivani Patra - Software Engineer\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Sanjivani Patra - Software Engineer\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"5 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/www.couchbase.com\/blog\/prepare-datasets-fine-tuning-ml-models\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/www.couchbase.com\/blog\/prepare-datasets-fine-tuning-ml-models\/\"},\"author\":{\"name\":\"Sanjivani Patra - Software Engineer\",\"@id\":\"https:\/\/www.couchbase.com\/blog\/#\/schema\/person\/bef3d836c397a5a2cd59c80a43070250\"},\"headline\":\"Preparing Datasets for Fine-Tuning ML Models: A Comprehensive Guide\",\"datePublished\":\"2024-10-25T03:51:08+00:00\",\"dateModified\":\"2024-10-29T12:52:44+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/www.couchbase.com\/blog\/prepare-datasets-fine-tuning-ml-models\/\"},\"wordCount\":919,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/www.couchbase.com\/blog\/#organization\"},\"image\":{\"@id\":\"https:\/\/www.couchbase.com\/blog\/prepare-datasets-fine-tuning-ml-models\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/10\/blog-data-prep-for-ml-models.png\",\"keywords\":[\"data science\",\"embeddings\",\"Machine Learning (ML)\",\"RAG retrieval-augmented generation\"],\"articleSection\":[\"Application Design\",\"Best Practices and Tutorials\",\"Generative AI (GenAI)\",\"Python\",\"Vector Search\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/www.couchbase.com\/blog\/prepare-datasets-fine-tuning-ml-models\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.couchbase.com\/blog\/prepare-datasets-fine-tuning-ml-models\/\",\"url\":\"https:\/\/www.couchbase.com\/blog\/prepare-datasets-fine-tuning-ml-models\/\",\"name\":\"Preparing Datasets for Fine-Tuning ML Models: A Comprehensive Guide - The Couchbase Blog\",\"isPartOf\":{\"@id\":\"https:\/\/www.couchbase.com\/blog\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/www.couchbase.com\/blog\/prepare-datasets-fine-tuning-ml-models\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/www.couchbase.com\/blog\/prepare-datasets-fine-tuning-ml-models\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/10\/blog-data-prep-for-ml-models.png\",\"datePublished\":\"2024-10-25T03:51:08+00:00\",\"dateModified\":\"2024-10-29T12:52:44+00:00\",\"description\":\"Create high-quality datasets for fine-tuning models with this guide on data gathering, text extraction, and instruction file generation.\",\"breadcrumb\":{\"@id\":\"https:\/\/www.couchbase.com\/blog\/prepare-datasets-fine-tuning-ml-models\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.couchbase.com\/blog\/prepare-datasets-fine-tuning-ml-models\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.couchbase.com\/blog\/prepare-datasets-fine-tuning-ml-models\/#primaryimage\",\"url\":\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/10\/blog-data-prep-for-ml-models.png\",\"contentUrl\":\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/10\/blog-data-prep-for-ml-models.png\",\"width\":2400,\"height\":1256},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.couchbase.com\/blog\/prepare-datasets-fine-tuning-ml-models\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.couchbase.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Preparing Datasets for Fine-Tuning ML Models: A Comprehensive Guide\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.couchbase.com\/blog\/#website\",\"url\":\"https:\/\/www.couchbase.com\/blog\/\",\"name\":\"The Couchbase Blog\",\"description\":\"Couchbase, the NoSQL Database\",\"publisher\":{\"@id\":\"https:\/\/www.couchbase.com\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.couchbase.com\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/www.couchbase.com\/blog\/#organization\",\"name\":\"The Couchbase Blog\",\"url\":\"https:\/\/www.couchbase.com\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.couchbase.com\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/2023\/04\/admin-logo.png\",\"contentUrl\":\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/2023\/04\/admin-logo.png\",\"width\":218,\"height\":34,\"caption\":\"The Couchbase Blog\"},\"image\":{\"@id\":\"https:\/\/www.couchbase.com\/blog\/#\/schema\/logo\/image\/\"}},{\"@type\":\"Person\",\"@id\":\"https:\/\/www.couchbase.com\/blog\/#\/schema\/person\/bef3d836c397a5a2cd59c80a43070250\",\"name\":\"Sanjivani Patra - Software Engineer\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.couchbase.com\/blog\/#\/schema\/person\/image\/e4335546d16f0f88af0dd52b94ab89d7\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/fffb8781af2f43cc312cfd67e2873c3a93111077161855cc2b605e3733cb712b?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/fffb8781af2f43cc312cfd67e2873c3a93111077161855cc2b605e3733cb712b?s=96&d=mm&r=g\",\"caption\":\"Sanjivani Patra - Software Engineer\"},\"url\":\"https:\/\/www.couchbase.com\/blog\/author\/sanjivanipatra\/\"}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Preparing Datasets for Fine-Tuning ML Models: A Comprehensive Guide - The Couchbase Blog","description":"Create high-quality datasets for fine-tuning models with this guide on data gathering, text extraction, and instruction file generation.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.couchbase.com\/blog\/prepare-datasets-fine-tuning-ml-models\/","og_locale":"en_US","og_type":"article","og_title":"Preparing Datasets for Fine-Tuning ML Models: A Comprehensive Guide","og_description":"Create high-quality datasets for fine-tuning models with this guide on data gathering, text extraction, and instruction file generation.","og_url":"https:\/\/www.couchbase.com\/blog\/prepare-datasets-fine-tuning-ml-models\/","og_site_name":"The Couchbase Blog","article_published_time":"2024-10-25T03:51:08+00:00","article_modified_time":"2024-10-29T12:52:44+00:00","og_image":[{"width":2400,"height":1256,"url":"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/10\/blog-data-prep-for-ml-models.png","type":"image\/png"}],"author":"Sanjivani Patra - Software Engineer","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Sanjivani Patra - Software Engineer","Est. reading time":"5 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.couchbase.com\/blog\/prepare-datasets-fine-tuning-ml-models\/#article","isPartOf":{"@id":"https:\/\/www.couchbase.com\/blog\/prepare-datasets-fine-tuning-ml-models\/"},"author":{"name":"Sanjivani Patra - Software Engineer","@id":"https:\/\/www.couchbase.com\/blog\/#\/schema\/person\/bef3d836c397a5a2cd59c80a43070250"},"headline":"Preparing Datasets for Fine-Tuning ML Models: A Comprehensive Guide","datePublished":"2024-10-25T03:51:08+00:00","dateModified":"2024-10-29T12:52:44+00:00","mainEntityOfPage":{"@id":"https:\/\/www.couchbase.com\/blog\/prepare-datasets-fine-tuning-ml-models\/"},"wordCount":919,"commentCount":0,"publisher":{"@id":"https:\/\/www.couchbase.com\/blog\/#organization"},"image":{"@id":"https:\/\/www.couchbase.com\/blog\/prepare-datasets-fine-tuning-ml-models\/#primaryimage"},"thumbnailUrl":"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/10\/blog-data-prep-for-ml-models.png","keywords":["data science","embeddings","Machine Learning (ML)","RAG retrieval-augmented generation"],"articleSection":["Application Design","Best Practices and Tutorials","Generative AI (GenAI)","Python","Vector Search"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.couchbase.com\/blog\/prepare-datasets-fine-tuning-ml-models\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.couchbase.com\/blog\/prepare-datasets-fine-tuning-ml-models\/","url":"https:\/\/www.couchbase.com\/blog\/prepare-datasets-fine-tuning-ml-models\/","name":"Preparing Datasets for Fine-Tuning ML Models: A Comprehensive Guide - The Couchbase Blog","isPartOf":{"@id":"https:\/\/www.couchbase.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.couchbase.com\/blog\/prepare-datasets-fine-tuning-ml-models\/#primaryimage"},"image":{"@id":"https:\/\/www.couchbase.com\/blog\/prepare-datasets-fine-tuning-ml-models\/#primaryimage"},"thumbnailUrl":"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/10\/blog-data-prep-for-ml-models.png","datePublished":"2024-10-25T03:51:08+00:00","dateModified":"2024-10-29T12:52:44+00:00","description":"Create high-quality datasets for fine-tuning models with this guide on data gathering, text extraction, and instruction file generation.","breadcrumb":{"@id":"https:\/\/www.couchbase.com\/blog\/prepare-datasets-fine-tuning-ml-models\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.couchbase.com\/blog\/prepare-datasets-fine-tuning-ml-models\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.couchbase.com\/blog\/prepare-datasets-fine-tuning-ml-models\/#primaryimage","url":"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/10\/blog-data-prep-for-ml-models.png","contentUrl":"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2024\/10\/blog-data-prep-for-ml-models.png","width":2400,"height":1256},{"@type":"BreadcrumbList","@id":"https:\/\/www.couchbase.com\/blog\/prepare-datasets-fine-tuning-ml-models\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.couchbase.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Preparing Datasets for Fine-Tuning ML Models: A Comprehensive Guide"}]},{"@type":"WebSite","@id":"https:\/\/www.couchbase.com\/blog\/#website","url":"https:\/\/www.couchbase.com\/blog\/","name":"The Couchbase Blog","description":"Couchbase, the NoSQL Database","publisher":{"@id":"https:\/\/www.couchbase.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.couchbase.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.couchbase.com\/blog\/#organization","name":"The Couchbase Blog","url":"https:\/\/www.couchbase.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.couchbase.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/2023\/04\/admin-logo.png","contentUrl":"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/2023\/04\/admin-logo.png","width":218,"height":34,"caption":"The Couchbase Blog"},"image":{"@id":"https:\/\/www.couchbase.com\/blog\/#\/schema\/logo\/image\/"}},{"@type":"Person","@id":"https:\/\/www.couchbase.com\/blog\/#\/schema\/person\/bef3d836c397a5a2cd59c80a43070250","name":"Sanjivani Patra - Software Engineer","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.couchbase.com\/blog\/#\/schema\/person\/image\/e4335546d16f0f88af0dd52b94ab89d7","url":"https:\/\/secure.gravatar.com\/avatar\/fffb8781af2f43cc312cfd67e2873c3a93111077161855cc2b605e3733cb712b?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/fffb8781af2f43cc312cfd67e2873c3a93111077161855cc2b605e3733cb712b?s=96&d=mm&r=g","caption":"Sanjivani Patra - Software Engineer"},"url":"https:\/\/www.couchbase.com\/blog\/author\/sanjivanipatra\/"}]}},"authors":[{"term_id":10012,"user_id":85513,"is_guest":0,"slug":"sanjivanipatra","display_name":"Sanjivani Patra - Software Engineer","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/fffb8781af2f43cc312cfd67e2873c3a93111077161855cc2b605e3733cb712b?s=96&d=mm&r=g","author_category":"","last_name":"Patra - Software Engineer","first_name":"Sanjivani","job_title":"","user_url":"","description":""}],"_links":{"self":[{"href":"https:\/\/www.couchbase.com\/blog\/wp-json\/wp\/v2\/posts\/16493","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.couchbase.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.couchbase.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.couchbase.com\/blog\/wp-json\/wp\/v2\/users\/85513"}],"replies":[{"embeddable":true,"href":"https:\/\/www.couchbase.com\/blog\/wp-json\/wp\/v2\/comments?post=16493"}],"version-history":[{"count":0,"href":"https:\/\/www.couchbase.com\/blog\/wp-json\/wp\/v2\/posts\/16493\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.couchbase.com\/blog\/wp-json\/wp\/v2\/media\/16495"}],"wp:attachment":[{"href":"https:\/\/www.couchbase.com\/blog\/wp-json\/wp\/v2\/media?parent=16493"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.couchbase.com\/blog\/wp-json\/wp\/v2\/categories?post=16493"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.couchbase.com\/blog\/wp-json\/wp\/v2\/tags?post=16493"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.couchbase.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=16493"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}