Generate Synthetic Training Data Guide

Overview

The Kolo project uses the following scripts and configuration file to generate and process QA data:

The following command will copy over all subfolders, documents and files into /var/kolo_data/qa_generation_input.
```
./copy_qa_input_generation.ps1 "directory"
```
If you are testing for the first time. Try copying the entire Kolo project by running this command.
```
./copy_qa_input_generation.ps1 "../"
```
Modify the config file to specify file groups, custom prompts, and the number of iterations. If you are testing with Kolo project, leave the config file untouched.
Run the copy all scripts command. This will move the configuration file into Kolo.
```
./copy_scripts.ps1
```
This will generate QA data using the LLM provider and model you choose. In the config file you can choose whether to use openai or ollama and the specified model name. By default we use openai and the model gpt-4o-mini. When using the OpenAI provider you must pass in your API key when running the generating script.
```
./generate_qa_data.ps1 -OPENAI_API_KEY "your key"
```
Multi-threaded parameters
```
./generate_qa_data.ps1 -OPENAI_API_KEY "your key" -Threads 16
```
After generating the QA prompts, this command converts the question and answer text files inside
/var/kolo_data/qa_generation_output into training data: data.jsonl and data.json in /app/.
```
./convert_qa_output.ps1
```
Note: On subsequent generations, ensure you delete the existing qa_generation_output folder by executing:
```
./delete_qa_generation_output.ps1
```
Your training data is now ready; continue by training your LLM using ./train_model_torchtune.ps1 or ./train_model_unsloth.ps1.
Follow the README guide after this step.

Config File Details

This YAML configuration file controls various aspects of the QA generation process.

Global Settings

Directories & Paths

base_dir: Location of the QA generation input files.
output_dir: Directory where QA generation output and debug files are saved.
output_base_path: The base path for output files (e.g., /var/kolo_data).

Service Endpoints

ollama_url: URL endpoint for the Ollama API (if used).

Providers

Define the API providers for generating both questions and answers. Each provider block specifies:

provider: The service to use (e.g., openai or ollama).
model: The model to be used (e.g., gpt-4o-mini).

global:
  base_dir: qa_generation_input
  output_dir: qa_generation_output
  output_base_path: /var/kolo_data
  ollama_url: http://localhost:11434/api/generate

providers:
  question:
    provider: openai # Use "ollama" or "openai"
    model: gpt-4o-mini
  answer:
    provider: openai # Use "ollama" or "openai"
    model: gpt-4o-mini

Prompts

Instruction Lists

Question Instruction List

This list defines different instructions to style the generated questions. Each entry may have multiple instructions. For example:

QuestionInstructionList:
  - name: 'CasualandFormal'
    instruction:
      - 'For each question write like a casual person.'
      - 'For each question write like a formal person.'

Usage: During question generation, each instruction is applied to a seed to create variations in tone.

Answer Instruction List

This list provides variations in the answer generation style:

AnswerInstructionList:
  - name: 'SimpleAndComplex'
    instruction:
      - 'For your answer keep it simple and short.'
      - 'For your answer give detail and reference any relevant content.'

Usage: Each answer instruction is paired with a question to generate answers with different levels of complexity or detail.

Question Generation Seeds

The GenerateQuestionLists section provides seed questions or prompts that drive the question generation process:

GenerateQuestionLists:
  - name: 'DocumentList'
    questions:
      - 'Based on the above content, generate a list of questions where the user asks how to use different things.'
      - 'Based on the above content, generate a list of questions where the user asks to summarize different parts of the content.'
      - 'Based on the above content, generate a list of questions where the user wants to learn certain parts of the content.'
      - 'Based on the above content, generate a list of questions where the user wants to understand the concepts in the content.'
      - 'Based on the above content, generate a list of questions where the users ask you to help do something for them based on various needs and requirements.'
  - name: 'CodingList'
    questions:
      - 'Based on the above content, generate a list of questions where a new user wants to learn how to use the code and what it does using different tones and styles.'
      - 'Based on the above content, generate a list of questions where the user wants to know what a specific thing does in the code.'

Usage: The seeds are combined with the instructions to produce a variety of questions, such as tailoring them to either document or coding contexts.

Prompt Templates

Prompt templates are used to construct the text sent to the language model.

FileHeaders

The FileHeaders section specifies the header prompt that will be inserted above each file content.

FileHeaders:
  - name: 'DefaultFileHeader'
    description: 'The file contents for: {file_name}'

{file_name}: Represents the file name.

Answer Prompt

Defines how to format the answer prompt:

AnswerPrompt:
  - name: 'DefaultAnswerPrompt'
    description: |
      {file_content}
      {instruction}
      {question}

Usage: Placeholders are replaced as follows:

{file_content}: Combined content of the source files.
{instruction}: The answer instruction text.
{question}: The specific question to answer.

QuestionPrompt

Defines how to format the question prompt:

For this example, there are two variants for question prompts, depending on whether the file names should be referenced or not.

QuestionPrompt:
  - name: 'NoFileName'
    description: |
      {file_content}
      {generate_question}
      {instruction}
      Use the following output format:
        1. <question 1>
        2. <question 2>
        3. <question 3>
      etc.
  - name: 'WithFileName'
    description: |
      {file_content}
      {generate_question}
      {instruction}
      Use the following output format.
        1. <question 1>
        2. <question 2>
        3. <question 3>
      etc.
      You are required to reference {file_name_list} for every single question that you generate!

Usage:

{file_content}: Combined content of the source files.
{instruction}: The question instruction text.
{generate_question}: The specific generate question instruction from the Generate Question List.
{file_name_list} is the list of file names that you can use to instruct the LLM to use when generating questions.

Note: Changing the output format may impact how well the conversion script works.

File Groups

The file_groups section organizes the files into groups that will each be processed independently. Each file group defines:

iterations: How many times the group should be processed (each iteration may generate a new set of Q&A outputs).
files: List of files that you want to use for the LLM context.
question_prompt: Which question prompt template to use (e.g., NoFileName or WithFileName).
generate_question_list: Which question generation seed list(s) to use.
question_instruction_list: Which instruction list to apply when generating questions.
file_header: Which file header template to use.
answer_prompt: Which answer prompt template to use.
answer_instruction_list: Which answer instruction list to apply when generating answers.

Example configuration for three groups:

file_groups:
  UninstallModel:
    iterations: 3
    files:
      - uninstall_model.ps1
    question_prompt: WithFileName
    generate_question_list: [CodingList]
    question_instruction_list: [CasualandFormal]
    file_header: DefaultFileHeader
    answer_prompt: DefaultAnswerPrompt
    answer_instruction_list: [SimpleAndComplex]
  README:
    iterations: 3
    files:
      - README.md
    question_prompt: NoFileName
    generate_question_list: [DocumentList]
    question_instruction_list: [CasualandFormal]
    file_header: DefaultFileHeader
    answer_prompt: DefaultAnswerPrompt
    answer_instruction_list: [SimpleAndComplex]
  DeleteModel:
    iterations: 3
    files:
      - delete_model.ps1
    question_prompt: WithFileName
    generate_question_list: [CodingList]
    question_instruction_list: [CasualandFormal]
    file_header: DefaultFileHeader
    answer_prompt: DefaultAnswerPrompt
    answer_instruction_list: [SimpleAndComplex]

See generate_qa_config.yaml for a full config example.

Debugging

If you run into issues, you can look at the debug folder inside kolo_container at /var/kolo_data/qa_generation_output using WinSCP. The debug text files will show you exactly what is being sent to the LLM during generation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate Synthetic Training Data Guide

Overview

Config File Details

Global Settings

Directories & Paths

Service Endpoints

Providers

Prompts

Instruction Lists

Question Instruction List

Answer Instruction List

Question Generation Seeds

Prompt Templates

FileHeaders

Answer Prompt

QuestionPrompt

File Groups

Debugging

FilesExpand file tree

GenerateTrainingDataGuide.md

Latest commit

History

GenerateTrainingDataGuide.md

File metadata and controls

Generate Synthetic Training Data Guide

Overview

Config File Details

Global Settings

Directories & Paths

Service Endpoints

Providers

Prompts

Instruction Lists

Question Instruction List

Answer Instruction List

Question Generation Seeds

Prompt Templates

FileHeaders

Answer Prompt

QuestionPrompt

File Groups

Debugging