- Overview
- Architecture
- Installation
- CLI Interface
- Configuration System
- Pipeline Stages
- Component Reference
- Output Formats
- Environment Variables
- Workflow Examples
- Customizing Prompts
- Extending the Toolkit
- Troubleshooting
- Best Practices
Synthetic Data Kit is a toolkit for preparing high-quality synthetic datasets to fine-tune Large Language Models (LLMs). It provides a modular command-line interface (CLI) for the complete data preparation workflow, with 4 simple commands named after their respective actions.
- Document Parsing: Convert various file formats (PDF, HTML, YouTube, DOCX, PPTX, TXT) to clean text
- Content Generation: Generate high-quality QA pairs using local LLM inference
- Quality Control: Filter content based on quality metrics
- Format Conversion: Export to various training formats (JSONL, Alpaca, OpenAI FT, ChatML)
- Configurable: All aspects controlled via YAML configuration
- Extensible: Easy to add new parsers, generators, or output formats
Synthetic Data Kit follows a modular architecture with these main components:
graph TD
CLI[CLI Interface] --> Core
Core --> Parsers
Core --> Generators
Core --> LLMClient
Core --> FormatConverter
Parsers --> PDFParser
Parsers --> HTMLParser
Parsers --> YouTubeParser
Parsers --> DOCXParser
Parsers --> PPTParser
Parsers --> TXTParser
Generators --> QAGenerator
Generators --> COTGenerator
Config[Configuration] --> CLI
Config --> Core
Config --> LLMClient
Config --> Generators
Utils[Utilities] --> TextProcessing
Utils --> LLMProcessing
Utils --> ConfigUtils
Utils --> FormatConverter
Utils --> DatasetUtils[HF Dataset Utils]
LLMClient --> BatchProcessing[Batch Processing]
LLMProcessing --> ParseQAPairs[Parse QA Pairs]
LLMProcessing --> ParseRatings[Enhanced Rating Parser]
LLMProcessing --> ConversionUtils[Conversation Format Utils]
EnvVars[Environment Variables] -.-> Core
EnvVars -.-> LLMProcessing
synthetic-data-kit/
├── synthetic_data_kit/ # Package source code
│ ├── __init__.py # Package initialization
│ ├── cli.py # CLI entry point using Typer
│ ├── core/ # Core functionality
│ │ ├── __init__.py
│ │ ├── context.py # Application context
│ │ ├── ingest.py # Document ingestion
│ │ ├── create.py # Content creation
│ │ ├── cleanup.py # Content filtering
│ │ └── save_as.py # Format conversion
│ ├── models/ # LLM integration
│ │ ├── __init__.py
│ │ └── llm_client.py # VLLM client
│ ├── parsers/ # Document parsers
│ │ ├── __init__.py
│ │ ├── pdf_parser.py # PDF parser
│ │ ├── html_parser.py # HTML parser
│ │ ├── youtube_parser.py # YouTube parser
│ │ ├── docx_parser.py # DOCX parser
│ │ ├── ppt_parser.py # PPT parser
│ │ └── txt_parser.py # TXT parser
│ ├── generators/ # Content generators
│ │ ├── __init__.py
│ │ └── qa_generator.py # QA pair generator
│ └── utils/ # Utilities
│ ├── __init__.py
│ ├── config.py # Config handling
│ ├── text.py # Text processing
│ ├── llm_processing.py # LLM output parsing
│ └── format_converter.py # Format conversion
├── configs/ # Configuration files
│ └── config.yaml # Default configuration
├── data/ # Data directories
│ ├── pdf/ # Input PDFs
│ ├── html/ # Input HTML files
│ ├── youtube/ # YouTube transcripts
│ ├── docx/ # Input Word documents
│ ├── ppt/ # Input PowerPoint files
│ ├── txt/ # Input text files
│ ├── output/ # Parsed text outputs
│ ├── generated/ # Generated content
│ ├── cleaned/ # Filtered content
│ └── final/ # Formatted outputs
├── setup.py # Package setup script
├── pyproject.toml # Project metadata
├── MANIFEST.in # Package manifest
└── README.md # Project readme
classDiagram
class AppContext {
+config_path: Path
+config: Dict
+_ensure_data_dirs()
}
class LLMClient {
+api_base: str
+model: str
+max_retries: int
+retry_delay: float
+config: Dict
+_check_server() tuple
+chat_completion(messages, temperature, max_tokens, top_p) str
+batch_completion(message_batches, temperature, max_tokens, top_p) List[str]
}
class QAGenerator {
+client: LLMClient
+config: Dict
+generation_config: Dict
+curate_config: Dict
+generate_summary(document_text) str
+generate_qa_pairs(document_text, summary, num_pairs) List[Dict]
+rate_qa_pairs(qa_pairs, summary, threshold) Tuple[List, Dict]
+process_document(document_text, num_pairs, quality_threshold) Dict
}
class Parser {
+parse(file_path) str
+save(content, output_path) None
}
class PDFParser {
+parse(file_path) str
+save(content, output_path) None
}
class HTMLParser {
+parse(file_path) str
+save(content, output_path) None
}
class YouTubeParser {
+parse(url) str
+save(content, output_path) None
}
class CLIApp {
+callback(config)
+system_check(api_base)
+ingest(input, output_dir, name)
+create(input, content_type, output_dir, api_base, model, num_pairs, threshold)
+curate(input, output, threshold, api_base, model)
+save_as(input, format, output)
}
Parser <|-- PDFParser
Parser <|-- HTMLParser
Parser <|-- YouTubeParser
Parser <|-- DOCXParser
Parser <|-- PPTParser
Parser <|-- TXTParser
QAGenerator --> LLMClient
CLIApp --> AppContext
CLIApp --> QAGenerator
CLIApp --> Parser
sequenceDiagram
participant User
participant CLI
participant Parsers
participant LLMClient
participant QAGenerator
participant FormatConverter
User->>CLI: synthetic-data-kit ingest file.pdf
CLI->>Parsers: determine_parser(file.pdf)
Parsers-->>CLI: PDFParser
CLI->>Parsers: parse(file.pdf)
Parsers-->>CLI: Extracted text
CLI-->>User: Text saved to data/output/file.txt
User->>CLI: synthetic-data-kit create file.txt
CLI->>LLMClient: Initialize with config
CLI->>QAGenerator: process_document(text)
QAGenerator->>LLMClient: generate_summary()
LLMClient-->>QAGenerator: Summary
QAGenerator->>LLMClient: generate_qa_pairs()
LLMClient-->>QAGenerator: QA pairs
QAGenerator->>LLMClient: rate_qa_pairs()
LLMClient-->>QAGenerator: Rated pairs
QAGenerator-->>CLI: Results
CLI-->>User: QA pairs saved to data/generated/file_qa_pairs.json
User->>CLI: synthetic-data-kit curate file_qa_pairs.json -v
CLI->>LLMClient: Initialize with config
CLI->>QAGenerator: rate_qa_pairs()
QAGenerator->>LLMClient: Process in batches
LLMClient-->>QAGenerator: Batch responses
QAGenerator->>ParseRatings: Parse with multiple methods
Note over ParseRatings: Enhanced JSON parsing
alt Successful parsing
ParseRatings-->>QAGenerator: Parsed ratings
else Parsing failed
ParseRatings-->>QAGenerator: Error
QAGenerator->>LLMClient: Process individually
LLMClient-->>QAGenerator: Individual responses
QAGenerator->>ParseRatings: Parse individual results
ParseRatings-->>QAGenerator: Individual ratings
end
QAGenerator->>QAGenerator: Apply threshold & metrics
QAGenerator-->>CLI: Filtered pairs with stats
CLI-->>User: Cleaned data saved to data/cleaned/file_cleaned.json
User->>CLI: synthetic-data-kit save-as file_cleaned.json -f ft
CLI->>FormatConverter: convert_format(input, output, format)
FormatConverter-->>CLI: Converted data
CLI-->>User: Data saved to data/final/file_ft.json
- Python 3.8 or later
- VLLM for local inference (recommended)
pip install synthetic-data-kitgit clone https://github.com/meta-llama/synthetic-data-kit.git
cd synthetic-data-kit
pip install -e .For local inference, you'll need to install and run VLLM:
pip install vllm
# Start the VLLM server with your preferred model
vllm serve meta-llama/Llama-3.3-70B-Instruct --port 8000Synthetic Data Kit provides a Typer-based CLI interface with subcommands for each stage of the pipeline.
synthetic-data-kit [OPTIONS] COMMAND [ARGS]...
| Option | Description |
|---|---|
-c, --config PATH |
Path to custom configuration file |
--help |
Show help message |
graph LR
SDK[synthetic-data-kit] --> Ingest[ingest]
SDK --> Create[create]
SDK --> Curate[curate]
SDK --> SaveAs[save-as]
SDK --> SystemCheck[system-check]
Ingest --> PDFFile[PDF File]
Ingest --> HTMLFile[HTML File]
Ingest --> YouTubeURL[YouTube URL]
Create --> QA[QA Pairs]
Create --> Summary[Summary]
Curate --> Filter[Filter by Quality]
SaveAs --> JSONL[JSONL Format]
SaveAs --> Alpaca[Alpaca Format]
SaveAs --> FT[Fine-Tuning Format]
SaveAs --> ChatML[ChatML Format]
Verifies if the VLLM server is running.
synthetic-data-kit system-check [OPTIONS]| Option | Description |
|---|---|
--api-base TEXT |
VLLM API base URL to check |
# Check default server
synthetic-data-kit system-check
# Check specific server
synthetic-data-kit system-check --api-base="http://localhost:8000/v1"Parses documents into clean text.
synthetic-data-kit ingest [OPTIONS] INPUT| Argument | Description |
|---|---|
INPUT |
File or URL to parse |
| Option | Description |
|---|---|
-o, --output-dir PATH |
Directory to save parsed text |
-n, --name TEXT |
Custom filename for output |
# Parse a PDF file
synthetic-data-kit ingest documents/paper.pdf
# Parse with custom output directory
synthetic-data-kit ingest documents/paper.pdf -o custom_dir/
# Parse a web page
synthetic-data-kit ingest "https://example.com/article"
# Parse a YouTube video
synthetic-data-kit ingest "https://www.youtube.com/watch?v=dQw4w9WgXcQ"Generates content from text files.
synthetic-data-kit create [OPTIONS] INPUT| Argument | Description |
|---|---|
INPUT |
Text file to process |
| Option | Description |
|---|---|
--type TEXT |
Content type to generate [qa|summary|cot] |
-o, --output-dir PATH |
Directory to save generated content |
--api-base TEXT |
VLLM API base URL |
-m, --model TEXT |
Model to use |
-n, --num-pairs INTEGER |
Number of QA pairs to generate |
--threshold FLOAT |
Quality threshold (1-10) |
# Generate QA pairs
synthetic-data-kit create data/output/document.txt
# Specify number of pairs
synthetic-data-kit create data/output/document.txt -n 30
# Generate summary only
synthetic-data-kit create data/output/document.txt --type summary
# Generate Chain of Thought (CoT) reasoning examples
synthetic-data-kit create data/output/document.txt --type cot
# Use custom model
synthetic-data-kit create data/output/document.txt -m "meta-llama/Llama-3.3-8B-Instruct"Filters content based on quality.
synthetic-data-kit curate [OPTIONS] INPUT| Argument | Description |
|---|---|
INPUT |
File with QA pairs to clean |
| Option | Description |
|---|---|
-o, --output PATH |
Output file path |
-t, --threshold FLOAT |
Quality threshold (1-10) |
--api-base TEXT |
VLLM API base URL |
-m, --model TEXT |
Model to use |
# Clean with default settings
synthetic-data-kit curate data/generated/document_qa_pairs.json
# Set higher quality threshold
synthetic-data-kit curate data/generated/document_qa_pairs.json -t 8.5
# Specify output location
synthetic-data-kit curate data/generated/document_qa_pairs.json -o custom_path.jsonConverts content to different formats.
synthetic-data-kit save-as [OPTIONS] INPUT| Argument | Description |
|---|---|
INPUT |
File to convert |
| Option | Description |
|---|---|
-f, --format TEXT |
Output format [jsonl|alpaca|ft|chatml] |
--storage TEXT |
Storage format [json|hf] (default: json) |
-o, --output PATH |
Output file path |
# Convert to JSONL format
synthetic-data-kit save-as data/cleaned/document_cleaned.json -f jsonl
# Convert to fine-tuning format (JSON file)
synthetic-data-kit save-as data/cleaned/document_cleaned.json -f ft
# Convert to fine-tuning format (HF dataset)
synthetic-data-kit save-as data/cleaned/document_cleaned.json -f ft --storage hf
# Convert to ChatML format (HF dataset) with specific output location
synthetic-data-kit save-as data/cleaned/document_cleaned.json -f chatml --storage hf -o data/final/custom_nameSynthetic Data Kit uses a YAML-based configuration system with a central config file.
# paths: Configure input and output paths
paths:
input:
pdf: "data/pdf"
html: "data/html"
youtube: "data/youtube"
docx: "data/docx"
ppt: "data/ppt"
txt: "data/txt"
output:
parsed: "data/output"
generated: "data/generated"
cleaned: "data/cleaned"
final: "data/final"
# vllm: Configure VLLM server settings
vllm:
api_base: "http://localhost:8000/v1"
port: 8000
model: "meta-llama/Llama-3.3-70B-Instruct"
max_retries: 3
retry_delay: 1.0
# generation: Content generation parameters
generation:
temperature: 0.7
top_p: 0.95
chunk_size: 4000
overlap: 200
max_tokens: 4096
num_pairs: 25
batch_size: 32 # Number of requests to batch together
# curate: Content filtering parameters
curate:
threshold: 7.0
batch_size: 8
temperature: 0.1
# format: Export format parameters
format:
default: "jsonl"
include_metadata: true
pretty_json: true
# prompts: LLM prompts for different tasks
prompts:
summary: |
Summarize this document in 3-5 sentences, focusing on the main topic and key concepts.
qa_generation: |
Create {num_pairs} question-answer pairs from this text for LLM training.
Rules:
1. Questions must be about important facts in the text
2. Answers must be directly supported by the text
3. Return JSON format only:
[
{{
"question": "Question 1?",
"answer": "Answer 1."
}},
{{
"question": "Question 2?",
"answer": "Answer 2."
}}
]
Text:
{text}
qa_rating: |
You are a helpful JSON processor that rates question-answer pairs.
Your task is to rate each pair on a scale from 1-10 and return valid JSON with added ratings.
ONLY return a valid JSON array with the original pairs plus ratings. Do not include any explanations or text outside the JSON.
Here are the pairs to rate:
{pairs}You can specify a custom configuration file using the -c option:
synthetic-data-kit -c custom_config.yaml ingest documents/paper.pdfThe toolkit uses the following priority for configuration values:
- Command line arguments (highest priority)
- Custom configuration file (if specified)
- Default configuration values (lowest priority)
from synthetic_data_kit.utils.config import (
load_config,
get_path_config,
get_vllm_config,
get_generation_config,
get_curate_config,
get_format_config,
get_prompt
)
# Load config from file
config = load_config("path/to/config.yaml")
# Get specific configuration sections
vllm_config = get_vllm_config(config)
generation_config = get_generation_config(config)
curate_config = get_curate_config(config)
format_config = get_format_config(config)
# Get specific path
output_dir = get_path_config(config, "output", "parsed")
# Get prompt template
summary_prompt = get_prompt(config, "summary")The ingest stage converts various document formats to plain text.
graph TD
Input[Input Document] --> Parser{Parser Selection}
Parser -->|PDF| PDFParser[PDF Parser]
Parser -->|HTML| HTMLParser[HTML Parser]
Parser -->|YouTube| YouTubeParser[YouTube Parser]
Parser -->|DOCX| DOCXParser[DOCX Parser]
Parser -->|PPT| PPTParser[PPT Parser]
Parser -->|TXT| TXTParser[TXT Parser]
PDFParser --> TextExtraction[Text Extraction]
HTMLParser --> TextExtraction
YouTubeParser --> TextExtraction
DOCXParser --> TextExtraction
PPTParser --> TextExtraction
TXTParser --> TextExtraction
TextExtraction --> CleanText[Clean Text]
CleanText --> SaveText[Save Text File]
The toolkit selects the appropriate parser based on the file extension or URL pattern:
def determine_parser(file_path, config):
# URL handling
if file_path.startswith(('http://', 'https://')):
if 'youtube.com' in file_path or 'youtu.be' in file_path:
return YouTubeParser()
else:
return HTMLParser()
# File handling
ext = os.path.splitext(file_path)[1].lower()
parsers = {
'.pdf': PDFParser(),
'.html': HTMLParser(),
'.htm': HTMLParser(),
'.docx': DOCXParser(),
'.pptx': PPTParser(),
'.txt': TXTParser(),
}
if ext in parsers:
return parsers[ext]
else:
raise ValueError(f"Unsupported file extension: {ext}")The create stage generates content from the parsed text.
graph TD
InputText[Input Text] --> Preprocessing[Text Preprocessing]
Preprocessing --> Chunking[Split into Chunks]
Chunking --> GenerateSummary[Generate Summary]
Chunking --> GenerateQA[Generate QA Pairs]
GenerateSummary --> ModelInference1[LLM Inference]
GenerateQA --> ModelInference2[LLM Inference]
ModelInference1 --> Summary[Document Summary]
ModelInference2 --> QAPairs[QA Pairs]
Summary --> Results[Results Object]
QAPairs --> Results
Results --> SaveResults[Save to JSON]
For long documents, the text is split into manageable chunks:
def split_into_chunks(text: str, chunk_size: int = 4000, overlap: int = 200) -> List[str]:
paragraphs = text.split("\n\n")
chunks = []
current_chunk = ""
for para in paragraphs:
if len(current_chunk) + len(para) > chunk_size and current_chunk:
chunks.append(current_chunk)
# Keep some overlap for context
sentences = current_chunk.split('. ')
if len(sentences) > 3:
current_chunk = '. '.join(sentences[-3:]) + "\n\n" + para
else:
current_chunk = para
else:
if current_chunk:
current_chunk += "\n\n" + para
else:
current_chunk = para
if current_chunk:
chunks.append(current_chunk)
return chunksThe cleanup stage filters content based on quality.
graph TD
InputJSON[Input JSON] --> LoadQAPairs[Load QA Pairs]
LoadQAPairs --> BatchProcessing[Process in Batches]
BatchProcessing --> QualityPrompt[Apply Rating Prompt]
QualityPrompt --> ModelInference[LLM Inference]
ModelInference --> ParseRatings[Parse Ratings with Enhanced Methods]
ParseRatings -->|Success| ApplyThreshold[Apply Quality Threshold]
ParseRatings -->|Failure| FallbackProcessing[Fallback to Individual Processing]
FallbackProcessing --> SinglePairRating[Rate Individual Pairs]
SinglePairRating --> ApplyThreshold
ApplyThreshold --> FilteredPairs[Filtered QA Pairs]
FilteredPairs --> QualityMetrics[Calculate Metrics]
FilteredPairs --> SaveResults[Save to JSON]
QualityMetrics --> SaveResults
subgraph "Enhanced JSON Parsing"
ParseRatings --> Method1[Method 1: Pretty-Printed JSON]
ParseRatings --> Method2[Method 2: Code Block Extraction]
ParseRatings --> Method3[Method 3: Regex Patterns]
ParseRatings --> Method4[Method 4: JSON5 Parser]
ParseRatings --> Method5[Method 5: Pattern Matching]
end
The curate module processes QA pairs in batches for efficiency, with robust error handling and fallback mechanisms. The system has been enhanced to handle JSON parsing edge cases and provide detailed diagnostic information.
def curate_qa_pairs(input_path, output_path, threshold=None, api_base=None, model=None, config_path=None, verbose=False):
"""Clean and filter QA pairs based on quality ratings"""
# Load input file and extract QA pairs
with open(input_path, 'r', encoding='utf-8') as f:
data = json.load(f)
qa_pairs = data.get("qa_pairs", [])
summary = data.get("summary", "")
# Initialize LLM client
client = LLMClient(config_path=config_path, api_base=api_base, model_name=model)
# Get configuration
curate_config = get_curate_config(client.config)
# Allow environment variable to override batch size for debugging
env_batch_size = os.environ.get('SDK_BATCH_SIZE')
if env_batch_size and env_batch_size.isdigit():
batch_size = int(env_batch_size)
inference_batch = int(env_batch_size)
else:
batch_size = curate_config.get("batch_size", 32)
inference_batch = curate_config.get("inference_batch", 32)
# Process in batches with smart error handling
batches = [qa_pairs[i:i+batch_size] for i in range(0, len(qa_pairs), batch_size)]
for batch_start in range(0, len(all_messages), inference_batch):
batch_responses = client.batch_completion(current_batch, temperature=rating_temperature)
# Process each response
for j, response in enumerate(batch_responses):
try:
# Pass original batch to enable fallback matching
rated_batch = parse_ratings(response, original_batch)
# Process ratings
for pair in rated_batch:
if "rating" in pair:
rating = pair["rating"]
if rating >= threshold:
filtered_pairs.append(pair)
except Exception as e:
# Attempt individual processing as fallback
if verbose:
print(f"Batch processing failed, trying individual items...")
# Process individual items in the batch as a fallback strategy
for item in original_batch:
try:
# Process single item
item_response = client.chat_completion(
[{"role": "system", "content": single_item_prompt}]
)
rated_item = parse_ratings(item_response, [item])
# Add to filtered pairs if rating meets threshold
except Exception:
if verbose:
print(f"Failed to process individual item")
# Calculate metrics and return results
return output_pathThe system includes several advanced features:
- Batch Size Configuration: Configurable batch sizes for optimal performance
- Environment Variable Overrides:
SDK_BATCH_SIZEfor debugging and testing - Fallback Processing: If batch processing fails, falls back to single-item processing
- Robust JSON Parsing: Multiple parsing methods to handle different LLM output formats
- Verbose Mode: Detailed diagnostic information with the
-vflag
The save-as stage converts the content to different formats.
graph TD
InputJSON[Input JSON] --> LoadContent[Load Content]
LoadContent --> FormatSelection{Format Selection}
FormatSelection -->|JSONL| JSONL[Convert to JSONL]
FormatSelection -->|Alpaca| Alpaca[Convert to Alpaca]
FormatSelection -->|FT| FT[Convert to Fine-Tuning]
FormatSelection -->|ChatML| ChatML[Convert to ChatML]
JSONL --> StorageSelection{Storage Format}
Alpaca --> StorageSelection
FT --> StorageSelection
ChatML --> StorageSelection
StorageSelection -->|JSON| SaveJSONFile[Save as JSON File]
StorageSelection -->|HF Dataset| CreateHFDataset[Create HF Dataset]
CreateHFDataset --> SaveArrow[Save in Arrow Format]
SaveJSONFile --> OutputFile[Output File]
SaveArrow --> OutputDir[Output Directory]
def convert_format(input_path, output_path, format_type):
# Load input file
with open(input_path, 'r', encoding='utf-8') as f:
data = json.load(f)
# Extract QA pairs
if "filtered_pairs" in data:
qa_pairs = data["filtered_pairs"]
elif "qa_pairs" in data:
qa_pairs = data["qa_pairs"]
else:
raise ValueError("No QA pairs found in input file")
# Convert to requested format
if format_type == "jsonl":
return to_jsonl(qa_pairs, output_path)
elif format_type == "alpaca":
return to_alpaca(qa_pairs, output_path)
elif format_type == "ft":
return to_fine_tuning(qa_pairs, output_path)
elif format_type == "chatml":
return to_chatml(qa_pairs, output_path)
else:
raise ValueError(f"Unknown format type: {format_type}")class LLMClient:
def __init__(self,
config_path: Optional[Path] = None,
api_base: Optional[str] = None,
model_name: Optional[str] = None,
max_retries: Optional[int] = None,
retry_delay: Optional[float] = None):
"""Initialize an OpenAI-compatible client that connects to a VLLM server"""
def chat_completion(self,
messages: List[Dict[str, str]],
temperature: float = None,
max_tokens: int = None,
top_p: float = None) -> str:
"""Generate a chat completion using the VLLM OpenAI-compatible API"""
def batch_completion(self,
message_batches: List[List[Dict[str, str]]],
temperature: float = None,
max_tokens: int = None,
top_p: float = None) -> List[str]:
"""Process multiple message sets sequentially"""class QAGenerator:
def __init__(self,
client: LLMClient,
config_path: Optional[Path] = None):
"""Initialize the QA Generator with an LLM client and optional config"""
def generate_summary(self, document_text: str) -> str:
"""Generate a summary of the document"""
def generate_qa_pairs(self,
document_text: str,
summary: str,
num_pairs: int = 25) -> List[Dict[str, str]]:
"""Generate QA pairs from the document"""
def rate_qa_pairs(self,
qa_pairs: List[Dict[str, str]],
summary: str,
threshold: Optional[float] = None) -> Tuple[List[Dict[str, Any]], Dict[str, Any]]:
"""Rate and filter QA pairs by quality"""
def process_document(self,
document_text: str,
num_pairs: int = 25,
quality_threshold: Optional[float] = None) -> Dict[str, Any]:
"""Process a document to generate, rate, and format QA pairs"""class Parser:
def parse(self, file_path: str) -> str:
"""Parse a document into plain text"""
def save(self, content: str, output_path: str) -> None:
"""Save the extracted text to a file"""Each parser implements this interface:
PDFParser: Uses pdfminer.six to extract text from PDF filesHTMLParser: Uses BeautifulSoup4 to extract text from HTML/web pagesYouTubeParser: Uses pytube and youtube-transcript-api to extract transcriptsDOCXParser: Uses python-docx to extract text from Word documentsPPTParser: Uses python-pptx to extract text from PowerPoint presentationsTXTParser: Reads plain text files
# Text Processing
def split_into_chunks(text: str, chunk_size: int = 4000, overlap: int = 200) -> List[str]:
"""Split text into chunks with optional overlap"""
# LLM Output Processing
def parse_qa_pairs(text: str) -> List[Dict[str, str]]:
"""Parse QA pairs from LLM output"""
def parse_ratings(text: str) -> List[Dict[str, Any]]:
"""Parse rated items from LLM output"""
def convert_to_conversation_format(qa_pairs: List[Dict[str, str]]) -> List[List[Dict[str, str]]]:
"""Convert QA pairs to conversation format"""
# Format Conversion
def to_jsonl(data: List[Dict[str, Any]], output_path: str) -> str:
"""Convert data to JSONL format and save to a file"""
def to_alpaca(qa_pairs: List[Dict[str, str]], output_path: str) -> str:
"""Convert QA pairs to Alpaca format and save"""
def to_fine_tuning(qa_pairs: List[Dict[str, str]], output_path: str) -> str:
"""Convert QA pairs to fine-tuning format and save"""
def to_chatml(qa_pairs: List[Dict[str, str]], output_path: str) -> str:
"""Convert QA pairs to ChatML format and save as JSONL"""{
"summary": "Document summary text",
"qa_pairs": [
{
"question": "What is X?",
"answer": "X is..."
},
// More QA pairs...
],
"filtered_pairs": [
{
"question": "What is X?",
"answer": "X is...",
"rating": 8.5
},
// More rated pairs...
],
"conversations": [
[
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "What is X?"},
{"role": "assistant", "content": "X is..."}
],
// More conversations...
],
"metrics": {
"total": 25,
"filtered": 18,
"retention_rate": 0.72,
"avg_score": 7.8
}
}{"question": "What is X?", "answer": "X is..."}
{"question": "How does Y work?", "answer": "Y works by..."}[
{
"instruction": "What is X?",
"input": "",
"output": "X is..."
},
{
"instruction": "How does Y work?",
"input": "",
"output": "Y works by..."
}
][
{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is X?"},
{"role": "assistant", "content": "X is..."}
]
},
{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "How does Y work?"},
{"role": "assistant", "content": "Y works by..."}
]
}
]{"messages":[{"role":"system","content":"You are a helpful AI assistant."},{"role":"user","content":"What is X?"},{"role":"assistant","content":"X is..."}]}
{"messages":[{"role":"system","content":"You are a helpful AI assistant."},{"role":"user","content":"How does Y work?"},{"role":"assistant","content":"Y works by..."}]}Content is stored in standard JSON files as shown in the formats above.
Content can be stored as Hugging Face datasets using the efficient Arrow format, which provides:
- Memory-efficient storage (memory-mapped files)
- Fast random access to data
- Column-oriented storage for efficient operations
- Native compatibility with the HF ecosystem
- Better performance for ML workflows
# Example of loading and using a HF dataset
from datasets import load_from_disk
# Load the dataset
dataset = load_from_disk('data/final/example_ft_hf')
# View the features
print(dataset.features)
# Example output: {'messages': [{'content': Value(dtype='string', id=None), 'role': Value(dtype='string', id=None)}]}
# Access the first example
print(dataset[0])
# Example output: {'messages': [{'role': 'system', 'content': '...'}, {'role': 'user', 'content': '...'}, ...]}
# Use with training libraries
import transformers
trainer = transformers.Trainer(
model=model,
train_dataset=dataset,
# other parameters...
)The toolkit supports these environment variables for debugging and configuration:
| Variable | Description | Default | Example |
|---|---|---|---|
SDK_VERBOSE |
Enable verbose output for all operations | false |
export SDK_VERBOSE=true |
SDK_BATCH_SIZE |
Override batch size for curate command | Config setting | export SDK_BATCH_SIZE=1 |
Setting these variables can help with debugging and performance tuning:
# Process one QA pair at a time with detailed output
export SDK_VERBOSE=true
export SDK_BATCH_SIZE=1
synthetic-data-kit curate data/generated/results.json# Start VLLM server (in a separate terminal)
vllm serve meta-llama/Llama-3.3-70B-Instruct --port 8000
# Check if server is running
synthetic-data-kit system-check
# 1. Parse a PDF document
synthetic-data-kit ingest documents/paper.pdf
# 2. Generate QA pairs from the parsed text
synthetic-data-kit create data/output/paper.txt
# 3. Clean and filter the generated content
synthetic-data-kit curate data/generated/paper_qa_pairs.json
# 4. Convert to fine-tuning format
synthetic-data-kit save-as data/cleaned/paper_cleaned.json -f ftCreate a custom configuration file technical_docs.yaml:
vllm:
model: "meta-llama/Llama-3.3-70B-Instruct"
generation:
temperature: 0.5
chunk_size: 3000
overlap: 300
num_pairs: 40
cleanup:
threshold: 8.0
temperature: 0.05
prompts:
qa_generation: |
Create {num_pairs} question-answer pairs about technical documentation.
Focus on questions that:
1. Test understanding of complex technical concepts
2. Include code examples and implementation details
3. Cover API usage patterns
Return only the JSON:
[
{{
"question": "Technical question?",
"answer": "Technical answer with code if relevant."
}}
]
Text:
{text}Use the custom configuration:
# Process technical documentation with custom config
synthetic-data-kit -c technical_docs.yaml ingest documentation/api_docs.pdf
synthetic-data-kit -c technical_docs.yaml create data/output/api_docs.txt
synthetic-data-kit -c technical_docs.yaml curate data/generated/api_docs_qa_pairs.json
synthetic-data-kit -c technical_docs.yaml save-as data/cleaned/api_docs_cleaned.json -f ft# Process all PDFs in a directory
for file in documents/*.pdf; do
filename=$(basename "$file" .pdf)
# Ingest
synthetic-data-kit ingest "$file"
# Create QA pairs
synthetic-data-kit create "data/output/${filename}.txt" -n 20
# Curate
synthetic-data-kit curate "data/generated/${filename}_qa_pairs.json" -t 7.5
# Save as fine-tuning format
synthetic-data-kit save-as "data/cleaned/${filename}_cleaned.json" -f ft
doneprompts:
summary: |
Create a comprehensive summary of this technical document.
Include:
1. The main topic and purpose
2. Key technical concepts and methodologies
3. Important findings or conclusions
4. System architecture or design patterns
Focus on extracting the most technically relevant information.prompts:
qa_generation: |
You're an expert creating training data for a technical assistant.
From this text, create {num_pairs} question-answer pairs that:
1. Focus on complex technical concepts
2. Include implementation details and practical usage
3. Cover both basic and advanced topics
4. Represent realistic user queries
Each answer should be comprehensive yet concise, and include code examples where relevant.
Return as JSON:
[
{{
"question": "How does X work in system Y?",
"answer": "X works in system Y by... For example: `code example`"
}}
]
Text:
{text}prompts:
qa_rating: |
Evaluate these QA pairs for a technical assistant on a scale of 1-10.
Criteria:
1. Technical accuracy (0-3 points)
2. Completeness of answer (0-3 points)
3. Relevance to practical usage (0-2 points)
4. Clear explanations (0-2 points)
Return the original pairs with ratings added:
[
{"question": "...", "answer": "...", "rating": 8}
]
QA Pairs:
{pairs}Create a new parser in the parsers directory:
# synthetic_data_kit/parsers/markdown_parser.py
import os
class MarkdownParser:
"""Parser for Markdown files"""
def parse(self, file_path: str) -> str:
"""Parse a Markdown file into plain text"""
with open(file_path, 'r', encoding='utf-8') as f:
content = f.read()
# Remove Markdown formatting
# This is a simple example - you'd want more robust parsing
import re
# Remove headers
content = re.sub(r'#+\s+(.*)', r'\1', content)
# Remove bold/italic
content = re.sub(r'\*\*(.*?)\*\*', r'\1', content)
content = re.sub(r'\*(.*?)\*', r'\1', content)
# Remove links
content = re.sub(r'\[(.*?)\]\(.*?\)', r'\1', content)
return content
def save(self, content: str, output_path: str) -> None:
"""Save the extracted text to a file"""
os.makedirs(os.path.dirname(output_path), exist_ok=True)
with open(output_path, 'w', encoding='utf-8') as f:
f.write(content)Register the parser in parsers/__init__.py:
from synthetic_data_kit.parsers.markdown_parser import MarkdownParserUpdate the parser selection in core/ingest.py:
def determine_parser(file_path, config):
# ... existing code ...
ext = os.path.splitext(file_path)[1].lower()
parsers = {
'.pdf': PDFParser(),
'.html': HTMLParser(),
'.htm': HTMLParser(),
'.docx': DOCXParser(),
'.pptx': PPTParser(),
'.txt': TXTParser(),
'.md': MarkdownParser(), # Add the new parser
'.markdown': MarkdownParser(),
}
# ... rest of the function ...Add a new converter function in utils/format_converter.py:
def to_custom_format(qa_pairs: List[Dict[str, str]], output_path: str) -> str:
"""Convert QA pairs to a custom format and save"""
# Create the custom format structure
formatted_data = {
"version": "1.0",
"created": datetime.now().isoformat(),
"items": []
}
for pair in qa_pairs:
formatted_data["items"].append({
"input": {
"query": pair["question"]
},
"output": {
"text": pair["answer"]
},
"metadata": {
"source": "synthetic-data-kit"
}
})
# Save to file
with open(output_path, 'w', encoding='utf-8') as f:
json.dump(formatted_data, f, indent=2)
return output_pathUpdate the format conversion in core/save_as.py:
def convert_format(input_path, output_path, format_type, config=None):
# ... existing code ...
elif format_type == "custom":
return to_custom_format(qa_pairs, output_path)
# ... rest of the function ...Create a new generator in the generators directory:
# synthetic_data_kit/generators/cot_generator.py
from typing import Dict, List, Any, Optional
import json
from synthetic_data_kit.models.llm_client import LLMClient
from synthetic_data_kit.utils.config import get_prompt
class COTGenerator:
"""Generates chain-of-thought reasoning examples"""
def __init__(self, client: LLMClient, config_path: Optional[str] = None):
self.client = client
self.config = client.config
def generate_cot_examples(self, document_text: str, num_examples: int = 5) -> List[Dict[str, Any]]:
"""Generate chain-of-thought reasoning examples"""
# Get the prompt template
prompt_template = get_prompt(self.config, "cot_generation")
# Format the prompt
prompt = prompt_template.format(
num_examples=num_examples,
text=document_text
)
# Generate examples
messages = [{"role": "system", "content": prompt}]
response = self.client.chat_completion(messages)
# Parse response (simplified for example)
examples = []
if '[' in response and ']' in response:
start = response.find('[')
end = response.rfind(']') + 1
try:
examples = json.loads(response[start:end])
except:
print("Error parsing COT examples")
return examplesAdd the corresponding prompt to config.yaml:
prompts:
cot_generation: |
Generate {num_examples} chain-of-thought reasoning examples from this text.
Each example should have:
1. A complex problem or question
2. Step-by-step reasoning to solve it
3. The final answer
Return as JSON:
[
{{
"question": "Complex problem?",
"reasoning": "Step 1: ... Step 2: ... Step 3: ...",
"answer": "Final answer"
}}
]
Text:
{text}Update the create command to use the new generator:
def process_file(...):
# ... existing code ...
elif content_type == "cot":
from synthetic_data_kit.generators.cot_generator import COTGenerator
generator = COTGenerator(client, config_path)
examples = generator.generate_cot_examples(
document_text,
num_examples=num_pairs # Reuse the num_pairs parameter
)
# Save output
output_path = os.path.join(output_dir, f"{base_name}_cot_examples.json")
with open(output_path, 'w', encoding='utf-8') as f:
json.dump({"cot_examples": examples}, f, indent=2)
return output_path
# ... rest of the function ...Error: VLLM server not available at http://localhost:8000/v1
Solution:
- Ensure VLLM is installed:
pip install vllm - Start the server:
vllm serve <model_name> --port 8000 - Check if the port is already in use by another process
- Verify network connectivity to the server
Error parsing LLM output: Expecting property name enclosed in double quotes
Solution:
- Lower the temperature setting (e.g., 0.1) for more predictable outputs
- Improve the prompt to be more explicit about JSON formatting
- Ensure the model is capable of generating valid JSON (larger models tend to do better)
The toolkit includes a robust, multi-method JSON parsing system for handling LLM responses:
def parse_ratings(text: str, original_items: List[Dict[str, str]] = None) -> List[Dict[str, Any]]:
"""Parse rated items from LLM output with enhanced error recovery"""
# Method 1: Comprehensive approach for pretty-printed JSON
# Handles indentation and newlines in JSON from LLMs
# Method 2: Code block extraction
# Finds and parses JSON inside markdown code blocks
# Method 3: Regex-based extraction
# Uses pattern matching to find JSON-like structures
# Method 4: JSON5 parsing (more lenient)
# Applies a more forgiving parser if available
# Method 5: Pattern matching with original items
# Uses original QA pairs to extract ratings when all else failsFor optimal JSON parsing, you can:
- Install json5:
pip install json5for enhanced JSON parsing capabilities - Use verbose mode: Run commands with
-vflag to see detailed parsing information - Set environment variables:
SDK_BATCH_SIZE=1to process one item at a time for debugging - Adjust prompt templates: Update config.yaml prompts for better JSON formatting
CUDA out of memory
Solution:
- Use a smaller model (e.g., 7B instead of 70B)
- Reduce the batch size in the configuration
- Start VLLM with memory optimization flags:
vllm serve <model> --gpu-memory-utilization 0.85 --max-model-len 4096
- If using multiple GPUs, enable tensor parallelism:
vllm serve <model> --tensor-parallel-size 4
File not found: documents/paper.pdf
Solution:
- Verify the file path is correct (absolute vs. relative)
- Check permissions on the file and directory
- Create the directory structure if it doesn't exist:
mkdir -p data/{pdf,html,youtube,docx,ppt,txt,output,generated,cleaned,final}
# Using the built-in system-check command
synthetic-data-kit system-check --api-base="http://localhost:8000/v1"
# Direct API check
curl -X GET http://localhost:8000/v1/models# View parsed text file
cat data/output/document.txt
# View generated QA pairs
jq . data/generated/document_qa_pairs.json
# Count QA pairs
jq '.qa_pairs | length' data/generated/document_qa_pairs.json
# View quality metrics
jq '.metrics' data/cleaned/document_cleaned.json# Test just the parser
synthetic-data-kit ingest documents/paper.pdf -o test_output/
# Test just content creation with a small text file
echo "This is a test document." > test.txt
synthetic-data-kit create test.txt -n 2
# Test just format conversion with a known good file
synthetic-data-kit save-as known_good_data.json -f jsonl-
Source Document Selection
- Use high-quality, accurate source materials
- Prefer technical, factual content over subjective or opinion-based text
- Include a diverse range of topics for better generalization
-
Content Generation
- Start with more pairs than needed (30-50% more)
- Set a higher quality threshold (8.0+) for critical applications
- Use lower temperature (0.1-0.3) for more consistent outputs
- Use larger models (30B+) for more accurate generation
-
Post-Processing
- Manually review a sample of generated content (5-10%)
- Check for hallucinations or unsupported claims
- Validate factual accuracy of technical content
-
Text Preprocessing
- Clean document text before ingestion
- For PDFs, ensure they are text-based, not scanned images
- Remove irrelevant content (headers, footers, page numbers)
-
Chunking Strategy
- Balance chunk size with context requirements
- Ensure sufficient overlap between chunks (10-15% of chunk size)
- For technical content, keep related sections together
-
Prompt Engineering
- Be explicit about the expected output format
- Include examples of desired output quality
- Customize prompts for different content types
-
Resource Management
- Process large documents in smaller batches
- Implement checkpointing for very large datasets
- Use a dedicated machine for VLLM serving