This directory contains the synthetic data generation engine for creating realistic but completely fictional social services case data.
Generate synthetic datasets that mirror real-world complexity to support:
- Testing risk flagging algorithms
- Validating sentiment analysis workflows
- Training pattern detection systems
- Benchmarking AI agent performance in
sda-casenote-reader
Expert-authored YAML configuration files
client-profiles.yml- Demographic patterns and risk factor combinationscase-complexity-levels.yml- Service intensity and documentation patternswriting-style-guides.yml- Caseworker documentation stylesproject-scenarios/- Testing configurations for specific SDA projects
R scripts for synthetic data generation
client-generator.R- Creates client demographic profiles with realistic risk factorsnote-generator.R- Generates case note text with authentic writing stylescomplexity-controller.R- Orchestrates case complexity and service patternsvalidation-framework.R- Ensures quality and realism of generated data
Generated synthetic datasets
- Export-ready datasets formatted for
sda-casenote-readerintegration - Multiple project scenarios with different characteristics
- Quality validation reports and metrics
Quality assurance and testing framework
- Validation scripts for ensuring realistic distributions
- Privacy protection verification (complete fictional status)
- Integration testing with SDA analytical pipelines
See implementation.md for comprehensive architecture documentation including:
- Expert-driven specification system
- Generation pipeline workflow
- Quality validation framework
- Integration with SDA workflows
- Configure specifications: Edit YAML files in
input-specifications/ - Generate data: Run scripts in
generation-engine/ - Validate output: Check
output-datasets/for generated files - Test integration: Use
testing-harness/for quality assurance
- Expert-Driven: Domain experts control parameters via YAML files
- Completely Fictional: No real client data, privacy-protected
- Alberta-Like: Realistic demographic patterns and terminology
- SDA-Ready: Export formats compatible with analytical workflows
- Quality Assured: Multi-level validation for realism and consistency
For detailed implementation information, see implementation.md