📍 Location: simulation/ | 🏠 Home: simulation/README.md | 📚 Main: README.md
The case-note-simulator generates realistic but completely fictional social services case data for testing and validating analytical workflows in the Strategic Data Analytics (SDA) unit. This guide describes the implementation approach, expert workflow, and integration with sda-casenote-reader.
Philosophy: Domain experts define synthetic data parameters through human-readable YAML files rather than hardcoded algorithmic assumptions.
Core Components:
- Client Archetypes (
simulation/input-specifications/client-profiles.yml): Demographic patterns and risk factor combinations - Case Complexity Levels (
simulation/input-specifications/case-complexity-levels.yml): Service intensity and documentation patterns - Writing Style Variations (
simulation/input-specifications/writing-style-guides.yml): Caseworker documentation styles - Project Scenarios (
simulation/input-specifications/project-scenarios/): Testing configurations for specific SDA projects
Benefits:
- Non-technical domain experts can modify synthetic data parameters
- Specifications are version-controlled and auditable
- Multiple project scenarios can be maintained simultaneously
- Clear separation between domain knowledge and technical implementation
Modular Design: Separate R scripts handle different aspects of synthetic data generation:
client-generator.R: Creates client demographic profiles with realistic risk factor co-occurrencenote-generator.R: Generates case note text with authentic writing style variationscomplexity-controller.R: Orchestrates case complexity and service intensity patternsvalidation-framework.R: Ensures quality and realism of generated data
Generation Pipeline:
- Specification Loading: Read expert-authored YAML configurations
- Population Generation: Create synthetic client profiles with controlled characteristics
- Case Assignment: Apply complexity levels based on client archetypes and project needs
- Text Generation: Produce case notes using appropriate writing styles and terminology
- Quality Assurance: Validate outputs for realism and eliminate identifying patterns
- Export Formatting: Structure data for seamless integration with SDA analytical workflows
Multi-Level Validation:
- Linguistic Authenticity: Verify appropriate social services terminology and realistic writing patterns
- Demographic Realism: Check population distributions against Alberta-like characteristics
- Risk Factor Prevalence: Validate realistic co-occurrence of client challenges
- Temporal Patterns: Ensure case progression follows authentic service delivery timelines
Validation Metrics:
- Distribution comparisons against expected population patterns
- Co-occurrence statistics for risk factors
- Writing style consistency measures
- Temporal pattern authenticity checks
Step 1: Copy template-scenario.yml to create project-specific configuration
Step 2: Define testing objectives and target algorithms
Step 3: Specify required synthetic data patterns and characteristics
Step 4: Set validation targets and quality criteria
Client Population Design:
population_parameters:
total_clients: 500
risk_distribution:
low_risk: 0.35
moderate_risk: 0.40
high_risk: 0.25Algorithm Testing Requirements:
validation_targets:
housing_risk_detection: 0.95
substance_use_flagging: 0.88
crisis_prediction: 0.85Quality Assurance Criteria:
quality_checks:
terminology_appropriateness: true
writing_style_variation: true
temporal_pattern_realism: trueGenerate Synthetic Data:
# Load generation functions
source("./simulation/generation-engine/client-generator.R")
# Generate population for specific project
clients <- generate_client_population(
n_clients = 500,
scenario_file = "./simulation/input-specifications/project-scenarios/risk-assessment-validation.yml"
)
# Validate generated population
validation <- validate_client_population(clients)Quality Review Process:
- Review generated client profiles for demographic realism
- Check risk factor distributions against expected patterns
- Validate case note samples for linguistic authenticity
- Verify temporal patterns match service delivery norms
Formatted Export:
export_client_population(
clients,
"./simulation/output-datasets/client-profiles/risk_validation_clients.csv"
)Standardized Structure: Synthetic data exported in formats directly compatible with SDA analytical workflows:
- Consistent client ID systems
- Matching temporal patterns
- Preserved risk factor encoding
- Compatible text formatting
Benchmark Testing:
# Export synthetic data in SDA-compatible format
export_to_sda_format(synthetic_dataset, "./simulation/testing-harness/sda_test_data.csv")
# Run SDA algorithms on synthetic data
test_analysis_pipeline("./simulation/testing-harness/sda_test_data.csv")Performance Validation:
- Compare algorithm performance on synthetic vs. anonymized real data
- Verify known positive cases trigger appropriate risk flags
- Ensure control cases don't generate false positives
- Test algorithm stability across different synthetic data generations
Feedback Loop:
- Generate initial synthetic dataset using expert specifications
- Run SDA algorithms on synthetic data to establish baseline performance
- Compare results to validation targets and identify gaps
- Adjust synthetic data specifications based on algorithm performance
- Regenerate and retest until validation targets are met
Human Documentation Patterns:
- Grammatical errors and spelling inconsistencies at realistic rates
- Varied writing styles reflecting different caseworker backgrounds
- Authentic social services terminology and abbreviations
- Natural variation in documentation completeness and detail
Demographic Realism:
- Alberta-like population distributions
- Realistic risk factor co-occurrence patterns
- Authentic family structures and geographic distributions
- Age-appropriate service engagement patterns
Complete Fictional Status:
- Systematic fictional name generation with no real-world correspondence
- Geographic obfuscation using realistic but fictional locations
- Temporal displacement preventing correlation with actual service periods
- Demographic noise injection maintaining realism while eliminating identifiability
Multi-Ministry Flexibility:
adapt_for_ministry <- function(ministry_type) {
switch(ministry_type,
"justice" = load_legal_terminology(),
"health" = load_clinical_patterns(),
"education" = load_student_services_vocab()
)
}Configurable Parameters:
- Adjustable population sizes and demographic distributions
- Flexible risk factor prevalence rates
- Customizable service delivery patterns
- Adaptable writing style variations
Expert Review Process:
- Domain expert review of all YAML specifications
- Validation against real-world service delivery patterns
- Cross-referencing with policy standards and program requirements
- Peer review by additional domain experts when available
Automated Validation:
- Distribution checks against expected population patterns
- Co-occurrence pattern validation for risk factors
- Linguistic authenticity scoring for generated text
- Temporal pattern consistency verification
Manual Quality Review:
- Sample review of generated case notes for authenticity
- Demographic realism assessment by domain experts
- Terminology appropriateness evaluation
- Overall believability assessment
Benchmark Testing:
- Performance comparison against established baselines
- Accuracy measurement on synthetic data with known characteristics
- False positive and false negative rate assessment
- Consistency testing across different synthetic data generations
Natural Language Processing:
- Integration with large language models for more sophisticated text generation
- Enhanced writing style modeling based on real caseworker documentation
- Dynamic terminology adaptation based on case characteristics
- Contextual error generation reflecting realistic human patterns
Complex Case Progression:
- Multi-year case evolution patterns
- Seasonal variation in service needs and crisis events
- Family system dynamics affecting individual case progression
- Economic and policy impact modeling on case outcomes
Comprehensive Synthetic Ecosystem:
- Integration with administrative data systems
- Healthcare interaction modeling
- Justice system involvement patterns
- Employment and education record generation
# Clone the repository
git clone [repository-url] case-note-simulator
cd case-note-simulator
# Install required R packages
Rscript -e "install.packages(c('yaml', 'dplyr', 'purrr', 'lubridate', 'stringr', 'jsonlite'))"# Load the client generator
source("./simulation/generation-engine/client-generator.R")
# Generate a small test population
test_clients <- generate_client_population(n_clients = 50)
# Review the results
head(test_clients)
validation <- validate_client_population(test_clients)
print(validation)- Copy
./simulation/input-specifications/project-scenarios/template-scenario.yml - Customize for your specific SDA project requirements
- Generate synthetic data using your project scenario
- Export in format compatible with your analytical workflows
Regular Review Process:
- Quarterly review of client archetype specifications
- Annual validation of risk factor prevalence rates
- Ongoing updates based on policy changes and program evolution
- Continuous calibration against anonymized real-world patterns
Code Quality Standards:
- Comprehensive unit testing for all generation functions
- Performance optimization for large-scale synthetic data generation
- Documentation updates reflecting specification changes
- Version control for both code and expert specifications
Ongoing Compatibility:
- Regular testing against evolving SDA analytical workflows
- Format updates to match changing analytical requirements
- Performance benchmarking against new algorithm versions
- Collaboration support for new analytical project development
This implementation provides a robust, expert-driven framework for generating synthetic social services data that supports rigorous testing and validation of analytical workflows while maintaining complete privacy protection and realistic authenticity.