Skip to content

Latest commit

 

History

History
326 lines (246 loc) · 11.5 KB

File metadata and controls

326 lines (246 loc) · 11.5 KB

Case Note Simulator Implementation

📍 Location: simulation/ | 🏠 Home: simulation/README.md | 📚 Main: README.md

Overview

The case-note-simulator generates realistic but completely fictional social services case data for testing and validating analytical workflows in the Strategic Data Analytics (SDA) unit. This guide describes the implementation approach, expert workflow, and integration with sda-casenote-reader.

Repository Architecture

1. Expert-Driven Specification System

Philosophy: Domain experts define synthetic data parameters through human-readable YAML files rather than hardcoded algorithmic assumptions.

Core Components:

  • Client Archetypes (simulation/input-specifications/client-profiles.yml): Demographic patterns and risk factor combinations
  • Case Complexity Levels (simulation/input-specifications/case-complexity-levels.yml): Service intensity and documentation patterns
  • Writing Style Variations (simulation/input-specifications/writing-style-guides.yml): Caseworker documentation styles
  • Project Scenarios (simulation/input-specifications/project-scenarios/): Testing configurations for specific SDA projects

Benefits:

  • Non-technical domain experts can modify synthetic data parameters
  • Specifications are version-controlled and auditable
  • Multiple project scenarios can be maintained simultaneously
  • Clear separation between domain knowledge and technical implementation

2. Generation Engine Architecture

Modular Design: Separate R scripts handle different aspects of synthetic data generation:

  • client-generator.R: Creates client demographic profiles with realistic risk factor co-occurrence
  • note-generator.R: Generates case note text with authentic writing style variations
  • complexity-controller.R: Orchestrates case complexity and service intensity patterns
  • validation-framework.R: Ensures quality and realism of generated data

Generation Pipeline:

  1. Specification Loading: Read expert-authored YAML configurations
  2. Population Generation: Create synthetic client profiles with controlled characteristics
  3. Case Assignment: Apply complexity levels based on client archetypes and project needs
  4. Text Generation: Produce case notes using appropriate writing styles and terminology
  5. Quality Assurance: Validate outputs for realism and eliminate identifying patterns
  6. Export Formatting: Structure data for seamless integration with SDA analytical workflows

3. Quality Validation Framework

Multi-Level Validation:

  • Linguistic Authenticity: Verify appropriate social services terminology and realistic writing patterns
  • Demographic Realism: Check population distributions against Alberta-like characteristics
  • Risk Factor Prevalence: Validate realistic co-occurrence of client challenges
  • Temporal Patterns: Ensure case progression follows authentic service delivery timelines

Validation Metrics:

  • Distribution comparisons against expected population patterns
  • Co-occurrence statistics for risk factors
  • Writing style consistency measures
  • Temporal pattern authenticity checks

Expert Workflow

1. Project Initiation

Step 1: Copy template-scenario.yml to create project-specific configuration Step 2: Define testing objectives and target algorithms Step 3: Specify required synthetic data patterns and characteristics Step 4: Set validation targets and quality criteria

2. Specification Customization

Client Population Design:

population_parameters:
  total_clients: 500
  risk_distribution:
    low_risk: 0.35
    moderate_risk: 0.40  
    high_risk: 0.25

Algorithm Testing Requirements:

validation_targets:
  housing_risk_detection: 0.95
  substance_use_flagging: 0.88
  crisis_prediction: 0.85

Quality Assurance Criteria:

quality_checks:
  terminology_appropriateness: true
  writing_style_variation: true
  temporal_pattern_realism: true

3. Generation and Validation

Generate Synthetic Data:

# Load generation functions
source("./simulation/generation-engine/client-generator.R")

# Generate population for specific project
clients <- generate_client_population(
  n_clients = 500,
  scenario_file = "./simulation/input-specifications/project-scenarios/risk-assessment-validation.yml"
)

# Validate generated population
validation <- validate_client_population(clients)

Quality Review Process:

  1. Review generated client profiles for demographic realism
  2. Check risk factor distributions against expected patterns
  3. Validate case note samples for linguistic authenticity
  4. Verify temporal patterns match service delivery norms

4. Export for SDA Integration

Formatted Export:

export_client_population(
  clients,
  "./simulation/output-datasets/client-profiles/risk_validation_clients.csv"
)

Integration with sda-casenote-reader

1. Data Format Compatibility

Standardized Structure: Synthetic data exported in formats directly compatible with SDA analytical workflows:

  • Consistent client ID systems
  • Matching temporal patterns
  • Preserved risk factor encoding
  • Compatible text formatting

2. Testing Harness

Benchmark Testing:

# Export synthetic data in SDA-compatible format
export_to_sda_format(synthetic_dataset, "./simulation/testing-harness/sda_test_data.csv")

# Run SDA algorithms on synthetic data
test_analysis_pipeline("./simulation/testing-harness/sda_test_data.csv")

Performance Validation:

  • Compare algorithm performance on synthetic vs. anonymized real data
  • Verify known positive cases trigger appropriate risk flags
  • Ensure control cases don't generate false positives
  • Test algorithm stability across different synthetic data generations

3. Iterative Refinement

Feedback Loop:

  1. Generate initial synthetic dataset using expert specifications
  2. Run SDA algorithms on synthetic data to establish baseline performance
  3. Compare results to validation targets and identify gaps
  4. Adjust synthetic data specifications based on algorithm performance
  5. Regenerate and retest until validation targets are met

Key Implementation Features

1. Realistic Authenticity

Human Documentation Patterns:

  • Grammatical errors and spelling inconsistencies at realistic rates
  • Varied writing styles reflecting different caseworker backgrounds
  • Authentic social services terminology and abbreviations
  • Natural variation in documentation completeness and detail

Demographic Realism:

  • Alberta-like population distributions
  • Realistic risk factor co-occurrence patterns
  • Authentic family structures and geographic distributions
  • Age-appropriate service engagement patterns

2. Privacy Protection

Complete Fictional Status:

  • Systematic fictional name generation with no real-world correspondence
  • Geographic obfuscation using realistic but fictional locations
  • Temporal displacement preventing correlation with actual service periods
  • Demographic noise injection maintaining realism while eliminating identifiability

3. Scalability and Adaptability

Multi-Ministry Flexibility:

adapt_for_ministry <- function(ministry_type) {
  switch(ministry_type,
    "justice" = load_legal_terminology(),
    "health" = load_clinical_patterns(),
    "education" = load_student_services_vocab()
  )
}

Configurable Parameters:

  • Adjustable population sizes and demographic distributions
  • Flexible risk factor prevalence rates
  • Customizable service delivery patterns
  • Adaptable writing style variations

Quality Assurance Standards

1. Specification Validation

Expert Review Process:

  • Domain expert review of all YAML specifications
  • Validation against real-world service delivery patterns
  • Cross-referencing with policy standards and program requirements
  • Peer review by additional domain experts when available

2. Generated Data Quality

Automated Validation:

  • Distribution checks against expected population patterns
  • Co-occurrence pattern validation for risk factors
  • Linguistic authenticity scoring for generated text
  • Temporal pattern consistency verification

Manual Quality Review:

  • Sample review of generated case notes for authenticity
  • Demographic realism assessment by domain experts
  • Terminology appropriateness evaluation
  • Overall believability assessment

3. Algorithm Performance Validation

Benchmark Testing:

  • Performance comparison against established baselines
  • Accuracy measurement on synthetic data with known characteristics
  • False positive and false negative rate assessment
  • Consistency testing across different synthetic data generations

Future Enhancements

1. Advanced Text Generation

Natural Language Processing:

  • Integration with large language models for more sophisticated text generation
  • Enhanced writing style modeling based on real caseworker documentation
  • Dynamic terminology adaptation based on case characteristics
  • Contextual error generation reflecting realistic human patterns

2. Longitudinal Case Modeling

Complex Case Progression:

  • Multi-year case evolution patterns
  • Seasonal variation in service needs and crisis events
  • Family system dynamics affecting individual case progression
  • Economic and policy impact modeling on case outcomes

3. Multi-Modal Data Integration

Comprehensive Synthetic Ecosystem:

  • Integration with administrative data systems
  • Healthcare interaction modeling
  • Justice system involvement patterns
  • Employment and education record generation

Getting Started

1. Repository Setup

# Clone the repository
git clone [repository-url] case-note-simulator
cd case-note-simulator

# Install required R packages
Rscript -e "install.packages(c('yaml', 'dplyr', 'purrr', 'lubridate', 'stringr', 'jsonlite'))"

2. First Synthetic Data Generation

# Load the client generator
source("./simulation/generation-engine/client-generator.R")

# Generate a small test population
test_clients <- generate_client_population(n_clients = 50)

# Review the results
head(test_clients)
validation <- validate_client_population(test_clients)
print(validation)

3. Project-Specific Configuration

  1. Copy ./simulation/input-specifications/project-scenarios/template-scenario.yml
  2. Customize for your specific SDA project requirements
  3. Generate synthetic data using your project scenario
  4. Export in format compatible with your analytical workflows

Support and Maintenance

1. Expert Specification Updates

Regular Review Process:

  • Quarterly review of client archetype specifications
  • Annual validation of risk factor prevalence rates
  • Ongoing updates based on policy changes and program evolution
  • Continuous calibration against anonymized real-world patterns

2. Technical Maintenance

Code Quality Standards:

  • Comprehensive unit testing for all generation functions
  • Performance optimization for large-scale synthetic data generation
  • Documentation updates reflecting specification changes
  • Version control for both code and expert specifications

3. SDA Integration Support

Ongoing Compatibility:

  • Regular testing against evolving SDA analytical workflows
  • Format updates to match changing analytical requirements
  • Performance benchmarking against new algorithm versions
  • Collaboration support for new analytical project development

This implementation provides a robust, expert-driven framework for generating synthetic social services data that supports rigorous testing and validation of analytical workflows while maintaining complete privacy protection and realistic authenticity.