Skip to content

feat: implement high-entropy synthetic data generator skill (Issue #22)#31

Merged
rosspeili merged 1 commit intoARPAHLS:mainfrom
rosspeili:feature/synthetic-generator-skill
Apr 3, 2026
Merged

feat: implement high-entropy synthetic data generator skill (Issue #22)#31
rosspeili merged 1 commit intoARPAHLS:mainfrom
rosspeili:feature/synthetic-generator-skill

Conversation

@rosspeili
Copy link
Copy Markdown
Contributor

Description

This PR introduces the data_engineering/synthetic_generator skill, designed to provide agents with a robust pipeline for generating high-entropy synthetic training data.

Logic, Cognition, and Governance

  • Logic: Implemented a model-agnostic execution layer in skill.py that supports internal routing to Ollama, Gemini, and Anthropic. It features a zero-dependency entropy validator using zlib compression ratios to ensure lexical diversity and prevent "model collapse."
  • Cognition: The instructions.md (cognitive map) enforces the use of combinatorial personas and edge-case scenarios while strictly prohibiting common AI tropes and boilerplate.
  • Governance: The skill operates entirely in Python, using the SkillLoader adapter patterns to maintain strict schema compliance for input/output. It encapsulates high-temperature generation to keep the primary agent's state stable.

Type of Change

  • 🚀 Skill Proposal: New Skill (Contains manifest.yaml, skill.py, and instructions.md)
  • 🐛 Bug Report Fix: Non-breaking change which fixes an execution error or framework bug
  • 📖 Doc Fix: Documentation Update
  • 🧠 Framework Feature / RFC Updates: Core Framework Update

Checklist

  • My code follows the Agent Code of Conduct.
  • I have included a properly formatted manifest.yaml.
  • The skill logic operates purely in Python and does not rely on arbitrary LLM code generation.
  • Requirements and env_vars are explicitly documented in the manifest.
  • I have written unit tests proving deterministic execution and schema compliance.
  • I have verified that SkillLoader successfully loads this module without missing dependency errors.

Constitution & Safety

This skill is restricted to text synthesis and deterministic entropy evaluation. It does not perform any file system modifications (leaving data persistence to the orchestrating agent), nor does it execute any generated strings as code. It strictly isolates internal LLM calls to the providers specified in the configuration.

Related Issues

Fixes #22

…PAHLS#22)

Resolves issue ARPAHLS#22 by introducing a data engineering skill that leverages model-agnostic execution and zlib compression heuristics to compute synthetic diversity. Includes tests, example dataset pipeline script, and fully updated docs.
@rosspeili rosspeili merged commit 8f8a963 into ARPAHLS:main Apr 3, 2026
2 of 5 checks passed
@rosspeili rosspeili deleted the feature/synthetic-generator-skill branch April 3, 2026 07:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[New Skill]: Synthetic Data Generator (High-Entropy)

1 participant