feat: implement high-entropy synthetic data generator skill (Issue #22)#31
Merged
rosspeili merged 1 commit intoARPAHLS:mainfrom Apr 3, 2026
Merged
Conversation
…PAHLS#22) Resolves issue ARPAHLS#22 by introducing a data engineering skill that leverages model-agnostic execution and zlib compression heuristics to compute synthetic diversity. Includes tests, example dataset pipeline script, and fully updated docs.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR introduces the
data_engineering/synthetic_generatorskill, designed to provide agents with a robust pipeline for generating high-entropy synthetic training data.Logic, Cognition, and Governance
skill.pythat supports internal routing to Ollama, Gemini, and Anthropic. It features a zero-dependency entropy validator usingzlibcompression ratios to ensure lexical diversity and prevent "model collapse."instructions.md(cognitive map) enforces the use of combinatorial personas and edge-case scenarios while strictly prohibiting common AI tropes and boilerplate.SkillLoaderadapter patterns to maintain strict schema compliance for input/output. It encapsulates high-temperature generation to keep the primary agent's state stable.Type of Change
manifest.yaml,skill.py, andinstructions.md)Checklist
manifest.yaml.env_varsare explicitly documented in the manifest.SkillLoadersuccessfully loads this module without missing dependency errors.Constitution & Safety
This skill is restricted to text synthesis and deterministic entropy evaluation. It does not perform any file system modifications (leaving data persistence to the orchestrating agent), nor does it execute any generated strings as code. It strictly isolates internal LLM calls to the providers specified in the configuration.
Related Issues
Fixes #22