Skip to content

lospooky/archeoml-confparser

Repository files navigation

ArcheoML-Confparser

DOI
SWH

A YAML-based hierarchical configuration parser for machine learning projects from 2018-2020,
demonstrating independent convergent evolution of design patterns later popularized by Hydra and OmegaConf.

This configuration management system was originally developed starting in June 2018, fifteen months before the first public release of Hydra (September 2019) and two years before OmegaConf reached stability (November 2020). This codebase independently arrived at the same design and usage principles that would later define the industry standard approach to ML experiment configuration.

Historical Note

The Context: Late 2010s ML Configuration Chaos

In 2017-2018, the ML community was grappling with configuration management complexity as experiments became more sophisticated. Training runs required managing dozens of hyperparameters, nested model architectures, and dataset configurations. Engineers needed to:

  • Run hundreds of experiments with slight variations
  • Ensure reproducibility across team members and compute environments
  • Override specific parameters without duplicating entire config files
  • Maintain readability as configurations grew to hundreds of lines

Existing solutions were ad-hoc: custom scripts, flat INI files, or environment variables. There was no widely-adopted standard for hierarchical, overrideable configurations tailored to ML workflows.

Why This Matters

This repository demonstrates convergent evolution in software engineering: when facing identical constraints and problems, independent efforts arrive at essentially the same solutions. This has implications for:

Software Archaeology: Documents how ML tooling evolved in response to practical challenges in the "pre-MLOps" era

Engineering Philosophy: Shows that widely-adopted patterns (like Hydra's design) succeeded not just through innovation and first mover advantage, but by being optimal solutions addressing fundamental challenges that were universally felt across ML teams

Prior Art & Independent Invention: Provides historical evidence that these configuration patterns emerged organically across the community, not from a single source

Understanding Design Inevitability: Demonstrates which patterns emerge from problem constraints (destined to be rediscovered) versus which depend on specific implementation choices or organizational context

Historical Perspective

Hydra (backed by Facebook AI Research) ultimately became the community standard through open-source availability, institutional authority (FAIR), extensive documentation and marketing, as well as strong ecosystem support. This codebase started being developed fifteen months prior Hydra's first public announcement within private organizations, never participating in public ML tooling discussions. Yet it arrived at nearly identical solutions and design patterns.

Timeline

  • June 18, 2018: Original development begins
  • Early July 2018: Core functionality complete (hierarchical configs, CLI overrides, dot notation, validation)
  • October 3, 2019: Hydra publicly announced by Facebook AI Research (official blog post)
  • December 2019: This codebase extracted as standalone package
  • June 2021: Hydra 1.1 + OmegaConf 2.1 released, establishing the modern standard for ML configuration
  • October-November 2025: Historical preservation; git history reconstructed from original repositories with preserved timestamps and authorship
  • February 2026: Public release as open-source historical artifact

Core Design Patterns (Independently Discovered)

This implementation and Hydra/OmegaConf share fundamental patterns that emerged from practical needs:

  • Default-as-schema: Using a default configuration file as both type schema and fallback values
  • Hierarchical override semantics: Clear precedence rules (CLI arguments > Custom Config > Default Config)
  • Dot notation access: Treating nested YAML as attribute-accessible objects (config.model.learning_rate)
  • Partial overriding: Modifying specific nested values without redefining parent structures
  • Type-aware validation: Catching configuration errors before expensive training runs

Feature Comparison

Feature ArcheoML-Confparser (2018) Hydra (2019+)
Hierarchical configs
CLI overrides
Dot notation access
Schema validation
Partial overriding
Multi-run support
Plugin system
Tab completion
Ecosystem ✅ Extensive

Installation

For those interested in exploring this historical implementation:

pip install git+https://github.com/lospooky/archeoml-confparser.git

Or in editable/development mode:

git clone https://github.com/lospooky/archeoml-confparser.git
cd archeoml-confparser
pip install -e .

Quick Start

from confparser import parse_configuration

# Load configuration with default values
config = parse_configuration("examples/default_config.yaml")

# Access nested config via dot notation
print(config.model.name)
print(config.training.learning_rate)

At the command line:

# Override with custom config and/or CLI arguments
python train.py --custom_config custom.yaml --training.learning_rate 0.001

See README.original.md for the original project documentation.

Maintenance Status

Not actively maintained - This is a historical preservation project
No new features planned - Code preserved as-is from 2018-2020
For production use → Use Hydra instead
For historical/academic interest - Feel free to explore!

Preservation Strategy

This repository serves as a software archaeology artifact, preserved to document the independent evolution of ML configuration patterns. The preservation approach:

Git History Preservation: All commits have been preserved with their original timestamps and authorship intact. The codebase was extracted from two larger private repositories where the project evolved, and commit 45a436dbbf71c245c79966045230818387626a34 serves as the graft point to bridge these two original repositories into a unified history.

Code Integrity: The core implementation remains unchanged from its 2018-2020 state, maintaining historical accuracy. Only documentation and packaging have been updated to reflect archival/preservation status.

Acknowledgements

This codebase was developed during the course of professional work at Micropsi Industries (2018-2019) and Advertima (2019-2020). We are grateful for their explicit permission to preserve and open-source this implementation as a historical artifact demonstrating the independent evolution of ML configuration patterns.

Core Contributors:

  • Simone Cirillo - Primary implementation and design
  • Clemens Korndörfer - Loss gating mechanisms and architecture refinements
  • Mathias Winther Madsen - Testing, documentation, and data field resolution
  • Levani Tevdoradze - Bug fixes and example implementations
  • Noorvir Aulakh - Early contributions

This work emerged from the practical needs of production ML systems and the collective problem-solving of teams facing configuration management challenges in the late 2010s. The fact that similar solutions emerged independently across the industry speaks to the universality of these challenges and the convergent nature of effective solutions.

License

MIT License - See LICENSE for details.

Citation

If referencing this work in academic contexts:

@software{cirillo2018confparser,
  author = {Cirillo, Simone},
  title = {ArcheoML-Confparser: A Historical ML Configuration Parser},
  year = {2018-2026},
  url = {https://github.com/lospooky/archeoml-confparser},
  note = {Historical artifact demonstrating independent convergent evolution 
          of ML configuration patterns}
}

About

A YAML-based hierarchical configuration parser for machine learning projects from the 2018-2020 era.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages