Cocina is a collection of tools for building structured Python projects. It provides sophisticated configuration management, job execution capabilities, and a professional CLI interface.
- ConfigHandler - Unified configuration management, constants, and environment variables
- ConfigArgs - Job-specific configuration loading with structured argument access
- CLI - Command-line interface for project initialization and job execution
FROM PYPI
pip install cocinaFROM CONDA
conda install -c conda-forge cocinapixi run cocina init --log_dir logs --package your_package_nameSee cocina Configuration for detailed initialization options.
Cocina separates configuration (values that can change) from constants (values that never change) and job arguments (run-specific parameters).
-
ConfigHandler (
ch) - Manages constants and project configuration- Constants:
your_module/constants.py(protected from modification) - General Config:
config/config.yaml - Env Config:
config/<environment-name>.yaml - Usage:
ch.DATABASE_URL,ch.get(MAX_SCALE, 1000)
- Constants:
-
ConfigArgs (
ca) - Manages job-specific run configurations- Job configs:
config/args/job_name.yaml - Usage: To run method
method_name:method_name(*ca.method_name.args, **ca.method_name.kwargs)
- Job configs:
Note: names of configuration and job directories and files can be customized in .cocina.
Traditional approach:
SOURCE = "path/to/src.parquet"
OUTPUT_DEST = "path/to/output"
def main():
data = load_data(SOURCE, limit=1000, debug=True)
data = process_data(data, scale=100, validate=False)
save_data(data, OUTPUT_DEST, format="json")
if __name__ == "__main__":
main()With Cocina:
def run(config_args):
data = load_data(*config_args.load_data.args, **config_args.load_data.kwargs)
data = process_data(data, *config_args.process_data.args, **config_args.process_data.kwargs)
save_data(data, *config_args.save_data.args, **config_args.save_data.kwargs)All parameters are now externalized to YAML configuration files, making scripts reusable and maintainable. CLI mangagement/arg-parsing is handled through the cocina CLI
Project Structure:
my_project/
├── my_package/ # Python package
│ ├── constants.py # Project Constants (protected from modification)
│ ├── ... # Modules
│ └── data_manager.py # Named example python module
├── config/
│ ├── config.yaml # Main configuration
│ ├── prod.yaml # Production configuration overrides
│ └── args/
│ └── data_pipeline.yaml # Job configuration
└── jobs/
└── data_pipeline.py # Job implementation
Configuration (config/args/data_pipeline.yaml):
extract_data:
args: ["source_table"]
kwargs:
limit: 1000
debug: false
transform_data:
scale: 100
validate: true
save_data:
- "output_table"Job Implementation (jobs/data_pipeline.py):
def run(config_args, printer=None):
data = extract_data(*config_args.extract_data.args, **config_args.extract_data.kwargs)
data = transform_data(data, *config_args.transform_data.args, **config_args.transform_data.kwargs)
save_data(*config_args.save_data.args, **config_args.save_data.kwargs)Running Jobs:
# Default environment
pixi run cocina job data_pipeline
# Production environment
pixi run cocina job data_pipeline --env prodWhen running a job, the CLI requires either a run method that takes arguments config_args: ConfigArgs, printer: Printer, or a run method that takes only config_args: ConfigArgs, or a main method that does not have any arguments.
Priority ordering is:
run(config_args, printer)| passing both aConfigArgsandPrinterinstancerun(config_args)| passing aConfigArgsinstancemain()| for jobs without configuration (legacy scripts)
Although the main focus is on building and running configured "jobs", ConfigArgs can also be used in your code (a notebook for example):
# Load job-specific configuration
ca = ConfigArgs('job_group_1.job_a1')
jobs.job_group_1.job_a1.step_1(*ca.step_1.args, **ca.step_1.kwargs)The .cocina file contains project settings and must be in your project root. It defines:
- Configuration file locations and naming conventions
- Project root directory location
- Environment variable names
Required: Every project must have a .cocina file at the root.
Options:
--log_dir: Enable automatic log file creation--package: Specify main package for constants loading--force: Overwrite existing.cocinafile
Cocina uses YAML files in the config/ directory:
config/
├── config.yaml # Main configuration
├── dev.yaml # Development environment overrides
├── prod.yaml # Production environment overrides
└── args/ # Job-specific configurations
├── job_name.yaml # Individual job config
└── group_name/ # Grouped job configs
└── job_a.yaml
Configuration Types:
- Main Config:
config.yaml- shared across all environments - Environment Config:
{env}.yaml- environment-specific overrides - Job Config:
args/{job}.yaml- job-specific parameters and arguments
Manages constants and main configuration with environment support.
from cocina.config_handler import ConfigHandler
ch = ConfigHandler()
print(ch.DATABASE_URL) # From config.yaml
print(ch.MAX_SCALE) # From constants.py (protected)Features:
- Loads constants from
your_package/constants.py - Loads configuration from
config/config.yaml - Environment-specific overrides from
config/{env}.yaml - Dict-style and attribute access patterns
Loads job-specific configurations with structured argument access.
from cocina.config_handler import ConfigArgs
ca = ConfigArgs('data_pipeline')
# Access method arguments
ca.extract_data.args # ["source_table"]
ca.extract_data.kwargs # {"limit": 1000, "debug": False}YAML Configuration Parsing:
- Dict with
args/kwargskeys → extracts args and kwargs - Dict without special keys →
args=[],kwargs=dict - List/tuple →
args=value,kwargs={} - Single value →
args=[value],kwargs={}
Features:
- Environment-specific overrides
- Reference resolution from main config
- Dynamic value substitution
pixi run cocina init --log_dir logs --package your_package# Run a single job
pixi run cocina job data_pipeline
# Run with specific environment
pixi run cocina job data_pipeline --env prod
# Run multiple jobs
pixi run cocina job job1 job2 job3
# Dry run (validate without executing)
pixi run cocina job data_pipeline --dry_runOptions:
--env: Environment configuration to use (dev, prod, etc.)--verbose: Enable detailed output--dry_run: Validate configuration without running
Professional output with timestamps, headers, and optional file logging. Printer is a singleton class that automatically initializes when first accessed.
from cocina.printer import Printer
printer = Printer(log_dir='logs', basename='MyApp')
printer.message('Status update', count=42, status='ok')
printer.stop('Complete')Simple timing functionality with duration tracking.
from cocina.utils import Timer
timer = Timer()
timer.start() # Start timing
print(timer.state()) # Current elapsed time
print(timer.now()) # Current timestamp
stop_time = timer.stop() # Stop timing
print(timer.delta()) # Total duration stringSee complete documentation for all utility functions and helpers.
Requirements: Managed with Pixi - no manual environment setup needed.
# All commands use pixi
pixi run jupyter labStyle: Follows PEP8 standards. See setup.cfg for project-specific rules.
- Getting Started - Installation, initialization, and first job
- Configuration Guide - Complete configuration management
- Job System - Creating and running jobs
- CLI Reference - Command-line interface
- Examples - Detailed usage examples
- Advanced Topics - Complex patterns and extensions