Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
e9acb72
docs: add design philosophy guide
zengarden Mar 30, 2026
9d5863e
docs: clarify project philosophy in readme
zengarden Mar 30, 2026
2ae4ea8
docs: align homepage with project philosophy
zengarden Mar 30, 2026
e5a1c23
docs: add phase 1 minimal helpers plan
zengarden Mar 30, 2026
62d6525
docs: refine cfg-driven design philosophy
zengarden Mar 30, 2026
9305d1c
docs: add phase 1 file-by-file plan
zengarden Mar 30, 2026
033bd67
docs: add phase 1 api draft
zengarden Mar 30, 2026
0b89c71
feat: add cfg-driven checkpoint and artifact helpers
zengarden Mar 30, 2026
07859a1
feat: add checkpoint-aware mnist example flows
zengarden Mar 30, 2026
567e142
test: cover mnist validation run flow
zengarden Mar 30, 2026
09bcfc5
fix: load model state from checkpoint files
zengarden Mar 30, 2026
17d3a3e
refactor: simplify validation flow and checkpoint imports
zengarden Mar 30, 2026
b311181
refactor: inline mnist training resume state
zengarden Mar 30, 2026
a17bc69
refactor: remove ensure_run_dir helper
zengarden Mar 30, 2026
182d40e
fix: harden checkpoint format handling
zengarden Mar 30, 2026
b3085ed
fix: validate required checkpoint state
zengarden Mar 30, 2026
662744e
refactor: simplify checkpoint error types
zengarden Mar 30, 2026
c109997
fix: validate checkpoint format version
zengarden Mar 30, 2026
0953f82
refactor: remove checkpoint error helpers
zengarden Mar 30, 2026
7738d5b
refactor: remove checkpoint format version
zengarden Mar 30, 2026
bae724e
fix: register missing hydra env config
zengarden Mar 31, 2026
186a62c
fix: backfill missing hydra env config safely
zengarden Mar 31, 2026
15027ee
revert: drop hydra env backfill workaround
zengarden Mar 31, 2026
d29d07e
docs: sync phase 1 plans with current helpers
zengarden Mar 31, 2026
3268422
docs: remove unimplemented metric logging plans
zengarden Mar 31, 2026
5c6bc5f
refactor: remove config dump helper
zengarden Mar 31, 2026
18e2cfc
docs: strengthen framework design guidance
zengarden Mar 31, 2026
63d5886
feat: add resnet val checkpoint flow
zengarden Mar 31, 2026
e110ce5
feat: add resnet train checkpoint resume flow
zengarden Mar 31, 2026
16ce0a2
refactor: keep latest and best resnet checkpoints
zengarden Mar 31, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,24 @@ TinyExp focuses on simple, maintainable experiment management:
- Your config stays structured and easy to override.
- Your execution path stays consistent as experiments grow.

## Design Philosophy

TinyExp is intentionally light.

It is not trying to be a heavy trainer framework that owns your epoch loop, callback system, or full runtime
lifecycle. Instead, it focuses on a smaller goal:

- keep the experiment itself as the main entrypoint
- keep the training loop in user space
- make configuration and launch behavior explicit
- expose shared capabilities through focused `XXXCfg` components
- provide thin helpers rather than framework-owned control flow
- treat examples as reusable recipes, not just demos

In short, TinyExp should help you write less experiment plumbing, not less experiment logic.

For a longer explanation, see [`docs/philosophy.md`](docs/philosophy.md).

## Quick Start (1 Minute)

### Option A: Install with pip and use import-based entrypoint
Expand Down
38 changes: 34 additions & 4 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,40 @@

A minimalist Python project for deep learning experiment management.

TinyExp lets you launch experiments with one click: the file you edit becomes the entrypoint to your experiment.
TinyExp keeps one idea at the center:
your configured experiment is your entrypoint.

# Features
Instead of splitting config, launcher, and execution across many files, TinyExp keeps them together in one experiment
definition so iteration stays fast and predictable.

## What You Get

- Experiment-centered configuration with Hydra/OmegaConf
- CLI overrides without rewriting code
- Training loops that stay close to plain PyTorch
- The same experiment definition from local debug to distributed launch

## Design Philosophy

TinyExp is intentionally light.

It is not trying to be a heavy trainer framework that owns your epoch loop, callback system, or full runtime
lifecycle. Instead, it focuses on a smaller and more explicit goal:

- keep the experiment itself as the main entrypoint
- keep the training loop in user space
- make configuration and launch behavior explicit
- expose shared capabilities through focused `XXXCfg` components
- provide thin helpers instead of framework-owned control flow
- treat examples as reusable recipes, not just demos

In short, TinyExp should help you write less experiment plumbing, not less experiment logic.

For the longer version, see [Design Philosophy](philosophy.md).

## Features

- 🚀 One-click experiment launch: The file you edit becomes the entrypoint to your experiment.
- 📊 Experiment tracking: Track your experiments with W&B.
- 🔄 Experiment management: Manage your experiments configuration with Hydra.
- 🔄 Config-driven experiment management with Hydra.
- 🧩 Thin helpers without taking over your training loop.
- 🧪 Examples that can serve as reusable experiment recipes.
276 changes: 276 additions & 0 deletions docs/phase1-api-draft.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,276 @@
# Phase 1: API Draft

This document proposes a concrete API shape for the first implementation slice.

It should be read together with:

- [Design Philosophy](philosophy.md)
- [Phase 1: Minimal Helpers Plan](phase1-minimal-helpers.md)
- [Phase 1: File-by-File Implementation Plan](phase1-file-by-file-plan.md)

This is still a draft. The goal is to make the intended shape explicit before implementation expands further.

## Drafting Principles

The Phase 1 API should follow these rules:

- keep `TinyExp` small
- keep training and evaluation loops in examples
- expose shared capabilities through focused `XXXCfg` components
- make configuration override-friendly through Hydra
- keep execution explicit through method calls
- avoid introducing trainer-like control flow

## `TinyExp` Draft Surface

Phase 1 keeps the root experiment object intentionally small.

### Fields

Recommended experiment-level fields:

- `mode: str = "train"`
- `resume_from: str = ""`
- `output_root: str = "./output"`
- `exp_name: str = ...`

These fields describe the experiment as a whole rather than one isolated feature subsystem.

### Composed config components

The experiment object should continue to expose capability-specific config components, such as:

- `logger_cfg`
- `wandb_cfg`
- `checkpoint_cfg`

Other feature configs may be added later only if they earn their place.

### Methods

Recommended Phase 1 methods on `TinyExp`:

```python
def get_run_dir(self) -> str:
...
```

This belongs on `TinyExp` because it is experiment-scoped, not feature-scoped.

## `CheckpointCfg` Draft

Checkpointing is the main new shared capability in Phase 1.

It should follow the same cfg-driven pattern as logger and W&B integration:

- fields define behavior and defaults
- methods perform explicit actions only when called

### Draft fields

```python
@dataclass
class CheckpointCfg:
last_ckpt_name: str = "last.ckpt"
best_ckpt_name: str = "best.ckpt"
```

Phase 1 should keep this deliberately small.

### Draft methods

```python
def save_checkpoint(
self,
*,
run_dir: str,
name: str,
model=None,
optimizer=None,
scheduler=None,
epoch: int | None = None,
global_step: int | None = None,
best_metric: float | None = None,
extra_state: dict | None = None,
) -> str:
...

def load_checkpoint(
self,
path: str,
*,
model=None,
optimizer=None,
scheduler=None,
strict: bool = True,
map_location=None,
) -> dict:
...
```

### Responsibilities

`CheckpointCfg` should:

- define default checkpoint filenames
- save a standard checkpoint structure
- load a standard checkpoint structure
- optionally restore state into provided model / optimizer / scheduler objects

### Non-responsibilities

`CheckpointCfg` should not decide:

- when to save
- whether to save best checkpoints
- which metric is considered best
- whether resume is automatic
- how many checkpoints to retain

Those remain example-level or user-level decisions.

## Draft Checkpoint Format

The checkpoint format should be simple and explicit.

Recommended structure:

```python
{
"model_state_dict": ...,
"optimizer_state_dict": ...,
"scheduler_state_dict": ...,
"epoch": ...,
"global_step": ...,
"best_metric": ...,
"meta": {
"exp_name": ...,
"exp_class": ...,
"saved_at": ...,
},
...
}
```

Notes:

- `optimizer_state_dict` is optional
- `scheduler_state_dict` is optional
- `meta` should stay lightweight
- `extra_state` can extend the structure without forcing premature abstraction

## Run Directory Draft

Run directory behavior should remain simple in Phase 1.

### Draft methods

```python
def get_run_dir(self) -> str:
...
```

### Expected behavior

- `get_run_dir()` returns `os.path.join(self.output_root, self.exp_name)`
- methods that write files should create parent directories when needed

Phase 1 should not add timestamped run folders, version managers, or heavier run registry behavior.

## Example Usage Draft

Below is the intended style for examples after Phase 1.

### Logger setup

```python
run_dir = self.get_run_dir()
logger = self.logger_cfg.build_logger(
save_dir=run_dir,
distributed_rank=accelerator.rank,
)
```

### Explicit W&B usage

```python
if self.wandb_cfg.enable_wandb:
wandb = self.wandb_cfg.build_wandb(
accelerator=accelerator,
project="Baselines",
config=cfg_dict,
)
```

### Explicit checkpoint load

```python
resume_state = None
if self.resume_from:
resume_state = self.checkpoint_cfg.load_checkpoint(
self.resume_from,
model=model,
optimizer=optimizer,
scheduler=scheduler,
map_location=accelerator.device,
)
```

### Explicit checkpoint save

```python
self.checkpoint_cfg.save_checkpoint(
run_dir=run_dir,
name=self.checkpoint_cfg.last_ckpt_name,
model=accelerator.unwrap_model(model),
optimizer=optimizer,
scheduler=scheduler,
epoch=epoch,
global_step=global_step,
best_metric=best_metric,
)
```

### Explicit best checkpoint policy in example code

```python
if best_metric is None or val_metric > best_metric:
best_metric = val_metric
self.checkpoint_cfg.save_checkpoint(
run_dir=run_dir,
name=self.checkpoint_cfg.best_ckpt_name,
model=accelerator.unwrap_model(model),
optimizer=optimizer,
scheduler=scheduler,
epoch=epoch,
global_step=global_step,
best_metric=best_metric,
)
```

This is the intended balance:

- framework provides the capability
- example decides when to use it

## API Choices Deferred Beyond Phase 1

The following questions should stay open until the first implementation slice proves itself:

- whether metrics deserve their own `MetricCfg`
- whether `CheckpointCfg` should move into its own module
- whether shared recipe base classes are worth introducing

These should not be over-designed before the first slice is working.

## Success Criteria

The Phase 1 API draft is successful if it leads to implementation that feels:

- small
- explicit
- override-friendly
- consistent with the existing `XXXCfg` style
- still close to plain PyTorch

That is the standard Phase 1 should be judged against.
Loading
Loading