Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
260 changes: 260 additions & 0 deletions backend/docs/174-zppy-links/plan.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,260 @@
# Plan: Connect zppy Diagnostics to SimBoard Simulations

## Goal

Replace manual diagnostics URL entry with automated linking from zppy diagnostics outputs to existing SimBoard simulation records.

MVP is NERSC-only.

## Scope

### In

- Add required zppy provenance fields: `case_name`, `machine`, `hpc_username`
- Add required diagnostics URLs in zppy provenance
- Require standardized zppy diagnostics output locations for NERSC production runs
- Discover zppy diagnostics provenance files from configured NERSC production filesystem roots
- Confirm diagnostics completion from index page plus status files
- Match diagnostics to SimBoard records using `(case_name, machine, hpc_username)`
- Create idempotent case-scoped diagnostic links
- Maintain scanner state to avoid repeated processing

### Out

- Frontend redesign
- Changes to manual external-link workflows
- Case identity or uniqueness refactor
- Diagnostics content ingestion or indexing
- Public HTML directory scraping
- Historical backfill beyond configured provenance roots
- Non-NERSC deployments

## Core Decisions

### Match diagnostics at case scope

zppy runs against a full case output tree, not a single execution/LID. Use case identity as the primary join key:

```text
(case_name, machine, hpc_username)
```

All three fields are required. `case_name` alone is not globally safe, and `CASE_HASH` is not reliable across executions.

### Do not parse public HTML directories

Avoid public directory scraping. It is fragile, web-server-coupled, slow, and expands the SSRF/content-injection attack surface.

### Use zppy provenance cfg as the primary input

SimBoard discovers zppy provenance files from configured NERSC filesystem roots. Newer zppy runs already emit provenance cfg files under diagnostics output paths, for example:

```text
post/scripts/provenance.20260303_230804_991619.cfg
```

Reference example:

- https://github.com/E3SM-Project/zppy/blob/main/examples/post.v3.LR.historical.zppy_v3.cfg
- https://web.lcrc.anl.gov/public/e3sm/diagnostic_output/zppy_example/v3.2.0/v3.LR.historical_0051/provenance.20260303_230804_991619.cfg

Current cfg examples expose useful fields:

- `case`: case name
- `input`: case run directory
- `output`: diagnostics filesystem root
- `www`: public diagnostics root
- `campaign`: optional campaign metadata

But current cfg is not yet an authoritative join source because it may lack:

- `machine`
- canonical simulation owner
- unambiguous `hpc_username`

Path-derived usernames are unsafe. Example ambiguity:

```text
input path owner: ac.wlin
output path owner: ac.zhang40
```

Therefore, zppy must enrich provenance cfg with required case identity copied from `<input>/case_scripts/env_case.xml`:

| XML field | Provenance field |
| ---------- | ---------------- |
| `CASE` | `case_name` |
| `MACH` | `machine` |
| `REALUSER` | `hpc_username` |

If any required field is missing, SimBoard skips the provenance file and logs it as invalid for linking.

For MVP, zppy should reuse existing top-level cfg fields rather than emit a new versioned normalized block.

### Require standardized output locations for production runs

For MVP, NERSC production runs must use standardized zppy diagnostics output locations. SimBoard relies on those known production roots for provenance discovery.

Custom or ad hoc layouts do not block the overall design, but they are not the required path for MVP.

### Require explicit diagnostics URLs in provenance

For MVP, SimBoard should not derive diagnostics URLs from path conventions. zppy should emit explicit diagnostics URLs in provenance cfg.

### Use index page plus status files as completion signal

Treat diagnostics as complete only when the expected index page and zppy status files are present.

### Persist links, do not resolve at query time

Create database rows when diagnostics are discovered. Frontend queries should not crawl filesystems or remote URLs.

Diagnostic links are case-scoped. For MVP, store them on `Case` by adding `case_id` to `ExternalLink`. Keep the existing manual-link rendering path where possible by surfacing case-scoped diagnostic links alongside current links.

## Implementation

Implement in order: provenance contract -> scanner -> storage target -> resolver/API -> frontend verification.

### zppy

#### 1. Emit required provenance fields

For MVP, production runs must write diagnostics outputs and provenance cfg files to the standardized NERSC zppy output locations.

| Field | Source |
| -------------- | ------------------------- |
| `case_name` | `env_case.xml` `CASE` |
| `machine` | `env_case.xml` `MACH` |
| `hpc_username` | `env_case.xml` `REALUSER` |

Implementation note:

- For NERSC MVP, zppy can construct explicit diagnostics URLs from cfg `www` plus `mache` machine metadata.
- `mache.MachineInfo` exposes helpers such as `web_portal_base`, `web_portal_url`, and `username`.
- Reference: https://docs.e3sm.org/mache/main/developers_guide/generated/mache.MachineInfo.html

Tests:

- uses standardized NERSC production output locations
- emits `case_name`, `machine`, `hpc_username`
- emits explicit diagnostics URLs
- can construct explicit diagnostics URLs from cfg `www` plus `mache` machine metadata
- parses values from `env_case.xml`
- parses values from `env_build.xml`
- handles missing `env_case.xml` or `env_build.xml`
- preserves existing provenance behavior

### SimBoard

#### 1. Add diagnostics scanner

Add `diagnostics_link_scanner.py`.

Responsibilities:

- scan configured NERSC production diagnostics roots for `provenance*.cfg`
- dedup with state file
- verify diagnostics completion from index page plus status files
- parse `case_name`, `machine`, `hpc_username`
- parse explicit diagnostics URLs
- call internal API with service-account auth
- skip and log if full join key is unavailable

Tests:

- discovers cfgs
- parses required cfg identity
- handles malformed cfgs
- skips missing identity
- checks index-plus-status completion marker
- dedups state
- handles duplicate links idempotently

#### 2. Resolve link storage

Add `DiagnosticsLinkRequest` in `backend/app/features/simulation/schemas.py`.

For MVP, add `case_id` to `ExternalLink` and store diagnostic links at case scope.

#### 3. Add matching resolver

| Input | Match |
| -------------- | ----------------------- |
| `case_name` | `Case.name` |
| `machine` | joined case simulations |
| `hpc_username` | joined case simulations |

Outcomes:

- 1 case match: create/update case-scoped links
- 0 matches: `404`
- multiple matches: `409`

Tests:

- matching triple creates links
- same case/machine under different user does not cross-link
- no match returns `404`
- ambiguous match returns `409`

#### 4. Add internal API endpoint

Endpoint: `POST /api/v1/diagnostics/link`

Roles: `ADMIN`, `SERVICE_ACCOUNT`

Request:

| Field | Required |
| -------------- | -------- |
| `case_name` | yes |
| `machine` | yes |
| `hpc_username` | yes |
| `diagnostics` | yes |

Diagnostics item:

| Field | Required |
| ------------------- | -------- |
| `name` | yes |
| `url` | yes |
| `kind = diagnostic` | yes |

Tests:

- duplicate request is idempotent
- invalid payload returns `422`
- auth required

#### 5. Keep frontend unchanged

Existing external-link rendering should display diagnostic links once rows exist.

## Fallbacks

### Curated backfill

Allow convention-based URL derivation only for controlled campaigns. Do not use as the primary MVP path.

### Validation command

```bash
make backend-test && make pre-commit-run
```

## Risks

- **Case-scoped link migration**: diagnostics are case-scoped, but `ExternalLink` currently points at `simulation_id`.
Mitigation: add `case_id` for MVP and keep migration/API behavior narrow.
- **Missing identity**: SimBoard cannot link a provenance file without `case_name`, `machine`, and `hpc_username`.
Mitigation: require zppy provenance enrichment; skip and log invalid files.
- **NERSC deployment variability**: zppy roots and public URL prefixes may still vary by campaign or user layout within NERSC.
Mitigation: use env-configured NERSC scanner roots and NERSC public-prefix mappings.
- **Provenance drift**: cfg layout and required-field coverage may vary across zppy versions.
Mitigation: add parser tests, schema/version detection, and a documented support window.

## Remaining Open Questions

1. **NERSC deployment scope:** Which NERSC scanner roots and public URL prefixes are supported in MVP?
2. **Retroactive linking:** Does MVP include historical backfill, or only provenance files with the required join key?
3. **Case identity hardening:** Is `(case_name, machine, hpc_username)` sufficient until issue #136 is resolved?
Loading