AA222_Final_Project

Final project for AA222/CS361 - Engineering Design Optimization (Best Paper Award, Spring 2024). This project develops a probabilistic optimization workflow for future electron-positron colliders, using Gaussian-process surrogate models of beam-beam simulations to maximize luminosity under physical constraints.

Project Highlights

Physics-informed Gaussian-process surrogate for modeling the beam-beam enhancement factor $H_D$, replacing expensive Particle-In-Cell simulations during optimization.
Combination of constrained optimization and Pareto-frontier analysis to highlight the tradeoff between collider luminosity and beamstrahlung background.
Demonstrated gains across several different future collider options, showing that the approach generalizes beyond a single operating point.

These translate machine-learning optimization into direct scientific impact: higher luminosity means more collision data, stronger precision reach, and potentially shorter run times for future collider programs.

Report: AA222 Final Project Paper

Repository Overview

The code can be run in stages: parse input data, fit GP model, and run optimization tasks. Each stage writes an artifact so results can be resumed, audited, or reused without re-running the full pipeline. The GP stage includes kernel-family comparison, replicate-aware noise modeling, and optional variance calibration.

functions/ gathers reusable logic: parse_data_functions.jl parses .ref exports and aggregates replicates, GP_abs_functions.jl builds the mean and noise GPs, GP_evaluation.jl emits diagnostics, minimize_abs_functions*.jl hold single- and multi-objective optimizers, and workflow_utils.jl manages CLI prompts plus artifact storage.
scripts/ contains workflow entry points: Part 1 (parse_abs_files.jl), Part 2 (fit_gp_model.jl), Part 3 (run_minimization.jl), the interactive orchestrator (run_workflow.jl), model-artifact comparison (compare_model_artifacts.jl), sampling utilities (random_sampling_v2.jl, random_sampling_v3.jl), and plotting helpers (make_pareto_plot.jl).
artifacts/ stores results by stage: data/ (aggregated observations), models/ (fitted GP state and diagnostics), optimizations/ (JuMP results, residual plots, summaries), and runs/ (workflow session history). Every artifact folder holds a metadata.toml, a serialized payload (*.jls), and optional plots.
Inputs reside in files_to_parse/ (raw simulator dumps), sampling_files/ (generated design sets), inputs/ (curated tables such as Pareto front data), and plots/ (ad-hoc figures that are not tied to an artifact).

Environment Setup

Use Julia 1.9 or newer.
Instantiate the project environment from the repo root:
```
julia --project=. -e 'using Pkg; Pkg.instantiate()'
```
Ensure Ipopt is available (e.g. julia --project=. -e 'using Pkg; Pkg.add(\"Ipopt_jll\")'), or point JuMP at a system installation.
Set GENERATE_PLOTS=false before running scripts if you want to skip heavy plotting while iterating quickly.
Runtime overrides for GP fitting are controlled via environment variables (see Part 2 for details): RANDOM_SEARCH_SAMPLES, CV_K, COMPARE_KERNELS, KERNEL_FAMILIES, CALIBRATE_VARIANCE.

Recommended Workflow

Run the orchestrator to step through data processing, GP fitting, and constrained optimization:

julia --project scripts/run_workflow.jl

The driver prompts for each stage, lets you skip to a previously saved artifact, and records the session under artifacts/runs/<timestamp>/metadata.toml. At the start of Part 1 it now walks through the variance-assignment strategies and whether to drop single-seed points entirely, applies your choices to that run, and persists them in the resulting artifacts so you can see exactly what assumptions were used later. Use --resume=<run_id> to preload artifacts from an earlier run, and pass --energy= flags or set SELECT_ENERGIES (comma-separated GeV values or indices) to pre-select energies in non-interactive environments. When no TTY is available the script automatically re-runs all parts using the latest artifacts.

Stage Details

Part 1 - Process raw simulator exports

julia --project scripts/parse_abs_files.jl \
  [--input path/to/files_to_parse/subdir] \
  [--variance-strategy cascade|primary_only|two_phase] \
  [--drop-singletons true|false]

The script gathers energy_*.ref files, groups samples by energy, rescales units, and aggregates replicates. For singleton points it assigns variances using nearest neighbors (with logged provenance) or a global fallback. You can control where that variance comes from via --variance-strategy (or VARIANCE_STRATEGY):
cascade reproduces the legacy behavior (singletons may borrow from previously assigned singletons),
primary_only restricts borrowing to locations with true multi-seed measurements, and
two_phase first borrows from multi-seed neighbors, then runs a second kNN pass that smooths the assigned variances across all points.
Set --drop-singletons=true (or DROP_SINGLETONS=true) if you would rather discard all single-seed aggregates entirely before saving the artifact—diagnostics still report how many were removed so you can track coverage. The processed dataset lives under artifacts/data/<timestamp>_E... with aggregated.jls, feature labels, and optional histograms in plots/. Override the input directory with --input or by setting FILES_TO_PARSE. Choose energies interactively or via SELECT_ENERGIES / --energy.

Part 2 - Fit Gaussian-process models

julia --project scripts/fit_gp_model.jl [--data artifacts/data/<id>]

Select a processed data artifact, pick the energies to fit, and train mean/noise GPs with hyperparameter optimization plus k-fold cross-validation.

New surrogate-model features in Part 2:

Kernel-family comparison framework: evaluates se_ard, matern52_ard, and se_plus_matern52 (or a subset), then selects the best by CV NLPD.
Replicate-aware noise modeling: the noise GP now uses replicate counts to weight uncertainty in log-variance targets.
Variance calibration: CV out-of-fold predictions can fit a global variance scale (variance_scale) to improve uncertainty calibration.
Richer artifact payload: stores selected kernel_family, variance_scale, and per-kernel CV summaries.

Part 2 runtime flags (environment variables):

RANDOM_SEARCH_SAMPLES (default 128): random starts per hyperparameter search.
CV_K (default 5): CV folds (minimum 2).
COMPARE_KERNELS (true/false, default true): evaluate all configured kernel families or only the first.
KERNEL_FAMILIES (default all): comma-separated subset, e.g. se_ard,matern52_ard.
CALIBRATE_VARIANCE (true/false, default true): enable CV-based global variance scaling.
CALIBRATION_METHOD (nlpd or coverage68, default nlpd): calibration objective for selecting variance_scale.
CALIBRATION_BOUNDS (default 0.25,2.5): lower/upper bounds for variance_scale.
CALIBRATION_TARGET_COVERAGE (default 0.68): target coverage used when CALIBRATION_METHOD=coverage68.
GENERATE_PLOTS (true/false, default true): create diagnostic plots and CSVs.
SELECT_ENERGIES: comma-separated indices or energy values for non-interactive selection.

Example: full-quality run (single energy, all kernels, calibrated uncertainty)

SELECT_ENERGIES=125 \
RANDOM_SEARCH_SAMPLES=128 \
CV_K=5 \
COMPARE_KERNELS=true \
CALIBRATE_VARIANCE=true \
CALIBRATION_METHOD=nlpd \
julia --project scripts/fit_gp_model.jl --data artifacts/data/20260120_220447_E90_E125_E250

Example: fast/debug run

SELECT_ENERGIES=125 \
RANDOM_SEARCH_SAMPLES=32 \
CV_K=3 \
COMPARE_KERNELS=false \
KERNEL_FAMILIES=se_ard \
CALIBRATE_VARIANCE=true \
CALIBRATION_METHOD=coverage68 \
CALIBRATION_TARGET_COVERAGE=0.68 \
GENERATE_PLOTS=false \
julia --project scripts/fit_gp_model.jl --data artifacts/data/20260120_220447_E90_E125_E250

Outputs include training metrics, PIT diagnostics, per-fold CSV exports, and residual agreement plots in artifacts/models/<timestamp>_E.../plots/. Use the --data flag to bypass the interactive artifact picker.

Part 3 - Constrained optimization

julia --project scripts/run_minimization.jl [--model artifacts/models/<id>]

Choose a GP artifact, optionally customize physical box constraints and allowed beamstrahlung Y bounds, then solve the JuMP/Ipopt problem. The optimization artifact stores results.jls, a human-readable summary.txt, and residual/coverage plots filtered to the active bounds. The script echoes the optimal physical design, predicted luminosity, and metrics carried over from training/CV. Set --model to target a specific artifact or rely on the interactive picker. Part 3 automatically reuses the kernel_family stored in the selected model artifact.

Model Artifact Comparison

Compare baseline vs candidate GP artifacts with a metric table and CSV export:

julia --project scripts/compare_model_artifacts.jl \
  --baseline artifacts/models/<baseline_id> \
  --candidate artifacts/models/<candidate_id> \
  [--out artifacts/analysis/<name>.csv]

The comparison reports per-energy deltas for CV metrics (nlpd, rmse, coverage68, coverage95, r2). If deserialization fails, rerun with the same Julia version that produced the artifacts.

Sampling Utilities

scripts/random_sampling_v2.jl produces stratified designs across six continuous optics variables, prepending the discrete energy choice and appending a random seed. Pass suffix tokens to create multiple files (e.g. julia --project scripts/random_sampling_v2.jl 1501 1502).
scripts/random_sampling_v3.jl fixes one optics point per invocation and generates multiple unique seeds; it also accepts range tokens (1501-1505) when producing several outputs at once. Adjust the parameter bounds inside each script before use.

Pareto Plotting

Rebuild Pareto-front figures from a prepared table:

julia --project scripts/make_pareto_plot.jl inputs/values_for_pareto_frontier.txt

The script defaults to the provided file and writes both PDF and PNG plots to plots/.

Additional Notes

Do not commit large raw simulator exports or cluster-specific edits (e.g. tailor GP_test_AA222_S3DF_SLURM.sh per run, but keep template changes minimal).
Prune outdated standalone plots in plots/ before opening a PR; artifact-specific plots already live under artifacts/.
Data artifacts record the variance-strategy and drop-singletons choices; the workflow UI shows them when you pick an existing artifact so you can align downstream runs.
For multi-objective studies, call helpers in functions/minimize_abs_functions_multiobjective.jl from a Julia REPL after loading the GP artifact.
A basic test suite now exists at test/runtests.jl. Run with:
```
julia --project test/runtests.jl
```
It covers variance calibration, kernel parameterization, replicate-aware noise helpers, and CV-NLPD sanity checks. For full validation, still compare regenerated artifacts/plots on an existing dataset (for example inputs/Y_and_product_values.txt).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AA222_Final_Project

Project Highlights

Repository Overview

Environment Setup

Recommended Workflow

Stage Details

Part 1 - Process raw simulator exports

Part 2 - Fit Gaussian-process models

Part 3 - Constrained optimization

Model Artifact Comparison

Sampling Utilities

Pareto Plotting

Additional Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
doc		doc
files_to_parse		files_to_parse
functions		functions
inputs		inputs
plots		plots
sampling_files		sampling_files
scripts		scripts
test		test
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

AA222_Final_Project

Project Highlights

Repository Overview

Environment Setup

Recommended Workflow

Stage Details

Part 1 - Process raw simulator exports

Part 2 - Fit Gaussian-process models

Part 3 - Constrained optimization

Model Artifact Comparison

Sampling Utilities

Pareto Plotting

Additional Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages