Final project for AA222/CS361 - Engineering Design Optimization (Best Paper Award, Spring 2024). This project develops a probabilistic optimization workflow for future electron-positron colliders, using Gaussian-process surrogate models of beam-beam simulations to maximize luminosity under physical constraints.
- Physics-informed Gaussian-process surrogate for modeling the beam-beam enhancement factor
$H_D$ , replacing expensive Particle-In-Cell simulations during optimization. - Combination of constrained optimization and Pareto-frontier analysis to highlight the tradeoff between collider luminosity and beamstrahlung background.
- Demonstrated gains across several different future collider options, showing that the approach generalizes beyond a single operating point.
These translate machine-learning optimization into direct scientific impact: higher luminosity means more collision data, stronger precision reach, and potentially shorter run times for future collider programs.
Report: AA222 Final Project Paper
The code can be run in stages: parse input data, fit GP model, and run optimization tasks. Each stage writes an artifact so results can be resumed, audited, or reused without re-running the full pipeline. The GP stage includes kernel-family comparison, replicate-aware noise modeling, and optional variance calibration.
functions/gathers reusable logic:parse_data_functions.jlparses.refexports and aggregates replicates,GP_abs_functions.jlbuilds the mean and noise GPs,GP_evaluation.jlemits diagnostics,minimize_abs_functions*.jlhold single- and multi-objective optimizers, andworkflow_utils.jlmanages CLI prompts plus artifact storage.scripts/contains workflow entry points: Part 1 (parse_abs_files.jl), Part 2 (fit_gp_model.jl), Part 3 (run_minimization.jl), the interactive orchestrator (run_workflow.jl), model-artifact comparison (compare_model_artifacts.jl), sampling utilities (random_sampling_v2.jl,random_sampling_v3.jl), and plotting helpers (make_pareto_plot.jl).artifacts/stores results by stage:data/(aggregated observations),models/(fitted GP state and diagnostics),optimizations/(JuMP results, residual plots, summaries), andruns/(workflow session history). Every artifact folder holds ametadata.toml, a serialized payload (*.jls), and optional plots.- Inputs reside in
files_to_parse/(raw simulator dumps),sampling_files/(generated design sets),inputs/(curated tables such as Pareto front data), andplots/(ad-hoc figures that are not tied to an artifact).
- Use Julia 1.9 or newer.
- Instantiate the project environment from the repo root:
julia --project=. -e 'using Pkg; Pkg.instantiate()' - Ensure Ipopt is available (e.g.
julia --project=. -e 'using Pkg; Pkg.add(\"Ipopt_jll\")'), or point JuMP at a system installation. - Set
GENERATE_PLOTS=falsebefore running scripts if you want to skip heavy plotting while iterating quickly. - Runtime overrides for GP fitting are controlled via environment variables (see Part 2 for details):
RANDOM_SEARCH_SAMPLES,CV_K,COMPARE_KERNELS,KERNEL_FAMILIES,CALIBRATE_VARIANCE.
Run the orchestrator to step through data processing, GP fitting, and constrained optimization:
julia --project scripts/run_workflow.jlThe driver prompts for each stage, lets you skip to a previously saved artifact, and records the session under artifacts/runs/<timestamp>/metadata.toml. At the start of Part 1 it now walks through the variance-assignment strategies and whether to drop single-seed points entirely, applies your choices to that run, and persists them in the resulting artifacts so you can see exactly what assumptions were used later. Use --resume=<run_id> to preload artifacts from an earlier run, and pass --energy= flags or set SELECT_ENERGIES (comma-separated GeV values or indices) to pre-select energies in non-interactive environments. When no TTY is available the script automatically re-runs all parts using the latest artifacts.
julia --project scripts/parse_abs_files.jl \
[--input path/to/files_to_parse/subdir] \
[--variance-strategy cascade|primary_only|two_phase] \
[--drop-singletons true|false]The script gathers energy_*.ref files, groups samples by energy, rescales units, and aggregates replicates. For singleton points it assigns variances using nearest neighbors (with logged provenance) or a global fallback. You can control where that variance comes from via --variance-strategy (or VARIANCE_STRATEGY):
cascade reproduces the legacy behavior (singletons may borrow from previously assigned singletons),
primary_only restricts borrowing to locations with true multi-seed measurements, and
two_phase first borrows from multi-seed neighbors, then runs a second kNN pass that smooths the assigned variances across all points.
Set --drop-singletons=true (or DROP_SINGLETONS=true) if you would rather discard all single-seed aggregates entirely before saving the artifact—diagnostics still report how many were removed so you can track coverage. The processed dataset lives under artifacts/data/<timestamp>_E... with aggregated.jls, feature labels, and optional histograms in plots/. Override the input directory with --input or by setting FILES_TO_PARSE. Choose energies interactively or via SELECT_ENERGIES / --energy.
julia --project scripts/fit_gp_model.jl [--data artifacts/data/<id>]Select a processed data artifact, pick the energies to fit, and train mean/noise GPs with hyperparameter optimization plus k-fold cross-validation.
New surrogate-model features in Part 2:
- Kernel-family comparison framework: evaluates
se_ard,matern52_ard, andse_plus_matern52(or a subset), then selects the best by CV NLPD. - Replicate-aware noise modeling: the noise GP now uses replicate counts to weight uncertainty in log-variance targets.
- Variance calibration: CV out-of-fold predictions can fit a global variance scale (
variance_scale) to improve uncertainty calibration. - Richer artifact payload: stores selected
kernel_family,variance_scale, and per-kernel CV summaries.
Part 2 runtime flags (environment variables):
RANDOM_SEARCH_SAMPLES(default128): random starts per hyperparameter search.CV_K(default5): CV folds (minimum2).COMPARE_KERNELS(true/false, defaulttrue): evaluate all configured kernel families or only the first.KERNEL_FAMILIES(default all): comma-separated subset, e.g.se_ard,matern52_ard.CALIBRATE_VARIANCE(true/false, defaulttrue): enable CV-based global variance scaling.CALIBRATION_METHOD(nlpdorcoverage68, defaultnlpd): calibration objective for selectingvariance_scale.CALIBRATION_BOUNDS(default0.25,2.5): lower/upper bounds forvariance_scale.CALIBRATION_TARGET_COVERAGE(default0.68): target coverage used whenCALIBRATION_METHOD=coverage68.GENERATE_PLOTS(true/false, defaulttrue): create diagnostic plots and CSVs.SELECT_ENERGIES: comma-separated indices or energy values for non-interactive selection.
Example: full-quality run (single energy, all kernels, calibrated uncertainty)
SELECT_ENERGIES=125 \
RANDOM_SEARCH_SAMPLES=128 \
CV_K=5 \
COMPARE_KERNELS=true \
CALIBRATE_VARIANCE=true \
CALIBRATION_METHOD=nlpd \
julia --project scripts/fit_gp_model.jl --data artifacts/data/20260120_220447_E90_E125_E250Example: fast/debug run
SELECT_ENERGIES=125 \
RANDOM_SEARCH_SAMPLES=32 \
CV_K=3 \
COMPARE_KERNELS=false \
KERNEL_FAMILIES=se_ard \
CALIBRATE_VARIANCE=true \
CALIBRATION_METHOD=coverage68 \
CALIBRATION_TARGET_COVERAGE=0.68 \
GENERATE_PLOTS=false \
julia --project scripts/fit_gp_model.jl --data artifacts/data/20260120_220447_E90_E125_E250Outputs include training metrics, PIT diagnostics, per-fold CSV exports, and residual agreement plots in artifacts/models/<timestamp>_E.../plots/. Use the --data flag to bypass the interactive artifact picker.
julia --project scripts/run_minimization.jl [--model artifacts/models/<id>]Choose a GP artifact, optionally customize physical box constraints and allowed beamstrahlung Y bounds, then solve the JuMP/Ipopt problem. The optimization artifact stores results.jls, a human-readable summary.txt, and residual/coverage plots filtered to the active bounds. The script echoes the optimal physical design, predicted luminosity, and metrics carried over from training/CV. Set --model to target a specific artifact or rely on the interactive picker. Part 3 automatically reuses the kernel_family stored in the selected model artifact.
Compare baseline vs candidate GP artifacts with a metric table and CSV export:
julia --project scripts/compare_model_artifacts.jl \
--baseline artifacts/models/<baseline_id> \
--candidate artifacts/models/<candidate_id> \
[--out artifacts/analysis/<name>.csv]The comparison reports per-energy deltas for CV metrics (nlpd, rmse, coverage68, coverage95, r2).
If deserialization fails, rerun with the same Julia version that produced the artifacts.
scripts/random_sampling_v2.jl produces stratified designs across six continuous optics variables, prepending the discrete energy choice and appending a random seed. Pass suffix tokens to create multiple files (e.g. julia --project scripts/random_sampling_v2.jl 1501 1502).
scripts/random_sampling_v3.jl fixes one optics point per invocation and generates multiple unique seeds; it also accepts range tokens (1501-1505) when producing several outputs at once. Adjust the parameter bounds inside each script before use.
Rebuild Pareto-front figures from a prepared table:
julia --project scripts/make_pareto_plot.jl inputs/values_for_pareto_frontier.txtThe script defaults to the provided file and writes both PDF and PNG plots to plots/.
- Do not commit large raw simulator exports or cluster-specific edits (e.g. tailor
GP_test_AA222_S3DF_SLURM.shper run, but keep template changes minimal). - Prune outdated standalone plots in
plots/before opening a PR; artifact-specific plots already live underartifacts/. - Data artifacts record the variance-strategy and drop-singletons choices; the workflow UI shows them when you pick an existing artifact so you can align downstream runs.
- For multi-objective studies, call helpers in
functions/minimize_abs_functions_multiobjective.jlfrom a Julia REPL after loading the GP artifact. - A basic test suite now exists at
test/runtests.jl. Run with:It covers variance calibration, kernel parameterization, replicate-aware noise helpers, and CV-NLPD sanity checks. For full validation, still compare regenerated artifacts/plots on an existing dataset (for examplejulia --project test/runtests.jl
inputs/Y_and_product_values.txt).
