AC-BO-Hackathon · sgbaird · Jun 11, 2025 · May 28, 2025 · May 28, 2025 · May 28, 2025
diff --git a/.gitignore b/.gitignore
@@ -138,5 +138,11 @@ main.glo
 main.out
 mainNotes.bib
 pdfa.xmpi
-main.pdf
+
 main.synctex(busy)
+main.pdf
+main.bbl
+main.blg
+main_diff.acn
+main_diff.glo
+main_diffNotes.bib
diff --git a/_projects/json_summaries/project_21.json b/_projects/json_summaries/project_21.json
@@ -2,6 +2,6 @@
     "project_number": 21,
     "project_name": "Benchmarking Molecular Descriptors with Actively Identified Subsets (MolDAIS)",
     "video_url": "https://www.youtube.com/watch?v=uYXAe3sRUSo",
-    "summary": "This research presents a novel approach called MOLDES (Molecular Descriptors with Actively Identified Subspaces) for molecular property optimization. The method addresses the challenge of optimizing molecules in high-dimensional spaces by using molecular descriptors - sets of rotationally and translationally invariant calculations performed on molecular graphs - coupled with active subspace identification. MOLDES employs a sparse axis-aligned subspace Gaussian Process prior, which actively learns an encoding while performing Bayesian optimization.The researchers evaluated MOLDES on three case studies: experimental lipophilicity (4,200 compounds), log P optimization benchmark (250,000 molecules), and power conversion efficiency from the Harvard Clean Energy Project (30,000 compounds). In all cases, MOLDES demonstrated superior performance compared to other optimizers, particularly in larger datasets. For the log P optimization, MOLDES consistently found the optimal molecule within 100 iterations. The method also showed strong performance in constrained optimization problems, often achieving the best-case scenario and maintaining a favorable worst-case scenario compared to other methods. Overall, MOLDES proved efficient in identifying high-performing molecules in low-data regimes, offering a promising approach for molecular property optimization tasks.",
+    "summary": "This research presents a novel approach called MOLDES (Molecular Descriptors with Actively Identified Subspaces) for molecular property optimization. The method addresses the challenge of optimizing molecules in high-dimensional spaces by using molecular descriptors - sets of rotationally and translationally invariant calculations performed on molecular graphs - coupled with active subspace identification. MOLDES employs a sparse axis-aligned subspace Gaussian Process prior, which actively learns an encoding while performing Bayesian optimization. Recent works\\cite{sorourifar_accelerating_2024,maus_local_2023} are increasingly turning towards active encoding of molecular feature spaces. The researchers evaluated MOLDES on three case studies: experimental lipophilicity (4,200 compounds), log P optimization benchmark (250,000 molecules), and power conversion efficiency from the Harvard Clean Energy Project (30,000 compounds). In all cases, MOLDES demonstrated superior performance compared to other optimizers, particularly in larger datasets. For the log P optimization, MOLDES consistently found the optimal molecule within 100 iterations. The method also showed strong performance in constrained optimization problems, often achieving the best-case scenario and maintaining a favorable worst-case scenario compared to other methods. Overall, MOLDES proved efficient in identifying high-performing molecules in low-data regimes, offering a promising approach for molecular property optimization tasks.",
     "status": "success"
 }
diff --git a/_projects/json_summaries/project_22.json b/_projects/json_summaries/project_22.json
@@ -2,6 +2,6 @@
     "project_number": 22,
     "project_name": "Chemical Similarity-Informed Earth Mover’s Distance Kernel Bayesian Optimization for Predicting the Properties of Molecules and Molecular Mixtures",
     "video_url": "https://www.youtube.com/watch?v=I179UR8P054",
-    "summary": "This research project focuses on developing chemical similarity-informed distance functions and kernels for explainable Bayesian optimization, specifically targeting the prediction of properties for molecular mixtures. The researchers propose a novel approach that bypasses the need for embedding vectors by directly providing pairwise distances between data points in the kernel function of a Gaussian Process (GP) model.The project introduces the Earth Mover's Distance (EMD) kernel into the GP framework to calculate pairwise distances between mixtures based on individual component distances. This method was tested for predicting yields of binary reactant mixtures, demonstrating high chemical resolution in mixture analysis. The results show that the EMD kernel achieves accurate yield predictions with narrow distributions for both high and low-yield cases, indicating improved performance in distinguishing between different mixture compositions. By incorporating smooth distance metrics, the researchers successfully extended Bayesian optimization techniques from pure components to molecular mixtures, potentially enhancing the efficiency and interpretability of materials property prediction in complex chemical systems.",
+    "summary": "This research project focuses on developing chemical similarity-informed distance functions and kernels for explainable Bayesian optimization, specifically targeting the prediction of properties for molecular mixtures. The researchers propose a novel approach that bypasses the need for embedding vectors by directly providing pairwise distances between data points in the kernel function of a Gaussian Process (GP) model\\cite{moss_gaussian_2020}. The project introduces the Earth Mover's Distance (EMD) kernel\\cite{hargreaves_earth_2020} into the GP framework to calculate pairwise distances between mixtures based on individual component distances. This method was tested for predicting yields of binary reactant mixtures, demonstrating high chemical resolution in mixture analysis. The results show that the EMD kernel achieves accurate yield predictions with narrow distributions for both high and low-yield cases, indicating improved performance in distinguishing between different mixture compositions. By incorporating smooth distance metrics, the researchers successfully extended Bayesian optimization techniques from pure components to molecular mixtures, potentially enhancing the efficiency and interpretability of materials property prediction in complex chemical systems.",
     "status": "success"
 }
diff --git a/_projects/json_summaries/project_24.json b/_projects/json_summaries/project_24.json
@@ -1,8 +1,7 @@
 {
     "project_number": 24,
     "project_name": "ScattBO Benchmark - Bayesian optimisation for materials discovery",
-    "video_url": "https://github.com/AndySAnker/ScattBO/tree/main/presentation",
-    "summaries": null,
-    "status": "failed",
-    "error": "Could not determine the video ID for the URL \"https://github.com/AndySAnker/ScattBO/tree/main/presentation\"."
+    "video_url": "https://twitter.com/SodeAndy/status/1773474538631651769",
+    "summary": "This project presents ScattBO, a Python-based benchmark that simulates a self-driving laboratory (SDL) for materials discovery. A self-driving laboratory is an autonomous platform that conducts machine learning-selected experiments to achieve a user-defined objective, such as synthesizing a specific material\\cite{szymanski_autonomous_2023}. The benchmark addresses the challenge that such SDLs can be expensive to run, making intelligent experimental planning essential, while only a few people have access to real SDLs for materials discovery. ScattBO provides an in silico simulation of an SDL where, based on synthesis parameters, the benchmark 'synthesizes' a structure, calculates the scattering pattern\\cite{johansen_gpu-accelerated_2024}, and compares it to the target structure's scattering pattern. The benchmark acknowledges that scattering data may not be sufficient to conclusively validate that the target material has been synthesized\\cite{leeman_challenges_2024}, but can include other types of data as long as they can be simulated. This makes it currently challenging to benchmark Bayesian optimization algorithms for experimental planning tasks in SDLs, and ScattBO fills this gap by providing an accessible simulation environment.",
+    "status": "success"
 }
diff --git a/_projects/json_summaries/project_25.json b/_projects/json_summaries/project_25.json
@@ -2,6 +2,6 @@
     "project_number": 25,
     "project_name": "Bayesian Optimized De Novo Drug Design for Selective Kinase Targeting ",
     "video_url": "https://www.youtube.com/watch?v=nVtTYXxG7i4",
-    "summary": "This project focused on incorporating Bayesian optimization to guide de novo drug design, specifically targeting growth factor receptors for cancer therapeutics. The team built upon the doct string paper, Python library, and dataset by Garcia-Oron and Bacal, using a Gaussian process with a Matérn kernel on Morgan fingerprint representations. They employed a graph genetic algorithm to generate SMILES strings guided by the Bayesian optimization output.The researchers explored both selective and promiscuous binding scenarios. For selective binding, they optimized for binding to FGFR1 while penalizing overbinding to other growth factor receptors relative to their median. For promiscuous binding, they maximized the maximum binding affinity across multiple receptors. They found that a sigmoidal penalty function was more effective than simple absolute differences when optimizing against multiple proteins. The team also incorporated a drug-likeness measure (QED) as a penalty in the optimization process, though its effect was limited. Due to time and resource constraints, the project was unable to extensively explore the chemical space or use more accurate binding affinity calculations beyond docking. The authors suggest that future work could incorporate known unknowns through an evasion process, further optimize selective binding, and compare different molecular representations.",
+    "summary": "This project focused on incorporating Bayesian optimization to guide de novo drug design, specifically targeting growth factor receptors for cancer therapeutics. The team built upon the DOCKSTRING paper, Python library, and dataset\\cite{garcia_dockstring_2022}, using a Gaussian process with a Matérn kernel on Morgan fingerprint representations. They employed a graph genetic algorithm to generate SMILES strings guided by the Bayesian optimization output. The researchers explored both selective and promiscuous binding scenarios. For selective binding, they optimized for binding to FGFR1 while penalizing overbinding to other growth factor receptors relative to their median. For promiscuous binding, they maximized the maximum binding affinity across multiple receptors. They found that a sigmoidal penalty function was more effective than simple absolute differences when optimizing against multiple proteins. The team also incorporated a drug-likeness measure (QED)\\cite{bickerton_quantifying_2012} as a penalty in the optimization process, though its effect was limited. Due to time and resource constraints, the project was unable to extensively explore the chemical space or use more accurate binding affinity calculations beyond docking. The authors suggest that future work could incorporate known unknowns through an evasion process, further optimize selective binding, and compare different molecular representations.",
     "status": "success"
 }
diff --git a/_projects/json_summaries/project_27.json b/_projects/json_summaries/project_27.json
@@ -2,6 +2,6 @@
     "project_number": 27,
     "project_name": "How does initial warm-up data influence Bayesian optimization in low-data experimental settings?",
     "video_url": "https://www.youtube.com/watch?v=4gPTMaarQt0",
-    "summary": "This research project investigated the influence of warm-up sampling methods and dataset sizes on property optimization in low data regimes, specifically focusing on molecular property prediction. The team used the QM9 dataset and selected band gap as the optimization target. They compared two chemically-inspired sampling methods for the warm-up dataset: Morgan fingerprints and MolFormer language model fingerprints.The researchers performed dimensionality reduction on the fingerprints using PCA, projecting them into a 2D space for sampling. They conducted experiments to analyze how the warm-up dataset size affects optimization results. The most significant finding was the comparison between Morgan fingerprints and MolFormer fingerprints at a constant data regime of 50 data points. The results showed that MolFormer fingerprints substantially outperformed Morgan fingerprints, suggesting that pre-trained models on large chemical spaces can potentially improve model optimization rates. This study aims to initiate broader discussions on how dataset sizes and sampling methodologies impact final optimization tasks in molecular property prediction.",
+    "summary": "This research project investigated the influence of warm-up sampling methods and dataset sizes on property optimization in low data regimes, specifically focusing on molecular property prediction. The team used the QM9 dataset\\cite{ramakrishnan_quantum_2014} and selected band gap as the optimization target. They compared two chemically-inspired sampling methods for the warm-up dataset: Morgan fingerprints and MolFormer language model fingerprints. The researchers also referenced the GDB-17 chemical universe database\\cite{ruddigkeit_enumeration_2012} in their background work. The researchers performed dimensionality reduction on the fingerprints using PCA, projecting them into a 2D space for sampling. They conducted experiments to analyze how the warm-up dataset size affects optimization results. The most significant finding was the comparison between Morgan fingerprints and MolFormer fingerprints at a constant data regime of 50 data points. The results showed that MolFormer fingerprints substantially outperformed Morgan fingerprints, suggesting that pre-trained models on large chemical spaces can potentially improve model optimization rates. This study aims to initiate broader discussions on how dataset sizes and sampling methodologies impact final optimization tasks in molecular property prediction.",
     "status": "success"
 }
diff --git a/_projects/json_summaries/project_35.json b/_projects/json_summaries/project_35.json
@@ -2,6 +2,6 @@
     "project_number": 35,
     "project_name": "Tutorial for GAUCHE - A Library for Gaussian Processes in Chemistry",
     "video_url": "https://x.com/Ryan__Rhys/status/1820723528469262419",
-    "summary": "This research project focuses on implementing input warping for Bayesian Optimization within the Gauche library, which was previously developed by the team and published at NeurIPS 2023. The primary innovation of Gauche is the introduction of Gaussian process (GP) kernels that enable modeling of discrete entities such as SMILES strings, graphs, and bit vectors, which are common representations in molecular sciences.The motivation behind using Gaussian processes for Bayesian Optimization is their suitability for automated tasks where fine-tuning for each problem is not feasible. GPs offer a good balance between performance and simplicity, with few trainable hyperparameters that can reliably converge on each iteration of the Bayesian Optimization loop. This makes them particularly attractive as surrogate models compared to more complex alternatives like deep neural networks, which might require careful monitoring during training at each iteration. The Gauche library extends the applicability of GPs to discrete input spaces, allowing for Bayesian Optimization over molecular representations. The project team has developed a range of tutorials and applications, including molecular property prediction, protein fitness prediction, and sparse GP regression, all available in the Gauche GitHub repository.",
+    "summary": "This research project focuses on implementing input warping for Bayesian Optimization within the Gauche library\\cite{griffiths_gauche_2024}, which was previously developed by the team and published at NeurIPS 2023. The primary innovation of Gauche is the introduction of Gaussian process (GP) kernels that enable modeling of discrete entities such as SMILES strings, graphs, and bit vectors, which are common representations in molecular sciences. The motivation behind using Gaussian processes for Bayesian Optimization is their suitability for automated tasks where fine-tuning for each problem is not feasible. GPs offer a good balance between performance and simplicity, with few trainable hyperparameters that can reliably converge on each iteration of the Bayesian Optimization loop. This makes them particularly attractive as surrogate models compared to more complex alternatives like deep neural networks, which might require careful monitoring during training at each iteration. The Gauche library extends the applicability of GPs to discrete input spaces, allowing for Bayesian Optimization over molecular representations. The project team has developed a range of tutorials and applications, including molecular property prediction, protein fitness prediction, and sparse GP regression, all available in the Gauche GitHub repository.",
     "status": "success"
 }
diff --git a/_projects/json_summaries/project_36.json b/_projects/json_summaries/project_36.json
@@ -2,6 +2,6 @@
     "project_number": 36,
     "project_name": "Scalable Nonmyopic Bayesian Optimization in Dynamic Cost Settings",
     "video_url": "https://youtu.be/CXweDiS_wbI",
-    "summary": "This research project focuses on scalable Bayesian optimization in dynamic settings, addressing limitations of previous approaches that rely on myopic acquisition functions and assume fixed cost structures. The researchers introduce a novel method using non-myopic acquisition functions that incorporate a look-ahead mechanism and dynamic cost functions.The project evaluates the proposed algorithm, named HBE, through two main experimental setups. First, they use synthetic functions across 14 different environments with varying dimensions to test scalability. Second, they apply the method to a real-world protein sequence design problem, aiming to maximize a protein score. The researchers compare their HBE algorithm against six other acquisition functions, including state-of-the-art methods. To enhance practicality, they integrate automatic hyperparameter tuning to reduce the number of optimization parameters. While specific results are not provided in the given context, the approach aims to overcome suboptimal resource allocation in dynamic cost experiments and improve upon existing Bayesian optimization techniques.",
+    "summary": "This research project focuses on scalable Bayesian optimization in dynamic settings, addressing limitations of previous approaches that rely on myopic acquisition functions and assume fixed cost structures. The researchers introduce a novel method using non-myopic acquisition functions\\cite{jiang_efficient_2020} that incorporate a look-ahead mechanism and dynamic cost functions. The project evaluates the proposed algorithm, named HBE, through two main experimental setups. First, they use synthetic functions across 14 different environments with varying dimensions to test scalability. Second, they apply the method to a real-world protein sequence design problem, aiming to maximize a protein score. The researchers compare their HBE algorithm against six other acquisition functions, including state-of-the-art methods. To enhance practicality, they integrate automatic hyperparameter tuning to reduce the number of optimization parameters. While specific results are not provided in the given context, the approach aims to overcome suboptimal resource allocation in dynamic cost experiments and improve upon existing Bayesian optimization techniques.",
     "status": "success"
 }