Ocean Surface pCO₂ Reconstruction with NGBoost

Authors

Group 1: Azam Khan, Bokai He, Sarah Pariser, Zhi Wang

Contribution Statement:

Azam Khan: Setup pipeline to train ngboost models; wrote visualization functions & analysis
Bokai He: Statistical significance analysis, including p-value & t-test; NGBoost & XGBoost comparison
Sarah Pariser: Created masks; streamlined code/story; contributed to introduction & conclusion
Zhi Wang: Seasonality analysis & visualization

Course: EESC/STAT 4243 – Climate Prediction Challenges with Machine Learning
Spring 2025

Overview

This project investigates the use of probabilistic machine learning to reconstruct surface ocean partial pressure of CO₂ (pCO₂) from sparse observations. By implementing NGBoost, we quantify uncertainty in pCO₂ estimates and examine how adding more data — either at existing locations or in new regions — improves model confidence and performance.

The project builds on:

Gloege et al. (2021) – quantifying uncertainty in pCO₂ reconstructions
Bennington et al. (2022) – residual-based ML reconstructions

Motivation

Oceans have absorbed about 38% of anthropogenic CO₂ emissions since the industrial revolution, making them a vital carbon sink. However, future oceanic carbon uptake remains uncertain.

A major barrier to understanding ocean-atmosphere carbon flux is the sparse and uneven distribution of pCO₂ observations, particularly in the Southern Hemisphere and high-latitude oceans.

This study addresses two key questions:

Where are reconstructions statistically reliable or uncertain?
Does adding more observations — in quantity or coverage — improve confidence?

Objectives

✅ Quantify uncertainty and statistical confidence of ML-based pCO₂ reconstructions
✅ Investigate the impact of different sampling strategies:
- Adding more data at existing locations
- Adding new observations in previously unsampled areas
✅ Identify regions where new data would improve model skill the most
✅ Evaluate model performance via metrics and spatial/temporal diagnostics

Model: NGBoost vs XGBoost

Feature	NGBoost	XGBoost
Output	Probabilistic (e.g. N(μ, σ²))	Point prediction
Uncertainty Estimation	✅ Built-in	❌ Not native
Loss Function	LogScore, CRPS	MSE, LogLoss
Use Case	Climate, medical, risk-sensitive	General-purpose

We use NGBoost + Normal distribution + LogScore to predict both the mean and standard deviation of pCO₂ values.

Sampling Scenarios

We compare the baseline SOCAT sampling mask to six enhanced sampling strategies:

1.Add More at Existing Locations

Mask Name	Description	Increase
`densify_mean_pattern`	Raise low-sampled locations to global mean	+14%
`densify_30p`	Ensure ≥ 7 months per sampled grid cell	+30%
`densify_50p`	Ensure ≥ 10 months per sampled grid cell	+50%

2.Add Data in New Locations

Mask Name	Description	Increase
`expand_14p`	Add new grid cells in S. Ocean, Indian, Pacific regions	+14%
`expand_30p`	100 new points per basin, moderate sampling	+30%
`expand_50p`	200 new points per basin, dense sampling	+50%

Methodology Summary

Sampling Mask Analysis – visualize SOCAT coverage & define augmentation strategies
Train NGBoost – on residuals (ESM minus truth) using SOCAT-like masks
Reconstruct pCO₂ – for the globe using trained NGBoost
Inverse Transformation – recover full pCO₂ from residuals
Evaluate – spatial metrics (bias, std, corr), uncertainty, p-values
Compare – baseline vs. augmented masks across key metrics

Peer Review Instructions:

Clone the repository:

git clone https://github.com/spariser/ReconstructOceanCarbonP3G1.git

Navigate to the project directory:
```
cd ReconstructOceanCarbonP3G1
```
Run the cells in the Oceanmixing_Group1.ipynb notebook to reproduce the analysis.
Enter your username in the username variable in the first cell of the notebook.
Restart kernel if you get the error lib.config not found.
Ensure runthiscell is set to -1 as a reviewer to reduce time.

Project Structure

Project3/
├── lib/                       # Helper scripts
│   ├── __init__.py
│   ├── bias_figure2.py        # Code for bias calculation and visualization
│   ├── corr_figure3.py        # Code for correlation calculation and visualization
│   ├── residual_utils.py      # Prepares data for ML, tools for dataset splitting, model evaluation, and saving files.
│   ├── group1_utils.py       # Group 1: Functions for data preprocessing, including loading and cleaning data, and creating training and test datasets.
│   ├── config.py # Configuration file for the project containing constants
│   └── visualization.py       # Custom plotting class SpatialMap2 for creating high-quality spatial visualizations with colorbars and map features using Cartopy and Matplotlib.
├── notebooks/
│   └── Oceanmixing_Group1.ipynb # Main notebook containing full analysis & data story

Name		Name	Last commit message	Last commit date
Latest commit History 104 Commits
Project3-ReconstructPCO2		Project3-ReconstructPCO2
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ocean Surface pCO₂ Reconstruction with NGBoost

Contribution Statement:

Overview

Motivation

Objectives

Model: NGBoost vs XGBoost

Sampling Scenarios

1.Add More at Existing Locations

2.Add Data in New Locations

Methodology Summary

Peer Review Instructions:

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Ocean Surface pCO₂ Reconstruction with NGBoost

Contribution Statement:

Overview

Motivation

Objectives

Model: NGBoost vs XGBoost

Sampling Scenarios

1.Add More at Existing Locations

2.Add Data in New Locations

Methodology Summary

Peer Review Instructions:

Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages