Skip to content

Latest commit

 

History

History
165 lines (123 loc) · 5.28 KB

File metadata and controls

165 lines (123 loc) · 5.28 KB

CW3E Surface Meteorology Downloader

Download, parse, and aggregate CW3E SurfaceMetObs hourly station files into a single time-indexed dataset.

This tool automates retrieval of surface meteorological data hosted by the Center for Western Weather and Water Extremes (CW3E) at UC San Diego/SIO. It supports auto‑discovering per‑station schemas by reading each station’s DataFormat.txt, robustly handles missing files, and exports combined results to CSV and Parquet.


✨ Features

  • Auto‑schema: Reads <SITE>/DataFormat.txt to build column names and missing‑value tokens per station (fallback schema provided if missing).
  • Flexible ranges: Fetches a continuous range from start year/Julian day to end year/Julian day (inclusive), across years with leap‑year handling.
  • 404‑tolerant: Quietly skips missing hours or transient network errors.
  • Memory‑only mode: Optionally does not save raw hourly files (parse directly from the HTTP response).
  • Clean timeseries: Converts Year + Julian_Day + HHMM (end of averaging) to a proper DatetimeIndex.
  • Exports: Writes one combined dataset to CSV and Parquet.

📦 Requirements

  • Python 3.9+
  • Packages:
    pip install pandas requests pyarrow

    pyarrow (or fastparquet) is required for Parquet output.


📁 File/URL Conventions

  • Remote path pattern:
    https://cw3e-datashare.ucsd.edu/CW3E_SurfaceMetObs/{SITE_UPPER}/{YYYY}/{JJJ}/{site_lower}{YY}{JJJ}.{HH}m
    
    Example:
    https://cw3e-datashare.ucsd.edu/CW3E_SurfaceMetObs/SIO/2026/001/sio26001.00m
    
  • Station metadata/schemas:
    https://cw3e-datashare.ucsd.edu/CW3E_SurfaceMetObs/{SITE_UPPER}/DataFormat.txt
    

🚀 Quick Start

Assuming the script filename is cw3e_surfacemet_download.py:

python cw3e_surfacemet_download.py \
  --site_name sio \
  --start_year 2026 --start_jday 1 \
  --end_year   2026 --end_jday   2 \
  --out_folder downloads

Outputs:

  • downloads/sio_2026j001_to_2026j002.csv
  • downloads/sio_2026j001_to_2026j002.parquet

🧠 Auto‑schema via DataFormat.txt

The script tries to fetch and parse DataFormat.txt from the site folder to:

  • build column names

If a site lacks DataFormat.txt or it’s unparseable, the script will exit with an error

Print and inspect the schema (no download)

python cw3e_surfacemet_download.py --site_name sio --print_schema

🧰 Command‑line Usage

--site_name SITE        Station code (e.g., "sio"). Case‑insensitive.
--start_year YYYY       Start year (e.g., 2026).
--start_jday DDD        Start Julian day (1..365/366).
--end_year YYYY         End year (>= start_year).
--end_jday DDD          End Julian day.

--out_folder PATH       Folder for raw files and CSV/Parquet (default: ./downloads).
--timeout SECONDS       HTTP timeout (default: 30).

--delete_unparsed       If saving raw files, delete any hourly file that fails to parse.
--no_save_raw           Do not save raw hourly files (parse directly from memory).

--dataformat_url URL    Override the DataFormat.txt URL (advanced).
--print_schema          Print discovered schema & NA tokens, then exit (no download).
-h, --help              Show full help and examples.

💡 Examples

1) Download SIO 2026 JD 001–002 (save raw files)

python cw3e_surfacemet_download.py \
  --site_name sio \
  --start_year 2026 --start_jday 1 \
  --end_year   2026 --end_jday   2 \
  --out_folder downloads

2) Same range, memory‑only (no raw files)

python cw3e_surfacemet_download.py \
  --site_name sio \
  --start_year 2026 --start_jday 1 \
  --end_year   2026 --end_jday   2 \
  --no_save_raw

3) Cross‑year range

python cw3e_surfacemet_download.py \
  --site_name sio \
  --start_year 2025 --start_jday 365 \
  --end_year   2026 --end_jday   2

4) Inspect schema only

python cw3e_surfacemet_download.py --site_name sio --print_schema

📤 Outputs

  • CSV: <site>_<startyear>j<startday>_to_<endyear>j<endday>.csv (includes time column)
  • Parquet: same tag, DateTimeIndex preserved

The DataFrame columns include Datalogger_ID plus the discovered data variables for that site. The time index reflects the end time of averaging.


🔎 Notes & Behavior

  • The downloader silently skips missing hours (HTTP 404) and transient network errors.
  • Only concatenates hourly tables if their column layouts are identical; otherwise, the combined output is omitted.

🧪 Tips & Extensions

  • Add plotting/QA routines (e.g., temp/wind/precip) in a notebook using the Parquet output.
  • For large ranges, consider parallelizing by day/hour (e.g., concurrent.futures).
  • If your environment lacks pyarrow, comment out Parquet writing or install fastparquet.

🤝 Contributing

Issues and pull requests are welcome. If you have stations with unusual DataFormat.txt styles, please share examples so we can improve the parser.


📄 License

This software is Copyright © 2026 The Regents of the University of California. All Rights Reserved.


🙏 Acknowledgments

Data provided by CW3E (Scripps Institution of Oceanography, UC San Diego). Please cite CW3E appropriately in your work and follow their data policies.