Download, parse, and aggregate CW3E SurfaceMetObs hourly station files into a single time-indexed dataset.
This tool automates retrieval of surface meteorological data hosted by the Center for Western Weather and Water Extremes (CW3E) at UC San Diego/SIO. It supports auto‑discovering per‑station schemas by reading each station’s DataFormat.txt, robustly handles missing files, and exports combined results to CSV and Parquet.
- Auto‑schema: Reads
<SITE>/DataFormat.txtto build column names and missing‑value tokens per station (fallback schema provided if missing). - Flexible ranges: Fetches a continuous range from start year/Julian day to end year/Julian day (inclusive), across years with leap‑year handling.
- 404‑tolerant: Quietly skips missing hours or transient network errors.
- Memory‑only mode: Optionally does not save raw hourly files (parse directly from the HTTP response).
- Clean timeseries: Converts
Year + Julian_Day + HHMM(end of averaging) to a properDatetimeIndex. - Exports: Writes one combined dataset to CSV and Parquet.
- Python 3.9+
- Packages:
pip install pandas requests pyarrow
pyarrow(orfastparquet) is required for Parquet output.
- Remote path pattern:
Example:
https://cw3e-datashare.ucsd.edu/CW3E_SurfaceMetObs/{SITE_UPPER}/{YYYY}/{JJJ}/{site_lower}{YY}{JJJ}.{HH}mhttps://cw3e-datashare.ucsd.edu/CW3E_SurfaceMetObs/SIO/2026/001/sio26001.00m - Station metadata/schemas:
https://cw3e-datashare.ucsd.edu/CW3E_SurfaceMetObs/{SITE_UPPER}/DataFormat.txt
Assuming the script filename is cw3e_surface_download.py:
python cw3e_surface_download.py \
--site_name sio \
--start_year 2026 --start_jday 1 \
--end_year 2026 --end_jday 2 \
--out_folder downloadsOutputs:
downloads/sio_2026j001_to_2026j002.csvdownloads/sio_2026j001_to_2026j002.parquet
The script tries to fetch and parse DataFormat.txt from the site folder to:
- build column names (leaving the first 4 time columns fixed:
Datalogger_ID, Year, Julian_Day, HHMM) - detect missing‑value tokens (e.g.,
99999,-7999,-99.99)
If a site lacks DataFormat.txt or it’s unparseable, the tool falls back to a 13‑field schema with common CW3E variables:
MSLP_mb, Temperature_C, Relative_Humidity_pct, Wind_Speed_mps, Wind_Direction_deg,Solar_Radiation_Wm2, Battery_Voltage_V, Precipitation_mm, Max_Wind_Speed_mps
python cw3e_surface_download.py --site_name sio --print_schemapython cw3e_surface_download.py --site_name sio --print_schema --no_auto_schema--site_name SITE Station code (e.g., "sio"). Case‑insensitive.
--start_year YYYY Start year (e.g., 2026).
--start_jday DDD Start Julian day (1..365/366).
--end_year YYYY End year (>= start_year).
--end_jday DDD End Julian day.
--out_folder PATH Folder for raw files and CSV/Parquet (default: ./downloads).
--timeout SECONDS HTTP timeout (default: 30).
--delete_unparsed If saving raw files, delete any hourly file that fails to parse.
--no_save_raw Do not save raw hourly files (parse directly from memory).
--no_auto_schema Do not read DataFormat.txt; use built‑in fallback schema.
--dataformat_url URL Override the DataFormat.txt URL (advanced).
--extra_na TOKENS ... Additional NA tokens (e.g., --extra_na -8888 -9999).
--print_schema Print discovered schema & NA tokens, then exit (no download).
-h, --help Show full help and examples.
python cw3e_surface_download.py \
--site_name sio \
--start_year 2026 --start_jday 1 \
--end_year 2026 --end_jday 2 \
--out_folder downloadspython cw3e_surface_download.py \
--site_name sio \
--start_year 2026 --start_jday 1 \
--end_year 2026 --end_jday 2 \
--no_save_rawpython cw3e_surface_download.py \
--site_name sio \
--start_year 2025 --start_jday 365 \
--end_year 2026 --end_jday 2 \
--extra_na -8888 -9999python cw3e_surface_download.py --site_name sio --print_schema- CSV:
<site>_<startyear>j<startday>_to_<endyear>j<endday>.csv(includestimecolumn) - Parquet: same tag, DateTimeIndex preserved
The DataFrame columns include Datalogger_ID plus the discovered data variables for that site. The time index reflects the end time of averaging.
- The downloader silently skips missing hours (HTTP 404) and transient network errors.
- Only concatenates hourly tables if their column layouts are identical; otherwise, the combined output is omitted.
- NA tokens include those found in
DataFormat.txtplus safe defaults. You can append more via--extra_na. - The first four fields are always:
Datalogger_ID,Year,Julian_Day,HHMM.
- Add plotting/QA routines (e.g., temp/wind/precip) in a notebook using the Parquet output.
- For large ranges, consider parallelizing by day/hour (e.g.,
concurrent.futures). - If your environment lacks
pyarrow, comment out Parquet writing or installfastparquet.
Issues and pull requests are welcome. If you have stations with unusual DataFormat.txt styles, please share examples so we can improve the parser.
This software is Copyright © 2026 The Regents of the University of California. All Rights Reserved.
Data provided by CW3E (Scripps Institution of Oceanography, UC San Diego). Please cite CW3E appropriately in your work and follow their data policies.