From cb90a8e1d6b0b6b0411ae2d9c84ffc068a231873 Mon Sep 17 00:00:00 2001 From: Jia Yu Date: Sun, 3 May 2026 22:48:09 -0700 Subject: [PATCH 1/2] [GH-2700] Add 03-fire-risk-fusion.ipynb: raster + vector fusion Continues the docker-notebook refresh series (issue #2700). The first notebook in the series that mixes Sedona's raster algebra, raster->vector zonal aggregation, and vector-on-vector distance joins in one pipeline - the workflow GeoPandas alone can't do. Pipeline answers "given a county's terrain steepness, fuel load, building footprints, and road network, score every building for wildfire risk weighted by distance from the nearest evacuation route": 1. Synthesize slope.tif + fuel.tif as 256x256 single-band float32 tiled GeoTIFFs in /tmp/fire-risk/ (slope highest east, fuel highest north). 2. Load both with sedona.read.format("raster"), keep (x, y) tile-index columns, join on those, compute composite risk via two-raster RS_MapAlgebra (`0.5 * slope + 0.5 * fuel`). Same SQL works for single-tile inputs (as here) and for multi-tile DEM-sized scenes. 3. Build a 4x4 grid of building polygons + two bisector LINESTRING roads as Spark DataFrames. 4. Compute distance from each building to its nearest road via MIN(ST_DistanceSpheroid) over the cross product (metres regardless of EPSG:4326 lon/lat units). 5. Score each building as `RS_ZonalStats(composite, footprint, 'mean') * (1 + min(dist_km, 5)/5)`. Multiplicative form means a building only ranks high when it has both high terrain risk and poor road access. 6. Rank, write top-5 as GeoParquet 1.1 (auto covering-bbox + projjson), round-trip read back to verify. 7. matplotlib panel: composite risk as basemap, footprints filled by risk_score (red = high), roads overlaid, top-5 buildings labeled. The slope-east + fuel-north synthesis means corner buildings carry the highest composite risk; the multiplicative evacuation factor then favours buildings far from the bisector roads. Built-in ground truth: B33 (NE corner) should rank top, which the harness confirms. All inputs synthesized in /tmp - no new data shipped, no network. Notebook intro flags `Requires Sedona >= 1.9.0` for the auto-tiling raster reader. Verified end-to-end via the local mirror of docker/test-notebooks.sh (matched docker stack: Python 3.10, pyspark==4.0.1, apache-sedona==1.9.0, JDK 17, DRIVER_MEM=4g, local[*], Sedona JAR via PYSPARK_SUBMIT_ARGS Maven coords): PASS 03-fire-risk-fusion 19s elapsed. Output sanity- checked: top building B33 (NE corner) with mean_risk=0.8772 and risk_score=1.7543; ranking decreases through the NE quadrant down to B00 (SW corner) with mean_risk=0.1244 - matches the synthesis design. GeoParquet round-trip read back the top-5 correctly. --- docs/usecases/03-fire-risk-fusion.ipynb | 404 ++++++++++++++++++++++++ 1 file changed, 404 insertions(+) create mode 100644 docs/usecases/03-fire-risk-fusion.ipynb diff --git a/docs/usecases/03-fire-risk-fusion.ipynb b/docs/usecases/03-fire-risk-fusion.ipynb new file mode 100644 index 00000000000..8b3117f2c69 --- /dev/null +++ b/docs/usecases/03-fire-risk-fusion.ipynb @@ -0,0 +1,404 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "# Wildfire risk score per building\n", + "\n", + "A raster + vector fusion workflow. We answer:\n", + "\n", + "> **Given a county's terrain steepness, fuel load, building footprints, and road network, score every building for wildfire risk weighted by distance from the nearest evacuation route.**\n", + "\n", + "This is the workflow GeoPandas alone can't do — it mixes raster algebra, raster→vector aggregation, and vector-on-vector distance joins in one pipeline. Sedona makes it one SQL block.\n", + "\n", + "**Pipeline:**\n", + "1. Synthesize a slope raster and a fuel-load raster.\n", + "2. Combine into a composite risk raster with two-raster `RS_MapAlgebra`.\n", + "3. Synthesize building footprints (4×4 grid of polygons) and a small road network.\n", + "4. Score each building with `RS_ZonalStats(composite, footprint, 'mean')` × evacuation factor from `ST_DistanceSpheroid` to the nearest road.\n", + "5. Rank, write top-5 as GeoParquet, plot.\n", + "\n", + "All inputs are synthesized in the notebook so it's fully reproducible and ships no new bytes. Swap the synthesis cell for `sedona.read.format(\"raster\").load(\"...\")` over a real DEM-derived slope raster and a NLCD-derived fuel raster; everything below is unchanged.\n", + "\n", + "**Requires Sedona ≥ 1.9.0** for the auto-tiling raster reader and proj4sedona-backed `ST_Transform`." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Connect to Spark through SedonaContext" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sedona.spark import SedonaContext\n", + "\n", + "config = SedonaContext.builder().master(\"spark://localhost:7077\").getOrCreate()\n", + "sedona = SedonaContext.create(config)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Synthesize the raster inputs\n", + "\n", + "Two 256×256 single-band rasters over a 0.1° AOI in eastern Iowa:\n", + "\n", + "- `slope.tif` — normalized 0-1, highest along the eastern edge (think: foothills rising toward the east).\n", + "- `fuel.tif` — normalized 0-1, highest along the northern edge (think: dense vegetation belt across the top of the AOI).\n", + "\n", + "Both written as **tiled** GeoTIFFs so the Sedona raster reader (which rejects strip-based files as \"too thin\") can ingest them." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "\n", + "import numpy as np\n", + "import rasterio\n", + "from rasterio.transform import from_bounds\n", + "\n", + "WORK = \"/tmp/fire-risk\"\n", + "os.makedirs(WORK, exist_ok=True)\n", + "\n", + "AOI = (-91.10, 41.50, -91.00, 41.60) # (xmin, ymin, xmax, ymax) EPSG:4326\n", + "W = H = 256\n", + "transform = from_bounds(*AOI, W, H)\n", + "rng = np.random.default_rng(7)\n", + "\n", + "ys, xs = np.mgrid[0:H, 0:W]\n", + "slope = (xs / (W - 1)) + 0.05 * rng.standard_normal((H, W))\n", + "fuel = ((H - 1 - ys) / (H - 1)) + 0.05 * rng.standard_normal((H, W))\n", + "slope = slope.clip(0, 1).astype(\"float32\")\n", + "fuel = fuel.clip(0, 1).astype(\"float32\")\n", + "\n", + "for name, arr in [(\"slope\", slope), (\"fuel\", fuel)]:\n", + " with rasterio.open(\n", + " f\"{WORK}/{name}.tif\",\n", + " \"w\",\n", + " driver=\"GTiff\",\n", + " tiled=True,\n", + " blockxsize=256,\n", + " blockysize=256,\n", + " height=H,\n", + " width=W,\n", + " count=1,\n", + " dtype=\"float32\",\n", + " crs=\"EPSG:4326\",\n", + " transform=transform,\n", + " ) as dst:\n", + " dst.write(arr, 1)\n", + " dst.set_band_description(1, name)\n", + " print(f\"{name}.tif: min={arr.min():.2f} max={arr.max():.2f} mean={arr.mean():.2f}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. Combine slope and fuel into a composite risk raster\n", + "\n", + "Load both rasters with the auto-tiling reader, keep the `(x, y)` tile-index columns, then join on those for a tile-aligned two-raster `RS_MapAlgebra`. The same query works for single-tile inputs (as here) or for a DEM-sized scene that yields thousands of tiles per layer." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "rasters = (\n", + " sedona.read.format(\"raster\")\n", + " .load(f\"{WORK}/*.tif\")\n", + " .selectExpr(\"split(name, '\\\\\\\\.')[0] as layer\", \"x\", \"y\", \"rast\")\n", + ")\n", + "rasters.createOrReplaceTempView(\"rasters\")\n", + "\n", + "composite = sedona.sql(\"\"\"\n", + " SELECT s.x, s.y,\n", + " RS_MapAlgebra(\n", + " s.rast, f.rast, 'D',\n", + " 'out[0] = 0.5 * rast0[0] + 0.5 * rast1[0];',\n", + " -9999.0\n", + " ) AS rast\n", + " FROM (SELECT x, y, rast FROM rasters WHERE layer = 'slope') s\n", + " JOIN (SELECT x, y, rast FROM rasters WHERE layer = 'fuel') f\n", + " ON s.x = f.x AND s.y = f.y\n", + "\"\"\")\n", + "composite.cache()\n", + "composite.createOrReplaceTempView(\"composite\")\n", + "composite.selectExpr(\n", + " \"x\",\n", + " \"y\",\n", + " \"RS_BandPixelType(rast, 1) as dtype\",\n", + " \"RS_Width(rast) as w\",\n", + " \"RS_Height(rast) as h\",\n", + ").show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. Synthesize building footprints and road network\n", + "\n", + "Sixteen small square buildings on a 4×4 grid across the AOI, plus two roads (one east-west across the middle, one north-south down the centre). Buildings are 0.005° × 0.005° (~ 500 m × 500 m at this latitude); roads are simple `LINESTRING`s that bisect the AOI." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from pyspark.sql import Row\n", + "\n", + "xmin, ymin, xmax, ymax = AOI\n", + "step_x = (xmax - xmin) / 4\n", + "step_y = (ymax - ymin) / 4\n", + "half = 0.0025 # building half-width in degrees\n", + "\n", + "building_rows = []\n", + "for ix in range(4):\n", + " for iy in range(4):\n", + " cx = xmin + (ix + 0.5) * step_x\n", + " cy = ymin + (iy + 0.5) * step_y\n", + " wkt = (\n", + " f\"POLYGON(({cx-half} {cy-half}, {cx+half} {cy-half}, \"\n", + " f\"{cx+half} {cy+half}, {cx-half} {cy+half}, {cx-half} {cy-half}))\"\n", + " )\n", + " building_rows.append(Row(bid=f\"B{ix}{iy}\", wkt=wkt))\n", + "\n", + "buildings = sedona.createDataFrame(building_rows).selectExpr(\n", + " \"bid\", \"ST_GeomFromText(wkt) as geom\"\n", + ")\n", + "buildings.createOrReplaceTempView(\"buildings\")\n", + "\n", + "mid_x = (xmin + xmax) / 2\n", + "mid_y = (ymin + ymax) / 2\n", + "road_rows = [\n", + " Row(road_id=\"EW\", wkt=f\"LINESTRING({xmin} {mid_y}, {xmax} {mid_y})\"),\n", + " Row(road_id=\"NS\", wkt=f\"LINESTRING({mid_x} {ymin}, {mid_x} {ymax})\"),\n", + "]\n", + "roads = sedona.createDataFrame(road_rows).selectExpr(\n", + " \"road_id\", \"ST_GeomFromText(wkt) as geom\"\n", + ")\n", + "roads.createOrReplaceTempView(\"roads\")\n", + "\n", + "print(f\"{buildings.count()} buildings, {roads.count()} roads\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5. Score every building\n", + "\n", + "Two SQL passes:\n", + "\n", + "- **Distance to nearest road** — group `(building × road)` cross product by building, take the `MIN(ST_DistanceSpheroid)`. Spheroidal distance returns metres regardless of the geometries' EPSG:4326 lon/lat units.\n", + "- **Risk score** — `mean composite risk × (1 + min(dist_to_road_km, 5) / 5)`. The clamp at 5 km caps the evacuation penalty so a single distant building doesn't dominate; the multiplicative form means a building only ranks high when it has *both* high terrain risk and poor road access." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "buildings_with_dist = sedona.sql(\"\"\"\n", + " SELECT b.bid, b.geom,\n", + " MIN(ST_DistanceSpheroid(b.geom, r.geom)) AS dist_m\n", + " FROM buildings b, roads r\n", + " GROUP BY b.bid, b.geom\n", + "\"\"\")\n", + "buildings_with_dist.createOrReplaceTempView(\"buildings_with_dist\")\n", + "\n", + "scored = sedona.sql(\"\"\"\n", + " SELECT b.bid,\n", + " ROUND(RS_ZonalStats(c.rast, b.geom, 'mean'), 4) AS mean_risk,\n", + " ROUND(b.dist_m / 1000.0, 2) AS dist_km,\n", + " ROUND(\n", + " RS_ZonalStats(c.rast, b.geom, 'mean')\n", + " * (1.0 + LEAST(b.dist_m, 5000.0) / 5000.0),\n", + " 4\n", + " ) AS risk_score,\n", + " b.geom\n", + " FROM buildings_with_dist b, composite c\n", + " ORDER BY risk_score DESC\n", + "\"\"\")\n", + "scored.cache()\n", + "scored.createOrReplaceTempView(\"scored\")\n", + "scored.select(\"bid\", \"mean_risk\", \"dist_km\", \"risk_score\").show(16)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 6. Persist the top-5 highest-risk buildings\n", + "\n", + "Write top-5 to GeoParquet 1.1 — covering-bbox metadata and projjson CRS auto-populated — and read it back to confirm the round-trip." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import shutil\n", + "\n", + "out_path = f\"{WORK}/top_risk_buildings.parquet\"\n", + "if os.path.exists(out_path):\n", + " shutil.rmtree(out_path)\n", + "\n", + "top5 = sedona.sql(\"SELECT * FROM scored LIMIT 5\")\n", + "top5.coalesce(1).write.format(\"geoparquet\").save(out_path)\n", + "\n", + "sedona.read.format(\"geoparquet\").load(out_path).select(\n", + " \"bid\", \"mean_risk\", \"dist_km\", \"risk_score\"\n", + ").show(truncate=False)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 7. Visualize composite risk + scored buildings + roads\n", + "\n", + "One matplotlib axes: composite risk raster as the basemap, building footprints filled by `risk_score` (red = high), roads overlaid as black lines, top-5 buildings labelled." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import matplotlib.patches as mpatches\n", + "import matplotlib.pyplot as plt\n", + "from matplotlib.cm import ScalarMappable\n", + "from matplotlib.colors import Normalize\n", + "\n", + "scored_pdf = scored.selectExpr(\n", + " \"bid\", \"mean_risk\", \"dist_km\", \"risk_score\", \"ST_AsText(geom) as wkt\"\n", + ").toPandas()\n", + "top_ids = set(scored_pdf.head(5)[\"bid\"])\n", + "\n", + "composite_arr = 0.5 * slope + 0.5 * fuel # same expression used in SQL\n", + "\n", + "fig, ax = plt.subplots(1, 1, figsize=(7, 7))\n", + "extent = (AOI[0], AOI[2], AOI[1], AOI[3])\n", + "ax.imshow(composite_arr, vmin=0, vmax=1, cmap=\"YlOrRd\", extent=extent, alpha=0.7)\n", + "\n", + "norm = Normalize(\n", + " vmin=scored_pdf[\"risk_score\"].min(), vmax=scored_pdf[\"risk_score\"].max()\n", + ")\n", + "cmap = plt.colormaps.get_cmap(\"Reds\")\n", + "for _, row in scored_pdf.iterrows():\n", + " bbox = row[\"wkt\"].split(\"((\")[1].split(\"))\")[0]\n", + " pts = [tuple(float(x) for x in p.strip().split()) for p in bbox.split(\",\")]\n", + " xs_b = [p[0] for p in pts]\n", + " ys_b = [p[1] for p in pts]\n", + " ax.fill(\n", + " xs_b,\n", + " ys_b,\n", + " facecolor=cmap(norm(row[\"risk_score\"])),\n", + " edgecolor=\"black\",\n", + " linewidth=0.6,\n", + " )\n", + " if row[\"bid\"] in top_ids:\n", + " ax.text(\n", + " sum(xs_b[:-1]) / 4,\n", + " sum(ys_b[:-1]) / 4,\n", + " row[\"bid\"],\n", + " ha=\"center\",\n", + " va=\"center\",\n", + " fontsize=8,\n", + " fontweight=\"bold\",\n", + " )\n", + "\n", + "ax.plot([AOI[0], AOI[2]], [(AOI[1] + AOI[3]) / 2] * 2, \"k-\", linewidth=2, alpha=0.6)\n", + "ax.plot([(AOI[0] + AOI[2]) / 2] * 2, [AOI[1], AOI[3]], \"k-\", linewidth=2, alpha=0.6)\n", + "ax.set_title(\"Composite risk + buildings (red = top-5 by score)\")\n", + "ax.set_xlabel(\"longitude\")\n", + "ax.set_ylabel(\"latitude\")\n", + "fig.colorbar(\n", + " ScalarMappable(norm=norm, cmap=cmap), ax=ax, label=\"risk_score\", fraction=0.04\n", + ")\n", + "fig.tight_layout()\n", + "fig" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## What just happened?\n", + "\n", + "Six SQL passes turned synthetic terrain + footprint inputs into a ranked, persisted, plotted wildfire-risk report:\n", + "\n", + "1. Loaded two synthesized GeoTIFFs with `sedona.read.format(\"raster\")`.\n", + "2. Joined them on tile coordinates and combined via two-raster `RS_MapAlgebra` into a composite risk raster.\n", + "3. Built building polygons and road `LINESTRING`s as Spark DataFrames.\n", + "4. `MIN(ST_DistanceSpheroid)` over the building × road cross product gave each building's distance to its nearest road, in metres.\n", + "5. `RS_ZonalStats(composite, footprint, 'mean')` × evacuation factor produced a per-building risk score.\n", + "6. Wrote top-5 to GeoParquet 1.1, read back to verify, plotted everything on a single matplotlib axes.\n", + "\n", + "The slope-east-fuel-north synthesis means buildings in the **north-east corner** carry the highest composite risk; the multiplicative evacuation factor then favours buildings that are also far from one of the two bisector roads. The ranking from the harness pass picks the corner buildings as expected — same SQL runs unchanged on a county-sized DEM + the OSM road network + Overture buildings, just with different inputs." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} From d1d895db58661c9b5bd0b0b18c157dce26f859e1 Mon Sep 17 00:00:00 2001 From: Jia Yu Date: Mon, 4 May 2026 08:06:22 -0700 Subject: [PATCH 2/2] [GH-2700] 03-fire-risk-fusion: address review (multi-tile, ST_KNN, SRID, version note) --- docs/usecases/03-fire-risk-fusion.ipynb | 106 +++--------------------- 1 file changed, 13 insertions(+), 93 deletions(-) diff --git a/docs/usecases/03-fire-risk-fusion.ipynb b/docs/usecases/03-fire-risk-fusion.ipynb index 8b3117f2c69..a91161d5e38 100644 --- a/docs/usecases/03-fire-risk-fusion.ipynb +++ b/docs/usecases/03-fire-risk-fusion.ipynb @@ -3,45 +3,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": [ - "\n", - "\n", - "# Wildfire risk score per building\n", - "\n", - "A raster + vector fusion workflow. We answer:\n", - "\n", - "> **Given a county's terrain steepness, fuel load, building footprints, and road network, score every building for wildfire risk weighted by distance from the nearest evacuation route.**\n", - "\n", - "This is the workflow GeoPandas alone can't do — it mixes raster algebra, raster→vector aggregation, and vector-on-vector distance joins in one pipeline. Sedona makes it one SQL block.\n", - "\n", - "**Pipeline:**\n", - "1. Synthesize a slope raster and a fuel-load raster.\n", - "2. Combine into a composite risk raster with two-raster `RS_MapAlgebra`.\n", - "3. Synthesize building footprints (4×4 grid of polygons) and a small road network.\n", - "4. Score each building with `RS_ZonalStats(composite, footprint, 'mean')` × evacuation factor from `ST_DistanceSpheroid` to the nearest road.\n", - "5. Rank, write top-5 as GeoParquet, plot.\n", - "\n", - "All inputs are synthesized in the notebook so it's fully reproducible and ships no new bytes. Swap the synthesis cell for `sedona.read.format(\"raster\").load(\"...\")` over a real DEM-derived slope raster and a NLCD-derived fuel raster; everything below is unchanged.\n", - "\n", - "**Requires Sedona ≥ 1.9.0** for the auto-tiling raster reader and proj4sedona-backed `ST_Transform`." - ] + "source": "\n\n# Wildfire risk score per building\n\nA raster + vector fusion workflow. We answer:\n\n> **Given a county's terrain steepness, fuel load, building footprints, and road network, score every building for wildfire risk weighted by distance from the nearest evacuation route.**\n\nThis is the workflow GeoPandas alone can't do \u2014 it mixes raster algebra, raster\u2192vector aggregation, and vector-on-vector distance joins in one pipeline. Sedona makes it one SQL block.\n\n**Pipeline:**\n1. Synthesize a slope raster and a fuel-load raster.\n2. Combine into a composite risk raster with two-raster `RS_MapAlgebra`.\n3. Synthesize building footprints (4\u00d74 grid of polygons) and a small road network.\n4. Score each building with **per-tile** `RS_ZonalStats(composite, footprint, 'sum'|'count')` aggregated to a per-building mean \u00d7 evacuation factor from `ST_DistanceSpheroid` to the nearest road found via `ST_KNN`.\n5. Rank, write top-5 as GeoParquet, plot.\n\nAll inputs are synthesized in the notebook so it's fully reproducible and ships no new bytes. Swap the synthesis cell for `sedona.read.format(\"raster\").load(\"...\")` over a real DEM-derived slope raster and a NLCD-derived fuel raster \u2014 and swap the small road `LINESTRING`s for a `sedona.read` of OSM PBF \u2014 and everything below is unchanged.\n\n**Requires Sedona \u2265 1.9.0** for the auto-tiling raster reader (GH-2672) and for the GeoParquet 1.1 writer's auto-populated covering-bbox + projjson CRS metadata (GH-2646, GH-2664)." }, { "cell_type": "markdown", @@ -68,10 +30,10 @@ "source": [ "## 2. Synthesize the raster inputs\n", "\n", - "Two 256×256 single-band rasters over a 0.1° AOI in eastern Iowa:\n", + "Two 256\u00d7256 single-band rasters over a 0.1\u00b0 AOI in eastern Iowa:\n", "\n", - "- `slope.tif` — normalized 0-1, highest along the eastern edge (think: foothills rising toward the east).\n", - "- `fuel.tif` — normalized 0-1, highest along the northern edge (think: dense vegetation belt across the top of the AOI).\n", + "- `slope.tif` \u2014 normalized 0-1, highest along the eastern edge (think: foothills rising toward the east).\n", + "- `fuel.tif` \u2014 normalized 0-1, highest along the northern edge (think: dense vegetation belt across the top of the AOI).\n", "\n", "Both written as **tiled** GeoTIFFs so the Sedona raster reader (which rejects strip-based files as \"too thin\") can ingest them." ] @@ -172,7 +134,7 @@ "source": [ "## 4. Synthesize building footprints and road network\n", "\n", - "Sixteen small square buildings on a 4×4 grid across the AOI, plus two roads (one east-west across the middle, one north-south down the centre). Buildings are 0.005° × 0.005° (~ 500 m × 500 m at this latitude); roads are simple `LINESTRING`s that bisect the AOI." + "Sixteen small square buildings on a 4\u00d74 grid across the AOI, plus two roads (one east-west across the middle, one north-south down the centre). Buildings are 0.005\u00b0 \u00d7 0.005\u00b0 (~ 500 m \u00d7 500 m at this latitude); roads are simple `LINESTRING`s that bisect the AOI." ] }, { @@ -226,8 +188,8 @@ "\n", "Two SQL passes:\n", "\n", - "- **Distance to nearest road** — group `(building × road)` cross product by building, take the `MIN(ST_DistanceSpheroid)`. Spheroidal distance returns metres regardless of the geometries' EPSG:4326 lon/lat units.\n", - "- **Risk score** — `mean composite risk × (1 + min(dist_to_road_km, 5) / 5)`. The clamp at 5 km caps the evacuation penalty so a single distant building doesn't dominate; the multiplicative form means a building only ranks high when it has *both* high terrain risk and poor road access." + "- **Distance to nearest road** \u2014 group `(building \u00d7 road)` cross product by building, take the `MIN(ST_DistanceSpheroid)`. Spheroidal distance returns metres regardless of the geometries' EPSG:4326 lon/lat units.\n", + "- **Risk score** \u2014 `mean composite risk \u00d7 (1 + min(dist_to_road_km, 5) / 5)`. The clamp at 5 km caps the evacuation penalty so a single distant building doesn't dominate; the multiplicative form means a building only ranks high when it has *both* high terrain risk and poor road access." ] }, { @@ -235,61 +197,19 @@ "execution_count": null, "metadata": {}, "outputs": [], - "source": [ - "buildings_with_dist = sedona.sql(\"\"\"\n", - " SELECT b.bid, b.geom,\n", - " MIN(ST_DistanceSpheroid(b.geom, r.geom)) AS dist_m\n", - " FROM buildings b, roads r\n", - " GROUP BY b.bid, b.geom\n", - "\"\"\")\n", - "buildings_with_dist.createOrReplaceTempView(\"buildings_with_dist\")\n", - "\n", - "scored = sedona.sql(\"\"\"\n", - " SELECT b.bid,\n", - " ROUND(RS_ZonalStats(c.rast, b.geom, 'mean'), 4) AS mean_risk,\n", - " ROUND(b.dist_m / 1000.0, 2) AS dist_km,\n", - " ROUND(\n", - " RS_ZonalStats(c.rast, b.geom, 'mean')\n", - " * (1.0 + LEAST(b.dist_m, 5000.0) / 5000.0),\n", - " 4\n", - " ) AS risk_score,\n", - " b.geom\n", - " FROM buildings_with_dist b, composite c\n", - " ORDER BY risk_score DESC\n", - "\"\"\")\n", - "scored.cache()\n", - "scored.createOrReplaceTempView(\"scored\")\n", - "scored.select(\"bid\", \"mean_risk\", \"dist_km\", \"risk_score\").show(16)" - ] + "source": "from pyspark.sql import Row\n\nxmin, ymin, xmax, ymax = AOI\nstep_x = (xmax - xmin) / 4\nstep_y = (ymax - ymin) / 4\nhalf = 0.0025 # building half-width in degrees\n\nbuilding_rows = []\nfor ix in range(4):\n for iy in range(4):\n cx = xmin + (ix + 0.5) * step_x\n cy = ymin + (iy + 0.5) * step_y\n wkt = (\n f\"POLYGON(({cx-half} {cy-half}, {cx+half} {cy-half}, \"\n f\"{cx+half} {cy+half}, {cx-half} {cy+half}, {cx-half} {cy-half}))\"\n )\n building_rows.append(Row(bid=f\"B{ix}{iy}\", wkt=wkt))\n\n# Set SRID=4326 on every geometry so the GeoParquet 1.1 writer can produce\n# projjson CRS metadata when we persist the top-5 in section 6.\nbuildings = sedona.createDataFrame(building_rows).selectExpr(\n \"bid\", \"ST_SetSRID(ST_GeomFromText(wkt), 4326) as geom\"\n)\nbuildings.createOrReplaceTempView(\"buildings\")\n\nmid_x = (xmin + xmax) / 2\nmid_y = (ymin + ymax) / 2\nroad_rows = [\n Row(road_id=\"EW\", wkt=f\"LINESTRING({xmin} {mid_y}, {xmax} {mid_y})\"),\n Row(road_id=\"NS\", wkt=f\"LINESTRING({mid_x} {ymin}, {mid_x} {ymax})\"),\n]\nroads = sedona.createDataFrame(road_rows).selectExpr(\n \"road_id\", \"ST_SetSRID(ST_GeomFromText(wkt), 4326) as geom\"\n)\nroads.createOrReplaceTempView(\"roads\")\n\nprint(f\"{buildings.count()} buildings, {roads.count()} roads\")" }, { "cell_type": "markdown", "metadata": {}, - "source": [ - "## 6. Persist the top-5 highest-risk buildings\n", - "\n", - "Write top-5 to GeoParquet 1.1 — covering-bbox metadata and projjson CRS auto-populated — and read it back to confirm the round-trip." - ] + "source": "## 5. Score every building\n\nThree SQL passes, each chosen to scale beyond this 16-building / 2-road toy:\n\n- **Distance to nearest road** \u2014 `ST_KNN(building, road, 1, false)` is an indexed nearest-neighbour join. For each building it returns its single nearest road in `O((B + R) log R)` instead of the `O(B \u00d7 R)` cartesian product. We then read off the actual metres distance with `ST_DistanceSpheroid`.\n- **Per-tile zonal stats** \u2014 `RS_ZonalStats(composite, footprint, 'sum')` and `'count'` per `(building, tile)`, gated by `RS_Intersects` so only intersecting tiles are visited. With a single-tile composite (as here) each building gets one tile; with a multi-tile DEM you get many.\n- **Aggregate to a per-building mean** \u2014 `SUM(tile_sum) / SUM(tile_cnt)` per building gives the correct pixel-count-weighted mean risk regardless of how many tiles the composite was sliced into.\n\nThe final score is `mean_risk \u00d7 (1 + min(dist_to_road_km, 5) / 5)`. The clamp at 5 km caps the evacuation penalty so a single distant building doesn't dominate; the multiplicative form means a building only ranks high when it has *both* high terrain risk and poor road access." }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], - "source": [ - "import shutil\n", - "\n", - "out_path = f\"{WORK}/top_risk_buildings.parquet\"\n", - "if os.path.exists(out_path):\n", - " shutil.rmtree(out_path)\n", - "\n", - "top5 = sedona.sql(\"SELECT * FROM scored LIMIT 5\")\n", - "top5.coalesce(1).write.format(\"geoparquet\").save(out_path)\n", - "\n", - "sedona.read.format(\"geoparquet\").load(out_path).select(\n", - " \"bid\", \"mean_risk\", \"dist_km\", \"risk_score\"\n", - ").show(truncate=False)" - ] + "source": "# Distance to the nearest road via indexed kNN \u2014 scales to county-sized\n# road networks without the O(B*R) blow-up.\nbuildings_with_dist = sedona.sql(\"\"\"\n SELECT b.bid, b.geom,\n ST_DistanceSpheroid(b.geom, r.geom) AS dist_m\n FROM buildings b\n JOIN roads r ON ST_KNN(b.geom, r.geom, 1, false)\n\"\"\")\nbuildings_with_dist.createOrReplaceTempView(\"buildings_with_dist\")\n\n# Per-tile zonal sum + count, gated by RS_Intersects so non-intersecting\n# (building, tile) pairs are skipped before RS_ZonalStats runs.\nper_tile = sedona.sql(\"\"\"\n SELECT b.bid, b.geom, b.dist_m,\n RS_ZonalStats(c.rast, b.geom, 'sum') AS tile_sum,\n RS_ZonalStats(c.rast, b.geom, 'count') AS tile_cnt\n FROM buildings_with_dist b, composite c\n WHERE RS_Intersects(c.rast, b.geom)\n\"\"\")\nper_tile.createOrReplaceTempView(\"per_tile\")\n\n# Aggregate to a pixel-count-weighted mean risk per building, then score.\nscored = sedona.sql(\"\"\"\n SELECT bid,\n FIRST(geom) AS geom,\n ROUND(SUM(tile_sum) / NULLIF(SUM(tile_cnt), 0), 4) AS mean_risk,\n ROUND(FIRST(dist_m) / 1000.0, 2) AS dist_km,\n ROUND(\n (SUM(tile_sum) / NULLIF(SUM(tile_cnt), 0))\n * (1.0 + LEAST(FIRST(dist_m), 5000.0) / 5000.0),\n 4\n ) AS risk_score\n FROM per_tile\n GROUP BY bid\n ORDER BY risk_score DESC\n\"\"\")\nscored.cache()\nscored.createOrReplaceTempView(\"scored\")\nscored.select(\"bid\", \"mean_risk\", \"dist_km\", \"risk_score\").show(16)" }, { "cell_type": "markdown", @@ -372,11 +292,11 @@ "1. Loaded two synthesized GeoTIFFs with `sedona.read.format(\"raster\")`.\n", "2. Joined them on tile coordinates and combined via two-raster `RS_MapAlgebra` into a composite risk raster.\n", "3. Built building polygons and road `LINESTRING`s as Spark DataFrames.\n", - "4. `MIN(ST_DistanceSpheroid)` over the building × road cross product gave each building's distance to its nearest road, in metres.\n", - "5. `RS_ZonalStats(composite, footprint, 'mean')` × evacuation factor produced a per-building risk score.\n", + "4. `MIN(ST_DistanceSpheroid)` over the building \u00d7 road cross product gave each building's distance to its nearest road, in metres.\n", + "5. `RS_ZonalStats(composite, footprint, 'mean')` \u00d7 evacuation factor produced a per-building risk score.\n", "6. Wrote top-5 to GeoParquet 1.1, read back to verify, plotted everything on a single matplotlib axes.\n", "\n", - "The slope-east-fuel-north synthesis means buildings in the **north-east corner** carry the highest composite risk; the multiplicative evacuation factor then favours buildings that are also far from one of the two bisector roads. The ranking from the harness pass picks the corner buildings as expected — same SQL runs unchanged on a county-sized DEM + the OSM road network + Overture buildings, just with different inputs." + "The slope-east-fuel-north synthesis means buildings in the **north-east corner** carry the highest composite risk; the multiplicative evacuation factor then favours buildings that are also far from one of the two bisector roads. The ranking from the harness pass picks the corner buildings as expected \u2014 same SQL runs unchanged on a county-sized DEM + the OSM road network + Overture buildings, just with different inputs." ] } ],