diff --git a/docs/usecases/02-vegetation-change.ipynb b/docs/usecases/02-vegetation-change.ipynb new file mode 100644 index 00000000000..31bbde62e9d --- /dev/null +++ b/docs/usecases/02-vegetation-change.ipynb @@ -0,0 +1,257 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "# Did this farmland green up?\n", + "\n", + "A complete raster pipeline. We answer:\n", + "\n", + "> **Between two satellite scenes a season apart, which farm parcels in this AOI greened up the most?**\n", + "\n", + "End-to-end on Sedona's raster surface: load two GeoTIFFs through the new tiling raster data source, compute NDVI per scene with `RS_MapAlgebra`, take their difference for a \"greening\" delta raster, run `RS_ZonalStats` per parcel, rank, clip the delta raster to the top parcel with `RS_Clip`, round-trip through `RS_AsCOG` to prove the result is a valid Cloud-Optimized GeoTIFF, and visualize the four stages as a single matplotlib figure.\n", + "\n", + "The two source scenes are **synthesized** at the top of the notebook (256 \u00d7 256 px, 2 bands each: red + NIR, EPSG:4326) so the notebook is fully reproducible and ships no new bytes. The same code path runs unchanged against real Sentinel-2 chips \u2014 only the raster filenames change.\n", + "\n", + "**Requires Sedona \u2265 1.9.0.** The `raster` data source (auto-tiling GeoTIFF reader, GH-2672) and `RS_AsCOG` (GH-2652) land in 1.9.0; the notebook will fail on older versions of the docker image." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Connect to Spark through SedonaContext" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sedona.spark import SedonaContext\n", + "\n", + "config = SedonaContext.builder().master(\"spark://localhost:7077\").getOrCreate()\n", + "sedona = SedonaContext.create(config)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Synthesize two scenes a season apart\n", + "\n", + "We build two 256\u00d7256, 2-band GeoTIFFs in `/tmp` representing red and near-infrared reflectance over the same AOI. The \"before\" scene is mostly bare; the \"after\" scene has a circular field of vegetation in the south-west corner with elevated NIR. Pixel values are scaled to the same 0-10000 reflectance range as Sentinel-2 surface-reflectance products so the NDVI math is identical to the real-world case." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": "import os\n\nimport numpy as np\nimport rasterio\nfrom rasterio.transform import from_bounds\n\nWORK = \"/tmp/veg-change\"\nos.makedirs(WORK, exist_ok=True)\n\nAOI = (-91.10, 41.50, -91.00, 41.60) # (xmin, ymin, xmax, ymax) in EPSG:4326\nW = H = 256\ntransform = from_bounds(*AOI, W, H)\nrng = np.random.default_rng(42)\n\nys, xs = np.mgrid[0:H, 0:W]\nfield_mask = ((xs - 64) ** 2 + (ys - 192) ** 2) < 50**2 # circular field\n\n\ndef synth_scene(green_strength):\n red = (1500 + 200 * rng.standard_normal((H, W))).clip(0, 10000)\n nir = (1800 + 200 * rng.standard_normal((H, W))).clip(0, 10000)\n nir = np.where(field_mask, nir + green_strength, nir)\n return red.astype(\"uint16\"), nir.astype(\"uint16\")\n\n\nfor name, strength in [(\"before\", 200), (\"after\", 4000)]:\n red, nir = synth_scene(strength)\n with rasterio.open(\n f\"{WORK}/scene_{name}.tif\",\n \"w\",\n driver=\"GTiff\",\n tiled=True,\n blockxsize=256,\n blockysize=256,\n height=H,\n width=W,\n count=2,\n dtype=\"uint16\",\n crs=\"EPSG:4326\",\n transform=transform,\n ) as dst:\n dst.write(red, 1)\n dst.write(nir, 2)\n dst.set_band_description(1, \"red\")\n dst.set_band_description(2, \"nir\")\n\nfor f in (f\"{WORK}/scene_before.tif\", f\"{WORK}/scene_after.tif\"):\n print(f\"{f}: {os.path.getsize(f)} bytes\")" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": "## 3. Load both scenes with the `raster` data source\n\n`sedona.read.format(\"raster\")` (new in 1.9) auto-tiles GeoTIFFs and yields one `Raster`-typed row per tile. Each row carries `(x, y)` tile-index columns \u2014 keep them; we'll join on those when computing \u0394NDVI so the same query works for single-tile inputs (as here) and for multi-gigabyte rasters that yield thousands of tiles per scene." + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": "scenes = (\n sedona.read.format(\"raster\")\n .load(f\"{WORK}/scene_*.tif\")\n .selectExpr(\"split(name, '\\\\\\\\.')[0] as scene\", \"x\", \"y\", \"rast\")\n)\nscenes.cache()\nscenes.show(truncate=80)" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. Compute NDVI per scene with `RS_MapAlgebra`\n", + "\n", + "`RS_MapAlgebra(raster, pixelType, script)` runs a single-raster pixel script. The script syntax (`out[0] = ...; rast[i]` indexing) is documented [here](../api/sql/Raster-map-algebra.md). We coerce the result to `D` (double) so the negative side of NDVI survives." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": "scenes.createOrReplaceTempView(\"scenes\")\n\nndvi = sedona.sql(\"\"\"\n SELECT scene, x, y,\n RS_MapAlgebra(\n rast, 'D',\n 'out[0] = (rast[1] - rast[0]) / (rast[1] + rast[0] + 0.000001);'\n ) AS rast\n FROM scenes\n\"\"\")\nndvi.cache()\nndvi.createOrReplaceTempView(\"ndvi\")\nndvi.selectExpr(\"scene\", \"x\", \"y\", \"RS_BandPixelType(rast, 1) as dtype\").show()" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": "## 5. Compute the greening delta raster\n\nTwo-raster `RS_MapAlgebra(rast0, rast1, pixelType, script, noDataValue)` subtracts the before-NDVI from the after-NDVI in a single pass. We join the two NDVI dataframes on the tile coordinates `(x, y)` so the same query works whether each scene produced a single tile (as here) or thousands." + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": "delta = sedona.sql(\"\"\"\n SELECT a.x, a.y,\n RS_MapAlgebra(\n a.rast, b.rast, 'D',\n 'out[0] = rast0[0] - rast1[0];',\n -9999.0\n ) AS rast\n FROM (SELECT x, y, rast FROM ndvi WHERE scene = 'scene_after') a\n JOIN (SELECT x, y, rast FROM ndvi WHERE scene = 'scene_before') b\n ON a.x = b.x AND a.y = b.y\n\"\"\")\ndelta.cache()\ndelta.createOrReplaceTempView(\"delta\")\ndelta.selectExpr(\n \"x\",\n \"y\",\n \"RS_BandPixelType(rast, 1) as dtype\",\n \"RS_Width(rast) as w\",\n \"RS_Height(rast) as h\",\n).show()" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": "## 6. Synthesize parcel polygons and run `RS_ZonalStats`\n\nBuild a 4\u00d74 grid of square parcels covering the AOI and compute the mean \u0394NDVI per parcel. `RS_ZonalStats(raster, zone, statType)` is the canonical raster\u2192vector aggregation: every pixel inside `zone` contributes to the statistic.\n\nFor tiled inputs, the cross-join below produces one row per `(parcel, tile)`; with our single-tile delta it collapses to one row per parcel. To handle truly tiled inputs robustly you would compute `RS_ZonalStats(rast, geom, 'sum')` and `'count'` per tile and aggregate `SUM(sum) / SUM(count)` per parcel \u2014 same idiom, one extra `GROUP BY`." + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from pyspark.sql import Row\n", + "\n", + "xmin, ymin, xmax, ymax = AOI\n", + "step_x = (xmax - xmin) / 4\n", + "step_y = (ymax - ymin) / 4\n", + "parcel_rows = []\n", + "for ix in range(4):\n", + " for iy in range(4):\n", + " x0, x1 = xmin + ix * step_x, xmin + (ix + 1) * step_x\n", + " y0, y1 = ymin + iy * step_y, ymin + (iy + 1) * step_y\n", + " wkt = f\"POLYGON(({x0} {y0}, {x1} {y0}, {x1} {y1}, {x0} {y1}, {x0} {y0}))\"\n", + " parcel_rows.append(Row(parcel_id=f\"P{ix}{iy}\", wkt=wkt))\n", + "\n", + "parcels = sedona.createDataFrame(parcel_rows).selectExpr(\n", + " \"parcel_id\", \"ST_GeomFromText(wkt) as geom\"\n", + ")\n", + "parcels.createOrReplaceTempView(\"parcels\")\n", + "\n", + "ranked = sedona.sql(\"\"\"\n", + " SELECT p.parcel_id,\n", + " ROUND(RS_ZonalStats(d.rast, p.geom, 'mean'), 4) AS mean_delta_ndvi\n", + " FROM parcels p, delta d\n", + " ORDER BY mean_delta_ndvi DESC\n", + "\"\"\")\n", + "ranked.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 7. Clip the delta raster to the top-greening parcel\n", + "\n", + "`RS_Clip(raster, band, geom)` crops the raster to the polygon's extent. We pick the parcel with the highest mean \u0394NDVI and zoom in on its delta values." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "top_id = ranked.first()[\"parcel_id\"]\n", + "print(f\"top-greening parcel: {top_id}\")\n", + "\n", + "clipped = sedona.sql(f\"\"\"\n", + " SELECT RS_Clip(d.rast, 1, p.geom) AS rast\n", + " FROM delta d, parcels p\n", + " WHERE p.parcel_id = '{top_id}'\n", + "\"\"\")\n", + "clipped.cache()\n", + "clipped.createOrReplaceTempView(\"clipped\")\n", + "clipped.selectExpr(\n", + " \"RS_Width(rast) as w\",\n", + " \"RS_Height(rast) as h\",\n", + ").show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 8. Round-trip through `RS_AsCOG`\n", + "\n", + "`RS_AsCOG` returns the raster as a binary Cloud-Optimized GeoTIFF byte array. We persist it to disk, then read it back with the same `raster` data source we used to load the input \u2014 proving the output is a valid GeoTIFF that downstream tools can stream from object storage." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": "cog_bytes = clipped.selectExpr(\"RS_AsCOG(rast) as cog\").first()[\"cog\"]\nwith open(f\"{WORK}/delta_topparcel_cog.tif\", \"wb\") as fh:\n fh.write(cog_bytes)\nprint(f\"COG size: {len(cog_bytes):,} bytes\")\n\nroundtrip = sedona.read.format(\"raster\").load(f\"{WORK}/delta_topparcel_cog.tif\")\nroundtrip.selectExpr(\n \"RS_Width(rast) as w\",\n \"RS_Height(rast) as h\",\n \"RS_BandPixelType(rast, 1) as dtype\",\n).show()" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 9. Visualize the pipeline as a 4-panel figure\n", + "\n", + "Read the rasters back via `rasterio` and lay them out side by side: NDVI before, NDVI after, \u0394NDVI across the full AOI with parcel boundaries overlaid, and the top-parcel close-up." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": "import matplotlib.patches as mpatches\nimport matplotlib.pyplot as plt\n\n\ndef ndvi_arr(path):\n with rasterio.open(path) as ds:\n red = ds.read(1).astype(\"float32\")\n nir = ds.read(2).astype(\"float32\")\n return (nir - red) / (nir + red + 1e-6)\n\n\nndvi_before = ndvi_arr(f\"{WORK}/scene_before.tif\")\nndvi_after = ndvi_arr(f\"{WORK}/scene_after.tif\")\ndelta_arr = ndvi_after - ndvi_before\nwith rasterio.open(f\"{WORK}/delta_topparcel_cog.tif\") as ds:\n top_arr = ds.read(1)\n top_extent = (ds.bounds.left, ds.bounds.right, ds.bounds.bottom, ds.bounds.top)\n\nfig, axes = plt.subplots(1, 4, figsize=(16, 4))\nextent = (AOI[0], AOI[2], AOI[1], AOI[3])\naxes[0].imshow(ndvi_before, vmin=-0.2, vmax=0.8, cmap=\"RdYlGn\", extent=extent)\naxes[0].set_title(\"NDVI before\")\naxes[1].imshow(ndvi_after, vmin=-0.2, vmax=0.8, cmap=\"RdYlGn\", extent=extent)\naxes[1].set_title(\"NDVI after\")\naxes[2].imshow(delta_arr, vmin=-0.5, vmax=0.5, cmap=\"PiYG\", extent=extent)\naxes[2].set_title(\"\u0394NDVI (after \u2212 before) with parcel grid\")\nxmin, ymin, xmax, ymax = AOI\nstep_x = (xmax - xmin) / 4\nstep_y = (ymax - ymin) / 4\nfor ix in range(4):\n for iy in range(4):\n x0 = xmin + ix * step_x\n y0 = ymin + iy * step_y\n axes[2].add_patch(\n mpatches.Rectangle(\n (x0, y0), step_x, step_y, fill=False, edgecolor=\"black\", linewidth=0.6\n )\n )\n axes[2].text(\n x0 + step_x / 2,\n y0 + step_y / 2,\n f\"P{ix}{iy}\",\n ha=\"center\",\n va=\"center\",\n fontsize=7,\n color=\"black\",\n )\naxes[3].imshow(top_arr, vmin=-0.5, vmax=0.5, cmap=\"PiYG\", extent=top_extent)\naxes[3].set_title(f\"Top parcel ({top_id}) \u0394NDVI\")\nfor ax in axes:\n ax.set_xticks([])\n ax.set_yticks([])\nfig.tight_layout()\nfig" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## What just happened?\n", + "\n", + "We turned two synthetic Sentinel-2-style scenes into a ranked parcel-level greening report in nine steps:\n", + "\n", + "1. Synthesized two 256\u00d7256 red+NIR GeoTIFFs in `/tmp`.\n", + "2. Loaded both with `sedona.read.format(\"raster\")` \u2014 the new auto-tiling reader handles GeoTIFFs of any size with the same call shape.\n", + "3. Computed per-pixel NDVI per scene with single-raster `RS_MapAlgebra`.\n", + "4. Subtracted the two NDVI rasters with two-raster `RS_MapAlgebra` to get a \u0394NDVI \"greening\" layer.\n", + "5. Synthesized a 4\u00d74 parcel grid and ranked parcels by mean \u0394NDVI via `RS_ZonalStats`.\n", + "6. Clipped the delta raster to the top-greening parcel with `RS_Clip`.\n", + "7. Encoded the result as a Cloud-Optimized GeoTIFF with `RS_AsCOG` and read it back with the `raster` data source \u2014 proving the output is valid for cloud-hosted streaming.\n", + "8. Plotted the four pipeline stages side by side.\n", + "\n", + "Swap `/tmp/scene_*.tif` for a glob over real Sentinel-2 chips and the same SQL runs on a cluster against terabytes of imagery." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}