diff --git a/.github/workflows/docker-build.yml b/.github/workflows/docker-build.yml index 4e84bc28aea..269f4adb351 100644 --- a/.github/workflows/docker-build.yml +++ b/.github/workflows/docker-build.yml @@ -24,12 +24,14 @@ on: paths: - '.github/workflows/docker-build.yml' - 'docker/**' + - 'docs/usecases/**' pull_request: branches: - '*' paths: - '.github/workflows/docker-build.yml' - 'docker/**' + - 'docs/usecases/**' env: MAVEN_OPTS: -Dmaven.wagon.httpconnectionManager.ttlSeconds=60 @@ -51,9 +53,6 @@ jobs: - spark: 4.0.1 sedona: 'latest' geotools: '33.1' - - spark: 4.0.1 - sedona: 1.8.0 - geotools: '33.1' runs-on: ${{ matrix.os }} defaults: run: diff --git a/docker/sedona-docker.dockerfile b/docker/sedona-docker.dockerfile index 2969f659cfb..b12c98d5104 100644 --- a/docker/sedona-docker.dockerfile +++ b/docker/sedona-docker.dockerfile @@ -21,8 +21,8 @@ ARG shared_workspace=/opt/workspace ARG spark_version=4.0.1 ARG hadoop_s3_version=3.4.1 ARG aws_sdk_version=2.38.2 -ARG sedona_version=1.8.0 -ARG geotools_wrapper_version=1.8.1-33.1 +ARG sedona_version=1.9.0 +ARG geotools_wrapper_version=1.9.0-33.5 ARG spark_extension_version=2.14.2 ARG zeppelin_version=0.12.0 diff --git a/docs/usecases/05-geopandas-on-spark.ipynb b/docs/usecases/05-geopandas-on-spark.ipynb new file mode 100644 index 00000000000..a43641985a7 --- /dev/null +++ b/docs/usecases/05-geopandas-on-spark.ipynb @@ -0,0 +1,235 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": "\n\n# Your GeoPandas notebook, scaled with Sedona\n\nSedona ships a `sedona.spark.geopandas` package that mirrors the public GeoPandas API \u2014 same constructors, same method names, same return shapes \u2014 but runs on a Spark backend so the same code path scales from a laptop to a cluster. We answer:\n\n> **What does it look like to take a typical GeoPandas script and run it on Sedona?**\n\nAlong the way we exercise the methods that landed in 1.8 / 1.9 and that this notebook actually calls \u2014 `convex_hull`, `clip_by_rect`, `total_bounds`, `to_geopandas` \u2014 and show how to drop into SQL when the GeoPandas-style API doesn't have what you need (`ST_VoronoiPolygons` + `ST_Collect_Agg`, `ST_DistanceSpheroid`).\n\n**Requires Sedona \u2265 1.9.0.** `clip_by_rect` lands in 1.9.0 and the notebook will fail on older versions of the docker image.\n\nData is the Natural Earth countries shapefile already shipped with the docker image; no network required." + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Connect to Spark through SedonaContext\n", + "\n", + "One difference from the other example notebooks: `sedona.spark.geopandas` runs on top of **pandas-on-Spark** (`pyspark.pandas`), which currently requires Spark's ANSI mode to be off. We set that flag explicitly when building the session." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sedona.spark import SedonaContext\n", + "\n", + "config = (\n", + " SedonaContext.builder()\n", + " .master(\"spark://localhost:7077\")\n", + " .config(\"spark.sql.ansi.enabled\", \"false\")\n", + " .getOrCreate()\n", + ")\n", + "sedona = SedonaContext.create(config)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Load a shapefile with the same `read_file` you already know\n", + "\n", + "`sedona.spark.geopandas.read_file` is a drop-in for `geopandas.read_file`. The only twist when pointing at a directory is to declare the format explicitly \u2014 the file extension can't be inferred for shapefile bundles." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sedona.spark import geopandas as sgpd\n", + "\n", + "countries = sgpd.read_file(\"data/ne_50m_admin_0_countries_lakes\", format=\"shapefile\")\n", + "print(f\"loaded {len(countries)} countries\")\n", + "print(\"columns:\", countries.columns.tolist()[:6], \"\u2026\")\n", + "countries[[\"NAME\", \"CONTINENT\", \"POP_EST\"]].head(5)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. Filter, then derive \u2014 the same idioms as vanilla GeoPandas\n", + "\n", + "Boolean indexing, `.geometry` accessor, `centroid`, `convex_hull`, `area`, and `total_bounds` all work exactly as they do in `geopandas`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "africa = countries[countries.CONTINENT == \"Africa\"]\n", + "print(f\"{len(africa)} African countries\")\n", + "\n", + "geom = africa.geometry\n", + "print(\"\\nbounding box of the continent:\", tuple(round(b, 2) for b in geom.total_bounds))\n", + "\n", + "summary = africa[[\"NAME\"]].copy()\n", + "summary[\"centroid\"] = geom.centroid\n", + "summary[\"area_deg2\"] = geom.area\n", + "summary[\"hull_area_deg2\"] = geom.convex_hull.area\n", + "summary.sort_values(\"area_deg2\", ascending=False).head(5)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. Voronoi catchments via `ST_VoronoiPolygons` + `ST_Collect_Agg`\n", + "\n", + "`GeoSeries.voronoi_polygons()` runs a Voronoi tessellation **per row**, which only makes sense if a single row already contains a MultiPoint. To compute one Voronoi diagram from many points, drop into SQL: aggregate every centroid into a single MultiPoint with the `ST_Collect_Agg` aggregator (new in 1.8.1), then call `ST_VoronoiPolygons` on the aggregate." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "africa.spark.frame().createOrReplaceTempView(\"africa\")\n", + "\n", + "voronoi_geom = sedona.sql(\"\"\"\n", + " SELECT ST_VoronoiPolygons(ST_Collect_Agg(ST_Centroid(geometry))) AS v\n", + " FROM africa\n", + "\"\"\").first()[0]\n", + "\n", + "stats = sedona.sql(\"\"\"\n", + " SELECT ST_NumGeometries(v) AS cells,\n", + " ROUND(ST_Area(v), 2) AS total_area_deg2,\n", + " ROUND(ST_XMin(v), 2) AS xmin,\n", + " ROUND(ST_YMin(v), 2) AS ymin,\n", + " ROUND(ST_XMax(v), 2) AS xmax,\n", + " ROUND(ST_YMax(v), 2) AS ymax\n", + " FROM (SELECT ST_VoronoiPolygons(ST_Collect_Agg(ST_Centroid(geometry))) AS v\n", + " FROM africa)\n", + "\"\"\")\n", + "stats.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5. Clip the Voronoi diagram to a continental bounding rectangle\n", + "\n", + "`clip_by_rect(xmin, ymin, xmax, ymax)` (new in 1.9) is the geopandas-style way to crop. We use it to confine the Voronoi cells to a generous Africa bbox so they line up cleanly with the country polygons in the final plot." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": "from shapely.wkt import loads as wkt_loads\n\nvoronoi_shapely = (\n wkt_loads(voronoi_geom.wkt) if hasattr(voronoi_geom, \"wkt\") else voronoi_geom\n)\nvoronoi_cells = sgpd.GeoSeries([g for g in voronoi_shapely.geoms])\nprint(f\"{len(voronoi_cells)} Voronoi cells before clip\")\n\nafrica_bbox = (-20.0, -36.0, 52.0, 38.0)\nclipped = voronoi_cells.clip_by_rect(*africa_bbox)\nprint(f\"{len(clipped)} Voronoi cells after clip_by_rect\")\nclipped.head(2)" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 6. Hand off to vanilla GeoPandas for plotting\n", + "\n", + "When the data is small enough, `to_geopandas()` materializes a Sedona GeoDataFrame as a vanilla `geopandas.GeoDataFrame` so it can be plotted with the standard `.plot(ax=...)` machinery." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import matplotlib.pyplot as plt\n", + "\n", + "africa_gp = africa.to_geopandas()\n", + "voronoi_gp = clipped.to_geopandas()\n", + "\n", + "fig, ax = plt.subplots(1, 1, figsize=(8, 8))\n", + "africa_gp.plot(ax=ax, color=\"#fdf6e3\", edgecolor=\"#586e75\", linewidth=0.6)\n", + "voronoi_gp.boundary.plot(ax=ax, color=\"#dc322f\", linewidth=0.4, alpha=0.8)\n", + "africa_gp.geometry.centroid.plot(ax=ax, color=\"#dc322f\", markersize=4)\n", + "ax.set_title(\"Africa: Voronoi catchments around country centroids\")\n", + "ax.set_xlabel(\"longitude\")\n", + "ax.set_ylabel(\"latitude\")\n", + "ax.set_xlim(africa_bbox[0], africa_bbox[2])\n", + "ax.set_ylim(africa_bbox[1], africa_bbox[3])\n", + "fig.tight_layout()\n", + "fig" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": "## 7. Drop into SQL whenever you need a function the API doesn't expose\n\n`.spark.frame()` returns the underlying Spark DataFrame, so the entire `ST_*` SQL catalog is one `createOrReplaceTempView` away. Here we ask which African countries' centroids are closest to (0\u00b0N, 0\u00b0E) using `ST_DistanceSpheroid` (great-circle distance in metres), without leaving the data we already loaded with the geopandas API." + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "africa.spark.frame().createOrReplaceTempView(\"africa\")\n", + "\n", + "closest = sedona.sql(\"\"\"\n", + " SELECT NAME,\n", + " ROUND(ST_DistanceSpheroid(\n", + " ST_Centroid(geometry),\n", + " ST_Point(0.0, 0.0)\n", + " ) / 1000.0, 1) AS km_from_origin\n", + " FROM africa\n", + " ORDER BY km_from_origin ASC\n", + " LIMIT 10\n", + "\"\"\")\n", + "closest.show(truncate=40)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## What just happened?\n", + "\n", + "We took a small GeoPandas-style script and ran it on Sedona's Spark backend without rewriting it as Spark SQL:\n", + "\n", + "1. `read_file(..., format=\"shapefile\")` loaded the Natural Earth countries shapefile into a Sedona GeoDataFrame.\n", + "2. Vanilla GeoPandas idioms \u2014 boolean filtering, `.geometry`, `.centroid`, `.convex_hull`, `.area`, `.total_bounds` \u2014 produced the per-country summary.\n", + "3. SQL filled in the gap when the geopandas-on-Spark API didn't have an aggregator: `ST_VoronoiPolygons(ST_Collect_Agg(ST_Centroid(geometry)))` produced a single Voronoi tessellation from many points.\n", + "4. `clip_by_rect`, also new in 1.9, cropped the result to a continent.\n", + "5. `to_geopandas()` materialized the small final result as a vanilla `geopandas.GeoDataFrame` for plotting.\n", + "6. `gdf.spark.frame()` exposed the Spark DataFrame so we could mix in arbitrary SQL \u2014 for example `ST_DistanceSpheroid` \u2014 without leaving the same dataset.\n", + "\n", + "The same code path runs on a multi-node Spark cluster against billion-row datasets; only the `master(...)` URL changes." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}