Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 2 additions & 3 deletions .github/workflows/docker-build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,12 +24,14 @@ on:
paths:
- '.github/workflows/docker-build.yml'
- 'docker/**'
- 'docs/usecases/**'
pull_request:
branches:
- '*'
paths:
- '.github/workflows/docker-build.yml'
- 'docker/**'
- 'docs/usecases/**'
env:
MAVEN_OPTS: -Dmaven.wagon.httpconnectionManager.ttlSeconds=60

Expand All @@ -51,9 +53,6 @@ jobs:
- spark: 4.0.1
sedona: 'latest'
geotools: '33.1'
- spark: 4.0.1
sedona: 1.8.0
geotools: '33.1'
runs-on: ${{ matrix.os }}
defaults:
run:
Expand Down
4 changes: 2 additions & 2 deletions docker/sedona-docker.dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,8 @@ ARG shared_workspace=/opt/workspace
ARG spark_version=4.0.1
ARG hadoop_s3_version=3.4.1
ARG aws_sdk_version=2.38.2
ARG sedona_version=1.8.0
ARG geotools_wrapper_version=1.8.1-33.1
ARG sedona_version=1.9.0
ARG geotools_wrapper_version=1.9.0-33.5
ARG spark_extension_version=2.14.2
ARG zeppelin_version=0.12.0

Expand Down
235 changes: 235 additions & 0 deletions docs/usecases/05-geopandas-on-spark.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,235 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": "<!--\n Licensed to the Apache Software Foundation (ASF) under one\n or more contributor license agreements. See the NOTICE file\n distributed with this work for additional information\n regarding copyright ownership. The ASF licenses this file\n to you under the Apache License, Version 2.0 (the\n \"License\"); you may not use this file except in compliance\n with the License. You may obtain a copy of the License at\n\n http://www.apache.org/licenses/LICENSE-2.0\n\n Unless required by applicable law or agreed to in writing,\n software distributed under the License is distributed on an\n \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n KIND, either express or implied. See the License for the\n specific language governing permissions and limitations\n under the License.\n-->\n\n# Your GeoPandas notebook, scaled with Sedona\n\nSedona ships a `sedona.spark.geopandas` package that mirrors the public GeoPandas API \u2014 same constructors, same method names, same return shapes \u2014 but runs on a Spark backend so the same code path scales from a laptop to a cluster. We answer:\n\n> **What does it look like to take a typical GeoPandas script and run it on Sedona?**\n\nAlong the way we exercise the methods that landed in 1.8 / 1.9 and that this notebook actually calls \u2014 `convex_hull`, `clip_by_rect`, `total_bounds`, `to_geopandas` \u2014 and show how to drop into SQL when the GeoPandas-style API doesn't have what you need (`ST_VoronoiPolygons` + `ST_Collect_Agg`, `ST_DistanceSpheroid`).\n\n**Requires Sedona \u2265 1.9.0.** `clip_by_rect` lands in 1.9.0 and the notebook will fail on older versions of the docker image.\n\nData is the Natural Earth countries shapefile already shipped with the docker image; no network required."
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Connect to Spark through SedonaContext\n",
"\n",
"One difference from the other example notebooks: `sedona.spark.geopandas` runs on top of **pandas-on-Spark** (`pyspark.pandas`), which currently requires Spark's ANSI mode to be off. We set that flag explicitly when building the session."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sedona.spark import SedonaContext\n",
"\n",
"config = (\n",
" SedonaContext.builder()\n",
" .master(\"spark://localhost:7077\")\n",
" .config(\"spark.sql.ansi.enabled\", \"false\")\n",
" .getOrCreate()\n",
")\n",
"sedona = SedonaContext.create(config)"
Comment on lines +27 to +31
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Load a shapefile with the same `read_file` you already know\n",
"\n",
"`sedona.spark.geopandas.read_file` is a drop-in for `geopandas.read_file`. The only twist when pointing at a directory is to declare the format explicitly \u2014 the file extension can't be inferred for shapefile bundles."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sedona.spark import geopandas as sgpd\n",
"\n",
"countries = sgpd.read_file(\"data/ne_50m_admin_0_countries_lakes\", format=\"shapefile\")\n",
"print(f\"loaded {len(countries)} countries\")\n",
"print(\"columns:\", countries.columns.tolist()[:6], \"\u2026\")\n",
"countries[[\"NAME\", \"CONTINENT\", \"POP_EST\"]].head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Filter, then derive \u2014 the same idioms as vanilla GeoPandas\n",
"\n",
"Boolean indexing, `.geometry` accessor, `centroid`, `convex_hull`, `area`, and `total_bounds` all work exactly as they do in `geopandas`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"africa = countries[countries.CONTINENT == \"Africa\"]\n",
"print(f\"{len(africa)} African countries\")\n",
"\n",
"geom = africa.geometry\n",
"print(\"\\nbounding box of the continent:\", tuple(round(b, 2) for b in geom.total_bounds))\n",
"\n",
"summary = africa[[\"NAME\"]].copy()\n",
"summary[\"centroid\"] = geom.centroid\n",
"summary[\"area_deg2\"] = geom.area\n",
"summary[\"hull_area_deg2\"] = geom.convex_hull.area\n",
"summary.sort_values(\"area_deg2\", ascending=False).head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Voronoi catchments via `ST_VoronoiPolygons` + `ST_Collect_Agg`\n",
"\n",
"`GeoSeries.voronoi_polygons()` runs a Voronoi tessellation **per row**, which only makes sense if a single row already contains a MultiPoint. To compute one Voronoi diagram from many points, drop into SQL: aggregate every centroid into a single MultiPoint with the `ST_Collect_Agg` aggregator (new in 1.8.1), then call `ST_VoronoiPolygons` on the aggregate."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"africa.spark.frame().createOrReplaceTempView(\"africa\")\n",
"\n",
"voronoi_geom = sedona.sql(\"\"\"\n",
" SELECT ST_VoronoiPolygons(ST_Collect_Agg(ST_Centroid(geometry))) AS v\n",
" FROM africa\n",
"\"\"\").first()[0]\n",
"\n",
"stats = sedona.sql(\"\"\"\n",
" SELECT ST_NumGeometries(v) AS cells,\n",
" ROUND(ST_Area(v), 2) AS total_area_deg2,\n",
" ROUND(ST_XMin(v), 2) AS xmin,\n",
" ROUND(ST_YMin(v), 2) AS ymin,\n",
" ROUND(ST_XMax(v), 2) AS xmax,\n",
" ROUND(ST_YMax(v), 2) AS ymax\n",
" FROM (SELECT ST_VoronoiPolygons(ST_Collect_Agg(ST_Centroid(geometry))) AS v\n",
" FROM africa)\n",
"\"\"\")\n",
"stats.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Clip the Voronoi diagram to a continental bounding rectangle\n",
"\n",
"`clip_by_rect(xmin, ymin, xmax, ymax)` (new in 1.9) is the geopandas-style way to crop. We use it to confine the Voronoi cells to a generous Africa bbox so they line up cleanly with the country polygons in the final plot."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "from shapely.wkt import loads as wkt_loads\n\nvoronoi_shapely = (\n wkt_loads(voronoi_geom.wkt) if hasattr(voronoi_geom, \"wkt\") else voronoi_geom\n)\nvoronoi_cells = sgpd.GeoSeries([g for g in voronoi_shapely.geoms])\nprint(f\"{len(voronoi_cells)} Voronoi cells before clip\")\n\nafrica_bbox = (-20.0, -36.0, 52.0, 38.0)\nclipped = voronoi_cells.clip_by_rect(*africa_bbox)\nprint(f\"{len(clipped)} Voronoi cells after clip_by_rect\")\nclipped.head(2)"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. Hand off to vanilla GeoPandas for plotting\n",
"\n",
"When the data is small enough, `to_geopandas()` materializes a Sedona GeoDataFrame as a vanilla `geopandas.GeoDataFrame` so it can be plotted with the standard `.plot(ax=...)` machinery."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"\n",
"africa_gp = africa.to_geopandas()\n",
"voronoi_gp = clipped.to_geopandas()\n",
"\n",
"fig, ax = plt.subplots(1, 1, figsize=(8, 8))\n",
"africa_gp.plot(ax=ax, color=\"#fdf6e3\", edgecolor=\"#586e75\", linewidth=0.6)\n",
"voronoi_gp.boundary.plot(ax=ax, color=\"#dc322f\", linewidth=0.4, alpha=0.8)\n",
"africa_gp.geometry.centroid.plot(ax=ax, color=\"#dc322f\", markersize=4)\n",
"ax.set_title(\"Africa: Voronoi catchments around country centroids\")\n",
"ax.set_xlabel(\"longitude\")\n",
"ax.set_ylabel(\"latitude\")\n",
"ax.set_xlim(africa_bbox[0], africa_bbox[2])\n",
"ax.set_ylim(africa_bbox[1], africa_bbox[3])\n",
"fig.tight_layout()\n",
"fig"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## 7. Drop into SQL whenever you need a function the API doesn't expose\n\n`<gdf>.spark.frame()` returns the underlying Spark DataFrame, so the entire `ST_*` SQL catalog is one `createOrReplaceTempView` away. Here we ask which African countries' centroids are closest to (0\u00b0N, 0\u00b0E) using `ST_DistanceSpheroid` (great-circle distance in metres), without leaving the data we already loaded with the geopandas API."
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"africa.spark.frame().createOrReplaceTempView(\"africa\")\n",
"\n",
"closest = sedona.sql(\"\"\"\n",
" SELECT NAME,\n",
" ROUND(ST_DistanceSpheroid(\n",
" ST_Centroid(geometry),\n",
" ST_Point(0.0, 0.0)\n",
" ) / 1000.0, 1) AS km_from_origin\n",
" FROM africa\n",
" ORDER BY km_from_origin ASC\n",
" LIMIT 10\n",
"\"\"\")\n",
"closest.show(truncate=40)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What just happened?\n",
"\n",
"We took a small GeoPandas-style script and ran it on Sedona's Spark backend without rewriting it as Spark SQL:\n",
"\n",
"1. `read_file(..., format=\"shapefile\")` loaded the Natural Earth countries shapefile into a Sedona GeoDataFrame.\n",
"2. Vanilla GeoPandas idioms \u2014 boolean filtering, `.geometry`, `.centroid`, `.convex_hull`, `.area`, `.total_bounds` \u2014 produced the per-country summary.\n",
"3. SQL filled in the gap when the geopandas-on-Spark API didn't have an aggregator: `ST_VoronoiPolygons(ST_Collect_Agg(ST_Centroid(geometry)))` produced a single Voronoi tessellation from many points.\n",
"4. `clip_by_rect`, also new in 1.9, cropped the result to a continent.\n",
"5. `to_geopandas()` materialized the small final result as a vanilla `geopandas.GeoDataFrame` for plotting.\n",
"6. `gdf.spark.frame()` exposed the Spark DataFrame so we could mix in arbitrary SQL \u2014 for example `ST_DistanceSpheroid` \u2014 without leaving the same dataset.\n",
"\n",
"The same code path runs on a multi-node Spark cluster against billion-row datasets; only the `master(...)` URL changes."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Loading