A fast pipeline for processing dense urban OSM data, avoiding the slowness of Overpass API in areas with thousands of overlapping polygons.
Alternative: For less dense regions, consider overpy instead.
- Download region-level OSM PBF files from Geofabrik
- Obtain boundary shapefiles (see Zenodo)
- Configure
config.envwith paths and parameters
Create a conda environment:
conda env create -f environment.yaml
conda activate vectorsynth_download-
Convert PBF to GeoParquet
python osm2geoparquet.py
-
Generate sample points within boundaries
python generate_points.py
-
Generate bounding boxes from points
python generate_bboxes.py
-
Clip OSM tags to bounding boxes
python clip_bbox.py
-
Process building heights (Microsoft Buildings dataset)
python process_building_heights.py python clip_building_heights.py python attach_height_to_osm.py
-
Combine cities and merge metadata
python combine_points.py
-
Clean and filter tags
python clean_tags.py
-
Rasterize tags to pixel grids
python create_pixel_grids.py
-
Compute text embeddings for tags
python compute_embeddings.py
-
Cleanup intermediate files and filter by coverage
python filter_cleanup.py
-
Download satellite tiles
python download_sat.py
-
Generate captions using LLaVA
python get_llava_captions.py
data/metadata/: Tag vocabularies, point metadata, coverage statisticsdata/pixel_tags/: Rasterized tag grids as PyTorch tensorsdata/tag_embeddings/: Precomputed text embeddingsdata/sat_images/: Satellite imagery tilesdata/logs/: Processing logs