Skip to content

Latest commit

 

History

History
96 lines (72 loc) · 2.14 KB

File metadata and controls

96 lines (72 loc) · 2.14 KB

Data Generation Pipeline

A fast pipeline for processing dense urban OSM data, avoiding the slowness of Overpass API in areas with thousands of overlapping polygons.

Alternative: For less dense regions, consider overpy instead.

Prerequisites

  • Download region-level OSM PBF files from Geofabrik
  • Obtain boundary shapefiles (see Zenodo)
  • Configure config.env with paths and parameters

🧑‍💻 Setting up environment

Create a conda environment:

conda env create -f environment.yaml
conda activate vectorsynth_download

Pipeline Execution Order

Vector Data Processing (per city, then combine)

  1. Convert PBF to GeoParquet

    python osm2geoparquet.py
  2. Generate sample points within boundaries

    python generate_points.py
  3. Generate bounding boxes from points

    python generate_bboxes.py
  4. Clip OSM tags to bounding boxes

    python clip_bbox.py
  5. Process building heights (Microsoft Buildings dataset)

    python process_building_heights.py
    python clip_building_heights.py
    python attach_height_to_osm.py
  6. Combine cities and merge metadata

    python combine_points.py
  7. Clean and filter tags

    python clean_tags.py
  8. Rasterize tags to pixel grids

    python create_pixel_grids.py
  9. Compute text embeddings for tags

    python compute_embeddings.py
  10. Cleanup intermediate files and filter by coverage

    python filter_cleanup.py

Satellite Imagery

  1. Download satellite tiles

    python download_sat.py
  2. Generate captions using LLaVA

    python get_llava_captions.py

Output Structure

  • data/metadata/: Tag vocabularies, point metadata, coverage statistics
  • data/pixel_tags/: Rasterized tag grids as PyTorch tensors
  • data/tag_embeddings/: Precomputed text embeddings
  • data/sat_images/: Satellite imagery tiles
  • data/logs/: Processing logs