96 lines (72 loc) · 2.14 KB

Data Generation Pipeline

A fast pipeline for processing dense urban OSM data, avoiding the slowness of Overpass API in areas with thousands of overlapping polygons.

Alternative: For less dense regions, consider overpy instead.

Prerequisites

Download region-level OSM PBF files from Geofabrik
Obtain boundary shapefiles (see Zenodo)
Configure config.env with paths and parameters

🧑‍💻 Setting up environment

Create a conda environment:

conda env create -f environment.yaml
conda activate vectorsynth_download

Pipeline Execution Order

Vector Data Processing (per city, then combine)

Convert PBF to GeoParquet
```
python osm2geoparquet.py
```
Generate sample points within boundaries
```
python generate_points.py
```
Generate bounding boxes from points
```
python generate_bboxes.py
```
Clip OSM tags to bounding boxes
```
python clip_bbox.py
```

Process building heights (Microsoft Buildings dataset)

python process_building_heights.py
python clip_building_heights.py
python attach_height_to_osm.py

Combine cities and merge metadata
```
python combine_points.py
```
Clean and filter tags
```
python clean_tags.py
```
Rasterize tags to pixel grids
```
python create_pixel_grids.py
```
Compute text embeddings for tags
```
python compute_embeddings.py
```
Cleanup intermediate files and filter by coverage
```
python filter_cleanup.py
```

Satellite Imagery

Download satellite tiles
```
python download_sat.py
```
Generate captions using LLaVA
```
python get_llava_captions.py
```

Output Structure

data/metadata/: Tag vocabularies, point metadata, coverage statistics
data/pixel_tags/: Rasterized tag grids as PyTorch tensors
data/tag_embeddings/: Precomputed text embeddings
data/sat_images/: Satellite imagery tiles
data/logs/: Processing logs