This project is a web scraping tool designed to automatically collect real estate data from Otodom, a popular property listing platform. The scraper automates the process of extracting property listings and their associated data, which is then stored in a PostgreSQL database for further analysis
The project is currently tailored for scraping only real estate listings related to apartment sales in a specific city
The database is designed to store apartment listings data, price history, photos, and extracted features. It consists of the following tables:
locations– stores unique location details (city, district, street, etc)apartments_sale_listings– main table for apartment dataprice_history– stores historical price changesphotos– stores binary image data (BYTEA type) related to listingsfeatures– extracted flat features (e.g. air conditioning, balcony, parking, etc)
💡 You can preview the structure in db/schema.sql
The easiest way to run the project is with Docker — no need to install PostgreSQL manually.
Requirements: Docker Desktop installed and running.
1. Create your .env file (copy from the example and fill in your password)
2. Build and run:
docker compose up --buildThis will:
- start a PostgreSQL container (
otodom_db) - build the scraper image and run it once
- scraper exits after finishing — no background processes left running
3. To run the scraper again (database keeps its data between runs):
docker compose up4. To stop and remove containers:
docker compose down💡 Database data is stored in a Docker volume (
otodom_pgdata) and persists between runs. To wipe the data completely:docker compose down -v
If you prefer to run without Docker, you need PostgreSQL installed and a database created:
psql -U postgres
CREATE DATABASE apartments_for_sale_otodom;Then set up your .env with DB_HOST=localhost and run:
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python3 main.pyThis project can be also run automatically once per day using GitHub Actions.
Required repository secrets:
DB_HOSTDB_PORTDB_NAMEDB_USERDB_PASSWORDDB_SSLMODE
The database should be an external PostgreSQL instance, for example Neon.
The scheduled workflow is defined in:
.github/workflows/daily-scraper.yml
It can also be triggered manually from:
GitHub → Actions → Daily Otodom Scraper → Run workflow
Application logs are uploaded as GitHub Actions artifacts after each run.
The scraper supports two modes, controlled via the SCRAPE_MODE environment variable:
-
full(default) — complete synchronization: fetches all search result pages, scrapes full listing data, and checks for deleted offers. Runs daily at 5:17 AM.docker compose run scraper_full
-
latest— lightweight mode: fetches only the first page(s) of results (newest first) and stops early once it finds an offer already in the database. Useful for catching new listings quickly. Runs hourly.docker compose run scraper_latest
Set SCRAPE_MODE=latest in .env (or pass as environment variable) to use lightweight mode. Optionally set LATEST_MAX_PAGES to control how many pages to check before stopping (default: 1).
All configuration is done via .env file in the project root. See .env.example for the required variables.
💡 When running for the first time, the necessary tables will be automatically created if they don't already exist.
The project has a unit test suite covering the repository and normalization layers. Tests use pytest and mock the database connection — no real DB needed to run them.
source venv/bin/activate
pytest tests/🚧 Planned: remaining unit and integration tests for the scraping and service layers.
🚧 Planned: contract tests for the scraping layer — to detect if Otodom changes their page structure (missing fields, changed JSON keys, etc.), the kind of breakage that currently only surfaces at runtime.
Database operations are logged using Python's logging module. Logs are saved to the logs/ directory and can be adjusted via config/logging_config.py. If you are using option 3 with GitHub Actions the logs are stored as GitHub Actions artifacts after each run.
