Skip to content

joachimhodana/moneysense-data

Repository files navigation

moneysense data platform

moneysense-data

moneysense data is a local, production-style data platform for collecting, storing, and analyzing historical market data from Polymarket, with a strong focus on historical backtesting of trading bots.

The core goal of this project is to build a reliable data foundation for researching, testing, and iterating on algorithmic trading strategies in prediction markets.

Why this exists

Building trading bots without high-quality historical data is pointless and it's hard to find reliable data for backtesting polymarket markets.

Polymarket does not provide clean, replayable datasets out of the box, especially for high-frequency signals like price changes, spreads, and short-term market dynamics. The issue with polymarket historical data endpoints is that it does show only final candle price which is irrelevant for backtesting as the price changes during the candle duration might be significant.

This platform solves that by:

  • Recording live market price change events
  • Storing it in queryable and compressed formats (ClickHouse and Parquet)
  • Making it easy to replay history and run backtests

If you are experimenting with trading bots on Polymarket, this project is meant to be the data layer underneath them.


High-level architecture

moneysense architecture diagram

Flow:

  1. A Go-based scraper streams live Polymarket market events
  2. Events are published to Kafka-compatible infrastructure (Redpanda)
  3. The same stream is consumed by:
    • a cold-path archivist writing Parquet files to object storage (MinIO) for long-term storage and backtesting
    • (optional) a hot-path consumer writing to ClickHouse for real-time analytics and monitoring
  4. Data is used for trading bot backtesting and analytics

Note: For backtesting, you can read Parquet files directly from MinIO using DuckDB or Polars. The consumer service is optional and mainly useful for real-time monitoring and analytics.

Service Profiles:

  • --profile cold: Scraper + Archivist (Parquet only, best for backtesting)
  • --profile hot: Scraper + Consumer (ClickHouse only, best for real-time analytics)
  • --profile apps: Scraper + Consumer + Archivist (full platform)

Running locally

Start infrastructure:

The platform uses Docker Compose profiles to let you choose which services to run:

Option 1: Cold path only (Parquet archiving for backtesting)

docker compose --profile cold up -d --build

Starts: scraper + archivist → writes Parquet files to MinIO

Option 2: Hot path only (Real-time analytics)

docker compose --profile hot up -d --build

Starts: scraper + consumer → writes to ClickHouse for real-time queries

Option 3: Both paths (Full platform)

docker compose --profile apps up -d --build

Starts: scraper + consumer + archivist → both ClickHouse and Parquet

Option 4: With monitoring tools

docker compose --profile apps --profile monitoring up -d --build

Adds Redpanda Console (web UI) and Monitor (terminal dashboard)

You can modify the configuration in the docker-compose.yml file.

FAQ

More information about the infrastructure?

You can find more informations about each service in specific README files in the apps directory.

apps/scraper/README.md
apps/consumer/README.md
apps/archivist/README.md

How to configure markets to scrape?

You can configure the markets to scrape by modifying the MARKETS environment variable in the docker-compose.yml file. For example:

scraper:
    build:
      context: ./apps/scraper
      dockerfile: Dockerfile
    container_name: moneysense-scraper
    profiles: ["apps"]
    depends_on:
      redpanda:
        condition: service_healthy
    volumes:
      - ./recordings:/data
    environment:
      SINK_MODE: kafka
      KAFKA_BROKERS: redpanda:9092
      KAFKA_TOPIC: moneysense.events.raw
      MARKETS: sol-updown-15m,btc-updown-15m
      DISCORD_WEBHOOK_URL: 
      LOG_FILE: logs/scraper.log

The MARKETS environment variable is a comma-separated list of markets to scrape without timestamp suffix.

How is data partitioned?

Parquet files in MinIO are partitioned by date=YYYY-MM-DD/market=.../hour=HH/ where:

  • Date and hour are in UTC - All timestamps (source_ts) are stored in UTC, and partitioning uses UTC timezone
  • This ensures consistent partitioning regardless of your local timezone
  • Example: An event at 2024-01-15T14:30:00Z will be stored in date=2024-01-15/market=.../hour=14/

How to access the data?

Accessing Cold Data (Parquet files in MinIO)

Parquet files are stored in MinIO with partitioning: date=YYYY-MM-DD/market=.../hour=HH/

Using DuckDB (Recommended for backtesting):

import duckdb

conn = duckdb.connect()

# Configure S3 access to MinIO
conn.execute("""
    INSTALL httpfs;
    LOAD httpfs;
    SET s3_endpoint='localhost:9000';
    SET s3_access_key_id='minioadmin';
    SET s3_secret_access_key='minioadmin_change_me';
    SET s3_use_ssl=false;
""")

# Read Parquet files directly from MinIO
df = conn.execute("""
    SELECT * 
    FROM read_parquet('s3://lakehouse/date=2024-01-15/market=sol-updown-15m/hour=14/*.parquet')
    WHERE event_type = 'price_change'
    ORDER BY source_ts
""").df()

# Or query multiple partitions
df = conn.execute("""
    SELECT * 
    FROM read_parquet([
        's3://lakehouse/date=2024-01-15/market=sol-updown-15m/hour=14/*.parquet',
        's3://lakehouse/date=2024-01-15/market=sol-updown-15m/hour=15/*.parquet'
    ])
""").df()

Using Polars:

import polars as pl

# Read Parquet files from MinIO
df = pl.read_parquet(
    "s3://lakehouse/date=2024-01-15/market=sol-updown-15m/hour=14/*.parquet",
    storage_options={
        "endpoint_url": "http://localhost:9000",
        "access_key_id": "minioadmin",
        "secret_access_key": "minioadmin_change_me",
        "aws_allow_http": "true"
    }
)

# Filter and process
price_changes = df.filter(pl.col("event_type") == "price_change")

Using ClickHouse S3 Tables:

-- Create external table pointing to MinIO
CREATE TABLE events_parquet_s3
ENGINE = S3(
    'http://minio:9000/lakehouse/date=*/market=*/hour=*/*.parquet',
    'Parquet',
    'minioadmin',
    'minioadmin_change_me'
);

-- Query the data
SELECT * FROM events_parquet_s3 
WHERE market = 'sol-updown-15m' 
  AND toDate(source_ts) = '2024-01-15'
LIMIT 100;

Accessing Hot Data (ClickHouse - if consumer is running)

# Connect to ClickHouse
docker exec -it moneysense-clickhouse clickhouse-client

# Query recent events
SELECT * FROM polymarket.events_raw 
ORDER BY ingested_ts DESC 
LIMIT 100;

# Aggregate by market
SELECT market, count() as events, 
       countIf(event_type = 'price_change') as price_changes
FROM polymarket.events_raw
GROUP BY market;

About

Streaming Polymarket market data platform for trading bots backtesting

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors