moneysense-data

moneysense data is a local, production-style data platform for collecting, storing, and analyzing historical market data from Polymarket, with a strong focus on historical backtesting of trading bots.

The core goal of this project is to build a reliable data foundation for researching, testing, and iterating on algorithmic trading strategies in prediction markets.

Why this exists

Building trading bots without high-quality historical data is pointless and it's hard to find reliable data for backtesting polymarket markets.

Polymarket does not provide clean, replayable datasets out of the box, especially for high-frequency signals like price changes, spreads, and short-term market dynamics. The issue with polymarket historical data endpoints is that it does show only final candle price which is irrelevant for backtesting as the price changes during the candle duration might be significant.

This platform solves that by:

Recording live market price change events
Storing it in queryable and compressed formats (ClickHouse and Parquet)
Making it easy to replay history and run backtests

If you are experimenting with trading bots on Polymarket, this project is meant to be the data layer underneath them.

High-level architecture

Flow:

A Go-based scraper streams live Polymarket market events
Events are published to Kafka-compatible infrastructure (Redpanda)
The same stream is consumed by:
- a cold-path archivist writing Parquet files to object storage (MinIO) for long-term storage and backtesting
- (optional) a hot-path consumer writing to ClickHouse for real-time analytics and monitoring
Data is used for trading bot backtesting and analytics

Note: For backtesting, you can read Parquet files directly from MinIO using DuckDB or Polars. The consumer service is optional and mainly useful for real-time monitoring and analytics.

Service Profiles:

--profile cold: Scraper + Archivist (Parquet only, best for backtesting)
--profile hot: Scraper + Consumer (ClickHouse only, best for real-time analytics)
--profile apps: Scraper + Consumer + Archivist (full platform)

Running locally

Start infrastructure:

The platform uses Docker Compose profiles to let you choose which services to run:

Option 1: Cold path only (Parquet archiving for backtesting)

docker compose --profile cold up -d --build

Starts: scraper + archivist → writes Parquet files to MinIO

Option 2: Hot path only (Real-time analytics)

docker compose --profile hot up -d --build

Starts: scraper + consumer → writes to ClickHouse for real-time queries

Option 3: Both paths (Full platform)

docker compose --profile apps up -d --build

Starts: scraper + consumer + archivist → both ClickHouse and Parquet

Option 4: With monitoring tools

docker compose --profile apps --profile monitoring up -d --build

Adds Redpanda Console (web UI) and Monitor (terminal dashboard)

You can modify the configuration in the docker-compose.yml file.

FAQ

More information about the infrastructure?

You can find more informations about each service in specific README files in the apps directory.

apps/scraper/README.md
apps/consumer/README.md
apps/archivist/README.md

How to configure markets to scrape?

You can configure the markets to scrape by modifying the MARKETS environment variable in the docker-compose.yml file. For example:

scraper:
    build:
      context: ./apps/scraper
      dockerfile: Dockerfile
    container_name: moneysense-scraper
    profiles: ["apps"]
    depends_on:
      redpanda:
        condition: service_healthy
    volumes:
      - ./recordings:/data
    environment:
      SINK_MODE: kafka
      KAFKA_BROKERS: redpanda:9092
      KAFKA_TOPIC: moneysense.events.raw
      MARKETS: sol-updown-15m,btc-updown-15m
      DISCORD_WEBHOOK_URL: 
      LOG_FILE: logs/scraper.log

The MARKETS environment variable is a comma-separated list of markets to scrape without timestamp suffix.

How is data partitioned?

Parquet files in MinIO are partitioned by date=YYYY-MM-DD/market=.../hour=HH/ where:

Date and hour are in UTC - All timestamps (source_ts) are stored in UTC, and partitioning uses UTC timezone
This ensures consistent partitioning regardless of your local timezone
Example: An event at 2024-01-15T14:30:00Z will be stored in date=2024-01-15/market=.../hour=14/

How to access the data?

Accessing Cold Data (Parquet files in MinIO)

Parquet files are stored in MinIO with partitioning: date=YYYY-MM-DD/market=.../hour=HH/

Using DuckDB (Recommended for backtesting):

import duckdb

conn = duckdb.connect()

# Configure S3 access to MinIO
conn.execute("""
    INSTALL httpfs;
    LOAD httpfs;
    SET s3_endpoint='localhost:9000';
    SET s3_access_key_id='minioadmin';
    SET s3_secret_access_key='minioadmin_change_me';
    SET s3_use_ssl=false;
""")

# Read Parquet files directly from MinIO
df = conn.execute("""
    SELECT * 
    FROM read_parquet('s3://lakehouse/date=2024-01-15/market=sol-updown-15m/hour=14/*.parquet')
    WHERE event_type = 'price_change'
    ORDER BY source_ts
""").df()

# Or query multiple partitions
df = conn.execute("""
    SELECT * 
    FROM read_parquet([
        's3://lakehouse/date=2024-01-15/market=sol-updown-15m/hour=14/*.parquet',
        's3://lakehouse/date=2024-01-15/market=sol-updown-15m/hour=15/*.parquet'
    ])
""").df()

Using Polars:

import polars as pl

# Read Parquet files from MinIO
df = pl.read_parquet(
    "s3://lakehouse/date=2024-01-15/market=sol-updown-15m/hour=14/*.parquet",
    storage_options={
        "endpoint_url": "http://localhost:9000",
        "access_key_id": "minioadmin",
        "secret_access_key": "minioadmin_change_me",
        "aws_allow_http": "true"
    }
)

# Filter and process
price_changes = df.filter(pl.col("event_type") == "price_change")

Using ClickHouse S3 Tables:

-- Create external table pointing to MinIO
CREATE TABLE events_parquet_s3
ENGINE = S3(
    'http://minio:9000/lakehouse/date=*/market=*/hour=*/*.parquet',
    'Parquet',
    'minioadmin',
    'minioadmin_change_me'
);

-- Query the data
SELECT * FROM events_parquet_s3 
WHERE market = 'sol-updown-15m' 
  AND toDate(source_ts) = '2024-01-15'
LIMIT 100;

Accessing Hot Data (ClickHouse - if consumer is running)

# Connect to ClickHouse
docker exec -it moneysense-clickhouse clickhouse-client

# Query recent events
SELECT * FROM polymarket.events_raw 
ORDER BY ingested_ts DESC 
LIMIT 100;

# Aggregate by market
SELECT market, count() as events, 
       countIf(event_type = 'price_change') as price_changes
FROM polymarket.events_raw
GROUP BY market;

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.github/workflows		.github/workflows
apps		apps
infra/clickhouse/init		infra/clickhouse/init
scripts		scripts
static		static
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

moneysense-data

Why this exists

High-level architecture

Running locally

Start infrastructure:

FAQ

More information about the infrastructure?

How to configure markets to scrape?

How is data partitioned?

How to access the data?

Accessing Cold Data (Parquet files in MinIO)

Accessing Hot Data (ClickHouse - if consumer is running)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

moneysense-data

Why this exists

High-level architecture

Running locally

Start infrastructure:

FAQ

More information about the infrastructure?

How to configure markets to scrape?

How is data partitioned?

How to access the data?

Accessing Cold Data (Parquet files in MinIO)

Accessing Hot Data (ClickHouse - if consumer is running)

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages