mcap-lancedb

mcap-lancedb turns robotics and autonomous-vehicle logs into Lance datasets. It reads MCAP / ROS bag data, decodes known message schemas, and writes a single wide table that works with Lance and LanceDB.

The practical goal: stop treating robot logs as opaque files that require a separate extract pipeline before they can be queried, curated, embedded, or used for training.

pip install "mcap-lancedb[all,cli]"

mcap-lancedb ingest scene-0061.mcap ./drives.lance --mode overwrite

from mcap_lancedb import McapSource, ingest

rows = ingest("scene-0061.mcap", "./drives.lance", mode="overwrite")
print(f"wrote {rows:,} rows")

for batch in McapSource("scene-0061.mcap", topics=["/CAM_FRONT"]):
    print(batch.num_rows, batch.schema)

Who This Is For

Robotics / AV data engineers who need a repeatable ingest path from MCAP files into a queryable table.
ML platform engineers who want one storage layout that supports scalar filtering, payload fetches, curation, and training reads.
Researchers who want to inspect drive logs with Arrow / Lance tooling instead of custom bag readers.
Agents and automation that need a small, predictable Python API for ingesting logs, discovering schema coverage, and generating trainable slices.

What You Get

Every MCAP message becomes one row. The output table has stable universal metadata columns, typed sensor struct columns, payload columns for large binary data, and safe fallbacks for unknown or undecodable messages.

Area	What It Enables
Universal metadata	Filter by `log_id`, `topic`, `schema_name`, time, sequence, and source file.
Typed structs	Query decoded camera, LiDAR, IMU, GNSS, pose, transform, diagnostics, and Foxglove messages without reparsing payloads.
Blob payload columns	Keep large images, point clouds, grids, and videos retrievable without bloating metadata scans.
`custom` catch-all	Preserve decoded-but-unrouted fields for schemas that do not deserve a first-class column yet.
`raw_payload` fallback	Keep unknown bytes and decode errors instead of silently dropping data.
Plugin builders	Add proprietary message types without forking the package.
PyTorch readers	Benchmark blob-v2 and inline-frame training paths against your hardware.

Installation

Choose the narrowest extra that matches the MCAPs you need to decode:

pip install mcap-lancedb              # base package, JSON/raw support
pip install "mcap-lancedb[ros2]"      # ROS2 CDR / ros2msg
pip install "mcap-lancedb[ros1]"      # ROS1 bag conversions
pip install "mcap-lancedb[protobuf]"  # Foxglove protobuf MCAPs
pip install "mcap-lancedb[all,cli]"   # common CLI install
pip install "mcap-lancedb[torch]"     # training readers and benchmark helpers

For local development:

git clone https://github.com/lancedb/mcap-lancedb.git
cd mcap-lancedb
uv sync --extra dev --extra docs
uv run pytest -q
uv run ruff check src tests

CLI Quickstart

Write a local Lance dataset:

mcap-lancedb ingest scene-0061.mcap ./drives.lance --mode overwrite

Append multiple logs into one dataset:

mcap-lancedb ingest scene-0061.mcap ./drives.lance --mode overwrite --log-id scene-0061
mcap-lancedb ingest scene-0103.mcap ./drives.lance --mode append --log-id scene-0103

Replace a previously ingested log safely:

mcap-lancedb ingest scene-0061.mcap ./drives.lance --mode append --replace

Filter ingest to a topic subset:

mcap-lancedb ingest scene-0061.mcap ./camera.lance \
  --mode overwrite \
  --topics /CAM_FRONT,/CAM_BACK

Write to S3:

AWS_REGION=us-east-1 \
mcap-lancedb ingest scene-0061.mcap s3://my-bucket/drives.lance --mode overwrite

Python API

One-Shot Ingest

from mcap_lancedb import ingest

rows = ingest(
    "s3://raw-logs/scene-0061.mcap",
    "s3://robotics-lake/drives.lance",
    mode="append",
    log_id="scene-0061",
    topics=["/CAM_FRONT", "/LIDAR_TOP"],
    storage_options={"region": "us-east-1"},
)

mode is passed to lance.write_dataset and is usually one of:

create — fail if the destination already exists.
overwrite — replace an existing dataset.
append — append rows to an existing dataset.

When appending, mcap-lancedb guards against duplicate log_id values. Pass replace=True to delete existing rows for that log_id before appending.

Streaming MCAP to Arrow

Use McapSource when you want batches without writing them immediately:

from mcap_lancedb import McapSource

source = McapSource(
    "scene-0061.mcap",
    batch_size=256,
    topics=["/CAM_FRONT"],
)

for batch in source.scan_as_stream():
    assert batch.schema == source.schema
    print(batch.num_rows)

McapSource is rescannable. Calling scan_as_stream() opens the MCAP again and emits the same Arrow schema each time, which makes it safe for retrying writers and agent-driven workflows.

Output Schema

The default schema has 35 columns:

Group	Count	Columns
Universal	10	`log_id`, `source_uri`, `log_time_ns`, `publish_time_ns`, `sequence`, `channel_id`, `schema_id`, `topic`, `schema_name`, `schema_fingerprint`
Typed structs	16	`image`, `compressed_image`, `pointcloud`, `imu`, `navsat`, `radar_returns`, `laserscan`, `compressed_video`, `tf`, `diagnostics`, `radar_tracks`, `image_annotations`, `camera_calibration`, `pose`, `grid`, `scene_update`
Payload blobs	5	`image_data`, `compressed_image_data`, `pointcloud_data`, `compressed_video_data`, `grid_data`
Generic decoded fallback	1	`custom`
Raw fallback	3	`raw_payload`, `raw_encoding`, `decode_error`

The invariant is intentionally simple for downstream agents:

Universal columns are populated for every row.
A known message routes into a typed struct column.
Large payload bytes live in the parallel payload column when one exists.
A decoded-but-unrouted message goes to custom.
An unknown or failed decode preserves bytes in raw_payload and records context in raw_encoding / decode_error.

See docs/schema.md for the column-by-column contract.

Reading Payload Bytes

Payload columns may be Lance blob columns or ordinary binary columns depending on how the dataset was written. Use fetch_blobs() so callers do not need to know the physical layout:

import lance
from mcap_lancedb import fetch_blobs

dataset = lance.dataset("./drives.lance")
images = fetch_blobs(dataset, "compressed_image_data", [0, 10, 42])

fetch_blobs() sorts and deduplicates row offsets internally, then returns bytes in the caller's original order.

Training Reads and Benchmarks

The package includes two training-oriented paths:

WideTableDataset reads existing blob payload columns lazily.
PermutationFrameDataset reads an inline binary train_frame column through lancedb.permutation.Permutation, matching the high-throughput LeRobot-style layout.

Run the benchmark on synthetic data:

mcap-lancedb benchmark --rows 2000 --batch 32 --device auto

Or against your own lake:

mcap-lancedb benchmark --lake s3://bucket/lake --table drives --json report.json

Representative CPU numbers from the development benchmark:

Layout	Fetch-only	End-to-end CPU
Blob-v2 per-row	1,370 rows/s	571 rows/s
Blob-v2 batched	14,140 rows/s	923 rows/s
Inline + `Permutation`	15,980 rows/s	921 rows/s

The headline is not "always inline everything." It is: batch your blob fetches, measure on your hardware, and materialize inline training frames only when the fetch path is your bottleneck.

See docs/training.md.

Custom Message Types

You can extend ingest with a Python package that registers entry points under mcap_lancedb.builders.

[project.entry-points."mcap_lancedb.builders"]
acme = "acme_lance_plugin:builders"

The entry point returns one or more BuilderPlugin objects. Each plugin can add a new typed struct column, add an optional payload column, and route exact schema names or structural matches into that column.

See examples/acme_plugin/ and docs/plugins.md.

Agent Guide

Coding agents should start with:

AGENTS.md — repository operating instructions.
docs/agent-guide.md — task-oriented map for modifying ingest, schema, docs, tests, plugins, and release automation.
tests/test_package_smoke.py — the smallest end-to-end fixture that does not require external data.

Important invariants for agents:

Keep mcap_lancedb independent of demo packages such as mcap_lancedb_demo and curate_lancedb.
Keep optional dependencies lazy. Import ROS/protobuf/torch/CLI dependencies on the paths that need them, not at package import time.
Update docs when the public schema, CLI, extras, or plugin interface changes.
Run uv run ruff check src tests and uv run pytest -q before handing off.

Repository Layout

src/mcap_lancedb/
  source.py              # MCAP -> Arrow RecordBatch stream
  ingest.py              # MCAP -> Lance writer
  schema.py              # 35-column public wide schema
  decoders.py            # ROS1 / ROS2 / protobuf / JSON decoder dispatch
  messages/              # built-in message builders and structural matchers
  torch/                 # PyTorch dataset helpers
  benchmark.py           # train-loader benchmark
  _cli.py                # Typer CLI

examples/acme_plugin/    # plugin-builder example
docs/                   # human and agent documentation
tests/                  # standalone package smoke tests

Development

uv sync --extra dev --extra docs
uv run ruff check src tests
uv run pytest -q
uv run mkdocs build --strict
uv build
uv run twine check dist/*

Release dry run:

uv run scripts/release.sh --build-only

Troubleshooting

ModuleNotFoundError: mcap_ros2 or mcap_protobuf

Install the right decoder extra, for example pip install "mcap-lancedb[ros2]" or pip install "mcap-lancedb[all]".

A message lands in raw_payload

Check schema_name, raw_encoding, and decode_error. If the message decoded but did not match a built-in typed column, enable or inspect custom.

S3 reads fail

Set normal AWS credentials and pass a region through storage_options from Python, or AWS_REGION / cloud environment variables from the CLI.

Training throughput is lower than expected

Run mcap-lancedb benchmark on the target machine. CPU image decode can hide fetch-path wins; GPU decode and batched reads change the profile.

Status

mcap-lancedb is early but usable. The schema is designed to be stable, but new typed columns may be added as more robotics message families are supported.

License

Apache 2.0. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.codex		.codex
.github/workflows		.github/workflows
.miagent		.miagent
docs		docs
examples/acme_plugin		examples/acme_plugin
scripts		scripts
src/mcap_lancedb		src/mcap_lancedb
tests		tests
.gitignore		.gitignore
.miagent.yaml		.miagent.yaml
AGENT.md		AGENT.md
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mcap-lancedb

Who This Is For

What You Get

Installation

CLI Quickstart

Python API

One-Shot Ingest

Streaming MCAP to Arrow

Output Schema

Reading Payload Bytes

Training Reads and Benchmarks

Custom Message Types

Agent Guide

Repository Layout

Development

Troubleshooting

Status

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

mcap-lancedb

Who This Is For

What You Get

Installation

CLI Quickstart

Python API

One-Shot Ingest

Streaming MCAP to Arrow

Output Schema

Reading Payload Bytes

Training Reads and Benchmarks

Custom Message Types

Agent Guide

Repository Layout

Development

Troubleshooting

Status

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages