Skip to content

lancedb/mcap-lancedb

Repository files navigation

mcap-lancedb

tests docs

mcap-lancedb turns robotics and autonomous-vehicle logs into Lance datasets. It reads MCAP / ROS bag data, decodes known message schemas, and writes a single wide table that works with Lance and LanceDB.

The practical goal: stop treating robot logs as opaque files that require a separate extract pipeline before they can be queried, curated, embedded, or used for training.

pip install "mcap-lancedb[all,cli]"

mcap-lancedb ingest scene-0061.mcap ./drives.lance --mode overwrite
from mcap_lancedb import McapSource, ingest

rows = ingest("scene-0061.mcap", "./drives.lance", mode="overwrite")
print(f"wrote {rows:,} rows")

for batch in McapSource("scene-0061.mcap", topics=["/CAM_FRONT"]):
    print(batch.num_rows, batch.schema)

Who This Is For

  • Robotics / AV data engineers who need a repeatable ingest path from MCAP files into a queryable table.
  • ML platform engineers who want one storage layout that supports scalar filtering, payload fetches, curation, and training reads.
  • Researchers who want to inspect drive logs with Arrow / Lance tooling instead of custom bag readers.
  • Agents and automation that need a small, predictable Python API for ingesting logs, discovering schema coverage, and generating trainable slices.

What You Get

Every MCAP message becomes one row. The output table has stable universal metadata columns, typed sensor struct columns, payload columns for large binary data, and safe fallbacks for unknown or undecodable messages.

Area What It Enables
Universal metadata Filter by log_id, topic, schema_name, time, sequence, and source file.
Typed structs Query decoded camera, LiDAR, IMU, GNSS, pose, transform, diagnostics, and Foxglove messages without reparsing payloads.
Blob payload columns Keep large images, point clouds, grids, and videos retrievable without bloating metadata scans.
custom catch-all Preserve decoded-but-unrouted fields for schemas that do not deserve a first-class column yet.
raw_payload fallback Keep unknown bytes and decode errors instead of silently dropping data.
Plugin builders Add proprietary message types without forking the package.
PyTorch readers Benchmark blob-v2 and inline-frame training paths against your hardware.

Installation

Choose the narrowest extra that matches the MCAPs you need to decode:

pip install mcap-lancedb              # base package, JSON/raw support
pip install "mcap-lancedb[ros2]"      # ROS2 CDR / ros2msg
pip install "mcap-lancedb[ros1]"      # ROS1 bag conversions
pip install "mcap-lancedb[protobuf]"  # Foxglove protobuf MCAPs
pip install "mcap-lancedb[all,cli]"   # common CLI install
pip install "mcap-lancedb[torch]"     # training readers and benchmark helpers

For local development:

git clone https://github.com/lancedb/mcap-lancedb.git
cd mcap-lancedb
uv sync --extra dev --extra docs
uv run pytest -q
uv run ruff check src tests

CLI Quickstart

Write a local Lance dataset:

mcap-lancedb ingest scene-0061.mcap ./drives.lance --mode overwrite

Append multiple logs into one dataset:

mcap-lancedb ingest scene-0061.mcap ./drives.lance --mode overwrite --log-id scene-0061
mcap-lancedb ingest scene-0103.mcap ./drives.lance --mode append --log-id scene-0103

Replace a previously ingested log safely:

mcap-lancedb ingest scene-0061.mcap ./drives.lance --mode append --replace

Filter ingest to a topic subset:

mcap-lancedb ingest scene-0061.mcap ./camera.lance \
  --mode overwrite \
  --topics /CAM_FRONT,/CAM_BACK

Write to S3:

AWS_REGION=us-east-1 \
mcap-lancedb ingest scene-0061.mcap s3://my-bucket/drives.lance --mode overwrite

Python API

One-Shot Ingest

from mcap_lancedb import ingest

rows = ingest(
    "s3://raw-logs/scene-0061.mcap",
    "s3://robotics-lake/drives.lance",
    mode="append",
    log_id="scene-0061",
    topics=["/CAM_FRONT", "/LIDAR_TOP"],
    storage_options={"region": "us-east-1"},
)

mode is passed to lance.write_dataset and is usually one of:

  • create — fail if the destination already exists.
  • overwrite — replace an existing dataset.
  • append — append rows to an existing dataset.

When appending, mcap-lancedb guards against duplicate log_id values. Pass replace=True to delete existing rows for that log_id before appending.

Streaming MCAP to Arrow

Use McapSource when you want batches without writing them immediately:

from mcap_lancedb import McapSource

source = McapSource(
    "scene-0061.mcap",
    batch_size=256,
    topics=["/CAM_FRONT"],
)

for batch in source.scan_as_stream():
    assert batch.schema == source.schema
    print(batch.num_rows)

McapSource is rescannable. Calling scan_as_stream() opens the MCAP again and emits the same Arrow schema each time, which makes it safe for retrying writers and agent-driven workflows.

Output Schema

The default schema has 35 columns:

Group Count Columns
Universal 10 log_id, source_uri, log_time_ns, publish_time_ns, sequence, channel_id, schema_id, topic, schema_name, schema_fingerprint
Typed structs 16 image, compressed_image, pointcloud, imu, navsat, radar_returns, laserscan, compressed_video, tf, diagnostics, radar_tracks, image_annotations, camera_calibration, pose, grid, scene_update
Payload blobs 5 image_data, compressed_image_data, pointcloud_data, compressed_video_data, grid_data
Generic decoded fallback 1 custom
Raw fallback 3 raw_payload, raw_encoding, decode_error

The invariant is intentionally simple for downstream agents:

  1. Universal columns are populated for every row.
  2. A known message routes into a typed struct column.
  3. Large payload bytes live in the parallel payload column when one exists.
  4. A decoded-but-unrouted message goes to custom.
  5. An unknown or failed decode preserves bytes in raw_payload and records context in raw_encoding / decode_error.

See docs/schema.md for the column-by-column contract.

Reading Payload Bytes

Payload columns may be Lance blob columns or ordinary binary columns depending on how the dataset was written. Use fetch_blobs() so callers do not need to know the physical layout:

import lance
from mcap_lancedb import fetch_blobs

dataset = lance.dataset("./drives.lance")
images = fetch_blobs(dataset, "compressed_image_data", [0, 10, 42])

fetch_blobs() sorts and deduplicates row offsets internally, then returns bytes in the caller's original order.

Training Reads and Benchmarks

The package includes two training-oriented paths:

  • WideTableDataset reads existing blob payload columns lazily.
  • PermutationFrameDataset reads an inline binary train_frame column through lancedb.permutation.Permutation, matching the high-throughput LeRobot-style layout.

Run the benchmark on synthetic data:

mcap-lancedb benchmark --rows 2000 --batch 32 --device auto

Or against your own lake:

mcap-lancedb benchmark --lake s3://bucket/lake --table drives --json report.json

Representative CPU numbers from the development benchmark:

Layout Fetch-only End-to-end CPU
Blob-v2 per-row 1,370 rows/s 571 rows/s
Blob-v2 batched 14,140 rows/s 923 rows/s
Inline + Permutation 15,980 rows/s 921 rows/s

The headline is not "always inline everything." It is: batch your blob fetches, measure on your hardware, and materialize inline training frames only when the fetch path is your bottleneck.

See docs/training.md.

Custom Message Types

You can extend ingest with a Python package that registers entry points under mcap_lancedb.builders.

[project.entry-points."mcap_lancedb.builders"]
acme = "acme_lance_plugin:builders"

The entry point returns one or more BuilderPlugin objects. Each plugin can add a new typed struct column, add an optional payload column, and route exact schema names or structural matches into that column.

See examples/acme_plugin/ and docs/plugins.md.

Agent Guide

Coding agents should start with:

Important invariants for agents:

  • Keep mcap_lancedb independent of demo packages such as mcap_lancedb_demo and curate_lancedb.
  • Keep optional dependencies lazy. Import ROS/protobuf/torch/CLI dependencies on the paths that need them, not at package import time.
  • Update docs when the public schema, CLI, extras, or plugin interface changes.
  • Run uv run ruff check src tests and uv run pytest -q before handing off.

Repository Layout

src/mcap_lancedb/
  source.py              # MCAP -> Arrow RecordBatch stream
  ingest.py              # MCAP -> Lance writer
  schema.py              # 35-column public wide schema
  decoders.py            # ROS1 / ROS2 / protobuf / JSON decoder dispatch
  messages/              # built-in message builders and structural matchers
  torch/                 # PyTorch dataset helpers
  benchmark.py           # train-loader benchmark
  _cli.py                # Typer CLI

examples/acme_plugin/    # plugin-builder example
docs/                   # human and agent documentation
tests/                  # standalone package smoke tests

Development

uv sync --extra dev --extra docs
uv run ruff check src tests
uv run pytest -q
uv run mkdocs build --strict
uv build
uv run twine check dist/*

Release dry run:

uv run scripts/release.sh --build-only

Troubleshooting

ModuleNotFoundError: mcap_ros2 or mcap_protobuf

Install the right decoder extra, for example pip install "mcap-lancedb[ros2]" or pip install "mcap-lancedb[all]".

A message lands in raw_payload

Check schema_name, raw_encoding, and decode_error. If the message decoded but did not match a built-in typed column, enable or inspect custom.

S3 reads fail

Set normal AWS credentials and pass a region through storage_options from Python, or AWS_REGION / cloud environment variables from the CLI.

Training throughput is lower than expected

Run mcap-lancedb benchmark on the target machine. CPU image decode can hide fetch-path wins; GPU decode and batched reads change the profile.

Status

mcap-lancedb is early but usable. The schema is designed to be stable, but new typed columns may be added as more robotics message families are supported.

License

Apache 2.0. See LICENSE.

About

Ingest MCAP and ROS bag files into LanceDB-ready Lance datasets

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors