mcap-lancedb turns robotics and autonomous-vehicle logs into Lance datasets.
It reads MCAP / ROS bag data, decodes known message schemas, and writes a single
wide table that works with Lance and
LanceDB.
The practical goal: stop treating robot logs as opaque files that require a separate extract pipeline before they can be queried, curated, embedded, or used for training.
pip install "mcap-lancedb[all,cli]"
mcap-lancedb ingest scene-0061.mcap ./drives.lance --mode overwritefrom mcap_lancedb import McapSource, ingest
rows = ingest("scene-0061.mcap", "./drives.lance", mode="overwrite")
print(f"wrote {rows:,} rows")
for batch in McapSource("scene-0061.mcap", topics=["/CAM_FRONT"]):
print(batch.num_rows, batch.schema)- Robotics / AV data engineers who need a repeatable ingest path from MCAP files into a queryable table.
- ML platform engineers who want one storage layout that supports scalar filtering, payload fetches, curation, and training reads.
- Researchers who want to inspect drive logs with Arrow / Lance tooling instead of custom bag readers.
- Agents and automation that need a small, predictable Python API for ingesting logs, discovering schema coverage, and generating trainable slices.
Every MCAP message becomes one row. The output table has stable universal metadata columns, typed sensor struct columns, payload columns for large binary data, and safe fallbacks for unknown or undecodable messages.
| Area | What It Enables |
|---|---|
| Universal metadata | Filter by log_id, topic, schema_name, time, sequence, and source file. |
| Typed structs | Query decoded camera, LiDAR, IMU, GNSS, pose, transform, diagnostics, and Foxglove messages without reparsing payloads. |
| Blob payload columns | Keep large images, point clouds, grids, and videos retrievable without bloating metadata scans. |
custom catch-all |
Preserve decoded-but-unrouted fields for schemas that do not deserve a first-class column yet. |
raw_payload fallback |
Keep unknown bytes and decode errors instead of silently dropping data. |
| Plugin builders | Add proprietary message types without forking the package. |
| PyTorch readers | Benchmark blob-v2 and inline-frame training paths against your hardware. |
Choose the narrowest extra that matches the MCAPs you need to decode:
pip install mcap-lancedb # base package, JSON/raw support
pip install "mcap-lancedb[ros2]" # ROS2 CDR / ros2msg
pip install "mcap-lancedb[ros1]" # ROS1 bag conversions
pip install "mcap-lancedb[protobuf]" # Foxglove protobuf MCAPs
pip install "mcap-lancedb[all,cli]" # common CLI install
pip install "mcap-lancedb[torch]" # training readers and benchmark helpersFor local development:
git clone https://github.com/lancedb/mcap-lancedb.git
cd mcap-lancedb
uv sync --extra dev --extra docs
uv run pytest -q
uv run ruff check src testsWrite a local Lance dataset:
mcap-lancedb ingest scene-0061.mcap ./drives.lance --mode overwriteAppend multiple logs into one dataset:
mcap-lancedb ingest scene-0061.mcap ./drives.lance --mode overwrite --log-id scene-0061
mcap-lancedb ingest scene-0103.mcap ./drives.lance --mode append --log-id scene-0103Replace a previously ingested log safely:
mcap-lancedb ingest scene-0061.mcap ./drives.lance --mode append --replaceFilter ingest to a topic subset:
mcap-lancedb ingest scene-0061.mcap ./camera.lance \
--mode overwrite \
--topics /CAM_FRONT,/CAM_BACKWrite to S3:
AWS_REGION=us-east-1 \
mcap-lancedb ingest scene-0061.mcap s3://my-bucket/drives.lance --mode overwritefrom mcap_lancedb import ingest
rows = ingest(
"s3://raw-logs/scene-0061.mcap",
"s3://robotics-lake/drives.lance",
mode="append",
log_id="scene-0061",
topics=["/CAM_FRONT", "/LIDAR_TOP"],
storage_options={"region": "us-east-1"},
)mode is passed to lance.write_dataset and is usually one of:
create— fail if the destination already exists.overwrite— replace an existing dataset.append— append rows to an existing dataset.
When appending, mcap-lancedb guards against duplicate log_id values. Pass
replace=True to delete existing rows for that log_id before appending.
Use McapSource when you want batches without writing them immediately:
from mcap_lancedb import McapSource
source = McapSource(
"scene-0061.mcap",
batch_size=256,
topics=["/CAM_FRONT"],
)
for batch in source.scan_as_stream():
assert batch.schema == source.schema
print(batch.num_rows)McapSource is rescannable. Calling scan_as_stream() opens the MCAP again and
emits the same Arrow schema each time, which makes it safe for retrying writers
and agent-driven workflows.
The default schema has 35 columns:
| Group | Count | Columns |
|---|---|---|
| Universal | 10 | log_id, source_uri, log_time_ns, publish_time_ns, sequence, channel_id, schema_id, topic, schema_name, schema_fingerprint |
| Typed structs | 16 | image, compressed_image, pointcloud, imu, navsat, radar_returns, laserscan, compressed_video, tf, diagnostics, radar_tracks, image_annotations, camera_calibration, pose, grid, scene_update |
| Payload blobs | 5 | image_data, compressed_image_data, pointcloud_data, compressed_video_data, grid_data |
| Generic decoded fallback | 1 | custom |
| Raw fallback | 3 | raw_payload, raw_encoding, decode_error |
The invariant is intentionally simple for downstream agents:
- Universal columns are populated for every row.
- A known message routes into a typed struct column.
- Large payload bytes live in the parallel payload column when one exists.
- A decoded-but-unrouted message goes to
custom. - An unknown or failed decode preserves bytes in
raw_payloadand records context inraw_encoding/decode_error.
See docs/schema.md for the column-by-column contract.
Payload columns may be Lance blob columns or ordinary binary columns depending
on how the dataset was written. Use fetch_blobs() so callers do not need to
know the physical layout:
import lance
from mcap_lancedb import fetch_blobs
dataset = lance.dataset("./drives.lance")
images = fetch_blobs(dataset, "compressed_image_data", [0, 10, 42])fetch_blobs() sorts and deduplicates row offsets internally, then returns
bytes in the caller's original order.
The package includes two training-oriented paths:
WideTableDatasetreads existing blob payload columns lazily.PermutationFrameDatasetreads an inline binarytrain_framecolumn throughlancedb.permutation.Permutation, matching the high-throughput LeRobot-style layout.
Run the benchmark on synthetic data:
mcap-lancedb benchmark --rows 2000 --batch 32 --device autoOr against your own lake:
mcap-lancedb benchmark --lake s3://bucket/lake --table drives --json report.jsonRepresentative CPU numbers from the development benchmark:
| Layout | Fetch-only | End-to-end CPU |
|---|---|---|
| Blob-v2 per-row | 1,370 rows/s | 571 rows/s |
| Blob-v2 batched | 14,140 rows/s | 923 rows/s |
Inline + Permutation |
15,980 rows/s | 921 rows/s |
The headline is not "always inline everything." It is: batch your blob fetches, measure on your hardware, and materialize inline training frames only when the fetch path is your bottleneck.
See docs/training.md.
You can extend ingest with a Python package that registers entry points under
mcap_lancedb.builders.
[project.entry-points."mcap_lancedb.builders"]
acme = "acme_lance_plugin:builders"The entry point returns one or more BuilderPlugin objects. Each plugin can add
a new typed struct column, add an optional payload column, and route exact
schema names or structural matches into that column.
See examples/acme_plugin/ and
docs/plugins.md.
Coding agents should start with:
AGENTS.md— repository operating instructions.docs/agent-guide.md— task-oriented map for modifying ingest, schema, docs, tests, plugins, and release automation.tests/test_package_smoke.py— the smallest end-to-end fixture that does not require external data.
Important invariants for agents:
- Keep
mcap_lancedbindependent of demo packages such asmcap_lancedb_demoandcurate_lancedb. - Keep optional dependencies lazy. Import ROS/protobuf/torch/CLI dependencies on the paths that need them, not at package import time.
- Update docs when the public schema, CLI, extras, or plugin interface changes.
- Run
uv run ruff check src testsanduv run pytest -qbefore handing off.
src/mcap_lancedb/
source.py # MCAP -> Arrow RecordBatch stream
ingest.py # MCAP -> Lance writer
schema.py # 35-column public wide schema
decoders.py # ROS1 / ROS2 / protobuf / JSON decoder dispatch
messages/ # built-in message builders and structural matchers
torch/ # PyTorch dataset helpers
benchmark.py # train-loader benchmark
_cli.py # Typer CLI
examples/acme_plugin/ # plugin-builder example
docs/ # human and agent documentation
tests/ # standalone package smoke tests
uv sync --extra dev --extra docs
uv run ruff check src tests
uv run pytest -q
uv run mkdocs build --strict
uv build
uv run twine check dist/*Release dry run:
uv run scripts/release.sh --build-onlyModuleNotFoundError: mcap_ros2 or mcap_protobuf
Install the right decoder extra, for example pip install "mcap-lancedb[ros2]"
or pip install "mcap-lancedb[all]".
A message lands in raw_payload
Check schema_name, raw_encoding, and decode_error. If the message decoded
but did not match a built-in typed column, enable or inspect custom.
S3 reads fail
Set normal AWS credentials and pass a region through storage_options from
Python, or AWS_REGION / cloud environment variables from the CLI.
Training throughput is lower than expected
Run mcap-lancedb benchmark on the target machine. CPU image decode can hide
fetch-path wins; GPU decode and batched reads change the profile.
mcap-lancedb is early but usable. The schema is designed to be stable, but new
typed columns may be added as more robotics message families are supported.
Apache 2.0. See LICENSE.