Add MiniSEED read support#666
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a MiniSEED IO backend (MSeedV2/MSeedV3) with format detection, scanning, and reading; implements MiniSEED decoding/grouping utilities; registers entry points and optional pymseed dependency; adds a data registry fixture, a small file-header helper, and extensive unit/integration tests. ChangesMiniSEED IO feature
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7e903d857b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| sample_count = min(x.sample_count for x in segments) | ||
| source_ids = tuple(x.source_id for x in segments) | ||
| data = np.stack([x.data[:sample_count] for x in segments]) |
There was a problem hiding this comment.
Preserve all samples when channel lengths differ
In _patch_from_segments, channel data are truncated to the shortest trace via sample_count = min(...) and x.data[:sample_count], so any group where traces share start/rate but have different lengths will silently drop valid trailing samples from longer channels. Because _get_group_key does not include end time or sample count, this case is reachable for real MiniSEED files with uneven channel coverage, and the truncation affects both read results and scan-derived metadata.
Useful? React with 👍 / 👎.
| """Return a stable source patch ID for a MiniSEED group.""" | ||
| version, network, location, seed_channel, start_ns, sample_rate = group_key | ||
| source_key = ".".join((network, location, seed_channel)) | ||
| return f"v{version}:{source_key}:{start_ns}:{sample_rate:g}" |
There was a problem hiding this comment.
Generate source_patch_id without float rounding collisions
_source_patch_id formats sample_rate with :g, which keeps only ~6 significant digits, so distinct groups can collapse to the same ID (for example, close non-integer rates). That makes source_patch_id non-unique and can cause read(..., source_patch_id=...) to match multiple logical patches or the wrong one. The ID should encode sample rate at full precision (or via an unambiguous derived value like step nanoseconds).
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
dascore/io/mseed/utils.py (1)
225-254: 💤 Low valueConsider documenting the truncation behavior for mismatched segment lengths.
Line 229 uses
min(x.sample_count for x in segments)which silently truncates longer segments when merging traces with different sample counts. While this may be intentional for handling slight timing variations, it could silently drop data. Consider adding a brief comment or logging when truncation occurs.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dascore/io/mseed/utils.py` around lines 225 - 254, _patches_from_segments currently computes sample_count = min(x.sample_count for x in segments) which silently truncates longer segments; update _patch_from_segments to detect when any segment.sample_count > sample_count and either add a brief inline comment explaining intentional truncation or emit a warning log that includes the group_key, the computed sample_count and the original per-segment sample counts (use the sample_count variable and the segments list to build the message) so callers are informed when data is being truncated before stacking into data = np.stack(...).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@dascore/io/mseed/utils.py`:
- Around line 225-254: _patches_from_segments currently computes sample_count =
min(x.sample_count for x in segments) which silently truncates longer segments;
update _patch_from_segments to detect when any segment.sample_count >
sample_count and either add a brief inline comment explaining intentional
truncation or emit a warning log that includes the group_key, the computed
sample_count and the original per-segment sample counts (use the sample_count
variable and the segments list to build the message) so callers are informed
when data is being truncated before stacking into data = np.stack(...).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: f2106f71-0b81-4a3d-96d6-1f3b0e81274e
📒 Files selected for processing (7)
dascore/data_registry.txtdascore/io/mseed/__init__.pydascore/io/mseed/core.pydascore/io/mseed/utils.pypyproject.tomltests/test_io/test_common_io.pytests/test_io/test_mseed/test_mseed.py
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## dev #666 +/- ##
==========================================
Coverage 100.00% 100.00%
==========================================
Files 140 143 +3
Lines 12890 13273 +383
==========================================
+ Hits 12890 13273 +383
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
🧹 Nitpick comments (3)
tests/test_io/test_common_io.py (1)
93-94: ⚡ Quick winAdd the v3 reader to the shared matrix too.
This common coverage currently exercises only
MSeedV2; ifMSeedV3is a separate public entry point, regressions in that plugin won't be caught by the generic IO tests.Suggested change
-from dascore.io.mseed.core import MSeedV2 +from dascore.io.mseed.core import MSeedV2, MSeedV3 ... NetCDFCFV18(): ("xdas_netcdf.nc",), MSeedV2(): ("etna_9n_3chan_10s.mseed",), + MSeedV3(): ("etna_9n_3chan_10s.mseed",),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_io/test_common_io.py` around lines 93 - 94, The shared test matrix currently only includes the MSeedV2 entry so MSeedV3 isn't exercised; update the matrix in tests/test_io/test_common_io.py by adding a parallel MSeedV3 entry (e.g., MSeedV3(): ("etna_9n_3chan_10s.mseed",) alongside MSeedV2) so the generic IO tests also run the v3 reader; ensure the key uses the public MSeedV3 symbol exactly as declared to pick up that plugin.dascore/io/mseed/utils.py (1)
93-99: ⚡ Quick winNarrow this fallback to the parser failures you expect.
Catching
Exceptionhere will hide real bugs inpymseed.sourceid2nslcand silently turn unexpected failures into generic NSLC metadata.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dascore/io/mseed/utils.py` around lines 93 - 99, The helper _source_id_to_nslc should only swallow the specific parsing errors from pymseed.sourceid2nslc instead of catching Exception; update the except to catch the parser error types you expect (e.g., ValueError, TypeError or the specific pymseed parse exception class if it exposes one, such as pymseed.SourceIDError) and return the empty-fallback only for those, while letting all other exceptions propagate (re-raise) so real bugs in pymseed.sourceid2nslc are not hidden.tests/test_io/test_mseed/test_mseed.py (1)
467-473: ⚡ Quick winAvoid a direct registry fetch in the test body.
This makes the integration test depend on downloader/network state. Reuse the shared fixture or wrap the fetch in
skip_timeout()so the test stays stable on slow/offline runs.Suggested change
from dascore.io.mseed import utils as mseed_utils from dascore.io.mseed.core import MSeedV2 +from tests.test_io._common_io_test_utils import skip_timeout ... - path = fetch("etna_9n_3chan_10s.mseed") + with skip_timeout(): + path = fetch("etna_9n_3chan_10s.mseed")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_io/test_mseed/test_mseed.py` around lines 467 - 473, The test test_read_das_station_channels currently calls fetch("etna_9n_3chan_10s.mseed") directly which makes it depend on network/downloader timing; instead either use the shared downloader fixture (if one exists in the test suite) to obtain the path or wrap the fetch call with skip_timeout(fetch(...)) so it will be skipped on slow/offline runs. Locate the fetch invocation in test_read_das_station_channels and replace it with the fixture variable (e.g., use the provided fetch_path or downloader fixture) or change the line to path = skip_timeout(lambda: fetch("etna_9n_3chan_10s.mseed")) (or equivalent helper used elsewhere in tests) so the test no longer directly depends on network timing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@dascore/io/mseed/utils.py`:
- Around line 93-99: The helper _source_id_to_nslc should only swallow the
specific parsing errors from pymseed.sourceid2nslc instead of catching
Exception; update the except to catch the parser error types you expect (e.g.,
ValueError, TypeError or the specific pymseed parse exception class if it
exposes one, such as pymseed.SourceIDError) and return the empty-fallback only
for those, while letting all other exceptions propagate (re-raise) so real bugs
in pymseed.sourceid2nslc are not hidden.
In `@tests/test_io/test_common_io.py`:
- Around line 93-94: The shared test matrix currently only includes the MSeedV2
entry so MSeedV3 isn't exercised; update the matrix in
tests/test_io/test_common_io.py by adding a parallel MSeedV3 entry (e.g.,
MSeedV3(): ("etna_9n_3chan_10s.mseed",) alongside MSeedV2) so the generic IO
tests also run the v3 reader; ensure the key uses the public MSeedV3 symbol
exactly as declared to pick up that plugin.
In `@tests/test_io/test_mseed/test_mseed.py`:
- Around line 467-473: The test test_read_das_station_channels currently calls
fetch("etna_9n_3chan_10s.mseed") directly which makes it depend on
network/downloader timing; instead either use the shared downloader fixture (if
one exists in the test suite) to obtain the path or wrap the fetch call with
skip_timeout(fetch(...)) so it will be skipped on slow/offline runs. Locate the
fetch invocation in test_read_das_station_channels and replace it with the
fixture variable (e.g., use the provided fetch_path or downloader fixture) or
change the line to path = skip_timeout(lambda: fetch("etna_9n_3chan_10s.mseed"))
(or equivalent helper used elsewhere in tests) so the test no longer directly
depends on network timing.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ad91f41d-e62f-4d59-b0ee-986f6a6a86b3
📒 Files selected for processing (5)
dascore/io/mseed/core.pydascore/io/mseed/utils.pydascore/utils/io.pytests/test_io/test_common_io.pytests/test_io/test_mseed/test_mseed.py
✅ Files skipped from review due to trivial changes (2)
- dascore/utils/io.py
- dascore/io/mseed/core.py
There was a problem hiding this comment.
🧹 Nitpick comments (1)
dascore/io/mseed/utils.py (1)
532-538: 💤 Low valueNarrow the exception handler to avoid masking unexpected errors.
The bare
except Exceptioncould mask bugs (e.g., logic errors in the detection code). Sinceunpackraisesstruct.erroron malformed data, catching that specifically is clearer.♻️ Proposed fix
+from struct import error as StructError, unpack -from struct import unpack ... try: year, day = unpack(">HH", header[20:24]) hour, minute, second = header[24:27] _unused, frac_seconds, sample_count = unpack(">BHH", header[27:32]) data_offset, blockette_offset = unpack(">HH", header[44:48]) - except Exception: + except (StructError, IndexError): return False🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dascore/io/mseed/utils.py` around lines 532 - 538, The try/except around the unpack calls is too broad; replace the bare "except Exception" with a specific catch for struct.error (the exception raised by unpack) so only malformed/parse errors are handled—e.g., "except struct.error:" (or "from struct import error as StructError" then "except StructError:") and ensure the struct.error symbol is imported/available; keep the existing unpack/header logic (year, day = unpack(...), hour, minute, second = header..., _unused, frac_seconds, sample_count = unpack(...), data_offset, blockette_offset = unpack(...)) so only struct unpack failures are swallowed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@dascore/io/mseed/utils.py`:
- Around line 532-538: The try/except around the unpack calls is too broad;
replace the bare "except Exception" with a specific catch for struct.error (the
exception raised by unpack) so only malformed/parse errors are handled—e.g.,
"except struct.error:" (or "from struct import error as StructError" then
"except StructError:") and ensure the struct.error symbol is imported/available;
keep the existing unpack/header logic (year, day = unpack(...), hour, minute,
second = header..., _unused, frac_seconds, sample_count = unpack(...),
data_offset, blockette_offset = unpack(...)) so only struct unpack failures are
swallowed.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 51f27d00-c4e1-472d-9b49-726e5e9f5176
📒 Files selected for processing (4)
dascore/io/mseed/core.pydascore/io/mseed/utils.pytests/test_io/test_common_io.pytests/test_io/test_mseed/test_mseed.py
🚧 Files skipped from review as they are similar to previous changes (2)
- dascore/io/mseed/core.py
- tests/test_io/test_mseed/test_mseed.py
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@dascore/io/mseed/utils.py`:
- Around line 354-360: can_merge currently allows merging segments with
different original encodings if their decoded dtype matches, causing
scan()/read() to disagree about patch boundaries; update can_merge (used by
_coalesce_source_segments) to require identical original encoding metadata in
addition to the existing checks (sample_rate, format_version, next_start_ns,
data.dtype). Concretely, add a equality check for the attribute that records the
source encoding (e.g., pending.encoding == seg.encoding or the actual encoding
field used on _TraceSegment in your codebase) so _coalesce_source_segments
preserves encoding boundaries the same way _coalesce_source_summaries does,
preventing source_patch_id mismatches between scan and read.
- Around line 147-149: The current next_start_ns method compounds per-sample
rounding by multiplying sample_count by sample_step_ns directly; change it to
compute the nanosecond offset with exact rational/rounded arithmetic to avoid
drift (e.g., use fractions.Fraction or Decimal to compute offset =
int(round(Fraction(self.sample_step_ns) * self.sample_count))) and then return
self.start_ns + offset; update next_start_ns to use this precise offset
calculation referencing the existing attributes start_ns, sample_count and
sample_step_ns.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 4442308a-494f-4db2-9296-0de19517334a
📒 Files selected for processing (3)
dascore/io/mseed/core.pydascore/io/mseed/utils.pytests/test_io/test_mseed/test_mseed.py
✅ Files skipped from review due to trivial changes (1)
- tests/test_io/test_mseed/test_mseed.py
🚧 Files skipped from review as they are similar to previous changes (1)
- dascore/io/mseed/core.py
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
tests/test_io/test_common_io.py (1)
207-214: 💤 Low value
_get_coord_trim_valuesproduces a degenerate single-point trim for small coordinates.For any coordinate with fewer than 10 unique values (e.g. the 3-channel MiniSEED fixture's
channelaxis),len // 10 == 0, sostart_ind == stop_ind == 1and both endpoints are the same value. The double-ended test then exercises only exact single-point selection rather than a genuine range, which is valid but less thorough.♻️ Proposed fix
def _get_coord_trim_values(coord): """Return existing coordinate values to use for read selections.""" values = coord.values if len(values) == 1: return values[0], values[0] - start_ind = min(max(1, len(values) // 10), len(values) - 1) - stop_ind = min(max(start_ind, 2 * len(values) // 10), len(values) - 1) + quarter = max(1, len(values) // 4) + start_ind = min(quarter, len(values) - 2) + stop_ind = min(start_ind + max(1, len(values) // 4), len(values) - 1) return values[start_ind], values[stop_ind]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_io/test_common_io.py` around lines 207 - 214, The helper _get_coord_trim_values currently can return the same value for start and stop when len(values) < 10 because start_ind and stop_ind both compute to 1; change the logic so the returned endpoints always form a proper range when possible: after computing start_ind and stop_ind (or instead of the current math), ensure stop_ind > start_ind by adjusting them within bounds (e.g., if stop_ind == start_ind, decrement start_ind if >0 else increment stop_ind if < len(values)-1) and then return values[start_ind], values[stop_ind]; reference _get_coord_trim_values, start_ind, stop_ind, and values when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@dascore/io/mseed/utils.py`:
- Around line 89-91: The __post_init__ method currently uses an assert (assert
len(self.data) == self.sample_count) which is removed under -O; replace it with
an explicit runtime check that raises a ValueError when len(self.data) !=
self.sample_count so the invariant is always enforced. Update the __post_init__
in the class (method name __post_init__) to perform: if len(self.data) !=
self.sample_count: raise ValueError(f"sample_count mismatch: expected
{self.sample_count}, got {len(self.data)}") so the error references the
sample_count and data lengths.
- Around line 639-645: The bare except should be narrowed to struct.error:
replace "except Exception:" with "except struct.error:" around the unpack calls
(the try block containing unpack(">HH", header[20:24]) and unpack(">BHH",
header[27:32]) etc.) and ensure the module is imported so struct.error is
available (add "import struct" at the top if not present).
---
Nitpick comments:
In `@tests/test_io/test_common_io.py`:
- Around line 207-214: The helper _get_coord_trim_values currently can return
the same value for start and stop when len(values) < 10 because start_ind and
stop_ind both compute to 1; change the logic so the returned endpoints always
form a proper range when possible: after computing start_ind and stop_ind (or
instead of the current math), ensure stop_ind > start_ind by adjusting them
within bounds (e.g., if stop_ind == start_ind, decrement start_ind if >0 else
increment stop_ind if < len(values)-1) and then return values[start_ind],
values[stop_ind]; reference _get_coord_trim_values, start_ind, stop_ind, and
values when making the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: f4a5e34c-a572-4421-bfd4-8650975bb98c
📒 Files selected for processing (3)
dascore/io/mseed/utils.pytests/test_io/test_common_io.pytests/test_io/test_mseed/test_mseed.py
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/test_io/test_mseed/test_mseed.py
| def __post_init__(self) -> None: | ||
| """Validate that trace metadata matches the decoded samples.""" | ||
| assert len(self.data) == self.sample_count |
There was a problem hiding this comment.
assert in __post_init__ is silently stripped under -O.
Use an explicit ValueError so the invariant is enforced regardless of optimization flags.
🛡️ Proposed fix
def __post_init__(self) -> None:
"""Validate that trace metadata matches the decoded samples."""
- assert len(self.data) == self.sample_count
+ if len(self.data) != self.sample_count:
+ msg = (
+ f"Decoded sample count ({len(self.data)}) does not match "
+ f"header sample_count ({self.sample_count})."
+ )
+ raise ValueError(msg)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@dascore/io/mseed/utils.py` around lines 89 - 91, The __post_init__ method
currently uses an assert (assert len(self.data) == self.sample_count) which is
removed under -O; replace it with an explicit runtime check that raises a
ValueError when len(self.data) != self.sample_count so the invariant is always
enforced. Update the __post_init__ in the class (method name __post_init__) to
perform: if len(self.data) != self.sample_count: raise ValueError(f"sample_count
mismatch: expected {self.sample_count}, got {len(self.data)}") so the error
references the sample_count and data lengths.
Summary
Summary by CodeRabbit
New Features
Tests
Chores