docs: define deals, retrievals, and metrics#164
Conversation
There was a problem hiding this comment.
Pull request overview
Adds public “source of truth” documentation for Dealbot’s Data Storage and Retrieval checks, plus a consolidated reference for check events and dashboard metrics (addressing #158, #159, #160).
Changes:
- Introduces a Data Storage Check specification and lifecycle description.
- Introduces a Retrieval Check specification, including selection logic and recorded metrics.
- Adds an Events & Metrics reference for dashboard consumers (with TBD markers for not-yet-implemented items).
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
| docs/checks/data-storage.md | Defines intended Data Storage check behavior, lifecycle, assertions, metrics, and configuration. |
| docs/checks/retrievals.md | Defines intended Retrieval check behavior, deal selection, assertions, recording, metrics, and configuration. |
| docs/checks/events-and-metrics.md | Documents a canonical event/metric model and links expected sources of truth in code. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
BigLep
left a comment
There was a problem hiding this comment.
Thanks for putting this together Russell. I think this will be a useful artifact going forward, and I think it's useful right now for making sure we're fully aligned on what #158 and #159 should implement. I think it was smart to start with it.
I have read through everything, but because of time I didn't comment on each thing I saw. I have a few inline comments and some more general things. If I had more time tonight, I would have just created a PR on top of yours to point out some other suggestions or changes. I think in terms of next steps it would be great if you can review/incorporate the feedback I have provided so far. I'll then checkout the change and as I read through it again, make any other changes in a PR on top so it's easy for you to review/accept/reject my items. Sound good?
Generall items
- I think we should write this document assuming the flow for when #158 and #159 are done (since they will be completed soon). It's fine/good to denote the parts that aren't complete as TBD or TBI (to be implemented). For example, lets write it assuming the IPNI advertising and verification happens as part of the sequential "data storage" flow
- What retries are we doing?
- Rather than stating line number in text can we just
#123in the URLs (more convenient and less text presented to user) - Maybe this is a losing battle, but rather than use the word "deal" throughout the document, do we use "piece" since that has a definition throughout FWSS?
- Some parts in each document seem duplicative? Maybe ask AI for ideas to reduce redundancy?
- In particular, there seems to be some duplication of the metric definitions across the files? Either reference events-and-metrics.md as much as possible or inline that content into the other two files?
- Not critical, but maybe do mermaid diagrams?
|
@BigLep I addressed all comments.. lmk if you need anything else from me |
|
Is this mean to include the thresholds from https://linear.app/filoz/document/public-definition-of-thresholds-for-dealbot-ccf29ae32583 I scanned in manually and had AI do a check as well. AI output:
|
|
@timfong888 : I am active in adding a lot more content to this including thresholds - I'll ping you when it's ready for review. |
|
#228 is where I am working on top of this PR, but it's not ready for review yet. I need to review with fresh eyes and explain my thinking better. |
|
I'll approve this once #228 is merged. |
) * docs: explain deal status as composite of sub-statuses with branching flow - Add explanation that overall deal status depends on Upload, Onchain, Discoverability, and Retrieval sub-statuses - Document branching flow: Upload → (Onchain ∥ Discoverability) → Retrieval - Add multiple diagram options showing the dependency structure - Update assertions table with retries and metric references - Add production configuration and approval methodology doc * docs: clean up FAQ and add checks directory README - Add README.md to checks directory with overview and terminology - Move 'What is DealBot?' section to README - Add MIN_NUM_DATASETS_FOR_CHECKS to configuration table with description - Clean up incomplete FAQ entries in data-storage.md - Add link to filecoin-pin FAQ for filecoinpin.contact vs cid.contact - Add SP maintenance window section to production config doc - Add check frequency information to production config doc - Fix metric name typo (dataStorageCheckMs -> retrievalCheckMs) * docs: fix typos, broken links, and grammar issues in checks docs - Fix typos: instnace→instance, appropraite→appropriate, verificaiont→verification, etc. - Fix broken anchor links in data-storage.md - Fix grammar: dealbot confirm→confirms, already knows→know, etc. - Remove TBD_VARIABLE placeholders and TODO comments - Fix missing closing backticks in formulas - Fix duplicate words and double spaces - Standardize terminology and fix inconsistent section references - Fix metric name typos and broken event links in events-and-metrics.md - Complete incomplete sentences and fix punctuation * docs: expand Prometheus/OpenTelemetry metrics label documentation Add details about checkType and providerId labels/attributes for filtering metrics * docs: use 'SP under test' consistently in checks docs Replace SP-under-test and SP under testing with SP under test / SPs under test throughout. * docs: add MAX_RETRIEVAL_CHECKS config vars to retrievals Configuration section * docs: document `MAX_RETRIEVAL_CHECKS_PER_SP_PER_CYCLE` for production * docs: redraw deal status diagram so Retrieval clearly follows Discoverability Show Upload → Onchain and Discoverability in parallel, then Retrieval only after Discoverability succeeds; deal success requires both Onchain and Retrieval. * docs: fix typo and cross-references in production config doc - bandwith → bandwidth - Link Data Storage Check and Retrieval Check to data-storage.md and retrievals.md * docs: align data-storage metric links with events-and-metrics definitions - Use pieceConfirmedOnChainMs and spIndexLocallyMs for assertions 3 and 4 - Set assertion 2 metric to n/a (no matching metric in events-and-metrics) * docs: add Sub Status Affected column to What Gets Asserted table Column links to sub-status meanings and maps each assertion to Upload, Onchain, Discoverability, or Retrieval. * docs: light polish in checks docs - events-and-metrics: fix 'metrics results' wording, Emmitted→Emitted, remove empty bullet, fix table cell - retrievals: subject-verb agreement (remain), 'on 5xx response' - production-config: clarify RANDOM_PIECE_SIZES note for retrieval downloads * docs: fix Count Related Metrics table alignment in events-and-metrics * docs: clarify count metrics use a value label (success|failure or HTTP code) * Incorporating feedback based on @SgtPooki 2026-02-05 discussion * Various improvements from reading through fresh * More fixup when doing self review * Applying copilot fixups. * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Incorporate feedback from @SgtPooki in #228 (review) * Incorporated feedback from latest round of verbal conversation and me reading through again --------- Co-authored-by: Russell Dempsey <1173416+SgtPooki@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This PR attempts to handle the documentation parts of #158, #159, #160.
This is all in a single PR to make the entire scope of documentation changes easier to review.