Skip to content

feat: adds python script to fetch claims for sources#30

Merged
semmet95 merged 11 commits into
mainfrom
feat/claim-fetch-workflow
May 18, 2026
Merged

feat: adds python script to fetch claims for sources#30
semmet95 merged 11 commits into
mainfrom
feat/claim-fetch-workflow

Conversation

@semmet95
Copy link
Copy Markdown
Contributor

@semmet95 semmet95 commented May 18, 2026

Summary by CodeRabbit

  • New Features

    • Added dozens of new claim records from global news outlets (BBC, Al Jazeera, NDTV, The Guardian, NYT, Hindustan Times, India Times, Yahoo Finance, South China Morning Post).
  • New Tools

    • Introduced a command-line importer to fetch and add news claims from external sources.
  • Updates

    • API documentation example for claim submission clarified.
  • Removals

    • Google News source removed.

Review Change Stack

semmet95 added 10 commits May 17, 2026 20:30
Signed-off-by: Amit Singh <singhamitch@outlook.com>
… model

Signed-off-by: Amit Singh <singhamitch@outlook.com>
Signed-off-by: Amit Singh <singhamitch@outlook.com>
Signed-off-by: Amit Singh <singhamitch@outlook.com>
Signed-off-by: Amit Singh <singhamitch@outlook.com>
Signed-off-by: Amit Singh <singhamitch@outlook.com>
Signed-off-by: Amit Singh <singhamitch@outlook.com>
Signed-off-by: Amit Singh <singhamitch@outlook.com>
Signed-off-by: Amit Singh <singhamitch@outlook.com>
Signed-off-by: Amit Singh <singhamitch@outlook.com>
@qodo-code-review
Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 18, 2026

📝 Walkthrough

Walkthrough

This PR adds a CLI ingestion script that fetches news sources and candidate claims, filters candidates via multiple OpenRouter models, deduplicates against existing YAML records, writes normalized claim YAML files, updates the OpenRouter free-model list, and adds an example URI to the OpenAPI ClaimInput schema.

Changes

Claim Ingestion System

Layer / File(s) Summary
Schema, requirements & configuration
oapi.yaml, scripts/newsdata_io.py, scripts/openrouter.py, requirements.txt
Adds an example for ClaimInput.uri, defines ingestion/runtime constants and the OpenRouter free model list, and adds requests and demjson3 to Python deps.
HTTP and API Helpers
scripts/newsdata_io.py
Implements fetch_web_text(), get_sources(), and OpenRouter request helpers (post_openrouter() / req_openrouter()) with timeout and status handling.
LLM-Based Claim Filtering
scripts/newsdata_io.py
filter_claims() retries across free models, prompts for a plain JSON array, normalizes None/True/False, strips code fences, and returns a parsed array or exits.
Claim Management and Deduplication
scripts/newsdata_io.py
get_claims() retrieves candidates, update_claim_fields()/is_claim_new() normalize fields and check novelty against existing claims/ YAML, and clean_filepath() sanitizes filenames.
YAML Document Generation
scripts/newsdata_io.py
create_claim_docs() reads oapi.yaml example keys, forces double-quoted YAML strings, derives sanitized filenames, and writes files under claims/<srcName>/.
Main Orchestration
scripts/newsdata_io.py
main() reads env vars, downloads the falsifiable-claim skill, iterates sources with delays, applies filtering and deduplication, and persists new claim YAML files.
Generated Claim Records
claims/*
Adds nine new claim YAML files (Al Jazeera, BBC, Hindustan Times, India Times, NDTV, SCMP, The Guardian, NYT, Yahoo Finance) each with sourceUriDigest, summary, title, and uri.

Sequence Diagram(s)

sequenceDiagram
  participant CLI as scripts/newsdata_io.py
  participant SourceAPI as Newsdata API
  participant LLM as OpenRouter
  participant LocalFS as claims/*.yaml
  CLI->>SourceAPI: get_sources()
  SourceAPI-->>CLI: source list
  CLI->>SourceAPI: get_claims(source)
  SourceAPI-->>CLI: candidate claims
  CLI->>LLM: filter_claims(prompt, candidate claims)
  LLM-->>CLI: JSON array of validated claims
  CLI->>LocalFS: is_claim_new()/create_claim_docs()
  LocalFS-->>CLI: written claim file paths
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐇 I nibble through the news at dawn,
prompt the models, sift the lawn,
nine YAML sprouts in tidy rows,
a rabbit's script where data grows.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the primary change—adding a new Python script to fetch claims for sources, which is the main focus of this PR with a 294-line script addition.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/claim-fetch-workflow

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
scripts/newsdata_io.py (1)

195-203: ⚡ Quick win

Avoid rescanning all YAML claim docs for each candidate claim.

This currently does full disk I/O per claim. Build in-memory URI/title indexes once and reuse during the loop.

Also applies to: 286-289

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/newsdata_io.py` around lines 195 - 203, The is_claim_new function
currently calls get_claim_docs() and re-scans disk for every candidate; instead,
build in-memory indexes (sets or dicts) of existing URIs and titles once (e.g.,
a build_claim_indexes or cache inside module scope) and have is_claim_new
consult those sets rather than calling get_claim_docs repeatedly; update calls
at both is_claim_new and the other occurrence around lines 286-289 to accept or
reference the prebuilt uri_set/title_set (or a get_claim_indexes() cached
wrapper) so no per-claim disk I/O occurs.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/newsdata_io.py`:
- Around line 211-212: The script currently opens 'oapi.yaml' and writes claim
files using CWD-relative paths; change these to use script-relative repo paths
so runs outside the repo root behave correctly. Locate the open('oapi.yaml')
usage and the claim write logic (also referenced around lines 240-245) and
compute a base path from Path(__file__).resolve().parent (matching
get_claim_docs()'s resolution), then join that base to 'oapi.yaml' and to the
claim output filenames when reading/writing; update all file open calls to use
these Path objects (or their str()) so schema reads and claim writes are
repository-relative instead of CWD-relative. Ensure you update any helper
functions that construct paths so they reuse the same script_dir base.
- Around line 164-168: The code currently assumes response.json() contains a
"results" key after a 200 status; update the post-request handling (around
response, endpoint, params, src_domain_url) to safely extract results by first
attempting to parse JSON with error handling, then checking for the "results"
key (e.g., use dict.get or membership test) and only returning that list when
present; if JSON parsing fails or "results" is missing, log/print a clear error
including response.status_code and response.text (or the JSON error) and return
None instead of indexing directly.
- Around line 250-255: The code directly indexes os.environ for API_BASE_URL,
API_KEY, NEWSDATA_API_BASE_URL, NEWSDATA_API_KEY, OPENROUTER_API_KEY, and
OPENROUTER_API_BASE_URL which raises KeyError without helpful context; change
this to validate and fail fast with a clear error: use a small helper (e.g.,
get_required_env_var) or os.environ.get for each of those symbols and if missing
raise a ValueError or log an explicit message naming the missing variable(s)
before exiting so operators see which env var is absent when variables like
base_url, api_key, news_data_base_url, news_data_api_key, openrouter_api_key, or
openrouter_base_url are loaded.
- Line 120: Replace equality comparisons against None with identity checks:
change uses of "filtered_claims != None" (and other comparisons to None at lines
referenced) to "filtered_claims is not None", and similarly update
"filtered_sources == None" / "filtered_claim == None" / the comparison at line
275 to use "is None" or "is not None" as appropriate; locate these checks in
scripts/newsdata_io.py by searching for the symbols filtered_claims,
filtered_sources, filtered_claim and the surrounding conditional blocks and
update them to use "is" / "is not" to satisfy Ruff E711.

---

Nitpick comments:
In `@scripts/newsdata_io.py`:
- Around line 195-203: The is_claim_new function currently calls
get_claim_docs() and re-scans disk for every candidate; instead, build in-memory
indexes (sets or dicts) of existing URIs and titles once (e.g., a
build_claim_indexes or cache inside module scope) and have is_claim_new consult
those sets rather than calling get_claim_docs repeatedly; update calls at both
is_claim_new and the other occurrence around lines 286-289 to accept or
reference the prebuilt uri_set/title_set (or a get_claim_indexes() cached
wrapper) so no per-claim disk I/O occurs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 52b7863a-a3fa-4cad-bb4b-1bb611dbd686

📥 Commits

Reviewing files that changed from the base of the PR and between 4e76050 and 73c9383.

📒 Files selected for processing (21)
  • claims/al_jazeera/eu-sa-trade-deal.yaml
  • claims/al_jazeera/us-tarrifs.yaml
  • claims/al_jazeera/zimbabwe_s_diaspora_reshapes_r.yaml
  • claims/bbc/russia_s_shadow_fleet_ships_de.yaml
  • claims/hindustan_times/china-zero-tarrif.yaml
  • claims/hindustan_times/india-gdp.yaml
  • claims/hindustan_times/yogi_reduces_convoy_size_in_go.yaml
  • claims/india_times/rrb_alp_recruitment_2026_regi.yaml
  • claims/ndtv/ebola_outbreak_in_congo_kills.yaml
  • claims/south_china_morning_post/china-dutch-sanctions.yaml
  • claims/south_china_morning_post/congo_ebola_outbreak_constant.yaml
  • claims/south_china_morning_post/moore-threads-share.yaml
  • claims/the_guardian/timmy_the_whale_confirmed_dead.yaml
  • claims/the_new_york_times/ai-backlash.yaml
  • claims/the_new_york_times/f_licien_kabuga_dies_an_accus.yaml
  • claims/the_new_york_times/pentagon-google-ai.yaml
  • claims/yahoo_finance/pet_valu_q1_earnings_call_high.yaml
  • oapi.yaml
  • scripts/newsdata_io.py
  • scripts/openrouter.py
  • sources/Google-News.yaml
💤 Files with no reviewable changes (1)
  • sources/Google-News.yaml

Comment thread scripts/newsdata_io.py
Comment thread scripts/newsdata_io.py
Comment thread scripts/newsdata_io.py
Comment thread scripts/newsdata_io.py
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues found across 21 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="oapi.yaml">

<violation number="1" location="oapi.yaml:616">
P2: Do not modify `oapi.yaml` in this repository; it is documented as a copied artifact and must stay in sync with the canonical schema source.</violation>
</file>

<file name="scripts/newsdata_io.py">

<violation number="1" location="scripts/newsdata_io.py:4">
P1: This script imports `demjson3` and `requests`, but those packages are not declared in `requirements.txt`, so execution will fail in standard environments.</violation>

<violation number="2" location="scripts/newsdata_io.py:211">
P2: `oapi.yaml` is opened via a CWD-relative path, which makes the script fail when executed outside the repo root.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Re-trigger cubic

Comment thread scripts/newsdata_io.py
Comment thread oapi.yaml
Comment thread scripts/newsdata_io.py
Signed-off-by: Amit Singh <singhamitch@outlook.com>
@semmet95 semmet95 force-pushed the feat/claim-fetch-workflow branch from 73c9383 to 42b8328 Compare May 18, 2026 04:23
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@requirements.txt`:
- Line 5: The requirements.txt entry for the dependency "requests" is unpinned;
replace that single-line package entry "requests" with a pinned range (e.g.,
requests>=2.32.0,<3) so installs are reproducible and upgrades are bounded;
update the same line containing the "requests" token to include the floor and
ceiling version specifier.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 4412c7bd-799d-430b-8637-6116963b77ac

📥 Commits

Reviewing files that changed from the base of the PR and between 73c9383 and 42b8328.

📒 Files selected for processing (18)
  • claims/al_jazeera/eu-sa-trade-deal.yaml
  • claims/al_jazeera/us-tarrifs.yaml
  • claims/al_jazeera/zimbabwe_s_diaspora_reshapes_r.yaml
  • claims/bbc/russia_s_shadow_fleet_ships_de.yaml
  • claims/hindustan_times/china-zero-tarrif.yaml
  • claims/hindustan_times/india-gdp.yaml
  • claims/hindustan_times/yogi_reduces_convoy_size_in_go.yaml
  • claims/india_times/rrb_alp_recruitment_2026_regi.yaml
  • claims/ndtv/ebola_outbreak_in_congo_kills.yaml
  • claims/south_china_morning_post/china-dutch-sanctions.yaml
  • claims/south_china_morning_post/congo_ebola_outbreak_constant.yaml
  • claims/south_china_morning_post/moore-threads-share.yaml
  • claims/the_guardian/timmy_the_whale_confirmed_dead.yaml
  • claims/the_new_york_times/ai-backlash.yaml
  • claims/the_new_york_times/f_licien_kabuga_dies_an_accus.yaml
  • claims/the_new_york_times/pentagon-google-ai.yaml
  • claims/yahoo_finance/pet_valu_q1_earnings_call_high.yaml
  • requirements.txt
✅ Files skipped from review due to trivial changes (7)
  • claims/yahoo_finance/pet_valu_q1_earnings_call_high.yaml
  • claims/hindustan_times/yogi_reduces_convoy_size_in_go.yaml
  • claims/south_china_morning_post/congo_ebola_outbreak_constant.yaml
  • claims/bbc/russia_s_shadow_fleet_ships_de.yaml
  • claims/the_guardian/timmy_the_whale_confirmed_dead.yaml
  • claims/the_new_york_times/f_licien_kabuga_dies_an_accus.yaml
  • claims/al_jazeera/zimbabwe_s_diaspora_reshapes_r.yaml
🚧 Files skipped from review as they are similar to previous changes (2)
  • claims/india_times/rrb_alp_recruitment_2026_regi.yaml
  • claims/ndtv/ebola_outbreak_in_congo_kills.yaml

Comment thread requirements.txt
@semmet95 semmet95 merged commit 5c4eea1 into main May 18, 2026
2 checks passed
@semmet95 semmet95 deleted the feat/claim-fetch-workflow branch May 18, 2026 04:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant