feat: adds python script to fetch claims for sources by semmet95 · Pull Request #30 · SatyaLens/sources

semmet95 · 2026-05-18T03:53:04Z

Summary by CodeRabbit

New Features
- Added dozens of new claim records from global news outlets (BBC, Al Jazeera, NDTV, The Guardian, NYT, Hindustan Times, India Times, Yahoo Finance, South China Morning Post).
New Tools
- Introduced a command-line importer to fetch and add news claims from external sources.
Updates
- API documentation example for claim submission clarified.
Removals
- Google News source removed.

Signed-off-by: Amit Singh <singhamitch@outlook.com>

… model Signed-off-by: Amit Singh <singhamitch@outlook.com>

Signed-off-by: Amit Singh <singhamitch@outlook.com>

qodo-code-review · 2026-05-18T03:53:08Z

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

coderabbitai · 2026-05-18T03:53:15Z

📝 Walkthrough

Walkthrough

This PR adds a CLI ingestion script that fetches news sources and candidate claims, filters candidates via multiple OpenRouter models, deduplicates against existing YAML records, writes normalized claim YAML files, updates the OpenRouter free-model list, and adds an example URI to the OpenAPI ClaimInput schema.

Changes

Claim Ingestion System

Layer / File(s)	Summary
Schema, requirements & configuration `oapi.yaml`, `scripts/newsdata_io.py`, `scripts/openrouter.py`, `requirements.txt`	Adds an `example` for `ClaimInput.uri`, defines ingestion/runtime constants and the OpenRouter free model list, and adds `requests` and `demjson3` to Python deps.
HTTP and API Helpers `scripts/newsdata_io.py`	Implements `fetch_web_text()`, `get_sources()`, and OpenRouter request helpers (`post_openrouter()` / `req_openrouter()`) with timeout and status handling.
LLM-Based Claim Filtering `scripts/newsdata_io.py`	`filter_claims()` retries across free models, prompts for a plain JSON array, normalizes `None/True/False`, strips code fences, and returns a parsed array or exits.
Claim Management and Deduplication `scripts/newsdata_io.py`	`get_claims()` retrieves candidates, `update_claim_fields()`/`is_claim_new()` normalize fields and check novelty against existing `claims/` YAML, and `clean_filepath()` sanitizes filenames.
YAML Document Generation `scripts/newsdata_io.py`	`create_claim_docs()` reads `oapi.yaml` example keys, forces double-quoted YAML strings, derives sanitized filenames, and writes files under `claims/<srcName>/`.
Main Orchestration `scripts/newsdata_io.py`	`main()` reads env vars, downloads the falsifiable-claim skill, iterates sources with delays, applies filtering and deduplication, and persists new claim YAML files.
Generated Claim Records `claims/*`	Adds nine new claim YAML files (Al Jazeera, BBC, Hindustan Times, India Times, NDTV, SCMP, The Guardian, NYT, Yahoo Finance) each with `sourceUriDigest`, `summary`, `title`, and `uri`.

Sequence Diagram(s)

sequenceDiagram
  participant CLI as scripts/newsdata_io.py
  participant SourceAPI as Newsdata API
  participant LLM as OpenRouter
  participant LocalFS as claims/*.yaml
  CLI->>SourceAPI: get_sources()
  SourceAPI-->>CLI: source list
  CLI->>SourceAPI: get_claims(source)
  SourceAPI-->>CLI: candidate claims
  CLI->>LLM: filter_claims(prompt, candidate claims)
  LLM-->>CLI: JSON array of validated claims
  CLI->>LocalFS: is_claim_new()/create_claim_docs()
  LocalFS-->>CLI: written claim file paths

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐇 I nibble through the news at dawn,
prompt the models, sift the lawn,
nine YAML sprouts in tidy rows,
a rabbit's script where data grows.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the primary change—adding a new Python script to fetch claims for sources, which is the main focus of this PR with a 294-line script addition.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/claim-fetch-workflow

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (1)

scripts/newsdata_io.py (1)

195-203: ⚡ Quick win

Avoid rescanning all YAML claim docs for each candidate claim.

This currently does full disk I/O per claim. Build in-memory URI/title indexes once and reuse during the loop.

Also applies to: 286-289

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/newsdata_io.py` around lines 195 - 203, The is_claim_new function
currently calls get_claim_docs() and re-scans disk for every candidate; instead,
build in-memory indexes (sets or dicts) of existing URIs and titles once (e.g.,
a build_claim_indexes or cache inside module scope) and have is_claim_new
consult those sets rather than calling get_claim_docs repeatedly; update calls
at both is_claim_new and the other occurrence around lines 286-289 to accept or
reference the prebuilt uri_set/title_set (or a get_claim_indexes() cached
wrapper) so no per-claim disk I/O occurs.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/newsdata_io.py`:
- Around line 211-212: The script currently opens 'oapi.yaml' and writes claim
files using CWD-relative paths; change these to use script-relative repo paths
so runs outside the repo root behave correctly. Locate the open('oapi.yaml')
usage and the claim write logic (also referenced around lines 240-245) and
compute a base path from Path(__file__).resolve().parent (matching
get_claim_docs()'s resolution), then join that base to 'oapi.yaml' and to the
claim output filenames when reading/writing; update all file open calls to use
these Path objects (or their str()) so schema reads and claim writes are
repository-relative instead of CWD-relative. Ensure you update any helper
functions that construct paths so they reuse the same script_dir base.
- Around line 164-168: The code currently assumes response.json() contains a
"results" key after a 200 status; update the post-request handling (around
response, endpoint, params, src_domain_url) to safely extract results by first
attempting to parse JSON with error handling, then checking for the "results"
key (e.g., use dict.get or membership test) and only returning that list when
present; if JSON parsing fails or "results" is missing, log/print a clear error
including response.status_code and response.text (or the JSON error) and return
None instead of indexing directly.
- Around line 250-255: The code directly indexes os.environ for API_BASE_URL,
API_KEY, NEWSDATA_API_BASE_URL, NEWSDATA_API_KEY, OPENROUTER_API_KEY, and
OPENROUTER_API_BASE_URL which raises KeyError without helpful context; change
this to validate and fail fast with a clear error: use a small helper (e.g.,
get_required_env_var) or os.environ.get for each of those symbols and if missing
raise a ValueError or log an explicit message naming the missing variable(s)
before exiting so operators see which env var is absent when variables like
base_url, api_key, news_data_base_url, news_data_api_key, openrouter_api_key, or
openrouter_base_url are loaded.
- Line 120: Replace equality comparisons against None with identity checks:
change uses of "filtered_claims != None" (and other comparisons to None at lines
referenced) to "filtered_claims is not None", and similarly update
"filtered_sources == None" / "filtered_claim == None" / the comparison at line
275 to use "is None" or "is not None" as appropriate; locate these checks in
scripts/newsdata_io.py by searching for the symbols filtered_claims,
filtered_sources, filtered_claim and the surrounding conditional blocks and
update them to use "is" / "is not" to satisfy Ruff E711.

---

Nitpick comments:
In `@scripts/newsdata_io.py`:
- Around line 195-203: The is_claim_new function currently calls
get_claim_docs() and re-scans disk for every candidate; instead, build in-memory
indexes (sets or dicts) of existing URIs and titles once (e.g., a
build_claim_indexes or cache inside module scope) and have is_claim_new consult
those sets rather than calling get_claim_docs repeatedly; update calls at both
is_claim_new and the other occurrence around lines 286-289 to accept or
reference the prebuilt uri_set/title_set (or a get_claim_indexes() cached
wrapper) so no per-claim disk I/O occurs.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 52b7863a-a3fa-4cad-bb4b-1bb611dbd686

📥 Commits

Reviewing files that changed from the base of the PR and between 4e76050 and 73c9383.

📒 Files selected for processing (21)

claims/al_jazeera/eu-sa-trade-deal.yaml
claims/al_jazeera/us-tarrifs.yaml
claims/al_jazeera/zimbabwe_s_diaspora_reshapes_r.yaml
claims/bbc/russia_s_shadow_fleet_ships_de.yaml
claims/hindustan_times/china-zero-tarrif.yaml
claims/hindustan_times/india-gdp.yaml
claims/hindustan_times/yogi_reduces_convoy_size_in_go.yaml
claims/india_times/rrb_alp_recruitment_2026_regi.yaml
claims/ndtv/ebola_outbreak_in_congo_kills.yaml
claims/south_china_morning_post/china-dutch-sanctions.yaml
claims/south_china_morning_post/congo_ebola_outbreak_constant.yaml
claims/south_china_morning_post/moore-threads-share.yaml
claims/the_guardian/timmy_the_whale_confirmed_dead.yaml
claims/the_new_york_times/ai-backlash.yaml
claims/the_new_york_times/f_licien_kabuga_dies_an_accus.yaml
claims/the_new_york_times/pentagon-google-ai.yaml
claims/yahoo_finance/pet_valu_q1_earnings_call_high.yaml
oapi.yaml
scripts/newsdata_io.py
scripts/openrouter.py
sources/Google-News.yaml

💤 Files with no reviewable changes (1)

sources/Google-News.yaml

cubic-dev-ai

3 issues found across 21 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="oapi.yaml">

<violation number="1" location="oapi.yaml:616">
P2: Do not modify `oapi.yaml` in this repository; it is documented as a copied artifact and must stay in sync with the canonical schema source.</violation>
</file>

<file name="scripts/newsdata_io.py">

<violation number="1" location="scripts/newsdata_io.py:4">
P1: This script imports `demjson3` and `requests`, but those packages are not declared in `requirements.txt`, so execution will fail in standard environments.</violation>

<violation number="2" location="scripts/newsdata_io.py:211">
P2: `oapi.yaml` is opened via a CWD-relative path, which makes the script fail when executed outside the repo root.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Re-trigger cubic}

Signed-off-by: Amit Singh <singhamitch@outlook.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@requirements.txt`:
- Line 5: The requirements.txt entry for the dependency "requests" is unpinned;
replace that single-line package entry "requests" with a pinned range (e.g.,
requests>=2.32.0,<3) so installs are reproducible and upgrades are bounded;
update the same line containing the "requests" token to include the floor and
ceiling version specifier.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 4412c7bd-799d-430b-8637-6116963b77ac

📥 Commits

Reviewing files that changed from the base of the PR and between 73c9383 and 42b8328.

📒 Files selected for processing (18)

claims/al_jazeera/eu-sa-trade-deal.yaml
claims/al_jazeera/us-tarrifs.yaml
claims/al_jazeera/zimbabwe_s_diaspora_reshapes_r.yaml
claims/bbc/russia_s_shadow_fleet_ships_de.yaml
claims/hindustan_times/china-zero-tarrif.yaml
claims/hindustan_times/india-gdp.yaml
claims/hindustan_times/yogi_reduces_convoy_size_in_go.yaml
claims/india_times/rrb_alp_recruitment_2026_regi.yaml
claims/ndtv/ebola_outbreak_in_congo_kills.yaml
claims/south_china_morning_post/china-dutch-sanctions.yaml
claims/south_china_morning_post/congo_ebola_outbreak_constant.yaml
claims/south_china_morning_post/moore-threads-share.yaml
claims/the_guardian/timmy_the_whale_confirmed_dead.yaml
claims/the_new_york_times/ai-backlash.yaml
claims/the_new_york_times/f_licien_kabuga_dies_an_accus.yaml
claims/the_new_york_times/pentagon-google-ai.yaml
claims/yahoo_finance/pet_valu_q1_earnings_call_high.yaml
requirements.txt

✅ Files skipped from review due to trivial changes (7)

claims/yahoo_finance/pet_valu_q1_earnings_call_high.yaml
claims/hindustan_times/yogi_reduces_convoy_size_in_go.yaml
claims/south_china_morning_post/congo_ebola_outbreak_constant.yaml
claims/bbc/russia_s_shadow_fleet_ships_de.yaml
claims/the_guardian/timmy_the_whale_confirmed_dead.yaml
claims/the_new_york_times/f_licien_kabuga_dies_an_accus.yaml
claims/al_jazeera/zimbabwe_s_diaspora_reshapes_r.yaml

🚧 Files skipped from review as they are similar to previous changes (2)

claims/india_times/rrb_alp_recruitment_2026_regi.yaml
claims/ndtv/ebola_outbreak_in_congo_kills.yaml

semmet95 added 10 commits May 17, 2026 20:30

feat: adds script to fetch claims for added sources

ef2336b

Signed-off-by: Amit Singh <singhamitch@outlook.com>

feat: updates filtered claims fields to make it compatible with claim…

c90651d

… model Signed-off-by: Amit Singh <singhamitch@outlook.com>

feat: adds logic to load all the claim docs in the repo

ee69dc1

Signed-off-by: Amit Singh <singhamitch@outlook.com>

feat: adds logic to create claim yaml docs

c19d2d3

Signed-off-by: Amit Singh <singhamitch@outlook.com>

fix: makes openrouter response process logic more robust

bbcb9a2

Signed-off-by: Amit Singh <singhamitch@outlook.com>

chore: adds example for claim input component

e93560b

Signed-off-by: Amit Singh <singhamitch@outlook.com>

fix: formats yaml output file

814982d

Signed-off-by: Amit Singh <singhamitch@outlook.com>

testing with llmrouter

acb3602

Signed-off-by: Amit Singh <singhamitch@outlook.com>

chore: removes google news source

1121069

Signed-off-by: Amit Singh <singhamitch@outlook.com>

chore: updates free model list

bd4fad0

Signed-off-by: Amit Singh <singhamitch@outlook.com>

coderabbitai Bot reviewed May 18, 2026

View reviewed changes

Comment thread scripts/newsdata_io.py

Comment thread scripts/newsdata_io.py

Comment thread scripts/newsdata_io.py

Comment thread scripts/newsdata_io.py

cubic-dev-ai Bot reviewed May 18, 2026

View reviewed changes

Comment thread scripts/newsdata_io.py

Comment thread oapi.yaml

Comment thread scripts/newsdata_io.py

chore: adds new claims for sources

42b8328

Signed-off-by: Amit Singh <singhamitch@outlook.com>

semmet95 force-pushed the feat/claim-fetch-workflow branch from 73c9383 to 42b8328 Compare May 18, 2026 04:23

coderabbitai Bot reviewed May 18, 2026

View reviewed changes

Comment thread requirements.txt

semmet95 merged commit 5c4eea1 into main May 18, 2026
2 checks passed

semmet95 deleted the feat/claim-fetch-workflow branch May 18, 2026 04:39

coderabbitai Bot mentioned this pull request May 18, 2026

refactor: simplifies claim fetch logic #31

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: adds python script to fetch claims for sources#30

feat: adds python script to fetch claims for sources#30
semmet95 merged 11 commits into
mainfrom
feat/claim-fetch-workflow

semmet95 commented May 18, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

qodo-code-review Bot commented May 18, 2026

Uh oh!

coderabbitai Bot commented May 18, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

semmet95 commented May 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

qodo-code-review Bot commented May 18, 2026

Qodo reviews are paused for this user.

Uh oh!

coderabbitai Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

semmet95 commented May 18, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 18, 2026 •

edited

Loading