feat: adds python script to fetch claims for sources#30
Conversation
Signed-off-by: Amit Singh <singhamitch@outlook.com>
… model Signed-off-by: Amit Singh <singhamitch@outlook.com>
Signed-off-by: Amit Singh <singhamitch@outlook.com>
Signed-off-by: Amit Singh <singhamitch@outlook.com>
Signed-off-by: Amit Singh <singhamitch@outlook.com>
Signed-off-by: Amit Singh <singhamitch@outlook.com>
Signed-off-by: Amit Singh <singhamitch@outlook.com>
Signed-off-by: Amit Singh <singhamitch@outlook.com>
Signed-off-by: Amit Singh <singhamitch@outlook.com>
Signed-off-by: Amit Singh <singhamitch@outlook.com>
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
📝 WalkthroughWalkthroughThis PR adds a CLI ingestion script that fetches news sources and candidate claims, filters candidates via multiple OpenRouter models, deduplicates against existing YAML records, writes normalized claim YAML files, updates the OpenRouter free-model list, and adds an example URI to the OpenAPI ClaimInput schema. ChangesClaim Ingestion System
Sequence Diagram(s)sequenceDiagram
participant CLI as scripts/newsdata_io.py
participant SourceAPI as Newsdata API
participant LLM as OpenRouter
participant LocalFS as claims/*.yaml
CLI->>SourceAPI: get_sources()
SourceAPI-->>CLI: source list
CLI->>SourceAPI: get_claims(source)
SourceAPI-->>CLI: candidate claims
CLI->>LLM: filter_claims(prompt, candidate claims)
LLM-->>CLI: JSON array of validated claims
CLI->>LocalFS: is_claim_new()/create_claim_docs()
LocalFS-->>CLI: written claim file paths
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
scripts/newsdata_io.py (1)
195-203: ⚡ Quick winAvoid rescanning all YAML claim docs for each candidate claim.
This currently does full disk I/O per claim. Build in-memory URI/title indexes once and reuse during the loop.
Also applies to: 286-289
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/newsdata_io.py` around lines 195 - 203, The is_claim_new function currently calls get_claim_docs() and re-scans disk for every candidate; instead, build in-memory indexes (sets or dicts) of existing URIs and titles once (e.g., a build_claim_indexes or cache inside module scope) and have is_claim_new consult those sets rather than calling get_claim_docs repeatedly; update calls at both is_claim_new and the other occurrence around lines 286-289 to accept or reference the prebuilt uri_set/title_set (or a get_claim_indexes() cached wrapper) so no per-claim disk I/O occurs.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@scripts/newsdata_io.py`:
- Around line 211-212: The script currently opens 'oapi.yaml' and writes claim
files using CWD-relative paths; change these to use script-relative repo paths
so runs outside the repo root behave correctly. Locate the open('oapi.yaml')
usage and the claim write logic (also referenced around lines 240-245) and
compute a base path from Path(__file__).resolve().parent (matching
get_claim_docs()'s resolution), then join that base to 'oapi.yaml' and to the
claim output filenames when reading/writing; update all file open calls to use
these Path objects (or their str()) so schema reads and claim writes are
repository-relative instead of CWD-relative. Ensure you update any helper
functions that construct paths so they reuse the same script_dir base.
- Around line 164-168: The code currently assumes response.json() contains a
"results" key after a 200 status; update the post-request handling (around
response, endpoint, params, src_domain_url) to safely extract results by first
attempting to parse JSON with error handling, then checking for the "results"
key (e.g., use dict.get or membership test) and only returning that list when
present; if JSON parsing fails or "results" is missing, log/print a clear error
including response.status_code and response.text (or the JSON error) and return
None instead of indexing directly.
- Around line 250-255: The code directly indexes os.environ for API_BASE_URL,
API_KEY, NEWSDATA_API_BASE_URL, NEWSDATA_API_KEY, OPENROUTER_API_KEY, and
OPENROUTER_API_BASE_URL which raises KeyError without helpful context; change
this to validate and fail fast with a clear error: use a small helper (e.g.,
get_required_env_var) or os.environ.get for each of those symbols and if missing
raise a ValueError or log an explicit message naming the missing variable(s)
before exiting so operators see which env var is absent when variables like
base_url, api_key, news_data_base_url, news_data_api_key, openrouter_api_key, or
openrouter_base_url are loaded.
- Line 120: Replace equality comparisons against None with identity checks:
change uses of "filtered_claims != None" (and other comparisons to None at lines
referenced) to "filtered_claims is not None", and similarly update
"filtered_sources == None" / "filtered_claim == None" / the comparison at line
275 to use "is None" or "is not None" as appropriate; locate these checks in
scripts/newsdata_io.py by searching for the symbols filtered_claims,
filtered_sources, filtered_claim and the surrounding conditional blocks and
update them to use "is" / "is not" to satisfy Ruff E711.
---
Nitpick comments:
In `@scripts/newsdata_io.py`:
- Around line 195-203: The is_claim_new function currently calls
get_claim_docs() and re-scans disk for every candidate; instead, build in-memory
indexes (sets or dicts) of existing URIs and titles once (e.g., a
build_claim_indexes or cache inside module scope) and have is_claim_new consult
those sets rather than calling get_claim_docs repeatedly; update calls at both
is_claim_new and the other occurrence around lines 286-289 to accept or
reference the prebuilt uri_set/title_set (or a get_claim_indexes() cached
wrapper) so no per-claim disk I/O occurs.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 52b7863a-a3fa-4cad-bb4b-1bb611dbd686
📒 Files selected for processing (21)
claims/al_jazeera/eu-sa-trade-deal.yamlclaims/al_jazeera/us-tarrifs.yamlclaims/al_jazeera/zimbabwe_s_diaspora_reshapes_r.yamlclaims/bbc/russia_s_shadow_fleet_ships_de.yamlclaims/hindustan_times/china-zero-tarrif.yamlclaims/hindustan_times/india-gdp.yamlclaims/hindustan_times/yogi_reduces_convoy_size_in_go.yamlclaims/india_times/rrb_alp_recruitment_2026_regi.yamlclaims/ndtv/ebola_outbreak_in_congo_kills.yamlclaims/south_china_morning_post/china-dutch-sanctions.yamlclaims/south_china_morning_post/congo_ebola_outbreak_constant.yamlclaims/south_china_morning_post/moore-threads-share.yamlclaims/the_guardian/timmy_the_whale_confirmed_dead.yamlclaims/the_new_york_times/ai-backlash.yamlclaims/the_new_york_times/f_licien_kabuga_dies_an_accus.yamlclaims/the_new_york_times/pentagon-google-ai.yamlclaims/yahoo_finance/pet_valu_q1_earnings_call_high.yamloapi.yamlscripts/newsdata_io.pyscripts/openrouter.pysources/Google-News.yaml
💤 Files with no reviewable changes (1)
- sources/Google-News.yaml
There was a problem hiding this comment.
3 issues found across 21 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="oapi.yaml">
<violation number="1" location="oapi.yaml:616">
P2: Do not modify `oapi.yaml` in this repository; it is documented as a copied artifact and must stay in sync with the canonical schema source.</violation>
</file>
<file name="scripts/newsdata_io.py">
<violation number="1" location="scripts/newsdata_io.py:4">
P1: This script imports `demjson3` and `requests`, but those packages are not declared in `requirements.txt`, so execution will fail in standard environments.</violation>
<violation number="2" location="scripts/newsdata_io.py:211">
P2: `oapi.yaml` is opened via a CWD-relative path, which makes the script fail when executed outside the repo root.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Re-trigger cubic
Signed-off-by: Amit Singh <singhamitch@outlook.com>
73c9383 to
42b8328
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@requirements.txt`:
- Line 5: The requirements.txt entry for the dependency "requests" is unpinned;
replace that single-line package entry "requests" with a pinned range (e.g.,
requests>=2.32.0,<3) so installs are reproducible and upgrades are bounded;
update the same line containing the "requests" token to include the floor and
ceiling version specifier.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 4412c7bd-799d-430b-8637-6116963b77ac
📒 Files selected for processing (18)
claims/al_jazeera/eu-sa-trade-deal.yamlclaims/al_jazeera/us-tarrifs.yamlclaims/al_jazeera/zimbabwe_s_diaspora_reshapes_r.yamlclaims/bbc/russia_s_shadow_fleet_ships_de.yamlclaims/hindustan_times/china-zero-tarrif.yamlclaims/hindustan_times/india-gdp.yamlclaims/hindustan_times/yogi_reduces_convoy_size_in_go.yamlclaims/india_times/rrb_alp_recruitment_2026_regi.yamlclaims/ndtv/ebola_outbreak_in_congo_kills.yamlclaims/south_china_morning_post/china-dutch-sanctions.yamlclaims/south_china_morning_post/congo_ebola_outbreak_constant.yamlclaims/south_china_morning_post/moore-threads-share.yamlclaims/the_guardian/timmy_the_whale_confirmed_dead.yamlclaims/the_new_york_times/ai-backlash.yamlclaims/the_new_york_times/f_licien_kabuga_dies_an_accus.yamlclaims/the_new_york_times/pentagon-google-ai.yamlclaims/yahoo_finance/pet_valu_q1_earnings_call_high.yamlrequirements.txt
✅ Files skipped from review due to trivial changes (7)
- claims/yahoo_finance/pet_valu_q1_earnings_call_high.yaml
- claims/hindustan_times/yogi_reduces_convoy_size_in_go.yaml
- claims/south_china_morning_post/congo_ebola_outbreak_constant.yaml
- claims/bbc/russia_s_shadow_fleet_ships_de.yaml
- claims/the_guardian/timmy_the_whale_confirmed_dead.yaml
- claims/the_new_york_times/f_licien_kabuga_dies_an_accus.yaml
- claims/al_jazeera/zimbabwe_s_diaspora_reshapes_r.yaml
🚧 Files skipped from review as they are similar to previous changes (2)
- claims/india_times/rrb_alp_recruitment_2026_regi.yaml
- claims/ndtv/ebola_outbreak_in_congo_kills.yaml
Summary by CodeRabbit
New Features
New Tools
Updates
Removals