Skip to content

refactor: simplifies claim fetch logic#31

Open
semmet95 wants to merge 3 commits into
mainfrom
refactor/python-scripts
Open

refactor: simplifies claim fetch logic#31
semmet95 wants to merge 3 commits into
mainfrom
refactor/python-scripts

Conversation

@semmet95
Copy link
Copy Markdown
Contributor

@semmet95 semmet95 commented May 18, 2026

Summary by CodeRabbit

  • New Features

    • Added required source field domainUrlNewsData
    • Added pipelines to ingest claims and to ingest sources from Markdown
  • Improvements

    • Better retry and error handling for remote model/API calls
    • New helper utilities for fetching, sanitizing, and normalizing data
    • Simplified news-data fetching with centralized configuration and deduplication/validation improvements

Review Change Stack

semmet95 added 2 commits May 18, 2026 16:59
Signed-off-by: Amit Singh <singhamitch@outlook.com>
Signed-off-by: Amit Singh <singhamitch@outlook.com>
@qodo-code-review
Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 18, 2026

📝 Walkthrough

Walkthrough

Adds helper utilities, modernizes OpenRouter client with retries/tools, simplifies NewsData client, introduces two CLI ingestion scripts (sources and claims), updates source OpenAPI schemas to include/require domainUrlNewsData, and switches the scraper to call the new source ingester.

Changes

Data Ingestion Infrastructure & Pipeline Refactor

Layer / File(s) Summary
OpenAPI Schema Update
oapi.yaml
OpenAPI schema adds optional domainUrlNewsData to SourceInput/SourcePatchInput and makes domainUrlNewsData required on the Source entity.
Shared Helper Module
scripts/helper.py
New helper utilities: URL/file text readers, get_oapi_spec(), get_sources(), cleanup_json_str(), clean_filepath(), and post_request().
OpenRouter Request Refactoring
scripts/openrouter.py
Reads config from env, introduces req_chat() with retry behavior (transport/429/5xx) and req_w_addons() that attaches skills/tools and iterates models.
NewsData API Simplification
scripts/newsdata_io.py
Loads NewsData config from env and rewrites get_claims(src_domain_url) to build requests internally; removed prior claim-ingest helpers.
Source Ingestion CLI
scripts/ingest_sources.py
CLI that converts scraped Markdown into deduplicated YAML source docs using OpenRouter steps and the SourceInput schema example.
Claim Ingestion CLI
scripts/ingest_claims.py
CLI that fetches claims per source via NewsData, filters/classifies with OpenRouter (falsifiable-claim skill + web search), deduplicates, normalizes, and writes claim YAML files.
Scraper integration
scripts/source_scraper.sh
Post-scrape processor switched to call scripts/ingest_sources.py instead of the previous consumer.

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly Related PRs

  • SatyaLens/sources#30: Refactors and earlier work on claims-fetching and ingestion pipelines closely related to the new ingest_claims.py and helper/OpenRouter changes.

Poem

🐰 Scripts hop, fetch, and tidy neat,

helpers hum, retries drum a beat,
sources born from scraped-marked lore,
claims find homes in YAML store,
rabbits cheer pipelines complete.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main refactoring effort: simplifying claim fetch logic by extracting it into modular helper functions and moving OpenRouter request handling to a dedicated module.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch refactor/python-scripts

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

🧹 Nitpick comments (2)
scripts/ingest_sources.py (1)

101-168: 💤 Low value

main() implicitly returns None, causing SystemExit(None) which exits with code 0.

When main() completes successfully, it returns None implicitly. At line 168, raise SystemExit(main()) becomes raise SystemExit(None), which exits with code 0. This is correct behavior, but for clarity and consistency with the explicit sys.exit(1) calls for errors, consider adding an explicit return 0 at the end of main().

Proposed fix
     # Write each unique source YAML doc into the `sources/` folder
     write_source_docs(unique_src_docs, sources_dir)    
+
+    return 0

 if __name__ == "__main__":
     raise SystemExit(main())
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/ingest_sources.py` around lines 101 - 168, The main() function
currently falls through and implicitly returns None, so the call raise
SystemExit(main()) becomes raise SystemExit(None); add an explicit successful
return value by appending "return 0" at the end of main() so SystemExit(main())
will raise SystemExit(0) on success; update the function defined as main() to
end with return 0 (so the invocation raise SystemExit(main()) reflects a
non-error exit code).
scripts/ingest_claims.py (1)

77-77: ⚡ Quick win

Global YAML representer modification may cause unintended side effects.

yaml.add_representer(str, quoted_str_representer) modifies the global PyYAML state, affecting all subsequent yaml.dump calls in this process. If this module is imported elsewhere or the function is called multiple times, this could have unintended consequences.

Consider using a local yaml.Dumper subclass instead.

Proposed fix using a custom Dumper class
-    # Custom representer to force double quotes around strings
-    def quoted_str_representer(dumper, data):
-        return dumper.represent_scalar('tag:yaml.org,2002:str', data, style='"')
-    
-    yaml.add_representer(str, quoted_str_representer)
+    # Custom Dumper to force double quotes around strings
+    class QuotedDumper(yaml.SafeDumper):
+        pass
+    
+    def quoted_str_representer(dumper, data):
+        return dumper.represent_scalar('tag:yaml.org,2002:str', data, style='"')
+    
+    QuotedDumper.add_representer(str, quoted_str_representer)

Then update the dump call:

-            yaml.dump(claim_doc, f, default_flow_style=False, allow_unicode=True, width=float('inf'))
+            yaml.dump(claim_doc, f, Dumper=QuotedDumper, default_flow_style=False, allow_unicode=True, width=float('inf'))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/ingest_claims.py` at line 77, The call yaml.add_representer(str,
quoted_str_representer) mutates global PyYAML state and should be replaced with
a local Dumper to avoid global side effects; create a custom Dumper subclass
(e.g., ClaimsYAMLDumper) that registers quoted_str_representer for str and then
pass that Dumper to yaml.dump calls in this module (replace direct
yaml.dump(...) with yaml.dump(..., Dumper=ClaimsYAMLDumper)) so the representer
is scoped locally and does not affect other imports or repeated invocations.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@oapi.yaml`:
- Around line 489-495: The SourceInput schema must match the required Source
contract: make domainUrlNewsData required on SourceInput (same as Source) so
POST /api/v1/source cannot accept payloads missing it; locate the
domainUrlNewsData property in the oapi.yaml SourceInput definition and add it to
that schema's required array (and remove/adjust any omitnil validation tag if it
conflicts), ensuring any paths that reference SourceInput/$ref are updated so
validation/POST scripts remain in sync with the Source model.

In `@scripts/helper.py`:
- Around line 14-16: The helper currently passes raw URLs into urlopen() (e.g.,
in get_text_from_url) which allows non-HTTP schemes like file:; before opening,
parse the URL (urllib.parse.urlparse) and explicitly allow only "http" and
"https" schemes, raising an exception (ValueError) for anything else, and apply
the same validation to the other urlopen call in the file (the call in the 61-70
range) so both get_text_from_url and the other URL-loading function reject
non-HTTP(s) schemes.
- Around line 69-75: The blanket except around urlopen(req, timeout=timeout) is
collapsing HTTPError responses into (-1, ""), so update the error handling to
catch urllib.error.HTTPError separately: in the try/except, add an except
HTTPError as e block that sets status = e.code and body =
e.read().decode("utf-8") (or "" if no body) and proceeds to return those values
(preserving 429/5xx), then keep a broader except Exception as e to handle other
failures and return (-1, ""); reference urlopen, HTTPError, status, body,
endpoint, req, and timeout when making the change.

In `@scripts/ingest_claims.py`:
- Line 130: Replace the equality check against None in the conditional that
tests the variable claims (the line containing "if claims == None or len(claims)
== 0:") with an identity check using "is None" so the condition becomes "if
claims is None or len(claims) == 0:"; this ensures correct None comparison for
the claims variable and avoids unintended behavior.
- Around line 113-114: Change the None check on the variable `sources` to use
identity comparison (`is None`) instead of `==`, and remove the unnecessary
f-string prefix on the error print (use a plain string for the message in the
`print(..., file=sys.stderr)` call) so the conditional reads a proper None check
and the printed message is a normal string.
- Around line 82-83: The loop over claim_example keys can raise KeyError when
claim is missing expected fields; update the assignment in the loop that
populates claim_doc so it uses claim.get(key, fallback) instead of claim[key],
converting the result to str and choosing a sensible fallback (e.g., empty
string or claim_example[key]) to preserve schema defaults; adjust the loop that
references claim_example, claim_doc, and claim accordingly.
- Around line 44-46: The open() call that reads YAML (the 'with open(yaml_file,
'r') as f:' inside the try block) should include encoding='utf-8' to match other
file operations and avoid platform-dependent encoding issues; update that call
to open the file with encoding='utf-8' so yaml.safe_load reads the file
consistently with other reads (see other file ops in this script for reference).
- Around line 60-63: The loop that checks for duplicates can raise KeyError when
existing YAML docs lack "uri" or "title"; in the for-loop over all_claim_docs
(the variables claim_doc, all_claim_docs and the new_claim flag), replace direct
dict access claim_doc["uri"] and claim_doc["title"] with safe lookups using
claim_doc.get("uri") and claim_doc.get("title") (or provide a default like None)
and compare to the incoming claim's values (claim["uri"], claim["title"]); this
avoids KeyError and ensures the equality check still sets new_claim = False and
breaks when a match is found while skipping or safely handling documents missing
those keys.

In `@scripts/ingest_sources.py`:
- Line 103: The usage message currently prints "Usage: openrouter.py TMP_MD"
which references the wrong script; update the print call that contains the
string "Usage: openrouter.py TMP_MD" so it instead references the correct script
name (e.g., "ingest_sources.py TMP_MD") so the usage output is accurate; locate
the print statement in scripts/ingest_sources.py (the call print("Usage:
openrouter.py TMP_MD", file=sys.stderr)) and replace the literal script name
accordingly.
- Line 58: The condition uses equality comparison to check for None ("if doc ==
None:"); change it to an identity check ("if doc is None:") to follow PEP 8.
Locate the occurrence of the "if doc == None" check in scripts/ingest_sources.py
(the variable named doc) and replace the equality operator with "is", and scan
for any other "== None" occurrences in the same file to update similarly.
- Line 80: The condition checking for a missing filename uses equality
comparison to None; change the check to use identity comparison (use "is None")
for the variable filename in that conditional (the if that currently reads "if
filename == None or filename.strip() == '':"), so replace the == None with "is
None" while keeping the rest of the logic intact to avoid false positives when
filename is an empty string.

In `@scripts/newsdata_io.py`:
- Around line 27-31: The code assumes response.json()["results"] exists;
instead, parse the JSON safely (catch JSONDecodeError), verify the top-level
payload is a dict and contains a "results" key whose value is a list before
returning it, and if not, log/print a helpful message including src_domain_url
and the raw payload or status and return None or an empty list; replace the
direct access response.json()["results"] with this safe-check logic around the
existing variables (response, endpoint, params, src_domain_url).

---

Nitpick comments:
In `@scripts/ingest_claims.py`:
- Line 77: The call yaml.add_representer(str, quoted_str_representer) mutates
global PyYAML state and should be replaced with a local Dumper to avoid global
side effects; create a custom Dumper subclass (e.g., ClaimsYAMLDumper) that
registers quoted_str_representer for str and then pass that Dumper to yaml.dump
calls in this module (replace direct yaml.dump(...) with yaml.dump(...,
Dumper=ClaimsYAMLDumper)) so the representer is scoped locally and does not
affect other imports or repeated invocations.

In `@scripts/ingest_sources.py`:
- Around line 101-168: The main() function currently falls through and
implicitly returns None, so the call raise SystemExit(main()) becomes raise
SystemExit(None); add an explicit successful return value by appending "return
0" at the end of main() so SystemExit(main()) will raise SystemExit(0) on
success; update the function defined as main() to end with return 0 (so the
invocation raise SystemExit(main()) reflects a non-error exit code).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 13a07b7d-38cc-4a3a-b2cc-900cf9915583

📥 Commits

Reviewing files that changed from the base of the PR and between 5c4eea1 and 0e13a1f.

📒 Files selected for processing (6)
  • oapi.yaml
  • scripts/helper.py
  • scripts/ingest_claims.py
  • scripts/ingest_sources.py
  • scripts/newsdata_io.py
  • scripts/openrouter.py

Comment thread oapi.yaml
Comment thread scripts/helper.py
Comment thread scripts/helper.py Outdated
Comment thread scripts/ingest_claims.py
Comment thread scripts/ingest_claims.py
Comment thread scripts/ingest_claims.py Outdated
Comment thread scripts/ingest_sources.py Outdated
Comment thread scripts/ingest_sources.py Outdated
Comment thread scripts/ingest_sources.py Outdated
Comment thread scripts/newsdata_io.py
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

11 issues found across 6 files

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread scripts/helper.py Outdated
Comment thread scripts/ingest_sources.py
Comment thread scripts/ingest_claims.py
Comment thread oapi.yaml
Comment thread scripts/newsdata_io.py
Comment thread scripts/openrouter.py
Comment thread scripts/ingest_claims.py
Comment thread scripts/ingest_claims.py
Comment thread scripts/ingest_claims.py
Comment thread scripts/ingest_sources.py Outdated
Signed-off-by: Amit Singh <singhamitch@outlook.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

♻️ Duplicate comments (2)
scripts/ingest_claims.py (2)

82-83: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid hard-failing when model output misses schema keys.

Line 83 directly indexes claim[key]; any missing field causes KeyError and stops writing remaining valid claims.

Proposed fix
         for key in claim_example.keys():
-            claim_doc[key] = str(claim[key])
+            claim_doc[key] = str(claim.get(key, ""))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/ingest_claims.py` around lines 82 - 83, The loop over claim_example
keys currently does claim_doc[key] = str(claim[key]) which will raise a KeyError
if the model output `claim` is missing a schema key; change this to safely
access the value (e.g., use claim.get(key) with a sensible default like "" or
None) and convert that result to string before assigning to `claim_doc[key]`,
optionally logging or collecting missing keys for diagnostics; update the loop
that iterates over `claim_example.keys()` and touches `claim_doc`, and keep all
references to `claim`, `claim_example`, `claim_doc`, and `key`.

60-62: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use safe key access when comparing against stored claim docs.

Line 61 can raise KeyError if an existing YAML doc is missing uri or title, which aborts dedupe and the run.

Proposed fix
-        for claim_doc in all_claim_docs:
-            if claim["uri"] == claim_doc["uri"] or claim["title"] == claim_doc["title"]:
+        for claim_doc in all_claim_docs:
+            if claim["uri"] == claim_doc.get("uri") or claim["title"] == claim_doc.get("title"):
                 new_claim = False
                 break
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/ingest_claims.py` around lines 60 - 62, The dedupe loop can raise
KeyError when accessing claim["uri"] or claim["title"]; change the comparisons
in the for all_claim_docs loop to use safe access (e.g., claim.get("uri"),
claim_doc.get("uri"), claim.get("title"), claim_doc.get("title")) and only
compare when the keys are present (or compare None-safely), so the block that
sets new_claim = False uses these .get calls instead of direct indexing; keep
the same variables (all_claim_docs, claim_doc, claim, new_claim) and preserve
the existing control flow.
🧹 Nitpick comments (1)
scripts/openrouter.py (1)

71-82: ⚡ Quick win

Place the system skill message before the user message to align with OpenRouter's OpenAI-compatible chat format expectations.

OpenRouter routes requests to various underlying model providers, many of which strictly enforce system messages appearing first in the messages array. The current implementation appends the system message after the user content, which can lead to suboptimal performance or request failures depending on the model used.

Suggested refactor
-        payload = {
-            "model": model,
-            "messages": [
-                {
-                    "role": "user",
-                    "content": content
-                },
-            ],
-        }
-
-        if skill != "":
-            payload["messages"].append({"role": "system", "content": skill})
+        messages = []
+        if skill != "":
+            messages.append({"role": "system", "content": skill})
+        messages.append(
+            {
+                "role": "user",
+                "content": content,
+            }
+        )
+        payload = {
+            "model": model,
+            "messages": messages,
+        }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/openrouter.py` around lines 71 - 82, The payload currently appends
the system skill after the user message which can break OpenRouter's
OpenAI-compatible ordering; modify the payload construction so the system
message (when skill != "") is inserted before the user message in
payload["messages"]—locate the payload dict creation and the conditional that
references payload, "messages", skill, and content and ensure the messages array
starts with {"role":"system","content":skill} (if provided) followed by
{"role":"user","content":content}.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/helper.py`:
- Around line 29-35: The get_sources function currently calls requests.get
without catching transport exceptions; wrap the requests.get(...) call in a
try/except that catches requests.RequestException (covers Timeout,
ConnectionError, etc.), print or log the exception details (e.g., include the
exception message) and return None on error so network failures behave like
non-200 HTTP responses; update get_sources to use this try/except around the
request and keep the existing status_code check afterward.

In `@scripts/ingest_claims.py`:
- Around line 32-35: update_claim_fields currently assumes claim["description"]
and claim["link"] exist and will raise KeyError for partial model outputs;
change it to safely read those keys (use claim.get("description") and
claim.get("link") or check key presence) and only set claim["summary"] and
claim["uri"] when values are present (or set a safe default like None/empty
string). Apply the same guarded access pattern to the other place where claims
are mutated (the block that also assigns "sourceUriDigest", "summary", and
"uri") so ingestion no longer crashes on missing fields.
- Around line 159-161: The deduplication baseline all_claim_docs is loaded once
and not updated, so new claims created by create_claim_docs aren't visible to
subsequent sources; after calling create_claim_docs(new_unique_claims,
source["name"]) update the baseline (either refresh all_claim_docs from the
store or extend the in-memory all_claim_docs with the newly created claim
documents/identifiers) so that subsequent calls to
get_new_claims(all_claim_docs, filtered_claims_list, source["uriDigest"])
consider claims ingested earlier in the same run.
- Around line 149-155: After parsing filtered_claims into filtered_claims_list
(from helper.cleanup_json_str), validate that the result is a list of dicts
before proceeding to get_new_claims: check isinstance(filtered_claims_list,
list) and that every item is a dict (or, if you want to accept a single dict,
coerce it into a single-item list), otherwise log an error (including the
offending payload and parse result) to stderr or process logger and continue;
update the code around filtered_claims_list and the downstream call to
get_new_claims to only run when this shape validation passes.

In `@scripts/ingest_sources.py`:
- Around line 54-64: The YAML loader may return non-dict types, so before
calling .get(...) in ingest_sources.py you should guard that the loaded `doc` is
a dict (e.g., `if not isinstance(doc, dict): warn and continue`) and likewise
ensure each `src` in `existing` is a dict before using `src.get(...)`; update
the checks around `doc = yaml.safe_load(doc_str)` and the duplicate-detection
loop (the `doc.get('name') == src.get('name') or doc.get('uri') ==
src.get('uri')` logic) to skip/log non-dict payloads to avoid AttributeError
during ingestion.
- Around line 123-137: The three calls to openrouter.req_w_addons in main() (the
call that sets raw_source_list, the call that sets source_list, and the later
call around lines 151-154) are unprotected and will surface tracebacks on
failure; wrap each openrouter.req_w_addons(...) invocation in a try/except that
catches Exception, writes a concise error message including the exception to
stderr (use the same error style as the existing checks), and exits with
sys.exit(1). Ensure you reference the call sites by their assignment targets
(raw_source_list, source_list, and the third call) and keep the existing
empty-string checks only for successful responses; if an exception occurs, log
the exception and exit immediately.

In `@scripts/openrouter.py`:
- Around line 87-90: The retry log currently prints to stdout; change it to
write to stderr so diagnostic output doesn't pollute assistant responses: in the
loop that calls req_chat(payload) (the block referencing resp and model),
replace the print(f"{model} failed, retrying with another model...") call with
an stderr-write (e.g., using sys.stderr.write or print(..., file=sys.stderr)) so
all fallback/diagnostic messages from the model fallback path go to stderr
instead of stdout.

---

Duplicate comments:
In `@scripts/ingest_claims.py`:
- Around line 82-83: The loop over claim_example keys currently does
claim_doc[key] = str(claim[key]) which will raise a KeyError if the model output
`claim` is missing a schema key; change this to safely access the value (e.g.,
use claim.get(key) with a sensible default like "" or None) and convert that
result to string before assigning to `claim_doc[key]`, optionally logging or
collecting missing keys for diagnostics; update the loop that iterates over
`claim_example.keys()` and touches `claim_doc`, and keep all references to
`claim`, `claim_example`, `claim_doc`, and `key`.
- Around line 60-62: The dedupe loop can raise KeyError when accessing
claim["uri"] or claim["title"]; change the comparisons in the for all_claim_docs
loop to use safe access (e.g., claim.get("uri"), claim_doc.get("uri"),
claim.get("title"), claim_doc.get("title")) and only compare when the keys are
present (or compare None-safely), so the block that sets new_claim = False uses
these .get calls instead of direct indexing; keep the same variables
(all_claim_docs, claim_doc, claim, new_claim) and preserve the existing control
flow.

---

Nitpick comments:
In `@scripts/openrouter.py`:
- Around line 71-82: The payload currently appends the system skill after the
user message which can break OpenRouter's OpenAI-compatible ordering; modify the
payload construction so the system message (when skill != "") is inserted before
the user message in payload["messages"]—locate the payload dict creation and the
conditional that references payload, "messages", skill, and content and ensure
the messages array starts with {"role":"system","content":skill} (if provided)
followed by {"role":"user","content":content}.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 2dca088a-3bd6-43ee-9640-97fdfcd1ce21

📥 Commits

Reviewing files that changed from the base of the PR and between 0e13a1f and a69be54.

📒 Files selected for processing (5)
  • scripts/helper.py
  • scripts/ingest_claims.py
  • scripts/ingest_sources.py
  • scripts/openrouter.py
  • scripts/source_scraper.sh

Comment thread scripts/helper.py
Comment on lines +29 to +35
def get_sources(base_url: str, api_key: str):
endpoint = f"{base_url}/api/v1/sources"
headers = {"X-API-Key": api_key}
response = requests.get(endpoint, headers=headers, timeout=90)
if response.status_code != 200:
print(f"Error: failed to get all sources: {response.status_code}")
return None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

Can requests.get()raise exceptions such as timeouts and connection errors, and isrequests.RequestException the common base class callers should catch?

💡 Result:

Yes, requests.get can raise various exceptions, including those for timeouts and connection errors [1][2]. All exceptions explicitly raised by the requests library inherit from requests.exceptions.RequestException [1][2]. Consequently, catching requests.RequestException is the recommended way to handle all errors originating from the requests library in a single block [3]. Common exceptions you may encounter include: requests.exceptions.Timeout: Raised when a request times out [1][2]. This can be further categorized into ConnectTimeout and ReadTimeout [4][5]. requests.exceptions.ConnectionError: Raised in the event of a network problem, such as a DNS failure or a refused connection [1][2]. requests.exceptions.HTTPError: Raised if a request returns an unsuccessful HTTP status code, specifically when using the Response.raise_for_status method [1][2]. requests.exceptions.TooManyRedirects: Raised if a request exceeds the configured number of maximum redirections [1][2]. While catching RequestException is useful for broad error handling, it is common practice to catch specific exceptions first to implement tailored logic, such as retrying a request after a timeout or logging a specific connection failure [6][3].

Citations:


🏁 Script executed:

# First, check if the file exists and inspect the relevant lines
if [ -f "scripts/helper.py" ]; then
  echo "=== File exists, checking lines 29-35 ==="
  sed -n '29,35p' scripts/helper.py
  echo ""
  echo "=== Full function context (around line 29) ==="
  sed -n '25,45p' scripts/helper.py
else
  echo "File not found at scripts/helper.py"
  echo "Searching for helper.py in the repo..."
  find . -name "helper.py" -type f 2>/dev/null
fi

Repository: SatyaLens/sources

Length of output: 1291


🏁 Script executed:

# Check the imports at the top of helper.py
echo "=== Imports at top of file ==="
head -20 scripts/helper.py

echo ""
echo "=== All requests library calls in helper.py ==="
grep -n "requests\." scripts/helper.py

Repository: SatyaLens/sources

Length of output: 663


Add exception handling for transport failures in get_sources().

Line 32 can raise requests.Timeout, requests.ConnectionError, or other exceptions on network issues, which aborts ingestion instead of returning None like the HTTP error case. Wrap the requests.get() call in a try-except block for requests.RequestException.

Suggested fix
 def get_sources(base_url: str, api_key: str):
     endpoint = f"{base_url}/api/v1/sources"
     headers = {"X-API-Key": api_key}
-    response = requests.get(endpoint, headers=headers, timeout=90)
+    try:
+        response = requests.get(endpoint, headers=headers, timeout=90)
+    except requests.RequestException as e:
+        print(f"Error: failed to get all sources: {e}", file=sys.stderr)
+        return None
     if response.status_code != 200:
-        print(f"Error: failed to get all sources: {response.status_code}")
+        print(f"Error: failed to get all sources: {response.status_code}", file=sys.stderr)
         return None
     return response.json()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/helper.py` around lines 29 - 35, The get_sources function currently
calls requests.get without catching transport exceptions; wrap the
requests.get(...) call in a try/except that catches requests.RequestException
(covers Timeout, ConnectionError, etc.), print or log the exception details
(e.g., include the exception message) and return None on error so network
failures behave like non-200 HTTP responses; update get_sources to use this
try/except around the request and keep the existing status_code check afterward.

Comment thread scripts/ingest_claims.py
Comment on lines +32 to +35
def update_claim_fields(srcDigest: str, claim):
claim["sourceUriDigest"] = srcDigest
claim["summary"] = claim["description"]
claim["uri"] = claim["link"]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard required keys before mutating claim objects.

Line 34 and Line 35 assume description and link always exist. If the model returns a partial object, ingestion crashes with KeyError before per-source error handling can continue.

Proposed fix
 def update_claim_fields(srcDigest: str, claim):
+    if not isinstance(claim, dict):
+        return False
+    if "description" not in claim or "link" not in claim:
+        return False
     claim["sourceUriDigest"] = srcDigest
     claim["summary"] = claim["description"]
     claim["uri"] = claim["link"]
+    return True
@@
     for claim in new_claims:
-        update_claim_fields(srcUriDigest, claim)
+        if not update_claim_fields(srcUriDigest, claim):
+            continue
         new_claim = True

Also applies to: 57-59

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/ingest_claims.py` around lines 32 - 35, update_claim_fields currently
assumes claim["description"] and claim["link"] exist and will raise KeyError for
partial model outputs; change it to safely read those keys (use
claim.get("description") and claim.get("link") or check key presence) and only
set claim["summary"] and claim["uri"] when values are present (or set a safe
default like None/empty string). Apply the same guarded access pattern to the
other place where claims are mutated (the block that also assigns
"sourceUriDigest", "summary", and "uri") so ingestion no longer crashes on
missing fields.

Comment thread scripts/ingest_claims.py
Comment on lines +149 to +155
try:
filtered_claims_list = json.loads(helper.cleanup_json_str(filtered_claims))
except Exception as e:
print(f"Error: failed to cleanup and unmarshal claims json string {filtered_claims}: {e}", file=sys.stderr)
continue

if len(filtered_claims_list) == 0:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate parsed JSON shape before downstream processing.

After Line 150, code assumes a list of dict claims. A valid non-list JSON payload (or mixed list) can trigger runtime failures later in get_new_claims.

Proposed fix
         try:
             filtered_claims_list = json.loads(helper.cleanup_json_str(filtered_claims))
         except Exception as e:
             print(f"Error: failed to cleanup and unmarshal claims json string {filtered_claims}: {e}", file=sys.stderr)
             continue
 
+        if not isinstance(filtered_claims_list, list):
+            print(f"Error: expected claims array for {source['name']}", file=sys.stderr)
+            continue
+        filtered_claims_list = [c for c in filtered_claims_list if isinstance(c, dict)]
+
         if len(filtered_claims_list) == 0:
             print(f"Error: no claims found for {source['name']} in: {filtered_claims}", file=sys.stderr)
             continue
🧰 Tools
🪛 Ruff (0.15.12)

[warning] 151-151: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/ingest_claims.py` around lines 149 - 155, After parsing
filtered_claims into filtered_claims_list (from helper.cleanup_json_str),
validate that the result is a list of dicts before proceeding to get_new_claims:
check isinstance(filtered_claims_list, list) and that every item is a dict (or,
if you want to accept a single dict, coerce it into a single-item list),
otherwise log an error (including the offending payload and parse result) to
stderr or process logger and continue; update the code around
filtered_claims_list and the downstream call to get_new_claims to only run when
this shape validation passes.

Comment thread scripts/ingest_claims.py
Comment on lines +159 to +161
# list of new claims to be ingested
new_unique_claims = get_new_claims(all_claim_docs, filtered_claims_list, source["uriDigest"])
create_claim_docs(new_unique_claims, source["name"])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Deduplication baseline is stale across sources in the same run.

all_claim_docs is loaded once and never updated, so claims created from earlier sources in this run are not considered when processing later sources.

Proposed fix
         new_unique_claims = get_new_claims(all_claim_docs, filtered_claims_list, source["uriDigest"])        
         create_claim_docs(new_unique_claims, source["name"])
+        all_claim_docs.extend(new_unique_claims)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/ingest_claims.py` around lines 159 - 161, The deduplication baseline
all_claim_docs is loaded once and not updated, so new claims created by
create_claim_docs aren't visible to subsequent sources; after calling
create_claim_docs(new_unique_claims, source["name"]) update the baseline (either
refresh all_claim_docs from the store or extend the in-memory all_claim_docs
with the newly created claim documents/identifiers) so that subsequent calls to
get_new_claims(all_claim_docs, filtered_claims_list, source["uriDigest"])
consider claims ingested earlier in the same run.

Comment thread scripts/ingest_sources.py
Comment on lines +54 to +64
doc = yaml.safe_load(doc_str)
except Exception as e:
print(f"Warning: failed to parse source_doc; not adding it to the source list it. Error: {e}", file=sys.stderr)
continue
if doc is None:
print(f"Warning: skipping the following yaml string as it failed to load: {doc_str}", file=sys.stderr)
continue
duplicate = False
for src in existing:
if doc.get('name') == src.get('name') or doc.get('uri') == src.get('uri'):
duplicate = True
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard non-dictionary YAML payloads before .get(...) access.

Line 63 and Line 79 assume yaml.safe_load(...) returns a dict. LLM output can be a list/string, which will raise AttributeError and stop ingestion.

Proposed fix
@@
-        if doc is None:
-            print(f"Warning: skipping the following yaml string as it failed to load: {doc_str}", file=sys.stderr)
+        if not isinstance(doc, dict):
+            print(f"Warning: skipping non-object yaml document: {doc_str}", file=sys.stderr)
             continue
@@
-        filename = parsed.get('name')
+        if not isinstance(parsed, dict):
+            print(f"Warning: skipping non-object yaml document: {doc_str}", file=sys.stderr)
+            continue
+
+        filename = parsed.get('name')

Also applies to: 74-80

🧰 Tools
🪛 Ruff (0.15.12)

[warning] 55-55: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/ingest_sources.py` around lines 54 - 64, The YAML loader may return
non-dict types, so before calling .get(...) in ingest_sources.py you should
guard that the loaded `doc` is a dict (e.g., `if not isinstance(doc, dict): warn
and continue`) and likewise ensure each `src` in `existing` is a dict before
using `src.get(...)`; update the checks around `doc = yaml.safe_load(doc_str)`
and the duplicate-detection loop (the `doc.get('name') == src.get('name') or
doc.get('uri') == src.get('uri')` logic) to skip/log non-dict payloads to avoid
AttributeError during ingestion.

Comment thread scripts/ingest_sources.py
Comment on lines +123 to +137
raw_source_list = openrouter.req_w_addons(req_content, skill=md_processing_skill)
if raw_source_list == "":
print("Error: failed to get raw source list from openrouter", file=sys.stderr)
sys.exit(1)

# openrouter call to get a clean formatted list of source urls
req_content = (
"Here is a raw list of URLs of news outlets with each line containing one or more unformatted URLs:"
f"\n\n{raw_source_list}\n\n"
f"{SOURCE_CLEANUP_PROMPT}"
)
source_list = openrouter.req_w_addons(req_content, tools=[openrouter.WEB_SEARCH_TOOL])
if source_list == "":
print("Error: failed to get filtered list of source urls", file=sys.stderr)
sys.exit(1)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle OpenRouter call failures explicitly in main().

The three openrouter.req_w_addons(...) calls are not wrapped in error handling. If the client raises, the script exits with traceback instead of controlled stderr + non-zero exit.

Proposed fix
@@
-    raw_source_list =  openrouter.req_w_addons(req_content, skill=md_processing_skill)
+    try:
+        raw_source_list = openrouter.req_w_addons(req_content, skill=md_processing_skill)
+    except Exception as e:
+        print(f"Error: failed to get raw source list from openrouter: {e}", file=sys.stderr)
+        sys.exit(1)
@@
-    source_list = openrouter.req_w_addons(req_content, tools=[openrouter.WEB_SEARCH_TOOL])
+    try:
+        source_list = openrouter.req_w_addons(req_content, tools=[openrouter.WEB_SEARCH_TOOL])
+    except Exception as e:
+        print(f"Error: failed to get filtered list of source urls: {e}", file=sys.stderr)
+        sys.exit(1)
@@
-    src_docs_str = openrouter.req_w_addons(req_content, tools=[openrouter.WEB_SEARCH_TOOL])
+    try:
+        src_docs_str = openrouter.req_w_addons(req_content, tools=[openrouter.WEB_SEARCH_TOOL])
+    except Exception as e:
+        print(f"Error: failed to generate source docs from source urls: {e}", file=sys.stderr)
+        sys.exit(1)

Also applies to: 151-154

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/ingest_sources.py` around lines 123 - 137, The three calls to
openrouter.req_w_addons in main() (the call that sets raw_source_list, the call
that sets source_list, and the later call around lines 151-154) are unprotected
and will surface tracebacks on failure; wrap each openrouter.req_w_addons(...)
invocation in a try/except that catches Exception, writes a concise error
message including the exception to stderr (use the same error style as the
existing checks), and exits with sys.exit(1). Ensure you reference the call
sites by their assignment targets (raw_source_list, source_list, and the third
call) and keep the existing empty-string checks only for successful responses;
if an exception occurs, log the exception and exit immediately.

Comment thread scripts/openrouter.py
Comment on lines +87 to +90
resp = req_chat(payload)
if resp != "":
break
print(f"{model} failed, retrying with another model...")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Write model-fallback diagnostics to stderr.

Line 90 is the only retry log still going to stdout, which can pollute callers that pipe or capture the assistant response.

Suggested fix
-        print(f"{model} failed, retrying with another model...")
+        print(f"{model} failed, retrying with another model...", file=sys.stderr)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
resp = req_chat(payload)
if resp != "":
break
print(f"{model} failed, retrying with another model...")
resp = req_chat(payload)
if resp != "":
break
print(f"{model} failed, retrying with another model...", file=sys.stderr)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/openrouter.py` around lines 87 - 90, The retry log currently prints
to stdout; change it to write to stderr so diagnostic output doesn't pollute
assistant responses: in the loop that calls req_chat(payload) (the block
referencing resp and model), replace the print(f"{model} failed, retrying with
another model...") call with an stderr-write (e.g., using sys.stderr.write or
print(..., file=sys.stderr)) so all fallback/diagnostic messages from the model
fallback path go to stderr instead of stdout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant