refactor: simplifies claim fetch logic by semmet95 · Pull Request #31 · SatyaLens/sources

semmet95 · 2026-05-18T13:51:56Z

Summary by CodeRabbit

New Features
- Added required source field domainUrlNewsData
- Added pipelines to ingest claims and to ingest sources from Markdown
Improvements
- Better retry and error handling for remote model/API calls
- New helper utilities for fetching, sanitizing, and normalizing data
- Simplified news-data fetching with centralized configuration and deduplication/validation improvements

Signed-off-by: Amit Singh <singhamitch@outlook.com>

qodo-code-review · 2026-05-18T13:52:00Z

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

coderabbitai · 2026-05-18T13:52:09Z

📝 Walkthrough

Walkthrough

Adds helper utilities, modernizes OpenRouter client with retries/tools, simplifies NewsData client, introduces two CLI ingestion scripts (sources and claims), updates source OpenAPI schemas to include/require domainUrlNewsData, and switches the scraper to call the new source ingester.

Changes

Data Ingestion Infrastructure & Pipeline Refactor

Layer / File(s)	Summary
OpenAPI Schema Update `oapi.yaml`	OpenAPI schema adds optional `domainUrlNewsData` to `SourceInput`/`SourcePatchInput` and makes `domainUrlNewsData` required on the `Source` entity.
Shared Helper Module `scripts/helper.py`	New helper utilities: URL/file text readers, `get_oapi_spec()`, `get_sources()`, `cleanup_json_str()`, `clean_filepath()`, and `post_request()`.
OpenRouter Request Refactoring `scripts/openrouter.py`	Reads config from env, introduces `req_chat()` with retry behavior (transport/429/5xx) and `req_w_addons()` that attaches skills/tools and iterates models.
NewsData API Simplification `scripts/newsdata_io.py`	Loads NewsData config from env and rewrites `get_claims(src_domain_url)` to build requests internally; removed prior claim-ingest helpers.
Source Ingestion CLI `scripts/ingest_sources.py`	CLI that converts scraped Markdown into deduplicated YAML source docs using OpenRouter steps and the `SourceInput` schema example.
Claim Ingestion CLI `scripts/ingest_claims.py`	CLI that fetches claims per source via NewsData, filters/classifies with OpenRouter (falsifiable-claim skill + web search), deduplicates, normalizes, and writes claim YAML files.
Scraper integration `scripts/source_scraper.sh`	Post-scrape processor switched to call `scripts/ingest_sources.py` instead of the previous consumer.

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly Related PRs

SatyaLens/sources#30: Refactors and earlier work on claims-fetching and ingestion pipelines closely related to the new ingest_claims.py and helper/OpenRouter changes.

Poem

🐰 Scripts hop, fetch, and tidy neat,

helpers hum, retries drum a beat,
sources born from scraped-marked lore,
claims find homes in YAML store,
rabbits cheer pipelines complete.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately reflects the main refactoring effort: simplifying claim fetch logic by extracting it into modular helper functions and moving OpenRouter request handling to a dedicated module.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch refactor/python-scripts

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 12

🧹 Nitpick comments (2)

scripts/ingest_sources.py (1)

101-168: 💤 Low value

main() implicitly returns None, causing SystemExit(None) which exits with code 0.

When main() completes successfully, it returns None implicitly. At line 168, raise SystemExit(main()) becomes raise SystemExit(None), which exits with code 0. This is correct behavior, but for clarity and consistency with the explicit sys.exit(1) calls for errors, consider adding an explicit return 0 at the end of main().
Proposed fix
     # Write each unique source YAML doc into the `sources/` folder
     write_source_docs(unique_src_docs, sources_dir)    
+
+    return 0

 if __name__ == "__main__":
     raise SystemExit(main())
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/ingest_sources.py` around lines 101 - 168, The main() function
currently falls through and implicitly returns None, so the call raise
SystemExit(main()) becomes raise SystemExit(None); add an explicit successful
return value by appending "return 0" at the end of main() so SystemExit(main())
will raise SystemExit(0) on success; update the function defined as main() to
end with return 0 (so the invocation raise SystemExit(main()) reflects a
non-error exit code).

scripts/ingest_claims.py (1)

77-77: ⚡ Quick win

Global YAML representer modification may cause unintended side effects.

yaml.add_representer(str, quoted_str_representer) modifies the global PyYAML state, affecting all subsequent yaml.dump calls in this process. If this module is imported elsewhere or the function is called multiple times, this could have unintended consequences.

Consider using a local yaml.Dumper subclass instead.

Proposed fix using a custom Dumper class

-    # Custom representer to force double quotes around strings
-    def quoted_str_representer(dumper, data):
-        return dumper.represent_scalar('tag:yaml.org,2002:str', data, style='"')
-    
-    yaml.add_representer(str, quoted_str_representer)
+    # Custom Dumper to force double quotes around strings
+    class QuotedDumper(yaml.SafeDumper):
+        pass
+    
+    def quoted_str_representer(dumper, data):
+        return dumper.represent_scalar('tag:yaml.org,2002:str', data, style='"')
+    
+    QuotedDumper.add_representer(str, quoted_str_representer)

Then update the dump call:

-            yaml.dump(claim_doc, f, default_flow_style=False, allow_unicode=True, width=float('inf'))
+            yaml.dump(claim_doc, f, Dumper=QuotedDumper, default_flow_style=False, allow_unicode=True, width=float('inf'))

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/ingest_claims.py` at line 77, The call yaml.add_representer(str,
quoted_str_representer) mutates global PyYAML state and should be replaced with
a local Dumper to avoid global side effects; create a custom Dumper subclass
(e.g., ClaimsYAMLDumper) that registers quoted_str_representer for str and then
pass that Dumper to yaml.dump calls in this module (replace direct
yaml.dump(...) with yaml.dump(..., Dumper=ClaimsYAMLDumper)) so the representer
is scoped locally and does not affect other imports or repeated invocations.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@oapi.yaml`:
- Around line 489-495: The SourceInput schema must match the required Source
contract: make domainUrlNewsData required on SourceInput (same as Source) so
POST /api/v1/source cannot accept payloads missing it; locate the
domainUrlNewsData property in the oapi.yaml SourceInput definition and add it to
that schema's required array (and remove/adjust any omitnil validation tag if it
conflicts), ensuring any paths that reference SourceInput/$ref are updated so
validation/POST scripts remain in sync with the Source model.

In `@scripts/helper.py`:
- Around line 14-16: The helper currently passes raw URLs into urlopen() (e.g.,
in get_text_from_url) which allows non-HTTP schemes like file:; before opening,
parse the URL (urllib.parse.urlparse) and explicitly allow only "http" and
"https" schemes, raising an exception (ValueError) for anything else, and apply
the same validation to the other urlopen call in the file (the call in the 61-70
range) so both get_text_from_url and the other URL-loading function reject
non-HTTP(s) schemes.
- Around line 69-75: The blanket except around urlopen(req, timeout=timeout) is
collapsing HTTPError responses into (-1, ""), so update the error handling to
catch urllib.error.HTTPError separately: in the try/except, add an except
HTTPError as e block that sets status = e.code and body =
e.read().decode("utf-8") (or "" if no body) and proceeds to return those values
(preserving 429/5xx), then keep a broader except Exception as e to handle other
failures and return (-1, ""); reference urlopen, HTTPError, status, body,
endpoint, req, and timeout when making the change.

In `@scripts/ingest_claims.py`:
- Line 130: Replace the equality check against None in the conditional that
tests the variable claims (the line containing "if claims == None or len(claims)
== 0:") with an identity check using "is None" so the condition becomes "if
claims is None or len(claims) == 0:"; this ensures correct None comparison for
the claims variable and avoids unintended behavior.
- Around line 113-114: Change the None check on the variable `sources` to use
identity comparison (`is None`) instead of `==`, and remove the unnecessary
f-string prefix on the error print (use a plain string for the message in the
`print(..., file=sys.stderr)` call) so the conditional reads a proper None check
and the printed message is a normal string.
- Around line 82-83: The loop over claim_example keys can raise KeyError when
claim is missing expected fields; update the assignment in the loop that
populates claim_doc so it uses claim.get(key, fallback) instead of claim[key],
converting the result to str and choosing a sensible fallback (e.g., empty
string or claim_example[key]) to preserve schema defaults; adjust the loop that
references claim_example, claim_doc, and claim accordingly.
- Around line 44-46: The open() call that reads YAML (the 'with open(yaml_file,
'r') as f:' inside the try block) should include encoding='utf-8' to match other
file operations and avoid platform-dependent encoding issues; update that call
to open the file with encoding='utf-8' so yaml.safe_load reads the file
consistently with other reads (see other file ops in this script for reference).
- Around line 60-63: The loop that checks for duplicates can raise KeyError when
existing YAML docs lack "uri" or "title"; in the for-loop over all_claim_docs
(the variables claim_doc, all_claim_docs and the new_claim flag), replace direct
dict access claim_doc["uri"] and claim_doc["title"] with safe lookups using
claim_doc.get("uri") and claim_doc.get("title") (or provide a default like None)
and compare to the incoming claim's values (claim["uri"], claim["title"]); this
avoids KeyError and ensures the equality check still sets new_claim = False and
breaks when a match is found while skipping or safely handling documents missing
those keys.

In `@scripts/ingest_sources.py`:
- Line 103: The usage message currently prints "Usage: openrouter.py TMP_MD"
which references the wrong script; update the print call that contains the
string "Usage: openrouter.py TMP_MD" so it instead references the correct script
name (e.g., "ingest_sources.py TMP_MD") so the usage output is accurate; locate
the print statement in scripts/ingest_sources.py (the call print("Usage:
openrouter.py TMP_MD", file=sys.stderr)) and replace the literal script name
accordingly.
- Line 58: The condition uses equality comparison to check for None ("if doc ==
None:"); change it to an identity check ("if doc is None:") to follow PEP 8.
Locate the occurrence of the "if doc == None" check in scripts/ingest_sources.py
(the variable named doc) and replace the equality operator with "is", and scan
for any other "== None" occurrences in the same file to update similarly.
- Line 80: The condition checking for a missing filename uses equality
comparison to None; change the check to use identity comparison (use "is None")
for the variable filename in that conditional (the if that currently reads "if
filename == None or filename.strip() == '':"), so replace the == None with "is
None" while keeping the rest of the logic intact to avoid false positives when
filename is an empty string.

In `@scripts/newsdata_io.py`:
- Around line 27-31: The code assumes response.json()["results"] exists;
instead, parse the JSON safely (catch JSONDecodeError), verify the top-level
payload is a dict and contains a "results" key whose value is a list before
returning it, and if not, log/print a helpful message including src_domain_url
and the raw payload or status and return None or an empty list; replace the
direct access response.json()["results"] with this safe-check logic around the
existing variables (response, endpoint, params, src_domain_url).

---

Nitpick comments:
In `@scripts/ingest_claims.py`:
- Line 77: The call yaml.add_representer(str, quoted_str_representer) mutates
global PyYAML state and should be replaced with a local Dumper to avoid global
side effects; create a custom Dumper subclass (e.g., ClaimsYAMLDumper) that
registers quoted_str_representer for str and then pass that Dumper to yaml.dump
calls in this module (replace direct yaml.dump(...) with yaml.dump(...,
Dumper=ClaimsYAMLDumper)) so the representer is scoped locally and does not
affect other imports or repeated invocations.

In `@scripts/ingest_sources.py`:
- Around line 101-168: The main() function currently falls through and
implicitly returns None, so the call raise SystemExit(main()) becomes raise
SystemExit(None); add an explicit successful return value by appending "return
0" at the end of main() so SystemExit(main()) will raise SystemExit(0) on
success; update the function defined as main() to end with return 0 (so the
invocation raise SystemExit(main()) reflects a non-error exit code).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 13a07b7d-38cc-4a3a-b2cc-900cf9915583

📥 Commits

Reviewing files that changed from the base of the PR and between 5c4eea1 and 0e13a1f.

📒 Files selected for processing (6)

oapi.yaml
scripts/helper.py
scripts/ingest_claims.py
scripts/ingest_sources.py
scripts/newsdata_io.py
scripts/openrouter.py

cubic-dev-ai

11 issues found across 6 files

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

Signed-off-by: Amit Singh <singhamitch@outlook.com>

coderabbitai

Actionable comments posted: 7

♻️ Duplicate comments (2)

scripts/ingest_claims.py (2)

82-83: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid hard-failing when model output misses schema keys.

Line 83 directly indexes claim[key]; any missing field causes KeyError and stops writing remaining valid claims.

Proposed fix

         for key in claim_example.keys():
-            claim_doc[key] = str(claim[key])
+            claim_doc[key] = str(claim.get(key, ""))

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/ingest_claims.py` around lines 82 - 83, The loop over claim_example
keys currently does claim_doc[key] = str(claim[key]) which will raise a KeyError
if the model output `claim` is missing a schema key; change this to safely
access the value (e.g., use claim.get(key) with a sensible default like "" or
None) and convert that result to string before assigning to `claim_doc[key]`,
optionally logging or collecting missing keys for diagnostics; update the loop
that iterates over `claim_example.keys()` and touches `claim_doc`, and keep all
references to `claim`, `claim_example`, `claim_doc`, and `key`.

60-62: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use safe key access when comparing against stored claim docs.

Line 61 can raise KeyError if an existing YAML doc is missing uri or title, which aborts dedupe and the run.

Proposed fix

-        for claim_doc in all_claim_docs:
-            if claim["uri"] == claim_doc["uri"] or claim["title"] == claim_doc["title"]:
+        for claim_doc in all_claim_docs:
+            if claim["uri"] == claim_doc.get("uri") or claim["title"] == claim_doc.get("title"):
                 new_claim = False
                 break

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/ingest_claims.py` around lines 60 - 62, The dedupe loop can raise
KeyError when accessing claim["uri"] or claim["title"]; change the comparisons
in the for all_claim_docs loop to use safe access (e.g., claim.get("uri"),
claim_doc.get("uri"), claim.get("title"), claim_doc.get("title")) and only
compare when the keys are present (or compare None-safely), so the block that
sets new_claim = False uses these .get calls instead of direct indexing; keep
the same variables (all_claim_docs, claim_doc, claim, new_claim) and preserve
the existing control flow.

🧹 Nitpick comments (1)

scripts/openrouter.py (1)

71-82: ⚡ Quick win

Place the system skill message before the user message to align with OpenRouter's OpenAI-compatible chat format expectations.

OpenRouter routes requests to various underlying model providers, many of which strictly enforce system messages appearing first in the messages array. The current implementation appends the system message after the user content, which can lead to suboptimal performance or request failures depending on the model used.

Suggested refactor

-        payload = {
-            "model": model,
-            "messages": [
-                {
-                    "role": "user",
-                    "content": content
-                },
-            ],
-        }
-
-        if skill != "":
-            payload["messages"].append({"role": "system", "content": skill})
+        messages = []
+        if skill != "":
+            messages.append({"role": "system", "content": skill})
+        messages.append(
+            {
+                "role": "user",
+                "content": content,
+            }
+        )
+        payload = {
+            "model": model,
+            "messages": messages,
+        }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/openrouter.py` around lines 71 - 82, The payload currently appends
the system skill after the user message which can break OpenRouter's
OpenAI-compatible ordering; modify the payload construction so the system
message (when skill != "") is inserted before the user message in
payload["messages"]—locate the payload dict creation and the conditional that
references payload, "messages", skill, and content and ensure the messages array
starts with {"role":"system","content":skill} (if provided) followed by
{"role":"user","content":content}.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/helper.py`:
- Around line 29-35: The get_sources function currently calls requests.get
without catching transport exceptions; wrap the requests.get(...) call in a
try/except that catches requests.RequestException (covers Timeout,
ConnectionError, etc.), print or log the exception details (e.g., include the
exception message) and return None on error so network failures behave like
non-200 HTTP responses; update get_sources to use this try/except around the
request and keep the existing status_code check afterward.

In `@scripts/ingest_claims.py`:
- Around line 32-35: update_claim_fields currently assumes claim["description"]
and claim["link"] exist and will raise KeyError for partial model outputs;
change it to safely read those keys (use claim.get("description") and
claim.get("link") or check key presence) and only set claim["summary"] and
claim["uri"] when values are present (or set a safe default like None/empty
string). Apply the same guarded access pattern to the other place where claims
are mutated (the block that also assigns "sourceUriDigest", "summary", and
"uri") so ingestion no longer crashes on missing fields.
- Around line 159-161: The deduplication baseline all_claim_docs is loaded once
and not updated, so new claims created by create_claim_docs aren't visible to
subsequent sources; after calling create_claim_docs(new_unique_claims,
source["name"]) update the baseline (either refresh all_claim_docs from the
store or extend the in-memory all_claim_docs with the newly created claim
documents/identifiers) so that subsequent calls to
get_new_claims(all_claim_docs, filtered_claims_list, source["uriDigest"])
consider claims ingested earlier in the same run.
- Around line 149-155: After parsing filtered_claims into filtered_claims_list
(from helper.cleanup_json_str), validate that the result is a list of dicts
before proceeding to get_new_claims: check isinstance(filtered_claims_list,
list) and that every item is a dict (or, if you want to accept a single dict,
coerce it into a single-item list), otherwise log an error (including the
offending payload and parse result) to stderr or process logger and continue;
update the code around filtered_claims_list and the downstream call to
get_new_claims to only run when this shape validation passes.

In `@scripts/ingest_sources.py`:
- Around line 54-64: The YAML loader may return non-dict types, so before
calling .get(...) in ingest_sources.py you should guard that the loaded `doc` is
a dict (e.g., `if not isinstance(doc, dict): warn and continue`) and likewise
ensure each `src` in `existing` is a dict before using `src.get(...)`; update
the checks around `doc = yaml.safe_load(doc_str)` and the duplicate-detection
loop (the `doc.get('name') == src.get('name') or doc.get('uri') ==
src.get('uri')` logic) to skip/log non-dict payloads to avoid AttributeError
during ingestion.
- Around line 123-137: The three calls to openrouter.req_w_addons in main() (the
call that sets raw_source_list, the call that sets source_list, and the later
call around lines 151-154) are unprotected and will surface tracebacks on
failure; wrap each openrouter.req_w_addons(...) invocation in a try/except that
catches Exception, writes a concise error message including the exception to
stderr (use the same error style as the existing checks), and exits with
sys.exit(1). Ensure you reference the call sites by their assignment targets
(raw_source_list, source_list, and the third call) and keep the existing
empty-string checks only for successful responses; if an exception occurs, log
the exception and exit immediately.

In `@scripts/openrouter.py`:
- Around line 87-90: The retry log currently prints to stdout; change it to
write to stderr so diagnostic output doesn't pollute assistant responses: in the
loop that calls req_chat(payload) (the block referencing resp and model),
replace the print(f"{model} failed, retrying with another model...") call with
an stderr-write (e.g., using sys.stderr.write or print(..., file=sys.stderr)) so
all fallback/diagnostic messages from the model fallback path go to stderr
instead of stdout.

---

Duplicate comments:
In `@scripts/ingest_claims.py`:
- Around line 82-83: The loop over claim_example keys currently does
claim_doc[key] = str(claim[key]) which will raise a KeyError if the model output
`claim` is missing a schema key; change this to safely access the value (e.g.,
use claim.get(key) with a sensible default like "" or None) and convert that
result to string before assigning to `claim_doc[key]`, optionally logging or
collecting missing keys for diagnostics; update the loop that iterates over
`claim_example.keys()` and touches `claim_doc`, and keep all references to
`claim`, `claim_example`, `claim_doc`, and `key`.
- Around line 60-62: The dedupe loop can raise KeyError when accessing
claim["uri"] or claim["title"]; change the comparisons in the for all_claim_docs
loop to use safe access (e.g., claim.get("uri"), claim_doc.get("uri"),
claim.get("title"), claim_doc.get("title")) and only compare when the keys are
present (or compare None-safely), so the block that sets new_claim = False uses
these .get calls instead of direct indexing; keep the same variables
(all_claim_docs, claim_doc, claim, new_claim) and preserve the existing control
flow.

---

Nitpick comments:
In `@scripts/openrouter.py`:
- Around line 71-82: The payload currently appends the system skill after the
user message which can break OpenRouter's OpenAI-compatible ordering; modify the
payload construction so the system message (when skill != "") is inserted before
the user message in payload["messages"]—locate the payload dict creation and the
conditional that references payload, "messages", skill, and content and ensure
the messages array starts with {"role":"system","content":skill} (if provided)
followed by {"role":"user","content":content}.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 2dca088a-3bd6-43ee-9640-97fdfcd1ce21

📥 Commits

Reviewing files that changed from the base of the PR and between 0e13a1f and a69be54.

📒 Files selected for processing (5)

scripts/helper.py
scripts/ingest_claims.py
scripts/ingest_sources.py
scripts/openrouter.py
scripts/source_scraper.sh

coderabbitai · 2026-05-18T16:12:47Z

+def get_sources(base_url: str, api_key: str):
+    endpoint = f"{base_url}/api/v1/sources"
+    headers = {"X-API-Key": api_key}
+    response = requests.get(endpoint, headers=headers, timeout=90)
+    if response.status_code != 200:
+        print(f"Error: failed to get all sources: {response.status_code}")
+        return None


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

Can requests.get()raise exceptions such as timeouts and connection errors, and isrequests.RequestException the common base class callers should catch?

💡 Result:

Yes, requests.get can raise various exceptions, including those for timeouts and connection errors [1][2]. All exceptions explicitly raised by the requests library inherit from requests.exceptions.RequestException [1][2]. Consequently, catching requests.RequestException is the recommended way to handle all errors originating from the requests library in a single block [3]. Common exceptions you may encounter include: requests.exceptions.Timeout: Raised when a request times out [1][2]. This can be further categorized into ConnectTimeout and ReadTimeout [4][5]. requests.exceptions.ConnectionError: Raised in the event of a network problem, such as a DNS failure or a refused connection [1][2]. requests.exceptions.HTTPError: Raised if a request returns an unsuccessful HTTP status code, specifically when using the Response.raise_for_status method [1][2]. requests.exceptions.TooManyRedirects: Raised if a request exceeds the configured number of maximum redirections [1][2]. While catching RequestException is useful for broad error handling, it is common practice to catch specific exceptions first to implement tailored logic, such as retrying a request after a timeout or logging a specific connection failure [6][3].

Citations:

1: https://requests.readthedocs.io/en/master/user/quickstart/?highlight=timeout

2: https://docs.python-requests.org/en/stable/user/quickstart/

3: https://stackoverflow.com/questions/16511337/correct-way-to-try-except-using-python-requests-module

4: https://requests.readthedocs.io/en/latest/_modules/requests/exceptions/

5: https://oxylabs.io/blog/python-requests-timeout

6: https://www.pythonrequests.com/python-requests-exceptions/

🏁 Script executed:

# First, check if the file exists and inspect the relevant lines if [ -f "scripts/helper.py" ]; then echo "=== File exists, checking lines 29-35 ===" sed -n '29,35p' scripts/helper.py echo "" echo "=== Full function context (around line 29) ===" sed -n '25,45p' scripts/helper.py else echo "File not found at scripts/helper.py" echo "Searching for helper.py in the repo..." find . -name "helper.py" -type f 2>/dev/null fi

Repository: SatyaLens/sources

Length of output: 1291

🏁 Script executed:

# Check the imports at the top of helper.py echo "=== Imports at top of file ===" head -20 scripts/helper.py echo "" echo "=== All requests library calls in helper.py ===" grep -n "requests\." scripts/helper.py

Repository: SatyaLens/sources

Length of output: 663

Add exception handling for transport failures in get_sources().

Line 32 can raise requests.Timeout, requests.ConnectionError, or other exceptions on network issues, which aborts ingestion instead of returning None like the HTTP error case. Wrap the requests.get() call in a try-except block for requests.RequestException.

Suggested fix

def get_sources(base_url: str, api_key: str): endpoint = f"{base_url}/api/v1/sources" headers = {"X-API-Key": api_key} - response = requests.get(endpoint, headers=headers, timeout=90) + try: + response = requests.get(endpoint, headers=headers, timeout=90) + except requests.RequestException as e: + print(f"Error: failed to get all sources: {e}", file=sys.stderr) + return None if response.status_code != 200: - print(f"Error: failed to get all sources: {response.status_code}") + print(f"Error: failed to get all sources: {response.status_code}", file=sys.stderr) return None return response.json()

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/helper.py` around lines 29 - 35, The get_sources function currently calls requests.get without catching transport exceptions; wrap the requests.get(...) call in a try/except that catches requests.RequestException (covers Timeout, ConnectionError, etc.), print or log the exception details (e.g., include the exception message) and return None on error so network failures behave like non-200 HTTP responses; update get_sources to use this try/except around the request and keep the existing status_code check afterward.

coderabbitai · 2026-05-18T16:12:47Z

+def update_claim_fields(srcDigest: str, claim):
+    claim["sourceUriDigest"] = srcDigest
+    claim["summary"] = claim["description"]
+    claim["uri"] = claim["link"]


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard required keys before mutating claim objects.

Line 34 and Line 35 assume description and link always exist. If the model returns a partial object, ingestion crashes with KeyError before per-source error handling can continue.

Proposed fix

def update_claim_fields(srcDigest: str, claim): + if not isinstance(claim, dict): + return False + if "description" not in claim or "link" not in claim: + return False claim["sourceUriDigest"] = srcDigest claim["summary"] = claim["description"] claim["uri"] = claim["link"] + return True @@ for claim in new_claims: - update_claim_fields(srcUriDigest, claim) + if not update_claim_fields(srcUriDigest, claim): + continue new_claim = True

Also applies to: 57-59

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/ingest_claims.py` around lines 32 - 35, update_claim_fields currently assumes claim["description"] and claim["link"] exist and will raise KeyError for partial model outputs; change it to safely read those keys (use claim.get("description") and claim.get("link") or check key presence) and only set claim["summary"] and claim["uri"] when values are present (or set a safe default like None/empty string). Apply the same guarded access pattern to the other place where claims are mutated (the block that also assigns "sourceUriDigest", "summary", and "uri") so ingestion no longer crashes on missing fields.

coderabbitai · 2026-05-18T16:12:47Z

+        try:
+            filtered_claims_list = json.loads(helper.cleanup_json_str(filtered_claims))
+        except Exception as e:
+            print(f"Error: failed to cleanup and unmarshal claims json string {filtered_claims}: {e}", file=sys.stderr)
+            continue
+
+        if len(filtered_claims_list) == 0:


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate parsed JSON shape before downstream processing.

After Line 150, code assumes a list of dict claims. A valid non-list JSON payload (or mixed list) can trigger runtime failures later in get_new_claims.

Proposed fix

try: filtered_claims_list = json.loads(helper.cleanup_json_str(filtered_claims)) except Exception as e: print(f"Error: failed to cleanup and unmarshal claims json string {filtered_claims}: {e}", file=sys.stderr) continue + if not isinstance(filtered_claims_list, list): + print(f"Error: expected claims array for {source['name']}", file=sys.stderr) + continue + filtered_claims_list = [c for c in filtered_claims_list if isinstance(c, dict)] + if len(filtered_claims_list) == 0: print(f"Error: no claims found for {source['name']} in: {filtered_claims}", file=sys.stderr) continue

🧰 Tools

🪛 Ruff (0.15.12)

[warning] 151-151: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/ingest_claims.py` around lines 149 - 155, After parsing filtered_claims into filtered_claims_list (from helper.cleanup_json_str), validate that the result is a list of dicts before proceeding to get_new_claims: check isinstance(filtered_claims_list, list) and that every item is a dict (or, if you want to accept a single dict, coerce it into a single-item list), otherwise log an error (including the offending payload and parse result) to stderr or process logger and continue; update the code around filtered_claims_list and the downstream call to get_new_claims to only run when this shape validation passes.

coderabbitai · 2026-05-18T16:12:47Z

+        # list of new claims to be ingested
+        new_unique_claims = get_new_claims(all_claim_docs, filtered_claims_list, source["uriDigest"])        
+        create_claim_docs(new_unique_claims, source["name"])


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Deduplication baseline is stale across sources in the same run.

all_claim_docs is loaded once and never updated, so claims created from earlier sources in this run are not considered when processing later sources.

Proposed fix

new_unique_claims = get_new_claims(all_claim_docs, filtered_claims_list, source["uriDigest"]) create_claim_docs(new_unique_claims, source["name"]) + all_claim_docs.extend(new_unique_claims)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/ingest_claims.py` around lines 159 - 161, The deduplication baseline all_claim_docs is loaded once and not updated, so new claims created by create_claim_docs aren't visible to subsequent sources; after calling create_claim_docs(new_unique_claims, source["name"]) update the baseline (either refresh all_claim_docs from the store or extend the in-memory all_claim_docs with the newly created claim documents/identifiers) so that subsequent calls to get_new_claims(all_claim_docs, filtered_claims_list, source["uriDigest"]) consider claims ingested earlier in the same run.

coderabbitai · 2026-05-18T16:12:47Z

+            doc = yaml.safe_load(doc_str)
+        except Exception as e:
+            print(f"Warning: failed to parse source_doc; not adding it to the source list it. Error: {e}", file=sys.stderr)
+            continue
+        if doc is None:
+            print(f"Warning: skipping the following yaml string as it failed to load: {doc_str}", file=sys.stderr)
+            continue
+        duplicate = False
+        for src in existing:
+            if doc.get('name') == src.get('name') or doc.get('uri') == src.get('uri'):
+                duplicate = True


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard non-dictionary YAML payloads before .get(...) access.

Line 63 and Line 79 assume yaml.safe_load(...) returns a dict. LLM output can be a list/string, which will raise AttributeError and stop ingestion.

Proposed fix

@@ - if doc is None: - print(f"Warning: skipping the following yaml string as it failed to load: {doc_str}", file=sys.stderr) + if not isinstance(doc, dict): + print(f"Warning: skipping non-object yaml document: {doc_str}", file=sys.stderr) continue @@ - filename = parsed.get('name') + if not isinstance(parsed, dict): + print(f"Warning: skipping non-object yaml document: {doc_str}", file=sys.stderr) + continue + + filename = parsed.get('name')

Also applies to: 74-80

🧰 Tools

🪛 Ruff (0.15.12)

[warning] 55-55: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/ingest_sources.py` around lines 54 - 64, The YAML loader may return non-dict types, so before calling .get(...) in ingest_sources.py you should guard that the loaded `doc` is a dict (e.g., `if not isinstance(doc, dict): warn and continue`) and likewise ensure each `src` in `existing` is a dict before using `src.get(...)`; update the checks around `doc = yaml.safe_load(doc_str)` and the duplicate-detection loop (the `doc.get('name') == src.get('name') or doc.get('uri') == src.get('uri')` logic) to skip/log non-dict payloads to avoid AttributeError during ingestion.

coderabbitai · 2026-05-18T16:12:47Z

+    raw_source_list =  openrouter.req_w_addons(req_content, skill=md_processing_skill)
+    if raw_source_list == "":
+        print("Error: failed to get raw source list from openrouter", file=sys.stderr)
+        sys.exit(1)
+
+    # openrouter call to get a clean formatted list of source urls
+    req_content = (
+        "Here is a raw list of URLs of news outlets with each line containing one or more unformatted URLs:"
+        f"\n\n{raw_source_list}\n\n"
+        f"{SOURCE_CLEANUP_PROMPT}"
+    )
+    source_list = openrouter.req_w_addons(req_content, tools=[openrouter.WEB_SEARCH_TOOL])
+    if source_list == "":
+        print("Error: failed to get filtered list of source urls", file=sys.stderr)
+        sys.exit(1)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle OpenRouter call failures explicitly in main().

The three openrouter.req_w_addons(...) calls are not wrapped in error handling. If the client raises, the script exits with traceback instead of controlled stderr + non-zero exit.

Proposed fix

@@ - raw_source_list = openrouter.req_w_addons(req_content, skill=md_processing_skill) + try: + raw_source_list = openrouter.req_w_addons(req_content, skill=md_processing_skill) + except Exception as e: + print(f"Error: failed to get raw source list from openrouter: {e}", file=sys.stderr) + sys.exit(1) @@ - source_list = openrouter.req_w_addons(req_content, tools=[openrouter.WEB_SEARCH_TOOL]) + try: + source_list = openrouter.req_w_addons(req_content, tools=[openrouter.WEB_SEARCH_TOOL]) + except Exception as e: + print(f"Error: failed to get filtered list of source urls: {e}", file=sys.stderr) + sys.exit(1) @@ - src_docs_str = openrouter.req_w_addons(req_content, tools=[openrouter.WEB_SEARCH_TOOL]) + try: + src_docs_str = openrouter.req_w_addons(req_content, tools=[openrouter.WEB_SEARCH_TOOL]) + except Exception as e: + print(f"Error: failed to generate source docs from source urls: {e}", file=sys.stderr) + sys.exit(1)

Also applies to: 151-154

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/ingest_sources.py` around lines 123 - 137, The three calls to openrouter.req_w_addons in main() (the call that sets raw_source_list, the call that sets source_list, and the later call around lines 151-154) are unprotected and will surface tracebacks on failure; wrap each openrouter.req_w_addons(...) invocation in a try/except that catches Exception, writes a concise error message including the exception to stderr (use the same error style as the existing checks), and exits with sys.exit(1). Ensure you reference the call sites by their assignment targets (raw_source_list, source_list, and the third call) and keep the existing empty-string checks only for successful responses; if an exception occurs, log the exception and exit immediately.

coderabbitai · 2026-05-18T16:12:47Z

+        resp = req_chat(payload)
+        if resp != "":
+            break
+        print(f"{model} failed, retrying with another model...")


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Write model-fallback diagnostics to stderr.

Line 90 is the only retry log still going to stdout, which can pollute callers that pipe or capture the assistant response.

Suggested fix

- print(f"{model} failed, retrying with another model...") + print(f"{model} failed, retrying with another model...", file=sys.stderr)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

resp = req_chat(payload)

if resp != "":

break

print(f"{model} failed, retrying with another model...")

resp = req_chat(payload)

if resp != "":

break

print(f"{model} failed, retrying with another model...", file=sys.stderr)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/openrouter.py` around lines 87 - 90, The retry log currently prints to stdout; change it to write to stderr so diagnostic output doesn't pollute assistant responses: in the loop that calls req_chat(payload) (the block referencing resp and model), replace the print(f"{model} failed, retrying with another model...") call with an stderr-write (e.g., using sys.stderr.write or print(..., file=sys.stderr)) so all fallback/diagnostic messages from the model fallback path go to stderr instead of stdout.

semmet95 added 2 commits May 18, 2026 16:59

refactor: simplifies source ingestion logic

7428bd8

Signed-off-by: Amit Singh <singhamitch@outlook.com>

refactor: simplifies claim ingestion logic

0e13a1f

Signed-off-by: Amit Singh <singhamitch@outlook.com>

coderabbitai Bot reviewed May 18, 2026

View reviewed changes

cubic-dev-ai Bot reviewed May 18, 2026

View reviewed changes

refactor: minor improvements

a69be54

Signed-off-by: Amit Singh <singhamitch@outlook.com>

coderabbitai Bot reviewed May 18, 2026

View reviewed changes

Conversation

semmet95 commented May 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

qodo-code-review Bot commented May 18, 2026

Qodo reviews are paused for this user.

Uh oh!

coderabbitai Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Possibly Related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

semmet95 commented May 18, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 18, 2026 •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading