-
Notifications
You must be signed in to change notification settings - Fork 73
Async LangGraph with DuckDuckGo fact checking and config cleanup #124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ython >= 3.12 (fix): removed incorrect pinecone client init in get_rag.py (fix): centralize groq model change from llm_config.py
…arch Tool (fix): Implement new fact check subgraph (enhancement): sentiment analysis and fact check run parallely \n clean_text->extract_claims->plan_searches->execute_searches->verify_facts
…odes i.e sentiment node and fact check node (chore): /process route awaits langgraph build compile
- Updated chunk_rag_data.py to support both new (claim, status) and old (original_claim, �erdict) key formats from the fact-checker. - Added logic to correctly parse perspective whether it is a Pydantic model or a dict. - implemented skipping of malformed facts instead of raising valueErr. - compatibility with the new parallel DuckDuckGo fact-checking workflow.
-'PerspectiveOutput' pydantic model to handle reasoning as a list for claims
📝 WalkthroughWalkthroughIntroduces configurable LLM_MODEL via environment variable, replacing hardcoded Groq model strings across multiple modules. Adds a multi-step fact-checking pipeline with claim extraction, search planning, and verification. Refactors LangGraph workflow to run sentiment and fact-checking in parallel and converts the pipeline to async. Changes
Sequence Diagram(s)sequenceDiagram
actor State as LangGraph State
participant PA as parallel_analysis
participant SA as Sentiment Analysis
participant FC as Fact-Check Pipeline
participant EX as extract_claims_node
participant PS as plan_searches_node
participant ES as execute_searches_node
participant VF as verify_facts_node
State->>PA: run_parallel_analysis(state)
par Parallel Execution
PA->>SA: run_sentiment_sdk (async)
SA-->>PA: sentiment result
and
PA->>FC: run_fact_check_pipeline
FC->>EX: extract_claims_node
EX-->>FC: claims
FC->>PS: plan_searches_node
PS-->>FC: search_queries
FC->>ES: execute_searches_node
ES-->>FC: search_results
FC->>VF: verify_facts_node
VF-->>FC: facts, fact_check_done
end
PA->>State: return aggregated state<br/>(sentiment, claims, facts, etc.)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/app/modules/facts_check/llm_processing.py (1)
104-155: Fixparsedunbound usage on JSON parse failures (will crash).On Line 146–149, if
json.loads(content)fails,parsedis never assigned, but still appended—this will raise and short-circuit the whole run. Also, returning"claim": claim(Line 152) only reflects the last processed item.Proposed fix
@@ - # Try parsing the JSON response - try: - parsed = json.loads(content) - except Exception as parse_err: - logger.error(f"LLM JSON parse error: {parse_err}") - - results_list.append(parsed) + # Try parsing the JSON response + try: + parsed = json.loads(content) + except Exception as parse_err: + logger.error(f"LLM JSON parse error: {parse_err}") + parsed = { + "verdict": "Unknown", + "explanation": "Failed to parse verifier output as JSON.", + "original_claim": claim, + "source_link": source, + "raw_output": content, + } + + results_list.append(parsed) @@ - return { - "claim": claim, - "verifications": results_list, - "status": "success", - } + return {"verifications": results_list, "status": "success"}
🤖 Fix all issues with AI agents
In @backend/app/modules/fact_check_tool.py:
- Around line 126-129: The loop over results assumes item["claim_id"] is present
and an int; add defensive checks in the for item in results loop to (1) verify
"claim_id" in item, (2) coerce or validate it as an integer (e.g., try
int(item["claim_id"]) and handle ValueError/TypeError), (3) ensure the resulting
index is within range 0 <= c_id < len(claims), and (4) skip or log malformed
entries instead of using them when building context; update references to
item["claim_id"], c_id, claims, and context accordingly so missing/invalid
claim_id values cannot raise KeyError/TypeError or index errors.
In @backend/app/modules/vector_store/chunk_rag_data.py:
- Around line 47-48: The code calls generate_id(data["cleaned_text"]) without
validating cleaned_text which can raise ValueError; before calling generate_id
in chunk_rag_data.py, validate that data["cleaned_text"] is a non-empty string
(or fallback to a safe default id) and only then assign article_id, or wrap the
generate_id call in a try/except that catches ValueError and handles it the same
way as the existing perspective/facts error handling (e.g., log and
continue/skip); reference the article_id assignment and generate_id invocation
so you can add the guard or exception handling there.
- Around line 52-57: The attribute-check order for converting perspective_data
to p_data should prefer Pydantic v2: check for perspective_data.model_dump()
first, then fallback to perspective_data.dict(), and finally handle plain dicts;
update the conditional in chunk_rag_data.py so it calls model_dump() before
dict() (retain the isinstance(perspective_data, dict) branch) and assign the
result to p_data.
🧹 Nitpick comments (9)
backend/app/llm_config.py (1)
1-4: Good centralization pattern; consider adding optional enhancements.The centralized LLM model configuration is a solid improvement that establishes a single source of truth for model selection across the codebase. The default model "llama-3.3-70b-versatile" is valid and available on Groq as of 2025. The environment variable pattern with a sensible default is appropriate.
Consider these optional enhancements to improve maintainability:
- Add a module-level docstring explaining the configuration's purpose and usage
- Add validation to ensure the model name is valid for the Groq API (e.g., a list of supported models)
- Consider logging which model is being used at startup for easier debugging
📝 Optional: Add module docstring
+""" +llm_config.py +------------- +Centralized configuration for LLM model selection. + +This module provides a single source of truth for the LLM model used +across the application. The model can be configured via the LLM_MODEL +environment variable. + +Environment Variables: + LLM_MODEL (str): The Groq model identifier to use. + Defaults to "llama-3.3-70b-versatile". +""" + import os # Default to a stable model LLM_MODEL = os.getenv("LLM_MODEL", "llama-3.3-70b-versatile")backend/app/modules/pipeline.py (1)
41-41: Unused import:asynciois not used in this module.The
asyncioimport was added but isn't referenced anywhere in the file. Theasync/awaitsyntax doesn't require an explicit import of the module.🧹 Remove unused import
from app.logging.logging_config import setup_logger import json -import asynciobackend/app/modules/vector_store/chunk_rag_data.py (1)
108-110: Swallowing exceptions may hide upstream issues.Returning an empty list on any exception could mask problems in the data pipeline. The current logging via
logger.exceptionis good, but consider whether certain exceptions (e.g.,ValueErrorfor missing required fields likecleaned_textorperspective) should propagate to alert callers of invalid input.backend/app/modules/langgraph_nodes/sentiment.py (2)
73-84: Partial failure discards successful results.If sentiment analysis fails but fact-checking succeeds (or vice versa), the entire operation returns an error, discarding potentially useful results from the successful task. Consider returning partial results with a warning status for better resilience.
♻️ Consider partial success handling
- if sentiment_result.get("status") == "error": - return { - "status": "error", - "error_from": sentiment_result.get("error_from", "sentiment_analysis"), - "message": sentiment_result.get("message", "Unknown error") - } - if fact_check_result.get("status") == "error": - return { - "status": "error", - "error_from": fact_check_result.get("error_from", "fact_checking"), - "message": fact_check_result.get("message", "Unknown error") - } + # Check for errors but still return partial results + has_error = False + error_sources = [] + + if sentiment_result.get("status") == "error": + has_error = True + error_sources.append("sentiment_analysis") + logger.warning(f"Sentiment analysis failed: {sentiment_result.get('message')}") + + if fact_check_result.get("status") == "error": + has_error = True + error_sources.append("fact_checking") + logger.warning(f"Fact checking failed: {fact_check_result.get('message')}") return { **state, - "sentiment": sentiment_result.get("sentiment"), + "sentiment": sentiment_result.get("sentiment", "unknown"), "claims": fact_check_result.get("claims", []), "search_queries": fact_check_result.get("search_queries", []), "search_results": fact_check_result.get("search_results", []), "facts": fact_check_result.get("facts", []), - "status": "success" + "status": "partial_error" if has_error else "success", + "error_sources": error_sources if has_error else None }
60-60: Minor: Redundant exception object inlogging.exception.Per static analysis,
logging.exceptionautomatically includes exception info; you don't need to passeexplicitly.🧹 Fix
- logger.exception(f"Error in fact_check_pipeline: {e}") + logger.exception("Error in fact_check_pipeline")backend/app/modules/fact_check_tool.py (2)
109-110: Unused exception variablee.The exception is caught but never used. Either log it or remove the variable binding.
🧹 Log the error or remove unused variable
- except Exception as e: - return {"claim_id": q.get("claim_id"), "result": "Search failed"} + except Exception: + logger.warning(f"Search failed for claim {q.get('claim_id')}") + return {"claim_id": q.get("claim_id"), "result": "Search failed"}
48-48: Uselogging.exceptioninstead oflogging.errorto capture stack traces.Per static analysis,
logging.exceptionautomatically includes the traceback, which is valuable for debugging LLM/network failures.♻️ Replace logger.error with logger.exception
- logger.error(f"Error extraction claims: {e}") + logger.exception("Error extracting claims")- logger.error(f"Failed to plan searches: {e}") + logger.exception("Failed to plan searches")- logger.error(f"Verification failed: {e}") + logger.exception("Verification failed")Also applies to: 91-91, 165-165
backend/app/modules/langgraph_builder.py (2)
47-57: Inconsistent type annotation style.Line 49 uses Python 3.9+ built-in
list[dict]syntax, while lines 55-57 usetyping.List. For consistency, use one style throughout.🧹 Standardize to built-in generics (Python 3.9+)
-from typing import List, Any +from typing import Any from langgraph.graph import StateGraph from typing_extensions import TypedDict - ... class MyState(TypedDict): cleaned_text: str facts: list[dict] sentiment: str perspective: str score: int retries: int status: str - claims: List[str] - search_queries: List[Any] - search_results: List[Any] + claims: list[str] + search_queries: list[Any] + search_results: list[Any]
87-100: Complex nested conditional logic is correct but hard to read.The branching logic handles error → retry threshold → score threshold correctly. Consider extracting to a named function for clarity.
♻️ Extract to named function for readability
+def _judge_routing(state): + if state.get("status") == "error": + return "error_handler" + if state.get("score", 0) < 70: + if state.get("retries", 0) >= 3: + return "store_and_send" + return "generate_perspective" + return "store_and_send" + ... graph.add_conditional_edges( "judge_perspective", - lambda state: ( - "error_handler" - if state.get("status") == "error" - else ( - "store_and_send" - if state.get("retries", 0) >= 3 - else "generate_perspective" - ) - if state.get("score", 0) < 70 - else "store_and_send" - ), + _judge_routing, )
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
backend/uv.lockis excluded by!**/*.lock
📒 Files selected for processing (15)
README.mdbackend/app/llm_config.pybackend/app/modules/bias_detection/check_bias.pybackend/app/modules/chat/get_rag_data.pybackend/app/modules/chat/llm_processing.pybackend/app/modules/fact_check_tool.pybackend/app/modules/facts_check/llm_processing.pybackend/app/modules/langgraph_builder.pybackend/app/modules/langgraph_nodes/generate_perspective.pybackend/app/modules/langgraph_nodes/judge.pybackend/app/modules/langgraph_nodes/sentiment.pybackend/app/modules/pipeline.pybackend/app/modules/vector_store/chunk_rag_data.pybackend/app/routes/routes.pybackend/pyproject.toml
🧰 Additional context used
🧬 Code graph analysis (7)
backend/app/modules/langgraph_nodes/judge.py (1)
backend/app/logging/logging_config.py (1)
setup_logger(4-39)
backend/app/modules/langgraph_nodes/generate_perspective.py (1)
backend/app/logging/logging_config.py (1)
setup_logger(4-39)
backend/app/modules/vector_store/chunk_rag_data.py (1)
backend/app/utils/generate_chunk_id.py (1)
generate_id(29-33)
backend/app/modules/fact_check_tool.py (1)
backend/app/logging/logging_config.py (1)
setup_logger(4-39)
backend/app/modules/langgraph_builder.py (5)
backend/app/modules/langgraph_nodes/sentiment.py (1)
run_parallel_analysis(35-94)backend/app/modules/langgraph_nodes/generate_perspective.py (1)
generate_perspective(50-87)backend/app/modules/langgraph_nodes/judge.py (1)
judge_perspective(34-74)backend/app/modules/langgraph_nodes/store_and_send.py (1)
store_and_send(26-54)backend/app/modules/langgraph_nodes/error_handler.py (1)
error_handler(21-30)
backend/app/modules/langgraph_nodes/sentiment.py (2)
backend/app/logging/logging_config.py (1)
setup_logger(4-39)backend/app/modules/fact_check_tool.py (4)
extract_claims_node(22-49)plan_searches_node(51-92)execute_searches_node(94-115)verify_facts_node(117-166)
backend/app/routes/routes.py (1)
backend/app/modules/pipeline.py (1)
run_langgraph_workflow(68-72)
🪛 Ruff (0.14.10)
backend/app/modules/vector_store/chunk_rag_data.py
41-41: Abstract raise to an inner function
(TRY301)
41-41: Avoid specifying long messages outside the exception class
(TRY003)
106-106: Consider moving this statement to an else block
(TRY300)
109-109: Redundant exception object included in logging.exception call
(TRY401)
backend/app/modules/fact_check_tool.py
45-45: Consider moving this statement to an else block
(TRY300)
47-47: Do not catch blind exception: Exception
(BLE001)
48-48: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
88-88: Consider moving this statement to an else block
(TRY300)
90-90: Do not catch blind exception: Exception
(BLE001)
91-91: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
108-108: Consider moving this statement to an else block
(TRY300)
109-109: Do not catch blind exception: Exception
(BLE001)
109-109: Local variable e is assigned to but never used
Remove assignment to unused variable e
(F841)
162-162: Consider moving this statement to an else block
(TRY300)
164-164: Do not catch blind exception: Exception
(BLE001)
165-165: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
backend/app/modules/langgraph_nodes/sentiment.py
60-60: Redundant exception object included in logging.exception call
(TRY401)
🔇 Additional comments (20)
backend/app/modules/bias_detection/check_bias.py (2)
30-30: Good adoption of centralized configuration.The import of LLM_MODEL from app.llm_config correctly adopts the centralized configuration pattern introduced in this PR. This improves maintainability by eliminating hard-coded model names.
65-65: Code correctly uses the centralized LLM_MODEL configuration.The implementation at line 65 properly references the centralized
LLM_MODELvariable fromllm_config.py, which defaults to "llama-3.3-70b-versatile". This is consistent with the rest of the backend codebase and represents the intended pattern for centralized model configuration.No model switch is occurring—the code has always used the centralized variable approach. The instruction to return only a number (line 57) is clear, and the temperature setting of 0.3 supports deterministic output.
Likely an incorrect or invalid review comment.
README.md (1)
170-170: Remove HF_TOKEN from required environment variables in README.md.The documentation lists HF_TOKEN as a required environment variable, but the backend does not actually use it. The embedding models ("all-MiniLM-L6-v2") loaded in
embed_query.pyandembed.pyare public Hugging Face models that require no authentication token. Remove this from the required environment variables list to avoid misleading developers.Likely an incorrect or invalid review comment.
backend/app/modules/chat/get_rag_data.py (1)
34-34: Code correctly uses Pinecone 7.3.0+ initialization API.This is the required pattern for Pinecone SDK v7.x+. The
api_keykeyword argument (with value from the environment variable) is the correct and only supported way to initialize the Pinecone client in v7.x, replacing the deprecatedpinecone.init()pattern.backend/pyproject.toml (1)
19-19: The upgrade to newspaper4k is compatible with current usage.The code uses only the standard Article API (
Article,download(),parse(), and basic properties liketitle,text,authors,publish_date), which newspaper4k maintains for backward compatibility. The Python 3.13 requirement also exceeds newspaper4k's minimum version (3.8+, later 3.10+), so there are no compatibility issues.If stricter version pinning is desired for additional stability (e.g.,
>=0.9.4.1,<1.0.0), it's optional but not required for this implementation.backend/app/modules/chat/llm_processing.py (1)
30-64: Config-drivenLLM_MODELwiring looks good; ensure invalid model names fail fast.If
LLM_MODELis missing/unsupported, Groq will likely error at runtime; consider validating/logging the chosen model once at startup (or on first call) to make misconfig obvious.backend/app/modules/facts_check/llm_processing.py (1)
32-70: LLM model centralization is consistent here.backend/app/routes/routes.py (1)
70-75: Correctly awaiting the async LangGraph workflow (no thread hop).Just ensure the workflow’s heavy work is truly async (or safely executed off the event loop) so
/processdoesn’t block other requests under load.backend/app/modules/langgraph_nodes/judge.py (1)
22-31: Good: centralizedLLM_MODELused for judge LLM.If the configured model changes, confirm it still reliably outputs a single integer with
max_tokens=10(some models get verbose despite instructions).backend/app/modules/langgraph_nodes/generate_perspective.py (3)
23-45: Model centralization + structured output plumbing looks coherent.
35-37: TheList[str]change toreasoningfield is intentional and already handled downstream.The schema change was explicitly made in commit
ef1f0bcto handle reasoning as a list. The main consumer inbackend/app/modules/vector_store/chunk_rag_data.py(lines 66–69) already has defensive handling that supports bothList[str]and string inputs:if isinstance(raw_reasoning, list): p_reason = "\n".join(raw_reasoning) else: p_reason = str(raw_reasoning)However, verify that all other consumers of
PerspectiveOutput(if any exist) properly handle the list type to avoid silent failures.
60-79: The review concerns are largely based on incorrect assumptions about the codebase.Type safety (not an issue): Facts are always dictionaries—they come from JSON parsing in
fact_check_tool.py(line 137–162) and are typed aslist[dict]in the state. Using.get()is correct.Status vs. verdict field (not an issue): The
statusfield in facts containsTrue/False/Unverified(fact-check result), not"success". The fallback to'verdict'is defensive code used elsewhere in the codebase (chunk_rag_data.pyline 151) to handle alternative structures—no semantic confusion occurs here.Blocking call (not an issue): LangGraph automatically executes sync nodes in a thread pool when the graph is invoked via
ainvoke()(as shown inpipeline.pyline 24). The syncchain.invoke()call does not block the event loop.Likely an incorrect or invalid review comment.
backend/app/modules/pipeline.py (1)
68-71: LGTM! Async conversion is correctly implemented.The function signature change to
async defand the use ofawait _LANGGRAPH_WORKFLOW.ainvoke(state)properly enables asynchronous execution of the LangGraph workflow. This aligns with the route layer changes that now await this function.backend/app/modules/vector_store/chunk_rag_data.py (1)
82-103: Good defensive handling for dynamic fact schemas.The fallback field lookups (
claim/original_claim,status/verdict, etc.) and the skip logic for invalid facts are well-implemented. This prevents crashes when encountering malformed or varied fact structures from the new fact-checking pipeline.backend/app/modules/langgraph_nodes/sentiment.py (2)
35-94: Well-structured parallel execution with proper async patterns.The design correctly uses
asyncio.to_threadfor the synchronous Groq SDK call while running the async fact-check pipeline natively. Theasyncio.gathereffectively parallelizes both operations.
119-119: Good: Centralized LLM model configuration.Using
LLM_MODELfrom the config instead of a hardcoded model name improves maintainability and allows environment-based configuration.backend/app/modules/fact_check_tool.py (2)
17-18: Module-level initialization may cause import-time failures.The
Groqclient andDuckDuckGoSearchRunare initialized when the module is imported. IfGROQ_API_KEYis not set or the search tool initialization fails, the entire module import will fail, potentially breaking application startup.Consider lazy initialization or wrapping in a factory function to defer initialization until first use.
22-49: Well-structured fact-checking pipeline.The four-step approach (extract → plan → execute → verify) is clean and each node handles errors gracefully by returning empty defaults. The use of
asyncio.to_threadfor blocking SDK calls andasyncio.gatherfor parallel searches is appropriate.Also applies to: 51-92, 94-115, 117-166
backend/app/modules/langgraph_builder.py (2)
60-109: Workflow structure is well-designed for parallel execution.The refactored graph correctly starts with
parallel_analysis, chains through perspective generation/judging with retry logic, and properly routes errors. The conditional edges handle all expected states.
63-64: LangGraph'sStateGraphexplicitly supports async node functions—this is documented and demonstrated in the LangGraph quickstart and examples. Therun_parallel_analysisfunction is correctly defined as async (line 35 of sentiment.py) and returns a dictionary, not an async generator. The pattern of adding the async function as a node, compiling the graph, and invoking viaainvokeis the correct, fully supported approach. No action needed.
| for item in results: | ||
| c_id = item["claim_id"] | ||
| if c_id < len(claims): | ||
| context += f"\nClaim: {claims[c_id]}\nEvidence: {item['result']}\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential KeyError or type mismatch for claim_id.
The code assumes item["claim_id"] exists and is an integer comparable to len(claims). If the LLM returns malformed JSON or the search fails, claim_id might be missing, None, or a non-integer, causing runtime errors.
🛡️ Add defensive handling
for item in results:
- c_id = item["claim_id"]
- if c_id < len(claims):
+ c_id = item.get("claim_id")
+ if c_id is not None and isinstance(c_id, int) and 0 <= c_id < len(claims):
context += f"\nClaim: {claims[c_id]}\nEvidence: {item['result']}\n"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for item in results: | |
| c_id = item["claim_id"] | |
| if c_id < len(claims): | |
| context += f"\nClaim: {claims[c_id]}\nEvidence: {item['result']}\n" | |
| for item in results: | |
| c_id = item.get("claim_id") | |
| if c_id is not None and isinstance(c_id, int) and 0 <= c_id < len(claims): | |
| context += f"\nClaim: {claims[c_id]}\nEvidence: {item['result']}\n" |
🤖 Prompt for AI Agents
In @backend/app/modules/fact_check_tool.py around lines 126 - 129, The loop over
results assumes item["claim_id"] is present and an int; add defensive checks in
the for item in results loop to (1) verify "claim_id" in item, (2) coerce or
validate it as an integer (e.g., try int(item["claim_id"]) and handle
ValueError/TypeError), (3) ensure the resulting index is within range 0 <= c_id
< len(claims), and (4) skip or log malformed entries instead of using them when
building context; update references to item["claim_id"], c_id, claims, and
context accordingly so missing/invalid claim_id values cannot raise
KeyError/TypeError or index errors.
| article_id = generate_id(data["cleaned_text"]) | ||
| chunks = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential unhandled ValueError from generate_id.
Per the generate_id function in backend/app/utils/generate_chunk_id.py, it raises ValueError if text is empty or not a string. If cleaned_text is an empty string, this will raise before reaching the perspective/facts handling logic, potentially bypassing the graceful error handling at line 108-110.
🛡️ Add validation before calling generate_id
+ if not data.get("cleaned_text"):
+ logger.warning("Missing or empty cleaned_text. Returning empty chunks.")
+ return []
+
article_id = generate_id(data["cleaned_text"])📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| article_id = generate_id(data["cleaned_text"]) | |
| chunks = [] | |
| if not data.get("cleaned_text"): | |
| logger.warning("Missing or empty cleaned_text. Returning empty chunks.") | |
| return [] | |
| article_id = generate_id(data["cleaned_text"]) | |
| chunks = [] |
🤖 Prompt for AI Agents
In @backend/app/modules/vector_store/chunk_rag_data.py around lines 47 - 48, The
code calls generate_id(data["cleaned_text"]) without validating cleaned_text
which can raise ValueError; before calling generate_id in chunk_rag_data.py,
validate that data["cleaned_text"] is a non-empty string (or fallback to a safe
default id) and only then assign article_id, or wrap the generate_id call in a
try/except that catches ValueError and handles it the same way as the existing
perspective/facts error handling (e.g., log and continue/skip); reference the
article_id assignment and generate_id invocation so you can add the guard or
exception handling there.
| if hasattr(perspective_data, "dict"): | ||
| p_data = perspective_data.dict() | ||
| elif hasattr(perspective_data, "model_dump"): | ||
| p_data = perspective_data.model_dump() | ||
| elif isinstance(perspective_data, dict): | ||
| p_data = perspective_data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
find . -type f \( -name "requirements*.txt" -o -name "setup.py" -o -name "pyproject.toml" -o -name "Pipfile" \) | head -20Repository: AOSSIE-Org/Perspective
Length of output: 90
🏁 Script executed:
cd backend && find . -type f \( -name "requirements*.txt" -o -name "setup.py" -o -name "pyproject.toml" \) 2>/dev/null | xargs cat 2>/dev/null | grep -i pydanticRepository: AOSSIE-Org/Perspective
Length of output: 48
🏁 Script executed:
cat -n backend/app/modules/vector_store/chunk_rag_data.py | sed -n '50,70p'Repository: AOSSIE-Org/Perspective
Length of output: 1000
🏁 Script executed:
cat backend/pyproject.toml | grep -A 20 -B 5 -i pydanticRepository: AOSSIE-Org/Perspective
Length of output: 48
🏁 Script executed:
cat backend/pyproject.tomlRepository: AOSSIE-Org/Perspective
Length of output: 744
🏁 Script executed:
cd backend && rg -B 5 "perspective_data = data\[" app/modules/vector_store/chunk_rag_data.py -A 2Repository: AOSSIE-Org/Perspective
Length of output: 275
🏁 Script executed:
cd backend && find . -type f -name "*.py" | xargs rg "class.*BaseModel" | head -10Repository: AOSSIE-Org/Perspective
Length of output: 258
🏁 Script executed:
cd backend && rg -A 5 "class PerspectiveOutput" app/modules/langgraph_nodes/generate_perspective.pyRepository: AOSSIE-Org/Perspective
Length of output: 308
Reverse the order to check .model_dump() before .dict() for Pydantic v2 compatibility.
The project uses Pydantic v2 (via FastAPI and LangChain dependencies), where .dict() has been removed entirely. While the code functionally works because hasattr() checks fail for the removed method and fall through to .model_dump(), checking a non-existent method first is inefficient and unclear. Reverse the order to check .model_dump() first, then .dict() as a potential fallback for legacy compatibility.
🤖 Prompt for AI Agents
In @backend/app/modules/vector_store/chunk_rag_data.py around lines 52 - 57, The
attribute-check order for converting perspective_data to p_data should prefer
Pydantic v2: check for perspective_data.model_dump() first, then fallback to
perspective_data.dict(), and finally handle plain dicts; update the conditional
in chunk_rag_data.py so it calls model_dump() before dict() (retain the
isinstance(perspective_data, dict) branch) and assign the result to p_data.
This PR improves the overall pipeline performance, stability and configuration.
Key changes included in this PR
Replaced Google Search tool with a native DuckDuckGo based fact checking pipeline
Refactored the LangGraph workflow to run asynchronously with parallel execution
Added a central LLM model config and updated all modules to use it
Improved fact parsing and perspective generation to handle dynamic schemas
Fixed crashes in RAG chunking by safely handling missing or invalid fields
Upgraded newspaper3k to newspaper4k for better compatibility
Minor documentation update for required environment variables
Overall, this makes the system faster, more reliable, and easier to configure and maintain.

Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Chores
✏️ Tip: You can customize this high-level summary in your review settings.