Skip to content

Conversation

@khushal1512
Copy link

@khushal1512 khushal1512 commented Jan 9, 2026

This PR improves the overall pipeline performance, stability and configuration.

Key changes included in this PR

  • Replaced Google Search tool with a native DuckDuckGo based fact checking pipeline

  • Refactored the LangGraph workflow to run asynchronously with parallel execution

  • Added a central LLM model config and updated all modules to use it

  • Improved fact parsing and perspective generation to handle dynamic schemas

  • Fixed crashes in RAG chunking by safely handling missing or invalid fields

  • Upgraded newspaper3k to newspaper4k for better compatibility

  • Minor documentation update for required environment variables

Overall, this makes the system faster, more reliable, and easier to configure and maintain.
image

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced configurable LLM model selection via environment configuration
    • Added comprehensive fact-checking workflow with claims extraction, search planning, and verdict verification
    • Enabled parallel processing of sentiment analysis and fact-checking for improved efficiency
    • Enhanced output structure for better reasoning transparency
  • Bug Fixes

    • Improved error handling and data validation in vector storage
    • Strengthened perspective data normalization and field extraction
  • Chores

    • Updated article parsing library dependency
    • Added support for new authentication environment variable

✏️ Tip: You can customize this high-level summary in your review settings.

…ython >= 3.12

(fix): removed incorrect pinecone client init in get_rag.py

(fix): centralize groq model change from llm_config.py
…arch Tool

(fix): Implement new fact check subgraph

(enhancement): sentiment analysis and fact check run parallely \n clean_text->extract_claims->plan_searches->execute_searches->verify_facts
…odes i.e sentiment node and fact check node

(chore): /process route awaits langgraph build compile
- Updated chunk_rag_data.py to support both new (claim, status) and old (original_claim, �erdict) key formats from the fact-checker.

- Added logic to correctly parse perspective whether it is a Pydantic model or a dict.

- implemented skipping of malformed facts instead of raising valueErr.

- compatibility with the new parallel DuckDuckGo fact-checking workflow.
-'PerspectiveOutput' pydantic model to handle reasoning as a list for claims
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 9, 2026

📝 Walkthrough

Walkthrough

Introduces configurable LLM_MODEL via environment variable, replacing hardcoded Groq model strings across multiple modules. Adds a multi-step fact-checking pipeline with claim extraction, search planning, and verification. Refactors LangGraph workflow to run sentiment and fact-checking in parallel and converts the pipeline to async.

Changes

Cohort / File(s) Summary
Configuration & Setup
README.md, backend/app/llm_config.py
Documents HF_TOKEN environment variable; introduces new LLM_MODEL configuration constant with "llama-3.3-70b-versatile" default.
Model Configurability
backend/app/modules/bias_detection/check_bias.py, backend/app/modules/chat/llm_processing.py, backend/app/modules/facts_check/llm_processing.py, backend/app/modules/langgraph_nodes/generate_perspective.py, backend/app/modules/langgraph_nodes/judge.py
Replaces hardcoded "gemma2-9b-it" model references with configurable LLM_MODEL variable across LLM client initializations.
Fact-Checking Pipeline
backend/app/modules/fact_check_tool.py
Introduces new module with four async nodes: extract_claims_node, plan_searches_node, execute_searches_node, verify_facts_node implementing multi-step verification workflow with error handling.
LangGraph Workflow Refactoring
backend/app/modules/langgraph_builder.py, backend/app/modules/langgraph_nodes/sentiment.py
Restructures workflow from sequential pipeline to parallel_analysis stage; adds new run_parallel_analysis function coordinating sentiment analysis and fact-checking concurrently; updates MyState schema with new fields (claims, search_queries, search_results, facts).
Output Structure Updates
backend/app/modules/langgraph_nodes/generate_perspective.py
Changes PerspectiveOutput.reasoning from string to List[str] with reasoning_steps alias; enables structured LLM output.
Data Processing
backend/app/modules/chat/get_rag_data.py, backend/app/modules/vector_store/chunk_rag_data.py
Converts Pinecone initialization to use keyword argument; refactors fact chunking with flexible field mapping (claim/original_claim, status/verdict, reason/explanation) and robust type handling.
Async Pipeline & Routes
backend/app/modules/pipeline.py, backend/app/routes/routes.py
Converts run_langgraph_workflow to async with ainvoke; updates route to directly await async function instead of using asyncio.to_thread.
Dependencies
backend/pyproject.toml
Upgrades newspaper3k>=0.2.8 to newspaper4k>=0.9.4.1.

Sequence Diagram(s)

sequenceDiagram
    actor State as LangGraph State
    participant PA as parallel_analysis
    participant SA as Sentiment Analysis
    participant FC as Fact-Check Pipeline
    participant EX as extract_claims_node
    participant PS as plan_searches_node
    participant ES as execute_searches_node
    participant VF as verify_facts_node

    State->>PA: run_parallel_analysis(state)
    par Parallel Execution
        PA->>SA: run_sentiment_sdk (async)
        SA-->>PA: sentiment result
    and
        PA->>FC: run_fact_check_pipeline
        FC->>EX: extract_claims_node
        EX-->>FC: claims
        FC->>PS: plan_searches_node
        PS-->>FC: search_queries
        FC->>ES: execute_searches_node
        ES-->>FC: search_results
        FC->>VF: verify_facts_node
        VF-->>FC: facts, fact_check_done
    end
    PA->>State: return aggregated state<br/>(sentiment, claims, facts, etc.)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Poem

🐰 Through forests of models, now configurable and free,
Parallel pipelines harvest facts with glee,
Claims extracted, searches planned with care,
Async hops make the workflow fair! 🌟

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.11% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the three main changes: async LangGraph workflow, DuckDuckGo-based fact checking pipeline, and LLM configuration centralization.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/app/modules/facts_check/llm_processing.py (1)

104-155: Fix parsed unbound usage on JSON parse failures (will crash).

On Line 146–149, if json.loads(content) fails, parsed is never assigned, but still appended—this will raise and short-circuit the whole run. Also, returning "claim": claim (Line 152) only reflects the last processed item.

Proposed fix
@@
-            # Try parsing the JSON response
-            try:
-                parsed = json.loads(content)
-            except Exception as parse_err:
-                logger.error(f"LLM JSON parse error: {parse_err}")
-
-            results_list.append(parsed)
+            # Try parsing the JSON response
+            try:
+                parsed = json.loads(content)
+            except Exception as parse_err:
+                logger.error(f"LLM JSON parse error: {parse_err}")
+                parsed = {
+                    "verdict": "Unknown",
+                    "explanation": "Failed to parse verifier output as JSON.",
+                    "original_claim": claim,
+                    "source_link": source,
+                    "raw_output": content,
+                }
+
+            results_list.append(parsed)
@@
-        return {
-            "claim": claim,
-            "verifications": results_list,
-            "status": "success",
-        }
+        return {"verifications": results_list, "status": "success"}
🤖 Fix all issues with AI agents
In @backend/app/modules/fact_check_tool.py:
- Around line 126-129: The loop over results assumes item["claim_id"] is present
and an int; add defensive checks in the for item in results loop to (1) verify
"claim_id" in item, (2) coerce or validate it as an integer (e.g., try
int(item["claim_id"]) and handle ValueError/TypeError), (3) ensure the resulting
index is within range 0 <= c_id < len(claims), and (4) skip or log malformed
entries instead of using them when building context; update references to
item["claim_id"], c_id, claims, and context accordingly so missing/invalid
claim_id values cannot raise KeyError/TypeError or index errors.

In @backend/app/modules/vector_store/chunk_rag_data.py:
- Around line 47-48: The code calls generate_id(data["cleaned_text"]) without
validating cleaned_text which can raise ValueError; before calling generate_id
in chunk_rag_data.py, validate that data["cleaned_text"] is a non-empty string
(or fallback to a safe default id) and only then assign article_id, or wrap the
generate_id call in a try/except that catches ValueError and handles it the same
way as the existing perspective/facts error handling (e.g., log and
continue/skip); reference the article_id assignment and generate_id invocation
so you can add the guard or exception handling there.
- Around line 52-57: The attribute-check order for converting perspective_data
to p_data should prefer Pydantic v2: check for perspective_data.model_dump()
first, then fallback to perspective_data.dict(), and finally handle plain dicts;
update the conditional in chunk_rag_data.py so it calls model_dump() before
dict() (retain the isinstance(perspective_data, dict) branch) and assign the
result to p_data.
🧹 Nitpick comments (9)
backend/app/llm_config.py (1)

1-4: Good centralization pattern; consider adding optional enhancements.

The centralized LLM model configuration is a solid improvement that establishes a single source of truth for model selection across the codebase. The default model "llama-3.3-70b-versatile" is valid and available on Groq as of 2025. The environment variable pattern with a sensible default is appropriate.

Consider these optional enhancements to improve maintainability:

  1. Add a module-level docstring explaining the configuration's purpose and usage
  2. Add validation to ensure the model name is valid for the Groq API (e.g., a list of supported models)
  3. Consider logging which model is being used at startup for easier debugging
📝 Optional: Add module docstring
+"""
+llm_config.py
+-------------
+Centralized configuration for LLM model selection.
+
+This module provides a single source of truth for the LLM model used
+across the application. The model can be configured via the LLM_MODEL
+environment variable.
+
+Environment Variables:
+    LLM_MODEL (str): The Groq model identifier to use. 
+                     Defaults to "llama-3.3-70b-versatile".
+"""
+
 import os
 
 # Default to a stable model
 LLM_MODEL = os.getenv("LLM_MODEL", "llama-3.3-70b-versatile")
backend/app/modules/pipeline.py (1)

41-41: Unused import: asyncio is not used in this module.

The asyncio import was added but isn't referenced anywhere in the file. The async/await syntax doesn't require an explicit import of the module.

🧹 Remove unused import
 from app.logging.logging_config import setup_logger
 import json
-import asyncio
backend/app/modules/vector_store/chunk_rag_data.py (1)

108-110: Swallowing exceptions may hide upstream issues.

Returning an empty list on any exception could mask problems in the data pipeline. The current logging via logger.exception is good, but consider whether certain exceptions (e.g., ValueError for missing required fields like cleaned_text or perspective) should propagate to alert callers of invalid input.

backend/app/modules/langgraph_nodes/sentiment.py (2)

73-84: Partial failure discards successful results.

If sentiment analysis fails but fact-checking succeeds (or vice versa), the entire operation returns an error, discarding potentially useful results from the successful task. Consider returning partial results with a warning status for better resilience.

♻️ Consider partial success handling
-    if sentiment_result.get("status") == "error":
-        return {
-            "status": "error",
-            "error_from": sentiment_result.get("error_from", "sentiment_analysis"),
-            "message": sentiment_result.get("message", "Unknown error")
-        }
-    if fact_check_result.get("status") == "error":
-        return {
-            "status": "error",
-            "error_from": fact_check_result.get("error_from", "fact_checking"),
-            "message": fact_check_result.get("message", "Unknown error")
-        }
+    # Check for errors but still return partial results
+    has_error = False
+    error_sources = []
+    
+    if sentiment_result.get("status") == "error":
+        has_error = True
+        error_sources.append("sentiment_analysis")
+        logger.warning(f"Sentiment analysis failed: {sentiment_result.get('message')}")
+    
+    if fact_check_result.get("status") == "error":
+        has_error = True
+        error_sources.append("fact_checking")
+        logger.warning(f"Fact checking failed: {fact_check_result.get('message')}")

     return {
         **state,
-        "sentiment": sentiment_result.get("sentiment"),
+        "sentiment": sentiment_result.get("sentiment", "unknown"),
         "claims": fact_check_result.get("claims", []),
         "search_queries": fact_check_result.get("search_queries", []),
         "search_results": fact_check_result.get("search_results", []),
         "facts": fact_check_result.get("facts", []),
-        "status": "success"
+        "status": "partial_error" if has_error else "success",
+        "error_sources": error_sources if has_error else None
     }

60-60: Minor: Redundant exception object in logging.exception.

Per static analysis, logging.exception automatically includes exception info; you don't need to pass e explicitly.

🧹 Fix
-            logger.exception(f"Error in fact_check_pipeline: {e}")
+            logger.exception("Error in fact_check_pipeline")
backend/app/modules/fact_check_tool.py (2)

109-110: Unused exception variable e.

The exception is caught but never used. Either log it or remove the variable binding.

🧹 Log the error or remove unused variable
-        except Exception as e:
-            return {"claim_id": q.get("claim_id"), "result": "Search failed"}
+        except Exception:
+            logger.warning(f"Search failed for claim {q.get('claim_id')}")
+            return {"claim_id": q.get("claim_id"), "result": "Search failed"}

48-48: Use logging.exception instead of logging.error to capture stack traces.

Per static analysis, logging.exception automatically includes the traceback, which is valuable for debugging LLM/network failures.

♻️ Replace logger.error with logger.exception
-        logger.error(f"Error extraction claims: {e}")
+        logger.exception("Error extracting claims")
-        logger.error(f"Failed to plan searches: {e}")
+        logger.exception("Failed to plan searches")
-        logger.error(f"Verification failed: {e}")
+        logger.exception("Verification failed")

Also applies to: 91-91, 165-165

backend/app/modules/langgraph_builder.py (2)

47-57: Inconsistent type annotation style.

Line 49 uses Python 3.9+ built-in list[dict] syntax, while lines 55-57 use typing.List. For consistency, use one style throughout.

🧹 Standardize to built-in generics (Python 3.9+)
-from typing import List, Any
+from typing import Any
 from langgraph.graph import StateGraph
 from typing_extensions import TypedDict
-
 ...

 class MyState(TypedDict):
     cleaned_text: str
     facts: list[dict]
     sentiment: str
     perspective: str
     score: int
     retries: int
     status: str
-    claims: List[str]
-    search_queries: List[Any]
-    search_results: List[Any]
+    claims: list[str]
+    search_queries: list[Any]
+    search_results: list[Any]

87-100: Complex nested conditional logic is correct but hard to read.

The branching logic handles error → retry threshold → score threshold correctly. Consider extracting to a named function for clarity.

♻️ Extract to named function for readability
+def _judge_routing(state):
+    if state.get("status") == "error":
+        return "error_handler"
+    if state.get("score", 0) < 70:
+        if state.get("retries", 0) >= 3:
+            return "store_and_send"
+        return "generate_perspective"
+    return "store_and_send"
+
 ...
     graph.add_conditional_edges(
         "judge_perspective",
-        lambda state: (
-            "error_handler"
-            if state.get("status") == "error"
-            else (
-                "store_and_send"
-                if state.get("retries", 0) >= 3
-                else "generate_perspective"
-            )
-            if state.get("score", 0) < 70
-            else "store_and_send"
-        ),
+        _judge_routing,
     )
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 19c83ed and ef1f0bc.

⛔ Files ignored due to path filters (1)
  • backend/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (15)
  • README.md
  • backend/app/llm_config.py
  • backend/app/modules/bias_detection/check_bias.py
  • backend/app/modules/chat/get_rag_data.py
  • backend/app/modules/chat/llm_processing.py
  • backend/app/modules/fact_check_tool.py
  • backend/app/modules/facts_check/llm_processing.py
  • backend/app/modules/langgraph_builder.py
  • backend/app/modules/langgraph_nodes/generate_perspective.py
  • backend/app/modules/langgraph_nodes/judge.py
  • backend/app/modules/langgraph_nodes/sentiment.py
  • backend/app/modules/pipeline.py
  • backend/app/modules/vector_store/chunk_rag_data.py
  • backend/app/routes/routes.py
  • backend/pyproject.toml
🧰 Additional context used
🧬 Code graph analysis (7)
backend/app/modules/langgraph_nodes/judge.py (1)
backend/app/logging/logging_config.py (1)
  • setup_logger (4-39)
backend/app/modules/langgraph_nodes/generate_perspective.py (1)
backend/app/logging/logging_config.py (1)
  • setup_logger (4-39)
backend/app/modules/vector_store/chunk_rag_data.py (1)
backend/app/utils/generate_chunk_id.py (1)
  • generate_id (29-33)
backend/app/modules/fact_check_tool.py (1)
backend/app/logging/logging_config.py (1)
  • setup_logger (4-39)
backend/app/modules/langgraph_builder.py (5)
backend/app/modules/langgraph_nodes/sentiment.py (1)
  • run_parallel_analysis (35-94)
backend/app/modules/langgraph_nodes/generate_perspective.py (1)
  • generate_perspective (50-87)
backend/app/modules/langgraph_nodes/judge.py (1)
  • judge_perspective (34-74)
backend/app/modules/langgraph_nodes/store_and_send.py (1)
  • store_and_send (26-54)
backend/app/modules/langgraph_nodes/error_handler.py (1)
  • error_handler (21-30)
backend/app/modules/langgraph_nodes/sentiment.py (2)
backend/app/logging/logging_config.py (1)
  • setup_logger (4-39)
backend/app/modules/fact_check_tool.py (4)
  • extract_claims_node (22-49)
  • plan_searches_node (51-92)
  • execute_searches_node (94-115)
  • verify_facts_node (117-166)
backend/app/routes/routes.py (1)
backend/app/modules/pipeline.py (1)
  • run_langgraph_workflow (68-72)
🪛 Ruff (0.14.10)
backend/app/modules/vector_store/chunk_rag_data.py

41-41: Abstract raise to an inner function

(TRY301)


41-41: Avoid specifying long messages outside the exception class

(TRY003)


106-106: Consider moving this statement to an else block

(TRY300)


109-109: Redundant exception object included in logging.exception call

(TRY401)

backend/app/modules/fact_check_tool.py

45-45: Consider moving this statement to an else block

(TRY300)


47-47: Do not catch blind exception: Exception

(BLE001)


48-48: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


88-88: Consider moving this statement to an else block

(TRY300)


90-90: Do not catch blind exception: Exception

(BLE001)


91-91: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


108-108: Consider moving this statement to an else block

(TRY300)


109-109: Do not catch blind exception: Exception

(BLE001)


109-109: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)


162-162: Consider moving this statement to an else block

(TRY300)


164-164: Do not catch blind exception: Exception

(BLE001)


165-165: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

backend/app/modules/langgraph_nodes/sentiment.py

60-60: Redundant exception object included in logging.exception call

(TRY401)

🔇 Additional comments (20)
backend/app/modules/bias_detection/check_bias.py (2)

30-30: Good adoption of centralized configuration.

The import of LLM_MODEL from app.llm_config correctly adopts the centralized configuration pattern introduced in this PR. This improves maintainability by eliminating hard-coded model names.


65-65: Code correctly uses the centralized LLM_MODEL configuration.

The implementation at line 65 properly references the centralized LLM_MODEL variable from llm_config.py, which defaults to "llama-3.3-70b-versatile". This is consistent with the rest of the backend codebase and represents the intended pattern for centralized model configuration.

No model switch is occurring—the code has always used the centralized variable approach. The instruction to return only a number (line 57) is clear, and the temperature setting of 0.3 supports deterministic output.

Likely an incorrect or invalid review comment.

README.md (1)

170-170: Remove HF_TOKEN from required environment variables in README.md.

The documentation lists HF_TOKEN as a required environment variable, but the backend does not actually use it. The embedding models ("all-MiniLM-L6-v2") loaded in embed_query.py and embed.py are public Hugging Face models that require no authentication token. Remove this from the required environment variables list to avoid misleading developers.

Likely an incorrect or invalid review comment.

backend/app/modules/chat/get_rag_data.py (1)

34-34: Code correctly uses Pinecone 7.3.0+ initialization API.

This is the required pattern for Pinecone SDK v7.x+. The api_key keyword argument (with value from the environment variable) is the correct and only supported way to initialize the Pinecone client in v7.x, replacing the deprecated pinecone.init() pattern.

backend/pyproject.toml (1)

19-19: The upgrade to newspaper4k is compatible with current usage.

The code uses only the standard Article API (Article, download(), parse(), and basic properties like title, text, authors, publish_date), which newspaper4k maintains for backward compatibility. The Python 3.13 requirement also exceeds newspaper4k's minimum version (3.8+, later 3.10+), so there are no compatibility issues.

If stricter version pinning is desired for additional stability (e.g., >=0.9.4.1,<1.0.0), it's optional but not required for this implementation.

backend/app/modules/chat/llm_processing.py (1)

30-64: Config-driven LLM_MODEL wiring looks good; ensure invalid model names fail fast.

If LLM_MODEL is missing/unsupported, Groq will likely error at runtime; consider validating/logging the chosen model once at startup (or on first call) to make misconfig obvious.

backend/app/modules/facts_check/llm_processing.py (1)

32-70: LLM model centralization is consistent here.

backend/app/routes/routes.py (1)

70-75: Correctly awaiting the async LangGraph workflow (no thread hop).

Just ensure the workflow’s heavy work is truly async (or safely executed off the event loop) so /process doesn’t block other requests under load.

backend/app/modules/langgraph_nodes/judge.py (1)

22-31: Good: centralized LLM_MODEL used for judge LLM.

If the configured model changes, confirm it still reliably outputs a single integer with max_tokens=10 (some models get verbose despite instructions).

backend/app/modules/langgraph_nodes/generate_perspective.py (3)

23-45: Model centralization + structured output plumbing looks coherent.


35-37: The List[str] change to reasoning field is intentional and already handled downstream.

The schema change was explicitly made in commit ef1f0bc to handle reasoning as a list. The main consumer in backend/app/modules/vector_store/chunk_rag_data.py (lines 66–69) already has defensive handling that supports both List[str] and string inputs:

if isinstance(raw_reasoning, list):
    p_reason = "\n".join(raw_reasoning)
else:
    p_reason = str(raw_reasoning)

However, verify that all other consumers of PerspectiveOutput (if any exist) properly handle the list type to avoid silent failures.


60-79: The review concerns are largely based on incorrect assumptions about the codebase.

Type safety (not an issue): Facts are always dictionaries—they come from JSON parsing in fact_check_tool.py (line 137–162) and are typed as list[dict] in the state. Using .get() is correct.

Status vs. verdict field (not an issue): The status field in facts contains True/False/Unverified (fact-check result), not "success". The fallback to 'verdict' is defensive code used elsewhere in the codebase (chunk_rag_data.py line 151) to handle alternative structures—no semantic confusion occurs here.

Blocking call (not an issue): LangGraph automatically executes sync nodes in a thread pool when the graph is invoked via ainvoke() (as shown in pipeline.py line 24). The sync chain.invoke() call does not block the event loop.

Likely an incorrect or invalid review comment.

backend/app/modules/pipeline.py (1)

68-71: LGTM! Async conversion is correctly implemented.

The function signature change to async def and the use of await _LANGGRAPH_WORKFLOW.ainvoke(state) properly enables asynchronous execution of the LangGraph workflow. This aligns with the route layer changes that now await this function.

backend/app/modules/vector_store/chunk_rag_data.py (1)

82-103: Good defensive handling for dynamic fact schemas.

The fallback field lookups (claim/original_claim, status/verdict, etc.) and the skip logic for invalid facts are well-implemented. This prevents crashes when encountering malformed or varied fact structures from the new fact-checking pipeline.

backend/app/modules/langgraph_nodes/sentiment.py (2)

35-94: Well-structured parallel execution with proper async patterns.

The design correctly uses asyncio.to_thread for the synchronous Groq SDK call while running the async fact-check pipeline natively. The asyncio.gather effectively parallelizes both operations.


119-119: Good: Centralized LLM model configuration.

Using LLM_MODEL from the config instead of a hardcoded model name improves maintainability and allows environment-based configuration.

backend/app/modules/fact_check_tool.py (2)

17-18: Module-level initialization may cause import-time failures.

The Groq client and DuckDuckGoSearchRun are initialized when the module is imported. If GROQ_API_KEY is not set or the search tool initialization fails, the entire module import will fail, potentially breaking application startup.

Consider lazy initialization or wrapping in a factory function to defer initialization until first use.


22-49: Well-structured fact-checking pipeline.

The four-step approach (extract → plan → execute → verify) is clean and each node handles errors gracefully by returning empty defaults. The use of asyncio.to_thread for blocking SDK calls and asyncio.gather for parallel searches is appropriate.

Also applies to: 51-92, 94-115, 117-166

backend/app/modules/langgraph_builder.py (2)

60-109: Workflow structure is well-designed for parallel execution.

The refactored graph correctly starts with parallel_analysis, chains through perspective generation/judging with retry logic, and properly routes errors. The conditional edges handle all expected states.


63-64: LangGraph's StateGraph explicitly supports async node functions—this is documented and demonstrated in the LangGraph quickstart and examples. The run_parallel_analysis function is correctly defined as async (line 35 of sentiment.py) and returns a dictionary, not an async generator. The pattern of adding the async function as a node, compiling the graph, and invoking via ainvoke is the correct, fully supported approach. No action needed.

Comment on lines +126 to +129
for item in results:
c_id = item["claim_id"]
if c_id < len(claims):
context += f"\nClaim: {claims[c_id]}\nEvidence: {item['result']}\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential KeyError or type mismatch for claim_id.

The code assumes item["claim_id"] exists and is an integer comparable to len(claims). If the LLM returns malformed JSON or the search fails, claim_id might be missing, None, or a non-integer, causing runtime errors.

🛡️ Add defensive handling
     for item in results:
-        c_id = item["claim_id"]
-        if c_id < len(claims):
+        c_id = item.get("claim_id")
+        if c_id is not None and isinstance(c_id, int) and 0 <= c_id < len(claims):
             context += f"\nClaim: {claims[c_id]}\nEvidence: {item['result']}\n"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for item in results:
c_id = item["claim_id"]
if c_id < len(claims):
context += f"\nClaim: {claims[c_id]}\nEvidence: {item['result']}\n"
for item in results:
c_id = item.get("claim_id")
if c_id is not None and isinstance(c_id, int) and 0 <= c_id < len(claims):
context += f"\nClaim: {claims[c_id]}\nEvidence: {item['result']}\n"
🤖 Prompt for AI Agents
In @backend/app/modules/fact_check_tool.py around lines 126 - 129, The loop over
results assumes item["claim_id"] is present and an int; add defensive checks in
the for item in results loop to (1) verify "claim_id" in item, (2) coerce or
validate it as an integer (e.g., try int(item["claim_id"]) and handle
ValueError/TypeError), (3) ensure the resulting index is within range 0 <= c_id
< len(claims), and (4) skip or log malformed entries instead of using them when
building context; update references to item["claim_id"], c_id, claims, and
context accordingly so missing/invalid claim_id values cannot raise
KeyError/TypeError or index errors.

Comment on lines 47 to 48
article_id = generate_id(data["cleaned_text"])
chunks = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential unhandled ValueError from generate_id.

Per the generate_id function in backend/app/utils/generate_chunk_id.py, it raises ValueError if text is empty or not a string. If cleaned_text is an empty string, this will raise before reaching the perspective/facts handling logic, potentially bypassing the graceful error handling at line 108-110.

🛡️ Add validation before calling generate_id
+        if not data.get("cleaned_text"):
+            logger.warning("Missing or empty cleaned_text. Returning empty chunks.")
+            return []
+
         article_id = generate_id(data["cleaned_text"])
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
article_id = generate_id(data["cleaned_text"])
chunks = []
if not data.get("cleaned_text"):
logger.warning("Missing or empty cleaned_text. Returning empty chunks.")
return []
article_id = generate_id(data["cleaned_text"])
chunks = []
🤖 Prompt for AI Agents
In @backend/app/modules/vector_store/chunk_rag_data.py around lines 47 - 48, The
code calls generate_id(data["cleaned_text"]) without validating cleaned_text
which can raise ValueError; before calling generate_id in chunk_rag_data.py,
validate that data["cleaned_text"] is a non-empty string (or fallback to a safe
default id) and only then assign article_id, or wrap the generate_id call in a
try/except that catches ValueError and handles it the same way as the existing
perspective/facts error handling (e.g., log and continue/skip); reference the
article_id assignment and generate_id invocation so you can add the guard or
exception handling there.

Comment on lines +52 to +57
if hasattr(perspective_data, "dict"):
p_data = perspective_data.dict()
elif hasattr(perspective_data, "model_dump"):
p_data = perspective_data.model_dump()
elif isinstance(perspective_data, dict):
p_data = perspective_data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

find . -type f \( -name "requirements*.txt" -o -name "setup.py" -o -name "pyproject.toml" -o -name "Pipfile" \) | head -20

Repository: AOSSIE-Org/Perspective

Length of output: 90


🏁 Script executed:

cd backend && find . -type f \( -name "requirements*.txt" -o -name "setup.py" -o -name "pyproject.toml" \) 2>/dev/null | xargs cat 2>/dev/null | grep -i pydantic

Repository: AOSSIE-Org/Perspective

Length of output: 48


🏁 Script executed:

cat -n backend/app/modules/vector_store/chunk_rag_data.py | sed -n '50,70p'

Repository: AOSSIE-Org/Perspective

Length of output: 1000


🏁 Script executed:

cat backend/pyproject.toml | grep -A 20 -B 5 -i pydantic

Repository: AOSSIE-Org/Perspective

Length of output: 48


🏁 Script executed:

cat backend/pyproject.toml

Repository: AOSSIE-Org/Perspective

Length of output: 744


🏁 Script executed:

cd backend && rg -B 5 "perspective_data = data\[" app/modules/vector_store/chunk_rag_data.py -A 2

Repository: AOSSIE-Org/Perspective

Length of output: 275


🏁 Script executed:

cd backend && find . -type f -name "*.py" | xargs rg "class.*BaseModel" | head -10

Repository: AOSSIE-Org/Perspective

Length of output: 258


🏁 Script executed:

cd backend && rg -A 5 "class PerspectiveOutput" app/modules/langgraph_nodes/generate_perspective.py

Repository: AOSSIE-Org/Perspective

Length of output: 308


Reverse the order to check .model_dump() before .dict() for Pydantic v2 compatibility.

The project uses Pydantic v2 (via FastAPI and LangChain dependencies), where .dict() has been removed entirely. While the code functionally works because hasattr() checks fail for the removed method and fall through to .model_dump(), checking a non-existent method first is inefficient and unclear. Reverse the order to check .model_dump() first, then .dict() as a potential fallback for legacy compatibility.

🤖 Prompt for AI Agents
In @backend/app/modules/vector_store/chunk_rag_data.py around lines 52 - 57, The
attribute-check order for converting perspective_data to p_data should prefer
Pydantic v2: check for perspective_data.model_dump() first, then fallback to
perspective_data.dict(), and finally handle plain dicts; update the conditional
in chunk_rag_data.py so it calls model_dump() before dict() (retain the
isinstance(perspective_data, dict) branch) and assign the result to p_data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant