Knowledge Agent

An autonomous AI agent for intelligently updating, maintaining, and curating a LightRAG knowledge base.

About The Project
Architecture
- Frameworks and Libraries
- Agent Roles
Getting Started
- Prerequisites
- Installation
Usage
- Workflows
Configuration
Prompts
Logging
Database
Workflow Details

About The Project

This project provides a sophisticated, autonomous AI agent—the "Knowledge Agent"—that proactively maintains, expands, and curates a local LightRAG knowledge base. It transforms the knowledge base from a static repository into a living, self-improving intelligence system, ensuring the information it contains is always accurate, relevant, and up-to-date.

The Knowledge Agent is designed to solve the challenges of maintaining a static knowledge base:

Staleness: Information quickly becomes outdated without a process for continuous updates.
Incompleteness: The knowledge base is limited to manually selected documents, creating information silos and knowledge gaps.
High Maintenance Overhead: The manual effort required to find new sources, ingest them, and fix data quality issues is significant and does not scale.
Data Quality Degradation: As more data is added, inconsistencies and duplicates can accumulate, reducing the reliability of RAG outputs and polluting the knowledge graph.

The agent now features a robust document processing pipeline that can fetch raw web content (including PDFs and HTML), generate clean markdown, and store it in a structured database for further analysis and summarization.

Architecture

The Knowledge Agent uses a multi-agent architecture, where a primary Orchestrator Agent manages the overall workflow by delegating tasks to a team of specialized sub-agents.

+---------------------+
| Orchestrator Agent  |
+----------+----------+
           |
           v
+----------+----------+
|      Sub-Agents     |
+----------+----------+
           |
           v
+----------+----------+
|       MCP Servers   |
+---------------------+

Orchestrator (Knowledge Agent): The project manager. It holds the high-level plan and the overall state. It invokes the appropriate sub-agent for each task and handles the flow of information between them.
Sub-Agents: A team of specialized agents, each with a specific role in the knowledge management lifecycle.
MCP Servers: All agents interact with the outside world and the knowledge base exclusively through tools provided by MCP servers.

Frameworks and Libraries

LangChain: A framework for developing applications powered by language models.
LangGraph: A library for building stateful, multi-agent applications with LLMs.
langchain-mcp-adapters: Used for connecting to and using tools from MCP servers.
ChatOpenAI: The language model used for the agents.
pydantic: Used for data validation and settings management.
psycopg2-binary: A PostgreSQL adapter for Python.
python-dotenv: A library for managing environment variables.
json-repair: A library for repairing malformed JSON.
requests: A library for making HTTP requests to download web content.
pdfplumber: A library for extracting text from PDF documents.
Trafilatura: A tool for fast and accurate extraction of main content from HTML.
Playwright: A library for browser automation, used as a fallback for complex websites.
beautifulsoup4: A library for parsing HTML content.
html2text: A library for converting HTML to markdown.
tiktoken: A tool for counting tokens to ensure content fits within the LLM's context window.

Agent Roles

Analyst: Identifies knowledge gaps and stale information in the knowledge base by analyzing its content and structure.
Researcher: Acts as the primary research arm of the agent. It breaks down research tasks and manages the entire content acquisition pipeline:
- Planner: Creates a strategic, diversified search plan using advanced search operators.
- Content Processor: Uses a hybrid strategy to extract clean, reader-mode content. It first tries the fast and accurate trafilatura library, and if that fails to return quality content, it falls back to a full browser rendering with Playwright to handle complex, JavaScript-heavy sites.
- Refiner: If the initial search plan is unsuccessful, the refiner adjusts the strategy to find the missing information.
- Summarizer: Generates a concise summary from the clean markdown content. Before summarizing, the content is passed through a filter that truncates it to a safe token limit (16k) to ensure efficiency and prevent context window errors.
Curator: Takes the URLs from the Researcher and decides which ones are relevant, then carries out ingestion of approved content into the knowledge base.
Auditor: Scans the knowledge graph for data quality issues like duplicate entities, inconsistent naming, and messy relationships.
Fixer: Corrects the data quality issues identified by the Auditor, with a human approval step for destructive operations.
Advisor: Analyzes recurring error patterns and suggests improvements to the LightRAG system's configuration to prevent future issues.

Getting Started

Prerequisites

Python 3.12+
uv package manager
A running LightRAG instance
Running MCP servers for tools (e.g., Google Search)
PostgreSQL database

Installation

Clone the repository:

git clone https://github.com/fvanevski/knowledge-agent.git
cd knowledge-agent

Install the dependencies using uv:
```
uv sync
```
Set up the environment variables by creating a .env file in the root directory. You can use the .env.example file as a template.
Install Playwright's browser binaries:
```
uv run python -m playwright install
```

Usage

The Knowledge Agent is executed via the run.py script. You can specify different workflows using command-line arguments.

Workflows

Full Maintenance (--maintenance): This is the default workflow and runs all the sub-agents in sequence to perform a full maintenance cycle on the knowledge base.
```
uv run python run.py --maintenance
```
Analyze (--analyze): Identifies knowledge gaps and stale information.
```
uv run python run.py --analyze
```
Research (--research): Finds new sources for the topics identified by the Analyst.
```
uv run python run.py --research
```
Curate (--curate): Ranks search results and ingests approved new content into the knowledge base.
```
uv run python run.py --curate
```
Audit (--audit): Reviews the knowledge base for data quality issues.
```
uv run python run.py --audit
```
Fix (--fix): Corrects the data quality issues found by the Auditor.
```
uv run python run.py --fix
```
Advise (--advise): Provides recommendations for systemic improvements.
```
uv run python run.py --advise
```

Configuration

The Knowledge Agent requires a mcp.json file in the root directory to configure the connection to the MCP tool servers. This file should contain the server configurations, for example:

{
    "google_search": {
        "command": "uv",
        "args": ["run", "python", "google_search_mcp.py"],
        "cwd": "/workspace/mcp_servers/google_search_mcp",
        "transport": "stdio"
    },
    "lightrag": {
        "command": "uv",
        "args": ["run", "python", "lightrag_mcp.py"],
        "cwd": "/workspace/mcp_servers/lightrag_mcp",
        "transport": "stdio"
    },
    "file_tools": {
        "command": "npx",
        "args": ["-y", "@modelcontextprotocol/server-filesystem", "/workspace/knowledge_agent", "/workspace/LightRAG"],
        "transport": "stdio"
    },
    "deepwiki": {
        "url": "https://mcp.deepwiki.com/sse",
        "transport": "sse"
    }
}

Prompts

The behavior of each sub-agent is guided by a system prompt located in the prompts/ directory. These prompts define the agent's persona, goals, and expected output format.

analyst_prompt.txt: Guides the Analyst in identifying knowledge gaps.
planner_prompt.txt: Guides the Researcher's Planner in creating a search strategy.
refiner_prompt.txt: Guides the Researcher's Refiner in adjusting the search strategy.
summarizer_prompt.txt: Guides the Researcher's Summarizer in creating a concise summary.
search_ranker_prompt.txt: Guides the Curator in ranking search results for ingestion.
ingester_prompt.txt: Guides the Curator in ingesting new sources.
auditor_prompt.txt: Guides the Auditor in identifying data quality issues.
fixer_prompt.txt: Guides the Fixer in correcting data quality issues.
advisor_prompt.txt: Guides the Advisor in providing recommendations.

Logging

The agent's operations are logged to a file in the logs/ directory. The logs are in JSON format and include the timestamp, log level, agent name, and the message, providing a detailed record of the agent's activity.

Database

The Knowledge Agent uses a PostgreSQL database to store the reports generated by the sub-agents and to cache processed web content. The db_utils.py file contains the functions for creating the tables and interacting with the database.

The database schema consists of tables for each agent's reports and a central documents table:

analyst_reports
researcher_reports
curator_reports
auditor_reports
fixer_reports
advisor_reports
documents

The documents table stores processed web content and has the following structure:

id: Primary key (integer)
url: The unique URL of the source document (text)
raw_document: The raw binary content of the document (BYTEA)
markdown_content: The processed, clean markdown version of the content (text)
summary: A concise summary of the document (text)
created_at: Timestamp of when the document was first added

Workflow Details

The maintenance workflow is the most comprehensive, executing the full lifecycle of knowledge management. Here is a step-by-step breakdown of the process:

Analysis: The Analyst examines the knowledge base to identify areas that are outdated or incomplete. It generates a report detailing these knowledge gaps.
Research: The Researcher takes the Analyst's report and executes the entire content acquisition pipeline:
- The Planner develops a set of targeted, diversified search queries.
- The agent executes these searches. For each resulting URL, it uses the hybrid content processor (Trafilatura with a Playwright fallback) to extract clean, main content and generate high-quality markdown.
- All artifacts (raw document, markdown, and summary) are stored in the documents table in the database.
- If the initial searches are insufficient, the Refiner adjusts the plan and tries again.
Curation: The Curator ranks the URLs from the Researcher and decides which ones to ingest into the knowledge base, and then proceeds to ingest approved content.
Audit: The Auditor scans the knowledge graph for inconsistencies, duplicates, and other data quality issues, producing a report of its findings.
Fix: The Fixer takes the Auditor's report and attempts to correct the identified issues. For any destructive changes (e.g., deleting an entity), it will require human approval.
Advise: Finally, the Advisor analyzes the reports from all the other agents, identifies recurring problems, and suggests systemic improvements to the LightRAG configuration or the agent's own processes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Knowledge Agent

Table of Contents

About The Project

Architecture

Frameworks and Libraries

Agent Roles

Getting Started

Prerequisites

Installation

Usage

Workflows

Configuration

Prompts

Logging

Database

Workflow Details

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Knowledge Agent

Table of Contents

About The Project

Architecture

Frameworks and Libraries

Agent Roles

Getting Started

Prerequisites

Installation

Usage

Workflows

Configuration

Prompts

Logging

Database

Workflow Details