diff --git a/docs/_toc.yml b/docs/_toc.yml index d067c48..bddbc3b 100644 --- a/docs/_toc.yml +++ b/docs/_toc.yml @@ -18,15 +18,13 @@ parts: numbered: True chapters: - file: ingestion_service + - file: deployment_braikbservices - caption: BrainKB User Interface numbered: True chapters: - file: brainkbui - - caption: Deployment - numbered: True - chapters: + - file: ui_developer_document - file: deployment_userinterface - - file: deployment_braikbservices - caption: StructSense numbered: True chapters: diff --git a/docs/deployment_braikbservices.md b/docs/deployment_braikbservices.md index c2960f0..5d8f4d0 100644 --- a/docs/deployment_braikbservices.md +++ b/docs/deployment_braikbservices.md @@ -1,4 +1,4 @@ -# Deployment of BrainKB Services +# Microservices Deployment BrainKB consists of multiple service components, as highlighted in the {ref}`brainkb_architecture_figure` All of the service components can be deployed independently. However, there are a few dependencies, such as setting up the PostgreSQL database that is used by JWT Users and Scope Manager, that need to be setup first. diff --git a/docs/deployment_userinterface.md b/docs/deployment_userinterface.md index 657e6b4..27524cb 100644 --- a/docs/deployment_userinterface.md +++ b/docs/deployment_userinterface.md @@ -1,4 +1,4 @@ -# Deployment of User Interface +# Deployment This section provides information regarding the deployment of the BrainKB UI, both in the development and the production mode. ```{note} diff --git a/docs/structsense_configuration.md b/docs/structsense_configuration.md index bed3f56..89888b4 100644 --- a/docs/structsense_configuration.md +++ b/docs/structsense_configuration.md @@ -9,15 +9,13 @@ Pass the YAML via CLI, e.g. `--config config/ner_agent.yaml`. - `agent_config` - `task_config` -**Do not replace** runtime variables in braces `{}`: -- `{literature}` — input text (e.g., extracted PDF content) -- `{extracted_structured_information}` — extractor output -- `{aligned_structured_information}` — alignment output -- `{judged_structured_information_with_human_feedback}` — judge output -- `{modification_context}`, `{user_feedback_text}` — inputs to feedback agent +**Do not replace variables** enclosed in curly braces (`{}`); they are dynamically populated at runtime. Names must match the pipeline input map (see `config_template` for examples): +- **Extraction input:** `{input_text}` — input text (e.g. PDF content or raw text) +- **Alignment input:** `{extracted_structured_information}` — output from the extractor agent +- **Judge input:** `{aligned_structured_information}` — output from the alignment agent +- **Human feedback input:** `{judged_structured_information_with_human_feedback}` — output from the judge agent; `{modification_context}` and `{user_feedback_text}` — user feedback for the feedback agent -**Config Template**\ -A blank template is available in [config_template](https://github.com/sensein/structsense/blob/main/config_template/config.yaml). +A blank template as well as templates for tasks such as `NER`, `Resource Extraction` and `PDF2 ReproSchema` is available in `config_template/`. See **Templates**. ## Agent Configuration @@ -63,7 +61,7 @@ Run without a paid API key: ```bash structsense-cli extract \ --source SOME.pdf \ - --config ner_config_gpt.yaml \ + --config ner-config.yaml \ --env_file .env ``` @@ -77,7 +75,7 @@ Required task IDs (do not rename): - `humanfeedback_task` Each task includes: -- `description` — includes expected input (e.g., `{literature}`) +- `description` — includes expected input (e.g., `{input_text}`) - `expected_output` — **JSON** output format or example - `agent_id` — must match an agent ID from `agent_config` @@ -106,6 +104,20 @@ embedder_config: model: nomic-embed-text:latest ``` +### Experiment Tracking (optional) +| Variable | Description | Default | +|---|---|---| +| `ENABLE_WEIGHTSANDBIAS` | Enable W&B | `false` | +| `ENABLE_MLFLOW` | Enable MLflow | `false` | +| `MLFLOW_TRACKING_URL` | MLflow tracking URL | `http://localhost:5000` | + +### Minimal (no tracking, no knowledge source) +```bash +ENABLE_WEIGHTSANDBIAS=false +ENABLE_MLFLOW=false +ENABLE_KG_SOURCE=false +``` +## Legacy ### Knowledge Source (Vector DB) `WEAVIATE_*` environment variables are optional and only needed if you enable a knowledge source for schema/ontology lookup. @@ -146,17 +158,25 @@ embedder_config: > If Ollama runs on host and Weaviate in Docker, use `http://host.docker.internal:11434`. > If both are in Docker on the same host network, use `http://localhost:11434`. -### Experiment Tracking (optional) -| Variable | Description | Default | -|---|---|---| -| `ENABLE_WEIGHTSANDBIAS` | Enable W&B | `false` | -| `ENABLE_MLFLOW` | Enable MLflow | `false` | -| `MLFLOW_TRACKING_URL` | MLflow tracking URL | `http://localhost:5000` | - - -## Example `.env` +### Example `.env` ```bash -ENABLE_KG_SOURCE=false -OLLAMA_API_ENDPOINT=http://localhost:11434 -OLLAMA_MODEL=nomic-embed-text:v1.5 +WEAVIATE_API_KEY=your_api_key +WEAVIATE_HTTP_HOST=localhost +WEAVIATE_HTTP_PORT=8080 +WEAVIATE_HTTP_SECURE=false + +WEAVIATE_GRPC_HOST=localhost +WEAVIATE_GRPC_PORT=50051 +WEAVIATE_GRPC_SECURE=false + +WEAVIATE_TIMEOUT_INIT=30 +WEAVIATE_TIMEOUT_QUERY=60 +WEAVIATE_TIMEOUT_INSERT=120 + +OLLAMA_API_ENDPOINT=http://host.docker.internal:11434 +OLLAMA_MODEL=nomic-embed-text + +ENABLE_WEAVE=true +ENABLE_MLFLOW=true +MLFLOW_TRACKING_URL=http://localhost:5000 ``` diff --git a/docs/structsense_examples.md b/docs/structsense_examples.md index 6a28967..41f11b5 100644 --- a/docs/structsense_examples.md +++ b/docs/structsense_examples.md @@ -1,9 +1,18 @@ -# Examples +# Tutorials & Examples -- See the [example/](https://github.com/sensein/structsense/tree/main/example) directory for usage demonstrations and reference configs. +- See the `tutorial/` directory for usage demonstrations. +- See the `example/` directory for task specific reference configs that can be used for `StructSense`. +- A configuration is provided under `config_template/`. -## Example Use Cases -**For more information about StructSense use cases, see the [StructSense paper on arXiv](https://arxiv.org/html/2507.03674v2#S5)** -- Neuroscience Named Entity Extraction from text -- Resource (i.e. models, datasets) Extraction -- ReproSchema Extraction + +## Blank Configuration Template + +A starting template is provided in `config_template/`. +Please note that `config_template/` folder also contains configuration files for `NER`, `Resource Extraction` and `PDF2ReproSchema` tasks. + +Before modifying, read: +- **Configuration Overview & Template** +- **Agents** +- **Tasks** +- **Embeddings & Knowledge** +- **Environment Variables (see `.env_example` from the `StructSense` repository)** diff --git a/docs/structsense_getting_started.md b/docs/structsense_getting_started.md index c3d5ac3..63adbfa 100644 --- a/docs/structsense_getting_started.md +++ b/docs/structsense_getting_started.md @@ -7,6 +7,8 @@ pip install structsense ``` Alternatively, you can install the latest version of StructSense from the source code on GitHub: +**Note:** The latest updates are not pushed to PyPI, so for now it's recommended to use from GitHub. + ```bash git clone https://github.com/sensein/structsense.git cd structsense @@ -20,15 +22,17 @@ StructSense supports **Python >=3.10,<3.13**. ## Requirements + ### PDF Extraction with Grobid StructSense supports PDF extraction using **[Grobid](https://grobid.readthedocs.io/en/latest/Introduction/)** (default) or an external API service. #### Default: Grobid -By default, StructSense uses Grobid for PDF extraction. You can install and run Grobid either with Docker or in a non-Docker setup. +StructSense uses Grobid for PDF extraction. You can install and run Grobid either with Docker or in a non-Docker setup. We recommend using Docker for easier setup and dependency management. ##### Run Grobid with Docker + ```bash docker pull lfoppiano/grobid:0.8.0 docker run --init -p 8070:8070 -e JAVA_OPTS="-XX:+UseZGC" lfoppiano/grobid:0.8.0 @@ -58,31 +62,180 @@ In our default setup, Ollama is used for embedding generation. You can also use -## Running +## Using StructSense (CLI and Python) + +### Command-line (CLI) + +After installing (`pip install -e .`), the entry point is **`structsense-cli`**. + +#### Full pipeline (extract) + +Runs extraction → alignment → judge → optional human feedback and returns the final structured result. + +```bash +structsense-cli extract \ + --config path/to/config.yaml \ + --source path/to/file.pdf \ + --env_file .env \ + --save_file result.json +``` + +| Option | Description | +|--------|-------------| +| `--config` | **(Required)** Path to YAML config (agent + task + embedder). | +| `--source` | **(Required)** Input: path to a PDF/text file, a folder, or a text string. | +| `--api_key` | OpenRouter (or other) API key; can also be set in `.env` as `OPENROUTER_API_KEY`. | +| `--env_file` | Path to `.env` (default: `.env` in current directory). | +| `--save_file` | Save the result JSON to this path. | +| `--enable_chunking` | Enable chunking for long documents (flag). | +| `--chunk_size` | Chunk size in characters (e.g. `2000`); used when chunking is enabled. | +| `--max_workers` | Max parallel workers for chunked extraction. | +| `--downstream_max_input_chars` | Max input length for alignment/judge (default 80000). | +| `--max_extraction_chunk_chars` | Cap per-chunk size for extraction (default 25000). | + +**With OpenRouter (API key):** -### Using OpenRouter ```bash structsense-cli extract \ --source somefile.pdf \ - --api_key \ + --api_key \ --config someconfig.yaml \ --env_file .env \ - --save_file result.json # optional + --save_file result.json ``` -### Using Ollama (Local) +**With Ollama (local, no API key):** + ```bash structsense-cli extract \ --source somefile.pdf \ --config someconfig.yaml \ - --env_file .env_file \ - --save_file result.json # optional + --env_file .env \ + --save_file result.json ``` -### Chunking -Disabled by default. Enable with: +**With chunking (recommended for long PDFs):** + ```bash ---chunking True +structsense-cli extract \ + --config config.yaml \ + --source file.pdf \ + --enable_chunking \ + --chunk_size 2000 \ + --save_file result.json +``` + +#### Single agent–task (run-agent) + +Run one agent and one task only (e.g. extractor only), without the full pipeline: + +```bash +structsense-cli run-agent \ + --config path/to/config.yaml \ + --agent_key extractor_agent \ + --task_key extraction_task \ + --source path/to/file.pdf \ + --env_file .env \ + --save_file result.json +``` + +Use the same chunking/worker options as `extract` when needed. + + +### Python (programmatic) + +Use **StructSenseFlow** as the single entry point. Run the **full pipeline** with `information_extraction_task()`, or a **single agent** with `kickoff(agent_key, task_key)` or `extraction()`. + +**API key when running via Python:** For OpenRouter (or other cloud LLMs), either pass `api_key="your-key"` to `StructSenseFlow(...)` or set `OPENROUTER_API_KEY` in a `.env` file and pass `env_file=".env"`. The key is injected into the agent LLM config so all agents use it. Get an OpenRouter key at [openrouter.ai/keys](https://openrouter.ai/keys). If you get `401 User not found`, the key is missing or invalid. + +#### Full pipeline (recommended) + +```python +import asyncio +from structsense.app import StructSenseFlow + +# Config can be paths to YAML files or dicts +flow = StructSenseFlow( + agent_config="path/to/config.yaml", + task_config="path/to/config.yaml", + embedder_config="path/to/config.yaml", + input_source="path/to/file.pdf", # or a text string, or path to .txt + enable_chunking=True, + chunk_size=2000, + max_workers=8, + env_file=".env", + api_key=None, # or set OPENROUTER_API_KEY in .env +) + +# Run full pipeline: extraction → alignment → judge → human feedback (if enabled) +result = asyncio.run(flow.information_extraction_task()) + +# Result is a dict: entities, key_terms, resources, judged_terms, concept_mapping, etc. +print(result.get("task_type"), result.get("elapsed_time")) + +# Save to file +import json +with open("result.json", "w") as f: + json.dump(result, f, indent=2, default=str) +``` + +#### Single agent (one agent–task pair) + +You can run **any** single agent–task pair with `kickoff(agent_key=..., task_key=...)`. For the extractor only, the convenience method is `extraction()`. For the **full pipeline** (extraction → alignment → judge → humanfeedback), use `information_extraction_task()`. + +```python +import asyncio +from structsense.app import StructSenseFlow + +flow = StructSenseFlow( + agent_config="path/to/config.yaml", + task_config="path/to/config.yaml", + embedder_config="path/to/config.yaml", + input_source="path/to/file.pdf", # or source_text="raw text" + enable_chunking=True, + chunk_size=2000, +) + +# Run only the extractor (convenience method) +result = asyncio.run(flow.extraction()) + +# Or run any specific agent–task pair +result = asyncio.run(flow.kickoff( + agent_key="extractor_agent", + task_key="extraction_task", +)) +# Other pairs: alignment_agent/alignment_task, judge_agent/judge_task, +# humanfeedback_agent/humanfeedback_task +``` + +**Note:** Alignment, judge, and humanfeedback tasks are designed to receive **output from the previous stage** when run in the full pipeline. When you run them alone via `kickoff(...)`, they receive the raw `source_text` as input (useful for debugging or custom flows). + +#### Passing config as dicts + +```python +import asyncio +import yaml +from structsense.app import StructSenseFlow + +with open("ner-config.yaml") as f: + all_config = yaml.safe_load(f) + +flow = StructSenseFlow( + agent_config=all_config["agent_config"], + task_config=all_config["task_config"], + embedder_config=all_config.get("embedder_config", {}), + input_source="path/to/file.pdf", # or source_text="raw text" + enable_chunking=True, + chunk_size=2000, + max_workers=8, + env_file=".env", # optional; loads OPENROUTER_API_KEY etc. + api_key=None, # or pass key here; injected into LLM config +) +result = asyncio.run(flow.information_extraction_task()) + +import json +with open("result.json", "w") as f: + json.dump(result, f, indent=2, default=str) ``` @@ -91,9 +244,8 @@ Disabled by default. Enable with: The `docker/` directory contains **Docker Compose** files for running the following components: - **Grobid** – for PDF extraction -- **Weaviate** – In our StructSense architecture, Weaviate acts as the vector database responsible for storing the ontology, effectively serving as the Ontology database. -These Compose files allow you to quickly stand up a complete local **StructSense** stack. +- These Compose files allow you to quickly stand up a complete local **StructSense** stack. If you prefer not to install dependencies system-wide, you can use the provided Docker Compose setup to run everything in **container mode**. This makes it easy to isolate services and manage your environment with minimal setup. diff --git a/docs/structsense_troubleshooting.md b/docs/structsense_troubleshooting.md index d684ed7..d164f1f 100644 --- a/docs/structsense_troubleshooting.md +++ b/docs/structsense_troubleshooting.md @@ -23,11 +23,34 @@ Ensure Python version is **>=3.10,<3.13**. ## FAQ -**Q: Do I need Weaviate to run StructSense?** -A: No. Set `ENABLE_KG_SOURCE=false` to run without a vector DB. +**Q: Why does the agent prompt “Would you like to view your execution traces?”** +A: This happens when execution tracing or telemetry is enabled by default. You can disable the prompt by turning off tracing and telemetry via environment variables. + +```bash +CREWAI_TRACING_ENABLED=false +CREWAI_DISABLE_TELEMETRY=true +CREWAI_DISABLE_TRACING=true +CREWAI_TELEMETRY=false +OTEL_SDK_DISABLED=true +ENABLE_CREW_MEMORY=false +``` +**Q: I am seeing non-fatal agent memory errors. What should I do?** +A: This is commonly related to agent memory being enabled without a valid OpenAI key. If you don’t need memory, disable it explicitly. + +```bash +ENABLE_CREW_MEMORY=false +``` + +**Q: How do chunk sizes affect performance and accuracy?** +A: Smaller chunk sizes generally improve extraction accuracy, but they also increase processing time. Larger chunks run faster but may reduce accuracy—choose based on your priority. + +**Q: Where can I find developer documentation?** +A: Developer documentation is available in the repository under `Developer.md`. **Q: Can I use local models without API keys?** A: Yes, via **Ollama**. Update agent configs to use the Ollama base URL and model. **Q: Where do I find a minimal `.env`?** A: See **Environment Variables → Minimal** section. + + diff --git a/docs/ui_developer_document.md b/docs/ui_developer_document.md new file mode 100644 index 0000000..03b4cb0 --- /dev/null +++ b/docs/ui_developer_document.md @@ -0,0 +1 @@ +# Developer Documentation \ No newline at end of file