Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 2 additions & 4 deletions docs/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,15 +18,13 @@ parts:
numbered: True
chapters:
- file: ingestion_service
- file: deployment_braikbservices
- caption: BrainKB User Interface
numbered: True
chapters:
- file: brainkbui
- caption: Deployment
numbered: True
chapters:
- file: ui_developer_document
- file: deployment_userinterface
- file: deployment_braikbservices
- caption: StructSense
numbered: True
chapters:
Expand Down
2 changes: 1 addition & 1 deletion docs/deployment_braikbservices.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Deployment of BrainKB Services
# Microservices Deployment

BrainKB consists of multiple service components, as highlighted in the {ref}`brainkb_architecture_figure` All of the service components can be deployed independently. However, there are a few dependencies, such as setting up the PostgreSQL database that is used by JWT Users and Scope Manager, that need to be setup first.

Expand Down
2 changes: 1 addition & 1 deletion docs/deployment_userinterface.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Deployment of User Interface
# Deployment
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To improve clarity, consider making the title more specific, as this file is now part of the 'BrainKB User Interface' section in the table of contents. A more descriptive title would help readers understand the context better.

Suggested change
# Deployment
# User Interface Deployment

This section provides information regarding the deployment of the BrainKB UI, both in the development and the production mode.

```{note}
Expand Down
64 changes: 42 additions & 22 deletions docs/structsense_configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,15 +9,13 @@ Pass the YAML via CLI, e.g. `--config config/ner_agent.yaml`.
- `agent_config`
- `task_config`

**Do not replace** runtime variables in braces `{}`:
- `{literature}` — input text (e.g., extracted PDF content)
- `{extracted_structured_information}` — extractor output
- `{aligned_structured_information}` — alignment output
- `{judged_structured_information_with_human_feedback}` — judge output
- `{modification_context}`, `{user_feedback_text}` — inputs to feedback agent
**Do not replace variables** enclosed in curly braces (`{}`); they are dynamically populated at runtime. Names must match the pipeline input map (see `config_template` for examples):
- **Extraction input:** `{input_text}` — input text (e.g. PDF content or raw text)
- **Alignment input:** `{extracted_structured_information}` — output from the extractor agent
- **Judge input:** `{aligned_structured_information}` — output from the alignment agent
- **Human feedback input:** `{judged_structured_information_with_human_feedback}` — output from the judge agent; `{modification_context}` and `{user_feedback_text}` — user feedback for the feedback agent

**Config Template**\
A blank template is available in [config_template](https://github.com/sensein/structsense/blob/main/config_template/config.yaml).
A blank template as well as templates for tasks such as `NER`, `Resource Extraction` and `PDF2 ReproSchema` is available in `config_template/`. See **Templates**.

<!--Agent Configuration -->
## Agent Configuration
Expand Down Expand Up @@ -63,7 +61,7 @@ Run without a paid API key:
```bash
structsense-cli extract \
--source SOME.pdf \
--config ner_config_gpt.yaml \
--config ner-config.yaml \
--env_file .env
```

Expand All @@ -77,7 +75,7 @@ Required task IDs (do not rename):
- `humanfeedback_task`

Each task includes:
- `description` — includes expected input (e.g., `{literature}`)
- `description` — includes expected input (e.g., `{input_text}`)
- `expected_output` — **JSON** output format or example
- `agent_id` — must match an agent ID from `agent_config`

Expand Down Expand Up @@ -106,6 +104,20 @@ embedder_config:
model: nomic-embed-text:latest
```

### Experiment Tracking (optional)
| Variable | Description | Default |
|---|---|---|
| `ENABLE_WEIGHTSANDBIAS` | Enable W&B | `false` |
| `ENABLE_MLFLOW` | Enable MLflow | `false` |
| `MLFLOW_TRACKING_URL` | MLflow tracking URL | `http://localhost:5000` |

### Minimal (no tracking, no knowledge source)
```bash
ENABLE_WEIGHTSANDBIAS=false
ENABLE_MLFLOW=false
ENABLE_KG_SOURCE=false
```
## Legacy
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The ## Legacy heading here might cause confusion. It implies that using a 'Knowledge Source (Vector DB)' is a legacy feature, yet the ## Environment Variables section below provides detailed current configuration for Weaviate (a vector DB). To avoid ambiguity, please clarify if using a vector DB is a legacy approach. If so, consider moving all Weaviate-related environment variable documentation under this Legacy section.

### Knowledge Source (Vector DB)
`WEAVIATE_*` environment variables are optional and only needed if you enable a knowledge source for schema/ontology lookup.

Expand Down Expand Up @@ -146,17 +158,25 @@ embedder_config:
> If Ollama runs on host and Weaviate in Docker, use `http://host.docker.internal:11434`.
> If both are in Docker on the same host network, use `http://localhost:11434`.

### Experiment Tracking (optional)
| Variable | Description | Default |
|---|---|---|
| `ENABLE_WEIGHTSANDBIAS` | Enable W&B | `false` |
| `ENABLE_MLFLOW` | Enable MLflow | `false` |
| `MLFLOW_TRACKING_URL` | MLflow tracking URL | `http://localhost:5000` |

<!-- Example .env -->
## Example `.env`
### Example `.env`
```bash
ENABLE_KG_SOURCE=false
OLLAMA_API_ENDPOINT=http://localhost:11434
OLLAMA_MODEL=nomic-embed-text:v1.5
WEAVIATE_API_KEY=your_api_key
WEAVIATE_HTTP_HOST=localhost
WEAVIATE_HTTP_PORT=8080
WEAVIATE_HTTP_SECURE=false

WEAVIATE_GRPC_HOST=localhost
WEAVIATE_GRPC_PORT=50051
WEAVIATE_GRPC_SECURE=false

WEAVIATE_TIMEOUT_INIT=30
WEAVIATE_TIMEOUT_QUERY=60
WEAVIATE_TIMEOUT_INSERT=120

OLLAMA_API_ENDPOINT=http://host.docker.internal:11434
OLLAMA_MODEL=nomic-embed-text

ENABLE_WEAVE=true
ENABLE_MLFLOW=true
MLFLOW_TRACKING_URL=http://localhost:5000
```
23 changes: 16 additions & 7 deletions docs/structsense_examples.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,18 @@
# Examples
# Tutorials & Examples

- See the [example/](https://github.com/sensein/structsense/tree/main/example) directory for usage demonstrations and reference configs.
- See the `tutorial/` directory for usage demonstrations.
- See the `example/` directory for task specific reference configs that can be used for `StructSense`.
- A configuration is provided under `config_template/`.

## Example Use Cases
**For more information about StructSense use cases, see the [StructSense paper on arXiv](https://arxiv.org/html/2507.03674v2#S5)**
- Neuroscience Named Entity Extraction from text
- Resource (i.e. models, datasets) Extraction
- ReproSchema Extraction

## Blank Configuration Template

A starting template is provided in `config_template/`.
Please note that `config_template/` folder also contains configuration files for `NER`, `Resource Extraction` and `PDF2ReproSchema` tasks.

Before modifying, read:
- **Configuration Overview & Template**
- **Agents**
- **Tasks**
- **Embeddings & Knowledge**
- **Environment Variables (see `.env_example` from the `StructSense` repository)**
178 changes: 165 additions & 13 deletions docs/structsense_getting_started.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@ pip install structsense
```
Alternatively, you can install the latest version of StructSense from the source code on GitHub:

**Note:** The latest updates are not pushed to PyPI, so for now it's recommended to use from GitHub.

```bash
git clone https://github.com/sensein/structsense.git
cd structsense
Expand All @@ -20,15 +22,17 @@ StructSense supports **Python >=3.10,<3.13**.

<!-- # Requirements -->
## Requirements

### PDF Extraction with Grobid

StructSense supports PDF extraction using **[Grobid](https://grobid.readthedocs.io/en/latest/Introduction/)** (default) or an external API service.

#### Default: Grobid
By default, StructSense uses Grobid for PDF extraction. You can install and run Grobid either with Docker or in a non-Docker setup.
StructSense uses Grobid for PDF extraction. You can install and run Grobid either with Docker or in a non-Docker setup.
We recommend using Docker for easier setup and dependency management.

##### Run Grobid with Docker

```bash
docker pull lfoppiano/grobid:0.8.0
docker run --init -p 8070:8070 -e JAVA_OPTS="-XX:+UseZGC" lfoppiano/grobid:0.8.0
Expand Down Expand Up @@ -58,31 +62,180 @@ In our default setup, Ollama is used for embedding generation. You can also use

<!--Running -->

## Running
## Using StructSense (CLI and Python)

### Command-line (CLI)

After installing (`pip install -e .`), the entry point is **`structsense-cli`**.

#### Full pipeline (extract)

Runs extraction → alignment → judge → optional human feedback and returns the final structured result.

```bash
structsense-cli extract \
--config path/to/config.yaml \
--source path/to/file.pdf \
--env_file .env \
--save_file result.json
```

| Option | Description |
|--------|-------------|
| `--config` | **(Required)** Path to YAML config (agent + task + embedder). |
| `--source` | **(Required)** Input: path to a PDF/text file, a folder, or a text string. |
| `--api_key` | OpenRouter (or other) API key; can also be set in `.env` as `OPENROUTER_API_KEY`. |
| `--env_file` | Path to `.env` (default: `.env` in current directory). |
| `--save_file` | Save the result JSON to this path. |
| `--enable_chunking` | Enable chunking for long documents (flag). |
| `--chunk_size` | Chunk size in characters (e.g. `2000`); used when chunking is enabled. |
| `--max_workers` | Max parallel workers for chunked extraction. |
| `--downstream_max_input_chars` | Max input length for alignment/judge (default 80000). |
| `--max_extraction_chunk_chars` | Cap per-chunk size for extraction (default 25000). |

**With OpenRouter (API key):**

### Using OpenRouter
```bash
structsense-cli extract \
--source somefile.pdf \
--api_key <YOUR_API_KEY> \
--api_key <YOUR_OPENROUTER_API_KEY> \
--config someconfig.yaml \
--env_file .env \
--save_file result.json # optional
--save_file result.json
```

### Using Ollama (Local)
**With Ollama (local, no API key):**

```bash
structsense-cli extract \
--source somefile.pdf \
--config someconfig.yaml \
--env_file .env_file \
--save_file result.json # optional
--env_file .env \
--save_file result.json
```

### Chunking
Disabled by default. Enable with:
**With chunking (recommended for long PDFs):**

```bash
--chunking True
structsense-cli extract \
--config config.yaml \
--source file.pdf \
--enable_chunking \
--chunk_size 2000 \
--save_file result.json
```

#### Single agent–task (run-agent)

Run one agent and one task only (e.g. extractor only), without the full pipeline:

```bash
structsense-cli run-agent \
--config path/to/config.yaml \
--agent_key extractor_agent \
--task_key extraction_task \
--source path/to/file.pdf \
--env_file .env \
--save_file result.json
```

Use the same chunking/worker options as `extract` when needed.


### Python (programmatic)

Use **StructSenseFlow** as the single entry point. Run the **full pipeline** with `information_extraction_task()`, or a **single agent** with `kickoff(agent_key, task_key)` or `extraction()`.

**API key when running via Python:** For OpenRouter (or other cloud LLMs), either pass `api_key="your-key"` to `StructSenseFlow(...)` or set `OPENROUTER_API_KEY` in a `.env` file and pass `env_file=".env"`. The key is injected into the agent LLM config so all agents use it. Get an OpenRouter key at [openrouter.ai/keys](https://openrouter.ai/keys). If you get `401 User not found`, the key is missing or invalid.

#### Full pipeline (recommended)

```python
import asyncio
from structsense.app import StructSenseFlow

# Config can be paths to YAML files or dicts
flow = StructSenseFlow(
agent_config="path/to/config.yaml",
task_config="path/to/config.yaml",
embedder_config="path/to/config.yaml",
input_source="path/to/file.pdf", # or a text string, or path to .txt
enable_chunking=True,
chunk_size=2000,
max_workers=8,
env_file=".env",
api_key=None, # or set OPENROUTER_API_KEY in .env
)

# Run full pipeline: extraction → alignment → judge → human feedback (if enabled)
result = asyncio.run(flow.information_extraction_task())

# Result is a dict: entities, key_terms, resources, judged_terms, concept_mapping, etc.
print(result.get("task_type"), result.get("elapsed_time"))

# Save to file
import json
with open("result.json", "w") as f:
json.dump(result, f, indent=2, default=str)
```

#### Single agent (one agent–task pair)

You can run **any** single agent–task pair with `kickoff(agent_key=..., task_key=...)`. For the extractor only, the convenience method is `extraction()`. For the **full pipeline** (extraction → alignment → judge → humanfeedback), use `information_extraction_task()`.

```python
import asyncio
from structsense.app import StructSenseFlow

flow = StructSenseFlow(
agent_config="path/to/config.yaml",
task_config="path/to/config.yaml",
embedder_config="path/to/config.yaml",
input_source="path/to/file.pdf", # or source_text="raw text"
enable_chunking=True,
chunk_size=2000,
)

# Run only the extractor (convenience method)
result = asyncio.run(flow.extraction())

# Or run any specific agent–task pair
result = asyncio.run(flow.kickoff(
agent_key="extractor_agent",
task_key="extraction_task",
))
# Other pairs: alignment_agent/alignment_task, judge_agent/judge_task,
# humanfeedback_agent/humanfeedback_task
```

**Note:** Alignment, judge, and humanfeedback tasks are designed to receive **output from the previous stage** when run in the full pipeline. When you run them alone via `kickoff(...)`, they receive the raw `source_text` as input (useful for debugging or custom flows).

#### Passing config as dicts

```python
import asyncio
import yaml
from structsense.app import StructSenseFlow

with open("ner-config.yaml") as f:
all_config = yaml.safe_load(f)

flow = StructSenseFlow(
agent_config=all_config["agent_config"],
task_config=all_config["task_config"],
embedder_config=all_config.get("embedder_config", {}),
input_source="path/to/file.pdf", # or source_text="raw text"
enable_chunking=True,
chunk_size=2000,
max_workers=8,
env_file=".env", # optional; loads OPENROUTER_API_KEY etc.
api_key=None, # or pass key here; injected into LLM config
)
result = asyncio.run(flow.information_extraction_task())

import json
with open("result.json", "w") as f:
json.dump(result, f, indent=2, default=str)
```

<!-- Docker -->
Expand All @@ -91,9 +244,8 @@ Disabled by default. Enable with:
The `docker/` directory contains **Docker Compose** files for running the following components:

- **Grobid** – for PDF extraction
- **Weaviate** – In our StructSense architecture, Weaviate acts as the vector database responsible for storing the ontology, effectively serving as the Ontology database.

These Compose files allow you to quickly stand up a complete local **StructSense** stack.
- These Compose files allow you to quickly stand up a complete local **StructSense** stack.

If you prefer not to install dependencies system-wide, you can use the provided Docker Compose setup to run everything in **container mode**.
This makes it easy to isolate services and manage your environment with minimal setup.
Loading