Lightweight gRPC microservice that extracts and summarizes financial risk factors from SEC-style corporate filings (example: 10-K). The service locates the risk section of a document, splits it into focused chunks, runs a fine-tuned LLM (via a remote HF Space) to identify risk categories and summaries, and returns structured results with provenance metadata.
- Status: Prototype
- Language: Python
- A microservice to extract structured risk information from long financial documents (10-K, 10-Q style reports).
- Provide provenance and integrity metadata for compliance and auditability (audit id, lineage, integrity hash).
- Integrates local pre-processing (text extraction, clustering) with lightweight remote inference (HF Space / Gradio client).
- gRPC API:
ExtractRisksRPC that accepts a simple query and returns categorized risk summaries. - Document parsing: PDF text extraction via
PyMuPDFand pattern matching for typical SEC item markers (Item 1A Risk Factors). - Chunking: sentence-level clustering + lexical heuristics to form inference-friendly chunks.
- Remote inference: batch inference via a Hugging Face Space (using
gradio_client) so the model can be hosted separately. - Provenance: generation timestamps, audit id, data-lineage and a SHA-256 integrity hash are returned with results.
The model is fine-tuned from the base TinyLlama-1.1B model using PEFT/LoRA (r=16, targeting q/k/v/o_proj modules) on the Gretel-AI synthetic financial risk analysis dataset1 (1,034 samples total). Training uses 4-bit quantization (NF4 with double-quant) for efficiency, AdamW optimizer, and early stopping (patience=2). The process runs on an A100 GPU, with 30 epochs (effective batch size=32 via gradient accumulation). See the accompanying notebook (finance_llm_risk_extractor_training_and_evaluation_notebook.ipynb) for full code, including data splits (80/20 train/val + held-out eval) and tokenization handling (max length=2048).
The fine-tuned model was evaluated against the base TinyLlama-1.1B model using 207 held-out examples from the source dataset1. These examples were not used during fine-tuning and were reserved exclusively for evaluation. A known limitation is that the training dataset is synthetic. It was generated by fine-tuning Phi-3-mini-128k-instruct on 14,306 real SEC filings (2023–2024) with differential privacy (ε=8, delta=1.2e-06). However, Gretel-AI designed it specifically "for training models to extract key risk factors and generate structured summaries from financial documents" meaning that it is a viable training dataset for the task at hand.
All experiments were conducted on an A100 GPU to ensure consistent performance measurements.
Total execution time for base model: 11 minutes and 6 seconds Total execution time for finetuned model: 14 minutes and 52 seconds
Prompt: Extract financial risks. Output as [CATEGORIES] | SUMMARY output:
Example input from evaluation dataset: "Item 8.01. Other Events.
On November 22, 2022, the Company announced that its Board of Directors authorized the repurchase of up to $1 billion of the Company's outstanding common shares, which includes the $1.1 billion remaining authorization amount under the Company's prior repurchase program. The new repurchase authorization does not have a specific expiration date. This decision reflects the Company's ongoing commitment to optimizing its capital structure and delivering value to shareholders....."
Example corresponding expected output from the evaluation dataset (Ground Truth, GT): ['LIQUIDITY', 'DEBT'] | $1B share repurchase authorization may lead to increased debt exposure
Multiple complementary metrics were used to capture different aspects of model behaviour. Statistical tests are exploratory; no correction applied as metrics are not independent.
-
Cosine similarity (embedding-based)
Used to quantify the degree of transformation between the input and the generated output.
Lower similarity indicates greater abstraction, though this metric alone does not capture semantic correctness or completeness. -
Ground-truth similarity
Generated outputs were compared against synthetic financial risk analysis text at both summary and full-output (both the risk categories and the summary) levels. -
Structured category extraction metrics
Precision, recall, F1 (micro), Jaccard similarity, and exact match rate were computed for extracted risk categories. -
Statistical significance testing
Paired t-tests and Wilcoxon signed-rank tests were used, alongside effect sizes, to assess whether observed differences were unlikely to be due to random variation. -
Inference efficiency
Latency, throughput, and GPU memory usage were measured to characterise deployment-relevant trade-offs.
The base model exhibited very high similarity to the input text, indicating a strong tendency toward reproduction:
- Summary-level: ≈ 0.92
- Full-output: ≈ 0.92
The fine-tuned model produced substantially lower input–output similarity:
- Summary-level: 0.55 (mean = 0.546)
- Full-output: 0.60 (mean = 0.595)
These values are close to the GT transformation baseline:
- GT input → summary: 0.53
- GT input → full output: 0.55
This suggests that the fine-tuned model more frequently departs from surface-level rewriting and operates closer to GT abstraction. Variance increased relative to the base model, reflecting less uniform transformation behaviour.
Differences between the base and fine-tuned models were statistically significant:
- Summary-level:
- t-test p = 0.0167, small effect size (Cohen's dz ≈ 0.20)
- Wilcoxon p < 0.001, small effect size (Wilcoxon r ≈ 0.23)
- Full-output:
- t-test p < 1e-7, moderate effect size (Cohen's dz ≈ 0.45)
- Wilcoxon p < 1e-10, moderate effect size (Wilcoxon r ≈ 0.45)
When evaluated against ground-truth outputs:
-
Summary-level similarity
- Base model: 0.54
- Fine-tuned model: 0.58
- Mean improvement: +0.04
-
Full-output similarity
- Base model: 0.56
- Fine-tuned model: 0.64
- Mean improvement: +0.08
Improvements were more pronounced for full outputs and were statistically significant. Increased variance for the fine-tuned model indicates greater diversity in generated structure and content.
Substantial improvements were observed in structured risk category extraction:
| Metric | Base Model | Fine-Tuned Model |
|---|---|---|
| Exact Match Rate | ~6.3% | ~35.3% |
| Precision (Micro) | 0.00 | 0.65 |
| Recall (Micro) | 0.00 | 0.66 |
| F1 Score (Micro) | 0.00 | 0.65 |
| Jaccard Similarity | ~0.06 | ~0.55 |
The base model produced no true positives across the evaluation set. This is because it produced unstructured text rather than valid category labels resulting in zero true positives under the structured extraction metric. In comparison, the fine-tuned model generated meaningful structured outputs in the majority of cases. Partial matches and omissions remain common, as reflected in the gap between exact match rate and micro-averaged scores.
Prompt compliance was evaluated explicitly:
-
Base model:
Failed to follow the required structured format in all 207 cases. -
Fine-tuned model:
Prompt compliance errors occurred in 9 out of 207 cases (~4.3%).
All 9 failures were associated with context window limitations, rather than systematic instruction misinterpretation.
Fine-tuning substantially reduced format-related failures, though it did not eliminate them entirely.
ROUGE scores were higher for the fine-tuned model under beam-based decoding:
- ROUGE-1: 0.24 (vs. 0.07)
- ROUGE-2: 0.12 (vs. 0.03)
- ROUGE-L: 0.20 (vs. 0.06)
However, ROUGE primarily measures lexical overlap and does not directly reflect semantic correctness, abstraction quality, or structural validity. It is therefore treated as a secondary analysis metric.
Fine-tuning introduced expected performance trade-offs:
-
Average latency
- Base model: ~14.1 s
- Fine-tuned model: ~18.6 s
-
Throughput
- Base model: ~0.71 samples/s
- Fine-tuned model: ~0.54 samples/s
-
GPU memory usage
- No measurable increase during inference for either model
Across the held-out evaluation set, fine-tuning was associated with:
- Reduced input–output similarity, closely matching GT transformation levels
- Statistically significant differences in abstraction behaviour relative to the base model
- Meaningful improvements in structured risk category extraction
- A large reduction in prompt-format violations
- Moderate trade-offs in inference latency and throughput
Overall, the fine-tuned model is better aligned with the intended abstraction and extraction task than the base model under beam-based decoding, while still exhibiting limitations in consistency, completeness, and efficiency.
server.py— gRPC server implementation. Binds to:50051by default and serves theRiskExtractorRPC.client.py— example client showing how to call the service.utils.py— core extraction, chunking, inference and provenance utilities.proto/risk_extractor.proto— canonical service and message definitions (used to generate therisk_extractor_pb2*.pyfiles).LLM_Model_Test_v8_gretel/— local artifacts (adapter, tokenizer, optimizer) for a finetuned TinyLlama variant (not required for the remote Space-based inference approach).requirements.txt— pinned Python dependencies for development.
Prerequisites
- Python 3.10+ (3.11 recommended)
gitand a working internet connection to install packages- Optional: GPU + drivers if you plan to run local model inference
- Clone and enter the repository
git clone <repo-url>
cd risk_extractor_microservice- Create and activate a virtual environment
python -m venv .venv; .\.venv\Scripts\Activate.ps1- Install dependencies
pip install -r requirements.txtNote: requirements.txt includes many NLP and ML packages. If you
only need to run the gRPC server and call a remote HF Space for
inference, you can try installing a reduced set focused on gRPC,
gradio_client, and pymupdf first.
- (Optional) Generate Python gRPC bindings from
.proto
If you modify proto/risk_extractor.proto, regenerate the Python
stubs with:
python -m grpc_tools.protoc -I proto --python_out=. --grpc_python_out=. proto/risk_extractor.proto- Run the server
python server.pyThe server listens on localhost:50051 by default.
- Run the example client
python client.pyThe client.py script demonstrates creating an ExtractRisksRequest
and printing a JSON-like result. Edit the query and space_url
fields in client.py to target a different document or inference
space.
Usage Example (from client.py)
request = risk_extractor_pb2.ExtractRisksRequest(
query="What are the risks in the report from 2022?",
user_id="user123",
query_timestamp=int(time.time()),
space_url="mcnamacl/tinyllama-inference"
)- Document extraction uses
PyMuPDFto read PDFs and then looks for common SEC markers (Item 1A). The heuristics are intentionally simple — adapt them if your document formats differ. - Chunking uses
sentence-transformersto embed sentences then clusters them withKMeansto select candidate "risky" sentences and pack them into token-limited chunks. - Inference is performed by calling a remote Hugging Face Space via
gradio_client.Client. The repo expects the Space to accept a list of prompts at the/inferapi_name and return a list of outputs in a simple format:[CATEGORIES] | SUMMARY. - The code includes utilities for loading a quantized TinyLlama base model plus a PEFT adapter; those heavy tasks require GPU and are optional if you use remote inference.
- Uploaded documents and queries may be sent to a remote HF Space depending on configuration. Do not send sensitive or regulated data to an external space unless you control it and understand the data handling policies.
- File an issue in this repository's issue tracker for bugs or feature requests.
- For questions about the
.protoservice, seeproto/risk_extractor.proto. - For contribution guidance, see
CONTRIBUTING.md(if present).
- Maintainer: repository owner (see repo metadata).
- Contributions are welcome. Please open issues or pull requests.
- Keep PRs small, include tests where applicable, and describe the motivation and design in the PR description.
- See
CONTRIBUTING.mdfor detailed contribution guidelines.
- When experimenting with the model locally, prefer a controlled virtualenv and install optional heavy packages only when needed.
- To speed up prototyping, stub out
batch_generateinutils.pyto return canned outputs instead of calling a Space.
See the project LICENSE file for license details.
- Built with
PyMuPDF,sentence-transformers,transformers,gradio_clientand other community libraries. Seerequirements.txtfor full dependency details.
- Run the server and try
client.pyagainst a controlled HF Space. - Add
CONTRIBUTING.mdanddocs/materials if you want to grow the project documentation.
Gen AI Usage Disclosure This readme and code scaffolding for this repo was created in part using VSCode Copilot and GPT-5. All content has been manually reviewed, modified as necessary, and validated by the author.