Skip to content

mcnamacl/risk_extractor_microservice

Repository files navigation

Risk Extractor Microservice

Lightweight gRPC microservice that extracts and summarizes financial risk factors from SEC-style corporate filings (example: 10-K). The service locates the risk section of a document, splits it into focused chunks, runs a fine-tuned LLM (via a remote HF Space) to identify risk categories and summaries, and returns structured results with provenance metadata.

Badges

  • Status: Prototype
  • Language: Python

Why this project exists

  • A microservice to extract structured risk information from long financial documents (10-K, 10-Q style reports).
  • Provide provenance and integrity metadata for compliance and auditability (audit id, lineage, integrity hash).
  • Integrates local pre-processing (text extraction, clustering) with lightweight remote inference (HF Space / Gradio client).

Key Features

  • gRPC API: ExtractRisks RPC that accepts a simple query and returns categorized risk summaries.
  • Document parsing: PDF text extraction via PyMuPDF and pattern matching for typical SEC item markers (Item 1A Risk Factors).
  • Chunking: sentence-level clustering + lexical heuristics to form inference-friendly chunks.
  • Remote inference: batch inference via a Hugging Face Space (using gradio_client) so the model can be hosted separately.
  • Provenance: generation timestamps, audit id, data-lineage and a SHA-256 integrity hash are returned with results.

Fine-tuning

The model is fine-tuned from the base TinyLlama-1.1B model using PEFT/LoRA (r=16, targeting q/k/v/o_proj modules) on the Gretel-AI synthetic financial risk analysis dataset1 (1,034 samples total). Training uses 4-bit quantization (NF4 with double-quant) for efficiency, AdamW optimizer, and early stopping (patience=2). The process runs on an A100 GPU, with 30 epochs (effective batch size=32 via gradient accumulation). See the accompanying notebook (finance_llm_risk_extractor_training_and_evaluation_notebook.ipynb) for full code, including data splits (80/20 train/val + held-out eval) and tokenization handling (max length=2048).

Model Evaluation

Evaluation Setup

The fine-tuned model was evaluated against the base TinyLlama-1.1B model using 207 held-out examples from the source dataset1. These examples were not used during fine-tuning and were reserved exclusively for evaluation. A known limitation is that the training dataset is synthetic. It was generated by fine-tuning Phi-3-mini-128k-instruct on 14,306 real SEC filings (2023–2024) with differential privacy (ε=8, delta=1.2e-06). However, Gretel-AI designed it specifically "for training models to extract key risk factors and generate structured summaries from financial documents" meaning that it is a viable training dataset for the task at hand.

All experiments were conducted on an A100 GPU to ensure consistent performance measurements.

Total execution time for base model: 11 minutes and 6 seconds Total execution time for finetuned model: 14 minutes and 52 seconds

Prompt: Extract financial risks. Output as [CATEGORIES] | SUMMARY output:

Example input from evaluation dataset: "Item 8.01. Other Events.

On November 22, 2022, the Company announced that its Board of Directors authorized the repurchase of up to $1 billion of the Company's outstanding common shares, which includes the $1.1 billion remaining authorization amount under the Company's prior repurchase program. The new repurchase authorization does not have a specific expiration date. This decision reflects the Company's ongoing commitment to optimizing its capital structure and delivering value to shareholders....."

Example corresponding expected output from the evaluation dataset (Ground Truth, GT): ['LIQUIDITY', 'DEBT'] | $1B share repurchase authorization may lead to increased debt exposure

Metrics and Methodology

Multiple complementary metrics were used to capture different aspects of model behaviour. Statistical tests are exploratory; no correction applied as metrics are not independent.

  • Cosine similarity (embedding-based)
    Used to quantify the degree of transformation between the input and the generated output.
    Lower similarity indicates greater abstraction, though this metric alone does not capture semantic correctness or completeness.

  • Ground-truth similarity
    Generated outputs were compared against synthetic financial risk analysis text at both summary and full-output (both the risk categories and the summary) levels.

  • Structured category extraction metrics
    Precision, recall, F1 (micro), Jaccard similarity, and exact match rate were computed for extracted risk categories.

  • Statistical significance testing
    Paired t-tests and Wilcoxon signed-rank tests were used, alongside effect sizes, to assess whether observed differences were unlikely to be due to random variation.

  • Inference efficiency
    Latency, throughput, and GPU memory usage were measured to characterise deployment-relevant trade-offs.


Transformation Quality

The base model exhibited very high similarity to the input text, indicating a strong tendency toward reproduction:

  • Summary-level: ≈ 0.92
  • Full-output: ≈ 0.92

The fine-tuned model produced substantially lower input–output similarity:

  • Summary-level: 0.55 (mean = 0.546)
  • Full-output: 0.60 (mean = 0.595)

These values are close to the GT transformation baseline:

  • GT input → summary: 0.53
  • GT input → full output: 0.55

This suggests that the fine-tuned model more frequently departs from surface-level rewriting and operates closer to GT abstraction. Variance increased relative to the base model, reflecting less uniform transformation behaviour.

Differences between the base and fine-tuned models were statistically significant:

  • Summary-level:
    • t-test p = 0.0167, small effect size (Cohen's dz ≈ 0.20)
    • Wilcoxon p < 0.001, small effect size (Wilcoxon r ≈ 0.23)
  • Full-output:
    • t-test p < 1e-7, moderate effect size (Cohen's dz ≈ 0.45)
    • Wilcoxon p < 1e-10, moderate effect size (Wilcoxon r ≈ 0.45)

Output Quality vs Ground Truth

When evaluated against ground-truth outputs:

  • Summary-level similarity

    • Base model: 0.54
    • Fine-tuned model: 0.58
    • Mean improvement: +0.04
  • Full-output similarity

    • Base model: 0.56
    • Fine-tuned model: 0.64
    • Mean improvement: +0.08

Improvements were more pronounced for full outputs and were statistically significant. Increased variance for the fine-tuned model indicates greater diversity in generated structure and content.


Structured Risk Category Extraction

Substantial improvements were observed in structured risk category extraction:

Metric Base Model Fine-Tuned Model
Exact Match Rate ~6.3% ~35.3%
Precision (Micro) 0.00 0.65
Recall (Micro) 0.00 0.66
F1 Score (Micro) 0.00 0.65
Jaccard Similarity ~0.06 ~0.55

The base model produced no true positives across the evaluation set. This is because it produced unstructured text rather than valid category labels resulting in zero true positives under the structured extraction metric. In comparison, the fine-tuned model generated meaningful structured outputs in the majority of cases. Partial matches and omissions remain common, as reflected in the gap between exact match rate and micro-averaged scores.


Prompt Compliance and Reliability

Prompt compliance was evaluated explicitly:

  • Base model:
    Failed to follow the required structured format in all 207 cases.

  • Fine-tuned model:
    Prompt compliance errors occurred in 9 out of 207 cases (~4.3%).
    All 9 failures were associated with context window limitations, rather than systematic instruction misinterpretation.

Fine-tuning substantially reduced format-related failures, though it did not eliminate them entirely.


ROUGE Scores

ROUGE scores were higher for the fine-tuned model under beam-based decoding:

  • ROUGE-1: 0.24 (vs. 0.07)
  • ROUGE-2: 0.12 (vs. 0.03)
  • ROUGE-L: 0.20 (vs. 0.06)

However, ROUGE primarily measures lexical overlap and does not directly reflect semantic correctness, abstraction quality, or structural validity. It is therefore treated as a secondary analysis metric.


Inference Efficiency

Fine-tuning introduced expected performance trade-offs:

  • Average latency

    • Base model: ~14.1 s
    • Fine-tuned model: ~18.6 s
  • Throughput

    • Base model: ~0.71 samples/s
    • Fine-tuned model: ~0.54 samples/s
  • GPU memory usage

    • No measurable increase during inference for either model

Summary

Across the held-out evaluation set, fine-tuning was associated with:

  • Reduced input–output similarity, closely matching GT transformation levels
  • Statistically significant differences in abstraction behaviour relative to the base model
  • Meaningful improvements in structured risk category extraction
  • A large reduction in prompt-format violations
  • Moderate trade-offs in inference latency and throughput

Overall, the fine-tuned model is better aligned with the intended abstraction and extraction task than the base model under beam-based decoding, while still exhibiting limitations in consistency, completeness, and efficiency.

Repository Layout

  • server.py — gRPC server implementation. Binds to :50051 by default and serves the RiskExtractor RPC.
  • client.py — example client showing how to call the service.
  • utils.py — core extraction, chunking, inference and provenance utilities.
  • proto/risk_extractor.proto — canonical service and message definitions (used to generate the risk_extractor_pb2*.py files).
  • LLM_Model_Test_v8_gretel/ — local artifacts (adapter, tokenizer, optimizer) for a finetuned TinyLlama variant (not required for the remote Space-based inference approach).
  • requirements.txt — pinned Python dependencies for development.

Quickstart (developer)

Prerequisites

  • Python 3.10+ (3.11 recommended)
  • git and a working internet connection to install packages
  • Optional: GPU + drivers if you plan to run local model inference
  1. Clone and enter the repository
git clone <repo-url>
cd risk_extractor_microservice
  1. Create and activate a virtual environment
python -m venv .venv; .\.venv\Scripts\Activate.ps1
  1. Install dependencies
pip install -r requirements.txt

Note: requirements.txt includes many NLP and ML packages. If you only need to run the gRPC server and call a remote HF Space for inference, you can try installing a reduced set focused on gRPC, gradio_client, and pymupdf first.

  1. (Optional) Generate Python gRPC bindings from .proto

If you modify proto/risk_extractor.proto, regenerate the Python stubs with:

python -m grpc_tools.protoc -I proto --python_out=. --grpc_python_out=. proto/risk_extractor.proto
  1. Run the server
python server.py

The server listens on localhost:50051 by default.

  1. Run the example client
python client.py

The client.py script demonstrates creating an ExtractRisksRequest and printing a JSON-like result. Edit the query and space_url fields in client.py to target a different document or inference space.

Usage Example (from client.py)

request = risk_extractor_pb2.ExtractRisksRequest(
		query="What are the risks in the report from 2022?",
		user_id="user123",
		query_timestamp=int(time.time()),
		space_url="mcnamacl/tinyllama-inference"
)

Implementation Notes

  • Document extraction uses PyMuPDF to read PDFs and then looks for common SEC markers (Item 1A). The heuristics are intentionally simple — adapt them if your document formats differ.
  • Chunking uses sentence-transformers to embed sentences then clusters them with KMeans to select candidate "risky" sentences and pack them into token-limited chunks.
  • Inference is performed by calling a remote Hugging Face Space via gradio_client.Client. The repo expects the Space to accept a list of prompts at the /infer api_name and return a list of outputs in a simple format: [CATEGORIES] | SUMMARY.
  • The code includes utilities for loading a quantized TinyLlama base model plus a PEFT adapter; those heavy tasks require GPU and are optional if you use remote inference.

Security & Privacy

  • Uploaded documents and queries may be sent to a remote HF Space depending on configuration. Do not send sensitive or regulated data to an external space unless you control it and understand the data handling policies.

Where to get help

  • File an issue in this repository's issue tracker for bugs or feature requests.
  • For questions about the .proto service, see proto/risk_extractor.proto.
  • For contribution guidance, see CONTRIBUTING.md (if present).

Maintainers & Contributing

  • Maintainer: repository owner (see repo metadata).
  • Contributions are welcome. Please open issues or pull requests.
  • Keep PRs small, include tests where applicable, and describe the motivation and design in the PR description.
  • See CONTRIBUTING.md for detailed contribution guidelines.

Development tips

  • When experimenting with the model locally, prefer a controlled virtualenv and install optional heavy packages only when needed.
  • To speed up prototyping, stub out batch_generate in utils.py to return canned outputs instead of calling a Space.

License

See the project LICENSE file for license details.

Acknowledgements

  • Built with PyMuPDF, sentence-transformers, transformers, gradio_client and other community libraries. See requirements.txt for full dependency details.

Next steps

  • Run the server and try client.py against a controlled HF Space.
  • Add CONTRIBUTING.md and docs/ materials if you want to grow the project documentation.

Gen AI Usage Disclosure This readme and code scaffolding for this repo was created in part using VSCode Copilot and GPT-5. All content has been manually reviewed, modified as necessary, and validated by the author.

Footnotes

  1. Synthetic Financial Risk Analysis Dataset, Gretel-AI, 2024 2

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors